StarPunk/docs/decisions/ADR-053-performance-monitoring-strategy.md

# ADR-053: Performance Monitoring Strategy

## Status
Accepted

## Context
StarPunk v1.1.1 introduces performance monitoring to help operators understand system behavior in production. Currently, we have no visibility into:
- Database query performance
- Memory usage patterns
- Request processing times
- Bottlenecks and slow operations

We need a lightweight, zero-dependency monitoring solution that provides actionable insights without impacting performance.

## Decision
Implement a built-in performance monitoring system using Python's standard library, with optional detailed tracking controlled by configuration.

### Architecture Overview

```
Request → Middleware (timing) → Handler
              ↓                    ↓
         Context Manager      Decorators
              ↓                    ↓
         Metrics Store ← Database Hooks
              ↓
        Admin Dashboard
```

### Core Components

#### 1. Metrics Collector
Location: `starpunk/monitoring/collector.py`

Responsibilities:
- Collect timing data
- Track memory usage
- Store recent metrics in memory
- Provide aggregation functions

Data Structure:
```python
@dataclass
class Metric:
    timestamp: float
    category: str  # "db", "http", "function"
    operation: str  # specific operation name
    duration: float  # in seconds
    metadata: dict  # additional context
```

#### 2. Database Performance Tracking
Location: `starpunk/monitoring/db_monitor.py`

Features:
- Query execution timing
- Slow query detection
- Query pattern analysis
- Connection pool monitoring

Implementation via SQLite callbacks:
```python
# Wrap database operations
with monitor.track_query("SELECT", "notes"):
    cursor.execute(query)
```

#### 3. Memory Tracking
Location: `starpunk/monitoring/memory.py`

Track:
- Process memory (RSS)
- Memory growth over time
- Per-request memory delta
- Memory high water mark

Uses `resource` module (stdlib).

#### 4. Request Performance
Location: `starpunk/monitoring/http.py`

Track:
- Request processing time
- Response size
- Status code distribution
- Slowest endpoints

#### 5. Admin Dashboard
Location: `/admin/performance`

Display:
- Real-time metrics (last 15 minutes)
- Slow query log
- Memory usage graph
- Endpoint performance table
- Database statistics

### Data Retention

In-memory circular buffer approach:
- Last 1000 metrics retained
- Automatic old data eviction
- No persistent storage (privacy/simplicity)
- Reset on restart

### Performance Overhead

Target: <1% overhead when enabled

Strategies:
- Sampling for high-frequency operations
- Lazy computation of aggregates
- Minimal memory footprint (1MB max)
- Conditional compilation via config

## Rationale

### Why Built-in Monitoring?
1. **Zero Dependencies**: Uses only Python stdlib
2. **Privacy**: No external services
3. **Simplicity**: No complex setup
4. **Integrated**: Direct access to internals
5. **Lightweight**: Minimal overhead

### Why Not External Tools?

**Prometheus/Grafana**:
- Requires external services
- Complex setup
- Overkill for single-user system

**APM Services** (New Relic, DataDog):
- Privacy concerns
- Subscription costs
- Network dependency
- Too heavy for our needs

**OpenTelemetry**:
- Large dependency
- Complex configuration
- Designed for distributed systems

### Design Principles

1. **Opt-in**: Disabled by default
2. **Lightweight**: Minimal resource usage
3. **Actionable**: Focus on useful metrics
4. **Temporary**: No permanent storage
5. **Private**: No external data transmission

## Consequences

### Positive
1. **Production Visibility**: Understand behavior under load
2. **Performance Debugging**: Identify bottlenecks quickly
3. **No Dependencies**: Pure Python solution
4. **Privacy Preserving**: Data stays local
5. **Simple Deployment**: No additional services

### Negative
1. **Limited History**: Only recent data available
2. **Memory Usage**: ~1MB for metrics buffer
3. **No Alerting**: Manual monitoring required
4. **Single Node**: No distributed tracing

### Mitigations
1. Export capability for external tools
2. Configurable buffer size
3. Webhook support for alerts (future)
4. Focus on most valuable metrics

## Alternatives Considered

### 1. Logging-based Monitoring
**Approach**: Parse performance data from logs
**Pros**: Simple, no new code
**Cons**: Log parsing complexity, no real-time view
**Decision**: Dedicated monitoring is cleaner

### 2. External Monitoring Service
**Approach**: Use service like Sentry
**Pros**: Full-featured, alerting included
**Cons**: Privacy, cost, complexity
**Decision**: Violates self-hosted principle

### 3. Prometheus Exporter
**Approach**: Expose /metrics endpoint
**Pros**: Standard, good tooling
**Cons**: Requires Prometheus setup
**Decision**: Too complex for target users

### 4. No Monitoring
**Approach**: Rely on logs and external tools
**Pros**: Simplest
**Cons**: Poor production visibility
**Decision**: v1.1.1 specifically targets production readiness

## Implementation Details

### Instrumentation Points

1. **Database Layer**
   - All queries automatically timed
   - Connection acquisition/release
   - Transaction duration
   - Migration execution

2. **HTTP Layer**
   - Middleware wraps all requests
   - Per-endpoint timing
   - Static file serving
   - Error handling

3. **Core Functions**
   - Note creation/update
   - Search operations
   - RSS generation
   - Authentication flow

### Performance Dashboard Layout

```
Performance Dashboard
═══════════════════

Overview
--------
Uptime: 5d 3h 15m
Requests: 10,234
Avg Response: 45ms
Memory: 128MB

Slow Queries (>1s)
------------------
[timestamp] SELECT ... FROM notes (1.2s)
[timestamp] UPDATE ... SET ... (1.1s)

Endpoint Performance
-------------------
GET /          : avg 23ms, p99 45ms
GET /notes/:id : avg 35ms, p99 67ms
POST /micropub : avg 125ms, p99 234ms

Memory Usage
-----------
[ASCII graph showing last 15 minutes]

Database Stats
-------------
Pool Size: 3/5
Queries/sec: 4.2
Cache Hit Rate: 87%
```

### Configuration Options

```python
# All under STARPUNK_PERF_* prefix
MONITORING_ENABLED = False  # Master switch
SLOW_QUERY_THRESHOLD = 1.0  # seconds
LOG_QUERIES = False  # Log all queries
MEMORY_TRACKING = False  # Track memory usage
SAMPLE_RATE = 1.0  # 1.0 = all, 0.1 = 10%
BUFFER_SIZE = 1000  # Number of metrics
DASHBOARD_ENABLED = True  # Enable web UI
```

## Testing Strategy

1. **Unit Tests**: Mock collectors, verify metrics
2. **Integration Tests**: End-to-end monitoring flow
3. **Performance Tests**: Verify low overhead
4. **Load Tests**: Behavior under stress

## Security Considerations

1. Dashboard requires admin authentication
2. No sensitive data in metrics
3. No external data transmission
4. Metrics cleared on logout
5. Rate limiting on dashboard endpoint

## Migration Path

No migration required - monitoring is opt-in via configuration.

## Future Enhancements

v1.2.0 and beyond:
- Metric export (CSV/JSON)
- Alert thresholds
- Historical trending
- Custom metric points
- Plugin architecture

## References

- [Python resource module](https://docs.python.org/3/library/resource.html)
- [SQLite Query Performance](https://www.sqlite.org/queryplanner.html)
- [Web Vitals](https://web.dev/vitals/)

## Document History

- 2025-11-25: Initial draft for v1.1.1 release planning