# ADR-053: Performance Monitoring Strategy ## Status Accepted ## Context StarPunk v1.1.1 introduces performance monitoring to help operators understand system behavior in production. Currently, we have no visibility into: - Database query performance - Memory usage patterns - Request processing times - Bottlenecks and slow operations We need a lightweight, zero-dependency monitoring solution that provides actionable insights without impacting performance. ## Decision Implement a built-in performance monitoring system using Python's standard library, with optional detailed tracking controlled by configuration. ### Architecture Overview ``` Request → Middleware (timing) → Handler ↓ ↓ Context Manager Decorators ↓ ↓ Metrics Store ← Database Hooks ↓ Admin Dashboard ``` ### Core Components #### 1. Metrics Collector Location: `starpunk/monitoring/collector.py` Responsibilities: - Collect timing data - Track memory usage - Store recent metrics in memory - Provide aggregation functions Data Structure: ```python @dataclass class Metric: timestamp: float category: str # "db", "http", "function" operation: str # specific operation name duration: float # in seconds metadata: dict # additional context ``` #### 2. Database Performance Tracking Location: `starpunk/monitoring/db_monitor.py` Features: - Query execution timing - Slow query detection - Query pattern analysis - Connection pool monitoring Implementation via SQLite callbacks: ```python # Wrap database operations with monitor.track_query("SELECT", "notes"): cursor.execute(query) ``` #### 3. Memory Tracking Location: `starpunk/monitoring/memory.py` Track: - Process memory (RSS) - Memory growth over time - Per-request memory delta - Memory high water mark Uses `resource` module (stdlib). #### 4. Request Performance Location: `starpunk/monitoring/http.py` Track: - Request processing time - Response size - Status code distribution - Slowest endpoints #### 5. Admin Dashboard Location: `/admin/performance` Display: - Real-time metrics (last 15 minutes) - Slow query log - Memory usage graph - Endpoint performance table - Database statistics ### Data Retention In-memory circular buffer approach: - Last 1000 metrics retained - Automatic old data eviction - No persistent storage (privacy/simplicity) - Reset on restart ### Performance Overhead Target: <1% overhead when enabled Strategies: - Sampling for high-frequency operations - Lazy computation of aggregates - Minimal memory footprint (1MB max) - Conditional compilation via config ## Rationale ### Why Built-in Monitoring? 1. **Zero Dependencies**: Uses only Python stdlib 2. **Privacy**: No external services 3. **Simplicity**: No complex setup 4. **Integrated**: Direct access to internals 5. **Lightweight**: Minimal overhead ### Why Not External Tools? **Prometheus/Grafana**: - Requires external services - Complex setup - Overkill for single-user system **APM Services** (New Relic, DataDog): - Privacy concerns - Subscription costs - Network dependency - Too heavy for our needs **OpenTelemetry**: - Large dependency - Complex configuration - Designed for distributed systems ### Design Principles 1. **Opt-in**: Disabled by default 2. **Lightweight**: Minimal resource usage 3. **Actionable**: Focus on useful metrics 4. **Temporary**: No permanent storage 5. **Private**: No external data transmission ## Consequences ### Positive 1. **Production Visibility**: Understand behavior under load 2. **Performance Debugging**: Identify bottlenecks quickly 3. **No Dependencies**: Pure Python solution 4. **Privacy Preserving**: Data stays local 5. **Simple Deployment**: No additional services ### Negative 1. **Limited History**: Only recent data available 2. **Memory Usage**: ~1MB for metrics buffer 3. **No Alerting**: Manual monitoring required 4. **Single Node**: No distributed tracing ### Mitigations 1. Export capability for external tools 2. Configurable buffer size 3. Webhook support for alerts (future) 4. Focus on most valuable metrics ## Alternatives Considered ### 1. Logging-based Monitoring **Approach**: Parse performance data from logs **Pros**: Simple, no new code **Cons**: Log parsing complexity, no real-time view **Decision**: Dedicated monitoring is cleaner ### 2. External Monitoring Service **Approach**: Use service like Sentry **Pros**: Full-featured, alerting included **Cons**: Privacy, cost, complexity **Decision**: Violates self-hosted principle ### 3. Prometheus Exporter **Approach**: Expose /metrics endpoint **Pros**: Standard, good tooling **Cons**: Requires Prometheus setup **Decision**: Too complex for target users ### 4. No Monitoring **Approach**: Rely on logs and external tools **Pros**: Simplest **Cons**: Poor production visibility **Decision**: v1.1.1 specifically targets production readiness ## Implementation Details ### Instrumentation Points 1. **Database Layer** - All queries automatically timed - Connection acquisition/release - Transaction duration - Migration execution 2. **HTTP Layer** - Middleware wraps all requests - Per-endpoint timing - Static file serving - Error handling 3. **Core Functions** - Note creation/update - Search operations - RSS generation - Authentication flow ### Performance Dashboard Layout ``` Performance Dashboard ═══════════════════ Overview -------- Uptime: 5d 3h 15m Requests: 10,234 Avg Response: 45ms Memory: 128MB Slow Queries (>1s) ------------------ [timestamp] SELECT ... FROM notes (1.2s) [timestamp] UPDATE ... SET ... (1.1s) Endpoint Performance ------------------- GET / : avg 23ms, p99 45ms GET /notes/:id : avg 35ms, p99 67ms POST /micropub : avg 125ms, p99 234ms Memory Usage ----------- [ASCII graph showing last 15 minutes] Database Stats ------------- Pool Size: 3/5 Queries/sec: 4.2 Cache Hit Rate: 87% ``` ### Configuration Options ```python # All under STARPUNK_PERF_* prefix MONITORING_ENABLED = False # Master switch SLOW_QUERY_THRESHOLD = 1.0 # seconds LOG_QUERIES = False # Log all queries MEMORY_TRACKING = False # Track memory usage SAMPLE_RATE = 1.0 # 1.0 = all, 0.1 = 10% BUFFER_SIZE = 1000 # Number of metrics DASHBOARD_ENABLED = True # Enable web UI ``` ## Testing Strategy 1. **Unit Tests**: Mock collectors, verify metrics 2. **Integration Tests**: End-to-end monitoring flow 3. **Performance Tests**: Verify low overhead 4. **Load Tests**: Behavior under stress ## Security Considerations 1. Dashboard requires admin authentication 2. No sensitive data in metrics 3. No external data transmission 4. Metrics cleared on logout 5. Rate limiting on dashboard endpoint ## Migration Path No migration required - monitoring is opt-in via configuration. ## Future Enhancements v1.2.0 and beyond: - Metric export (CSV/JSON) - Alert thresholds - Historical trending - Custom metric points - Plugin architecture ## References - [Python resource module](https://docs.python.org/3/library/resource.html) - [SQLite Query Performance](https://www.sqlite.org/queryplanner.html) - [Web Vitals](https://web.dev/vitals/) ## Document History - 2025-11-25: Initial draft for v1.1.1 release planning