# Metrics Instrumentation Specification - v1.1.2 ## Overview This specification completes the metrics instrumentation foundation started in v1.1.1, adding comprehensive coverage for database operations, HTTP requests, memory monitoring, and business-specific syndication metrics. ## Requirements ### Functional Requirements 1. **Database Performance Metrics** - Time all database operations - Track query patterns and frequency - Detect slow queries (>1 second) - Monitor connection pool utilization - Count rows affected/returned 2. **HTTP Request/Response Metrics** - Full request lifecycle timing - Request and response size tracking - Status code distribution - Per-endpoint performance metrics - Client identification (user agent) 3. **Memory Monitoring** - Continuous RSS memory tracking - Memory growth detection - High water mark tracking - Garbage collection statistics - Leak detection algorithms 4. **Business Metrics** - Feed request counts by format - Cache hit/miss rates - Content publication rates - Syndication success tracking - Format popularity analysis ### Non-Functional Requirements 1. **Performance Impact** - Total overhead <1% when enabled - Zero impact when disabled - Efficient metric storage (<2MB) - Non-blocking collection 2. **Data Retention** - In-memory circular buffer - Last 1000 metrics retained - 15-minute detail window - Automatic cleanup ## Design ### Database Instrumentation #### Connection Wrapper ```python class MonitoredConnection: """SQLite connection wrapper with performance monitoring""" def __init__(self, db_path: str, metrics_collector: MetricsCollector): self.conn = sqlite3.connect(db_path) self.metrics = metrics_collector def execute(self, query: str, params: Optional[tuple] = None) -> sqlite3.Cursor: """Execute query with timing""" query_type = self._get_query_type(query) table_name = self._extract_table_name(query) start_time = time.perf_counter() try: cursor = self.conn.execute(query, params or ()) duration = time.perf_counter() - start_time # Record successful execution self.metrics.record_database_operation( operation_type=query_type, table_name=table_name, duration_ms=duration * 1000, rows_affected=cursor.rowcount if query_type != 'SELECT' else len(cursor.fetchall()) ) # Check for slow query if duration > 1.0: self.metrics.record_slow_query(query, duration, params) return cursor except Exception as e: duration = time.perf_counter() - start_time self.metrics.record_database_error(query_type, table_name, str(e), duration * 1000) raise def _get_query_type(self, query: str) -> str: """Extract query type from SQL""" query_upper = query.strip().upper() for query_type in ['SELECT', 'INSERT', 'UPDATE', 'DELETE', 'CREATE', 'DROP']: if query_upper.startswith(query_type): return query_type return 'OTHER' def _extract_table_name(self, query: str) -> Optional[str]: """Extract primary table name from query""" # Simple regex patterns for common cases patterns = [ r'FROM\s+(\w+)', r'INTO\s+(\w+)', r'UPDATE\s+(\w+)', r'DELETE\s+FROM\s+(\w+)' ] # Implementation details... ``` #### Metrics Collected | Metric | Type | Description | |--------|------|-------------| | `db.query.duration` | Histogram | Query execution time in ms | | `db.query.count` | Counter | Total queries by type | | `db.query.errors` | Counter | Failed queries by type | | `db.rows.affected` | Histogram | Rows modified per query | | `db.rows.returned` | Histogram | Rows returned per SELECT | | `db.slow_queries` | List | Queries exceeding threshold | | `db.connection.active` | Gauge | Active connections | | `db.transaction.duration` | Histogram | Transaction time in ms | ### HTTP Instrumentation #### Request Middleware ```python class HTTPMetricsMiddleware: """Flask middleware for HTTP metrics collection""" def __init__(self, app: Flask, metrics_collector: MetricsCollector): self.app = app self.metrics = metrics_collector self.setup_hooks() def setup_hooks(self): """Register Flask hooks for metrics""" @self.app.before_request def start_request_timer(): """Initialize request metrics""" g.request_metrics = { 'start_time': time.perf_counter(), 'start_memory': self._get_memory_usage(), 'request_id': str(uuid.uuid4()), 'method': request.method, 'endpoint': request.endpoint, 'path': request.path, 'content_length': request.content_length or 0 } @self.app.after_request def record_response_metrics(response): """Record response metrics""" if not hasattr(g, 'request_metrics'): return response # Calculate metrics duration = time.perf_counter() - g.request_metrics['start_time'] memory_delta = self._get_memory_usage() - g.request_metrics['start_memory'] # Record to collector self.metrics.record_http_request( method=g.request_metrics['method'], endpoint=g.request_metrics['endpoint'], status_code=response.status_code, duration_ms=duration * 1000, request_size=g.request_metrics['content_length'], response_size=len(response.get_data()), memory_delta_mb=memory_delta ) # Add timing header for debugging if self.app.config.get('DEBUG'): response.headers['X-Response-Time'] = f"{duration * 1000:.2f}ms" return response ``` #### Metrics Collected | Metric | Type | Description | |--------|------|-------------| | `http.request.duration` | Histogram | Total request processing time | | `http.request.count` | Counter | Requests by method and endpoint | | `http.request.size` | Histogram | Request body size distribution | | `http.response.size` | Histogram | Response body size distribution | | `http.status.{code}` | Counter | Response status code counts | | `http.endpoint.{name}.duration` | Histogram | Per-endpoint timing | | `http.memory.delta` | Gauge | Memory change per request | ### Memory Monitoring #### Background Monitor Thread ```python class MemoryMonitor(Thread): """Background thread for continuous memory monitoring""" def __init__(self, metrics_collector: MetricsCollector, interval: int = 10): super().__init__(daemon=True) self.metrics = metrics_collector self.interval = interval self.running = True self.baseline_memory = None self.high_water_mark = 0 def run(self): """Main monitoring loop""" # Establish baseline after startup time.sleep(5) self.baseline_memory = self._get_memory_info() while self.running: try: memory_info = self._get_memory_info() # Update high water mark self.high_water_mark = max(self.high_water_mark, memory_info['rss']) # Calculate growth rate if self.baseline_memory: growth_rate = (memory_info['rss'] - self.baseline_memory['rss']) / (time.time() - self.baseline_memory['timestamp']) * 3600 # Detect potential leak (>10MB/hour growth) if growth_rate > 10: self.metrics.record_memory_leak_warning(growth_rate) # Record metrics self.metrics.record_memory_usage( rss_mb=memory_info['rss'], vms_mb=memory_info['vms'], high_water_mb=self.high_water_mark, gc_stats=self._get_gc_stats() ) except Exception as e: logger.error(f"Memory monitoring error: {e}") time.sleep(self.interval) def _get_memory_info(self) -> dict: """Get current memory usage""" import resource usage = resource.getrusage(resource.RUSAGE_SELF) return { 'timestamp': time.time(), 'rss': usage.ru_maxrss / 1024, # Convert to MB 'vms': usage.ru_idrss } def _get_gc_stats(self) -> dict: """Get garbage collection statistics""" import gc return { 'collections': gc.get_count(), 'collected': gc.collect(0), 'uncollectable': len(gc.garbage) } ``` #### Metrics Collected | Metric | Type | Description | |--------|------|-------------| | `memory.rss` | Gauge | Resident set size in MB | | `memory.vms` | Gauge | Virtual memory size in MB | | `memory.high_water` | Gauge | Maximum RSS observed | | `memory.growth_rate` | Gauge | MB/hour growth rate | | `gc.collections` | Counter | GC collection counts by generation | | `gc.collected` | Counter | Objects collected | | `gc.uncollectable` | Gauge | Uncollectable object count | ### Business Metrics #### Syndication Metrics ```python class SyndicationMetrics: """Business metrics specific to content syndication""" def __init__(self, metrics_collector: MetricsCollector): self.metrics = metrics_collector def record_feed_request(self, format: str, cached: bool, generation_time: float): """Record feed request metrics""" self.metrics.increment(f'feed.requests.{format}') if cached: self.metrics.increment('feed.cache.hits') else: self.metrics.increment('feed.cache.misses') self.metrics.record_histogram('feed.generation.time', generation_time * 1000) def record_content_negotiation(self, accept_header: str, selected_format: str): """Track content negotiation results""" self.metrics.increment(f'feed.negotiation.{selected_format}') # Track client preferences if 'json' in accept_header.lower(): self.metrics.increment('feed.client.prefers_json') elif 'atom' in accept_header.lower(): self.metrics.increment('feed.client.prefers_atom') def record_publication(self, note_length: int, has_media: bool): """Track content publication metrics""" self.metrics.increment('content.notes.published') self.metrics.record_histogram('content.note.length', note_length) if has_media: self.metrics.increment('content.notes.with_media') ``` #### Metrics Collected | Metric | Type | Description | |--------|------|-------------| | `feed.requests.{format}` | Counter | Requests by feed format | | `feed.cache.hits` | Counter | Cache hit count | | `feed.cache.misses` | Counter | Cache miss count | | `feed.cache.hit_rate` | Gauge | Cache hit percentage | | `feed.generation.time` | Histogram | Feed generation duration | | `feed.negotiation.{format}` | Counter | Format selection results | | `content.notes.published` | Counter | Total notes published | | `content.note.length` | Histogram | Note size distribution | | `content.syndication.success` | Counter | Successful syndications | ## Implementation Details ### Metrics Collector ```python class MetricsCollector: """Central metrics collection and storage""" def __init__(self, buffer_size: int = 1000): self.buffer = deque(maxlen=buffer_size) self.counters = defaultdict(int) self.gauges = {} self.histograms = defaultdict(list) self.slow_queries = deque(maxlen=100) def record_metric(self, category: str, name: str, value: float, metadata: dict = None): """Record a generic metric""" metric = { 'timestamp': time.time(), 'category': category, 'name': name, 'value': value, 'metadata': metadata or {} } self.buffer.append(metric) def increment(self, name: str, amount: int = 1): """Increment a counter""" self.counters[name] += amount def set_gauge(self, name: str, value: float): """Set a gauge value""" self.gauges[name] = value def record_histogram(self, name: str, value: float): """Add value to histogram""" self.histograms[name].append(value) # Keep only last 1000 values if len(self.histograms[name]) > 1000: self.histograms[name] = self.histograms[name][-1000:] def get_summary(self, window_seconds: int = 900) -> dict: """Get metrics summary for dashboard""" cutoff = time.time() - window_seconds recent = [m for m in self.buffer if m['timestamp'] > cutoff] summary = { 'counters': dict(self.counters), 'gauges': dict(self.gauges), 'histograms': self._calculate_histogram_stats(), 'recent_metrics': recent[-100:], # Last 100 metrics 'slow_queries': list(self.slow_queries) } return summary def _calculate_histogram_stats(self) -> dict: """Calculate statistics for histograms""" stats = {} for name, values in self.histograms.items(): if values: sorted_values = sorted(values) stats[name] = { 'count': len(values), 'min': min(values), 'max': max(values), 'mean': sum(values) / len(values), 'p50': sorted_values[len(values) // 2], 'p95': sorted_values[int(len(values) * 0.95)], 'p99': sorted_values[int(len(values) * 0.99)] } return stats ``` ## Configuration ### Environment Variables ```ini # Metrics collection toggles STARPUNK_METRICS_ENABLED=true STARPUNK_METRICS_DB_TIMING=true STARPUNK_METRICS_HTTP_TIMING=true STARPUNK_METRICS_MEMORY_MONITOR=true STARPUNK_METRICS_BUSINESS=true # Thresholds STARPUNK_METRICS_SLOW_QUERY_THRESHOLD=1.0 # seconds STARPUNK_METRICS_MEMORY_LEAK_THRESHOLD=10 # MB/hour # Storage STARPUNK_METRICS_BUFFER_SIZE=1000 STARPUNK_METRICS_RETENTION_SECONDS=900 # 15 minutes # Monitoring intervals STARPUNK_METRICS_MEMORY_INTERVAL=10 # seconds ``` ## Testing Strategy ### Unit Tests 1. **Collector Tests** ```python def test_metrics_buffer_circular(): collector = MetricsCollector(buffer_size=10) for i in range(20): collector.record_metric('test', 'metric', i) assert len(collector.buffer) == 10 assert collector.buffer[0]['value'] == 10 # Oldest is 10, not 0 ``` 2. **Instrumentation Tests** ```python def test_database_timing(): conn = MonitoredConnection(':memory:', collector) conn.execute('CREATE TABLE test (id INTEGER)') metrics = collector.get_summary() assert 'db.query.duration' in metrics['histograms'] assert metrics['counters']['db.query.count'] == 1 ``` ### Integration Tests 1. **End-to-End Request Tracking** ```python def test_request_metrics(): response = client.get('/feed.xml') metrics = app.metrics_collector.get_summary() assert 'http.request.duration' in metrics['histograms'] assert metrics['counters']['http.status.200'] > 0 ``` 2. **Memory Leak Detection** ```python def test_memory_monitoring(): monitor = MemoryMonitor(collector) monitor.start() # Simulate memory growth large_list = [0] * 1000000 time.sleep(15) metrics = collector.get_summary() assert metrics['gauges']['memory.rss'] > 0 ``` ## Performance Benchmarks ### Overhead Measurement ```python def benchmark_instrumentation_overhead(): # Baseline without instrumentation config.METRICS_ENABLED = False start = time.perf_counter() for _ in range(1000): execute_operation() baseline = time.perf_counter() - start # With instrumentation config.METRICS_ENABLED = True start = time.perf_counter() for _ in range(1000): execute_operation() instrumented = time.perf_counter() - start overhead_percent = ((instrumented - baseline) / baseline) * 100 assert overhead_percent < 1.0 # Less than 1% overhead ``` ## Security Considerations 1. **No Sensitive Data**: Never log query parameters that might contain passwords 2. **Rate Limiting**: Metrics endpoints should be rate-limited 3. **Access Control**: Metrics dashboard requires admin authentication 4. **Data Sanitization**: Escape all user-provided data in metrics ## Migration Notes ### From v1.1.1 - Existing performance monitoring configuration remains compatible - New metrics are additive, no breaking changes - Dashboard enhanced but backward compatible ## Acceptance Criteria 1. ✅ All database operations are timed 2. ✅ HTTP requests fully instrumented 3. ✅ Memory monitoring thread operational 4. ✅ Business metrics for syndication tracked 5. ✅ Performance overhead <1% 6. ✅ Metrics dashboard shows all new data 7. ✅ Slow query detection working 8. ✅ Memory leak detection functional 9. ✅ All metrics properly documented 10. ✅ Security review passed