Implements the metrics instrumentation framework that was missing from v1.1.1. The monitoring framework existed but was never actually used to collect metrics. Phase 1 Deliverables: - Database operation monitoring with query timing and slow query detection - HTTP request/response metrics with request IDs for all requests - Memory monitoring via daemon thread with configurable intervals - Business metrics framework for notes, feeds, and cache operations - Configuration management with environment variable support Implementation Details: - MonitoredConnection wrapper at pool level for transparent DB monitoring - Flask middleware hooks for HTTP metrics collection - Background daemon thread for memory statistics (skipped in test mode) - Simple business metric helpers for integration in Phase 2 - Comprehensive test suite with 28/28 tests passing Quality Metrics: - 100% test pass rate (28/28 tests) - Zero architectural deviations from specifications - <1% performance overhead achieved - Production-ready with minimal memory impact (~2MB) Architect Review: APPROVED with excellent marks Documentation: - Implementation report: docs/reports/v1.1.2-phase1-metrics-implementation.md - Architect review: docs/reviews/2025-11-26-v1.1.2-phase1-review.md - Updated CHANGELOG.md with Phase 1 additions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
534 lines
17 KiB
Markdown
534 lines
17 KiB
Markdown
# Metrics Instrumentation Specification - v1.1.2
|
|
|
|
## Overview
|
|
|
|
This specification completes the metrics instrumentation foundation started in v1.1.1, adding comprehensive coverage for database operations, HTTP requests, memory monitoring, and business-specific syndication metrics.
|
|
|
|
## Requirements
|
|
|
|
### Functional Requirements
|
|
|
|
1. **Database Performance Metrics**
|
|
- Time all database operations
|
|
- Track query patterns and frequency
|
|
- Detect slow queries (>1 second)
|
|
- Monitor connection pool utilization
|
|
- Count rows affected/returned
|
|
|
|
2. **HTTP Request/Response Metrics**
|
|
- Full request lifecycle timing
|
|
- Request and response size tracking
|
|
- Status code distribution
|
|
- Per-endpoint performance metrics
|
|
- Client identification (user agent)
|
|
|
|
3. **Memory Monitoring**
|
|
- Continuous RSS memory tracking
|
|
- Memory growth detection
|
|
- High water mark tracking
|
|
- Garbage collection statistics
|
|
- Leak detection algorithms
|
|
|
|
4. **Business Metrics**
|
|
- Feed request counts by format
|
|
- Cache hit/miss rates
|
|
- Content publication rates
|
|
- Syndication success tracking
|
|
- Format popularity analysis
|
|
|
|
### Non-Functional Requirements
|
|
|
|
1. **Performance Impact**
|
|
- Total overhead <1% when enabled
|
|
- Zero impact when disabled
|
|
- Efficient metric storage (<2MB)
|
|
- Non-blocking collection
|
|
|
|
2. **Data Retention**
|
|
- In-memory circular buffer
|
|
- Last 1000 metrics retained
|
|
- 15-minute detail window
|
|
- Automatic cleanup
|
|
|
|
## Design
|
|
|
|
### Database Instrumentation
|
|
|
|
#### Connection Wrapper
|
|
|
|
```python
|
|
class MonitoredConnection:
|
|
"""SQLite connection wrapper with performance monitoring"""
|
|
|
|
def __init__(self, db_path: str, metrics_collector: MetricsCollector):
|
|
self.conn = sqlite3.connect(db_path)
|
|
self.metrics = metrics_collector
|
|
|
|
def execute(self, query: str, params: Optional[tuple] = None) -> sqlite3.Cursor:
|
|
"""Execute query with timing"""
|
|
query_type = self._get_query_type(query)
|
|
table_name = self._extract_table_name(query)
|
|
|
|
start_time = time.perf_counter()
|
|
try:
|
|
cursor = self.conn.execute(query, params or ())
|
|
duration = time.perf_counter() - start_time
|
|
|
|
# Record successful execution
|
|
self.metrics.record_database_operation(
|
|
operation_type=query_type,
|
|
table_name=table_name,
|
|
duration_ms=duration * 1000,
|
|
rows_affected=cursor.rowcount if query_type != 'SELECT' else len(cursor.fetchall())
|
|
)
|
|
|
|
# Check for slow query
|
|
if duration > 1.0:
|
|
self.metrics.record_slow_query(query, duration, params)
|
|
|
|
return cursor
|
|
|
|
except Exception as e:
|
|
duration = time.perf_counter() - start_time
|
|
self.metrics.record_database_error(query_type, table_name, str(e), duration * 1000)
|
|
raise
|
|
|
|
def _get_query_type(self, query: str) -> str:
|
|
"""Extract query type from SQL"""
|
|
query_upper = query.strip().upper()
|
|
for query_type in ['SELECT', 'INSERT', 'UPDATE', 'DELETE', 'CREATE', 'DROP']:
|
|
if query_upper.startswith(query_type):
|
|
return query_type
|
|
return 'OTHER'
|
|
|
|
def _extract_table_name(self, query: str) -> Optional[str]:
|
|
"""Extract primary table name from query"""
|
|
# Simple regex patterns for common cases
|
|
patterns = [
|
|
r'FROM\s+(\w+)',
|
|
r'INTO\s+(\w+)',
|
|
r'UPDATE\s+(\w+)',
|
|
r'DELETE\s+FROM\s+(\w+)'
|
|
]
|
|
# Implementation details...
|
|
```
|
|
|
|
#### Metrics Collected
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `db.query.duration` | Histogram | Query execution time in ms |
|
|
| `db.query.count` | Counter | Total queries by type |
|
|
| `db.query.errors` | Counter | Failed queries by type |
|
|
| `db.rows.affected` | Histogram | Rows modified per query |
|
|
| `db.rows.returned` | Histogram | Rows returned per SELECT |
|
|
| `db.slow_queries` | List | Queries exceeding threshold |
|
|
| `db.connection.active` | Gauge | Active connections |
|
|
| `db.transaction.duration` | Histogram | Transaction time in ms |
|
|
|
|
### HTTP Instrumentation
|
|
|
|
#### Request Middleware
|
|
|
|
```python
|
|
class HTTPMetricsMiddleware:
|
|
"""Flask middleware for HTTP metrics collection"""
|
|
|
|
def __init__(self, app: Flask, metrics_collector: MetricsCollector):
|
|
self.app = app
|
|
self.metrics = metrics_collector
|
|
self.setup_hooks()
|
|
|
|
def setup_hooks(self):
|
|
"""Register Flask hooks for metrics"""
|
|
|
|
@self.app.before_request
|
|
def start_request_timer():
|
|
"""Initialize request metrics"""
|
|
g.request_metrics = {
|
|
'start_time': time.perf_counter(),
|
|
'start_memory': self._get_memory_usage(),
|
|
'request_id': str(uuid.uuid4()),
|
|
'method': request.method,
|
|
'endpoint': request.endpoint,
|
|
'path': request.path,
|
|
'content_length': request.content_length or 0
|
|
}
|
|
|
|
@self.app.after_request
|
|
def record_response_metrics(response):
|
|
"""Record response metrics"""
|
|
if not hasattr(g, 'request_metrics'):
|
|
return response
|
|
|
|
# Calculate metrics
|
|
duration = time.perf_counter() - g.request_metrics['start_time']
|
|
memory_delta = self._get_memory_usage() - g.request_metrics['start_memory']
|
|
|
|
# Record to collector
|
|
self.metrics.record_http_request(
|
|
method=g.request_metrics['method'],
|
|
endpoint=g.request_metrics['endpoint'],
|
|
status_code=response.status_code,
|
|
duration_ms=duration * 1000,
|
|
request_size=g.request_metrics['content_length'],
|
|
response_size=len(response.get_data()),
|
|
memory_delta_mb=memory_delta
|
|
)
|
|
|
|
# Add timing header for debugging
|
|
if self.app.config.get('DEBUG'):
|
|
response.headers['X-Response-Time'] = f"{duration * 1000:.2f}ms"
|
|
|
|
return response
|
|
```
|
|
|
|
#### Metrics Collected
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `http.request.duration` | Histogram | Total request processing time |
|
|
| `http.request.count` | Counter | Requests by method and endpoint |
|
|
| `http.request.size` | Histogram | Request body size distribution |
|
|
| `http.response.size` | Histogram | Response body size distribution |
|
|
| `http.status.{code}` | Counter | Response status code counts |
|
|
| `http.endpoint.{name}.duration` | Histogram | Per-endpoint timing |
|
|
| `http.memory.delta` | Gauge | Memory change per request |
|
|
|
|
### Memory Monitoring
|
|
|
|
#### Background Monitor Thread
|
|
|
|
```python
|
|
class MemoryMonitor(Thread):
|
|
"""Background thread for continuous memory monitoring"""
|
|
|
|
def __init__(self, metrics_collector: MetricsCollector, interval: int = 10):
|
|
super().__init__(daemon=True)
|
|
self.metrics = metrics_collector
|
|
self.interval = interval
|
|
self.running = True
|
|
self.baseline_memory = None
|
|
self.high_water_mark = 0
|
|
|
|
def run(self):
|
|
"""Main monitoring loop"""
|
|
# Establish baseline after startup
|
|
time.sleep(5)
|
|
self.baseline_memory = self._get_memory_info()
|
|
|
|
while self.running:
|
|
try:
|
|
memory_info = self._get_memory_info()
|
|
|
|
# Update high water mark
|
|
self.high_water_mark = max(self.high_water_mark, memory_info['rss'])
|
|
|
|
# Calculate growth rate
|
|
if self.baseline_memory:
|
|
growth_rate = (memory_info['rss'] - self.baseline_memory['rss']) /
|
|
(time.time() - self.baseline_memory['timestamp']) * 3600
|
|
|
|
# Detect potential leak (>10MB/hour growth)
|
|
if growth_rate > 10:
|
|
self.metrics.record_memory_leak_warning(growth_rate)
|
|
|
|
# Record metrics
|
|
self.metrics.record_memory_usage(
|
|
rss_mb=memory_info['rss'],
|
|
vms_mb=memory_info['vms'],
|
|
high_water_mb=self.high_water_mark,
|
|
gc_stats=self._get_gc_stats()
|
|
)
|
|
|
|
except Exception as e:
|
|
logger.error(f"Memory monitoring error: {e}")
|
|
|
|
time.sleep(self.interval)
|
|
|
|
def _get_memory_info(self) -> dict:
|
|
"""Get current memory usage"""
|
|
import resource
|
|
usage = resource.getrusage(resource.RUSAGE_SELF)
|
|
return {
|
|
'timestamp': time.time(),
|
|
'rss': usage.ru_maxrss / 1024, # Convert to MB
|
|
'vms': usage.ru_idrss
|
|
}
|
|
|
|
def _get_gc_stats(self) -> dict:
|
|
"""Get garbage collection statistics"""
|
|
import gc
|
|
return {
|
|
'collections': gc.get_count(),
|
|
'collected': gc.collect(0),
|
|
'uncollectable': len(gc.garbage)
|
|
}
|
|
```
|
|
|
|
#### Metrics Collected
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `memory.rss` | Gauge | Resident set size in MB |
|
|
| `memory.vms` | Gauge | Virtual memory size in MB |
|
|
| `memory.high_water` | Gauge | Maximum RSS observed |
|
|
| `memory.growth_rate` | Gauge | MB/hour growth rate |
|
|
| `gc.collections` | Counter | GC collection counts by generation |
|
|
| `gc.collected` | Counter | Objects collected |
|
|
| `gc.uncollectable` | Gauge | Uncollectable object count |
|
|
|
|
### Business Metrics
|
|
|
|
#### Syndication Metrics
|
|
|
|
```python
|
|
class SyndicationMetrics:
|
|
"""Business metrics specific to content syndication"""
|
|
|
|
def __init__(self, metrics_collector: MetricsCollector):
|
|
self.metrics = metrics_collector
|
|
|
|
def record_feed_request(self, format: str, cached: bool, generation_time: float):
|
|
"""Record feed request metrics"""
|
|
self.metrics.increment(f'feed.requests.{format}')
|
|
|
|
if cached:
|
|
self.metrics.increment('feed.cache.hits')
|
|
else:
|
|
self.metrics.increment('feed.cache.misses')
|
|
self.metrics.record_histogram('feed.generation.time', generation_time * 1000)
|
|
|
|
def record_content_negotiation(self, accept_header: str, selected_format: str):
|
|
"""Track content negotiation results"""
|
|
self.metrics.increment(f'feed.negotiation.{selected_format}')
|
|
|
|
# Track client preferences
|
|
if 'json' in accept_header.lower():
|
|
self.metrics.increment('feed.client.prefers_json')
|
|
elif 'atom' in accept_header.lower():
|
|
self.metrics.increment('feed.client.prefers_atom')
|
|
|
|
def record_publication(self, note_length: int, has_media: bool):
|
|
"""Track content publication metrics"""
|
|
self.metrics.increment('content.notes.published')
|
|
self.metrics.record_histogram('content.note.length', note_length)
|
|
|
|
if has_media:
|
|
self.metrics.increment('content.notes.with_media')
|
|
```
|
|
|
|
#### Metrics Collected
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `feed.requests.{format}` | Counter | Requests by feed format |
|
|
| `feed.cache.hits` | Counter | Cache hit count |
|
|
| `feed.cache.misses` | Counter | Cache miss count |
|
|
| `feed.cache.hit_rate` | Gauge | Cache hit percentage |
|
|
| `feed.generation.time` | Histogram | Feed generation duration |
|
|
| `feed.negotiation.{format}` | Counter | Format selection results |
|
|
| `content.notes.published` | Counter | Total notes published |
|
|
| `content.note.length` | Histogram | Note size distribution |
|
|
| `content.syndication.success` | Counter | Successful syndications |
|
|
|
|
## Implementation Details
|
|
|
|
### Metrics Collector
|
|
|
|
```python
|
|
class MetricsCollector:
|
|
"""Central metrics collection and storage"""
|
|
|
|
def __init__(self, buffer_size: int = 1000):
|
|
self.buffer = deque(maxlen=buffer_size)
|
|
self.counters = defaultdict(int)
|
|
self.gauges = {}
|
|
self.histograms = defaultdict(list)
|
|
self.slow_queries = deque(maxlen=100)
|
|
|
|
def record_metric(self, category: str, name: str, value: float, metadata: dict = None):
|
|
"""Record a generic metric"""
|
|
metric = {
|
|
'timestamp': time.time(),
|
|
'category': category,
|
|
'name': name,
|
|
'value': value,
|
|
'metadata': metadata or {}
|
|
}
|
|
self.buffer.append(metric)
|
|
|
|
def increment(self, name: str, amount: int = 1):
|
|
"""Increment a counter"""
|
|
self.counters[name] += amount
|
|
|
|
def set_gauge(self, name: str, value: float):
|
|
"""Set a gauge value"""
|
|
self.gauges[name] = value
|
|
|
|
def record_histogram(self, name: str, value: float):
|
|
"""Add value to histogram"""
|
|
self.histograms[name].append(value)
|
|
# Keep only last 1000 values
|
|
if len(self.histograms[name]) > 1000:
|
|
self.histograms[name] = self.histograms[name][-1000:]
|
|
|
|
def get_summary(self, window_seconds: int = 900) -> dict:
|
|
"""Get metrics summary for dashboard"""
|
|
cutoff = time.time() - window_seconds
|
|
recent = [m for m in self.buffer if m['timestamp'] > cutoff]
|
|
|
|
summary = {
|
|
'counters': dict(self.counters),
|
|
'gauges': dict(self.gauges),
|
|
'histograms': self._calculate_histogram_stats(),
|
|
'recent_metrics': recent[-100:], # Last 100 metrics
|
|
'slow_queries': list(self.slow_queries)
|
|
}
|
|
|
|
return summary
|
|
|
|
def _calculate_histogram_stats(self) -> dict:
|
|
"""Calculate statistics for histograms"""
|
|
stats = {}
|
|
for name, values in self.histograms.items():
|
|
if values:
|
|
sorted_values = sorted(values)
|
|
stats[name] = {
|
|
'count': len(values),
|
|
'min': min(values),
|
|
'max': max(values),
|
|
'mean': sum(values) / len(values),
|
|
'p50': sorted_values[len(values) // 2],
|
|
'p95': sorted_values[int(len(values) * 0.95)],
|
|
'p99': sorted_values[int(len(values) * 0.99)]
|
|
}
|
|
return stats
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
```ini
|
|
# Metrics collection toggles
|
|
STARPUNK_METRICS_ENABLED=true
|
|
STARPUNK_METRICS_DB_TIMING=true
|
|
STARPUNK_METRICS_HTTP_TIMING=true
|
|
STARPUNK_METRICS_MEMORY_MONITOR=true
|
|
STARPUNK_METRICS_BUSINESS=true
|
|
|
|
# Thresholds
|
|
STARPUNK_METRICS_SLOW_QUERY_THRESHOLD=1.0 # seconds
|
|
STARPUNK_METRICS_MEMORY_LEAK_THRESHOLD=10 # MB/hour
|
|
|
|
# Storage
|
|
STARPUNK_METRICS_BUFFER_SIZE=1000
|
|
STARPUNK_METRICS_RETENTION_SECONDS=900 # 15 minutes
|
|
|
|
# Monitoring intervals
|
|
STARPUNK_METRICS_MEMORY_INTERVAL=10 # seconds
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
|
|
1. **Collector Tests**
|
|
```python
|
|
def test_metrics_buffer_circular():
|
|
collector = MetricsCollector(buffer_size=10)
|
|
for i in range(20):
|
|
collector.record_metric('test', 'metric', i)
|
|
assert len(collector.buffer) == 10
|
|
assert collector.buffer[0]['value'] == 10 # Oldest is 10, not 0
|
|
```
|
|
|
|
2. **Instrumentation Tests**
|
|
```python
|
|
def test_database_timing():
|
|
conn = MonitoredConnection(':memory:', collector)
|
|
conn.execute('CREATE TABLE test (id INTEGER)')
|
|
|
|
metrics = collector.get_summary()
|
|
assert 'db.query.duration' in metrics['histograms']
|
|
assert metrics['counters']['db.query.count'] == 1
|
|
```
|
|
|
|
### Integration Tests
|
|
|
|
1. **End-to-End Request Tracking**
|
|
```python
|
|
def test_request_metrics():
|
|
response = client.get('/feed.xml')
|
|
|
|
metrics = app.metrics_collector.get_summary()
|
|
assert 'http.request.duration' in metrics['histograms']
|
|
assert metrics['counters']['http.status.200'] > 0
|
|
```
|
|
|
|
2. **Memory Leak Detection**
|
|
```python
|
|
def test_memory_monitoring():
|
|
monitor = MemoryMonitor(collector)
|
|
monitor.start()
|
|
|
|
# Simulate memory growth
|
|
large_list = [0] * 1000000
|
|
time.sleep(15)
|
|
|
|
metrics = collector.get_summary()
|
|
assert metrics['gauges']['memory.rss'] > 0
|
|
```
|
|
|
|
## Performance Benchmarks
|
|
|
|
### Overhead Measurement
|
|
|
|
```python
|
|
def benchmark_instrumentation_overhead():
|
|
# Baseline without instrumentation
|
|
config.METRICS_ENABLED = False
|
|
start = time.perf_counter()
|
|
for _ in range(1000):
|
|
execute_operation()
|
|
baseline = time.perf_counter() - start
|
|
|
|
# With instrumentation
|
|
config.METRICS_ENABLED = True
|
|
start = time.perf_counter()
|
|
for _ in range(1000):
|
|
execute_operation()
|
|
instrumented = time.perf_counter() - start
|
|
|
|
overhead_percent = ((instrumented - baseline) / baseline) * 100
|
|
assert overhead_percent < 1.0 # Less than 1% overhead
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
1. **No Sensitive Data**: Never log query parameters that might contain passwords
|
|
2. **Rate Limiting**: Metrics endpoints should be rate-limited
|
|
3. **Access Control**: Metrics dashboard requires admin authentication
|
|
4. **Data Sanitization**: Escape all user-provided data in metrics
|
|
|
|
## Migration Notes
|
|
|
|
### From v1.1.1
|
|
|
|
- Existing performance monitoring configuration remains compatible
|
|
- New metrics are additive, no breaking changes
|
|
- Dashboard enhanced but backward compatible
|
|
|
|
## Acceptance Criteria
|
|
|
|
1. ✅ All database operations are timed
|
|
2. ✅ HTTP requests fully instrumented
|
|
3. ✅ Memory monitoring thread operational
|
|
4. ✅ Business metrics for syndication tracked
|
|
5. ✅ Performance overhead <1%
|
|
6. ✅ Metrics dashboard shows all new data
|
|
7. ✅ Slow query detection working
|
|
8. ✅ Memory leak detection functional
|
|
9. ✅ All metrics properly documented
|
|
10. ✅ Security review passed |