Files

Phil Skentelbery b0230b1233 feat: Complete v1.1.2 Phase 1 - Metrics Instrumentation

Implements the metrics instrumentation framework that was missing from v1.1.1.
The monitoring framework existed but was never actually used to collect metrics.

Phase 1 Deliverables:
- Database operation monitoring with query timing and slow query detection
- HTTP request/response metrics with request IDs for all requests
- Memory monitoring via daemon thread with configurable intervals
- Business metrics framework for notes, feeds, and cache operations
- Configuration management with environment variable support

Implementation Details:
- MonitoredConnection wrapper at pool level for transparent DB monitoring
- Flask middleware hooks for HTTP metrics collection
- Background daemon thread for memory statistics (skipped in test mode)
- Simple business metric helpers for integration in Phase 2
- Comprehensive test suite with 28/28 tests passing

Quality Metrics:
- 100% test pass rate (28/28 tests)
- Zero architectural deviations from specifications
- <1% performance overhead achieved
- Production-ready with minimal memory impact (~2MB)

Architect Review: APPROVED with excellent marks

Documentation:
- Implementation report: docs/reports/v1.1.2-phase1-metrics-implementation.md
- Architect review: docs/reviews/2025-11-26-v1.1.2-phase1-review.md
- Updated CHANGELOG.md with Phase 1 additions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 14:13:44 -07:00

17 KiB

Raw Blame History

Metrics Instrumentation Specification - v1.1.2

Overview

This specification completes the metrics instrumentation foundation started in v1.1.1, adding comprehensive coverage for database operations, HTTP requests, memory monitoring, and business-specific syndication metrics.

Requirements

Functional Requirements

Database Performance Metrics
- Time all database operations
- Track query patterns and frequency
- Detect slow queries (>1 second)
- Monitor connection pool utilization
- Count rows affected/returned
HTTP Request/Response Metrics
- Full request lifecycle timing
- Request and response size tracking
- Status code distribution
- Per-endpoint performance metrics
- Client identification (user agent)
Memory Monitoring
- Continuous RSS memory tracking
- Memory growth detection
- High water mark tracking
- Garbage collection statistics
- Leak detection algorithms
Business Metrics
- Feed request counts by format
- Cache hit/miss rates
- Content publication rates
- Syndication success tracking
- Format popularity analysis

Non-Functional Requirements

Performance Impact
- Total overhead <1% when enabled
- Zero impact when disabled
- Efficient metric storage (<2MB)
- Non-blocking collection
Data Retention
- In-memory circular buffer
- Last 1000 metrics retained
- 15-minute detail window
- Automatic cleanup

Design

Database Instrumentation

Connection Wrapper

class MonitoredConnection:
    """SQLite connection wrapper with performance monitoring"""

    def __init__(self, db_path: str, metrics_collector: MetricsCollector):
        self.conn = sqlite3.connect(db_path)
        self.metrics = metrics_collector

    def execute(self, query: str, params: Optional[tuple] = None) -> sqlite3.Cursor:
        """Execute query with timing"""
        query_type = self._get_query_type(query)
        table_name = self._extract_table_name(query)

        start_time = time.perf_counter()
        try:
            cursor = self.conn.execute(query, params or ())
            duration = time.perf_counter() - start_time

            # Record successful execution
            self.metrics.record_database_operation(
                operation_type=query_type,
                table_name=table_name,
                duration_ms=duration * 1000,
                rows_affected=cursor.rowcount if query_type != 'SELECT' else len(cursor.fetchall())
            )

            # Check for slow query
            if duration > 1.0:
                self.metrics.record_slow_query(query, duration, params)

            return cursor

        except Exception as e:
            duration = time.perf_counter() - start_time
            self.metrics.record_database_error(query_type, table_name, str(e), duration * 1000)
            raise

    def _get_query_type(self, query: str) -> str:
        """Extract query type from SQL"""
        query_upper = query.strip().upper()
        for query_type in ['SELECT', 'INSERT', 'UPDATE', 'DELETE', 'CREATE', 'DROP']:
            if query_upper.startswith(query_type):
                return query_type
        return 'OTHER'

    def _extract_table_name(self, query: str) -> Optional[str]:
        """Extract primary table name from query"""
        # Simple regex patterns for common cases
        patterns = [
            r'FROM\s+(\w+)',
            r'INTO\s+(\w+)',
            r'UPDATE\s+(\w+)',
            r'DELETE\s+FROM\s+(\w+)'
        ]
        # Implementation details...

Metrics Collected

Metric	Type	Description
`db.query.duration`	Histogram	Query execution time in ms
`db.query.count`	Counter	Total queries by type
`db.query.errors`	Counter	Failed queries by type
`db.rows.affected`	Histogram	Rows modified per query
`db.rows.returned`	Histogram	Rows returned per SELECT
`db.slow_queries`	List	Queries exceeding threshold
`db.connection.active`	Gauge	Active connections
`db.transaction.duration`	Histogram	Transaction time in ms

HTTP Instrumentation

Request Middleware

class HTTPMetricsMiddleware:
    """Flask middleware for HTTP metrics collection"""

    def __init__(self, app: Flask, metrics_collector: MetricsCollector):
        self.app = app
        self.metrics = metrics_collector
        self.setup_hooks()

    def setup_hooks(self):
        """Register Flask hooks for metrics"""

        @self.app.before_request
        def start_request_timer():
            """Initialize request metrics"""
            g.request_metrics = {
                'start_time': time.perf_counter(),
                'start_memory': self._get_memory_usage(),
                'request_id': str(uuid.uuid4()),
                'method': request.method,
                'endpoint': request.endpoint,
                'path': request.path,
                'content_length': request.content_length or 0
            }

        @self.app.after_request
        def record_response_metrics(response):
            """Record response metrics"""
            if not hasattr(g, 'request_metrics'):
                return response

            # Calculate metrics
            duration = time.perf_counter() - g.request_metrics['start_time']
            memory_delta = self._get_memory_usage() - g.request_metrics['start_memory']

            # Record to collector
            self.metrics.record_http_request(
                method=g.request_metrics['method'],
                endpoint=g.request_metrics['endpoint'],
                status_code=response.status_code,
                duration_ms=duration * 1000,
                request_size=g.request_metrics['content_length'],
                response_size=len(response.get_data()),
                memory_delta_mb=memory_delta
            )

            # Add timing header for debugging
            if self.app.config.get('DEBUG'):
                response.headers['X-Response-Time'] = f"{duration * 1000:.2f}ms"

            return response

Metrics Collected

Metric	Type	Description
`http.request.duration`	Histogram	Total request processing time
`http.request.count`	Counter	Requests by method and endpoint
`http.request.size`	Histogram	Request body size distribution
`http.response.size`	Histogram	Response body size distribution
`http.status.{code}`	Counter	Response status code counts
`http.endpoint.{name}.duration`	Histogram	Per-endpoint timing
`http.memory.delta`	Gauge	Memory change per request

Memory Monitoring

Background Monitor Thread

class MemoryMonitor(Thread):
    """Background thread for continuous memory monitoring"""

    def __init__(self, metrics_collector: MetricsCollector, interval: int = 10):
        super().__init__(daemon=True)
        self.metrics = metrics_collector
        self.interval = interval
        self.running = True
        self.baseline_memory = None
        self.high_water_mark = 0

    def run(self):
        """Main monitoring loop"""
        # Establish baseline after startup
        time.sleep(5)
        self.baseline_memory = self._get_memory_info()

        while self.running:
            try:
                memory_info = self._get_memory_info()

                # Update high water mark
                self.high_water_mark = max(self.high_water_mark, memory_info['rss'])

                # Calculate growth rate
                if self.baseline_memory:
                    growth_rate = (memory_info['rss'] - self.baseline_memory['rss']) /
                                 (time.time() - self.baseline_memory['timestamp']) * 3600

                    # Detect potential leak (>10MB/hour growth)
                    if growth_rate > 10:
                        self.metrics.record_memory_leak_warning(growth_rate)

                # Record metrics
                self.metrics.record_memory_usage(
                    rss_mb=memory_info['rss'],
                    vms_mb=memory_info['vms'],
                    high_water_mb=self.high_water_mark,
                    gc_stats=self._get_gc_stats()
                )

            except Exception as e:
                logger.error(f"Memory monitoring error: {e}")

            time.sleep(self.interval)

    def _get_memory_info(self) -> dict:
        """Get current memory usage"""
        import resource
        usage = resource.getrusage(resource.RUSAGE_SELF)
        return {
            'timestamp': time.time(),
            'rss': usage.ru_maxrss / 1024,  # Convert to MB
            'vms': usage.ru_idrss
        }

    def _get_gc_stats(self) -> dict:
        """Get garbage collection statistics"""
        import gc
        return {
            'collections': gc.get_count(),
            'collected': gc.collect(0),
            'uncollectable': len(gc.garbage)
        }

Metrics Collected

Metric	Type	Description
`memory.rss`	Gauge	Resident set size in MB
`memory.vms`	Gauge	Virtual memory size in MB
`memory.high_water`	Gauge	Maximum RSS observed
`memory.growth_rate`	Gauge	MB/hour growth rate
`gc.collections`	Counter	GC collection counts by generation
`gc.collected`	Counter	Objects collected
`gc.uncollectable`	Gauge	Uncollectable object count

Business Metrics

Syndication Metrics

class SyndicationMetrics:
    """Business metrics specific to content syndication"""

    def __init__(self, metrics_collector: MetricsCollector):
        self.metrics = metrics_collector

    def record_feed_request(self, format: str, cached: bool, generation_time: float):
        """Record feed request metrics"""
        self.metrics.increment(f'feed.requests.{format}')

        if cached:
            self.metrics.increment('feed.cache.hits')
        else:
            self.metrics.increment('feed.cache.misses')
            self.metrics.record_histogram('feed.generation.time', generation_time * 1000)

    def record_content_negotiation(self, accept_header: str, selected_format: str):
        """Track content negotiation results"""
        self.metrics.increment(f'feed.negotiation.{selected_format}')

        # Track client preferences
        if 'json' in accept_header.lower():
            self.metrics.increment('feed.client.prefers_json')
        elif 'atom' in accept_header.lower():
            self.metrics.increment('feed.client.prefers_atom')

    def record_publication(self, note_length: int, has_media: bool):
        """Track content publication metrics"""
        self.metrics.increment('content.notes.published')
        self.metrics.record_histogram('content.note.length', note_length)

        if has_media:
            self.metrics.increment('content.notes.with_media')

Metrics Collected

Metric	Type	Description
`feed.requests.{format}`	Counter	Requests by feed format
`feed.cache.hits`	Counter	Cache hit count
`feed.cache.misses`	Counter	Cache miss count
`feed.cache.hit_rate`	Gauge	Cache hit percentage
`feed.generation.time`	Histogram	Feed generation duration
`feed.negotiation.{format}`	Counter	Format selection results
`content.notes.published`	Counter	Total notes published
`content.note.length`	Histogram	Note size distribution
`content.syndication.success`	Counter	Successful syndications

Implementation Details

Metrics Collector

class MetricsCollector:
    """Central metrics collection and storage"""

    def __init__(self, buffer_size: int = 1000):
        self.buffer = deque(maxlen=buffer_size)
        self.counters = defaultdict(int)
        self.gauges = {}
        self.histograms = defaultdict(list)
        self.slow_queries = deque(maxlen=100)

    def record_metric(self, category: str, name: str, value: float, metadata: dict = None):
        """Record a generic metric"""
        metric = {
            'timestamp': time.time(),
            'category': category,
            'name': name,
            'value': value,
            'metadata': metadata or {}
        }
        self.buffer.append(metric)

    def increment(self, name: str, amount: int = 1):
        """Increment a counter"""
        self.counters[name] += amount

    def set_gauge(self, name: str, value: float):
        """Set a gauge value"""
        self.gauges[name] = value

    def record_histogram(self, name: str, value: float):
        """Add value to histogram"""
        self.histograms[name].append(value)
        # Keep only last 1000 values
        if len(self.histograms[name]) > 1000:
            self.histograms[name] = self.histograms[name][-1000:]

    def get_summary(self, window_seconds: int = 900) -> dict:
        """Get metrics summary for dashboard"""
        cutoff = time.time() - window_seconds
        recent = [m for m in self.buffer if m['timestamp'] > cutoff]

        summary = {
            'counters': dict(self.counters),
            'gauges': dict(self.gauges),
            'histograms': self._calculate_histogram_stats(),
            'recent_metrics': recent[-100:],  # Last 100 metrics
            'slow_queries': list(self.slow_queries)
        }

        return summary

    def _calculate_histogram_stats(self) -> dict:
        """Calculate statistics for histograms"""
        stats = {}
        for name, values in self.histograms.items():
            if values:
                sorted_values = sorted(values)
                stats[name] = {
                    'count': len(values),
                    'min': min(values),
                    'max': max(values),
                    'mean': sum(values) / len(values),
                    'p50': sorted_values[len(values) // 2],
                    'p95': sorted_values[int(len(values) * 0.95)],
                    'p99': sorted_values[int(len(values) * 0.99)]
                }
        return stats

Configuration

Environment Variables

# Metrics collection toggles
STARPUNK_METRICS_ENABLED=true
STARPUNK_METRICS_DB_TIMING=true
STARPUNK_METRICS_HTTP_TIMING=true
STARPUNK_METRICS_MEMORY_MONITOR=true
STARPUNK_METRICS_BUSINESS=true

# Thresholds
STARPUNK_METRICS_SLOW_QUERY_THRESHOLD=1.0  # seconds
STARPUNK_METRICS_MEMORY_LEAK_THRESHOLD=10  # MB/hour

# Storage
STARPUNK_METRICS_BUFFER_SIZE=1000
STARPUNK_METRICS_RETENTION_SECONDS=900  # 15 minutes

# Monitoring intervals
STARPUNK_METRICS_MEMORY_INTERVAL=10  # seconds

Testing Strategy

Unit Tests

Collector Tests

def test_metrics_buffer_circular():
    collector = MetricsCollector(buffer_size=10)
    for i in range(20):
        collector.record_metric('test', 'metric', i)
    assert len(collector.buffer) == 10
    assert collector.buffer[0]['value'] == 10  # Oldest is 10, not 0

Instrumentation Tests

def test_database_timing():
    conn = MonitoredConnection(':memory:', collector)
    conn.execute('CREATE TABLE test (id INTEGER)')

    metrics = collector.get_summary()
    assert 'db.query.duration' in metrics['histograms']
    assert metrics['counters']['db.query.count'] == 1

Integration Tests

End-to-End Request Tracking

def test_request_metrics():
    response = client.get('/feed.xml')

    metrics = app.metrics_collector.get_summary()
    assert 'http.request.duration' in metrics['histograms']
    assert metrics['counters']['http.status.200'] > 0

Memory Leak Detection

def test_memory_monitoring():
    monitor = MemoryMonitor(collector)
    monitor.start()

    # Simulate memory growth
    large_list = [0] * 1000000
    time.sleep(15)

    metrics = collector.get_summary()
    assert metrics['gauges']['memory.rss'] > 0

Performance Benchmarks

Overhead Measurement

def benchmark_instrumentation_overhead():
    # Baseline without instrumentation
    config.METRICS_ENABLED = False
    start = time.perf_counter()
    for _ in range(1000):
        execute_operation()
    baseline = time.perf_counter() - start

    # With instrumentation
    config.METRICS_ENABLED = True
    start = time.perf_counter()
    for _ in range(1000):
        execute_operation()
    instrumented = time.perf_counter() - start

    overhead_percent = ((instrumented - baseline) / baseline) * 100
    assert overhead_percent < 1.0  # Less than 1% overhead

Security Considerations

No Sensitive Data: Never log query parameters that might contain passwords
Rate Limiting: Metrics endpoints should be rate-limited
Access Control: Metrics dashboard requires admin authentication
Data Sanitization: Escape all user-provided data in metrics

Migration Notes

From v1.1.1

Existing performance monitoring configuration remains compatible
New metrics are additive, no breaking changes
Dashboard enhanced but backward compatible

Acceptance Criteria

✅ All database operations are timed
✅ HTTP requests fully instrumented
✅ Memory monitoring thread operational
✅ Business metrics for syndication tracked
✅ Performance overhead <1%
✅ Metrics dashboard shows all new data
✅ Slow query detection working
✅ Memory leak detection functional
✅ All metrics properly documented
✅ Security review passed

17 KiB Raw Blame History

Metrics Instrumentation Specification - v1.1.2

Overview

Requirements

Functional Requirements

Non-Functional Requirements

Design

Database Instrumentation

Connection Wrapper

Metrics Collected

HTTP Instrumentation

Request Middleware

Metrics Collected

Memory Monitoring

Background Monitor Thread

Metrics Collected

Business Metrics

Syndication Metrics

Metrics Collected

Implementation Details

Metrics Collector

Configuration

Environment Variables

Testing Strategy

Unit Tests

Integration Tests

Performance Benchmarks

Overhead Measurement

Security Considerations

Migration Notes

From v1.1.1

Acceptance Criteria

17 KiB

Raw Blame History