# v1.1.2-rc.1 Production Issues Investigation Report **Date:** 2025-11-28 **Version:** v1.1.2-rc.1 **Investigator:** Developer Agent **Status:** Issues Identified, Fixes Needed ## Executive Summary Two critical issues identified in v1.1.2-rc.1 production deployment: 1. **CRITICAL**: Static files return 500 errors - site unusable (no CSS/JS) 2. **HIGH**: Database metrics showing zero - feature incomplete Both issues have been traced to root causes and are ready for architect review. --- ## Issue 1: Static Files Return 500 Error ### Symptom - All static files (CSS, JS, images) return HTTP 500 - Specifically: `https://starpunk.thesatelliteoflove.com/static/css/style.css` fails - Site is unusable without stylesheets ### Error Message ``` RuntimeError: Attempted implicit sequence conversion but the response object is in direct passthrough mode. ``` ### Root Cause **File:** `starpunk/monitoring/http.py:74-78` ```python # Get response size response_size = 0 if response.data: # <-- PROBLEM HERE response_size = len(response.data) elif hasattr(response, 'content_length') and response.content_length: response_size = response.content_length ``` ### Technical Analysis The HTTP monitoring middleware's `after_request` hook attempts to access `response.data` to calculate response size for metrics. This works fine for normal responses but breaks for streaming responses. **How Flask serves static files:** 1. Flask's `send_from_directory()` returns a streaming response 2. Streaming responses are in "direct passthrough mode" 3. Accessing `.data` on a streaming response triggers implicit sequence conversion 4. This raises `RuntimeError` because the response is not buffered **Why this affects all static files:** - ALL static files use `send_from_directory()` - ALL are served as streaming responses - The `after_request` hook runs for EVERY response - Therefore ALL static files fail ### Impact - **Severity:** CRITICAL - **User Impact:** Site completely unusable - no styling, no JavaScript - **Scope:** All static assets (CSS, JS, images, fonts, etc.) ### Proposed Fix Direction The middleware needs to: 1. Check if response is in direct passthrough mode before accessing `.data` 2. Fall back to `content_length` for streaming responses 3. Handle cases where size cannot be determined (record as 0 or unknown) **Code location for fix:** `starpunk/monitoring/http.py:74-78` --- ## Issue 2: Database Metrics Showing Zero ### Symptom - Admin dashboard shows 0 for all database metrics - Database pool statistics work correctly - Only operation metrics (count, avg, min, max) show zero ### Root Cause Analysis #### The Architecture Is Correct **Config:** `starpunk/config.py:90` ```python app.config["METRICS_ENABLED"] = os.getenv("METRICS_ENABLED", "true").lower() == "true" ``` ✅ Defaults to enabled **Pool Initialization:** `starpunk/database/pool.py:172` ```python metrics_enabled = app.config.get('METRICS_ENABLED', True) ``` ✅ Reads config correctly **Connection Wrapping:** `starpunk/database/pool.py:74-77` ```python if self.metrics_enabled: from starpunk.monitoring import MonitoredConnection return MonitoredConnection(conn, self.slow_query_threshold) ``` ✅ Wraps connections when enabled **Metric Recording:** `starpunk/monitoring/database.py:83-89` ```python record_metric( 'database', f'{query_type} {table_name}', duration_ms, metadata, force=is_slow # Always record slow queries ) ``` ✅ Calls record_metric correctly #### The Real Problem: Sampling Rate **File:** `starpunk/monitoring/metrics.py:105-110` ```python self._sampling_rates = sampling_rates or { "database": 0.1, # Only 10% of queries recorded! "http": 0.1, "render": 0.1, } ``` **File:** `starpunk/monitoring/metrics.py:138-142` ```python if not force: sampling_rate = self._sampling_rates.get(operation_type, 0.1) if random.random() > sampling_rate: # 90% chance to skip! return False ``` ### Why Metrics Show Zero 1. **Low traffic:** Production site has minimal activity 2. **10% sampling:** Only 1 in 10 database queries are recorded 3. **Fast queries:** Queries complete in < 1 second, so `force=False` 4. **Statistical probability:** With low traffic + 10% sampling = high chance of 0 metrics Example scenario: - 20 database queries during monitoring window - 10% sampling = expect 2 metrics recorded - But random sampling might record 0, 1, or 3 (statistical variation) - Dashboard shows 0 because no metrics were sampled ### Why Slow Queries Would Work If there were slow queries (>= 1.0 second), they would be recorded with `force=True`, bypassing sampling. But production queries are all fast. ### Impact - **Severity:** HIGH (feature incomplete, not critical to operations) - **User Impact:** Cannot see database performance metrics - **Scope:** Database operation metrics only (pool stats work fine) ### Design Questions for Architect 1. **Is 10% sampling rate appropriate for production?** - Pro: Reduces overhead, good for high-traffic sites - Con: Insufficient for low-traffic sites like this one - Alternative: Higher default (50-100%) or traffic-based adaptive sampling 2. **Should sampling be configurable?** - Already supported via `METRICS_SAMPLING_RATE` config (starpunk/config.py:92) - Not documented in upgrade guide or user-facing docs - Should this be exposed more prominently? 3. **Should there be a minimum recording guarantee?** - E.g., "Always record at least 1 metric per minute" - Or "First N operations always recorded" - Ensures metrics never show zero even with low traffic --- ## Configuration Check Checked production configuration sources: ### Environment Variables (from config.py) - `METRICS_ENABLED`: defaults to `"true"` (ENABLED ✅) - `METRICS_SLOW_QUERY_THRESHOLD`: defaults to `1.0` seconds - `METRICS_SAMPLING_RATE`: defaults to `1.0` (100%... wait, what?) ### WAIT - Config Discrepancy Detected! **In config.py:92:** ```python app.config["METRICS_SAMPLING_RATE"] = float(os.getenv("METRICS_SAMPLING_RATE", "1.0")) ``` Default: **1.0 (100%)** **But this config is never used by MetricsBuffer!** **In metrics.py:336-341:** ```python try: from flask import current_app max_size = current_app.config.get('METRICS_BUFFER_SIZE', 1000) sampling_rates = current_app.config.get('METRICS_SAMPLING_RATES', None) # Note: plural! except (ImportError, RuntimeError): ``` **The config key mismatch:** - Config.py sets: `METRICS_SAMPLING_RATE` (singular, defaults to 1.0) - Metrics.py reads: `METRICS_SAMPLING_RATES` (plural, expects dict) - Result: Always returns `None`, falls back to hardcoded 10% ### Root Cause Confirmed **The real issue is a configuration key mismatch:** 1. Config loads `METRICS_SAMPLING_RATE` (singular) = 1.0 2. MetricsBuffer reads `METRICS_SAMPLING_RATES` (plural) expecting dict 3. Key mismatch returns None 4. Falls back to hardcoded 10% sampling 5. Low traffic + 10% = no metrics --- ## Verification Evidence ### Code References - `starpunk/monitoring/http.py:74-78` - Static file error location - `starpunk/monitoring/database.py:83-89` - Database metric recording - `starpunk/monitoring/metrics.py:105-110` - Hardcoded sampling rates - `starpunk/monitoring/metrics.py:336-341` - Config reading with wrong key - `starpunk/config.py:92` - Config setting with different key ### Container Logs Error message confirmed in production logs (user reported) ### Configuration Flow 1. `starpunk/config.py` → Sets `METRICS_SAMPLING_RATE` (singular) 2. `starpunk/__init__.py` → Initializes app with config 3. `starpunk/monitoring/metrics.py` → Reads `METRICS_SAMPLING_RATES` (plural) 4. Mismatch → Falls back to 10% --- ## Recommendations for Architect ### Issue 1: Static Files (CRITICAL) **Immediate action required:** 1. Fix `starpunk/monitoring/http.py` to handle streaming responses 2. Test with static files before any deployment 3. Consider adding integration test for static file serving ### Issue 2: Database Metrics (HIGH) **Two problems to address:** **Problem 2A: Config key mismatch** - Fix either config.py or metrics.py to use same key name - Decision needed: singular or plural? - Singular (`METRICS_SAMPLING_RATE`) simpler if same rate for all types - Plural (`METRICS_SAMPLING_RATES`) allows per-type customization **Problem 2B: Default sampling rate** - 10% may be too low for low-traffic sites - Consider higher default (50-100%) for better visibility - Or make sampling traffic-adaptive ### Design Questions 1. Should there be a minimum recording guarantee for zero metrics? 2. Should sampling rate be per-operation-type or global? 3. What's the right balance between overhead and visibility? --- ## Next Steps 1. **Architect Review:** Review findings and provide design decisions 2. **Fix Implementation:** Implement approved fixes 3. **Testing:** Comprehensive testing of both fixes 4. **Release:** Deploy v1.1.2-rc.2 with fixes --- ## References - v1.1.2 Implementation Plan: `docs/projectplan/v1.1.2-implementation-plan.md` - Phase 1 Report: `docs/reports/v1.1.2-phase1-metrics-implementation.md` - Developer Q&A: `docs/design/v1.1.2/developer-qa.md` (Questions Q6, Q12)