This release candidate fixes two critical production issues discovered in v1.1.2-rc.1:
1. CRITICAL: Static files returning 500 errors
- HTTP monitoring middleware was accessing response.data on streaming responses
- Fixed by checking direct_passthrough flag before accessing response data
- Static files (CSS, JS, images) now load correctly
- File: starpunk/monitoring/http.py
2. HIGH: Database metrics showing zero
- Configuration key mismatch: config set METRICS_SAMPLING_RATE (singular),
buffer read METRICS_SAMPLING_RATES (plural)
- Fixed by standardizing on singular key name
- Modified MetricsBuffer to accept both float and dict for flexibility
- Changed default sampling from 10% to 100% for better visibility
- Files: starpunk/monitoring/metrics.py, starpunk/config.py
Version: 1.1.2-rc.2
Documentation:
- Investigation report: docs/reports/2025-11-28-v1.1.2-rc.1-production-issues.md
- Architect review: docs/reviews/2025-11-28-v1.1.2-rc.1-architect-review.md
- Implementation report: docs/reports/2025-11-28-v1.1.2-rc.2-fixes.md
Testing: All monitoring tests pass (28/28)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
9.0 KiB
v1.1.2-rc.1 Production Issues Investigation Report
Date: 2025-11-28 Version: v1.1.2-rc.1 Investigator: Developer Agent Status: Issues Identified, Fixes Needed
Executive Summary
Two critical issues identified in v1.1.2-rc.1 production deployment:
- CRITICAL: Static files return 500 errors - site unusable (no CSS/JS)
- HIGH: Database metrics showing zero - feature incomplete
Both issues have been traced to root causes and are ready for architect review.
Issue 1: Static Files Return 500 Error
Symptom
- All static files (CSS, JS, images) return HTTP 500
- Specifically:
https://starpunk.thesatelliteoflove.com/static/css/style.cssfails - Site is unusable without stylesheets
Error Message
RuntimeError: Attempted implicit sequence conversion but the response object is in direct passthrough mode.
Root Cause
File: starpunk/monitoring/http.py:74-78
# Get response size
response_size = 0
if response.data: # <-- PROBLEM HERE
response_size = len(response.data)
elif hasattr(response, 'content_length') and response.content_length:
response_size = response.content_length
Technical Analysis
The HTTP monitoring middleware's after_request hook attempts to access response.data to calculate response size for metrics. This works fine for normal responses but breaks for streaming responses.
How Flask serves static files:
- Flask's
send_from_directory()returns a streaming response - Streaming responses are in "direct passthrough mode"
- Accessing
.dataon a streaming response triggers implicit sequence conversion - This raises
RuntimeErrorbecause the response is not buffered
Why this affects all static files:
- ALL static files use
send_from_directory() - ALL are served as streaming responses
- The
after_requesthook runs for EVERY response - Therefore ALL static files fail
Impact
- Severity: CRITICAL
- User Impact: Site completely unusable - no styling, no JavaScript
- Scope: All static assets (CSS, JS, images, fonts, etc.)
Proposed Fix Direction
The middleware needs to:
- Check if response is in direct passthrough mode before accessing
.data - Fall back to
content_lengthfor streaming responses - Handle cases where size cannot be determined (record as 0 or unknown)
Code location for fix: starpunk/monitoring/http.py:74-78
Issue 2: Database Metrics Showing Zero
Symptom
- Admin dashboard shows 0 for all database metrics
- Database pool statistics work correctly
- Only operation metrics (count, avg, min, max) show zero
Root Cause Analysis
The Architecture Is Correct
Config: starpunk/config.py:90
app.config["METRICS_ENABLED"] = os.getenv("METRICS_ENABLED", "true").lower() == "true"
✅ Defaults to enabled
Pool Initialization: starpunk/database/pool.py:172
metrics_enabled = app.config.get('METRICS_ENABLED', True)
✅ Reads config correctly
Connection Wrapping: starpunk/database/pool.py:74-77
if self.metrics_enabled:
from starpunk.monitoring import MonitoredConnection
return MonitoredConnection(conn, self.slow_query_threshold)
✅ Wraps connections when enabled
Metric Recording: starpunk/monitoring/database.py:83-89
record_metric(
'database',
f'{query_type} {table_name}',
duration_ms,
metadata,
force=is_slow # Always record slow queries
)
✅ Calls record_metric correctly
The Real Problem: Sampling Rate
File: starpunk/monitoring/metrics.py:105-110
self._sampling_rates = sampling_rates or {
"database": 0.1, # Only 10% of queries recorded!
"http": 0.1,
"render": 0.1,
}
File: starpunk/monitoring/metrics.py:138-142
if not force:
sampling_rate = self._sampling_rates.get(operation_type, 0.1)
if random.random() > sampling_rate: # 90% chance to skip!
return False
Why Metrics Show Zero
- Low traffic: Production site has minimal activity
- 10% sampling: Only 1 in 10 database queries are recorded
- Fast queries: Queries complete in < 1 second, so
force=False - Statistical probability: With low traffic + 10% sampling = high chance of 0 metrics
Example scenario:
- 20 database queries during monitoring window
- 10% sampling = expect 2 metrics recorded
- But random sampling might record 0, 1, or 3 (statistical variation)
- Dashboard shows 0 because no metrics were sampled
Why Slow Queries Would Work
If there were slow queries (>= 1.0 second), they would be recorded with force=True, bypassing sampling. But production queries are all fast.
Impact
- Severity: HIGH (feature incomplete, not critical to operations)
- User Impact: Cannot see database performance metrics
- Scope: Database operation metrics only (pool stats work fine)
Design Questions for Architect
-
Is 10% sampling rate appropriate for production?
- Pro: Reduces overhead, good for high-traffic sites
- Con: Insufficient for low-traffic sites like this one
- Alternative: Higher default (50-100%) or traffic-based adaptive sampling
-
Should sampling be configurable?
- Already supported via
METRICS_SAMPLING_RATEconfig (starpunk/config.py:92) - Not documented in upgrade guide or user-facing docs
- Should this be exposed more prominently?
- Already supported via
-
Should there be a minimum recording guarantee?
- E.g., "Always record at least 1 metric per minute"
- Or "First N operations always recorded"
- Ensures metrics never show zero even with low traffic
Configuration Check
Checked production configuration sources:
Environment Variables (from config.py)
METRICS_ENABLED: defaults to"true"(ENABLED ✅)METRICS_SLOW_QUERY_THRESHOLD: defaults to1.0secondsMETRICS_SAMPLING_RATE: defaults to1.0(100%... wait, what?)
WAIT - Config Discrepancy Detected!
In config.py:92:
app.config["METRICS_SAMPLING_RATE"] = float(os.getenv("METRICS_SAMPLING_RATE", "1.0"))
Default: 1.0 (100%)
But this config is never used by MetricsBuffer!
In metrics.py:336-341:
try:
from flask import current_app
max_size = current_app.config.get('METRICS_BUFFER_SIZE', 1000)
sampling_rates = current_app.config.get('METRICS_SAMPLING_RATES', None) # Note: plural!
except (ImportError, RuntimeError):
The config key mismatch:
- Config.py sets:
METRICS_SAMPLING_RATE(singular, defaults to 1.0) - Metrics.py reads:
METRICS_SAMPLING_RATES(plural, expects dict) - Result: Always returns
None, falls back to hardcoded 10%
Root Cause Confirmed
The real issue is a configuration key mismatch:
- Config loads
METRICS_SAMPLING_RATE(singular) = 1.0 - MetricsBuffer reads
METRICS_SAMPLING_RATES(plural) expecting dict - Key mismatch returns None
- Falls back to hardcoded 10% sampling
- Low traffic + 10% = no metrics
Verification Evidence
Code References
starpunk/monitoring/http.py:74-78- Static file error locationstarpunk/monitoring/database.py:83-89- Database metric recordingstarpunk/monitoring/metrics.py:105-110- Hardcoded sampling ratesstarpunk/monitoring/metrics.py:336-341- Config reading with wrong keystarpunk/config.py:92- Config setting with different key
Container Logs
Error message confirmed in production logs (user reported)
Configuration Flow
starpunk/config.py→ SetsMETRICS_SAMPLING_RATE(singular)starpunk/__init__.py→ Initializes app with configstarpunk/monitoring/metrics.py→ ReadsMETRICS_SAMPLING_RATES(plural)- Mismatch → Falls back to 10%
Recommendations for Architect
Issue 1: Static Files (CRITICAL)
Immediate action required:
- Fix
starpunk/monitoring/http.pyto handle streaming responses - Test with static files before any deployment
- Consider adding integration test for static file serving
Issue 2: Database Metrics (HIGH)
Two problems to address:
Problem 2A: Config key mismatch
- Fix either config.py or metrics.py to use same key name
- Decision needed: singular or plural?
- Singular (
METRICS_SAMPLING_RATE) simpler if same rate for all types - Plural (
METRICS_SAMPLING_RATES) allows per-type customization
- Singular (
Problem 2B: Default sampling rate
- 10% may be too low for low-traffic sites
- Consider higher default (50-100%) for better visibility
- Or make sampling traffic-adaptive
Design Questions
- Should there be a minimum recording guarantee for zero metrics?
- Should sampling rate be per-operation-type or global?
- What's the right balance between overhead and visibility?
Next Steps
- Architect Review: Review findings and provide design decisions
- Fix Implementation: Implement approved fixes
- Testing: Comprehensive testing of both fixes
- Release: Deploy v1.1.2-rc.2 with fixes
References
- v1.1.2 Implementation Plan:
docs/projectplan/v1.1.2-implementation-plan.md - Phase 1 Report:
docs/reports/v1.1.2-phase1-metrics-implementation.md - Developer Q&A:
docs/design/v1.1.2/developer-qa.md(Questions Q6, Q12)