This release candidate fixes two critical production issues discovered in v1.1.2-rc.1:
1. CRITICAL: Static files returning 500 errors
- HTTP monitoring middleware was accessing response.data on streaming responses
- Fixed by checking direct_passthrough flag before accessing response data
- Static files (CSS, JS, images) now load correctly
- File: starpunk/monitoring/http.py
2. HIGH: Database metrics showing zero
- Configuration key mismatch: config set METRICS_SAMPLING_RATE (singular),
buffer read METRICS_SAMPLING_RATES (plural)
- Fixed by standardizing on singular key name
- Modified MetricsBuffer to accept both float and dict for flexibility
- Changed default sampling from 10% to 100% for better visibility
- Files: starpunk/monitoring/metrics.py, starpunk/config.py
Version: 1.1.2-rc.2
Documentation:
- Investigation report: docs/reports/2025-11-28-v1.1.2-rc.1-production-issues.md
- Architect review: docs/reviews/2025-11-28-v1.1.2-rc.1-architect-review.md
- Implementation report: docs/reports/2025-11-28-v1.1.2-rc.2-fixes.md
Testing: All monitoring tests pass (28/28)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
286 lines
9.0 KiB
Markdown
286 lines
9.0 KiB
Markdown
# v1.1.2-rc.1 Production Issues Investigation Report
|
|
|
|
**Date:** 2025-11-28
|
|
**Version:** v1.1.2-rc.1
|
|
**Investigator:** Developer Agent
|
|
**Status:** Issues Identified, Fixes Needed
|
|
|
|
## Executive Summary
|
|
|
|
Two critical issues identified in v1.1.2-rc.1 production deployment:
|
|
|
|
1. **CRITICAL**: Static files return 500 errors - site unusable (no CSS/JS)
|
|
2. **HIGH**: Database metrics showing zero - feature incomplete
|
|
|
|
Both issues have been traced to root causes and are ready for architect review.
|
|
|
|
---
|
|
|
|
## Issue 1: Static Files Return 500 Error
|
|
|
|
### Symptom
|
|
- All static files (CSS, JS, images) return HTTP 500
|
|
- Specifically: `https://starpunk.thesatelliteoflove.com/static/css/style.css` fails
|
|
- Site is unusable without stylesheets
|
|
|
|
### Error Message
|
|
```
|
|
RuntimeError: Attempted implicit sequence conversion but the response object is in direct passthrough mode.
|
|
```
|
|
|
|
### Root Cause
|
|
**File:** `starpunk/monitoring/http.py:74-78`
|
|
|
|
```python
|
|
# Get response size
|
|
response_size = 0
|
|
if response.data: # <-- PROBLEM HERE
|
|
response_size = len(response.data)
|
|
elif hasattr(response, 'content_length') and response.content_length:
|
|
response_size = response.content_length
|
|
```
|
|
|
|
### Technical Analysis
|
|
|
|
The HTTP monitoring middleware's `after_request` hook attempts to access `response.data` to calculate response size for metrics. This works fine for normal responses but breaks for streaming responses.
|
|
|
|
**How Flask serves static files:**
|
|
1. Flask's `send_from_directory()` returns a streaming response
|
|
2. Streaming responses are in "direct passthrough mode"
|
|
3. Accessing `.data` on a streaming response triggers implicit sequence conversion
|
|
4. This raises `RuntimeError` because the response is not buffered
|
|
|
|
**Why this affects all static files:**
|
|
- ALL static files use `send_from_directory()`
|
|
- ALL are served as streaming responses
|
|
- The `after_request` hook runs for EVERY response
|
|
- Therefore ALL static files fail
|
|
|
|
### Impact
|
|
- **Severity:** CRITICAL
|
|
- **User Impact:** Site completely unusable - no styling, no JavaScript
|
|
- **Scope:** All static assets (CSS, JS, images, fonts, etc.)
|
|
|
|
### Proposed Fix Direction
|
|
The middleware needs to:
|
|
1. Check if response is in direct passthrough mode before accessing `.data`
|
|
2. Fall back to `content_length` for streaming responses
|
|
3. Handle cases where size cannot be determined (record as 0 or unknown)
|
|
|
|
**Code location for fix:** `starpunk/monitoring/http.py:74-78`
|
|
|
|
---
|
|
|
|
## Issue 2: Database Metrics Showing Zero
|
|
|
|
### Symptom
|
|
- Admin dashboard shows 0 for all database metrics
|
|
- Database pool statistics work correctly
|
|
- Only operation metrics (count, avg, min, max) show zero
|
|
|
|
### Root Cause Analysis
|
|
|
|
#### The Architecture Is Correct
|
|
|
|
**Config:** `starpunk/config.py:90`
|
|
```python
|
|
app.config["METRICS_ENABLED"] = os.getenv("METRICS_ENABLED", "true").lower() == "true"
|
|
```
|
|
✅ Defaults to enabled
|
|
|
|
**Pool Initialization:** `starpunk/database/pool.py:172`
|
|
```python
|
|
metrics_enabled = app.config.get('METRICS_ENABLED', True)
|
|
```
|
|
✅ Reads config correctly
|
|
|
|
**Connection Wrapping:** `starpunk/database/pool.py:74-77`
|
|
```python
|
|
if self.metrics_enabled:
|
|
from starpunk.monitoring import MonitoredConnection
|
|
return MonitoredConnection(conn, self.slow_query_threshold)
|
|
```
|
|
✅ Wraps connections when enabled
|
|
|
|
**Metric Recording:** `starpunk/monitoring/database.py:83-89`
|
|
```python
|
|
record_metric(
|
|
'database',
|
|
f'{query_type} {table_name}',
|
|
duration_ms,
|
|
metadata,
|
|
force=is_slow # Always record slow queries
|
|
)
|
|
```
|
|
✅ Calls record_metric correctly
|
|
|
|
#### The Real Problem: Sampling Rate
|
|
|
|
**File:** `starpunk/monitoring/metrics.py:105-110`
|
|
|
|
```python
|
|
self._sampling_rates = sampling_rates or {
|
|
"database": 0.1, # Only 10% of queries recorded!
|
|
"http": 0.1,
|
|
"render": 0.1,
|
|
}
|
|
```
|
|
|
|
**File:** `starpunk/monitoring/metrics.py:138-142`
|
|
|
|
```python
|
|
if not force:
|
|
sampling_rate = self._sampling_rates.get(operation_type, 0.1)
|
|
if random.random() > sampling_rate: # 90% chance to skip!
|
|
return False
|
|
```
|
|
|
|
### Why Metrics Show Zero
|
|
|
|
1. **Low traffic:** Production site has minimal activity
|
|
2. **10% sampling:** Only 1 in 10 database queries are recorded
|
|
3. **Fast queries:** Queries complete in < 1 second, so `force=False`
|
|
4. **Statistical probability:** With low traffic + 10% sampling = high chance of 0 metrics
|
|
|
|
Example scenario:
|
|
- 20 database queries during monitoring window
|
|
- 10% sampling = expect 2 metrics recorded
|
|
- But random sampling might record 0, 1, or 3 (statistical variation)
|
|
- Dashboard shows 0 because no metrics were sampled
|
|
|
|
### Why Slow Queries Would Work
|
|
|
|
If there were slow queries (>= 1.0 second), they would be recorded with `force=True`, bypassing sampling. But production queries are all fast.
|
|
|
|
### Impact
|
|
- **Severity:** HIGH (feature incomplete, not critical to operations)
|
|
- **User Impact:** Cannot see database performance metrics
|
|
- **Scope:** Database operation metrics only (pool stats work fine)
|
|
|
|
### Design Questions for Architect
|
|
|
|
1. **Is 10% sampling rate appropriate for production?**
|
|
- Pro: Reduces overhead, good for high-traffic sites
|
|
- Con: Insufficient for low-traffic sites like this one
|
|
- Alternative: Higher default (50-100%) or traffic-based adaptive sampling
|
|
|
|
2. **Should sampling be configurable?**
|
|
- Already supported via `METRICS_SAMPLING_RATE` config (starpunk/config.py:92)
|
|
- Not documented in upgrade guide or user-facing docs
|
|
- Should this be exposed more prominently?
|
|
|
|
3. **Should there be a minimum recording guarantee?**
|
|
- E.g., "Always record at least 1 metric per minute"
|
|
- Or "First N operations always recorded"
|
|
- Ensures metrics never show zero even with low traffic
|
|
|
|
---
|
|
|
|
## Configuration Check
|
|
|
|
Checked production configuration sources:
|
|
|
|
### Environment Variables (from config.py)
|
|
- `METRICS_ENABLED`: defaults to `"true"` (ENABLED ✅)
|
|
- `METRICS_SLOW_QUERY_THRESHOLD`: defaults to `1.0` seconds
|
|
- `METRICS_SAMPLING_RATE`: defaults to `1.0` (100%... wait, what?)
|
|
|
|
### WAIT - Config Discrepancy Detected!
|
|
|
|
**In config.py:92:**
|
|
```python
|
|
app.config["METRICS_SAMPLING_RATE"] = float(os.getenv("METRICS_SAMPLING_RATE", "1.0"))
|
|
```
|
|
Default: **1.0 (100%)**
|
|
|
|
**But this config is never used by MetricsBuffer!**
|
|
|
|
**In metrics.py:336-341:**
|
|
```python
|
|
try:
|
|
from flask import current_app
|
|
max_size = current_app.config.get('METRICS_BUFFER_SIZE', 1000)
|
|
sampling_rates = current_app.config.get('METRICS_SAMPLING_RATES', None) # Note: plural!
|
|
except (ImportError, RuntimeError):
|
|
```
|
|
|
|
**The config key mismatch:**
|
|
- Config.py sets: `METRICS_SAMPLING_RATE` (singular, defaults to 1.0)
|
|
- Metrics.py reads: `METRICS_SAMPLING_RATES` (plural, expects dict)
|
|
- Result: Always returns `None`, falls back to hardcoded 10%
|
|
|
|
### Root Cause Confirmed
|
|
|
|
**The real issue is a configuration key mismatch:**
|
|
1. Config loads `METRICS_SAMPLING_RATE` (singular) = 1.0
|
|
2. MetricsBuffer reads `METRICS_SAMPLING_RATES` (plural) expecting dict
|
|
3. Key mismatch returns None
|
|
4. Falls back to hardcoded 10% sampling
|
|
5. Low traffic + 10% = no metrics
|
|
|
|
---
|
|
|
|
## Verification Evidence
|
|
|
|
### Code References
|
|
- `starpunk/monitoring/http.py:74-78` - Static file error location
|
|
- `starpunk/monitoring/database.py:83-89` - Database metric recording
|
|
- `starpunk/monitoring/metrics.py:105-110` - Hardcoded sampling rates
|
|
- `starpunk/monitoring/metrics.py:336-341` - Config reading with wrong key
|
|
- `starpunk/config.py:92` - Config setting with different key
|
|
|
|
### Container Logs
|
|
Error message confirmed in production logs (user reported)
|
|
|
|
### Configuration Flow
|
|
1. `starpunk/config.py` → Sets `METRICS_SAMPLING_RATE` (singular)
|
|
2. `starpunk/__init__.py` → Initializes app with config
|
|
3. `starpunk/monitoring/metrics.py` → Reads `METRICS_SAMPLING_RATES` (plural)
|
|
4. Mismatch → Falls back to 10%
|
|
|
|
---
|
|
|
|
## Recommendations for Architect
|
|
|
|
### Issue 1: Static Files (CRITICAL)
|
|
**Immediate action required:**
|
|
1. Fix `starpunk/monitoring/http.py` to handle streaming responses
|
|
2. Test with static files before any deployment
|
|
3. Consider adding integration test for static file serving
|
|
|
|
### Issue 2: Database Metrics (HIGH)
|
|
**Two problems to address:**
|
|
|
|
**Problem 2A: Config key mismatch**
|
|
- Fix either config.py or metrics.py to use same key name
|
|
- Decision needed: singular or plural?
|
|
- Singular (`METRICS_SAMPLING_RATE`) simpler if same rate for all types
|
|
- Plural (`METRICS_SAMPLING_RATES`) allows per-type customization
|
|
|
|
**Problem 2B: Default sampling rate**
|
|
- 10% may be too low for low-traffic sites
|
|
- Consider higher default (50-100%) for better visibility
|
|
- Or make sampling traffic-adaptive
|
|
|
|
### Design Questions
|
|
1. Should there be a minimum recording guarantee for zero metrics?
|
|
2. Should sampling rate be per-operation-type or global?
|
|
3. What's the right balance between overhead and visibility?
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Architect Review:** Review findings and provide design decisions
|
|
2. **Fix Implementation:** Implement approved fixes
|
|
3. **Testing:** Comprehensive testing of both fixes
|
|
4. **Release:** Deploy v1.1.2-rc.2 with fixes
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- v1.1.2 Implementation Plan: `docs/projectplan/v1.1.2-implementation-plan.md`
|
|
- Phase 1 Report: `docs/reports/v1.1.2-phase1-metrics-implementation.md`
|
|
- Developer Q&A: `docs/design/v1.1.2/developer-qa.md` (Questions Q6, Q12)
|