Files

Phil Skentelbery 1e2135a49a fix: Resolve v1.1.2-rc.1 production issues - Static files and metrics

This release candidate fixes two critical production issues discovered in v1.1.2-rc.1:

1. CRITICAL: Static files returning 500 errors
   - HTTP monitoring middleware was accessing response.data on streaming responses
   - Fixed by checking direct_passthrough flag before accessing response data
   - Static files (CSS, JS, images) now load correctly
   - File: starpunk/monitoring/http.py

2. HIGH: Database metrics showing zero
   - Configuration key mismatch: config set METRICS_SAMPLING_RATE (singular),
     buffer read METRICS_SAMPLING_RATES (plural)
   - Fixed by standardizing on singular key name
   - Modified MetricsBuffer to accept both float and dict for flexibility
   - Changed default sampling from 10% to 100% for better visibility
   - Files: starpunk/monitoring/metrics.py, starpunk/config.py

Version: 1.1.2-rc.2

Documentation:
- Investigation report: docs/reports/2025-11-28-v1.1.2-rc.1-production-issues.md
- Architect review: docs/reviews/2025-11-28-v1.1.2-rc.1-architect-review.md
- Implementation report: docs/reports/2025-11-28-v1.1.2-rc.2-fixes.md

Testing: All monitoring tests pass (28/28)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-28 09:46:31 -07:00

9.0 KiB

Raw Blame History

v1.1.2-rc.1 Production Issues Investigation Report

Date: 2025-11-28 Version: v1.1.2-rc.1 Investigator: Developer Agent Status: Issues Identified, Fixes Needed

Executive Summary

Two critical issues identified in v1.1.2-rc.1 production deployment:

CRITICAL: Static files return 500 errors - site unusable (no CSS/JS)
HIGH: Database metrics showing zero - feature incomplete

Both issues have been traced to root causes and are ready for architect review.

Issue 1: Static Files Return 500 Error

Symptom

All static files (CSS, JS, images) return HTTP 500
Specifically: https://starpunk.thesatelliteoflove.com/static/css/style.css fails
Site is unusable without stylesheets

Error Message

RuntimeError: Attempted implicit sequence conversion but the response object is in direct passthrough mode.

Root Cause

File: starpunk/monitoring/http.py:74-78

# Get response size
response_size = 0
if response.data:  # <-- PROBLEM HERE
    response_size = len(response.data)
elif hasattr(response, 'content_length') and response.content_length:
    response_size = response.content_length

Technical Analysis

The HTTP monitoring middleware's after_request hook attempts to access response.data to calculate response size for metrics. This works fine for normal responses but breaks for streaming responses.

How Flask serves static files:

Flask's send_from_directory() returns a streaming response
Streaming responses are in "direct passthrough mode"
Accessing .data on a streaming response triggers implicit sequence conversion
This raises RuntimeError because the response is not buffered

Why this affects all static files:

ALL static files use send_from_directory()
ALL are served as streaming responses
The after_request hook runs for EVERY response
Therefore ALL static files fail

Impact

Severity: CRITICAL
User Impact: Site completely unusable - no styling, no JavaScript
Scope: All static assets (CSS, JS, images, fonts, etc.)

Proposed Fix Direction

The middleware needs to:

Check if response is in direct passthrough mode before accessing .data
Fall back to content_length for streaming responses
Handle cases where size cannot be determined (record as 0 or unknown)

Code location for fix: starpunk/monitoring/http.py:74-78

Issue 2: Database Metrics Showing Zero

Symptom

Admin dashboard shows 0 for all database metrics
Database pool statistics work correctly
Only operation metrics (count, avg, min, max) show zero

Root Cause Analysis

The Architecture Is Correct

Config: starpunk/config.py:90

app.config["METRICS_ENABLED"] = os.getenv("METRICS_ENABLED", "true").lower() == "true"

✅ Defaults to enabled

Pool Initialization: starpunk/database/pool.py:172

metrics_enabled = app.config.get('METRICS_ENABLED', True)

✅ Reads config correctly

Connection Wrapping: starpunk/database/pool.py:74-77

if self.metrics_enabled:
    from starpunk.monitoring import MonitoredConnection
    return MonitoredConnection(conn, self.slow_query_threshold)

✅ Wraps connections when enabled

Metric Recording: starpunk/monitoring/database.py:83-89

record_metric(
    'database',
    f'{query_type} {table_name}',
    duration_ms,
    metadata,
    force=is_slow  # Always record slow queries
)

✅ Calls record_metric correctly

The Real Problem: Sampling Rate

File: starpunk/monitoring/metrics.py:105-110

self._sampling_rates = sampling_rates or {
    "database": 0.1,  # Only 10% of queries recorded!
    "http": 0.1,
    "render": 0.1,
}

File: starpunk/monitoring/metrics.py:138-142

if not force:
    sampling_rate = self._sampling_rates.get(operation_type, 0.1)
    if random.random() > sampling_rate:  # 90% chance to skip!
        return False

Why Metrics Show Zero

Low traffic: Production site has minimal activity
10% sampling: Only 1 in 10 database queries are recorded
Fast queries: Queries complete in < 1 second, so force=False
Statistical probability: With low traffic + 10% sampling = high chance of 0 metrics

Example scenario:

20 database queries during monitoring window
10% sampling = expect 2 metrics recorded
But random sampling might record 0, 1, or 3 (statistical variation)
Dashboard shows 0 because no metrics were sampled

Why Slow Queries Would Work

If there were slow queries (>= 1.0 second), they would be recorded with force=True, bypassing sampling. But production queries are all fast.

Impact

Severity: HIGH (feature incomplete, not critical to operations)
User Impact: Cannot see database performance metrics
Scope: Database operation metrics only (pool stats work fine)

Design Questions for Architect

Is 10% sampling rate appropriate for production?
- Pro: Reduces overhead, good for high-traffic sites
- Con: Insufficient for low-traffic sites like this one
- Alternative: Higher default (50-100%) or traffic-based adaptive sampling
Should sampling be configurable?
- Already supported via METRICS_SAMPLING_RATE config (starpunk/config.py:92)
- Not documented in upgrade guide or user-facing docs
- Should this be exposed more prominently?
Should there be a minimum recording guarantee?
- E.g., "Always record at least 1 metric per minute"
- Or "First N operations always recorded"
- Ensures metrics never show zero even with low traffic

Configuration Check

Checked production configuration sources:

Environment Variables (from config.py)

METRICS_ENABLED: defaults to "true" (ENABLED ✅)
METRICS_SLOW_QUERY_THRESHOLD: defaults to 1.0 seconds
METRICS_SAMPLING_RATE: defaults to 1.0 (100%... wait, what?)

WAIT - Config Discrepancy Detected!

In config.py:92:

app.config["METRICS_SAMPLING_RATE"] = float(os.getenv("METRICS_SAMPLING_RATE", "1.0"))

Default: 1.0 (100%)

But this config is never used by MetricsBuffer!

In metrics.py:336-341:

try:
    from flask import current_app
    max_size = current_app.config.get('METRICS_BUFFER_SIZE', 1000)
    sampling_rates = current_app.config.get('METRICS_SAMPLING_RATES', None)  # Note: plural!
except (ImportError, RuntimeError):

The config key mismatch:

Config.py sets: METRICS_SAMPLING_RATE (singular, defaults to 1.0)
Metrics.py reads: METRICS_SAMPLING_RATES (plural, expects dict)
Result: Always returns None, falls back to hardcoded 10%

Root Cause Confirmed

The real issue is a configuration key mismatch:

Config loads METRICS_SAMPLING_RATE (singular) = 1.0
MetricsBuffer reads METRICS_SAMPLING_RATES (plural) expecting dict
Key mismatch returns None
Falls back to hardcoded 10% sampling
Low traffic + 10% = no metrics

Verification Evidence

Code References

starpunk/monitoring/http.py:74-78 - Static file error location
starpunk/monitoring/database.py:83-89 - Database metric recording
starpunk/monitoring/metrics.py:105-110 - Hardcoded sampling rates
starpunk/monitoring/metrics.py:336-341 - Config reading with wrong key
starpunk/config.py:92 - Config setting with different key

Container Logs

Error message confirmed in production logs (user reported)

Configuration Flow

starpunk/config.py → Sets METRICS_SAMPLING_RATE (singular)
starpunk/__init__.py → Initializes app with config
starpunk/monitoring/metrics.py → Reads METRICS_SAMPLING_RATES (plural)
Mismatch → Falls back to 10%

Recommendations for Architect

Issue 1: Static Files (CRITICAL)

Immediate action required:

Fix starpunk/monitoring/http.py to handle streaming responses
Test with static files before any deployment
Consider adding integration test for static file serving

Issue 2: Database Metrics (HIGH)

Two problems to address:

Problem 2A: Config key mismatch

Fix either config.py or metrics.py to use same key name
Decision needed: singular or plural?
- Singular (METRICS_SAMPLING_RATE) simpler if same rate for all types
- Plural (METRICS_SAMPLING_RATES) allows per-type customization

Problem 2B: Default sampling rate

10% may be too low for low-traffic sites
Consider higher default (50-100%) for better visibility
Or make sampling traffic-adaptive

Design Questions

Should there be a minimum recording guarantee for zero metrics?
Should sampling rate be per-operation-type or global?
What's the right balance between overhead and visibility?

Next Steps

Architect Review: Review findings and provide design decisions
Fix Implementation: Implement approved fixes
Testing: Comprehensive testing of both fixes
Release: Deploy v1.1.2-rc.2 with fixes

References

v1.1.2 Implementation Plan: docs/projectplan/v1.1.2-implementation-plan.md
Phase 1 Report: docs/reports/v1.1.2-phase1-metrics-implementation.md
Developer Q&A: docs/design/v1.1.2/developer-qa.md (Questions Q6, Q12)

9.0 KiB Raw Blame History

v1.1.2-rc.1 Production Issues Investigation Report

Executive Summary

Issue 1: Static Files Return 500 Error

Symptom

Error Message

Root Cause

Technical Analysis

Impact

Proposed Fix Direction

Issue 2: Database Metrics Showing Zero

Symptom

Root Cause Analysis

The Architecture Is Correct

The Real Problem: Sampling Rate

Why Metrics Show Zero

Why Slow Queries Would Work

Impact

Design Questions for Architect

Configuration Check

Environment Variables (from config.py)

WAIT - Config Discrepancy Detected!

Root Cause Confirmed

Verification Evidence

Code References

Container Logs

Configuration Flow

Recommendations for Architect

Issue 1: Static Files (CRITICAL)

Issue 2: Database Metrics (HIGH)

Design Questions

Next Steps

References

9.0 KiB

Raw Blame History