Files
StarPunk/docs/reports/2025-11-28-v1.1.2-rc.1-production-issues.md
Phil Skentelbery 1e2135a49a fix: Resolve v1.1.2-rc.1 production issues - Static files and metrics
This release candidate fixes two critical production issues discovered in v1.1.2-rc.1:

1. CRITICAL: Static files returning 500 errors
   - HTTP monitoring middleware was accessing response.data on streaming responses
   - Fixed by checking direct_passthrough flag before accessing response data
   - Static files (CSS, JS, images) now load correctly
   - File: starpunk/monitoring/http.py

2. HIGH: Database metrics showing zero
   - Configuration key mismatch: config set METRICS_SAMPLING_RATE (singular),
     buffer read METRICS_SAMPLING_RATES (plural)
   - Fixed by standardizing on singular key name
   - Modified MetricsBuffer to accept both float and dict for flexibility
   - Changed default sampling from 10% to 100% for better visibility
   - Files: starpunk/monitoring/metrics.py, starpunk/config.py

Version: 1.1.2-rc.2

Documentation:
- Investigation report: docs/reports/2025-11-28-v1.1.2-rc.1-production-issues.md
- Architect review: docs/reviews/2025-11-28-v1.1.2-rc.1-architect-review.md
- Implementation report: docs/reports/2025-11-28-v1.1.2-rc.2-fixes.md

Testing: All monitoring tests pass (28/28)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-28 09:46:31 -07:00

9.0 KiB

v1.1.2-rc.1 Production Issues Investigation Report

Date: 2025-11-28 Version: v1.1.2-rc.1 Investigator: Developer Agent Status: Issues Identified, Fixes Needed

Executive Summary

Two critical issues identified in v1.1.2-rc.1 production deployment:

  1. CRITICAL: Static files return 500 errors - site unusable (no CSS/JS)
  2. HIGH: Database metrics showing zero - feature incomplete

Both issues have been traced to root causes and are ready for architect review.


Issue 1: Static Files Return 500 Error

Symptom

  • All static files (CSS, JS, images) return HTTP 500
  • Specifically: https://starpunk.thesatelliteoflove.com/static/css/style.css fails
  • Site is unusable without stylesheets

Error Message

RuntimeError: Attempted implicit sequence conversion but the response object is in direct passthrough mode.

Root Cause

File: starpunk/monitoring/http.py:74-78

# Get response size
response_size = 0
if response.data:  # <-- PROBLEM HERE
    response_size = len(response.data)
elif hasattr(response, 'content_length') and response.content_length:
    response_size = response.content_length

Technical Analysis

The HTTP monitoring middleware's after_request hook attempts to access response.data to calculate response size for metrics. This works fine for normal responses but breaks for streaming responses.

How Flask serves static files:

  1. Flask's send_from_directory() returns a streaming response
  2. Streaming responses are in "direct passthrough mode"
  3. Accessing .data on a streaming response triggers implicit sequence conversion
  4. This raises RuntimeError because the response is not buffered

Why this affects all static files:

  • ALL static files use send_from_directory()
  • ALL are served as streaming responses
  • The after_request hook runs for EVERY response
  • Therefore ALL static files fail

Impact

  • Severity: CRITICAL
  • User Impact: Site completely unusable - no styling, no JavaScript
  • Scope: All static assets (CSS, JS, images, fonts, etc.)

Proposed Fix Direction

The middleware needs to:

  1. Check if response is in direct passthrough mode before accessing .data
  2. Fall back to content_length for streaming responses
  3. Handle cases where size cannot be determined (record as 0 or unknown)

Code location for fix: starpunk/monitoring/http.py:74-78


Issue 2: Database Metrics Showing Zero

Symptom

  • Admin dashboard shows 0 for all database metrics
  • Database pool statistics work correctly
  • Only operation metrics (count, avg, min, max) show zero

Root Cause Analysis

The Architecture Is Correct

Config: starpunk/config.py:90

app.config["METRICS_ENABLED"] = os.getenv("METRICS_ENABLED", "true").lower() == "true"

Defaults to enabled

Pool Initialization: starpunk/database/pool.py:172

metrics_enabled = app.config.get('METRICS_ENABLED', True)

Reads config correctly

Connection Wrapping: starpunk/database/pool.py:74-77

if self.metrics_enabled:
    from starpunk.monitoring import MonitoredConnection
    return MonitoredConnection(conn, self.slow_query_threshold)

Wraps connections when enabled

Metric Recording: starpunk/monitoring/database.py:83-89

record_metric(
    'database',
    f'{query_type} {table_name}',
    duration_ms,
    metadata,
    force=is_slow  # Always record slow queries
)

Calls record_metric correctly

The Real Problem: Sampling Rate

File: starpunk/monitoring/metrics.py:105-110

self._sampling_rates = sampling_rates or {
    "database": 0.1,  # Only 10% of queries recorded!
    "http": 0.1,
    "render": 0.1,
}

File: starpunk/monitoring/metrics.py:138-142

if not force:
    sampling_rate = self._sampling_rates.get(operation_type, 0.1)
    if random.random() > sampling_rate:  # 90% chance to skip!
        return False

Why Metrics Show Zero

  1. Low traffic: Production site has minimal activity
  2. 10% sampling: Only 1 in 10 database queries are recorded
  3. Fast queries: Queries complete in < 1 second, so force=False
  4. Statistical probability: With low traffic + 10% sampling = high chance of 0 metrics

Example scenario:

  • 20 database queries during monitoring window
  • 10% sampling = expect 2 metrics recorded
  • But random sampling might record 0, 1, or 3 (statistical variation)
  • Dashboard shows 0 because no metrics were sampled

Why Slow Queries Would Work

If there were slow queries (>= 1.0 second), they would be recorded with force=True, bypassing sampling. But production queries are all fast.

Impact

  • Severity: HIGH (feature incomplete, not critical to operations)
  • User Impact: Cannot see database performance metrics
  • Scope: Database operation metrics only (pool stats work fine)

Design Questions for Architect

  1. Is 10% sampling rate appropriate for production?

    • Pro: Reduces overhead, good for high-traffic sites
    • Con: Insufficient for low-traffic sites like this one
    • Alternative: Higher default (50-100%) or traffic-based adaptive sampling
  2. Should sampling be configurable?

    • Already supported via METRICS_SAMPLING_RATE config (starpunk/config.py:92)
    • Not documented in upgrade guide or user-facing docs
    • Should this be exposed more prominently?
  3. Should there be a minimum recording guarantee?

    • E.g., "Always record at least 1 metric per minute"
    • Or "First N operations always recorded"
    • Ensures metrics never show zero even with low traffic

Configuration Check

Checked production configuration sources:

Environment Variables (from config.py)

  • METRICS_ENABLED: defaults to "true" (ENABLED )
  • METRICS_SLOW_QUERY_THRESHOLD: defaults to 1.0 seconds
  • METRICS_SAMPLING_RATE: defaults to 1.0 (100%... wait, what?)

WAIT - Config Discrepancy Detected!

In config.py:92:

app.config["METRICS_SAMPLING_RATE"] = float(os.getenv("METRICS_SAMPLING_RATE", "1.0"))

Default: 1.0 (100%)

But this config is never used by MetricsBuffer!

In metrics.py:336-341:

try:
    from flask import current_app
    max_size = current_app.config.get('METRICS_BUFFER_SIZE', 1000)
    sampling_rates = current_app.config.get('METRICS_SAMPLING_RATES', None)  # Note: plural!
except (ImportError, RuntimeError):

The config key mismatch:

  • Config.py sets: METRICS_SAMPLING_RATE (singular, defaults to 1.0)
  • Metrics.py reads: METRICS_SAMPLING_RATES (plural, expects dict)
  • Result: Always returns None, falls back to hardcoded 10%

Root Cause Confirmed

The real issue is a configuration key mismatch:

  1. Config loads METRICS_SAMPLING_RATE (singular) = 1.0
  2. MetricsBuffer reads METRICS_SAMPLING_RATES (plural) expecting dict
  3. Key mismatch returns None
  4. Falls back to hardcoded 10% sampling
  5. Low traffic + 10% = no metrics

Verification Evidence

Code References

  • starpunk/monitoring/http.py:74-78 - Static file error location
  • starpunk/monitoring/database.py:83-89 - Database metric recording
  • starpunk/monitoring/metrics.py:105-110 - Hardcoded sampling rates
  • starpunk/monitoring/metrics.py:336-341 - Config reading with wrong key
  • starpunk/config.py:92 - Config setting with different key

Container Logs

Error message confirmed in production logs (user reported)

Configuration Flow

  1. starpunk/config.py → Sets METRICS_SAMPLING_RATE (singular)
  2. starpunk/__init__.py → Initializes app with config
  3. starpunk/monitoring/metrics.py → Reads METRICS_SAMPLING_RATES (plural)
  4. Mismatch → Falls back to 10%

Recommendations for Architect

Issue 1: Static Files (CRITICAL)

Immediate action required:

  1. Fix starpunk/monitoring/http.py to handle streaming responses
  2. Test with static files before any deployment
  3. Consider adding integration test for static file serving

Issue 2: Database Metrics (HIGH)

Two problems to address:

Problem 2A: Config key mismatch

  • Fix either config.py or metrics.py to use same key name
  • Decision needed: singular or plural?
    • Singular (METRICS_SAMPLING_RATE) simpler if same rate for all types
    • Plural (METRICS_SAMPLING_RATES) allows per-type customization

Problem 2B: Default sampling rate

  • 10% may be too low for low-traffic sites
  • Consider higher default (50-100%) for better visibility
  • Or make sampling traffic-adaptive

Design Questions

  1. Should there be a minimum recording guarantee for zero metrics?
  2. Should sampling rate be per-operation-type or global?
  3. What's the right balance between overhead and visibility?

Next Steps

  1. Architect Review: Review findings and provide design decisions
  2. Fix Implementation: Implement approved fixes
  3. Testing: Comprehensive testing of both fixes
  4. Release: Deploy v1.1.2-rc.2 with fixes

References

  • v1.1.2 Implementation Plan: docs/projectplan/v1.1.2-implementation-plan.md
  • Phase 1 Report: docs/reports/v1.1.2-phase1-metrics-implementation.md
  • Developer Q&A: docs/design/v1.1.2/developer-qa.md (Questions Q6, Q12)