Files

Phil Skentelbery 1e2135a49a fix: Resolve v1.1.2-rc.1 production issues - Static files and metrics

This release candidate fixes two critical production issues discovered in v1.1.2-rc.1:

1. CRITICAL: Static files returning 500 errors
   - HTTP monitoring middleware was accessing response.data on streaming responses
   - Fixed by checking direct_passthrough flag before accessing response data
   - Static files (CSS, JS, images) now load correctly
   - File: starpunk/monitoring/http.py

2. HIGH: Database metrics showing zero
   - Configuration key mismatch: config set METRICS_SAMPLING_RATE (singular),
     buffer read METRICS_SAMPLING_RATES (plural)
   - Fixed by standardizing on singular key name
   - Modified MetricsBuffer to accept both float and dict for flexibility
   - Changed default sampling from 10% to 100% for better visibility
   - Files: starpunk/monitoring/metrics.py, starpunk/config.py

Version: 1.1.2-rc.2

Documentation:
- Investigation report: docs/reports/2025-11-28-v1.1.2-rc.1-production-issues.md
- Architect review: docs/reviews/2025-11-28-v1.1.2-rc.1-architect-review.md
- Implementation report: docs/reports/2025-11-28-v1.1.2-rc.2-fixes.md

Testing: All monitoring tests pass (28/28)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-28 09:46:31 -07:00

8.4 KiB

Raw Blame History

Architect Review: v1.1.2-rc.1 Production Issues

Date: 2025-11-28 Reviewer: StarPunk Architect Status: Design Decisions Provided

Executive Summary

The developer's investigation is accurate and thorough. Both root causes are correctly identified:

Static files issue: HTTP middleware doesn't handle streaming responses properly
Database metrics issue: Configuration key mismatch (METRICS_SAMPLING_RATE vs METRICS_SAMPLING_RATES)

Both issues require immediate fixes. This review provides clear design decisions and implementation guidance.

Issue 1: Static Files (CRITICAL)

Root Cause Validation

✅ Analysis Correct: The developer correctly identified that Flask's send_from_directory() returns streaming responses in "direct passthrough mode", and accessing .data on these triggers a RuntimeError.

Design Decision

Decision: Skip size tracking for streaming responses

The HTTP middleware should:

Check if response is in direct passthrough mode BEFORE accessing .data
Use content_length when available for streaming responses
Record size as 0 when size cannot be determined (not "unknown" - keep metrics numeric)

Rationale:

Streaming responses are designed to avoid loading entire content into memory
The content_length header (when present) provides sufficient size information
Recording 0 is better than excluding the metric entirely (preserves request count)
This aligns with the "minimal overhead" principle in ADR-053

Implementation Guidance

# File: starpunk/monitoring/http.py, lines 74-78
# REPLACE the current implementation with:

# Get response size (handle streaming responses)
response_size = 0
if hasattr(response, 'direct_passthrough') and response.direct_passthrough:
    # Streaming response - don't access .data
    if hasattr(response, 'content_length') and response.content_length:
        response_size = response.content_length
    # else: size remains 0 for unknown streaming responses
elif response.data:
    response_size = len(response.data)
elif hasattr(response, 'content_length') and response.content_length:
    response_size = response.content_length

Key Points:

Check direct_passthrough FIRST to avoid the error
Fall back gracefully when size is unknown
Preserve the metric recording (don't skip static files entirely)

Issue 2: Database Metrics (HIGH)

Root Cause Validation

✅ Analysis Correct: Configuration key mismatch causes the system to always use 10% sampling, which is insufficient for low-traffic sites.

Design Decisions

Decision 1: Use Singular Configuration Key

Decision: Use METRICS_SAMPLING_RATE (singular) with a single float value

Rationale:

Simpler configuration model aligns with our "minimal code" principle
Single rate is sufficient for v1.x (no evidence of need for per-type rates)
Matches user expectation (config already uses singular form)
Can extend to per-type rates in v2.x if needed

Decision 2: Default Sampling Rate

Decision: Default to 100% sampling (1.0)

Rationale:

StarPunk is designed for single-user, low-traffic deployments
100% sampling has negligible overhead for typical usage
Ensures metrics are always visible (better UX)
Power users can reduce sampling if needed via environment variable
This matches the intent in config.py (which defaults to 1.0)

Decision 3: No Minimum Recording Guarantee

Decision: Keep simple percentage-based sampling without guarantees

Rationale:

Additional complexity not justified for v1.x
100% default sampling eliminates the zero-metrics problem
Minimum guarantees would complicate the clean sampling logic
YAGNI principle - we can add this if users report issues

Implementation Guidance

Step 1: Fix MetricsBuffer to accept float sampling rate

# File: starpunk/monitoring/metrics.py, lines 95-110
# Modify __init__ to accept either dict or float:

def __init__(self, max_size: int = 1000, sampling_rates: Optional[Union[Dict[str, float], float]] = None):
    """Initialize metrics buffer.

    Args:
        max_size: Maximum number of metrics to store
        sampling_rates: Either a float (0.0-1.0) for all operations,
                       or dict mapping operation type to rate
    """
    self.max_size = max_size
    self._buffer: Deque[Metric] = deque(maxlen=max_size)
    self._lock = Lock()
    self._process_id = os.getpid()

    # Handle both float and dict formats
    if sampling_rates is None:
        # Default to 100% sampling for low-traffic sites
        self._sampling_rates = {"database": 1.0, "http": 1.0, "render": 1.0}
    elif isinstance(sampling_rates, (int, float)):
        # Single rate for all operation types
        rate = float(sampling_rates)
        self._sampling_rates = {"database": rate, "http": rate, "render": rate}
    else:
        # Dict of per-type rates
        self._sampling_rates = sampling_rates

Step 2: Fix configuration reading

# File: starpunk/monitoring/metrics.py, lines 336-341
# Change to read the singular key:

try:
    from flask import current_app
    max_size = current_app.config.get('METRICS_BUFFER_SIZE', 1000)
    sampling_rate = current_app.config.get('METRICS_SAMPLING_RATE', 1.0)  # Singular, defaults to 1.0
except (ImportError, RuntimeError):
    # Flask not available or no app context
    max_size = 1000
    sampling_rate = 1.0  # Default to 100% for low-traffic sites

_metrics_buffer = MetricsBuffer(
    max_size=max_size,
    sampling_rates=sampling_rate  # Pass the float directly
)

Priority and Release Strategy

Fix Priority

First: Issue 1 (Static Files) - Site is unusable without this
Second: Issue 2 (Database Metrics) - Feature incomplete but not blocking

Release Approach

Decision: Create v1.1.2-rc.2 (not a hotfix)

Rationale:

These are bugs in a release candidate, not a stable release
Following our git branching strategy, continue on the feature branch
Test thoroughly before promoting to stable v1.1.2

Implementation Steps

Fix static file handling (Issue 1)
Fix metrics configuration (Issue 2)
Add integration tests for both issues
Deploy v1.1.2-rc.2 to production
Monitor for 24 hours
If stable, tag as v1.1.2 (stable)

Testing Requirements

For Issue 1 (Static Files)

Test that all static files load correctly (CSS, JS, images)
Verify metrics still record for static files (with size when available)
Test with both small and large static files
Verify no errors in logs

For Issue 2 (Database Metrics)

Verify database metrics appear immediately (not zero)
Test with METRICS_SAMPLING_RATE=0.1 environment variable
Verify backwards compatibility (existing configs still work)
Check that slow queries (>1s) are always recorded regardless of sampling

Integration Test Additions

# tests/test_monitoring_integration.py

def test_static_file_metrics_recording():
    """Static files should not cause 500 errors and should record metrics."""
    response = client.get('/static/css/style.css')
    assert response.status_code == 200
    # Verify metric was recorded (even if size is 0)

def test_database_metrics_with_sampling():
    """Database metrics should respect sampling configuration."""
    app.config['METRICS_SAMPLING_RATE'] = 0.5
    # Perform operations and verify ~50% are recorded

Configuration Documentation Update

Update the deployment documentation to clarify:

# Environment Variables

## Metrics Configuration
- `METRICS_ENABLED`: Enable/disable metrics (default: true)
- `METRICS_SAMPLING_RATE`: Percentage of operations to record, 0.0-1.0 (default: 1.0)
  - 1.0 = 100% (recommended for low-traffic sites)
  - 0.1 = 10% (for high-traffic deployments)
- `METRICS_BUFFER_SIZE`: Number of metrics to retain (default: 1000)
- `METRICS_SLOW_QUERY_THRESHOLD`: Slow query threshold in seconds (default: 1.0)

Summary

The developer's investigation is excellent. The fixes are straightforward:

Static files: Add a simple check for direct_passthrough before accessing .data
Database metrics: Standardize on singular config key with 100% default sampling

Both fixes maintain our principles of simplicity and minimalism. No new dependencies, no complex logic, just fixing the bugs while keeping the code clean.

The developer should implement these fixes in order of priority, thoroughly test, and deploy as v1.1.2-rc.2.

Approved for implementation StarPunk Architect 2025-11-28

8.4 KiB Raw Blame History

Architect Review: v1.1.2-rc.1 Production Issues

Executive Summary

Issue 1: Static Files (CRITICAL)

Root Cause Validation

Design Decision

Implementation Guidance

Issue 2: Database Metrics (HIGH)

Root Cause Validation

Design Decisions

Decision 1: Use Singular Configuration Key

Decision 2: Default Sampling Rate

Decision 3: No Minimum Recording Guarantee

Implementation Guidance

Priority and Release Strategy

Fix Priority

Release Approach

Implementation Steps

Testing Requirements

For Issue 1 (Static Files)

For Issue 2 (Database Metrics)

Integration Test Additions

Configuration Documentation Update

Summary

8.4 KiB

Raw Blame History