Files
StarPunk/docs/reviews/2025-11-28-v1.1.2-rc.1-architect-review.md
Phil Skentelbery 1e2135a49a fix: Resolve v1.1.2-rc.1 production issues - Static files and metrics
This release candidate fixes two critical production issues discovered in v1.1.2-rc.1:

1. CRITICAL: Static files returning 500 errors
   - HTTP monitoring middleware was accessing response.data on streaming responses
   - Fixed by checking direct_passthrough flag before accessing response data
   - Static files (CSS, JS, images) now load correctly
   - File: starpunk/monitoring/http.py

2. HIGH: Database metrics showing zero
   - Configuration key mismatch: config set METRICS_SAMPLING_RATE (singular),
     buffer read METRICS_SAMPLING_RATES (plural)
   - Fixed by standardizing on singular key name
   - Modified MetricsBuffer to accept both float and dict for flexibility
   - Changed default sampling from 10% to 100% for better visibility
   - Files: starpunk/monitoring/metrics.py, starpunk/config.py

Version: 1.1.2-rc.2

Documentation:
- Investigation report: docs/reports/2025-11-28-v1.1.2-rc.1-production-issues.md
- Architect review: docs/reviews/2025-11-28-v1.1.2-rc.1-architect-review.md
- Implementation report: docs/reports/2025-11-28-v1.1.2-rc.2-fixes.md

Testing: All monitoring tests pass (28/28)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-28 09:46:31 -07:00

8.4 KiB

Architect Review: v1.1.2-rc.1 Production Issues

Date: 2025-11-28 Reviewer: StarPunk Architect Status: Design Decisions Provided

Executive Summary

The developer's investigation is accurate and thorough. Both root causes are correctly identified:

  1. Static files issue: HTTP middleware doesn't handle streaming responses properly
  2. Database metrics issue: Configuration key mismatch (METRICS_SAMPLING_RATE vs METRICS_SAMPLING_RATES)

Both issues require immediate fixes. This review provides clear design decisions and implementation guidance.

Issue 1: Static Files (CRITICAL)

Root Cause Validation

Analysis Correct: The developer correctly identified that Flask's send_from_directory() returns streaming responses in "direct passthrough mode", and accessing .data on these triggers a RuntimeError.

Design Decision

Decision: Skip size tracking for streaming responses

The HTTP middleware should:

  1. Check if response is in direct passthrough mode BEFORE accessing .data
  2. Use content_length when available for streaming responses
  3. Record size as 0 when size cannot be determined (not "unknown" - keep metrics numeric)

Rationale:

  • Streaming responses are designed to avoid loading entire content into memory
  • The content_length header (when present) provides sufficient size information
  • Recording 0 is better than excluding the metric entirely (preserves request count)
  • This aligns with the "minimal overhead" principle in ADR-053

Implementation Guidance

# File: starpunk/monitoring/http.py, lines 74-78
# REPLACE the current implementation with:

# Get response size (handle streaming responses)
response_size = 0
if hasattr(response, 'direct_passthrough') and response.direct_passthrough:
    # Streaming response - don't access .data
    if hasattr(response, 'content_length') and response.content_length:
        response_size = response.content_length
    # else: size remains 0 for unknown streaming responses
elif response.data:
    response_size = len(response.data)
elif hasattr(response, 'content_length') and response.content_length:
    response_size = response.content_length

Key Points:

  • Check direct_passthrough FIRST to avoid the error
  • Fall back gracefully when size is unknown
  • Preserve the metric recording (don't skip static files entirely)

Issue 2: Database Metrics (HIGH)

Root Cause Validation

Analysis Correct: Configuration key mismatch causes the system to always use 10% sampling, which is insufficient for low-traffic sites.

Design Decisions

Decision 1: Use Singular Configuration Key

Decision: Use METRICS_SAMPLING_RATE (singular) with a single float value

Rationale:

  • Simpler configuration model aligns with our "minimal code" principle
  • Single rate is sufficient for v1.x (no evidence of need for per-type rates)
  • Matches user expectation (config already uses singular form)
  • Can extend to per-type rates in v2.x if needed

Decision 2: Default Sampling Rate

Decision: Default to 100% sampling (1.0)

Rationale:

  • StarPunk is designed for single-user, low-traffic deployments
  • 100% sampling has negligible overhead for typical usage
  • Ensures metrics are always visible (better UX)
  • Power users can reduce sampling if needed via environment variable
  • This matches the intent in config.py (which defaults to 1.0)

Decision 3: No Minimum Recording Guarantee

Decision: Keep simple percentage-based sampling without guarantees

Rationale:

  • Additional complexity not justified for v1.x
  • 100% default sampling eliminates the zero-metrics problem
  • Minimum guarantees would complicate the clean sampling logic
  • YAGNI principle - we can add this if users report issues

Implementation Guidance

Step 1: Fix MetricsBuffer to accept float sampling rate

# File: starpunk/monitoring/metrics.py, lines 95-110
# Modify __init__ to accept either dict or float:

def __init__(self, max_size: int = 1000, sampling_rates: Optional[Union[Dict[str, float], float]] = None):
    """Initialize metrics buffer.

    Args:
        max_size: Maximum number of metrics to store
        sampling_rates: Either a float (0.0-1.0) for all operations,
                       or dict mapping operation type to rate
    """
    self.max_size = max_size
    self._buffer: Deque[Metric] = deque(maxlen=max_size)
    self._lock = Lock()
    self._process_id = os.getpid()

    # Handle both float and dict formats
    if sampling_rates is None:
        # Default to 100% sampling for low-traffic sites
        self._sampling_rates = {"database": 1.0, "http": 1.0, "render": 1.0}
    elif isinstance(sampling_rates, (int, float)):
        # Single rate for all operation types
        rate = float(sampling_rates)
        self._sampling_rates = {"database": rate, "http": rate, "render": rate}
    else:
        # Dict of per-type rates
        self._sampling_rates = sampling_rates

Step 2: Fix configuration reading

# File: starpunk/monitoring/metrics.py, lines 336-341
# Change to read the singular key:

try:
    from flask import current_app
    max_size = current_app.config.get('METRICS_BUFFER_SIZE', 1000)
    sampling_rate = current_app.config.get('METRICS_SAMPLING_RATE', 1.0)  # Singular, defaults to 1.0
except (ImportError, RuntimeError):
    # Flask not available or no app context
    max_size = 1000
    sampling_rate = 1.0  # Default to 100% for low-traffic sites

_metrics_buffer = MetricsBuffer(
    max_size=max_size,
    sampling_rates=sampling_rate  # Pass the float directly
)

Priority and Release Strategy

Fix Priority

  1. First: Issue 1 (Static Files) - Site is unusable without this
  2. Second: Issue 2 (Database Metrics) - Feature incomplete but not blocking

Release Approach

Decision: Create v1.1.2-rc.2 (not a hotfix)

Rationale:

  • These are bugs in a release candidate, not a stable release
  • Following our git branching strategy, continue on the feature branch
  • Test thoroughly before promoting to stable v1.1.2

Implementation Steps

  1. Fix static file handling (Issue 1)
  2. Fix metrics configuration (Issue 2)
  3. Add integration tests for both issues
  4. Deploy v1.1.2-rc.2 to production
  5. Monitor for 24 hours
  6. If stable, tag as v1.1.2 (stable)

Testing Requirements

For Issue 1 (Static Files)

  • Test that all static files load correctly (CSS, JS, images)
  • Verify metrics still record for static files (with size when available)
  • Test with both small and large static files
  • Verify no errors in logs

For Issue 2 (Database Metrics)

  • Verify database metrics appear immediately (not zero)
  • Test with METRICS_SAMPLING_RATE=0.1 environment variable
  • Verify backwards compatibility (existing configs still work)
  • Check that slow queries (>1s) are always recorded regardless of sampling

Integration Test Additions

# tests/test_monitoring_integration.py

def test_static_file_metrics_recording():
    """Static files should not cause 500 errors and should record metrics."""
    response = client.get('/static/css/style.css')
    assert response.status_code == 200
    # Verify metric was recorded (even if size is 0)

def test_database_metrics_with_sampling():
    """Database metrics should respect sampling configuration."""
    app.config['METRICS_SAMPLING_RATE'] = 0.5
    # Perform operations and verify ~50% are recorded

Configuration Documentation Update

Update the deployment documentation to clarify:

# Environment Variables

## Metrics Configuration
- `METRICS_ENABLED`: Enable/disable metrics (default: true)
- `METRICS_SAMPLING_RATE`: Percentage of operations to record, 0.0-1.0 (default: 1.0)
  - 1.0 = 100% (recommended for low-traffic sites)
  - 0.1 = 10% (for high-traffic deployments)
- `METRICS_BUFFER_SIZE`: Number of metrics to retain (default: 1000)
- `METRICS_SLOW_QUERY_THRESHOLD`: Slow query threshold in seconds (default: 1.0)

Summary

The developer's investigation is excellent. The fixes are straightforward:

  1. Static files: Add a simple check for direct_passthrough before accessing .data
  2. Database metrics: Standardize on singular config key with 100% default sampling

Both fixes maintain our principles of simplicity and minimalism. No new dependencies, no complex logic, just fixing the bugs while keeping the code clean.

The developer should implement these fixes in order of priority, thoroughly test, and deploy as v1.1.2-rc.2.


Approved for implementation StarPunk Architect 2025-11-28