This release candidate fixes two critical production issues discovered in v1.1.2-rc.1:
1. CRITICAL: Static files returning 500 errors
- HTTP monitoring middleware was accessing response.data on streaming responses
- Fixed by checking direct_passthrough flag before accessing response data
- Static files (CSS, JS, images) now load correctly
- File: starpunk/monitoring/http.py
2. HIGH: Database metrics showing zero
- Configuration key mismatch: config set METRICS_SAMPLING_RATE (singular),
buffer read METRICS_SAMPLING_RATES (plural)
- Fixed by standardizing on singular key name
- Modified MetricsBuffer to accept both float and dict for flexibility
- Changed default sampling from 10% to 100% for better visibility
- Files: starpunk/monitoring/metrics.py, starpunk/config.py
Version: 1.1.2-rc.2
Documentation:
- Investigation report: docs/reports/2025-11-28-v1.1.2-rc.1-production-issues.md
- Architect review: docs/reviews/2025-11-28-v1.1.2-rc.1-architect-review.md
- Implementation report: docs/reports/2025-11-28-v1.1.2-rc.2-fixes.md
Testing: All monitoring tests pass (28/28)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
238 lines
8.4 KiB
Markdown
238 lines
8.4 KiB
Markdown
# Architect Review: v1.1.2-rc.1 Production Issues
|
|
|
|
**Date:** 2025-11-28
|
|
**Reviewer:** StarPunk Architect
|
|
**Status:** Design Decisions Provided
|
|
|
|
## Executive Summary
|
|
|
|
The developer's investigation is accurate and thorough. Both root causes are correctly identified:
|
|
1. **Static files issue**: HTTP middleware doesn't handle streaming responses properly
|
|
2. **Database metrics issue**: Configuration key mismatch (`METRICS_SAMPLING_RATE` vs `METRICS_SAMPLING_RATES`)
|
|
|
|
Both issues require immediate fixes. This review provides clear design decisions and implementation guidance.
|
|
|
|
## Issue 1: Static Files (CRITICAL)
|
|
|
|
### Root Cause Validation
|
|
✅ **Analysis Correct**: The developer correctly identified that Flask's `send_from_directory()` returns streaming responses in "direct passthrough mode", and accessing `.data` on these triggers a `RuntimeError`.
|
|
|
|
### Design Decision
|
|
|
|
**Decision: Skip size tracking for streaming responses**
|
|
|
|
The HTTP middleware should:
|
|
1. Check if response is in direct passthrough mode BEFORE accessing `.data`
|
|
2. Use `content_length` when available for streaming responses
|
|
3. Record size as 0 when size cannot be determined (not "unknown" - keep metrics numeric)
|
|
|
|
**Rationale:**
|
|
- Streaming responses are designed to avoid loading entire content into memory
|
|
- The `content_length` header (when present) provides sufficient size information
|
|
- Recording 0 is better than excluding the metric entirely (preserves request count)
|
|
- This aligns with the "minimal overhead" principle in ADR-053
|
|
|
|
### Implementation Guidance
|
|
|
|
```python
|
|
# File: starpunk/monitoring/http.py, lines 74-78
|
|
# REPLACE the current implementation with:
|
|
|
|
# Get response size (handle streaming responses)
|
|
response_size = 0
|
|
if hasattr(response, 'direct_passthrough') and response.direct_passthrough:
|
|
# Streaming response - don't access .data
|
|
if hasattr(response, 'content_length') and response.content_length:
|
|
response_size = response.content_length
|
|
# else: size remains 0 for unknown streaming responses
|
|
elif response.data:
|
|
response_size = len(response.data)
|
|
elif hasattr(response, 'content_length') and response.content_length:
|
|
response_size = response.content_length
|
|
```
|
|
|
|
**Key Points:**
|
|
- Check `direct_passthrough` FIRST to avoid the error
|
|
- Fall back gracefully when size is unknown
|
|
- Preserve the metric recording (don't skip static files entirely)
|
|
|
|
## Issue 2: Database Metrics (HIGH)
|
|
|
|
### Root Cause Validation
|
|
✅ **Analysis Correct**: Configuration key mismatch causes the system to always use 10% sampling, which is insufficient for low-traffic sites.
|
|
|
|
### Design Decisions
|
|
|
|
#### Decision 1: Use Singular Configuration Key
|
|
|
|
**Decision: Use `METRICS_SAMPLING_RATE` (singular) with a single float value**
|
|
|
|
**Rationale:**
|
|
- Simpler configuration model aligns with our "minimal code" principle
|
|
- Single rate is sufficient for v1.x (no evidence of need for per-type rates)
|
|
- Matches user expectation (config already uses singular form)
|
|
- Can extend to per-type rates in v2.x if needed
|
|
|
|
#### Decision 2: Default Sampling Rate
|
|
|
|
**Decision: Default to 100% sampling (1.0)**
|
|
|
|
**Rationale:**
|
|
- StarPunk is designed for single-user, low-traffic deployments
|
|
- 100% sampling has negligible overhead for typical usage
|
|
- Ensures metrics are always visible (better UX)
|
|
- Power users can reduce sampling if needed via environment variable
|
|
- This matches the intent in config.py (which defaults to 1.0)
|
|
|
|
#### Decision 3: No Minimum Recording Guarantee
|
|
|
|
**Decision: Keep simple percentage-based sampling without guarantees**
|
|
|
|
**Rationale:**
|
|
- Additional complexity not justified for v1.x
|
|
- 100% default sampling eliminates the zero-metrics problem
|
|
- Minimum guarantees would complicate the clean sampling logic
|
|
- YAGNI principle - we can add this if users report issues
|
|
|
|
### Implementation Guidance
|
|
|
|
**Step 1: Fix MetricsBuffer to accept float sampling rate**
|
|
|
|
```python
|
|
# File: starpunk/monitoring/metrics.py, lines 95-110
|
|
# Modify __init__ to accept either dict or float:
|
|
|
|
def __init__(self, max_size: int = 1000, sampling_rates: Optional[Union[Dict[str, float], float]] = None):
|
|
"""Initialize metrics buffer.
|
|
|
|
Args:
|
|
max_size: Maximum number of metrics to store
|
|
sampling_rates: Either a float (0.0-1.0) for all operations,
|
|
or dict mapping operation type to rate
|
|
"""
|
|
self.max_size = max_size
|
|
self._buffer: Deque[Metric] = deque(maxlen=max_size)
|
|
self._lock = Lock()
|
|
self._process_id = os.getpid()
|
|
|
|
# Handle both float and dict formats
|
|
if sampling_rates is None:
|
|
# Default to 100% sampling for low-traffic sites
|
|
self._sampling_rates = {"database": 1.0, "http": 1.0, "render": 1.0}
|
|
elif isinstance(sampling_rates, (int, float)):
|
|
# Single rate for all operation types
|
|
rate = float(sampling_rates)
|
|
self._sampling_rates = {"database": rate, "http": rate, "render": rate}
|
|
else:
|
|
# Dict of per-type rates
|
|
self._sampling_rates = sampling_rates
|
|
```
|
|
|
|
**Step 2: Fix configuration reading**
|
|
|
|
```python
|
|
# File: starpunk/monitoring/metrics.py, lines 336-341
|
|
# Change to read the singular key:
|
|
|
|
try:
|
|
from flask import current_app
|
|
max_size = current_app.config.get('METRICS_BUFFER_SIZE', 1000)
|
|
sampling_rate = current_app.config.get('METRICS_SAMPLING_RATE', 1.0) # Singular, defaults to 1.0
|
|
except (ImportError, RuntimeError):
|
|
# Flask not available or no app context
|
|
max_size = 1000
|
|
sampling_rate = 1.0 # Default to 100% for low-traffic sites
|
|
|
|
_metrics_buffer = MetricsBuffer(
|
|
max_size=max_size,
|
|
sampling_rates=sampling_rate # Pass the float directly
|
|
)
|
|
```
|
|
|
|
## Priority and Release Strategy
|
|
|
|
### Fix Priority
|
|
1. **First**: Issue 1 (Static Files) - Site is unusable without this
|
|
2. **Second**: Issue 2 (Database Metrics) - Feature incomplete but not blocking
|
|
|
|
### Release Approach
|
|
|
|
**Decision: Create v1.1.2-rc.2 (not a hotfix)**
|
|
|
|
**Rationale:**
|
|
- These are bugs in a release candidate, not a stable release
|
|
- Following our git branching strategy, continue on the feature branch
|
|
- Test thoroughly before promoting to stable v1.1.2
|
|
|
|
### Implementation Steps
|
|
|
|
1. Fix static file handling (Issue 1)
|
|
2. Fix metrics configuration (Issue 2)
|
|
3. Add integration tests for both issues
|
|
4. Deploy v1.1.2-rc.2 to production
|
|
5. Monitor for 24 hours
|
|
6. If stable, tag as v1.1.2 (stable)
|
|
|
|
## Testing Requirements
|
|
|
|
### For Issue 1 (Static Files)
|
|
- Test that all static files load correctly (CSS, JS, images)
|
|
- Verify metrics still record for static files (with size when available)
|
|
- Test with both small and large static files
|
|
- Verify no errors in logs
|
|
|
|
### For Issue 2 (Database Metrics)
|
|
- Verify database metrics appear immediately (not zero)
|
|
- Test with `METRICS_SAMPLING_RATE=0.1` environment variable
|
|
- Verify backwards compatibility (existing configs still work)
|
|
- Check that slow queries (>1s) are always recorded regardless of sampling
|
|
|
|
### Integration Test Additions
|
|
|
|
```python
|
|
# tests/test_monitoring_integration.py
|
|
|
|
def test_static_file_metrics_recording():
|
|
"""Static files should not cause 500 errors and should record metrics."""
|
|
response = client.get('/static/css/style.css')
|
|
assert response.status_code == 200
|
|
# Verify metric was recorded (even if size is 0)
|
|
|
|
def test_database_metrics_with_sampling():
|
|
"""Database metrics should respect sampling configuration."""
|
|
app.config['METRICS_SAMPLING_RATE'] = 0.5
|
|
# Perform operations and verify ~50% are recorded
|
|
```
|
|
|
|
## Configuration Documentation Update
|
|
|
|
Update the deployment documentation to clarify:
|
|
|
|
```markdown
|
|
# Environment Variables
|
|
|
|
## Metrics Configuration
|
|
- `METRICS_ENABLED`: Enable/disable metrics (default: true)
|
|
- `METRICS_SAMPLING_RATE`: Percentage of operations to record, 0.0-1.0 (default: 1.0)
|
|
- 1.0 = 100% (recommended for low-traffic sites)
|
|
- 0.1 = 10% (for high-traffic deployments)
|
|
- `METRICS_BUFFER_SIZE`: Number of metrics to retain (default: 1000)
|
|
- `METRICS_SLOW_QUERY_THRESHOLD`: Slow query threshold in seconds (default: 1.0)
|
|
```
|
|
|
|
## Summary
|
|
|
|
The developer's investigation is excellent. The fixes are straightforward:
|
|
|
|
1. **Static files**: Add a simple check for `direct_passthrough` before accessing `.data`
|
|
2. **Database metrics**: Standardize on singular config key with 100% default sampling
|
|
|
|
Both fixes maintain our principles of simplicity and minimalism. No new dependencies, no complex logic, just fixing the bugs while keeping the code clean.
|
|
|
|
The developer should implement these fixes in order of priority, thoroughly test, and deploy as v1.1.2-rc.2.
|
|
|
|
---
|
|
|
|
**Approved for implementation**
|
|
StarPunk Architect
|
|
2025-11-28 |