StarPunk/docs/reports/2025-11-28-v1.1.2-rc.1-production-issues.md

# v1.1.2-rc.1 Production Issues Investigation Report

**Date:** 2025-11-28
**Version:** v1.1.2-rc.1
**Investigator:** Developer Agent
**Status:** Issues Identified, Fixes Needed

## Executive Summary

Two critical issues identified in v1.1.2-rc.1 production deployment:

1. **CRITICAL**: Static files return 500 errors - site unusable (no CSS/JS)
2. **HIGH**: Database metrics showing zero - feature incomplete

Both issues have been traced to root causes and are ready for architect review.

---

## Issue 1: Static Files Return 500 Error

### Symptom
- All static files (CSS, JS, images) return HTTP 500
- Specifically: `https://starpunk.thesatelliteoflove.com/static/css/style.css` fails
- Site is unusable without stylesheets

### Error Message
```
RuntimeError: Attempted implicit sequence conversion but the response object is in direct passthrough mode.
```

### Root Cause
**File:** `starpunk/monitoring/http.py:74-78`

```python
# Get response size
response_size = 0
if response.data:  # <-- PROBLEM HERE
    response_size = len(response.data)
elif hasattr(response, 'content_length') and response.content_length:
    response_size = response.content_length
```

### Technical Analysis

The HTTP monitoring middleware's `after_request` hook attempts to access `response.data` to calculate response size for metrics. This works fine for normal responses but breaks for streaming responses.

**How Flask serves static files:**
1. Flask's `send_from_directory()` returns a streaming response
2. Streaming responses are in "direct passthrough mode"
3. Accessing `.data` on a streaming response triggers implicit sequence conversion
4. This raises `RuntimeError` because the response is not buffered

**Why this affects all static files:**
- ALL static files use `send_from_directory()`
- ALL are served as streaming responses
- The `after_request` hook runs for EVERY response
- Therefore ALL static files fail

### Impact
- **Severity:** CRITICAL
- **User Impact:** Site completely unusable - no styling, no JavaScript
- **Scope:** All static assets (CSS, JS, images, fonts, etc.)

### Proposed Fix Direction
The middleware needs to:
1. Check if response is in direct passthrough mode before accessing `.data`
2. Fall back to `content_length` for streaming responses
3. Handle cases where size cannot be determined (record as 0 or unknown)

**Code location for fix:** `starpunk/monitoring/http.py:74-78`

---

## Issue 2: Database Metrics Showing Zero

### Symptom
- Admin dashboard shows 0 for all database metrics
- Database pool statistics work correctly
- Only operation metrics (count, avg, min, max) show zero

### Root Cause Analysis

#### The Architecture Is Correct

**Config:** `starpunk/config.py:90`
```python
app.config["METRICS_ENABLED"] = os.getenv("METRICS_ENABLED", "true").lower() == "true"
```
✅ Defaults to enabled

**Pool Initialization:** `starpunk/database/pool.py:172`
```python
metrics_enabled = app.config.get('METRICS_ENABLED', True)
```
✅ Reads config correctly

**Connection Wrapping:** `starpunk/database/pool.py:74-77`
```python
if self.metrics_enabled:
    from starpunk.monitoring import MonitoredConnection
    return MonitoredConnection(conn, self.slow_query_threshold)
```
✅ Wraps connections when enabled

**Metric Recording:** `starpunk/monitoring/database.py:83-89`
```python
record_metric(
    'database',
    f'{query_type} {table_name}',
    duration_ms,
    metadata,
    force=is_slow  # Always record slow queries
)
```
✅ Calls record_metric correctly

#### The Real Problem: Sampling Rate

**File:** `starpunk/monitoring/metrics.py:105-110`

```python
self._sampling_rates = sampling_rates or {
    "database": 0.1,  # Only 10% of queries recorded!
    "http": 0.1,
    "render": 0.1,
}
```

**File:** `starpunk/monitoring/metrics.py:138-142`

```python
if not force:
    sampling_rate = self._sampling_rates.get(operation_type, 0.1)
    if random.random() > sampling_rate:  # 90% chance to skip!
        return False
```

### Why Metrics Show Zero

1. **Low traffic:** Production site has minimal activity
2. **10% sampling:** Only 1 in 10 database queries are recorded
3. **Fast queries:** Queries complete in < 1 second, so `force=False`
4. **Statistical probability:** With low traffic + 10% sampling = high chance of 0 metrics

Example scenario:
- 20 database queries during monitoring window
- 10% sampling = expect 2 metrics recorded
- But random sampling might record 0, 1, or 3 (statistical variation)
- Dashboard shows 0 because no metrics were sampled

### Why Slow Queries Would Work

If there were slow queries (>= 1.0 second), they would be recorded with `force=True`, bypassing sampling. But production queries are all fast.

### Impact
- **Severity:** HIGH (feature incomplete, not critical to operations)
- **User Impact:** Cannot see database performance metrics
- **Scope:** Database operation metrics only (pool stats work fine)

### Design Questions for Architect

1. **Is 10% sampling rate appropriate for production?**
   - Pro: Reduces overhead, good for high-traffic sites
   - Con: Insufficient for low-traffic sites like this one
   - Alternative: Higher default (50-100%) or traffic-based adaptive sampling

2. **Should sampling be configurable?**
   - Already supported via `METRICS_SAMPLING_RATE` config (starpunk/config.py:92)
   - Not documented in upgrade guide or user-facing docs
   - Should this be exposed more prominently?

3. **Should there be a minimum recording guarantee?**
   - E.g., "Always record at least 1 metric per minute"
   - Or "First N operations always recorded"
   - Ensures metrics never show zero even with low traffic

---

## Configuration Check

Checked production configuration sources:

### Environment Variables (from config.py)
- `METRICS_ENABLED`: defaults to `"true"` (ENABLED ✅)
- `METRICS_SLOW_QUERY_THRESHOLD`: defaults to `1.0` seconds
- `METRICS_SAMPLING_RATE`: defaults to `1.0` (100%... wait, what?)

### WAIT - Config Discrepancy Detected!

**In config.py:92:**
```python
app.config["METRICS_SAMPLING_RATE"] = float(os.getenv("METRICS_SAMPLING_RATE", "1.0"))
```
Default: **1.0 (100%)**

**But this config is never used by MetricsBuffer!**

**In metrics.py:336-341:**
```python
try:
    from flask import current_app
    max_size = current_app.config.get('METRICS_BUFFER_SIZE', 1000)
    sampling_rates = current_app.config.get('METRICS_SAMPLING_RATES', None)  # Note: plural!
except (ImportError, RuntimeError):
```

**The config key mismatch:**
- Config.py sets: `METRICS_SAMPLING_RATE` (singular, defaults to 1.0)
- Metrics.py reads: `METRICS_SAMPLING_RATES` (plural, expects dict)
- Result: Always returns `None`, falls back to hardcoded 10%

### Root Cause Confirmed

**The real issue is a configuration key mismatch:**
1. Config loads `METRICS_SAMPLING_RATE` (singular) = 1.0
2. MetricsBuffer reads `METRICS_SAMPLING_RATES` (plural) expecting dict
3. Key mismatch returns None
4. Falls back to hardcoded 10% sampling
5. Low traffic + 10% = no metrics

---

## Verification Evidence

### Code References
- `starpunk/monitoring/http.py:74-78` - Static file error location
- `starpunk/monitoring/database.py:83-89` - Database metric recording
- `starpunk/monitoring/metrics.py:105-110` - Hardcoded sampling rates
- `starpunk/monitoring/metrics.py:336-341` - Config reading with wrong key
- `starpunk/config.py:92` - Config setting with different key

### Container Logs
Error message confirmed in production logs (user reported)

### Configuration Flow
1. `starpunk/config.py` → Sets `METRICS_SAMPLING_RATE` (singular)
2. `starpunk/__init__.py` → Initializes app with config
3. `starpunk/monitoring/metrics.py` → Reads `METRICS_SAMPLING_RATES` (plural)
4. Mismatch → Falls back to 10%

---

## Recommendations for Architect

### Issue 1: Static Files (CRITICAL)
**Immediate action required:**
1. Fix `starpunk/monitoring/http.py` to handle streaming responses
2. Test with static files before any deployment
3. Consider adding integration test for static file serving

### Issue 2: Database Metrics (HIGH)
**Two problems to address:**

**Problem 2A: Config key mismatch**
- Fix either config.py or metrics.py to use same key name
- Decision needed: singular or plural?
  - Singular (`METRICS_SAMPLING_RATE`) simpler if same rate for all types
  - Plural (`METRICS_SAMPLING_RATES`) allows per-type customization

**Problem 2B: Default sampling rate**
- 10% may be too low for low-traffic sites
- Consider higher default (50-100%) for better visibility
- Or make sampling traffic-adaptive

### Design Questions
1. Should there be a minimum recording guarantee for zero metrics?
2. Should sampling rate be per-operation-type or global?
3. What's the right balance between overhead and visibility?

---

## Next Steps

1. **Architect Review:** Review findings and provide design decisions
2. **Fix Implementation:** Implement approved fixes
3. **Testing:** Comprehensive testing of both fixes
4. **Release:** Deploy v1.1.2-rc.2 with fixes

---

## References

- v1.1.2 Implementation Plan: `docs/projectplan/v1.1.2-implementation-plan.md`
- Phase 1 Report: `docs/reports/v1.1.2-phase1-metrics-implementation.md`
- Developer Q&A: `docs/design/v1.1.2/developer-qa.md` (Questions Q6, Q12)