docs: Add architect documentation for migration race condition fix

Add comprehensive architectural documentation for the migration race condition fix, including: - ADR-022: Architectural decision record for the fix - migration-race-condition-answers.md: All 23 Q&A answered - migration-fix-quick-reference.md: Implementation checklist - migration-race-condition-fix-implementation.md: Detailed guide These documents guided the implementation in v1.0.0-rc.5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:53:55 -07:00
parent 686d753fb9
commit 2240414f22
4 changed files with 1354 additions and 0 deletions
--- a/docs/architecture/migration-fix-quick-reference.md
+++ b/docs/architecture/migration-fix-quick-reference.md
@@ -0,0 +1,238 @@
+# Migration Race Condition Fix - Quick Implementation Reference
+
+## Implementation Checklist
+
+### Code Changes - `/home/phil/Projects/starpunk/starpunk/migrations.py`
+
+```python
+# 1. Add imports at top
+import time
+import random
+
+# 2. Replace entire run_migrations function (lines 304-462)
+# See full implementation in migration-race-condition-fix-implementation.md
+
+# Key patterns to implement:
+
+# A. Retry loop structure
+max_retries = 10
+retry_count = 0
+base_delay = 0.1
+start_time = time.time()
+max_total_time = 120  # 2 minute absolute max
+
+while retry_count < max_retries and (time.time() - start_time) < max_total_time:
+    conn = None  # NEW connection each iteration
+    try:
+        conn = sqlite3.connect(db_path, timeout=30.0)
+        conn.execute("BEGIN IMMEDIATE")  # Lock acquisition
+        # ... migration logic ...
+        conn.commit()
+        return  # Success
+    except sqlite3.OperationalError as e:
+        if "database is locked" in str(e).lower():
+            retry_count += 1
+            if retry_count < max_retries:
+                # Exponential backoff with jitter
+                delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
+                # Graduated logging
+                if retry_count <= 3:
+                    logger.debug(f"Retry {retry_count}/{max_retries}")
+                elif retry_count <= 7:
+                    logger.info(f"Retry {retry_count}/{max_retries}")
+                else:
+                    logger.warning(f"Retry {retry_count}/{max_retries}")
+                time.sleep(delay)
+                continue
+    finally:
+        if conn:
+            try:
+                conn.close()
+            except:
+                pass
+
+# B. Error handling pattern
+except Exception as e:
+    try:
+        conn.rollback()
+    except Exception as rollback_error:
+        logger.critical(f"FATAL: Rollback failed: {rollback_error}")
+        raise SystemExit(1)
+    raise MigrationError(f"Migration failed: {e}")
+
+# C. Final error message
+raise MigrationError(
+    f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
+    f"Possible causes:\n"
+    f"1. Another process is stuck in migration (check logs)\n"
+    f"2. Database file permissions issue\n"
+    f"3. Disk I/O problems\n"
+    f"Action: Restart container with single worker to diagnose"
+)
+```
+
+### Testing Requirements
+
+#### 1. Unit Test File: `test_migration_race_condition.py`
+```python
+import multiprocessing
+from multiprocessing import Barrier, Process
+import time
+
+def test_concurrent_migrations():
+    """Test 4 workers starting simultaneously"""
+    barrier = Barrier(4)
+
+    def worker(worker_id):
+        barrier.wait()  # Synchronize start
+        from starpunk import create_app
+        app = create_app()
+        return True
+
+    with multiprocessing.Pool(4) as pool:
+        results = pool.map(worker, range(4))
+
+    assert all(results), "Some workers failed"
+
+def test_lock_retry():
+    """Test retry logic with mock"""
+    with patch('sqlite3.connect') as mock:
+        mock.side_effect = [
+            sqlite3.OperationalError("database is locked"),
+            sqlite3.OperationalError("database is locked"),
+            MagicMock()  # Success on 3rd try
+        ]
+        run_migrations(db_path)
+        assert mock.call_count == 3
+```
+
+#### 2. Integration Test: `test_integration.sh`
+```bash
+#!/bin/bash
+# Test with actual gunicorn
+
+# Clean start
+rm -f test.db
+
+# Start gunicorn with 4 workers
+timeout 10 gunicorn --workers 4 --bind 127.0.0.1:8001 app:app &
+PID=$!
+
+# Wait for startup
+sleep 3
+
+# Check if running
+if ! kill -0 $PID 2>/dev/null; then
+    echo "FAILED: Gunicorn crashed"
+    exit 1
+fi
+
+# Check health endpoint
+curl -f http://127.0.0.1:8001/health || exit 1
+
+# Cleanup
+kill $PID
+
+echo "SUCCESS: All workers started without race condition"
+```
+
+#### 3. Container Test: `test_container.sh`
+```bash
+#!/bin/bash
+# Test in container environment
+
+# Build
+podman build -t starpunk:race-test -f Containerfile .
+
+# Run with fresh database
+podman run --rm -d --name race-test \
+    -v $(pwd)/test-data:/data \
+    starpunk:race-test
+
+# Check logs for success patterns
+sleep 5
+podman logs race-test | grep -E "(Applied migration|already applied by another worker)"
+
+# Cleanup
+podman stop race-test
+```
+
+### Verification Patterns in Logs
+
+#### Successful Migration (One Worker Wins)
+```
+Worker 0: Applying migration: 001_initial_schema.sql
+Worker 1: Database locked by another worker, retry 1/10 in 0.21s
+Worker 2: Database locked by another worker, retry 1/10 in 0.23s
+Worker 3: Database locked by another worker, retry 1/10 in 0.19s
+Worker 0: Applied migration: 001_initial_schema.sql
+Worker 1: All migrations already applied by another worker
+Worker 2: All migrations already applied by another worker
+Worker 3: All migrations already applied by another worker
+```
+
+#### Performance Metrics to Check
+- Single worker: < 100ms total
+- 4 workers: < 500ms total
+- 10 workers (stress): < 2000ms total
+
+### Rollback Plan if Issues
+
+1. **Immediate Workaround**
+   ```bash
+   # Change to single worker temporarily
+   gunicorn --workers 1 --bind 0.0.0.0:8000 app:app
+   ```
+
+2. **Revert Code**
+   ```bash
+   git revert HEAD
+   ```
+
+3. **Emergency Patch**
+   ```python
+   # In app.py temporarily
+   import os
+   if os.getenv('GUNICORN_WORKER_ID', '1') == '1':
+       init_db()  # Only first worker runs migrations
+   ```
+
+### Deployment Commands
+
+```bash
+# 1. Run tests
+python -m pytest test_migration_race_condition.py -v
+
+# 2. Build container
+podman build -t starpunk:v1.0.0-rc.3.1 -f Containerfile .
+
+# 3. Tag for release
+podman tag starpunk:v1.0.0-rc.3.1 git.philmade.com/starpunk:v1.0.0-rc.3.1
+
+# 4. Push
+podman push git.philmade.com/starpunk:v1.0.0-rc.3.1
+
+# 5. Deploy
+kubectl rollout restart deployment/starpunk
+```
+
+---
+
+## Critical Points to Remember
+
+1. **NEW CONNECTION EACH RETRY** - Don't reuse connections
+2. **BEGIN IMMEDIATE** - Not EXCLUSIVE, not DEFERRED
+3. **30s per attempt, 120s total max** - Two different timeouts
+4. **Graduated logging** - DEBUG → INFO → WARNING based on retry count
+5. **Test at multiple levels** - Unit, integration, container
+6. **Fresh database state** between tests
+
+## Support
+
+If issues arise, check:
+1. `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md` - Full Q&A
+2. `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md` - Detailed implementation
+3. SQLite lock states: `PRAGMA lock_status` during issue
+
+---
+*Quick Reference v1.0 - 2025-11-24*
--- a/docs/architecture/migration-race-condition-answers.md
+++ b/docs/architecture/migration-race-condition-answers.md
@@ -0,0 +1,477 @@
+# Migration Race Condition Fix - Architectural Answers
+
+## Status: READY FOR IMPLEMENTATION
+
+All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.
+
+---
+
+## Critical Questions
+
+### 1. Connection Lifecycle Management
+**Q: Should we create a new connection for each retry or reuse the same connection?**
+
+**Answer: NEW CONNECTION per retry**
+- Each retry MUST create a fresh connection
+- Rationale: Failed lock acquisition may leave connection in inconsistent state
+- SQLite connections are lightweight; overhead is minimal
+- Pattern:
+  ```python
+  while retry_count < max_retries:
+      conn = None  # Fresh connection each iteration
+      try:
+          conn = sqlite3.connect(db_path, timeout=30.0)
+          # ... attempt migration ...
+      finally:
+          if conn:
+              conn.close()
+  ```
+
+### 2. Transaction Boundaries
+**Q: Should init_db() wrap everything in one transaction?**
+
+**Answer: NO - Separate transactions for different operations**
+- Schema creation: Own transaction (already implicit in executescript)
+- Migrations: Own transaction with BEGIN IMMEDIATE
+- Initial data: Own transaction
+- Rationale: Minimizes lock duration and allows partial success visibility
+- Each operation is atomic but independent
+
+### 3. Lock Timeout vs Retry Timeout
+**Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?**
+
+**Answer: This is BY DESIGN - No conflict**
+- 30s timeout: Maximum wait for any single lock acquisition attempt
+- 102s total: Maximum cumulative retry duration across multiple attempts
+- If one worker holds lock for 30s+, other workers timeout and retry
+- Pattern ensures no single worker waits indefinitely
+- Recommendation: Add total timeout check:
+  ```python
+  start_time = time.time()
+  max_total_time = 120  # 2 minutes absolute maximum
+  while retry_count < max_retries and (time.time() - start_time) < max_total_time:
+  ```
+
+### 4. Testing Strategy
+**Q: Should we use multiprocessing.Pool or actual gunicorn for testing?**
+
+**Answer: BOTH - Different test levels**
+- Unit tests: multiprocessing.Pool (fast, isolated)
+- Integration tests: Actual gunicorn with --workers 4
+- Container tests: Full podman/docker run
+- Test matrix:
+  ```
+  Level 1: Mock concurrent access (unit)
+  Level 2: multiprocessing.Pool (integration)
+  Level 3: gunicorn locally (system)
+  Level 4: Container with gunicorn (e2e)
+  ```
+
+### 5. BEGIN IMMEDIATE vs EXCLUSIVE
+**Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?**
+
+**Answer: BEGIN IMMEDIATE is CORRECT choice**
+- BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
+- BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
+- Rationale:
+  - Migrations only need to prevent concurrent migrations (writes)
+  - Other workers can still read schema while one migrates
+  - Less contention, faster startup
+  - Only escalates to EXCLUSIVE when actually writing
+- Keep BEGIN IMMEDIATE as specified
+
+---
+
+## Edge Cases and Error Handling
+
+### 6. Partial Migration Failure
+**Q: What if a migration partially applies or rollback fails?**
+
+**Answer: Transaction atomicity handles this**
+- Within transaction: Automatic rollback on ANY error
+- Rollback failure: Extremely rare (corrupt database)
+- Strategy:
+  ```python
+  except Exception as e:
+      try:
+          conn.rollback()
+      except Exception as rollback_error:
+          logger.critical(f"FATAL: Rollback failed: {rollback_error}")
+          # Database potentially corrupt - fail hard
+          raise SystemExit(1)
+      raise MigrationError(e)
+  ```
+
+### 7. Migration File Consistency
+**Q: What if migration files change during deployment?**
+
+**Answer: Not a concern with proper deployment**
+- Container deployments: Files are immutable in image
+- Traditional deployment: Use atomic directory swap
+- If concerned, add checksum validation:
+  ```python
+  # Store in schema_migrations: (name, checksum, applied_at)
+  # Verify checksum matches before applying
+  ```
+
+### 8. Retry Exhaustion Error Messages
+**Q: What error message when retries exhausted?**
+
+**Answer: Be specific and actionable**
+```python
+raise MigrationError(
+    f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
+    f"Possible causes:\n"
+    f"1. Another process is stuck in migration (check logs)\n"
+    f"2. Database file permissions issue\n"
+    f"3. Disk I/O problems\n"
+    f"Action: Restart container with single worker to diagnose"
+)
+```
+
+### 9. Logging Levels
+**Q: What log level for lock waits?**
+
+**Answer: Graduated approach**
+- Retry 1-3: DEBUG (normal operation)
+- Retry 4-7: INFO (getting concerning)
+- Retry 8+: WARNING (abnormal)
+- Exhausted: ERROR (operation failed)
+- Pattern:
+  ```python
+  if retry_count <= 3:
+      level = logging.DEBUG
+  elif retry_count <= 7:
+      level = logging.INFO
+  else:
+      level = logging.WARNING
+  logger.log(level, f"Retry {retry_count}/{max_retries}")
+  ```
+
+### 10. Index Creation Failure
+**Q: How to handle index creation failures in migration 002?**
+
+**Answer: Fail fast with clear context**
+```python
+for index_name, index_sql in indexes_to_create:
+    try:
+        conn.execute(index_sql)
+    except sqlite3.OperationalError as e:
+        if "already exists" in str(e):
+            logger.debug(f"Index {index_name} already exists")
+        else:
+            raise MigrationError(
+                f"Failed to create index {index_name}: {e}\n"
+                f"SQL: {index_sql}"
+            )
+```
+
+---
+
+## Testing Strategy
+
+### 11. Concurrent Testing Simulation
+**Q: How to properly simulate concurrent worker startup?**
+
+**Answer: Multiple approaches**
+```python
+# Approach 1: Barrier synchronization
+def test_concurrent_migrations():
+    barrier = multiprocessing.Barrier(4)
+
+    def worker():
+        barrier.wait()  # All start together
+        return run_migrations(db_path)
+
+    with multiprocessing.Pool(4) as pool:
+        results = pool.map(worker, range(4))
+
+# Approach 2: Process start
+processes = []
+for i in range(4):
+    p = Process(target=run_migrations, args=(db_path,))
+    processes.append(p)
+for p in processes:
+    p.start()  # Near-simultaneous
+```
+
+### 12. Lock Contention Testing
+**Q: How to test lock contention scenarios?**
+
+**Answer: Inject delays**
+```python
+# Test helper to force contention
+def slow_migration_for_testing(conn):
+    conn.execute("BEGIN IMMEDIATE")
+    time.sleep(2)  # Force other workers to wait
+    # Apply migration
+    conn.commit()
+
+# Test timeout handling
+@patch('sqlite3.connect')
+def test_lock_timeout(mock_connect):
+    mock_connect.side_effect = sqlite3.OperationalError("database is locked")
+    # Verify retry logic
+```
+
+### 13. Performance Tests
+**Q: What timing is acceptable?**
+
+**Answer: Performance targets**
+- Single worker: < 100ms for all migrations
+- 4 workers with contention: < 500ms total
+- 10 workers stress test: < 2s total
+- Lock acquisition per retry: < 50ms
+- Test with:
+  ```python
+  import timeit
+  setup_time = timeit.timeit(lambda: create_app(), number=1)
+  assert setup_time < 0.5, f"Startup too slow: {setup_time}s"
+  ```
+
+### 14. Retry Logic Unit Tests
+**Q: How to unit test retry logic?**
+
+**Answer: Mock the lock failures**
+```python
+class TestRetryLogic:
+    def test_retry_on_lock(self):
+        with patch('sqlite3.connect') as mock:
+            # First 2 attempts fail, 3rd succeeds
+            mock.side_effect = [
+                sqlite3.OperationalError("database is locked"),
+                sqlite3.OperationalError("database is locked"),
+                MagicMock()  # Success
+            ]
+            run_migrations(db_path)
+            assert mock.call_count == 3
+```
+
+---
+
+## SQLite-Specific Concerns
+
+### 15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)
+**Q: Deep dive on lock choice?**
+
+**Answer: Lock escalation path**
+```
+BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
+BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
+BEGIN EXCLUSIVE → EXCLUSIVE
+
+For migrations:
+- IMMEDIATE starts at RESERVED (blocks other writers immediately)
+- Escalates to EXCLUSIVE only during actual writes
+- Optimal for our use case
+```
+
+### 16. WAL Mode Interaction
+**Q: How does this work with WAL mode?**
+
+**Answer: Works correctly with both modes**
+- Journal mode: BEGIN IMMEDIATE works as described
+- WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
+- No code changes needed
+- Add mode detection for logging:
+  ```python
+  cursor = conn.execute("PRAGMA journal_mode")
+  mode = cursor.fetchone()[0]
+  logger.debug(f"Database in {mode} mode")
+  ```
+
+### 17. Database File Permissions
+**Q: How to handle permission issues?**
+
+**Answer: Fail fast with helpful diagnostics**
+```python
+import os
+import stat
+
+db_path = Path(db_path)
+if not db_path.exists():
+    # Will be created - check parent dir
+    parent = db_path.parent
+    if not os.access(parent, os.W_OK):
+        raise MigrationError(f"Cannot write to directory: {parent}")
+else:
+    # Check existing file
+    if not os.access(db_path, os.W_OK):
+        stats = os.stat(db_path)
+        mode = stat.filemode(stats.st_mode)
+        raise MigrationError(
+            f"Database not writable: {db_path}\n"
+            f"Permissions: {mode}\n"
+            f"Owner: {stats.st_uid}:{stats.st_gid}"
+        )
+```
+
+---
+
+## Deployment/Operations
+
+### 18. Container Startup and Health Checks
+**Q: How to handle health checks during migration?**
+
+**Answer: Return 503 during migration**
+```python
+# In app.py
+MIGRATION_IN_PROGRESS = False
+
+def create_app():
+    global MIGRATION_IN_PROGRESS
+    MIGRATION_IN_PROGRESS = True
+    try:
+        init_db()
+    finally:
+        MIGRATION_IN_PROGRESS = False
+
+@app.route('/health')
+def health():
+    if MIGRATION_IN_PROGRESS:
+        return {'status': 'migrating'}, 503
+    return {'status': 'healthy'}, 200
+```
+
+### 19. Monitoring and Alerting
+**Q: What metrics/alerts are needed?**
+
+**Answer: Key metrics to track**
+```python
+# Add metrics collection
+metrics = {
+    'migration_duration_ms': 0,
+    'migration_retries': 0,
+    'migration_lock_wait_ms': 0,
+    'migrations_applied': 0
+}
+
+# Alert thresholds
+ALERTS = {
+    'migration_duration_ms': 5000,  # Alert if > 5s
+    'migration_retries': 5,         # Alert if > 5 retries
+    'worker_failures': 1             # Alert on any failure
+}
+
+# Log in structured format
+logger.info(json.dumps({
+    'event': 'migration_complete',
+    'metrics': metrics
+}))
+```
+
+---
+
+## Alternative Approaches
+
+### 20. Version Compatibility
+**Q: How to handle version mismatches?**
+
+**Answer: Strict version checking**
+```python
+# In migrations.py
+MIGRATION_VERSION = "1.0.0"
+
+def check_version_compatibility(conn):
+    cursor = conn.execute(
+        "SELECT value FROM app_config WHERE key = 'migration_version'"
+    )
+    row = cursor.fetchone()
+    if row and row[0] != MIGRATION_VERSION:
+        raise MigrationError(
+            f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
+            f"Action: Run migration tool separately"
+        )
+```
+
+### 21. File-Based Locking
+**Q: Should we consider flock() as backup?**
+
+**Answer: NO - Adds complexity without benefit**
+- SQLite locking is sufficient and portable
+- flock() not available on all systems
+- Would require additional cleanup logic
+- Database-level locking is the correct approach
+
+### 22. Gunicorn Preload
+**Q: Would --preload flag help?**
+
+**Answer: NO - Makes problem WORSE**
+- --preload runs app initialization ONCE in master
+- Workers fork from master AFTER migrations complete
+- BUT: Doesn't work with lazy-loaded resources
+- Current architecture expects per-worker initialization
+- Keep current approach
+
+### 23. Application-Level Locks
+**Q: Should we add Redis/memcached for coordination?**
+
+**Answer: NO - Violates simplicity principle**
+- Adds external dependency
+- More complex deployment
+- SQLite locking is sufficient
+- Would require Redis/memcached to be running before app starts
+- Solving a solved problem
+
+---
+
+## Final Implementation Checklist
+
+### Required Changes
+
+1. ✅ Add imports: `time`, `random`
+2. ✅ Implement retry loop with exponential backoff
+3. ✅ Use BEGIN IMMEDIATE for lock acquisition
+4. ✅ Add graduated logging levels
+5. ✅ Proper error messages with diagnostics
+6. ✅ Fresh connection per retry
+7. ✅ Total timeout check (2 minutes max)
+8. ✅ Preserve all existing migration logic
+
+### Test Coverage Required
+
+1. ✅ Unit test: Retry on lock
+2. ✅ Unit test: Exhaustion handling
+3. ✅ Integration test: 4 workers with multiprocessing
+4. ✅ System test: gunicorn with 4 workers
+5. ✅ Container test: Full deployment simulation
+6. ✅ Performance test: < 500ms with contention
+
+### Documentation Updates
+
+1. ✅ Update ADR-022 with final decision
+2. ✅ Add operational runbook for migration issues
+3. ✅ Document monitoring metrics
+4. ✅ Update deployment guide with health check info
+
+---
+
+## Go/No-Go Decision
+
+### ✅ GO FOR IMPLEMENTATION
+
+**Rationale:**
+- All 23 questions have concrete answers
+- Design is proven with SQLite's native capabilities
+- No external dependencies needed
+- Risk is low with clear rollback plan
+- Testing strategy is comprehensive
+
+**Implementation Priority: IMMEDIATE**
+- This is blocking v1.0.0-rc.4 release
+- Production systems affected
+- Fix is well-understood and low-risk
+
+**Next Steps:**
+1. Implement changes to migrations.py as specified
+2. Run test suite at all levels
+3. Deploy as hotfix v1.0.0-rc.3.1
+4. Monitor metrics in production
+5. Document lessons learned
+
+---
+
+*Document Version: 1.0*
+*Created: 2025-11-24*
+*Status: Approved for Implementation*
+*Author: StarPunk Architecture Team*