docs: Add architect documentation for migration race condition fix

Add comprehensive architectural documentation for the migration race condition fix, including: - ADR-022: Architectural decision record for the fix - migration-race-condition-answers.md: All 23 Q&A answered - migration-fix-quick-reference.md: Implementation checklist - migration-race-condition-fix-implementation.md: Detailed guide These documents guided the implementation in v1.0.0-rc.5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:53:55 -07:00
parent 686d753fb9
commit 2240414f22
4 changed files with 1354 additions and 0 deletions
--- a/docs/reports/migration-race-condition-fix-implementation.md
+++ b/docs/reports/migration-race-condition-fix-implementation.md
@@ -0,0 +1,431 @@
+# Migration Race Condition Fix - Implementation Guide
+
+## Executive Summary
+
+**CRITICAL PRODUCTION ISSUE**: Multiple gunicorn workers racing to apply migrations causes container startup failures.
+
+**Solution**: Implement database-level advisory locking with retry logic in `migrations.py`.
+
+**Urgency**: HIGH - This is a blocker for v1.0.0-rc.4 release.
+
+## Root Cause Analysis
+
+### The Problem Flow
+
+1. Container starts with `gunicorn --workers 4`
+2. Each worker independently calls:
+   ```
+   app.py → create_app() → init_db() → run_migrations()
+   ```
+3. All 4 workers simultaneously try to:
+   - INSERT into schema_migrations table
+   - Apply the same migrations
+4. SQLite's UNIQUE constraint on migration_name causes workers 2-4 to crash
+5. Container restarts, works on second attempt (migrations already applied)
+
+### Why This Happens
+
+- **No synchronization**: Workers are independent processes
+- **No locking**: Migration code doesn't prevent concurrent execution
+- **Immediate failure**: UNIQUE constraint violation crashes the worker
+- **Gunicorn behavior**: Worker crash triggers container restart
+
+## Immediate Fix Implementation
+
+### Step 1: Update migrations.py
+
+Add these imports at the top of `/home/phil/Projects/starpunk/starpunk/migrations.py`:
+
+```python
+import time
+import random
+```
+
+### Step 2: Replace run_migrations function
+
+Replace the entire `run_migrations` function (lines 304-462) with:
+
+```python
+def run_migrations(db_path, logger=None):
+    """
+    Run all pending database migrations with concurrency protection
+
+    Uses database-level locking to prevent race conditions when multiple
+    workers start simultaneously. Only one worker will apply migrations;
+    others will wait and verify completion.
+
+    Args:
+        db_path: Path to SQLite database file
+        logger: Optional logger for output
+
+    Raises:
+        MigrationError: If any migration fails to apply or lock cannot be acquired
+    """
+    if logger is None:
+        logger = logging.getLogger(__name__)
+
+    # Determine migrations directory
+    migrations_dir = Path(__file__).parent.parent / "migrations"
+
+    if not migrations_dir.exists():
+        logger.warning(f"Migrations directory not found: {migrations_dir}")
+        return
+
+    # Retry configuration for lock acquisition
+    max_retries = 10
+    retry_count = 0
+    base_delay = 0.1  # 100ms
+
+    while retry_count < max_retries:
+        conn = None
+        try:
+            # Connect with longer timeout for lock contention
+            conn = sqlite3.connect(db_path, timeout=30.0)
+
+            # Attempt to acquire exclusive lock for migrations
+            # BEGIN IMMEDIATE acquires RESERVED lock, preventing other writes
+            conn.execute("BEGIN IMMEDIATE")
+
+            try:
+                # Ensure migrations tracking table exists
+                create_migrations_table(conn)
+
+                # Quick check: have migrations already been applied by another worker?
+                cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
+                migration_count = cursor.fetchone()[0]
+
+                # Discover migration files
+                migration_files = discover_migration_files(migrations_dir)
+
+                if not migration_files:
+                    conn.commit()
+                    logger.info("No migration files found")
+                    return
+
+                # If migrations exist and we're not the first worker, verify and exit
+                if migration_count > 0:
+                    # Check if all migrations are applied
+                    applied = get_applied_migrations(conn)
+                    pending = [m for m, _ in migration_files if m not in applied]
+
+                    if not pending:
+                        conn.commit()
+                        logger.debug("All migrations already applied by another worker")
+                        return
+                    # If there are pending migrations, we continue to apply them
+                    logger.info(f"Found {len(pending)} pending migrations to apply")
+
+                # Fresh database detection (original logic preserved)
+                if migration_count == 0:
+                    if is_schema_current(conn):
+                        # Schema is current - mark all migrations as applied
+                        for migration_name, _ in migration_files:
+                            conn.execute(
+                                "INSERT INTO schema_migrations (migration_name) VALUES (?)",
+                                (migration_name,)
+                            )
+                        conn.commit()
+                        logger.info(
+                            f"Fresh database detected: marked {len(migration_files)} "
+                            f"migrations as applied (schema already current)"
+                        )
+                        return
+                    else:
+                        logger.info("Fresh database with partial schema: applying needed migrations")
+
+                # Get already-applied migrations
+                applied = get_applied_migrations(conn)
+
+                # Apply pending migrations (original logic preserved)
+                pending_count = 0
+                skipped_count = 0
+                for migration_name, migration_path in migration_files:
+                    if migration_name not in applied:
+                        # Check if migration is actually needed
+                        should_check_needed = (
+                            migration_count == 0 or
+                            migration_name == "002_secure_tokens_and_authorization_codes.sql"
+                        )
+
+                        if should_check_needed and not is_migration_needed(conn, migration_name):
+                            # Special handling for migration 002: if tables exist but indexes don't
+                            if migration_name == "002_secure_tokens_and_authorization_codes.sql":
+                                # Check if we need to create indexes
+                                indexes_to_create = []
+                                if not index_exists(conn, 'idx_tokens_hash'):
+                                    indexes_to_create.append("CREATE INDEX idx_tokens_hash ON tokens(token_hash)")
+                                if not index_exists(conn, 'idx_tokens_me'):
+                                    indexes_to_create.append("CREATE INDEX idx_tokens_me ON tokens(me)")
+                                if not index_exists(conn, 'idx_tokens_expires'):
+                                    indexes_to_create.append("CREATE INDEX idx_tokens_expires ON tokens(expires_at)")
+                                if not index_exists(conn, 'idx_auth_codes_hash'):
+                                    indexes_to_create.append("CREATE INDEX idx_auth_codes_hash ON authorization_codes(code_hash)")
+                                if not index_exists(conn, 'idx_auth_codes_expires'):
+                                    indexes_to_create.append("CREATE INDEX idx_auth_codes_expires ON authorization_codes(expires_at)")
+
+                                if indexes_to_create:
+                                    for index_sql in indexes_to_create:
+                                        conn.execute(index_sql)
+                                    logger.info(f"Created {len(indexes_to_create)} missing indexes from migration 002")
+
+                            # Mark as applied without executing full migration
+                            conn.execute(
+                                "INSERT INTO schema_migrations (migration_name) VALUES (?)",
+                                (migration_name,)
+                            )
+                            skipped_count += 1
+                            logger.debug(f"Skipped migration {migration_name} (already in SCHEMA_SQL)")
+                        else:
+                            # Apply the migration (within our transaction)
+                            try:
+                                # Read migration SQL
+                                migration_sql = migration_path.read_text()
+
+                                logger.debug(f"Applying migration: {migration_name}")
+
+                                # Execute migration (already in transaction)
+                                conn.executescript(migration_sql)
+
+                                # Record migration as applied
+                                conn.execute(
+                                    "INSERT INTO schema_migrations (migration_name) VALUES (?)",
+                                    (migration_name,)
+                                )
+
+                                logger.info(f"Applied migration: {migration_name}")
+                                pending_count += 1
+
+                            except Exception as e:
+                                # Roll back the transaction
+                                raise MigrationError(f"Migration {migration_name} failed: {e}")
+
+                # Commit all migrations atomically
+                conn.commit()
+
+                # Summary
+                total_count = len(migration_files)
+                if pending_count > 0 or skipped_count > 0:
+                    if skipped_count > 0:
+                        logger.info(
+                            f"Migrations complete: {pending_count} applied, {skipped_count} skipped "
+                            f"(already in SCHEMA_SQL), {total_count} total"
+                        )
+                    else:
+                        logger.info(
+                            f"Migrations complete: {pending_count} applied, "
+                            f"{total_count} total"
+                        )
+                else:
+                    logger.info(f"All migrations up to date ({total_count} total)")
+
+                return  # Success!
+
+            except MigrationError:
+                conn.rollback()
+                raise
+
+            except Exception as e:
+                conn.rollback()
+                raise MigrationError(f"Migration system error: {e}")
+
+        except sqlite3.OperationalError as e:
+            if "database is locked" in str(e).lower():
+                # Another worker has the lock, retry with exponential backoff
+                retry_count += 1
+
+                if retry_count < max_retries:
+                    # Exponential backoff with jitter
+                    delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
+                    logger.debug(
+                        f"Database locked by another worker, retry {retry_count}/{max_retries} "
+                        f"in {delay:.2f}s"
+                    )
+                    time.sleep(delay)
+                    continue
+                else:
+                    raise MigrationError(
+                        f"Failed to acquire migration lock after {max_retries} attempts. "
+                        f"This may indicate a hung migration process."
+                    )
+            else:
+                # Non-lock related database error
+                error_msg = f"Database error during migration: {e}"
+                logger.error(error_msg)
+                raise MigrationError(error_msg)
+
+        except Exception as e:
+            # Unexpected error
+            error_msg = f"Unexpected error during migration: {e}"
+            logger.error(error_msg)
+            raise MigrationError(error_msg)
+
+        finally:
+            if conn:
+                try:
+                    conn.close()
+                except:
+                    pass  # Ignore errors during cleanup
+
+    # Should never reach here, but just in case
+    raise MigrationError("Migration retry loop exited unexpectedly")
+```
+
+### Step 3: Testing the Fix
+
+Create a test script to verify the fix works:
+
+```python
+#!/usr/bin/env python3
+"""Test migration race condition fix"""
+
+import multiprocessing
+import time
+import sys
+from pathlib import Path
+
+# Add project to path
+sys.path.insert(0, str(Path(__file__).parent))
+
+def worker_init(worker_id):
+    """Simulate a gunicorn worker starting"""
+    print(f"Worker {worker_id}: Starting...")
+
+    try:
+        from starpunk import create_app
+        app = create_app()
+        print(f"Worker {worker_id}: Successfully initialized")
+        return True
+    except Exception as e:
+        print(f"Worker {worker_id}: FAILED - {e}")
+        return False
+
+if __name__ == "__main__":
+    # Test with 10 workers (more than production to stress test)
+    num_workers = 10
+
+    print(f"Starting {num_workers} workers simultaneously...")
+
+    with multiprocessing.Pool(num_workers) as pool:
+        results = pool.map(worker_init, range(num_workers))
+
+    success_count = sum(results)
+    print(f"\nResults: {success_count}/{num_workers} workers succeeded")
+
+    if success_count == num_workers:
+        print("SUCCESS: All workers initialized without race condition")
+        sys.exit(0)
+    else:
+        print("FAILURE: Race condition still present")
+        sys.exit(1)
+```
+
+## Verification Steps
+
+1. **Local Testing**:
+   ```bash
+   # Test with multiple workers
+   gunicorn --workers 4 --bind 0.0.0.0:8000 app:app
+
+   # Check logs for retry messages
+   # Should see "Database locked by another worker, retry..." messages
+   ```
+
+2. **Container Testing**:
+   ```bash
+   # Build container
+   podman build -t starpunk:test -f Containerfile .
+
+   # Run with fresh database
+   podman run --rm -p 8000:8000 -v ./test-data:/data starpunk:test
+
+   # Should start cleanly without restarts
+   ```
+
+3. **Log Verification**:
+   Look for these patterns:
+   - One worker: "Applied migration: XXX"
+   - Other workers: "Database locked by another worker, retry..."
+   - Final: "All migrations already applied by another worker"
+
+## Risk Assessment
+
+### Risk Level: LOW
+
+The fix is safe because:
+1. Uses SQLite's native transaction mechanism
+2. Preserves all existing migration logic
+3. Only adds retry wrapper around existing code
+4. Fails safely with clear error messages
+5. No data loss possible (transactions ensure atomicity)
+
+### Rollback Plan
+
+If issues occur:
+1. Revert to previous version
+2. Start container with single worker temporarily: `--workers 1`
+3. Once migrations apply, scale back to 4 workers
+
+## Release Strategy
+
+### Option 1: Hotfix (Recommended)
+- Release as v1.0.0-rc.3.1
+- Immediate deployment to fix production issue
+- Minimal testing required (focused fix)
+
+### Option 2: Include in rc.4
+- Bundle with other rc.4 changes
+- More testing time
+- Risk: Production remains broken until rc.4
+
+**Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately.
+
+## Alternative Workarounds (If Needed Urgently)
+
+While the proper fix is implemented, these temporary workarounds can be used:
+
+### Workaround 1: Single Worker Startup
+```bash
+# In Containerfile, temporarily change:
+CMD ["gunicorn", "--workers", "1", ...]
+
+# After first successful start, rebuild with 4 workers
+```
+
+### Workaround 2: Pre-migration Script
+```bash
+# Add entrypoint script that runs migrations before gunicorn
+#!/bin/bash
+python3 -c "from starpunk.database import init_db; init_db()"
+exec gunicorn --workers 4 ...
+```
+
+### Workaround 3: Delayed Worker Startup
+```bash
+# Stagger worker startup with --preload
+gunicorn --preload --workers 4 ...
+```
+
+## Summary
+
+- **Problem**: Race condition when multiple workers apply migrations
+- **Solution**: Database-level locking with retry logic
+- **Implementation**: ~150 lines of code changes in migrations.py
+- **Testing**: Verify with multi-worker startup
+- **Risk**: LOW - Safe, atomic changes
+- **Urgency**: HIGH - Blocks production deployment
+- **Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately
+
+## Developer Questions Answered
+
+All 23 architectural questions have been comprehensively answered in:
+`/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
+
+**Key Decisions:**
+- NEW connection per retry (not reused)
+- BEGIN IMMEDIATE is correct (not EXCLUSIVE)
+- Separate transactions for each operation
+- Both multiprocessing.Pool AND gunicorn testing needed
+- 30s timeout per attempt, 120s total maximum
+- Graduated logging levels based on retry count
+
+**Implementation Status: READY TO PROCEED**