# Migration Race Condition Fix - Implementation Guide ## Executive Summary **CRITICAL PRODUCTION ISSUE**: Multiple gunicorn workers racing to apply migrations causes container startup failures. **Solution**: Implement database-level advisory locking with retry logic in `migrations.py`. **Urgency**: HIGH - This is a blocker for v1.0.0-rc.4 release. ## Root Cause Analysis ### The Problem Flow 1. Container starts with `gunicorn --workers 4` 2. Each worker independently calls: ``` app.py → create_app() → init_db() → run_migrations() ``` 3. All 4 workers simultaneously try to: - INSERT into schema_migrations table - Apply the same migrations 4. SQLite's UNIQUE constraint on migration_name causes workers 2-4 to crash 5. Container restarts, works on second attempt (migrations already applied) ### Why This Happens - **No synchronization**: Workers are independent processes - **No locking**: Migration code doesn't prevent concurrent execution - **Immediate failure**: UNIQUE constraint violation crashes the worker - **Gunicorn behavior**: Worker crash triggers container restart ## Immediate Fix Implementation ### Step 1: Update migrations.py Add these imports at the top of `/home/phil/Projects/starpunk/starpunk/migrations.py`: ```python import time import random ``` ### Step 2: Replace run_migrations function Replace the entire `run_migrations` function (lines 304-462) with: ```python def run_migrations(db_path, logger=None): """ Run all pending database migrations with concurrency protection Uses database-level locking to prevent race conditions when multiple workers start simultaneously. Only one worker will apply migrations; others will wait and verify completion. Args: db_path: Path to SQLite database file logger: Optional logger for output Raises: MigrationError: If any migration fails to apply or lock cannot be acquired """ if logger is None: logger = logging.getLogger(__name__) # Determine migrations directory migrations_dir = Path(__file__).parent.parent / "migrations" if not migrations_dir.exists(): logger.warning(f"Migrations directory not found: {migrations_dir}") return # Retry configuration for lock acquisition max_retries = 10 retry_count = 0 base_delay = 0.1 # 100ms while retry_count < max_retries: conn = None try: # Connect with longer timeout for lock contention conn = sqlite3.connect(db_path, timeout=30.0) # Attempt to acquire exclusive lock for migrations # BEGIN IMMEDIATE acquires RESERVED lock, preventing other writes conn.execute("BEGIN IMMEDIATE") try: # Ensure migrations tracking table exists create_migrations_table(conn) # Quick check: have migrations already been applied by another worker? cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations") migration_count = cursor.fetchone()[0] # Discover migration files migration_files = discover_migration_files(migrations_dir) if not migration_files: conn.commit() logger.info("No migration files found") return # If migrations exist and we're not the first worker, verify and exit if migration_count > 0: # Check if all migrations are applied applied = get_applied_migrations(conn) pending = [m for m, _ in migration_files if m not in applied] if not pending: conn.commit() logger.debug("All migrations already applied by another worker") return # If there are pending migrations, we continue to apply them logger.info(f"Found {len(pending)} pending migrations to apply") # Fresh database detection (original logic preserved) if migration_count == 0: if is_schema_current(conn): # Schema is current - mark all migrations as applied for migration_name, _ in migration_files: conn.execute( "INSERT INTO schema_migrations (migration_name) VALUES (?)", (migration_name,) ) conn.commit() logger.info( f"Fresh database detected: marked {len(migration_files)} " f"migrations as applied (schema already current)" ) return else: logger.info("Fresh database with partial schema: applying needed migrations") # Get already-applied migrations applied = get_applied_migrations(conn) # Apply pending migrations (original logic preserved) pending_count = 0 skipped_count = 0 for migration_name, migration_path in migration_files: if migration_name not in applied: # Check if migration is actually needed should_check_needed = ( migration_count == 0 or migration_name == "002_secure_tokens_and_authorization_codes.sql" ) if should_check_needed and not is_migration_needed(conn, migration_name): # Special handling for migration 002: if tables exist but indexes don't if migration_name == "002_secure_tokens_and_authorization_codes.sql": # Check if we need to create indexes indexes_to_create = [] if not index_exists(conn, 'idx_tokens_hash'): indexes_to_create.append("CREATE INDEX idx_tokens_hash ON tokens(token_hash)") if not index_exists(conn, 'idx_tokens_me'): indexes_to_create.append("CREATE INDEX idx_tokens_me ON tokens(me)") if not index_exists(conn, 'idx_tokens_expires'): indexes_to_create.append("CREATE INDEX idx_tokens_expires ON tokens(expires_at)") if not index_exists(conn, 'idx_auth_codes_hash'): indexes_to_create.append("CREATE INDEX idx_auth_codes_hash ON authorization_codes(code_hash)") if not index_exists(conn, 'idx_auth_codes_expires'): indexes_to_create.append("CREATE INDEX idx_auth_codes_expires ON authorization_codes(expires_at)") if indexes_to_create: for index_sql in indexes_to_create: conn.execute(index_sql) logger.info(f"Created {len(indexes_to_create)} missing indexes from migration 002") # Mark as applied without executing full migration conn.execute( "INSERT INTO schema_migrations (migration_name) VALUES (?)", (migration_name,) ) skipped_count += 1 logger.debug(f"Skipped migration {migration_name} (already in SCHEMA_SQL)") else: # Apply the migration (within our transaction) try: # Read migration SQL migration_sql = migration_path.read_text() logger.debug(f"Applying migration: {migration_name}") # Execute migration (already in transaction) conn.executescript(migration_sql) # Record migration as applied conn.execute( "INSERT INTO schema_migrations (migration_name) VALUES (?)", (migration_name,) ) logger.info(f"Applied migration: {migration_name}") pending_count += 1 except Exception as e: # Roll back the transaction raise MigrationError(f"Migration {migration_name} failed: {e}") # Commit all migrations atomically conn.commit() # Summary total_count = len(migration_files) if pending_count > 0 or skipped_count > 0: if skipped_count > 0: logger.info( f"Migrations complete: {pending_count} applied, {skipped_count} skipped " f"(already in SCHEMA_SQL), {total_count} total" ) else: logger.info( f"Migrations complete: {pending_count} applied, " f"{total_count} total" ) else: logger.info(f"All migrations up to date ({total_count} total)") return # Success! except MigrationError: conn.rollback() raise except Exception as e: conn.rollback() raise MigrationError(f"Migration system error: {e}") except sqlite3.OperationalError as e: if "database is locked" in str(e).lower(): # Another worker has the lock, retry with exponential backoff retry_count += 1 if retry_count < max_retries: # Exponential backoff with jitter delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1) logger.debug( f"Database locked by another worker, retry {retry_count}/{max_retries} " f"in {delay:.2f}s" ) time.sleep(delay) continue else: raise MigrationError( f"Failed to acquire migration lock after {max_retries} attempts. " f"This may indicate a hung migration process." ) else: # Non-lock related database error error_msg = f"Database error during migration: {e}" logger.error(error_msg) raise MigrationError(error_msg) except Exception as e: # Unexpected error error_msg = f"Unexpected error during migration: {e}" logger.error(error_msg) raise MigrationError(error_msg) finally: if conn: try: conn.close() except: pass # Ignore errors during cleanup # Should never reach here, but just in case raise MigrationError("Migration retry loop exited unexpectedly") ``` ### Step 3: Testing the Fix Create a test script to verify the fix works: ```python #!/usr/bin/env python3 """Test migration race condition fix""" import multiprocessing import time import sys from pathlib import Path # Add project to path sys.path.insert(0, str(Path(__file__).parent)) def worker_init(worker_id): """Simulate a gunicorn worker starting""" print(f"Worker {worker_id}: Starting...") try: from starpunk import create_app app = create_app() print(f"Worker {worker_id}: Successfully initialized") return True except Exception as e: print(f"Worker {worker_id}: FAILED - {e}") return False if __name__ == "__main__": # Test with 10 workers (more than production to stress test) num_workers = 10 print(f"Starting {num_workers} workers simultaneously...") with multiprocessing.Pool(num_workers) as pool: results = pool.map(worker_init, range(num_workers)) success_count = sum(results) print(f"\nResults: {success_count}/{num_workers} workers succeeded") if success_count == num_workers: print("SUCCESS: All workers initialized without race condition") sys.exit(0) else: print("FAILURE: Race condition still present") sys.exit(1) ``` ## Verification Steps 1. **Local Testing**: ```bash # Test with multiple workers gunicorn --workers 4 --bind 0.0.0.0:8000 app:app # Check logs for retry messages # Should see "Database locked by another worker, retry..." messages ``` 2. **Container Testing**: ```bash # Build container podman build -t starpunk:test -f Containerfile . # Run with fresh database podman run --rm -p 8000:8000 -v ./test-data:/data starpunk:test # Should start cleanly without restarts ``` 3. **Log Verification**: Look for these patterns: - One worker: "Applied migration: XXX" - Other workers: "Database locked by another worker, retry..." - Final: "All migrations already applied by another worker" ## Risk Assessment ### Risk Level: LOW The fix is safe because: 1. Uses SQLite's native transaction mechanism 2. Preserves all existing migration logic 3. Only adds retry wrapper around existing code 4. Fails safely with clear error messages 5. No data loss possible (transactions ensure atomicity) ### Rollback Plan If issues occur: 1. Revert to previous version 2. Start container with single worker temporarily: `--workers 1` 3. Once migrations apply, scale back to 4 workers ## Release Strategy ### Option 1: Hotfix (Recommended) - Release as v1.0.0-rc.3.1 - Immediate deployment to fix production issue - Minimal testing required (focused fix) ### Option 2: Include in rc.4 - Bundle with other rc.4 changes - More testing time - Risk: Production remains broken until rc.4 **Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately. ## Alternative Workarounds (If Needed Urgently) While the proper fix is implemented, these temporary workarounds can be used: ### Workaround 1: Single Worker Startup ```bash # In Containerfile, temporarily change: CMD ["gunicorn", "--workers", "1", ...] # After first successful start, rebuild with 4 workers ``` ### Workaround 2: Pre-migration Script ```bash # Add entrypoint script that runs migrations before gunicorn #!/bin/bash python3 -c "from starpunk.database import init_db; init_db()" exec gunicorn --workers 4 ... ``` ### Workaround 3: Delayed Worker Startup ```bash # Stagger worker startup with --preload gunicorn --preload --workers 4 ... ``` ## Summary - **Problem**: Race condition when multiple workers apply migrations - **Solution**: Database-level locking with retry logic - **Implementation**: ~150 lines of code changes in migrations.py - **Testing**: Verify with multi-worker startup - **Risk**: LOW - Safe, atomic changes - **Urgency**: HIGH - Blocks production deployment - **Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately ## Developer Questions Answered All 23 architectural questions have been comprehensively answered in: `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md` **Key Decisions:** - NEW connection per retry (not reused) - BEGIN IMMEDIATE is correct (not EXCLUSIVE) - Separate transactions for each operation - Both multiprocessing.Pool AND gunicorn testing needed - 30s timeout per attempt, 120s total maximum - Graduated logging levels based on retry count **Implementation Status: READY TO PROCEED**