# ADR-022: Database Migration Race Condition Resolution ## Status Accepted ## Context In production, StarPunk runs with multiple gunicorn workers (currently 4). Each worker process independently initializes the Flask application through `create_app()`, which calls `init_db()`, which in turn runs database migrations via `run_migrations()`. When the container starts fresh, all 4 workers start simultaneously and attempt to: 1. Create the `schema_migrations` table 2. Apply pending migrations 3. Insert records into `schema_migrations` This causes a race condition where: - Worker 1 successfully applies migration and inserts record - Workers 2-4 fail with "UNIQUE constraint failed: schema_migrations.migration_name" - Failed workers crash, causing container restarts - After restart, migrations are already applied so it works ## Decision We will implement **database-level advisory locking** using SQLite's transaction mechanism with IMMEDIATE mode, combined with retry logic. This approach: 1. Uses SQLite's built-in `BEGIN IMMEDIATE` transaction to acquire a write lock 2. Implements exponential backoff retry for workers that can't acquire the lock 3. Ensures only one worker can run migrations at a time 4. Other workers wait and verify migrations are complete This is the simplest, most robust solution that: - Requires minimal code changes - Uses SQLite's native capabilities - Doesn't require external dependencies - Works across all deployment scenarios ## Rationale ### Options Considered 1. **File-based locking (fcntl)** - Pro: Simple to implement - Con: Doesn't work across containers/network filesystems - Con: Lock files can be orphaned if process crashes 2. **Run migrations before workers start** - Pro: Cleanest separation of concerns - Con: Requires container entrypoint script changes - Con: Complicates development workflow - Con: Doesn't fix the root cause for non-container deployments 3. **Make migration insertion idempotent (INSERT OR IGNORE)** - Pro: Simple SQL change - Con: Doesn't prevent parallel migration execution - Con: Could corrupt database if migrations partially apply - Con: Masks the real problem 4. **Database advisory locking (CHOSEN)** - Pro: Uses SQLite's native transaction locking - Pro: Guaranteed atomicity - Pro: Works across all deployment scenarios - Pro: Self-cleaning (no orphaned locks) - Con: Requires retry logic ### Why Database Locking? SQLite's `BEGIN IMMEDIATE` transaction mode acquires a RESERVED lock immediately, preventing other connections from writing. This provides: 1. **Atomicity**: Either all migrations apply or none do 2. **Isolation**: Only one worker can modify schema at a time 3. **Automatic cleanup**: Locks released on connection close/crash 4. **No external dependencies**: Uses SQLite's built-in features ## Implementation The fix will be implemented in `/home/phil/Projects/starpunk/starpunk/migrations.py`: ```python def run_migrations(db_path, logger=None): """Run all pending database migrations with concurrency protection""" max_retries = 10 retry_count = 0 base_delay = 0.1 # 100ms while retry_count < max_retries: try: conn = sqlite3.connect(db_path, timeout=30.0) # Acquire exclusive lock for migrations conn.execute("BEGIN IMMEDIATE") try: # Create migrations table if needed create_migrations_table(conn) # Check if another worker already ran migrations cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations") if cursor.fetchone()[0] > 0: # Migrations already run by another worker conn.commit() logger.info("Migrations already applied by another worker") return # Run migration logic (existing code) # ... rest of migration code ... conn.commit() return # Success except Exception: conn.rollback() raise except sqlite3.OperationalError as e: if "database is locked" in str(e): retry_count += 1 delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1) if retry_count < max_retries: logger.debug(f"Database locked, retry {retry_count}/{max_retries} in {delay:.2f}s") time.sleep(delay) else: raise MigrationError(f"Failed to acquire migration lock after {max_retries} attempts") else: raise finally: if conn: conn.close() ``` Additional changes needed: 1. Add imports: `import time`, `import random` 2. Modify connection timeout from default 5s to 30s 3. Add early check for already-applied migrations 4. Wrap entire migration process in IMMEDIATE transaction ## Consequences ### Positive - Eliminates race condition completely - No container configuration changes needed - Works in all deployment scenarios (container, systemd, manual) - Minimal code changes (~50 lines) - Self-healing (no manual lock cleanup needed) - Provides clear logging of what's happening ### Negative - Slight startup delay for workers that wait (100ms-2s typical) - Adds complexity to migration runner - Requires careful testing of retry logic ### Neutral - Workers start sequentially for migration phase, then run in parallel - First worker to acquire lock runs migrations for all - Log output will show retry attempts (useful for debugging) ## Testing Strategy 1. **Unit test with mock**: Test retry logic with simulated lock contention 2. **Integration test**: Spawn multiple processes, verify only one runs migrations 3. **Container test**: Build container, verify clean startup with 4 workers 4. **Stress test**: Start 20 processes simultaneously, verify correctness ## Migration Path 1. Implement fix in `starpunk/migrations.py` 2. Test locally with multiple workers 3. Build and test container 4. Deploy as v1.0.0-rc.4 or hotfix v1.0.0-rc.3.1 5. Monitor production logs for retry patterns ## Implementation Notes (Post-Analysis) Based on comprehensive architectural review, the following clarifications have been established: ### Critical Implementation Details 1. **Connection Management**: Create NEW connection for each retry attempt (no reuse) 2. **Lock Mode**: Use BEGIN IMMEDIATE (not EXCLUSIVE) for optimal concurrency 3. **Timeout Strategy**: 30s per connection attempt, 120s total maximum duration 4. **Logging Levels**: Graduated (DEBUG for retry 1-3, INFO for 4-7, WARNING for 8+) 5. **Transaction Boundaries**: Separate transactions for schema/migrations/data ### Test Requirements - Unit tests with multiprocessing.Pool - Integration tests with actual gunicorn - Container tests with full deployment - Performance target: <500ms with 4 workers ### Documentation - Full Q&A: `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md` - Implementation Guide: `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md` - Quick Reference: `/home/phil/Projects/starpunk/docs/architecture/migration-fix-quick-reference.md` ## References - [SQLite Transaction Documentation](https://www.sqlite.org/lang_transaction.html) - [SQLite Locking Documentation](https://www.sqlite.org/lockingv3.html) - [SQLite BEGIN IMMEDIATE](https://www.sqlite.org/lang_transaction.html#immediate) - Issue: Production migration race condition with gunicorn workers ## Status Update **2025-11-24**: All 23 architectural questions answered. Implementation approved. Ready for development.