# Migration Race Condition Fix - Quick Implementation Reference ## Implementation Checklist ### Code Changes - `/home/phil/Projects/starpunk/starpunk/migrations.py` ```python # 1. Add imports at top import time import random # 2. Replace entire run_migrations function (lines 304-462) # See full implementation in migration-race-condition-fix-implementation.md # Key patterns to implement: # A. Retry loop structure max_retries = 10 retry_count = 0 base_delay = 0.1 start_time = time.time() max_total_time = 120 # 2 minute absolute max while retry_count < max_retries and (time.time() - start_time) < max_total_time: conn = None # NEW connection each iteration try: conn = sqlite3.connect(db_path, timeout=30.0) conn.execute("BEGIN IMMEDIATE") # Lock acquisition # ... migration logic ... conn.commit() return # Success except sqlite3.OperationalError as e: if "database is locked" in str(e).lower(): retry_count += 1 if retry_count < max_retries: # Exponential backoff with jitter delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1) # Graduated logging if retry_count <= 3: logger.debug(f"Retry {retry_count}/{max_retries}") elif retry_count <= 7: logger.info(f"Retry {retry_count}/{max_retries}") else: logger.warning(f"Retry {retry_count}/{max_retries}") time.sleep(delay) continue finally: if conn: try: conn.close() except: pass # B. Error handling pattern except Exception as e: try: conn.rollback() except Exception as rollback_error: logger.critical(f"FATAL: Rollback failed: {rollback_error}") raise SystemExit(1) raise MigrationError(f"Migration failed: {e}") # C. Final error message raise MigrationError( f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. " f"Possible causes:\n" f"1. Another process is stuck in migration (check logs)\n" f"2. Database file permissions issue\n" f"3. Disk I/O problems\n" f"Action: Restart container with single worker to diagnose" ) ``` ### Testing Requirements #### 1. Unit Test File: `test_migration_race_condition.py` ```python import multiprocessing from multiprocessing import Barrier, Process import time def test_concurrent_migrations(): """Test 4 workers starting simultaneously""" barrier = Barrier(4) def worker(worker_id): barrier.wait() # Synchronize start from starpunk import create_app app = create_app() return True with multiprocessing.Pool(4) as pool: results = pool.map(worker, range(4)) assert all(results), "Some workers failed" def test_lock_retry(): """Test retry logic with mock""" with patch('sqlite3.connect') as mock: mock.side_effect = [ sqlite3.OperationalError("database is locked"), sqlite3.OperationalError("database is locked"), MagicMock() # Success on 3rd try ] run_migrations(db_path) assert mock.call_count == 3 ``` #### 2. Integration Test: `test_integration.sh` ```bash #!/bin/bash # Test with actual gunicorn # Clean start rm -f test.db # Start gunicorn with 4 workers timeout 10 gunicorn --workers 4 --bind 127.0.0.1:8001 app:app & PID=$! # Wait for startup sleep 3 # Check if running if ! kill -0 $PID 2>/dev/null; then echo "FAILED: Gunicorn crashed" exit 1 fi # Check health endpoint curl -f http://127.0.0.1:8001/health || exit 1 # Cleanup kill $PID echo "SUCCESS: All workers started without race condition" ``` #### 3. Container Test: `test_container.sh` ```bash #!/bin/bash # Test in container environment # Build podman build -t starpunk:race-test -f Containerfile . # Run with fresh database podman run --rm -d --name race-test \ -v $(pwd)/test-data:/data \ starpunk:race-test # Check logs for success patterns sleep 5 podman logs race-test | grep -E "(Applied migration|already applied by another worker)" # Cleanup podman stop race-test ``` ### Verification Patterns in Logs #### Successful Migration (One Worker Wins) ``` Worker 0: Applying migration: 001_initial_schema.sql Worker 1: Database locked by another worker, retry 1/10 in 0.21s Worker 2: Database locked by another worker, retry 1/10 in 0.23s Worker 3: Database locked by another worker, retry 1/10 in 0.19s Worker 0: Applied migration: 001_initial_schema.sql Worker 1: All migrations already applied by another worker Worker 2: All migrations already applied by another worker Worker 3: All migrations already applied by another worker ``` #### Performance Metrics to Check - Single worker: < 100ms total - 4 workers: < 500ms total - 10 workers (stress): < 2000ms total ### Rollback Plan if Issues 1. **Immediate Workaround** ```bash # Change to single worker temporarily gunicorn --workers 1 --bind 0.0.0.0:8000 app:app ``` 2. **Revert Code** ```bash git revert HEAD ``` 3. **Emergency Patch** ```python # In app.py temporarily import os if os.getenv('GUNICORN_WORKER_ID', '1') == '1': init_db() # Only first worker runs migrations ``` ### Deployment Commands ```bash # 1. Run tests python -m pytest test_migration_race_condition.py -v # 2. Build container podman build -t starpunk:v1.0.0-rc.3.1 -f Containerfile . # 3. Tag for release podman tag starpunk:v1.0.0-rc.3.1 git.philmade.com/starpunk:v1.0.0-rc.3.1 # 4. Push podman push git.philmade.com/starpunk:v1.0.0-rc.3.1 # 5. Deploy kubectl rollout restart deployment/starpunk ``` --- ## Critical Points to Remember 1. **NEW CONNECTION EACH RETRY** - Don't reuse connections 2. **BEGIN IMMEDIATE** - Not EXCLUSIVE, not DEFERRED 3. **30s per attempt, 120s total max** - Two different timeouts 4. **Graduated logging** - DEBUG → INFO → WARNING based on retry count 5. **Test at multiple levels** - Unit, integration, container 6. **Fresh database state** between tests ## Support If issues arise, check: 1. `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md` - Full Q&A 2. `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md` - Detailed implementation 3. SQLite lock states: `PRAGMA lock_status` during issue --- *Quick Reference v1.0 - 2025-11-24*