# Migration Race Condition Fix - Architectural Answers ## Status: READY FOR IMPLEMENTATION All 23 questions have been answered with concrete guidance. The developer can proceed with implementation. --- ## Critical Questions ### 1. Connection Lifecycle Management **Q: Should we create a new connection for each retry or reuse the same connection?** **Answer: NEW CONNECTION per retry** - Each retry MUST create a fresh connection - Rationale: Failed lock acquisition may leave connection in inconsistent state - SQLite connections are lightweight; overhead is minimal - Pattern: ```python while retry_count < max_retries: conn = None # Fresh connection each iteration try: conn = sqlite3.connect(db_path, timeout=30.0) # ... attempt migration ... finally: if conn: conn.close() ``` ### 2. Transaction Boundaries **Q: Should init_db() wrap everything in one transaction?** **Answer: NO - Separate transactions for different operations** - Schema creation: Own transaction (already implicit in executescript) - Migrations: Own transaction with BEGIN IMMEDIATE - Initial data: Own transaction - Rationale: Minimizes lock duration and allows partial success visibility - Each operation is atomic but independent ### 3. Lock Timeout vs Retry Timeout **Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?** **Answer: This is BY DESIGN - No conflict** - 30s timeout: Maximum wait for any single lock acquisition attempt - 102s total: Maximum cumulative retry duration across multiple attempts - If one worker holds lock for 30s+, other workers timeout and retry - Pattern ensures no single worker waits indefinitely - Recommendation: Add total timeout check: ```python start_time = time.time() max_total_time = 120 # 2 minutes absolute maximum while retry_count < max_retries and (time.time() - start_time) < max_total_time: ``` ### 4. Testing Strategy **Q: Should we use multiprocessing.Pool or actual gunicorn for testing?** **Answer: BOTH - Different test levels** - Unit tests: multiprocessing.Pool (fast, isolated) - Integration tests: Actual gunicorn with --workers 4 - Container tests: Full podman/docker run - Test matrix: ``` Level 1: Mock concurrent access (unit) Level 2: multiprocessing.Pool (integration) Level 3: gunicorn locally (system) Level 4: Container with gunicorn (e2e) ``` ### 5. BEGIN IMMEDIATE vs EXCLUSIVE **Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?** **Answer: BEGIN IMMEDIATE is CORRECT choice** - BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads) - BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access) - Rationale: - Migrations only need to prevent concurrent migrations (writes) - Other workers can still read schema while one migrates - Less contention, faster startup - Only escalates to EXCLUSIVE when actually writing - Keep BEGIN IMMEDIATE as specified --- ## Edge Cases and Error Handling ### 6. Partial Migration Failure **Q: What if a migration partially applies or rollback fails?** **Answer: Transaction atomicity handles this** - Within transaction: Automatic rollback on ANY error - Rollback failure: Extremely rare (corrupt database) - Strategy: ```python except Exception as e: try: conn.rollback() except Exception as rollback_error: logger.critical(f"FATAL: Rollback failed: {rollback_error}") # Database potentially corrupt - fail hard raise SystemExit(1) raise MigrationError(e) ``` ### 7. Migration File Consistency **Q: What if migration files change during deployment?** **Answer: Not a concern with proper deployment** - Container deployments: Files are immutable in image - Traditional deployment: Use atomic directory swap - If concerned, add checksum validation: ```python # Store in schema_migrations: (name, checksum, applied_at) # Verify checksum matches before applying ``` ### 8. Retry Exhaustion Error Messages **Q: What error message when retries exhausted?** **Answer: Be specific and actionable** ```python raise MigrationError( f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. " f"Possible causes:\n" f"1. Another process is stuck in migration (check logs)\n" f"2. Database file permissions issue\n" f"3. Disk I/O problems\n" f"Action: Restart container with single worker to diagnose" ) ``` ### 9. Logging Levels **Q: What log level for lock waits?** **Answer: Graduated approach** - Retry 1-3: DEBUG (normal operation) - Retry 4-7: INFO (getting concerning) - Retry 8+: WARNING (abnormal) - Exhausted: ERROR (operation failed) - Pattern: ```python if retry_count <= 3: level = logging.DEBUG elif retry_count <= 7: level = logging.INFO else: level = logging.WARNING logger.log(level, f"Retry {retry_count}/{max_retries}") ``` ### 10. Index Creation Failure **Q: How to handle index creation failures in migration 002?** **Answer: Fail fast with clear context** ```python for index_name, index_sql in indexes_to_create: try: conn.execute(index_sql) except sqlite3.OperationalError as e: if "already exists" in str(e): logger.debug(f"Index {index_name} already exists") else: raise MigrationError( f"Failed to create index {index_name}: {e}\n" f"SQL: {index_sql}" ) ``` --- ## Testing Strategy ### 11. Concurrent Testing Simulation **Q: How to properly simulate concurrent worker startup?** **Answer: Multiple approaches** ```python # Approach 1: Barrier synchronization def test_concurrent_migrations(): barrier = multiprocessing.Barrier(4) def worker(): barrier.wait() # All start together return run_migrations(db_path) with multiprocessing.Pool(4) as pool: results = pool.map(worker, range(4)) # Approach 2: Process start processes = [] for i in range(4): p = Process(target=run_migrations, args=(db_path,)) processes.append(p) for p in processes: p.start() # Near-simultaneous ``` ### 12. Lock Contention Testing **Q: How to test lock contention scenarios?** **Answer: Inject delays** ```python # Test helper to force contention def slow_migration_for_testing(conn): conn.execute("BEGIN IMMEDIATE") time.sleep(2) # Force other workers to wait # Apply migration conn.commit() # Test timeout handling @patch('sqlite3.connect') def test_lock_timeout(mock_connect): mock_connect.side_effect = sqlite3.OperationalError("database is locked") # Verify retry logic ``` ### 13. Performance Tests **Q: What timing is acceptable?** **Answer: Performance targets** - Single worker: < 100ms for all migrations - 4 workers with contention: < 500ms total - 10 workers stress test: < 2s total - Lock acquisition per retry: < 50ms - Test with: ```python import timeit setup_time = timeit.timeit(lambda: create_app(), number=1) assert setup_time < 0.5, f"Startup too slow: {setup_time}s" ``` ### 14. Retry Logic Unit Tests **Q: How to unit test retry logic?** **Answer: Mock the lock failures** ```python class TestRetryLogic: def test_retry_on_lock(self): with patch('sqlite3.connect') as mock: # First 2 attempts fail, 3rd succeeds mock.side_effect = [ sqlite3.OperationalError("database is locked"), sqlite3.OperationalError("database is locked"), MagicMock() # Success ] run_migrations(db_path) assert mock.call_count == 3 ``` --- ## SQLite-Specific Concerns ### 15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed) **Q: Deep dive on lock choice?** **Answer: Lock escalation path** ``` BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE BEGIN IMMEDIATE → RESERVED → EXCLUSIVE BEGIN EXCLUSIVE → EXCLUSIVE For migrations: - IMMEDIATE starts at RESERVED (blocks other writers immediately) - Escalates to EXCLUSIVE only during actual writes - Optimal for our use case ``` ### 16. WAL Mode Interaction **Q: How does this work with WAL mode?** **Answer: Works correctly with both modes** - Journal mode: BEGIN IMMEDIATE works as described - WAL mode: BEGIN IMMEDIATE still prevents concurrent writers - No code changes needed - Add mode detection for logging: ```python cursor = conn.execute("PRAGMA journal_mode") mode = cursor.fetchone()[0] logger.debug(f"Database in {mode} mode") ``` ### 17. Database File Permissions **Q: How to handle permission issues?** **Answer: Fail fast with helpful diagnostics** ```python import os import stat db_path = Path(db_path) if not db_path.exists(): # Will be created - check parent dir parent = db_path.parent if not os.access(parent, os.W_OK): raise MigrationError(f"Cannot write to directory: {parent}") else: # Check existing file if not os.access(db_path, os.W_OK): stats = os.stat(db_path) mode = stat.filemode(stats.st_mode) raise MigrationError( f"Database not writable: {db_path}\n" f"Permissions: {mode}\n" f"Owner: {stats.st_uid}:{stats.st_gid}" ) ``` --- ## Deployment/Operations ### 18. Container Startup and Health Checks **Q: How to handle health checks during migration?** **Answer: Return 503 during migration** ```python # In app.py MIGRATION_IN_PROGRESS = False def create_app(): global MIGRATION_IN_PROGRESS MIGRATION_IN_PROGRESS = True try: init_db() finally: MIGRATION_IN_PROGRESS = False @app.route('/health') def health(): if MIGRATION_IN_PROGRESS: return {'status': 'migrating'}, 503 return {'status': 'healthy'}, 200 ``` ### 19. Monitoring and Alerting **Q: What metrics/alerts are needed?** **Answer: Key metrics to track** ```python # Add metrics collection metrics = { 'migration_duration_ms': 0, 'migration_retries': 0, 'migration_lock_wait_ms': 0, 'migrations_applied': 0 } # Alert thresholds ALERTS = { 'migration_duration_ms': 5000, # Alert if > 5s 'migration_retries': 5, # Alert if > 5 retries 'worker_failures': 1 # Alert on any failure } # Log in structured format logger.info(json.dumps({ 'event': 'migration_complete', 'metrics': metrics })) ``` --- ## Alternative Approaches ### 20. Version Compatibility **Q: How to handle version mismatches?** **Answer: Strict version checking** ```python # In migrations.py MIGRATION_VERSION = "1.0.0" def check_version_compatibility(conn): cursor = conn.execute( "SELECT value FROM app_config WHERE key = 'migration_version'" ) row = cursor.fetchone() if row and row[0] != MIGRATION_VERSION: raise MigrationError( f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n" f"Action: Run migration tool separately" ) ``` ### 21. File-Based Locking **Q: Should we consider flock() as backup?** **Answer: NO - Adds complexity without benefit** - SQLite locking is sufficient and portable - flock() not available on all systems - Would require additional cleanup logic - Database-level locking is the correct approach ### 22. Gunicorn Preload **Q: Would --preload flag help?** **Answer: NO - Makes problem WORSE** - --preload runs app initialization ONCE in master - Workers fork from master AFTER migrations complete - BUT: Doesn't work with lazy-loaded resources - Current architecture expects per-worker initialization - Keep current approach ### 23. Application-Level Locks **Q: Should we add Redis/memcached for coordination?** **Answer: NO - Violates simplicity principle** - Adds external dependency - More complex deployment - SQLite locking is sufficient - Would require Redis/memcached to be running before app starts - Solving a solved problem --- ## Final Implementation Checklist ### Required Changes 1. ✅ Add imports: `time`, `random` 2. ✅ Implement retry loop with exponential backoff 3. ✅ Use BEGIN IMMEDIATE for lock acquisition 4. ✅ Add graduated logging levels 5. ✅ Proper error messages with diagnostics 6. ✅ Fresh connection per retry 7. ✅ Total timeout check (2 minutes max) 8. ✅ Preserve all existing migration logic ### Test Coverage Required 1. ✅ Unit test: Retry on lock 2. ✅ Unit test: Exhaustion handling 3. ✅ Integration test: 4 workers with multiprocessing 4. ✅ System test: gunicorn with 4 workers 5. ✅ Container test: Full deployment simulation 6. ✅ Performance test: < 500ms with contention ### Documentation Updates 1. ✅ Update ADR-022 with final decision 2. ✅ Add operational runbook for migration issues 3. ✅ Document monitoring metrics 4. ✅ Update deployment guide with health check info --- ## Go/No-Go Decision ### ✅ GO FOR IMPLEMENTATION **Rationale:** - All 23 questions have concrete answers - Design is proven with SQLite's native capabilities - No external dependencies needed - Risk is low with clear rollback plan - Testing strategy is comprehensive **Implementation Priority: IMMEDIATE** - This is blocking v1.0.0-rc.4 release - Production systems affected - Fix is well-understood and low-risk **Next Steps:** 1. Implement changes to migrations.py as specified 2. Run test suite at all levels 3. Deploy as hotfix v1.0.0-rc.3.1 4. Monitor metrics in production 5. Document lessons learned --- *Document Version: 1.0* *Created: 2025-11-24* *Status: Approved for Implementation* *Author: StarPunk Architecture Team*