docs: Add architect documentation for migration race condition fix

Add comprehensive architectural documentation for the migration race condition fix, including: - ADR-022: Architectural decision record for the fix - migration-race-condition-answers.md: All 23 Q&A answered - migration-fix-quick-reference.md: Implementation checklist - migration-race-condition-fix-implementation.md: Detailed guide These documents guided the implementation in v1.0.0-rc.5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:53:55 -07:00
parent 686d753fb9
commit 2240414f22
4 changed files with 1354 additions and 0 deletions
--- a/docs/decisions/ADR-022-migration-race-condition-fix.md
+++ b/docs/decisions/ADR-022-migration-race-condition-fix.md
@@ -0,0 +1,208 @@
+# ADR-022: Database Migration Race Condition Resolution
+
+## Status
+Accepted
+
+## Context
+
+In production, StarPunk runs with multiple gunicorn workers (currently 4). Each worker process independently initializes the Flask application through `create_app()`, which calls `init_db()`, which in turn runs database migrations via `run_migrations()`.
+
+When the container starts fresh, all 4 workers start simultaneously and attempt to:
+1. Create the `schema_migrations` table
+2. Apply pending migrations
+3. Insert records into `schema_migrations`
+
+This causes a race condition where:
+- Worker 1 successfully applies migration and inserts record
+- Workers 2-4 fail with "UNIQUE constraint failed: schema_migrations.migration_name"
+- Failed workers crash, causing container restarts
+- After restart, migrations are already applied so it works
+
+## Decision
+
+We will implement **database-level advisory locking** using SQLite's transaction mechanism with IMMEDIATE mode, combined with retry logic. This approach:
+
+1. Uses SQLite's built-in `BEGIN IMMEDIATE` transaction to acquire a write lock
+2. Implements exponential backoff retry for workers that can't acquire the lock
+3. Ensures only one worker can run migrations at a time
+4. Other workers wait and verify migrations are complete
+
+This is the simplest, most robust solution that:
+- Requires minimal code changes
+- Uses SQLite's native capabilities
+- Doesn't require external dependencies
+- Works across all deployment scenarios
+
+## Rationale
+
+### Options Considered
+
+1. **File-based locking (fcntl)**
+   - Pro: Simple to implement
+   - Con: Doesn't work across containers/network filesystems
+   - Con: Lock files can be orphaned if process crashes
+
+2. **Run migrations before workers start**
+   - Pro: Cleanest separation of concerns
+   - Con: Requires container entrypoint script changes
+   - Con: Complicates development workflow
+   - Con: Doesn't fix the root cause for non-container deployments
+
+3. **Make migration insertion idempotent (INSERT OR IGNORE)**
+   - Pro: Simple SQL change
+   - Con: Doesn't prevent parallel migration execution
+   - Con: Could corrupt database if migrations partially apply
+   - Con: Masks the real problem
+
+4. **Database advisory locking (CHOSEN)**
+   - Pro: Uses SQLite's native transaction locking
+   - Pro: Guaranteed atomicity
+   - Pro: Works across all deployment scenarios
+   - Pro: Self-cleaning (no orphaned locks)
+   - Con: Requires retry logic
+
+### Why Database Locking?
+
+SQLite's `BEGIN IMMEDIATE` transaction mode acquires a RESERVED lock immediately, preventing other connections from writing. This provides:
+
+1. **Atomicity**: Either all migrations apply or none do
+2. **Isolation**: Only one worker can modify schema at a time
+3. **Automatic cleanup**: Locks released on connection close/crash
+4. **No external dependencies**: Uses SQLite's built-in features
+
+## Implementation
+
+The fix will be implemented in `/home/phil/Projects/starpunk/starpunk/migrations.py`:
+
+```python
+def run_migrations(db_path, logger=None):
+    """Run all pending database migrations with concurrency protection"""
+
+    max_retries = 10
+    retry_count = 0
+    base_delay = 0.1  # 100ms
+
+    while retry_count < max_retries:
+        try:
+            conn = sqlite3.connect(db_path, timeout=30.0)
+
+            # Acquire exclusive lock for migrations
+            conn.execute("BEGIN IMMEDIATE")
+
+            try:
+                # Create migrations table if needed
+                create_migrations_table(conn)
+
+                # Check if another worker already ran migrations
+                cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
+                if cursor.fetchone()[0] > 0:
+                    # Migrations already run by another worker
+                    conn.commit()
+                    logger.info("Migrations already applied by another worker")
+                    return
+
+                # Run migration logic (existing code)
+                # ... rest of migration code ...
+
+                conn.commit()
+                return  # Success
+
+            except Exception:
+                conn.rollback()
+                raise
+
+        except sqlite3.OperationalError as e:
+            if "database is locked" in str(e):
+                retry_count += 1
+                delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
+
+                if retry_count < max_retries:
+                    logger.debug(f"Database locked, retry {retry_count}/{max_retries} in {delay:.2f}s")
+                    time.sleep(delay)
+                else:
+                    raise MigrationError(f"Failed to acquire migration lock after {max_retries} attempts")
+            else:
+                raise
+
+        finally:
+            if conn:
+                conn.close()
+```
+
+Additional changes needed:
+
+1. Add imports: `import time`, `import random`
+2. Modify connection timeout from default 5s to 30s
+3. Add early check for already-applied migrations
+4. Wrap entire migration process in IMMEDIATE transaction
+
+## Consequences
+
+### Positive
+- Eliminates race condition completely
+- No container configuration changes needed
+- Works in all deployment scenarios (container, systemd, manual)
+- Minimal code changes (~50 lines)
+- Self-healing (no manual lock cleanup needed)
+- Provides clear logging of what's happening
+
+### Negative
+- Slight startup delay for workers that wait (100ms-2s typical)
+- Adds complexity to migration runner
+- Requires careful testing of retry logic
+
+### Neutral
+- Workers start sequentially for migration phase, then run in parallel
+- First worker to acquire lock runs migrations for all
+- Log output will show retry attempts (useful for debugging)
+
+## Testing Strategy
+
+1. **Unit test with mock**: Test retry logic with simulated lock contention
+2. **Integration test**: Spawn multiple processes, verify only one runs migrations
+3. **Container test**: Build container, verify clean startup with 4 workers
+4. **Stress test**: Start 20 processes simultaneously, verify correctness
+
+## Migration Path
+
+1. Implement fix in `starpunk/migrations.py`
+2. Test locally with multiple workers
+3. Build and test container
+4. Deploy as v1.0.0-rc.4 or hotfix v1.0.0-rc.3.1
+5. Monitor production logs for retry patterns
+
+## Implementation Notes (Post-Analysis)
+
+Based on comprehensive architectural review, the following clarifications have been established:
+
+### Critical Implementation Details
+
+1. **Connection Management**: Create NEW connection for each retry attempt (no reuse)
+2. **Lock Mode**: Use BEGIN IMMEDIATE (not EXCLUSIVE) for optimal concurrency
+3. **Timeout Strategy**: 30s per connection attempt, 120s total maximum duration
+4. **Logging Levels**: Graduated (DEBUG for retry 1-3, INFO for 4-7, WARNING for 8+)
+5. **Transaction Boundaries**: Separate transactions for schema/migrations/data
+
+### Test Requirements
+
+- Unit tests with multiprocessing.Pool
+- Integration tests with actual gunicorn
+- Container tests with full deployment
+- Performance target: <500ms with 4 workers
+
+### Documentation
+
+- Full Q&A: `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
+- Implementation Guide: `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md`
+- Quick Reference: `/home/phil/Projects/starpunk/docs/architecture/migration-fix-quick-reference.md`
+
+## References
+
+- [SQLite Transaction Documentation](https://www.sqlite.org/lang_transaction.html)
+- [SQLite Locking Documentation](https://www.sqlite.org/lockingv3.html)
+- [SQLite BEGIN IMMEDIATE](https://www.sqlite.org/lang_transaction.html#immediate)
+- Issue: Production migration race condition with gunicorn workers
+
+## Status Update
+
+**2025-11-24**: All 23 architectural questions answered. Implementation approved. Ready for development.