Files
StarPunk/docs/reports/migration-race-condition-fix-implementation.md
Phil Skentelbery 2240414f22 docs: Add architect documentation for migration race condition fix
Add comprehensive architectural documentation for the migration race
condition fix, including:

- ADR-022: Architectural decision record for the fix
- migration-race-condition-answers.md: All 23 Q&A answered
- migration-fix-quick-reference.md: Implementation checklist
- migration-race-condition-fix-implementation.md: Detailed guide

These documents guided the implementation in v1.0.0-rc.5.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:53:55 -07:00

431 lines
16 KiB
Markdown

# Migration Race Condition Fix - Implementation Guide
## Executive Summary
**CRITICAL PRODUCTION ISSUE**: Multiple gunicorn workers racing to apply migrations causes container startup failures.
**Solution**: Implement database-level advisory locking with retry logic in `migrations.py`.
**Urgency**: HIGH - This is a blocker for v1.0.0-rc.4 release.
## Root Cause Analysis
### The Problem Flow
1. Container starts with `gunicorn --workers 4`
2. Each worker independently calls:
```
app.py → create_app() → init_db() → run_migrations()
```
3. All 4 workers simultaneously try to:
- INSERT into schema_migrations table
- Apply the same migrations
4. SQLite's UNIQUE constraint on migration_name causes workers 2-4 to crash
5. Container restarts, works on second attempt (migrations already applied)
### Why This Happens
- **No synchronization**: Workers are independent processes
- **No locking**: Migration code doesn't prevent concurrent execution
- **Immediate failure**: UNIQUE constraint violation crashes the worker
- **Gunicorn behavior**: Worker crash triggers container restart
## Immediate Fix Implementation
### Step 1: Update migrations.py
Add these imports at the top of `/home/phil/Projects/starpunk/starpunk/migrations.py`:
```python
import time
import random
```
### Step 2: Replace run_migrations function
Replace the entire `run_migrations` function (lines 304-462) with:
```python
def run_migrations(db_path, logger=None):
"""
Run all pending database migrations with concurrency protection
Uses database-level locking to prevent race conditions when multiple
workers start simultaneously. Only one worker will apply migrations;
others will wait and verify completion.
Args:
db_path: Path to SQLite database file
logger: Optional logger for output
Raises:
MigrationError: If any migration fails to apply or lock cannot be acquired
"""
if logger is None:
logger = logging.getLogger(__name__)
# Determine migrations directory
migrations_dir = Path(__file__).parent.parent / "migrations"
if not migrations_dir.exists():
logger.warning(f"Migrations directory not found: {migrations_dir}")
return
# Retry configuration for lock acquisition
max_retries = 10
retry_count = 0
base_delay = 0.1 # 100ms
while retry_count < max_retries:
conn = None
try:
# Connect with longer timeout for lock contention
conn = sqlite3.connect(db_path, timeout=30.0)
# Attempt to acquire exclusive lock for migrations
# BEGIN IMMEDIATE acquires RESERVED lock, preventing other writes
conn.execute("BEGIN IMMEDIATE")
try:
# Ensure migrations tracking table exists
create_migrations_table(conn)
# Quick check: have migrations already been applied by another worker?
cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
migration_count = cursor.fetchone()[0]
# Discover migration files
migration_files = discover_migration_files(migrations_dir)
if not migration_files:
conn.commit()
logger.info("No migration files found")
return
# If migrations exist and we're not the first worker, verify and exit
if migration_count > 0:
# Check if all migrations are applied
applied = get_applied_migrations(conn)
pending = [m for m, _ in migration_files if m not in applied]
if not pending:
conn.commit()
logger.debug("All migrations already applied by another worker")
return
# If there are pending migrations, we continue to apply them
logger.info(f"Found {len(pending)} pending migrations to apply")
# Fresh database detection (original logic preserved)
if migration_count == 0:
if is_schema_current(conn):
# Schema is current - mark all migrations as applied
for migration_name, _ in migration_files:
conn.execute(
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
(migration_name,)
)
conn.commit()
logger.info(
f"Fresh database detected: marked {len(migration_files)} "
f"migrations as applied (schema already current)"
)
return
else:
logger.info("Fresh database with partial schema: applying needed migrations")
# Get already-applied migrations
applied = get_applied_migrations(conn)
# Apply pending migrations (original logic preserved)
pending_count = 0
skipped_count = 0
for migration_name, migration_path in migration_files:
if migration_name not in applied:
# Check if migration is actually needed
should_check_needed = (
migration_count == 0 or
migration_name == "002_secure_tokens_and_authorization_codes.sql"
)
if should_check_needed and not is_migration_needed(conn, migration_name):
# Special handling for migration 002: if tables exist but indexes don't
if migration_name == "002_secure_tokens_and_authorization_codes.sql":
# Check if we need to create indexes
indexes_to_create = []
if not index_exists(conn, 'idx_tokens_hash'):
indexes_to_create.append("CREATE INDEX idx_tokens_hash ON tokens(token_hash)")
if not index_exists(conn, 'idx_tokens_me'):
indexes_to_create.append("CREATE INDEX idx_tokens_me ON tokens(me)")
if not index_exists(conn, 'idx_tokens_expires'):
indexes_to_create.append("CREATE INDEX idx_tokens_expires ON tokens(expires_at)")
if not index_exists(conn, 'idx_auth_codes_hash'):
indexes_to_create.append("CREATE INDEX idx_auth_codes_hash ON authorization_codes(code_hash)")
if not index_exists(conn, 'idx_auth_codes_expires'):
indexes_to_create.append("CREATE INDEX idx_auth_codes_expires ON authorization_codes(expires_at)")
if indexes_to_create:
for index_sql in indexes_to_create:
conn.execute(index_sql)
logger.info(f"Created {len(indexes_to_create)} missing indexes from migration 002")
# Mark as applied without executing full migration
conn.execute(
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
(migration_name,)
)
skipped_count += 1
logger.debug(f"Skipped migration {migration_name} (already in SCHEMA_SQL)")
else:
# Apply the migration (within our transaction)
try:
# Read migration SQL
migration_sql = migration_path.read_text()
logger.debug(f"Applying migration: {migration_name}")
# Execute migration (already in transaction)
conn.executescript(migration_sql)
# Record migration as applied
conn.execute(
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
(migration_name,)
)
logger.info(f"Applied migration: {migration_name}")
pending_count += 1
except Exception as e:
# Roll back the transaction
raise MigrationError(f"Migration {migration_name} failed: {e}")
# Commit all migrations atomically
conn.commit()
# Summary
total_count = len(migration_files)
if pending_count > 0 or skipped_count > 0:
if skipped_count > 0:
logger.info(
f"Migrations complete: {pending_count} applied, {skipped_count} skipped "
f"(already in SCHEMA_SQL), {total_count} total"
)
else:
logger.info(
f"Migrations complete: {pending_count} applied, "
f"{total_count} total"
)
else:
logger.info(f"All migrations up to date ({total_count} total)")
return # Success!
except MigrationError:
conn.rollback()
raise
except Exception as e:
conn.rollback()
raise MigrationError(f"Migration system error: {e}")
except sqlite3.OperationalError as e:
if "database is locked" in str(e).lower():
# Another worker has the lock, retry with exponential backoff
retry_count += 1
if retry_count < max_retries:
# Exponential backoff with jitter
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
logger.debug(
f"Database locked by another worker, retry {retry_count}/{max_retries} "
f"in {delay:.2f}s"
)
time.sleep(delay)
continue
else:
raise MigrationError(
f"Failed to acquire migration lock after {max_retries} attempts. "
f"This may indicate a hung migration process."
)
else:
# Non-lock related database error
error_msg = f"Database error during migration: {e}"
logger.error(error_msg)
raise MigrationError(error_msg)
except Exception as e:
# Unexpected error
error_msg = f"Unexpected error during migration: {e}"
logger.error(error_msg)
raise MigrationError(error_msg)
finally:
if conn:
try:
conn.close()
except:
pass # Ignore errors during cleanup
# Should never reach here, but just in case
raise MigrationError("Migration retry loop exited unexpectedly")
```
### Step 3: Testing the Fix
Create a test script to verify the fix works:
```python
#!/usr/bin/env python3
"""Test migration race condition fix"""
import multiprocessing
import time
import sys
from pathlib import Path
# Add project to path
sys.path.insert(0, str(Path(__file__).parent))
def worker_init(worker_id):
"""Simulate a gunicorn worker starting"""
print(f"Worker {worker_id}: Starting...")
try:
from starpunk import create_app
app = create_app()
print(f"Worker {worker_id}: Successfully initialized")
return True
except Exception as e:
print(f"Worker {worker_id}: FAILED - {e}")
return False
if __name__ == "__main__":
# Test with 10 workers (more than production to stress test)
num_workers = 10
print(f"Starting {num_workers} workers simultaneously...")
with multiprocessing.Pool(num_workers) as pool:
results = pool.map(worker_init, range(num_workers))
success_count = sum(results)
print(f"\nResults: {success_count}/{num_workers} workers succeeded")
if success_count == num_workers:
print("SUCCESS: All workers initialized without race condition")
sys.exit(0)
else:
print("FAILURE: Race condition still present")
sys.exit(1)
```
## Verification Steps
1. **Local Testing**:
```bash
# Test with multiple workers
gunicorn --workers 4 --bind 0.0.0.0:8000 app:app
# Check logs for retry messages
# Should see "Database locked by another worker, retry..." messages
```
2. **Container Testing**:
```bash
# Build container
podman build -t starpunk:test -f Containerfile .
# Run with fresh database
podman run --rm -p 8000:8000 -v ./test-data:/data starpunk:test
# Should start cleanly without restarts
```
3. **Log Verification**:
Look for these patterns:
- One worker: "Applied migration: XXX"
- Other workers: "Database locked by another worker, retry..."
- Final: "All migrations already applied by another worker"
## Risk Assessment
### Risk Level: LOW
The fix is safe because:
1. Uses SQLite's native transaction mechanism
2. Preserves all existing migration logic
3. Only adds retry wrapper around existing code
4. Fails safely with clear error messages
5. No data loss possible (transactions ensure atomicity)
### Rollback Plan
If issues occur:
1. Revert to previous version
2. Start container with single worker temporarily: `--workers 1`
3. Once migrations apply, scale back to 4 workers
## Release Strategy
### Option 1: Hotfix (Recommended)
- Release as v1.0.0-rc.3.1
- Immediate deployment to fix production issue
- Minimal testing required (focused fix)
### Option 2: Include in rc.4
- Bundle with other rc.4 changes
- More testing time
- Risk: Production remains broken until rc.4
**Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately.
## Alternative Workarounds (If Needed Urgently)
While the proper fix is implemented, these temporary workarounds can be used:
### Workaround 1: Single Worker Startup
```bash
# In Containerfile, temporarily change:
CMD ["gunicorn", "--workers", "1", ...]
# After first successful start, rebuild with 4 workers
```
### Workaround 2: Pre-migration Script
```bash
# Add entrypoint script that runs migrations before gunicorn
#!/bin/bash
python3 -c "from starpunk.database import init_db; init_db()"
exec gunicorn --workers 4 ...
```
### Workaround 3: Delayed Worker Startup
```bash
# Stagger worker startup with --preload
gunicorn --preload --workers 4 ...
```
## Summary
- **Problem**: Race condition when multiple workers apply migrations
- **Solution**: Database-level locking with retry logic
- **Implementation**: ~150 lines of code changes in migrations.py
- **Testing**: Verify with multi-worker startup
- **Risk**: LOW - Safe, atomic changes
- **Urgency**: HIGH - Blocks production deployment
- **Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately
## Developer Questions Answered
All 23 architectural questions have been comprehensively answered in:
`/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
**Key Decisions:**
- NEW connection per retry (not reused)
- BEGIN IMMEDIATE is correct (not EXCLUSIVE)
- Separate transactions for each operation
- Both multiprocessing.Pool AND gunicorn testing needed
- 30s timeout per attempt, 120s total maximum
- Graduated logging levels based on retry count
**Implementation Status: READY TO PROCEED**