docs: Add architect documentation for migration race condition fix
Add comprehensive architectural documentation for the migration race condition fix, including: - ADR-022: Architectural decision record for the fix - migration-race-condition-answers.md: All 23 Q&A answered - migration-fix-quick-reference.md: Implementation checklist - migration-race-condition-fix-implementation.md: Detailed guide These documents guided the implementation in v1.0.0-rc.5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
431
docs/reports/migration-race-condition-fix-implementation.md
Normal file
431
docs/reports/migration-race-condition-fix-implementation.md
Normal file
@@ -0,0 +1,431 @@
|
||||
# Migration Race Condition Fix - Implementation Guide
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**CRITICAL PRODUCTION ISSUE**: Multiple gunicorn workers racing to apply migrations causes container startup failures.
|
||||
|
||||
**Solution**: Implement database-level advisory locking with retry logic in `migrations.py`.
|
||||
|
||||
**Urgency**: HIGH - This is a blocker for v1.0.0-rc.4 release.
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### The Problem Flow
|
||||
|
||||
1. Container starts with `gunicorn --workers 4`
|
||||
2. Each worker independently calls:
|
||||
```
|
||||
app.py → create_app() → init_db() → run_migrations()
|
||||
```
|
||||
3. All 4 workers simultaneously try to:
|
||||
- INSERT into schema_migrations table
|
||||
- Apply the same migrations
|
||||
4. SQLite's UNIQUE constraint on migration_name causes workers 2-4 to crash
|
||||
5. Container restarts, works on second attempt (migrations already applied)
|
||||
|
||||
### Why This Happens
|
||||
|
||||
- **No synchronization**: Workers are independent processes
|
||||
- **No locking**: Migration code doesn't prevent concurrent execution
|
||||
- **Immediate failure**: UNIQUE constraint violation crashes the worker
|
||||
- **Gunicorn behavior**: Worker crash triggers container restart
|
||||
|
||||
## Immediate Fix Implementation
|
||||
|
||||
### Step 1: Update migrations.py
|
||||
|
||||
Add these imports at the top of `/home/phil/Projects/starpunk/starpunk/migrations.py`:
|
||||
|
||||
```python
|
||||
import time
|
||||
import random
|
||||
```
|
||||
|
||||
### Step 2: Replace run_migrations function
|
||||
|
||||
Replace the entire `run_migrations` function (lines 304-462) with:
|
||||
|
||||
```python
|
||||
def run_migrations(db_path, logger=None):
|
||||
"""
|
||||
Run all pending database migrations with concurrency protection
|
||||
|
||||
Uses database-level locking to prevent race conditions when multiple
|
||||
workers start simultaneously. Only one worker will apply migrations;
|
||||
others will wait and verify completion.
|
||||
|
||||
Args:
|
||||
db_path: Path to SQLite database file
|
||||
logger: Optional logger for output
|
||||
|
||||
Raises:
|
||||
MigrationError: If any migration fails to apply or lock cannot be acquired
|
||||
"""
|
||||
if logger is None:
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Determine migrations directory
|
||||
migrations_dir = Path(__file__).parent.parent / "migrations"
|
||||
|
||||
if not migrations_dir.exists():
|
||||
logger.warning(f"Migrations directory not found: {migrations_dir}")
|
||||
return
|
||||
|
||||
# Retry configuration for lock acquisition
|
||||
max_retries = 10
|
||||
retry_count = 0
|
||||
base_delay = 0.1 # 100ms
|
||||
|
||||
while retry_count < max_retries:
|
||||
conn = None
|
||||
try:
|
||||
# Connect with longer timeout for lock contention
|
||||
conn = sqlite3.connect(db_path, timeout=30.0)
|
||||
|
||||
# Attempt to acquire exclusive lock for migrations
|
||||
# BEGIN IMMEDIATE acquires RESERVED lock, preventing other writes
|
||||
conn.execute("BEGIN IMMEDIATE")
|
||||
|
||||
try:
|
||||
# Ensure migrations tracking table exists
|
||||
create_migrations_table(conn)
|
||||
|
||||
# Quick check: have migrations already been applied by another worker?
|
||||
cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
|
||||
migration_count = cursor.fetchone()[0]
|
||||
|
||||
# Discover migration files
|
||||
migration_files = discover_migration_files(migrations_dir)
|
||||
|
||||
if not migration_files:
|
||||
conn.commit()
|
||||
logger.info("No migration files found")
|
||||
return
|
||||
|
||||
# If migrations exist and we're not the first worker, verify and exit
|
||||
if migration_count > 0:
|
||||
# Check if all migrations are applied
|
||||
applied = get_applied_migrations(conn)
|
||||
pending = [m for m, _ in migration_files if m not in applied]
|
||||
|
||||
if not pending:
|
||||
conn.commit()
|
||||
logger.debug("All migrations already applied by another worker")
|
||||
return
|
||||
# If there are pending migrations, we continue to apply them
|
||||
logger.info(f"Found {len(pending)} pending migrations to apply")
|
||||
|
||||
# Fresh database detection (original logic preserved)
|
||||
if migration_count == 0:
|
||||
if is_schema_current(conn):
|
||||
# Schema is current - mark all migrations as applied
|
||||
for migration_name, _ in migration_files:
|
||||
conn.execute(
|
||||
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
||||
(migration_name,)
|
||||
)
|
||||
conn.commit()
|
||||
logger.info(
|
||||
f"Fresh database detected: marked {len(migration_files)} "
|
||||
f"migrations as applied (schema already current)"
|
||||
)
|
||||
return
|
||||
else:
|
||||
logger.info("Fresh database with partial schema: applying needed migrations")
|
||||
|
||||
# Get already-applied migrations
|
||||
applied = get_applied_migrations(conn)
|
||||
|
||||
# Apply pending migrations (original logic preserved)
|
||||
pending_count = 0
|
||||
skipped_count = 0
|
||||
for migration_name, migration_path in migration_files:
|
||||
if migration_name not in applied:
|
||||
# Check if migration is actually needed
|
||||
should_check_needed = (
|
||||
migration_count == 0 or
|
||||
migration_name == "002_secure_tokens_and_authorization_codes.sql"
|
||||
)
|
||||
|
||||
if should_check_needed and not is_migration_needed(conn, migration_name):
|
||||
# Special handling for migration 002: if tables exist but indexes don't
|
||||
if migration_name == "002_secure_tokens_and_authorization_codes.sql":
|
||||
# Check if we need to create indexes
|
||||
indexes_to_create = []
|
||||
if not index_exists(conn, 'idx_tokens_hash'):
|
||||
indexes_to_create.append("CREATE INDEX idx_tokens_hash ON tokens(token_hash)")
|
||||
if not index_exists(conn, 'idx_tokens_me'):
|
||||
indexes_to_create.append("CREATE INDEX idx_tokens_me ON tokens(me)")
|
||||
if not index_exists(conn, 'idx_tokens_expires'):
|
||||
indexes_to_create.append("CREATE INDEX idx_tokens_expires ON tokens(expires_at)")
|
||||
if not index_exists(conn, 'idx_auth_codes_hash'):
|
||||
indexes_to_create.append("CREATE INDEX idx_auth_codes_hash ON authorization_codes(code_hash)")
|
||||
if not index_exists(conn, 'idx_auth_codes_expires'):
|
||||
indexes_to_create.append("CREATE INDEX idx_auth_codes_expires ON authorization_codes(expires_at)")
|
||||
|
||||
if indexes_to_create:
|
||||
for index_sql in indexes_to_create:
|
||||
conn.execute(index_sql)
|
||||
logger.info(f"Created {len(indexes_to_create)} missing indexes from migration 002")
|
||||
|
||||
# Mark as applied without executing full migration
|
||||
conn.execute(
|
||||
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
||||
(migration_name,)
|
||||
)
|
||||
skipped_count += 1
|
||||
logger.debug(f"Skipped migration {migration_name} (already in SCHEMA_SQL)")
|
||||
else:
|
||||
# Apply the migration (within our transaction)
|
||||
try:
|
||||
# Read migration SQL
|
||||
migration_sql = migration_path.read_text()
|
||||
|
||||
logger.debug(f"Applying migration: {migration_name}")
|
||||
|
||||
# Execute migration (already in transaction)
|
||||
conn.executescript(migration_sql)
|
||||
|
||||
# Record migration as applied
|
||||
conn.execute(
|
||||
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
||||
(migration_name,)
|
||||
)
|
||||
|
||||
logger.info(f"Applied migration: {migration_name}")
|
||||
pending_count += 1
|
||||
|
||||
except Exception as e:
|
||||
# Roll back the transaction
|
||||
raise MigrationError(f"Migration {migration_name} failed: {e}")
|
||||
|
||||
# Commit all migrations atomically
|
||||
conn.commit()
|
||||
|
||||
# Summary
|
||||
total_count = len(migration_files)
|
||||
if pending_count > 0 or skipped_count > 0:
|
||||
if skipped_count > 0:
|
||||
logger.info(
|
||||
f"Migrations complete: {pending_count} applied, {skipped_count} skipped "
|
||||
f"(already in SCHEMA_SQL), {total_count} total"
|
||||
)
|
||||
else:
|
||||
logger.info(
|
||||
f"Migrations complete: {pending_count} applied, "
|
||||
f"{total_count} total"
|
||||
)
|
||||
else:
|
||||
logger.info(f"All migrations up to date ({total_count} total)")
|
||||
|
||||
return # Success!
|
||||
|
||||
except MigrationError:
|
||||
conn.rollback()
|
||||
raise
|
||||
|
||||
except Exception as e:
|
||||
conn.rollback()
|
||||
raise MigrationError(f"Migration system error: {e}")
|
||||
|
||||
except sqlite3.OperationalError as e:
|
||||
if "database is locked" in str(e).lower():
|
||||
# Another worker has the lock, retry with exponential backoff
|
||||
retry_count += 1
|
||||
|
||||
if retry_count < max_retries:
|
||||
# Exponential backoff with jitter
|
||||
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
|
||||
logger.debug(
|
||||
f"Database locked by another worker, retry {retry_count}/{max_retries} "
|
||||
f"in {delay:.2f}s"
|
||||
)
|
||||
time.sleep(delay)
|
||||
continue
|
||||
else:
|
||||
raise MigrationError(
|
||||
f"Failed to acquire migration lock after {max_retries} attempts. "
|
||||
f"This may indicate a hung migration process."
|
||||
)
|
||||
else:
|
||||
# Non-lock related database error
|
||||
error_msg = f"Database error during migration: {e}"
|
||||
logger.error(error_msg)
|
||||
raise MigrationError(error_msg)
|
||||
|
||||
except Exception as e:
|
||||
# Unexpected error
|
||||
error_msg = f"Unexpected error during migration: {e}"
|
||||
logger.error(error_msg)
|
||||
raise MigrationError(error_msg)
|
||||
|
||||
finally:
|
||||
if conn:
|
||||
try:
|
||||
conn.close()
|
||||
except:
|
||||
pass # Ignore errors during cleanup
|
||||
|
||||
# Should never reach here, but just in case
|
||||
raise MigrationError("Migration retry loop exited unexpectedly")
|
||||
```
|
||||
|
||||
### Step 3: Testing the Fix
|
||||
|
||||
Create a test script to verify the fix works:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
"""Test migration race condition fix"""
|
||||
|
||||
import multiprocessing
|
||||
import time
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add project to path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
def worker_init(worker_id):
|
||||
"""Simulate a gunicorn worker starting"""
|
||||
print(f"Worker {worker_id}: Starting...")
|
||||
|
||||
try:
|
||||
from starpunk import create_app
|
||||
app = create_app()
|
||||
print(f"Worker {worker_id}: Successfully initialized")
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"Worker {worker_id}: FAILED - {e}")
|
||||
return False
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Test with 10 workers (more than production to stress test)
|
||||
num_workers = 10
|
||||
|
||||
print(f"Starting {num_workers} workers simultaneously...")
|
||||
|
||||
with multiprocessing.Pool(num_workers) as pool:
|
||||
results = pool.map(worker_init, range(num_workers))
|
||||
|
||||
success_count = sum(results)
|
||||
print(f"\nResults: {success_count}/{num_workers} workers succeeded")
|
||||
|
||||
if success_count == num_workers:
|
||||
print("SUCCESS: All workers initialized without race condition")
|
||||
sys.exit(0)
|
||||
else:
|
||||
print("FAILURE: Race condition still present")
|
||||
sys.exit(1)
|
||||
```
|
||||
|
||||
## Verification Steps
|
||||
|
||||
1. **Local Testing**:
|
||||
```bash
|
||||
# Test with multiple workers
|
||||
gunicorn --workers 4 --bind 0.0.0.0:8000 app:app
|
||||
|
||||
# Check logs for retry messages
|
||||
# Should see "Database locked by another worker, retry..." messages
|
||||
```
|
||||
|
||||
2. **Container Testing**:
|
||||
```bash
|
||||
# Build container
|
||||
podman build -t starpunk:test -f Containerfile .
|
||||
|
||||
# Run with fresh database
|
||||
podman run --rm -p 8000:8000 -v ./test-data:/data starpunk:test
|
||||
|
||||
# Should start cleanly without restarts
|
||||
```
|
||||
|
||||
3. **Log Verification**:
|
||||
Look for these patterns:
|
||||
- One worker: "Applied migration: XXX"
|
||||
- Other workers: "Database locked by another worker, retry..."
|
||||
- Final: "All migrations already applied by another worker"
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
### Risk Level: LOW
|
||||
|
||||
The fix is safe because:
|
||||
1. Uses SQLite's native transaction mechanism
|
||||
2. Preserves all existing migration logic
|
||||
3. Only adds retry wrapper around existing code
|
||||
4. Fails safely with clear error messages
|
||||
5. No data loss possible (transactions ensure atomicity)
|
||||
|
||||
### Rollback Plan
|
||||
|
||||
If issues occur:
|
||||
1. Revert to previous version
|
||||
2. Start container with single worker temporarily: `--workers 1`
|
||||
3. Once migrations apply, scale back to 4 workers
|
||||
|
||||
## Release Strategy
|
||||
|
||||
### Option 1: Hotfix (Recommended)
|
||||
- Release as v1.0.0-rc.3.1
|
||||
- Immediate deployment to fix production issue
|
||||
- Minimal testing required (focused fix)
|
||||
|
||||
### Option 2: Include in rc.4
|
||||
- Bundle with other rc.4 changes
|
||||
- More testing time
|
||||
- Risk: Production remains broken until rc.4
|
||||
|
||||
**Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately.
|
||||
|
||||
## Alternative Workarounds (If Needed Urgently)
|
||||
|
||||
While the proper fix is implemented, these temporary workarounds can be used:
|
||||
|
||||
### Workaround 1: Single Worker Startup
|
||||
```bash
|
||||
# In Containerfile, temporarily change:
|
||||
CMD ["gunicorn", "--workers", "1", ...]
|
||||
|
||||
# After first successful start, rebuild with 4 workers
|
||||
```
|
||||
|
||||
### Workaround 2: Pre-migration Script
|
||||
```bash
|
||||
# Add entrypoint script that runs migrations before gunicorn
|
||||
#!/bin/bash
|
||||
python3 -c "from starpunk.database import init_db; init_db()"
|
||||
exec gunicorn --workers 4 ...
|
||||
```
|
||||
|
||||
### Workaround 3: Delayed Worker Startup
|
||||
```bash
|
||||
# Stagger worker startup with --preload
|
||||
gunicorn --preload --workers 4 ...
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
- **Problem**: Race condition when multiple workers apply migrations
|
||||
- **Solution**: Database-level locking with retry logic
|
||||
- **Implementation**: ~150 lines of code changes in migrations.py
|
||||
- **Testing**: Verify with multi-worker startup
|
||||
- **Risk**: LOW - Safe, atomic changes
|
||||
- **Urgency**: HIGH - Blocks production deployment
|
||||
- **Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately
|
||||
|
||||
## Developer Questions Answered
|
||||
|
||||
All 23 architectural questions have been comprehensively answered in:
|
||||
`/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
|
||||
|
||||
**Key Decisions:**
|
||||
- NEW connection per retry (not reused)
|
||||
- BEGIN IMMEDIATE is correct (not EXCLUSIVE)
|
||||
- Separate transactions for each operation
|
||||
- Both multiprocessing.Pool AND gunicorn testing needed
|
||||
- 30s timeout per attempt, 120s total maximum
|
||||
- Graduated logging levels based on retry count
|
||||
|
||||
**Implementation Status: READY TO PROCEED**
|
||||
Reference in New Issue
Block a user