Add comprehensive architectural documentation for the migration race condition fix, including: - ADR-022: Architectural decision record for the fix - migration-race-condition-answers.md: All 23 Q&A answered - migration-fix-quick-reference.md: Implementation checklist - migration-race-condition-fix-implementation.md: Detailed guide These documents guided the implementation in v1.0.0-rc.5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
431 lines
16 KiB
Markdown
431 lines
16 KiB
Markdown
# Migration Race Condition Fix - Implementation Guide
|
|
|
|
## Executive Summary
|
|
|
|
**CRITICAL PRODUCTION ISSUE**: Multiple gunicorn workers racing to apply migrations causes container startup failures.
|
|
|
|
**Solution**: Implement database-level advisory locking with retry logic in `migrations.py`.
|
|
|
|
**Urgency**: HIGH - This is a blocker for v1.0.0-rc.4 release.
|
|
|
|
## Root Cause Analysis
|
|
|
|
### The Problem Flow
|
|
|
|
1. Container starts with `gunicorn --workers 4`
|
|
2. Each worker independently calls:
|
|
```
|
|
app.py → create_app() → init_db() → run_migrations()
|
|
```
|
|
3. All 4 workers simultaneously try to:
|
|
- INSERT into schema_migrations table
|
|
- Apply the same migrations
|
|
4. SQLite's UNIQUE constraint on migration_name causes workers 2-4 to crash
|
|
5. Container restarts, works on second attempt (migrations already applied)
|
|
|
|
### Why This Happens
|
|
|
|
- **No synchronization**: Workers are independent processes
|
|
- **No locking**: Migration code doesn't prevent concurrent execution
|
|
- **Immediate failure**: UNIQUE constraint violation crashes the worker
|
|
- **Gunicorn behavior**: Worker crash triggers container restart
|
|
|
|
## Immediate Fix Implementation
|
|
|
|
### Step 1: Update migrations.py
|
|
|
|
Add these imports at the top of `/home/phil/Projects/starpunk/starpunk/migrations.py`:
|
|
|
|
```python
|
|
import time
|
|
import random
|
|
```
|
|
|
|
### Step 2: Replace run_migrations function
|
|
|
|
Replace the entire `run_migrations` function (lines 304-462) with:
|
|
|
|
```python
|
|
def run_migrations(db_path, logger=None):
|
|
"""
|
|
Run all pending database migrations with concurrency protection
|
|
|
|
Uses database-level locking to prevent race conditions when multiple
|
|
workers start simultaneously. Only one worker will apply migrations;
|
|
others will wait and verify completion.
|
|
|
|
Args:
|
|
db_path: Path to SQLite database file
|
|
logger: Optional logger for output
|
|
|
|
Raises:
|
|
MigrationError: If any migration fails to apply or lock cannot be acquired
|
|
"""
|
|
if logger is None:
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# Determine migrations directory
|
|
migrations_dir = Path(__file__).parent.parent / "migrations"
|
|
|
|
if not migrations_dir.exists():
|
|
logger.warning(f"Migrations directory not found: {migrations_dir}")
|
|
return
|
|
|
|
# Retry configuration for lock acquisition
|
|
max_retries = 10
|
|
retry_count = 0
|
|
base_delay = 0.1 # 100ms
|
|
|
|
while retry_count < max_retries:
|
|
conn = None
|
|
try:
|
|
# Connect with longer timeout for lock contention
|
|
conn = sqlite3.connect(db_path, timeout=30.0)
|
|
|
|
# Attempt to acquire exclusive lock for migrations
|
|
# BEGIN IMMEDIATE acquires RESERVED lock, preventing other writes
|
|
conn.execute("BEGIN IMMEDIATE")
|
|
|
|
try:
|
|
# Ensure migrations tracking table exists
|
|
create_migrations_table(conn)
|
|
|
|
# Quick check: have migrations already been applied by another worker?
|
|
cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
|
|
migration_count = cursor.fetchone()[0]
|
|
|
|
# Discover migration files
|
|
migration_files = discover_migration_files(migrations_dir)
|
|
|
|
if not migration_files:
|
|
conn.commit()
|
|
logger.info("No migration files found")
|
|
return
|
|
|
|
# If migrations exist and we're not the first worker, verify and exit
|
|
if migration_count > 0:
|
|
# Check if all migrations are applied
|
|
applied = get_applied_migrations(conn)
|
|
pending = [m for m, _ in migration_files if m not in applied]
|
|
|
|
if not pending:
|
|
conn.commit()
|
|
logger.debug("All migrations already applied by another worker")
|
|
return
|
|
# If there are pending migrations, we continue to apply them
|
|
logger.info(f"Found {len(pending)} pending migrations to apply")
|
|
|
|
# Fresh database detection (original logic preserved)
|
|
if migration_count == 0:
|
|
if is_schema_current(conn):
|
|
# Schema is current - mark all migrations as applied
|
|
for migration_name, _ in migration_files:
|
|
conn.execute(
|
|
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
|
(migration_name,)
|
|
)
|
|
conn.commit()
|
|
logger.info(
|
|
f"Fresh database detected: marked {len(migration_files)} "
|
|
f"migrations as applied (schema already current)"
|
|
)
|
|
return
|
|
else:
|
|
logger.info("Fresh database with partial schema: applying needed migrations")
|
|
|
|
# Get already-applied migrations
|
|
applied = get_applied_migrations(conn)
|
|
|
|
# Apply pending migrations (original logic preserved)
|
|
pending_count = 0
|
|
skipped_count = 0
|
|
for migration_name, migration_path in migration_files:
|
|
if migration_name not in applied:
|
|
# Check if migration is actually needed
|
|
should_check_needed = (
|
|
migration_count == 0 or
|
|
migration_name == "002_secure_tokens_and_authorization_codes.sql"
|
|
)
|
|
|
|
if should_check_needed and not is_migration_needed(conn, migration_name):
|
|
# Special handling for migration 002: if tables exist but indexes don't
|
|
if migration_name == "002_secure_tokens_and_authorization_codes.sql":
|
|
# Check if we need to create indexes
|
|
indexes_to_create = []
|
|
if not index_exists(conn, 'idx_tokens_hash'):
|
|
indexes_to_create.append("CREATE INDEX idx_tokens_hash ON tokens(token_hash)")
|
|
if not index_exists(conn, 'idx_tokens_me'):
|
|
indexes_to_create.append("CREATE INDEX idx_tokens_me ON tokens(me)")
|
|
if not index_exists(conn, 'idx_tokens_expires'):
|
|
indexes_to_create.append("CREATE INDEX idx_tokens_expires ON tokens(expires_at)")
|
|
if not index_exists(conn, 'idx_auth_codes_hash'):
|
|
indexes_to_create.append("CREATE INDEX idx_auth_codes_hash ON authorization_codes(code_hash)")
|
|
if not index_exists(conn, 'idx_auth_codes_expires'):
|
|
indexes_to_create.append("CREATE INDEX idx_auth_codes_expires ON authorization_codes(expires_at)")
|
|
|
|
if indexes_to_create:
|
|
for index_sql in indexes_to_create:
|
|
conn.execute(index_sql)
|
|
logger.info(f"Created {len(indexes_to_create)} missing indexes from migration 002")
|
|
|
|
# Mark as applied without executing full migration
|
|
conn.execute(
|
|
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
|
(migration_name,)
|
|
)
|
|
skipped_count += 1
|
|
logger.debug(f"Skipped migration {migration_name} (already in SCHEMA_SQL)")
|
|
else:
|
|
# Apply the migration (within our transaction)
|
|
try:
|
|
# Read migration SQL
|
|
migration_sql = migration_path.read_text()
|
|
|
|
logger.debug(f"Applying migration: {migration_name}")
|
|
|
|
# Execute migration (already in transaction)
|
|
conn.executescript(migration_sql)
|
|
|
|
# Record migration as applied
|
|
conn.execute(
|
|
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
|
(migration_name,)
|
|
)
|
|
|
|
logger.info(f"Applied migration: {migration_name}")
|
|
pending_count += 1
|
|
|
|
except Exception as e:
|
|
# Roll back the transaction
|
|
raise MigrationError(f"Migration {migration_name} failed: {e}")
|
|
|
|
# Commit all migrations atomically
|
|
conn.commit()
|
|
|
|
# Summary
|
|
total_count = len(migration_files)
|
|
if pending_count > 0 or skipped_count > 0:
|
|
if skipped_count > 0:
|
|
logger.info(
|
|
f"Migrations complete: {pending_count} applied, {skipped_count} skipped "
|
|
f"(already in SCHEMA_SQL), {total_count} total"
|
|
)
|
|
else:
|
|
logger.info(
|
|
f"Migrations complete: {pending_count} applied, "
|
|
f"{total_count} total"
|
|
)
|
|
else:
|
|
logger.info(f"All migrations up to date ({total_count} total)")
|
|
|
|
return # Success!
|
|
|
|
except MigrationError:
|
|
conn.rollback()
|
|
raise
|
|
|
|
except Exception as e:
|
|
conn.rollback()
|
|
raise MigrationError(f"Migration system error: {e}")
|
|
|
|
except sqlite3.OperationalError as e:
|
|
if "database is locked" in str(e).lower():
|
|
# Another worker has the lock, retry with exponential backoff
|
|
retry_count += 1
|
|
|
|
if retry_count < max_retries:
|
|
# Exponential backoff with jitter
|
|
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
|
|
logger.debug(
|
|
f"Database locked by another worker, retry {retry_count}/{max_retries} "
|
|
f"in {delay:.2f}s"
|
|
)
|
|
time.sleep(delay)
|
|
continue
|
|
else:
|
|
raise MigrationError(
|
|
f"Failed to acquire migration lock after {max_retries} attempts. "
|
|
f"This may indicate a hung migration process."
|
|
)
|
|
else:
|
|
# Non-lock related database error
|
|
error_msg = f"Database error during migration: {e}"
|
|
logger.error(error_msg)
|
|
raise MigrationError(error_msg)
|
|
|
|
except Exception as e:
|
|
# Unexpected error
|
|
error_msg = f"Unexpected error during migration: {e}"
|
|
logger.error(error_msg)
|
|
raise MigrationError(error_msg)
|
|
|
|
finally:
|
|
if conn:
|
|
try:
|
|
conn.close()
|
|
except:
|
|
pass # Ignore errors during cleanup
|
|
|
|
# Should never reach here, but just in case
|
|
raise MigrationError("Migration retry loop exited unexpectedly")
|
|
```
|
|
|
|
### Step 3: Testing the Fix
|
|
|
|
Create a test script to verify the fix works:
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""Test migration race condition fix"""
|
|
|
|
import multiprocessing
|
|
import time
|
|
import sys
|
|
from pathlib import Path
|
|
|
|
# Add project to path
|
|
sys.path.insert(0, str(Path(__file__).parent))
|
|
|
|
def worker_init(worker_id):
|
|
"""Simulate a gunicorn worker starting"""
|
|
print(f"Worker {worker_id}: Starting...")
|
|
|
|
try:
|
|
from starpunk import create_app
|
|
app = create_app()
|
|
print(f"Worker {worker_id}: Successfully initialized")
|
|
return True
|
|
except Exception as e:
|
|
print(f"Worker {worker_id}: FAILED - {e}")
|
|
return False
|
|
|
|
if __name__ == "__main__":
|
|
# Test with 10 workers (more than production to stress test)
|
|
num_workers = 10
|
|
|
|
print(f"Starting {num_workers} workers simultaneously...")
|
|
|
|
with multiprocessing.Pool(num_workers) as pool:
|
|
results = pool.map(worker_init, range(num_workers))
|
|
|
|
success_count = sum(results)
|
|
print(f"\nResults: {success_count}/{num_workers} workers succeeded")
|
|
|
|
if success_count == num_workers:
|
|
print("SUCCESS: All workers initialized without race condition")
|
|
sys.exit(0)
|
|
else:
|
|
print("FAILURE: Race condition still present")
|
|
sys.exit(1)
|
|
```
|
|
|
|
## Verification Steps
|
|
|
|
1. **Local Testing**:
|
|
```bash
|
|
# Test with multiple workers
|
|
gunicorn --workers 4 --bind 0.0.0.0:8000 app:app
|
|
|
|
# Check logs for retry messages
|
|
# Should see "Database locked by another worker, retry..." messages
|
|
```
|
|
|
|
2. **Container Testing**:
|
|
```bash
|
|
# Build container
|
|
podman build -t starpunk:test -f Containerfile .
|
|
|
|
# Run with fresh database
|
|
podman run --rm -p 8000:8000 -v ./test-data:/data starpunk:test
|
|
|
|
# Should start cleanly without restarts
|
|
```
|
|
|
|
3. **Log Verification**:
|
|
Look for these patterns:
|
|
- One worker: "Applied migration: XXX"
|
|
- Other workers: "Database locked by another worker, retry..."
|
|
- Final: "All migrations already applied by another worker"
|
|
|
|
## Risk Assessment
|
|
|
|
### Risk Level: LOW
|
|
|
|
The fix is safe because:
|
|
1. Uses SQLite's native transaction mechanism
|
|
2. Preserves all existing migration logic
|
|
3. Only adds retry wrapper around existing code
|
|
4. Fails safely with clear error messages
|
|
5. No data loss possible (transactions ensure atomicity)
|
|
|
|
### Rollback Plan
|
|
|
|
If issues occur:
|
|
1. Revert to previous version
|
|
2. Start container with single worker temporarily: `--workers 1`
|
|
3. Once migrations apply, scale back to 4 workers
|
|
|
|
## Release Strategy
|
|
|
|
### Option 1: Hotfix (Recommended)
|
|
- Release as v1.0.0-rc.3.1
|
|
- Immediate deployment to fix production issue
|
|
- Minimal testing required (focused fix)
|
|
|
|
### Option 2: Include in rc.4
|
|
- Bundle with other rc.4 changes
|
|
- More testing time
|
|
- Risk: Production remains broken until rc.4
|
|
|
|
**Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately.
|
|
|
|
## Alternative Workarounds (If Needed Urgently)
|
|
|
|
While the proper fix is implemented, these temporary workarounds can be used:
|
|
|
|
### Workaround 1: Single Worker Startup
|
|
```bash
|
|
# In Containerfile, temporarily change:
|
|
CMD ["gunicorn", "--workers", "1", ...]
|
|
|
|
# After first successful start, rebuild with 4 workers
|
|
```
|
|
|
|
### Workaround 2: Pre-migration Script
|
|
```bash
|
|
# Add entrypoint script that runs migrations before gunicorn
|
|
#!/bin/bash
|
|
python3 -c "from starpunk.database import init_db; init_db()"
|
|
exec gunicorn --workers 4 ...
|
|
```
|
|
|
|
### Workaround 3: Delayed Worker Startup
|
|
```bash
|
|
# Stagger worker startup with --preload
|
|
gunicorn --preload --workers 4 ...
|
|
```
|
|
|
|
## Summary
|
|
|
|
- **Problem**: Race condition when multiple workers apply migrations
|
|
- **Solution**: Database-level locking with retry logic
|
|
- **Implementation**: ~150 lines of code changes in migrations.py
|
|
- **Testing**: Verify with multi-worker startup
|
|
- **Risk**: LOW - Safe, atomic changes
|
|
- **Urgency**: HIGH - Blocks production deployment
|
|
- **Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately
|
|
|
|
## Developer Questions Answered
|
|
|
|
All 23 architectural questions have been comprehensively answered in:
|
|
`/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
|
|
|
|
**Key Decisions:**
|
|
- NEW connection per retry (not reused)
|
|
- BEGIN IMMEDIATE is correct (not EXCLUSIVE)
|
|
- Separate transactions for each operation
|
|
- Both multiprocessing.Pool AND gunicorn testing needed
|
|
- 30s timeout per attempt, 120s total maximum
|
|
- Graduated logging levels based on retry count
|
|
|
|
**Implementation Status: READY TO PROCEED** |