Files
StarPunk/docs/reports/migration-race-condition-fix-implementation.md
Phil Skentelbery 2240414f22 docs: Add architect documentation for migration race condition fix
Add comprehensive architectural documentation for the migration race
condition fix, including:

- ADR-022: Architectural decision record for the fix
- migration-race-condition-answers.md: All 23 Q&A answered
- migration-fix-quick-reference.md: Implementation checklist
- migration-race-condition-fix-implementation.md: Detailed guide

These documents guided the implementation in v1.0.0-rc.5.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:53:55 -07:00

16 KiB

Migration Race Condition Fix - Implementation Guide

Executive Summary

CRITICAL PRODUCTION ISSUE: Multiple gunicorn workers racing to apply migrations causes container startup failures.

Solution: Implement database-level advisory locking with retry logic in migrations.py.

Urgency: HIGH - This is a blocker for v1.0.0-rc.4 release.

Root Cause Analysis

The Problem Flow

  1. Container starts with gunicorn --workers 4
  2. Each worker independently calls:
    app.py → create_app() → init_db() → run_migrations()
    
  3. All 4 workers simultaneously try to:
    • INSERT into schema_migrations table
    • Apply the same migrations
  4. SQLite's UNIQUE constraint on migration_name causes workers 2-4 to crash
  5. Container restarts, works on second attempt (migrations already applied)

Why This Happens

  • No synchronization: Workers are independent processes
  • No locking: Migration code doesn't prevent concurrent execution
  • Immediate failure: UNIQUE constraint violation crashes the worker
  • Gunicorn behavior: Worker crash triggers container restart

Immediate Fix Implementation

Step 1: Update migrations.py

Add these imports at the top of /home/phil/Projects/starpunk/starpunk/migrations.py:

import time
import random

Step 2: Replace run_migrations function

Replace the entire run_migrations function (lines 304-462) with:

def run_migrations(db_path, logger=None):
    """
    Run all pending database migrations with concurrency protection

    Uses database-level locking to prevent race conditions when multiple
    workers start simultaneously. Only one worker will apply migrations;
    others will wait and verify completion.

    Args:
        db_path: Path to SQLite database file
        logger: Optional logger for output

    Raises:
        MigrationError: If any migration fails to apply or lock cannot be acquired
    """
    if logger is None:
        logger = logging.getLogger(__name__)

    # Determine migrations directory
    migrations_dir = Path(__file__).parent.parent / "migrations"

    if not migrations_dir.exists():
        logger.warning(f"Migrations directory not found: {migrations_dir}")
        return

    # Retry configuration for lock acquisition
    max_retries = 10
    retry_count = 0
    base_delay = 0.1  # 100ms

    while retry_count < max_retries:
        conn = None
        try:
            # Connect with longer timeout for lock contention
            conn = sqlite3.connect(db_path, timeout=30.0)

            # Attempt to acquire exclusive lock for migrations
            # BEGIN IMMEDIATE acquires RESERVED lock, preventing other writes
            conn.execute("BEGIN IMMEDIATE")

            try:
                # Ensure migrations tracking table exists
                create_migrations_table(conn)

                # Quick check: have migrations already been applied by another worker?
                cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
                migration_count = cursor.fetchone()[0]

                # Discover migration files
                migration_files = discover_migration_files(migrations_dir)

                if not migration_files:
                    conn.commit()
                    logger.info("No migration files found")
                    return

                # If migrations exist and we're not the first worker, verify and exit
                if migration_count > 0:
                    # Check if all migrations are applied
                    applied = get_applied_migrations(conn)
                    pending = [m for m, _ in migration_files if m not in applied]

                    if not pending:
                        conn.commit()
                        logger.debug("All migrations already applied by another worker")
                        return
                    # If there are pending migrations, we continue to apply them
                    logger.info(f"Found {len(pending)} pending migrations to apply")

                # Fresh database detection (original logic preserved)
                if migration_count == 0:
                    if is_schema_current(conn):
                        # Schema is current - mark all migrations as applied
                        for migration_name, _ in migration_files:
                            conn.execute(
                                "INSERT INTO schema_migrations (migration_name) VALUES (?)",
                                (migration_name,)
                            )
                        conn.commit()
                        logger.info(
                            f"Fresh database detected: marked {len(migration_files)} "
                            f"migrations as applied (schema already current)"
                        )
                        return
                    else:
                        logger.info("Fresh database with partial schema: applying needed migrations")

                # Get already-applied migrations
                applied = get_applied_migrations(conn)

                # Apply pending migrations (original logic preserved)
                pending_count = 0
                skipped_count = 0
                for migration_name, migration_path in migration_files:
                    if migration_name not in applied:
                        # Check if migration is actually needed
                        should_check_needed = (
                            migration_count == 0 or
                            migration_name == "002_secure_tokens_and_authorization_codes.sql"
                        )

                        if should_check_needed and not is_migration_needed(conn, migration_name):
                            # Special handling for migration 002: if tables exist but indexes don't
                            if migration_name == "002_secure_tokens_and_authorization_codes.sql":
                                # Check if we need to create indexes
                                indexes_to_create = []
                                if not index_exists(conn, 'idx_tokens_hash'):
                                    indexes_to_create.append("CREATE INDEX idx_tokens_hash ON tokens(token_hash)")
                                if not index_exists(conn, 'idx_tokens_me'):
                                    indexes_to_create.append("CREATE INDEX idx_tokens_me ON tokens(me)")
                                if not index_exists(conn, 'idx_tokens_expires'):
                                    indexes_to_create.append("CREATE INDEX idx_tokens_expires ON tokens(expires_at)")
                                if not index_exists(conn, 'idx_auth_codes_hash'):
                                    indexes_to_create.append("CREATE INDEX idx_auth_codes_hash ON authorization_codes(code_hash)")
                                if not index_exists(conn, 'idx_auth_codes_expires'):
                                    indexes_to_create.append("CREATE INDEX idx_auth_codes_expires ON authorization_codes(expires_at)")

                                if indexes_to_create:
                                    for index_sql in indexes_to_create:
                                        conn.execute(index_sql)
                                    logger.info(f"Created {len(indexes_to_create)} missing indexes from migration 002")

                            # Mark as applied without executing full migration
                            conn.execute(
                                "INSERT INTO schema_migrations (migration_name) VALUES (?)",
                                (migration_name,)
                            )
                            skipped_count += 1
                            logger.debug(f"Skipped migration {migration_name} (already in SCHEMA_SQL)")
                        else:
                            # Apply the migration (within our transaction)
                            try:
                                # Read migration SQL
                                migration_sql = migration_path.read_text()

                                logger.debug(f"Applying migration: {migration_name}")

                                # Execute migration (already in transaction)
                                conn.executescript(migration_sql)

                                # Record migration as applied
                                conn.execute(
                                    "INSERT INTO schema_migrations (migration_name) VALUES (?)",
                                    (migration_name,)
                                )

                                logger.info(f"Applied migration: {migration_name}")
                                pending_count += 1

                            except Exception as e:
                                # Roll back the transaction
                                raise MigrationError(f"Migration {migration_name} failed: {e}")

                # Commit all migrations atomically
                conn.commit()

                # Summary
                total_count = len(migration_files)
                if pending_count > 0 or skipped_count > 0:
                    if skipped_count > 0:
                        logger.info(
                            f"Migrations complete: {pending_count} applied, {skipped_count} skipped "
                            f"(already in SCHEMA_SQL), {total_count} total"
                        )
                    else:
                        logger.info(
                            f"Migrations complete: {pending_count} applied, "
                            f"{total_count} total"
                        )
                else:
                    logger.info(f"All migrations up to date ({total_count} total)")

                return  # Success!

            except MigrationError:
                conn.rollback()
                raise

            except Exception as e:
                conn.rollback()
                raise MigrationError(f"Migration system error: {e}")

        except sqlite3.OperationalError as e:
            if "database is locked" in str(e).lower():
                # Another worker has the lock, retry with exponential backoff
                retry_count += 1

                if retry_count < max_retries:
                    # Exponential backoff with jitter
                    delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
                    logger.debug(
                        f"Database locked by another worker, retry {retry_count}/{max_retries} "
                        f"in {delay:.2f}s"
                    )
                    time.sleep(delay)
                    continue
                else:
                    raise MigrationError(
                        f"Failed to acquire migration lock after {max_retries} attempts. "
                        f"This may indicate a hung migration process."
                    )
            else:
                # Non-lock related database error
                error_msg = f"Database error during migration: {e}"
                logger.error(error_msg)
                raise MigrationError(error_msg)

        except Exception as e:
            # Unexpected error
            error_msg = f"Unexpected error during migration: {e}"
            logger.error(error_msg)
            raise MigrationError(error_msg)

        finally:
            if conn:
                try:
                    conn.close()
                except:
                    pass  # Ignore errors during cleanup

    # Should never reach here, but just in case
    raise MigrationError("Migration retry loop exited unexpectedly")

Step 3: Testing the Fix

Create a test script to verify the fix works:

#!/usr/bin/env python3
"""Test migration race condition fix"""

import multiprocessing
import time
import sys
from pathlib import Path

# Add project to path
sys.path.insert(0, str(Path(__file__).parent))

def worker_init(worker_id):
    """Simulate a gunicorn worker starting"""
    print(f"Worker {worker_id}: Starting...")

    try:
        from starpunk import create_app
        app = create_app()
        print(f"Worker {worker_id}: Successfully initialized")
        return True
    except Exception as e:
        print(f"Worker {worker_id}: FAILED - {e}")
        return False

if __name__ == "__main__":
    # Test with 10 workers (more than production to stress test)
    num_workers = 10

    print(f"Starting {num_workers} workers simultaneously...")

    with multiprocessing.Pool(num_workers) as pool:
        results = pool.map(worker_init, range(num_workers))

    success_count = sum(results)
    print(f"\nResults: {success_count}/{num_workers} workers succeeded")

    if success_count == num_workers:
        print("SUCCESS: All workers initialized without race condition")
        sys.exit(0)
    else:
        print("FAILURE: Race condition still present")
        sys.exit(1)

Verification Steps

  1. Local Testing:

    # Test with multiple workers
    gunicorn --workers 4 --bind 0.0.0.0:8000 app:app
    
    # Check logs for retry messages
    # Should see "Database locked by another worker, retry..." messages
    
  2. Container Testing:

    # Build container
    podman build -t starpunk:test -f Containerfile .
    
    # Run with fresh database
    podman run --rm -p 8000:8000 -v ./test-data:/data starpunk:test
    
    # Should start cleanly without restarts
    
  3. Log Verification: Look for these patterns:

    • One worker: "Applied migration: XXX"
    • Other workers: "Database locked by another worker, retry..."
    • Final: "All migrations already applied by another worker"

Risk Assessment

Risk Level: LOW

The fix is safe because:

  1. Uses SQLite's native transaction mechanism
  2. Preserves all existing migration logic
  3. Only adds retry wrapper around existing code
  4. Fails safely with clear error messages
  5. No data loss possible (transactions ensure atomicity)

Rollback Plan

If issues occur:

  1. Revert to previous version
  2. Start container with single worker temporarily: --workers 1
  3. Once migrations apply, scale back to 4 workers

Release Strategy

  • Release as v1.0.0-rc.3.1
  • Immediate deployment to fix production issue
  • Minimal testing required (focused fix)

Option 2: Include in rc.4

  • Bundle with other rc.4 changes
  • More testing time
  • Risk: Production remains broken until rc.4

Recommendation: Deploy as hotfix v1.0.0-rc.3.1 immediately.

Alternative Workarounds (If Needed Urgently)

While the proper fix is implemented, these temporary workarounds can be used:

Workaround 1: Single Worker Startup

# In Containerfile, temporarily change:
CMD ["gunicorn", "--workers", "1", ...]

# After first successful start, rebuild with 4 workers

Workaround 2: Pre-migration Script

# Add entrypoint script that runs migrations before gunicorn
#!/bin/bash
python3 -c "from starpunk.database import init_db; init_db()"
exec gunicorn --workers 4 ...

Workaround 3: Delayed Worker Startup

# Stagger worker startup with --preload
gunicorn --preload --workers 4 ...

Summary

  • Problem: Race condition when multiple workers apply migrations
  • Solution: Database-level locking with retry logic
  • Implementation: ~150 lines of code changes in migrations.py
  • Testing: Verify with multi-worker startup
  • Risk: LOW - Safe, atomic changes
  • Urgency: HIGH - Blocks production deployment
  • Recommendation: Deploy as hotfix v1.0.0-rc.3.1 immediately

Developer Questions Answered

All 23 architectural questions have been comprehensively answered in: /home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md

Key Decisions:

  • NEW connection per retry (not reused)
  • BEGIN IMMEDIATE is correct (not EXCLUSIVE)
  • Separate transactions for each operation
  • Both multiprocessing.Pool AND gunicorn testing needed
  • 30s timeout per attempt, 120s total maximum
  • Graduated logging levels based on retry count

Implementation Status: READY TO PROCEED