Files
StarPunk/docs/architecture/migration-fix-quick-reference.md
Phil Skentelbery 2240414f22 docs: Add architect documentation for migration race condition fix
Add comprehensive architectural documentation for the migration race
condition fix, including:

- ADR-022: Architectural decision record for the fix
- migration-race-condition-answers.md: All 23 Q&A answered
- migration-fix-quick-reference.md: Implementation checklist
- migration-race-condition-fix-implementation.md: Detailed guide

These documents guided the implementation in v1.0.0-rc.5.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:53:55 -07:00

6.4 KiB

Migration Race Condition Fix - Quick Implementation Reference

Implementation Checklist

Code Changes - /home/phil/Projects/starpunk/starpunk/migrations.py

# 1. Add imports at top
import time
import random

# 2. Replace entire run_migrations function (lines 304-462)
# See full implementation in migration-race-condition-fix-implementation.md

# Key patterns to implement:

# A. Retry loop structure
max_retries = 10
retry_count = 0
base_delay = 0.1
start_time = time.time()
max_total_time = 120  # 2 minute absolute max

while retry_count < max_retries and (time.time() - start_time) < max_total_time:
    conn = None  # NEW connection each iteration
    try:
        conn = sqlite3.connect(db_path, timeout=30.0)
        conn.execute("BEGIN IMMEDIATE")  # Lock acquisition
        # ... migration logic ...
        conn.commit()
        return  # Success
    except sqlite3.OperationalError as e:
        if "database is locked" in str(e).lower():
            retry_count += 1
            if retry_count < max_retries:
                # Exponential backoff with jitter
                delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
                # Graduated logging
                if retry_count <= 3:
                    logger.debug(f"Retry {retry_count}/{max_retries}")
                elif retry_count <= 7:
                    logger.info(f"Retry {retry_count}/{max_retries}")
                else:
                    logger.warning(f"Retry {retry_count}/{max_retries}")
                time.sleep(delay)
                continue
    finally:
        if conn:
            try:
                conn.close()
            except:
                pass

# B. Error handling pattern
except Exception as e:
    try:
        conn.rollback()
    except Exception as rollback_error:
        logger.critical(f"FATAL: Rollback failed: {rollback_error}")
        raise SystemExit(1)
    raise MigrationError(f"Migration failed: {e}")

# C. Final error message
raise MigrationError(
    f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
    f"Possible causes:\n"
    f"1. Another process is stuck in migration (check logs)\n"
    f"2. Database file permissions issue\n"
    f"3. Disk I/O problems\n"
    f"Action: Restart container with single worker to diagnose"
)

Testing Requirements

1. Unit Test File: test_migration_race_condition.py

import multiprocessing
from multiprocessing import Barrier, Process
import time

def test_concurrent_migrations():
    """Test 4 workers starting simultaneously"""
    barrier = Barrier(4)

    def worker(worker_id):
        barrier.wait()  # Synchronize start
        from starpunk import create_app
        app = create_app()
        return True

    with multiprocessing.Pool(4) as pool:
        results = pool.map(worker, range(4))

    assert all(results), "Some workers failed"

def test_lock_retry():
    """Test retry logic with mock"""
    with patch('sqlite3.connect') as mock:
        mock.side_effect = [
            sqlite3.OperationalError("database is locked"),
            sqlite3.OperationalError("database is locked"),
            MagicMock()  # Success on 3rd try
        ]
        run_migrations(db_path)
        assert mock.call_count == 3

2. Integration Test: test_integration.sh

#!/bin/bash
# Test with actual gunicorn

# Clean start
rm -f test.db

# Start gunicorn with 4 workers
timeout 10 gunicorn --workers 4 --bind 127.0.0.1:8001 app:app &
PID=$!

# Wait for startup
sleep 3

# Check if running
if ! kill -0 $PID 2>/dev/null; then
    echo "FAILED: Gunicorn crashed"
    exit 1
fi

# Check health endpoint
curl -f http://127.0.0.1:8001/health || exit 1

# Cleanup
kill $PID

echo "SUCCESS: All workers started without race condition"

3. Container Test: test_container.sh

#!/bin/bash
# Test in container environment

# Build
podman build -t starpunk:race-test -f Containerfile .

# Run with fresh database
podman run --rm -d --name race-test \
    -v $(pwd)/test-data:/data \
    starpunk:race-test

# Check logs for success patterns
sleep 5
podman logs race-test | grep -E "(Applied migration|already applied by another worker)"

# Cleanup
podman stop race-test

Verification Patterns in Logs

Successful Migration (One Worker Wins)

Worker 0: Applying migration: 001_initial_schema.sql
Worker 1: Database locked by another worker, retry 1/10 in 0.21s
Worker 2: Database locked by another worker, retry 1/10 in 0.23s
Worker 3: Database locked by another worker, retry 1/10 in 0.19s
Worker 0: Applied migration: 001_initial_schema.sql
Worker 1: All migrations already applied by another worker
Worker 2: All migrations already applied by another worker
Worker 3: All migrations already applied by another worker

Performance Metrics to Check

  • Single worker: < 100ms total
  • 4 workers: < 500ms total
  • 10 workers (stress): < 2000ms total

Rollback Plan if Issues

  1. Immediate Workaround

    # Change to single worker temporarily
    gunicorn --workers 1 --bind 0.0.0.0:8000 app:app
    
  2. Revert Code

    git revert HEAD
    
  3. Emergency Patch

    # In app.py temporarily
    import os
    if os.getenv('GUNICORN_WORKER_ID', '1') == '1':
        init_db()  # Only first worker runs migrations
    

Deployment Commands

# 1. Run tests
python -m pytest test_migration_race_condition.py -v

# 2. Build container
podman build -t starpunk:v1.0.0-rc.3.1 -f Containerfile .

# 3. Tag for release
podman tag starpunk:v1.0.0-rc.3.1 git.philmade.com/starpunk:v1.0.0-rc.3.1

# 4. Push
podman push git.philmade.com/starpunk:v1.0.0-rc.3.1

# 5. Deploy
kubectl rollout restart deployment/starpunk

Critical Points to Remember

  1. NEW CONNECTION EACH RETRY - Don't reuse connections
  2. BEGIN IMMEDIATE - Not EXCLUSIVE, not DEFERRED
  3. 30s per attempt, 120s total max - Two different timeouts
  4. Graduated logging - DEBUG → INFO → WARNING based on retry count
  5. Test at multiple levels - Unit, integration, container
  6. Fresh database state between tests

Support

If issues arise, check:

  1. /home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md - Full Q&A
  2. /home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md - Detailed implementation
  3. SQLite lock states: PRAGMA lock_status during issue

Quick Reference v1.0 - 2025-11-24