Files

Phil Skentelbery 2240414f22 docs: Add architect documentation for migration race condition fix

Add comprehensive architectural documentation for the migration race
condition fix, including:

- ADR-022: Architectural decision record for the fix
- migration-race-condition-answers.md: All 23 Q&A answered
- migration-fix-quick-reference.md: Implementation checklist
- migration-race-condition-fix-implementation.md: Detailed guide

These documents guided the implementation in v1.0.0-rc.5.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-24 18:53:55 -07:00

6.4 KiB

Raw Blame History

Migration Race Condition Fix - Quick Implementation Reference

Implementation Checklist

Code Changes - `/home/phil/Projects/starpunk/starpunk/migrations.py`

# 1. Add imports at top
import time
import random

# 2. Replace entire run_migrations function (lines 304-462)
# See full implementation in migration-race-condition-fix-implementation.md

# Key patterns to implement:

# A. Retry loop structure
max_retries = 10
retry_count = 0
base_delay = 0.1
start_time = time.time()
max_total_time = 120  # 2 minute absolute max

while retry_count < max_retries and (time.time() - start_time) < max_total_time:
    conn = None  # NEW connection each iteration
    try:
        conn = sqlite3.connect(db_path, timeout=30.0)
        conn.execute("BEGIN IMMEDIATE")  # Lock acquisition
        # ... migration logic ...
        conn.commit()
        return  # Success
    except sqlite3.OperationalError as e:
        if "database is locked" in str(e).lower():
            retry_count += 1
            if retry_count < max_retries:
                # Exponential backoff with jitter
                delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
                # Graduated logging
                if retry_count <= 3:
                    logger.debug(f"Retry {retry_count}/{max_retries}")
                elif retry_count <= 7:
                    logger.info(f"Retry {retry_count}/{max_retries}")
                else:
                    logger.warning(f"Retry {retry_count}/{max_retries}")
                time.sleep(delay)
                continue
    finally:
        if conn:
            try:
                conn.close()
            except:
                pass

# B. Error handling pattern
except Exception as e:
    try:
        conn.rollback()
    except Exception as rollback_error:
        logger.critical(f"FATAL: Rollback failed: {rollback_error}")
        raise SystemExit(1)
    raise MigrationError(f"Migration failed: {e}")

# C. Final error message
raise MigrationError(
    f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
    f"Possible causes:\n"
    f"1. Another process is stuck in migration (check logs)\n"
    f"2. Database file permissions issue\n"
    f"3. Disk I/O problems\n"
    f"Action: Restart container with single worker to diagnose"
)

Testing Requirements

1. Unit Test File: `test_migration_race_condition.py`

import multiprocessing
from multiprocessing import Barrier, Process
import time

def test_concurrent_migrations():
    """Test 4 workers starting simultaneously"""
    barrier = Barrier(4)

    def worker(worker_id):
        barrier.wait()  # Synchronize start
        from starpunk import create_app
        app = create_app()
        return True

    with multiprocessing.Pool(4) as pool:
        results = pool.map(worker, range(4))

    assert all(results), "Some workers failed"

def test_lock_retry():
    """Test retry logic with mock"""
    with patch('sqlite3.connect') as mock:
        mock.side_effect = [
            sqlite3.OperationalError("database is locked"),
            sqlite3.OperationalError("database is locked"),
            MagicMock()  # Success on 3rd try
        ]
        run_migrations(db_path)
        assert mock.call_count == 3

2. Integration Test: `test_integration.sh`

#!/bin/bash
# Test with actual gunicorn

# Clean start
rm -f test.db

# Start gunicorn with 4 workers
timeout 10 gunicorn --workers 4 --bind 127.0.0.1:8001 app:app &
PID=$!

# Wait for startup
sleep 3

# Check if running
if ! kill -0 $PID 2>/dev/null; then
    echo "FAILED: Gunicorn crashed"
    exit 1
fi

# Check health endpoint
curl -f http://127.0.0.1:8001/health || exit 1

# Cleanup
kill $PID

echo "SUCCESS: All workers started without race condition"

3. Container Test: `test_container.sh`

#!/bin/bash
# Test in container environment

# Build
podman build -t starpunk:race-test -f Containerfile .

# Run with fresh database
podman run --rm -d --name race-test \
    -v $(pwd)/test-data:/data \
    starpunk:race-test

# Check logs for success patterns
sleep 5
podman logs race-test | grep -E "(Applied migration|already applied by another worker)"

# Cleanup
podman stop race-test

Verification Patterns in Logs

Successful Migration (One Worker Wins)

Worker 0: Applying migration: 001_initial_schema.sql
Worker 1: Database locked by another worker, retry 1/10 in 0.21s
Worker 2: Database locked by another worker, retry 1/10 in 0.23s
Worker 3: Database locked by another worker, retry 1/10 in 0.19s
Worker 0: Applied migration: 001_initial_schema.sql
Worker 1: All migrations already applied by another worker
Worker 2: All migrations already applied by another worker
Worker 3: All migrations already applied by another worker

Performance Metrics to Check

Single worker: < 100ms total
4 workers: < 500ms total
10 workers (stress): < 2000ms total

Rollback Plan if Issues

Immediate Workaround

# Change to single worker temporarily
gunicorn --workers 1 --bind 0.0.0.0:8000 app:app

Revert Code
```
git revert HEAD
```

Emergency Patch

# In app.py temporarily
import os
if os.getenv('GUNICORN_WORKER_ID', '1') == '1':
    init_db()  # Only first worker runs migrations

Deployment Commands

# 1. Run tests
python -m pytest test_migration_race_condition.py -v

# 2. Build container
podman build -t starpunk:v1.0.0-rc.3.1 -f Containerfile .

# 3. Tag for release
podman tag starpunk:v1.0.0-rc.3.1 git.philmade.com/starpunk:v1.0.0-rc.3.1

# 4. Push
podman push git.philmade.com/starpunk:v1.0.0-rc.3.1

# 5. Deploy
kubectl rollout restart deployment/starpunk

Critical Points to Remember

NEW CONNECTION EACH RETRY - Don't reuse connections
BEGIN IMMEDIATE - Not EXCLUSIVE, not DEFERRED
30s per attempt, 120s total max - Two different timeouts
Graduated logging - DEBUG → INFO → WARNING based on retry count
Test at multiple levels - Unit, integration, container
Fresh database state between tests

Support

If issues arise, check:

/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md - Full Q&A
/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md - Detailed implementation
SQLite lock states: PRAGMA lock_status during issue

Quick Reference v1.0 - 2025-11-24

6.4 KiB Raw Blame History

Migration Race Condition Fix - Quick Implementation Reference

Implementation Checklist

Code Changes - /home/phil/Projects/starpunk/starpunk/migrations.py

Testing Requirements

1. Unit Test File: test_migration_race_condition.py

2. Integration Test: test_integration.sh

3. Container Test: test_container.sh

Verification Patterns in Logs

Successful Migration (One Worker Wins)

Performance Metrics to Check

Rollback Plan if Issues

Deployment Commands

Critical Points to Remember

Support

6.4 KiB

Raw Blame History

Code Changes - `/home/phil/Projects/starpunk/starpunk/migrations.py`

1. Unit Test File: `test_migration_race_condition.py`

2. Integration Test: `test_integration.sh`

3. Container Test: `test_container.sh`