# Migration Race Condition Fix - Quick Implementation Reference

## Implementation Checklist

### Code Changes - `/home/phil/Projects/starpunk/starpunk/migrations.py`

```python
# 1. Add imports at top
import time
import random

# 2. Replace entire run_migrations function (lines 304-462)
# See full implementation in migration-race-condition-fix-implementation.md

# Key patterns to implement:

# A. Retry loop structure
max_retries = 10
retry_count = 0
base_delay = 0.1
start_time = time.time()
max_total_time = 120  # 2 minute absolute max

while retry_count < max_retries and (time.time() - start_time) < max_total_time:
    conn = None  # NEW connection each iteration
    try:
        conn = sqlite3.connect(db_path, timeout=30.0)
        conn.execute("BEGIN IMMEDIATE")  # Lock acquisition
        # ... migration logic ...
        conn.commit()
        return  # Success
    except sqlite3.OperationalError as e:
        if "database is locked" in str(e).lower():
            retry_count += 1
            if retry_count < max_retries:
                # Exponential backoff with jitter
                delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
                # Graduated logging
                if retry_count <= 3:
                    logger.debug(f"Retry {retry_count}/{max_retries}")
                elif retry_count <= 7:
                    logger.info(f"Retry {retry_count}/{max_retries}")
                else:
                    logger.warning(f"Retry {retry_count}/{max_retries}")
                time.sleep(delay)
                continue
    finally:
        if conn:
            try:
                conn.close()
            except:
                pass

# B. Error handling pattern
except Exception as e:
    try:
        conn.rollback()
    except Exception as rollback_error:
        logger.critical(f"FATAL: Rollback failed: {rollback_error}")
        raise SystemExit(1)
    raise MigrationError(f"Migration failed: {e}")

# C. Final error message
raise MigrationError(
    f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
    f"Possible causes:\n"
    f"1. Another process is stuck in migration (check logs)\n"
    f"2. Database file permissions issue\n"
    f"3. Disk I/O problems\n"
    f"Action: Restart container with single worker to diagnose"
)
```

### Testing Requirements

#### 1. Unit Test File: `test_migration_race_condition.py`
```python
import multiprocessing
from multiprocessing import Barrier, Process
import time

def test_concurrent_migrations():
    """Test 4 workers starting simultaneously"""
    barrier = Barrier(4)

    def worker(worker_id):
        barrier.wait()  # Synchronize start
        from starpunk import create_app
        app = create_app()
        return True

    with multiprocessing.Pool(4) as pool:
        results = pool.map(worker, range(4))

    assert all(results), "Some workers failed"

def test_lock_retry():
    """Test retry logic with mock"""
    with patch('sqlite3.connect') as mock:
        mock.side_effect = [
            sqlite3.OperationalError("database is locked"),
            sqlite3.OperationalError("database is locked"),
            MagicMock()  # Success on 3rd try
        ]
        run_migrations(db_path)
        assert mock.call_count == 3
```

#### 2. Integration Test: `test_integration.sh`
```bash
#!/bin/bash
# Test with actual gunicorn

# Clean start
rm -f test.db

# Start gunicorn with 4 workers
timeout 10 gunicorn --workers 4 --bind 127.0.0.1:8001 app:app &
PID=$!

# Wait for startup
sleep 3

# Check if running
if ! kill -0 $PID 2>/dev/null; then
    echo "FAILED: Gunicorn crashed"
    exit 1
fi

# Check health endpoint
curl -f http://127.0.0.1:8001/health || exit 1

# Cleanup
kill $PID

echo "SUCCESS: All workers started without race condition"
```

#### 3. Container Test: `test_container.sh`
```bash
#!/bin/bash
# Test in container environment

# Build
podman build -t starpunk:race-test -f Containerfile .

# Run with fresh database
podman run --rm -d --name race-test \
    -v $(pwd)/test-data:/data \
    starpunk:race-test

# Check logs for success patterns
sleep 5
podman logs race-test | grep -E "(Applied migration|already applied by another worker)"

# Cleanup
podman stop race-test
```

### Verification Patterns in Logs

#### Successful Migration (One Worker Wins)
```
Worker 0: Applying migration: 001_initial_schema.sql
Worker 1: Database locked by another worker, retry 1/10 in 0.21s
Worker 2: Database locked by another worker, retry 1/10 in 0.23s
Worker 3: Database locked by another worker, retry 1/10 in 0.19s
Worker 0: Applied migration: 001_initial_schema.sql
Worker 1: All migrations already applied by another worker
Worker 2: All migrations already applied by another worker
Worker 3: All migrations already applied by another worker
```

#### Performance Metrics to Check
- Single worker: < 100ms total
- 4 workers: < 500ms total
- 10 workers (stress): < 2000ms total

### Rollback Plan if Issues

1. **Immediate Workaround**
   ```bash
   # Change to single worker temporarily
   gunicorn --workers 1 --bind 0.0.0.0:8000 app:app
   ```

2. **Revert Code**
   ```bash
   git revert HEAD
   ```

3. **Emergency Patch**
   ```python
   # In app.py temporarily
   import os
   if os.getenv('GUNICORN_WORKER_ID', '1') == '1':
       init_db()  # Only first worker runs migrations
   ```

### Deployment Commands

```bash
# 1. Run tests
python -m pytest test_migration_race_condition.py -v

# 2. Build container
podman build -t starpunk:v1.0.0-rc.3.1 -f Containerfile .

# 3. Tag for release
podman tag starpunk:v1.0.0-rc.3.1 git.philmade.com/starpunk:v1.0.0-rc.3.1

# 4. Push
podman push git.philmade.com/starpunk:v1.0.0-rc.3.1

# 5. Deploy
kubectl rollout restart deployment/starpunk
```

---

## Critical Points to Remember

1. **NEW CONNECTION EACH RETRY** - Don't reuse connections
2. **BEGIN IMMEDIATE** - Not EXCLUSIVE, not DEFERRED
3. **30s per attempt, 120s total max** - Two different timeouts
4. **Graduated logging** - DEBUG → INFO → WARNING based on retry count
5. **Test at multiple levels** - Unit, integration, container
6. **Fresh database state** between tests

## Support

If issues arise, check:
1. `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md` - Full Q&A
2. `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md` - Detailed implementation
3. SQLite lock states: `PRAGMA lock_status` during issue

---
*Quick Reference v1.0 - 2025-11-24*