StarPunk/docs/decisions/ADR-037-migration-race-condition-fix.md

# ADR-022: Database Migration Race Condition Resolution

## Status
Accepted

## Context

In production, StarPunk runs with multiple gunicorn workers (currently 4). Each worker process independently initializes the Flask application through `create_app()`, which calls `init_db()`, which in turn runs database migrations via `run_migrations()`.

When the container starts fresh, all 4 workers start simultaneously and attempt to:
1. Create the `schema_migrations` table
2. Apply pending migrations
3. Insert records into `schema_migrations`

This causes a race condition where:
- Worker 1 successfully applies migration and inserts record
- Workers 2-4 fail with "UNIQUE constraint failed: schema_migrations.migration_name"
- Failed workers crash, causing container restarts
- After restart, migrations are already applied so it works

## Decision

We will implement **database-level advisory locking** using SQLite's transaction mechanism with IMMEDIATE mode, combined with retry logic. This approach:

1. Uses SQLite's built-in `BEGIN IMMEDIATE` transaction to acquire a write lock
2. Implements exponential backoff retry for workers that can't acquire the lock
3. Ensures only one worker can run migrations at a time
4. Other workers wait and verify migrations are complete

This is the simplest, most robust solution that:
- Requires minimal code changes
- Uses SQLite's native capabilities
- Doesn't require external dependencies
- Works across all deployment scenarios

## Rationale

### Options Considered

1. **File-based locking (fcntl)**
   - Pro: Simple to implement
   - Con: Doesn't work across containers/network filesystems
   - Con: Lock files can be orphaned if process crashes

2. **Run migrations before workers start**
   - Pro: Cleanest separation of concerns
   - Con: Requires container entrypoint script changes
   - Con: Complicates development workflow
   - Con: Doesn't fix the root cause for non-container deployments

3. **Make migration insertion idempotent (INSERT OR IGNORE)**
   - Pro: Simple SQL change
   - Con: Doesn't prevent parallel migration execution
   - Con: Could corrupt database if migrations partially apply
   - Con: Masks the real problem

4. **Database advisory locking (CHOSEN)**
   - Pro: Uses SQLite's native transaction locking
   - Pro: Guaranteed atomicity
   - Pro: Works across all deployment scenarios
   - Pro: Self-cleaning (no orphaned locks)
   - Con: Requires retry logic

### Why Database Locking?

SQLite's `BEGIN IMMEDIATE` transaction mode acquires a RESERVED lock immediately, preventing other connections from writing. This provides:

1. **Atomicity**: Either all migrations apply or none do
2. **Isolation**: Only one worker can modify schema at a time
3. **Automatic cleanup**: Locks released on connection close/crash
4. **No external dependencies**: Uses SQLite's built-in features

## Implementation

The fix will be implemented in `/home/phil/Projects/starpunk/starpunk/migrations.py`:

```python
def run_migrations(db_path, logger=None):
    """Run all pending database migrations with concurrency protection"""

    max_retries = 10
    retry_count = 0
    base_delay = 0.1  # 100ms

    while retry_count < max_retries:
        try:
            conn = sqlite3.connect(db_path, timeout=30.0)

            # Acquire exclusive lock for migrations
            conn.execute("BEGIN IMMEDIATE")

            try:
                # Create migrations table if needed
                create_migrations_table(conn)

                # Check if another worker already ran migrations
                cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
                if cursor.fetchone()[0] > 0:
                    # Migrations already run by another worker
                    conn.commit()
                    logger.info("Migrations already applied by another worker")
                    return

                # Run migration logic (existing code)
                # ... rest of migration code ...

                conn.commit()
                return  # Success

            except Exception:
                conn.rollback()
                raise

        except sqlite3.OperationalError as e:
            if "database is locked" in str(e):
                retry_count += 1
                delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)

                if retry_count < max_retries:
                    logger.debug(f"Database locked, retry {retry_count}/{max_retries} in {delay:.2f}s")
                    time.sleep(delay)
                else:
                    raise MigrationError(f"Failed to acquire migration lock after {max_retries} attempts")
            else:
                raise

        finally:
            if conn:
                conn.close()
```

Additional changes needed:

1. Add imports: `import time`, `import random`
2. Modify connection timeout from default 5s to 30s
3. Add early check for already-applied migrations
4. Wrap entire migration process in IMMEDIATE transaction

## Consequences

### Positive
- Eliminates race condition completely
- No container configuration changes needed
- Works in all deployment scenarios (container, systemd, manual)
- Minimal code changes (~50 lines)
- Self-healing (no manual lock cleanup needed)
- Provides clear logging of what's happening

### Negative
- Slight startup delay for workers that wait (100ms-2s typical)
- Adds complexity to migration runner
- Requires careful testing of retry logic

### Neutral
- Workers start sequentially for migration phase, then run in parallel
- First worker to acquire lock runs migrations for all
- Log output will show retry attempts (useful for debugging)

## Testing Strategy

1. **Unit test with mock**: Test retry logic with simulated lock contention
2. **Integration test**: Spawn multiple processes, verify only one runs migrations
3. **Container test**: Build container, verify clean startup with 4 workers
4. **Stress test**: Start 20 processes simultaneously, verify correctness

## Migration Path

1. Implement fix in `starpunk/migrations.py`
2. Test locally with multiple workers
3. Build and test container
4. Deploy as v1.0.0-rc.4 or hotfix v1.0.0-rc.3.1
5. Monitor production logs for retry patterns

## Implementation Notes (Post-Analysis)

Based on comprehensive architectural review, the following clarifications have been established:

### Critical Implementation Details

1. **Connection Management**: Create NEW connection for each retry attempt (no reuse)
2. **Lock Mode**: Use BEGIN IMMEDIATE (not EXCLUSIVE) for optimal concurrency
3. **Timeout Strategy**: 30s per connection attempt, 120s total maximum duration
4. **Logging Levels**: Graduated (DEBUG for retry 1-3, INFO for 4-7, WARNING for 8+)
5. **Transaction Boundaries**: Separate transactions for schema/migrations/data

### Test Requirements

- Unit tests with multiprocessing.Pool
- Integration tests with actual gunicorn
- Container tests with full deployment
- Performance target: <500ms with 4 workers

### Documentation

- Full Q&A: `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
- Implementation Guide: `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md`
- Quick Reference: `/home/phil/Projects/starpunk/docs/architecture/migration-fix-quick-reference.md`

## References

- [SQLite Transaction Documentation](https://www.sqlite.org/lang_transaction.html)
- [SQLite Locking Documentation](https://www.sqlite.org/lockingv3.html)
- [SQLite BEGIN IMMEDIATE](https://www.sqlite.org/lang_transaction.html#immediate)
- Issue: Production migration race condition with gunicorn workers

## Status Update

**2025-11-24**: All 23 architectural questions answered. Implementation approved. Ready for development.