docs: Add architect documentation for migration race condition fix

Add comprehensive architectural documentation for the migration race condition fix, including: - ADR-022: Architectural decision record for the fix - migration-race-condition-answers.md: All 23 Q&A answered - migration-fix-quick-reference.md: Implementation checklist - migration-race-condition-fix-implementation.md: Detailed guide These documents guided the implementation in v1.0.0-rc.5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:53:55 -07:00
parent 686d753fb9
commit 2240414f22
4 changed files with 1354 additions and 0 deletions
--- a/docs/architecture/migration-fix-quick-reference.md
+++ b/docs/architecture/migration-fix-quick-reference.md
@@ -0,0 +1,238 @@
+# Migration Race Condition Fix - Quick Implementation Reference
+
+## Implementation Checklist
+
+### Code Changes - `/home/phil/Projects/starpunk/starpunk/migrations.py`
+
+```python
+# 1. Add imports at top
+import time
+import random
+
+# 2. Replace entire run_migrations function (lines 304-462)
+# See full implementation in migration-race-condition-fix-implementation.md
+
+# Key patterns to implement:
+
+# A. Retry loop structure
+max_retries = 10
+retry_count = 0
+base_delay = 0.1
+start_time = time.time()
+max_total_time = 120  # 2 minute absolute max
+
+while retry_count < max_retries and (time.time() - start_time) < max_total_time:
+    conn = None  # NEW connection each iteration
+    try:
+        conn = sqlite3.connect(db_path, timeout=30.0)
+        conn.execute("BEGIN IMMEDIATE")  # Lock acquisition
+        # ... migration logic ...
+        conn.commit()
+        return  # Success
+    except sqlite3.OperationalError as e:
+        if "database is locked" in str(e).lower():
+            retry_count += 1
+            if retry_count < max_retries:
+                # Exponential backoff with jitter
+                delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
+                # Graduated logging
+                if retry_count <= 3:
+                    logger.debug(f"Retry {retry_count}/{max_retries}")
+                elif retry_count <= 7:
+                    logger.info(f"Retry {retry_count}/{max_retries}")
+                else:
+                    logger.warning(f"Retry {retry_count}/{max_retries}")
+                time.sleep(delay)
+                continue
+    finally:
+        if conn:
+            try:
+                conn.close()
+            except:
+                pass
+
+# B. Error handling pattern
+except Exception as e:
+    try:
+        conn.rollback()
+    except Exception as rollback_error:
+        logger.critical(f"FATAL: Rollback failed: {rollback_error}")
+        raise SystemExit(1)
+    raise MigrationError(f"Migration failed: {e}")
+
+# C. Final error message
+raise MigrationError(
+    f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
+    f"Possible causes:\n"
+    f"1. Another process is stuck in migration (check logs)\n"
+    f"2. Database file permissions issue\n"
+    f"3. Disk I/O problems\n"
+    f"Action: Restart container with single worker to diagnose"
+)
+```
+
+### Testing Requirements
+
+#### 1. Unit Test File: `test_migration_race_condition.py`
+```python
+import multiprocessing
+from multiprocessing import Barrier, Process
+import time
+
+def test_concurrent_migrations():
+    """Test 4 workers starting simultaneously"""
+    barrier = Barrier(4)
+
+    def worker(worker_id):
+        barrier.wait()  # Synchronize start
+        from starpunk import create_app
+        app = create_app()
+        return True
+
+    with multiprocessing.Pool(4) as pool:
+        results = pool.map(worker, range(4))
+
+    assert all(results), "Some workers failed"
+
+def test_lock_retry():
+    """Test retry logic with mock"""
+    with patch('sqlite3.connect') as mock:
+        mock.side_effect = [
+            sqlite3.OperationalError("database is locked"),
+            sqlite3.OperationalError("database is locked"),
+            MagicMock()  # Success on 3rd try
+        ]
+        run_migrations(db_path)
+        assert mock.call_count == 3
+```
+
+#### 2. Integration Test: `test_integration.sh`
+```bash
+#!/bin/bash
+# Test with actual gunicorn
+
+# Clean start
+rm -f test.db
+
+# Start gunicorn with 4 workers
+timeout 10 gunicorn --workers 4 --bind 127.0.0.1:8001 app:app &
+PID=$!
+
+# Wait for startup
+sleep 3
+
+# Check if running
+if ! kill -0 $PID 2>/dev/null; then
+    echo "FAILED: Gunicorn crashed"
+    exit 1
+fi
+
+# Check health endpoint
+curl -f http://127.0.0.1:8001/health || exit 1
+
+# Cleanup
+kill $PID
+
+echo "SUCCESS: All workers started without race condition"
+```
+
+#### 3. Container Test: `test_container.sh`
+```bash
+#!/bin/bash
+# Test in container environment
+
+# Build
+podman build -t starpunk:race-test -f Containerfile .
+
+# Run with fresh database
+podman run --rm -d --name race-test \
+    -v $(pwd)/test-data:/data \
+    starpunk:race-test
+
+# Check logs for success patterns
+sleep 5
+podman logs race-test | grep -E "(Applied migration|already applied by another worker)"
+
+# Cleanup
+podman stop race-test
+```
+
+### Verification Patterns in Logs
+
+#### Successful Migration (One Worker Wins)
+```
+Worker 0: Applying migration: 001_initial_schema.sql
+Worker 1: Database locked by another worker, retry 1/10 in 0.21s
+Worker 2: Database locked by another worker, retry 1/10 in 0.23s
+Worker 3: Database locked by another worker, retry 1/10 in 0.19s
+Worker 0: Applied migration: 001_initial_schema.sql
+Worker 1: All migrations already applied by another worker
+Worker 2: All migrations already applied by another worker
+Worker 3: All migrations already applied by another worker
+```
+
+#### Performance Metrics to Check
+- Single worker: < 100ms total
+- 4 workers: < 500ms total
+- 10 workers (stress): < 2000ms total
+
+### Rollback Plan if Issues
+
+1. **Immediate Workaround**
+   ```bash
+   # Change to single worker temporarily
+   gunicorn --workers 1 --bind 0.0.0.0:8000 app:app
+   ```
+
+2. **Revert Code**
+   ```bash
+   git revert HEAD
+   ```
+
+3. **Emergency Patch**
+   ```python
+   # In app.py temporarily
+   import os
+   if os.getenv('GUNICORN_WORKER_ID', '1') == '1':
+       init_db()  # Only first worker runs migrations
+   ```
+
+### Deployment Commands
+
+```bash
+# 1. Run tests
+python -m pytest test_migration_race_condition.py -v
+
+# 2. Build container
+podman build -t starpunk:v1.0.0-rc.3.1 -f Containerfile .
+
+# 3. Tag for release
+podman tag starpunk:v1.0.0-rc.3.1 git.philmade.com/starpunk:v1.0.0-rc.3.1
+
+# 4. Push
+podman push git.philmade.com/starpunk:v1.0.0-rc.3.1
+
+# 5. Deploy
+kubectl rollout restart deployment/starpunk
+```
+
+---
+
+## Critical Points to Remember
+
+1. **NEW CONNECTION EACH RETRY** - Don't reuse connections
+2. **BEGIN IMMEDIATE** - Not EXCLUSIVE, not DEFERRED
+3. **30s per attempt, 120s total max** - Two different timeouts
+4. **Graduated logging** - DEBUG → INFO → WARNING based on retry count
+5. **Test at multiple levels** - Unit, integration, container
+6. **Fresh database state** between tests
+
+## Support
+
+If issues arise, check:
+1. `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md` - Full Q&A
+2. `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md` - Detailed implementation
+3. SQLite lock states: `PRAGMA lock_status` during issue
+
+---
+*Quick Reference v1.0 - 2025-11-24*
--- a/docs/architecture/migration-race-condition-answers.md
+++ b/docs/architecture/migration-race-condition-answers.md
@@ -0,0 +1,477 @@
+# Migration Race Condition Fix - Architectural Answers
+
+## Status: READY FOR IMPLEMENTATION
+
+All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.
+
+---
+
+## Critical Questions
+
+### 1. Connection Lifecycle Management
+**Q: Should we create a new connection for each retry or reuse the same connection?**
+
+**Answer: NEW CONNECTION per retry**
+- Each retry MUST create a fresh connection
+- Rationale: Failed lock acquisition may leave connection in inconsistent state
+- SQLite connections are lightweight; overhead is minimal
+- Pattern:
+  ```python
+  while retry_count < max_retries:
+      conn = None  # Fresh connection each iteration
+      try:
+          conn = sqlite3.connect(db_path, timeout=30.0)
+          # ... attempt migration ...
+      finally:
+          if conn:
+              conn.close()
+  ```
+
+### 2. Transaction Boundaries
+**Q: Should init_db() wrap everything in one transaction?**
+
+**Answer: NO - Separate transactions for different operations**
+- Schema creation: Own transaction (already implicit in executescript)
+- Migrations: Own transaction with BEGIN IMMEDIATE
+- Initial data: Own transaction
+- Rationale: Minimizes lock duration and allows partial success visibility
+- Each operation is atomic but independent
+
+### 3. Lock Timeout vs Retry Timeout
+**Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?**
+
+**Answer: This is BY DESIGN - No conflict**
+- 30s timeout: Maximum wait for any single lock acquisition attempt
+- 102s total: Maximum cumulative retry duration across multiple attempts
+- If one worker holds lock for 30s+, other workers timeout and retry
+- Pattern ensures no single worker waits indefinitely
+- Recommendation: Add total timeout check:
+  ```python
+  start_time = time.time()
+  max_total_time = 120  # 2 minutes absolute maximum
+  while retry_count < max_retries and (time.time() - start_time) < max_total_time:
+  ```
+
+### 4. Testing Strategy
+**Q: Should we use multiprocessing.Pool or actual gunicorn for testing?**
+
+**Answer: BOTH - Different test levels**
+- Unit tests: multiprocessing.Pool (fast, isolated)
+- Integration tests: Actual gunicorn with --workers 4
+- Container tests: Full podman/docker run
+- Test matrix:
+  ```
+  Level 1: Mock concurrent access (unit)
+  Level 2: multiprocessing.Pool (integration)
+  Level 3: gunicorn locally (system)
+  Level 4: Container with gunicorn (e2e)
+  ```
+
+### 5. BEGIN IMMEDIATE vs EXCLUSIVE
+**Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?**
+
+**Answer: BEGIN IMMEDIATE is CORRECT choice**
+- BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
+- BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
+- Rationale:
+  - Migrations only need to prevent concurrent migrations (writes)
+  - Other workers can still read schema while one migrates
+  - Less contention, faster startup
+  - Only escalates to EXCLUSIVE when actually writing
+- Keep BEGIN IMMEDIATE as specified
+
+---
+
+## Edge Cases and Error Handling
+
+### 6. Partial Migration Failure
+**Q: What if a migration partially applies or rollback fails?**
+
+**Answer: Transaction atomicity handles this**
+- Within transaction: Automatic rollback on ANY error
+- Rollback failure: Extremely rare (corrupt database)
+- Strategy:
+  ```python
+  except Exception as e:
+      try:
+          conn.rollback()
+      except Exception as rollback_error:
+          logger.critical(f"FATAL: Rollback failed: {rollback_error}")
+          # Database potentially corrupt - fail hard
+          raise SystemExit(1)
+      raise MigrationError(e)
+  ```
+
+### 7. Migration File Consistency
+**Q: What if migration files change during deployment?**
+
+**Answer: Not a concern with proper deployment**
+- Container deployments: Files are immutable in image
+- Traditional deployment: Use atomic directory swap
+- If concerned, add checksum validation:
+  ```python
+  # Store in schema_migrations: (name, checksum, applied_at)
+  # Verify checksum matches before applying
+  ```
+
+### 8. Retry Exhaustion Error Messages
+**Q: What error message when retries exhausted?**
+
+**Answer: Be specific and actionable**
+```python
+raise MigrationError(
+    f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
+    f"Possible causes:\n"
+    f"1. Another process is stuck in migration (check logs)\n"
+    f"2. Database file permissions issue\n"
+    f"3. Disk I/O problems\n"
+    f"Action: Restart container with single worker to diagnose"
+)
+```
+
+### 9. Logging Levels
+**Q: What log level for lock waits?**
+
+**Answer: Graduated approach**
+- Retry 1-3: DEBUG (normal operation)
+- Retry 4-7: INFO (getting concerning)
+- Retry 8+: WARNING (abnormal)
+- Exhausted: ERROR (operation failed)
+- Pattern:
+  ```python
+  if retry_count <= 3:
+      level = logging.DEBUG
+  elif retry_count <= 7:
+      level = logging.INFO
+  else:
+      level = logging.WARNING
+  logger.log(level, f"Retry {retry_count}/{max_retries}")
+  ```
+
+### 10. Index Creation Failure
+**Q: How to handle index creation failures in migration 002?**
+
+**Answer: Fail fast with clear context**
+```python
+for index_name, index_sql in indexes_to_create:
+    try:
+        conn.execute(index_sql)
+    except sqlite3.OperationalError as e:
+        if "already exists" in str(e):
+            logger.debug(f"Index {index_name} already exists")
+        else:
+            raise MigrationError(
+                f"Failed to create index {index_name}: {e}\n"
+                f"SQL: {index_sql}"
+            )
+```
+
+---
+
+## Testing Strategy
+
+### 11. Concurrent Testing Simulation
+**Q: How to properly simulate concurrent worker startup?**
+
+**Answer: Multiple approaches**
+```python
+# Approach 1: Barrier synchronization
+def test_concurrent_migrations():
+    barrier = multiprocessing.Barrier(4)
+
+    def worker():
+        barrier.wait()  # All start together
+        return run_migrations(db_path)
+
+    with multiprocessing.Pool(4) as pool:
+        results = pool.map(worker, range(4))
+
+# Approach 2: Process start
+processes = []
+for i in range(4):
+    p = Process(target=run_migrations, args=(db_path,))
+    processes.append(p)
+for p in processes:
+    p.start()  # Near-simultaneous
+```
+
+### 12. Lock Contention Testing
+**Q: How to test lock contention scenarios?**
+
+**Answer: Inject delays**
+```python
+# Test helper to force contention
+def slow_migration_for_testing(conn):
+    conn.execute("BEGIN IMMEDIATE")
+    time.sleep(2)  # Force other workers to wait
+    # Apply migration
+    conn.commit()
+
+# Test timeout handling
+@patch('sqlite3.connect')
+def test_lock_timeout(mock_connect):
+    mock_connect.side_effect = sqlite3.OperationalError("database is locked")
+    # Verify retry logic
+```
+
+### 13. Performance Tests
+**Q: What timing is acceptable?**
+
+**Answer: Performance targets**
+- Single worker: < 100ms for all migrations
+- 4 workers with contention: < 500ms total
+- 10 workers stress test: < 2s total
+- Lock acquisition per retry: < 50ms
+- Test with:
+  ```python
+  import timeit
+  setup_time = timeit.timeit(lambda: create_app(), number=1)
+  assert setup_time < 0.5, f"Startup too slow: {setup_time}s"
+  ```
+
+### 14. Retry Logic Unit Tests
+**Q: How to unit test retry logic?**
+
+**Answer: Mock the lock failures**
+```python
+class TestRetryLogic:
+    def test_retry_on_lock(self):
+        with patch('sqlite3.connect') as mock:
+            # First 2 attempts fail, 3rd succeeds
+            mock.side_effect = [
+                sqlite3.OperationalError("database is locked"),
+                sqlite3.OperationalError("database is locked"),
+                MagicMock()  # Success
+            ]
+            run_migrations(db_path)
+            assert mock.call_count == 3
+```
+
+---
+
+## SQLite-Specific Concerns
+
+### 15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)
+**Q: Deep dive on lock choice?**
+
+**Answer: Lock escalation path**
+```
+BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
+BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
+BEGIN EXCLUSIVE → EXCLUSIVE
+
+For migrations:
+- IMMEDIATE starts at RESERVED (blocks other writers immediately)
+- Escalates to EXCLUSIVE only during actual writes
+- Optimal for our use case
+```
+
+### 16. WAL Mode Interaction
+**Q: How does this work with WAL mode?**
+
+**Answer: Works correctly with both modes**
+- Journal mode: BEGIN IMMEDIATE works as described
+- WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
+- No code changes needed
+- Add mode detection for logging:
+  ```python
+  cursor = conn.execute("PRAGMA journal_mode")
+  mode = cursor.fetchone()[0]
+  logger.debug(f"Database in {mode} mode")
+  ```
+
+### 17. Database File Permissions
+**Q: How to handle permission issues?**
+
+**Answer: Fail fast with helpful diagnostics**
+```python
+import os
+import stat
+
+db_path = Path(db_path)
+if not db_path.exists():
+    # Will be created - check parent dir
+    parent = db_path.parent
+    if not os.access(parent, os.W_OK):
+        raise MigrationError(f"Cannot write to directory: {parent}")
+else:
+    # Check existing file
+    if not os.access(db_path, os.W_OK):
+        stats = os.stat(db_path)
+        mode = stat.filemode(stats.st_mode)
+        raise MigrationError(
+            f"Database not writable: {db_path}\n"
+            f"Permissions: {mode}\n"
+            f"Owner: {stats.st_uid}:{stats.st_gid}"
+        )
+```
+
+---
+
+## Deployment/Operations
+
+### 18. Container Startup and Health Checks
+**Q: How to handle health checks during migration?**
+
+**Answer: Return 503 during migration**
+```python
+# In app.py
+MIGRATION_IN_PROGRESS = False
+
+def create_app():
+    global MIGRATION_IN_PROGRESS
+    MIGRATION_IN_PROGRESS = True
+    try:
+        init_db()
+    finally:
+        MIGRATION_IN_PROGRESS = False
+
+@app.route('/health')
+def health():
+    if MIGRATION_IN_PROGRESS:
+        return {'status': 'migrating'}, 503
+    return {'status': 'healthy'}, 200
+```
+
+### 19. Monitoring and Alerting
+**Q: What metrics/alerts are needed?**
+
+**Answer: Key metrics to track**
+```python
+# Add metrics collection
+metrics = {
+    'migration_duration_ms': 0,
+    'migration_retries': 0,
+    'migration_lock_wait_ms': 0,
+    'migrations_applied': 0
+}
+
+# Alert thresholds
+ALERTS = {
+    'migration_duration_ms': 5000,  # Alert if > 5s
+    'migration_retries': 5,         # Alert if > 5 retries
+    'worker_failures': 1             # Alert on any failure
+}
+
+# Log in structured format
+logger.info(json.dumps({
+    'event': 'migration_complete',
+    'metrics': metrics
+}))
+```
+
+---
+
+## Alternative Approaches
+
+### 20. Version Compatibility
+**Q: How to handle version mismatches?**
+
+**Answer: Strict version checking**
+```python
+# In migrations.py
+MIGRATION_VERSION = "1.0.0"
+
+def check_version_compatibility(conn):
+    cursor = conn.execute(
+        "SELECT value FROM app_config WHERE key = 'migration_version'"
+    )
+    row = cursor.fetchone()
+    if row and row[0] != MIGRATION_VERSION:
+        raise MigrationError(
+            f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
+            f"Action: Run migration tool separately"
+        )
+```
+
+### 21. File-Based Locking
+**Q: Should we consider flock() as backup?**
+
+**Answer: NO - Adds complexity without benefit**
+- SQLite locking is sufficient and portable
+- flock() not available on all systems
+- Would require additional cleanup logic
+- Database-level locking is the correct approach
+
+### 22. Gunicorn Preload
+**Q: Would --preload flag help?**
+
+**Answer: NO - Makes problem WORSE**
+- --preload runs app initialization ONCE in master
+- Workers fork from master AFTER migrations complete
+- BUT: Doesn't work with lazy-loaded resources
+- Current architecture expects per-worker initialization
+- Keep current approach
+
+### 23. Application-Level Locks
+**Q: Should we add Redis/memcached for coordination?**
+
+**Answer: NO - Violates simplicity principle**
+- Adds external dependency
+- More complex deployment
+- SQLite locking is sufficient
+- Would require Redis/memcached to be running before app starts
+- Solving a solved problem
+
+---
+
+## Final Implementation Checklist
+
+### Required Changes
+
+1. ✅ Add imports: `time`, `random`
+2. ✅ Implement retry loop with exponential backoff
+3. ✅ Use BEGIN IMMEDIATE for lock acquisition
+4. ✅ Add graduated logging levels
+5. ✅ Proper error messages with diagnostics
+6. ✅ Fresh connection per retry
+7. ✅ Total timeout check (2 minutes max)
+8. ✅ Preserve all existing migration logic
+
+### Test Coverage Required
+
+1. ✅ Unit test: Retry on lock
+2. ✅ Unit test: Exhaustion handling
+3. ✅ Integration test: 4 workers with multiprocessing
+4. ✅ System test: gunicorn with 4 workers
+5. ✅ Container test: Full deployment simulation
+6. ✅ Performance test: < 500ms with contention
+
+### Documentation Updates
+
+1. ✅ Update ADR-022 with final decision
+2. ✅ Add operational runbook for migration issues
+3. ✅ Document monitoring metrics
+4. ✅ Update deployment guide with health check info
+
+---
+
+## Go/No-Go Decision
+
+### ✅ GO FOR IMPLEMENTATION
+
+**Rationale:**
+- All 23 questions have concrete answers
+- Design is proven with SQLite's native capabilities
+- No external dependencies needed
+- Risk is low with clear rollback plan
+- Testing strategy is comprehensive
+
+**Implementation Priority: IMMEDIATE**
+- This is blocking v1.0.0-rc.4 release
+- Production systems affected
+- Fix is well-understood and low-risk
+
+**Next Steps:**
+1. Implement changes to migrations.py as specified
+2. Run test suite at all levels
+3. Deploy as hotfix v1.0.0-rc.3.1
+4. Monitor metrics in production
+5. Document lessons learned
+
+---
+
+*Document Version: 1.0*
+*Created: 2025-11-24*
+*Status: Approved for Implementation*
+*Author: StarPunk Architecture Team*
--- a/docs/decisions/ADR-022-migration-race-condition-fix.md
+++ b/docs/decisions/ADR-022-migration-race-condition-fix.md
@@ -0,0 +1,208 @@
+# ADR-022: Database Migration Race Condition Resolution
+
+## Status
+Accepted
+
+## Context
+
+In production, StarPunk runs with multiple gunicorn workers (currently 4). Each worker process independently initializes the Flask application through `create_app()`, which calls `init_db()`, which in turn runs database migrations via `run_migrations()`.
+
+When the container starts fresh, all 4 workers start simultaneously and attempt to:
+1. Create the `schema_migrations` table
+2. Apply pending migrations
+3. Insert records into `schema_migrations`
+
+This causes a race condition where:
+- Worker 1 successfully applies migration and inserts record
+- Workers 2-4 fail with "UNIQUE constraint failed: schema_migrations.migration_name"
+- Failed workers crash, causing container restarts
+- After restart, migrations are already applied so it works
+
+## Decision
+
+We will implement **database-level advisory locking** using SQLite's transaction mechanism with IMMEDIATE mode, combined with retry logic. This approach:
+
+1. Uses SQLite's built-in `BEGIN IMMEDIATE` transaction to acquire a write lock
+2. Implements exponential backoff retry for workers that can't acquire the lock
+3. Ensures only one worker can run migrations at a time
+4. Other workers wait and verify migrations are complete
+
+This is the simplest, most robust solution that:
+- Requires minimal code changes
+- Uses SQLite's native capabilities
+- Doesn't require external dependencies
+- Works across all deployment scenarios
+
+## Rationale
+
+### Options Considered
+
+1. **File-based locking (fcntl)**
+   - Pro: Simple to implement
+   - Con: Doesn't work across containers/network filesystems
+   - Con: Lock files can be orphaned if process crashes
+
+2. **Run migrations before workers start**
+   - Pro: Cleanest separation of concerns
+   - Con: Requires container entrypoint script changes
+   - Con: Complicates development workflow
+   - Con: Doesn't fix the root cause for non-container deployments
+
+3. **Make migration insertion idempotent (INSERT OR IGNORE)**
+   - Pro: Simple SQL change
+   - Con: Doesn't prevent parallel migration execution
+   - Con: Could corrupt database if migrations partially apply
+   - Con: Masks the real problem
+
+4. **Database advisory locking (CHOSEN)**
+   - Pro: Uses SQLite's native transaction locking
+   - Pro: Guaranteed atomicity
+   - Pro: Works across all deployment scenarios
+   - Pro: Self-cleaning (no orphaned locks)
+   - Con: Requires retry logic
+
+### Why Database Locking?
+
+SQLite's `BEGIN IMMEDIATE` transaction mode acquires a RESERVED lock immediately, preventing other connections from writing. This provides:
+
+1. **Atomicity**: Either all migrations apply or none do
+2. **Isolation**: Only one worker can modify schema at a time
+3. **Automatic cleanup**: Locks released on connection close/crash
+4. **No external dependencies**: Uses SQLite's built-in features
+
+## Implementation
+
+The fix will be implemented in `/home/phil/Projects/starpunk/starpunk/migrations.py`:
+
+```python
+def run_migrations(db_path, logger=None):
+    """Run all pending database migrations with concurrency protection"""
+
+    max_retries = 10
+    retry_count = 0
+    base_delay = 0.1  # 100ms
+
+    while retry_count < max_retries:
+        try:
+            conn = sqlite3.connect(db_path, timeout=30.0)
+
+            # Acquire exclusive lock for migrations
+            conn.execute("BEGIN IMMEDIATE")
+
+            try:
+                # Create migrations table if needed
+                create_migrations_table(conn)
+
+                # Check if another worker already ran migrations
+                cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
+                if cursor.fetchone()[0] > 0:
+                    # Migrations already run by another worker
+                    conn.commit()
+                    logger.info("Migrations already applied by another worker")
+                    return
+
+                # Run migration logic (existing code)
+                # ... rest of migration code ...
+
+                conn.commit()
+                return  # Success
+
+            except Exception:
+                conn.rollback()
+                raise
+
+        except sqlite3.OperationalError as e:
+            if "database is locked" in str(e):
+                retry_count += 1
+                delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
+
+                if retry_count < max_retries:
+                    logger.debug(f"Database locked, retry {retry_count}/{max_retries} in {delay:.2f}s")
+                    time.sleep(delay)
+                else:
+                    raise MigrationError(f"Failed to acquire migration lock after {max_retries} attempts")
+            else:
+                raise
+
+        finally:
+            if conn:
+                conn.close()
+```
+
+Additional changes needed:
+
+1. Add imports: `import time`, `import random`
+2. Modify connection timeout from default 5s to 30s
+3. Add early check for already-applied migrations
+4. Wrap entire migration process in IMMEDIATE transaction
+
+## Consequences
+
+### Positive
+- Eliminates race condition completely
+- No container configuration changes needed
+- Works in all deployment scenarios (container, systemd, manual)
+- Minimal code changes (~50 lines)
+- Self-healing (no manual lock cleanup needed)
+- Provides clear logging of what's happening
+
+### Negative
+- Slight startup delay for workers that wait (100ms-2s typical)
+- Adds complexity to migration runner
+- Requires careful testing of retry logic
+
+### Neutral
+- Workers start sequentially for migration phase, then run in parallel
+- First worker to acquire lock runs migrations for all
+- Log output will show retry attempts (useful for debugging)
+
+## Testing Strategy
+
+1. **Unit test with mock**: Test retry logic with simulated lock contention
+2. **Integration test**: Spawn multiple processes, verify only one runs migrations
+3. **Container test**: Build container, verify clean startup with 4 workers
+4. **Stress test**: Start 20 processes simultaneously, verify correctness
+
+## Migration Path
+
+1. Implement fix in `starpunk/migrations.py`
+2. Test locally with multiple workers
+3. Build and test container
+4. Deploy as v1.0.0-rc.4 or hotfix v1.0.0-rc.3.1
+5. Monitor production logs for retry patterns
+
+## Implementation Notes (Post-Analysis)
+
+Based on comprehensive architectural review, the following clarifications have been established:
+
+### Critical Implementation Details
+
+1. **Connection Management**: Create NEW connection for each retry attempt (no reuse)
+2. **Lock Mode**: Use BEGIN IMMEDIATE (not EXCLUSIVE) for optimal concurrency
+3. **Timeout Strategy**: 30s per connection attempt, 120s total maximum duration
+4. **Logging Levels**: Graduated (DEBUG for retry 1-3, INFO for 4-7, WARNING for 8+)
+5. **Transaction Boundaries**: Separate transactions for schema/migrations/data
+
+### Test Requirements
+
+- Unit tests with multiprocessing.Pool
+- Integration tests with actual gunicorn
+- Container tests with full deployment
+- Performance target: <500ms with 4 workers
+
+### Documentation
+
+- Full Q&A: `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
+- Implementation Guide: `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md`
+- Quick Reference: `/home/phil/Projects/starpunk/docs/architecture/migration-fix-quick-reference.md`
+
+## References
+
+- [SQLite Transaction Documentation](https://www.sqlite.org/lang_transaction.html)
+- [SQLite Locking Documentation](https://www.sqlite.org/lockingv3.html)
+- [SQLite BEGIN IMMEDIATE](https://www.sqlite.org/lang_transaction.html#immediate)
+- Issue: Production migration race condition with gunicorn workers
+
+## Status Update
+
+**2025-11-24**: All 23 architectural questions answered. Implementation approved. Ready for development.
--- a/docs/reports/migration-race-condition-fix-implementation.md
+++ b/docs/reports/migration-race-condition-fix-implementation.md
@@ -0,0 +1,431 @@
+# Migration Race Condition Fix - Implementation Guide
+
+## Executive Summary
+
+**CRITICAL PRODUCTION ISSUE**: Multiple gunicorn workers racing to apply migrations causes container startup failures.
+
+**Solution**: Implement database-level advisory locking with retry logic in `migrations.py`.
+
+**Urgency**: HIGH - This is a blocker for v1.0.0-rc.4 release.
+
+## Root Cause Analysis
+
+### The Problem Flow
+
+1. Container starts with `gunicorn --workers 4`
+2. Each worker independently calls:
+   ```
+   app.py → create_app() → init_db() → run_migrations()
+   ```
+3. All 4 workers simultaneously try to:
+   - INSERT into schema_migrations table
+   - Apply the same migrations
+4. SQLite's UNIQUE constraint on migration_name causes workers 2-4 to crash
+5. Container restarts, works on second attempt (migrations already applied)
+
+### Why This Happens
+
+- **No synchronization**: Workers are independent processes
+- **No locking**: Migration code doesn't prevent concurrent execution
+- **Immediate failure**: UNIQUE constraint violation crashes the worker
+- **Gunicorn behavior**: Worker crash triggers container restart
+
+## Immediate Fix Implementation
+
+### Step 1: Update migrations.py
+
+Add these imports at the top of `/home/phil/Projects/starpunk/starpunk/migrations.py`:
+
+```python
+import time
+import random
+```
+
+### Step 2: Replace run_migrations function
+
+Replace the entire `run_migrations` function (lines 304-462) with:
+
+```python
+def run_migrations(db_path, logger=None):
+    """
+    Run all pending database migrations with concurrency protection
+
+    Uses database-level locking to prevent race conditions when multiple
+    workers start simultaneously. Only one worker will apply migrations;
+    others will wait and verify completion.
+
+    Args:
+        db_path: Path to SQLite database file
+        logger: Optional logger for output
+
+    Raises:
+        MigrationError: If any migration fails to apply or lock cannot be acquired
+    """
+    if logger is None:
+        logger = logging.getLogger(__name__)
+
+    # Determine migrations directory
+    migrations_dir = Path(__file__).parent.parent / "migrations"
+
+    if not migrations_dir.exists():
+        logger.warning(f"Migrations directory not found: {migrations_dir}")
+        return
+
+    # Retry configuration for lock acquisition
+    max_retries = 10
+    retry_count = 0
+    base_delay = 0.1  # 100ms
+
+    while retry_count < max_retries:
+        conn = None
+        try:
+            # Connect with longer timeout for lock contention
+            conn = sqlite3.connect(db_path, timeout=30.0)
+
+            # Attempt to acquire exclusive lock for migrations
+            # BEGIN IMMEDIATE acquires RESERVED lock, preventing other writes
+            conn.execute("BEGIN IMMEDIATE")
+
+            try:
+                # Ensure migrations tracking table exists
+                create_migrations_table(conn)
+
+                # Quick check: have migrations already been applied by another worker?
+                cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
+                migration_count = cursor.fetchone()[0]
+
+                # Discover migration files
+                migration_files = discover_migration_files(migrations_dir)
+
+                if not migration_files:
+                    conn.commit()
+                    logger.info("No migration files found")
+                    return
+
+                # If migrations exist and we're not the first worker, verify and exit
+                if migration_count > 0:
+                    # Check if all migrations are applied
+                    applied = get_applied_migrations(conn)
+                    pending = [m for m, _ in migration_files if m not in applied]
+
+                    if not pending:
+                        conn.commit()
+                        logger.debug("All migrations already applied by another worker")
+                        return
+                    # If there are pending migrations, we continue to apply them
+                    logger.info(f"Found {len(pending)} pending migrations to apply")
+
+                # Fresh database detection (original logic preserved)
+                if migration_count == 0:
+                    if is_schema_current(conn):
+                        # Schema is current - mark all migrations as applied
+                        for migration_name, _ in migration_files:
+                            conn.execute(
+                                "INSERT INTO schema_migrations (migration_name) VALUES (?)",
+                                (migration_name,)
+                            )
+                        conn.commit()
+                        logger.info(
+                            f"Fresh database detected: marked {len(migration_files)} "
+                            f"migrations as applied (schema already current)"
+                        )
+                        return
+                    else:
+                        logger.info("Fresh database with partial schema: applying needed migrations")
+
+                # Get already-applied migrations
+                applied = get_applied_migrations(conn)
+
+                # Apply pending migrations (original logic preserved)
+                pending_count = 0
+                skipped_count = 0
+                for migration_name, migration_path in migration_files:
+                    if migration_name not in applied:
+                        # Check if migration is actually needed
+                        should_check_needed = (
+                            migration_count == 0 or
+                            migration_name == "002_secure_tokens_and_authorization_codes.sql"
+                        )
+
+                        if should_check_needed and not is_migration_needed(conn, migration_name):
+                            # Special handling for migration 002: if tables exist but indexes don't
+                            if migration_name == "002_secure_tokens_and_authorization_codes.sql":
+                                # Check if we need to create indexes
+                                indexes_to_create = []
+                                if not index_exists(conn, 'idx_tokens_hash'):
+                                    indexes_to_create.append("CREATE INDEX idx_tokens_hash ON tokens(token_hash)")
+                                if not index_exists(conn, 'idx_tokens_me'):
+                                    indexes_to_create.append("CREATE INDEX idx_tokens_me ON tokens(me)")
+                                if not index_exists(conn, 'idx_tokens_expires'):
+                                    indexes_to_create.append("CREATE INDEX idx_tokens_expires ON tokens(expires_at)")
+                                if not index_exists(conn, 'idx_auth_codes_hash'):
+                                    indexes_to_create.append("CREATE INDEX idx_auth_codes_hash ON authorization_codes(code_hash)")
+                                if not index_exists(conn, 'idx_auth_codes_expires'):
+                                    indexes_to_create.append("CREATE INDEX idx_auth_codes_expires ON authorization_codes(expires_at)")
+
+                                if indexes_to_create:
+                                    for index_sql in indexes_to_create:
+                                        conn.execute(index_sql)
+                                    logger.info(f"Created {len(indexes_to_create)} missing indexes from migration 002")
+
+                            # Mark as applied without executing full migration
+                            conn.execute(
+                                "INSERT INTO schema_migrations (migration_name) VALUES (?)",
+                                (migration_name,)
+                            )
+                            skipped_count += 1
+                            logger.debug(f"Skipped migration {migration_name} (already in SCHEMA_SQL)")
+                        else:
+                            # Apply the migration (within our transaction)
+                            try:
+                                # Read migration SQL
+                                migration_sql = migration_path.read_text()
+
+                                logger.debug(f"Applying migration: {migration_name}")
+
+                                # Execute migration (already in transaction)
+                                conn.executescript(migration_sql)
+
+                                # Record migration as applied
+                                conn.execute(
+                                    "INSERT INTO schema_migrations (migration_name) VALUES (?)",
+                                    (migration_name,)
+                                )
+
+                                logger.info(f"Applied migration: {migration_name}")
+                                pending_count += 1
+
+                            except Exception as e:
+                                # Roll back the transaction
+                                raise MigrationError(f"Migration {migration_name} failed: {e}")
+
+                # Commit all migrations atomically
+                conn.commit()
+
+                # Summary
+                total_count = len(migration_files)
+                if pending_count > 0 or skipped_count > 0:
+                    if skipped_count > 0:
+                        logger.info(
+                            f"Migrations complete: {pending_count} applied, {skipped_count} skipped "
+                            f"(already in SCHEMA_SQL), {total_count} total"
+                        )
+                    else:
+                        logger.info(
+                            f"Migrations complete: {pending_count} applied, "
+                            f"{total_count} total"
+                        )
+                else:
+                    logger.info(f"All migrations up to date ({total_count} total)")
+
+                return  # Success!
+
+            except MigrationError:
+                conn.rollback()
+                raise
+
+            except Exception as e:
+                conn.rollback()
+                raise MigrationError(f"Migration system error: {e}")
+
+        except sqlite3.OperationalError as e:
+            if "database is locked" in str(e).lower():
+                # Another worker has the lock, retry with exponential backoff
+                retry_count += 1
+
+                if retry_count < max_retries:
+                    # Exponential backoff with jitter
+                    delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
+                    logger.debug(
+                        f"Database locked by another worker, retry {retry_count}/{max_retries} "
+                        f"in {delay:.2f}s"
+                    )
+                    time.sleep(delay)
+                    continue
+                else:
+                    raise MigrationError(
+                        f"Failed to acquire migration lock after {max_retries} attempts. "
+                        f"This may indicate a hung migration process."
+                    )
+            else:
+                # Non-lock related database error
+                error_msg = f"Database error during migration: {e}"
+                logger.error(error_msg)
+                raise MigrationError(error_msg)
+
+        except Exception as e:
+            # Unexpected error
+            error_msg = f"Unexpected error during migration: {e}"
+            logger.error(error_msg)
+            raise MigrationError(error_msg)
+
+        finally:
+            if conn:
+                try:
+                    conn.close()
+                except:
+                    pass  # Ignore errors during cleanup
+
+    # Should never reach here, but just in case
+    raise MigrationError("Migration retry loop exited unexpectedly")
+```
+
+### Step 3: Testing the Fix
+
+Create a test script to verify the fix works:
+
+```python
+#!/usr/bin/env python3
+"""Test migration race condition fix"""
+
+import multiprocessing
+import time
+import sys
+from pathlib import Path
+
+# Add project to path
+sys.path.insert(0, str(Path(__file__).parent))
+
+def worker_init(worker_id):
+    """Simulate a gunicorn worker starting"""
+    print(f"Worker {worker_id}: Starting...")
+
+    try:
+        from starpunk import create_app
+        app = create_app()
+        print(f"Worker {worker_id}: Successfully initialized")
+        return True
+    except Exception as e:
+        print(f"Worker {worker_id}: FAILED - {e}")
+        return False
+
+if __name__ == "__main__":
+    # Test with 10 workers (more than production to stress test)
+    num_workers = 10
+
+    print(f"Starting {num_workers} workers simultaneously...")
+
+    with multiprocessing.Pool(num_workers) as pool:
+        results = pool.map(worker_init, range(num_workers))
+
+    success_count = sum(results)
+    print(f"\nResults: {success_count}/{num_workers} workers succeeded")
+
+    if success_count == num_workers:
+        print("SUCCESS: All workers initialized without race condition")
+        sys.exit(0)
+    else:
+        print("FAILURE: Race condition still present")
+        sys.exit(1)
+```
+
+## Verification Steps
+
+1. **Local Testing**:
+   ```bash
+   # Test with multiple workers
+   gunicorn --workers 4 --bind 0.0.0.0:8000 app:app
+
+   # Check logs for retry messages
+   # Should see "Database locked by another worker, retry..." messages
+   ```
+
+2. **Container Testing**:
+   ```bash
+   # Build container
+   podman build -t starpunk:test -f Containerfile .
+
+   # Run with fresh database
+   podman run --rm -p 8000:8000 -v ./test-data:/data starpunk:test
+
+   # Should start cleanly without restarts
+   ```
+
+3. **Log Verification**:
+   Look for these patterns:
+   - One worker: "Applied migration: XXX"
+   - Other workers: "Database locked by another worker, retry..."
+   - Final: "All migrations already applied by another worker"
+
+## Risk Assessment
+
+### Risk Level: LOW
+
+The fix is safe because:
+1. Uses SQLite's native transaction mechanism
+2. Preserves all existing migration logic
+3. Only adds retry wrapper around existing code
+4. Fails safely with clear error messages
+5. No data loss possible (transactions ensure atomicity)
+
+### Rollback Plan
+
+If issues occur:
+1. Revert to previous version
+2. Start container with single worker temporarily: `--workers 1`
+3. Once migrations apply, scale back to 4 workers
+
+## Release Strategy
+
+### Option 1: Hotfix (Recommended)
+- Release as v1.0.0-rc.3.1
+- Immediate deployment to fix production issue
+- Minimal testing required (focused fix)
+
+### Option 2: Include in rc.4
+- Bundle with other rc.4 changes
+- More testing time
+- Risk: Production remains broken until rc.4
+
+**Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately.
+
+## Alternative Workarounds (If Needed Urgently)
+
+While the proper fix is implemented, these temporary workarounds can be used:
+
+### Workaround 1: Single Worker Startup
+```bash
+# In Containerfile, temporarily change:
+CMD ["gunicorn", "--workers", "1", ...]
+
+# After first successful start, rebuild with 4 workers
+```
+
+### Workaround 2: Pre-migration Script
+```bash
+# Add entrypoint script that runs migrations before gunicorn
+#!/bin/bash
+python3 -c "from starpunk.database import init_db; init_db()"
+exec gunicorn --workers 4 ...
+```
+
+### Workaround 3: Delayed Worker Startup
+```bash
+# Stagger worker startup with --preload
+gunicorn --preload --workers 4 ...
+```
+
+## Summary
+
+- **Problem**: Race condition when multiple workers apply migrations
+- **Solution**: Database-level locking with retry logic
+- **Implementation**: ~150 lines of code changes in migrations.py
+- **Testing**: Verify with multi-worker startup
+- **Risk**: LOW - Safe, atomic changes
+- **Urgency**: HIGH - Blocks production deployment
+- **Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately
+
+## Developer Questions Answered
+
+All 23 architectural questions have been comprehensively answered in:
+`/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
+
+**Key Decisions:**
+- NEW connection per retry (not reused)
+- BEGIN IMMEDIATE is correct (not EXCLUSIVE)
+- Separate transactions for each operation
+- Both multiprocessing.Pool AND gunicorn testing needed
+- 30s timeout per attempt, 120s total maximum
+- Graduated logging levels based on retry count
+
+**Implementation Status: READY TO PROCEED**