StarPunk/docs/architecture/migration-race-condition-answers.md

# Migration Race Condition Fix - Architectural Answers

## Status: READY FOR IMPLEMENTATION

All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.

---

## Critical Questions

### 1. Connection Lifecycle Management
**Q: Should we create a new connection for each retry or reuse the same connection?**

**Answer: NEW CONNECTION per retry**
- Each retry MUST create a fresh connection
- Rationale: Failed lock acquisition may leave connection in inconsistent state
- SQLite connections are lightweight; overhead is minimal
- Pattern:
  ```python
  while retry_count < max_retries:
      conn = None  # Fresh connection each iteration
      try:
          conn = sqlite3.connect(db_path, timeout=30.0)
          # ... attempt migration ...
      finally:
          if conn:
              conn.close()
  ```

### 2. Transaction Boundaries
**Q: Should init_db() wrap everything in one transaction?**

**Answer: NO - Separate transactions for different operations**
- Schema creation: Own transaction (already implicit in executescript)
- Migrations: Own transaction with BEGIN IMMEDIATE
- Initial data: Own transaction
- Rationale: Minimizes lock duration and allows partial success visibility
- Each operation is atomic but independent

### 3. Lock Timeout vs Retry Timeout
**Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?**

**Answer: This is BY DESIGN - No conflict**
- 30s timeout: Maximum wait for any single lock acquisition attempt
- 102s total: Maximum cumulative retry duration across multiple attempts
- If one worker holds lock for 30s+, other workers timeout and retry
- Pattern ensures no single worker waits indefinitely
- Recommendation: Add total timeout check:
  ```python
  start_time = time.time()
  max_total_time = 120  # 2 minutes absolute maximum
  while retry_count < max_retries and (time.time() - start_time) < max_total_time:
  ```

### 4. Testing Strategy
**Q: Should we use multiprocessing.Pool or actual gunicorn for testing?**

**Answer: BOTH - Different test levels**
- Unit tests: multiprocessing.Pool (fast, isolated)
- Integration tests: Actual gunicorn with --workers 4
- Container tests: Full podman/docker run
- Test matrix:
  ```
  Level 1: Mock concurrent access (unit)
  Level 2: multiprocessing.Pool (integration)
  Level 3: gunicorn locally (system)
  Level 4: Container with gunicorn (e2e)
  ```

### 5. BEGIN IMMEDIATE vs EXCLUSIVE
**Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?**

**Answer: BEGIN IMMEDIATE is CORRECT choice**
- BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
- BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
- Rationale:
  - Migrations only need to prevent concurrent migrations (writes)
  - Other workers can still read schema while one migrates
  - Less contention, faster startup
  - Only escalates to EXCLUSIVE when actually writing
- Keep BEGIN IMMEDIATE as specified

---

## Edge Cases and Error Handling

### 6. Partial Migration Failure
**Q: What if a migration partially applies or rollback fails?**

**Answer: Transaction atomicity handles this**
- Within transaction: Automatic rollback on ANY error
- Rollback failure: Extremely rare (corrupt database)
- Strategy:
  ```python
  except Exception as e:
      try:
          conn.rollback()
      except Exception as rollback_error:
          logger.critical(f"FATAL: Rollback failed: {rollback_error}")
          # Database potentially corrupt - fail hard
          raise SystemExit(1)
      raise MigrationError(e)
  ```

### 7. Migration File Consistency
**Q: What if migration files change during deployment?**

**Answer: Not a concern with proper deployment**
- Container deployments: Files are immutable in image
- Traditional deployment: Use atomic directory swap
- If concerned, add checksum validation:
  ```python
  # Store in schema_migrations: (name, checksum, applied_at)
  # Verify checksum matches before applying
  ```

### 8. Retry Exhaustion Error Messages
**Q: What error message when retries exhausted?**

**Answer: Be specific and actionable**
```python
raise MigrationError(
    f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
    f"Possible causes:\n"
    f"1. Another process is stuck in migration (check logs)\n"
    f"2. Database file permissions issue\n"
    f"3. Disk I/O problems\n"
    f"Action: Restart container with single worker to diagnose"
)
```

### 9. Logging Levels
**Q: What log level for lock waits?**

**Answer: Graduated approach**
- Retry 1-3: DEBUG (normal operation)
- Retry 4-7: INFO (getting concerning)
- Retry 8+: WARNING (abnormal)
- Exhausted: ERROR (operation failed)
- Pattern:
  ```python
  if retry_count <= 3:
      level = logging.DEBUG
  elif retry_count <= 7:
      level = logging.INFO
  else:
      level = logging.WARNING
  logger.log(level, f"Retry {retry_count}/{max_retries}")
  ```

### 10. Index Creation Failure
**Q: How to handle index creation failures in migration 002?**

**Answer: Fail fast with clear context**
```python
for index_name, index_sql in indexes_to_create:
    try:
        conn.execute(index_sql)
    except sqlite3.OperationalError as e:
        if "already exists" in str(e):
            logger.debug(f"Index {index_name} already exists")
        else:
            raise MigrationError(
                f"Failed to create index {index_name}: {e}\n"
                f"SQL: {index_sql}"
            )
```

---

## Testing Strategy

### 11. Concurrent Testing Simulation
**Q: How to properly simulate concurrent worker startup?**

**Answer: Multiple approaches**
```python
# Approach 1: Barrier synchronization
def test_concurrent_migrations():
    barrier = multiprocessing.Barrier(4)

    def worker():
        barrier.wait()  # All start together
        return run_migrations(db_path)

    with multiprocessing.Pool(4) as pool:
        results = pool.map(worker, range(4))

# Approach 2: Process start
processes = []
for i in range(4):
    p = Process(target=run_migrations, args=(db_path,))
    processes.append(p)
for p in processes:
    p.start()  # Near-simultaneous
```

### 12. Lock Contention Testing
**Q: How to test lock contention scenarios?**

**Answer: Inject delays**
```python
# Test helper to force contention
def slow_migration_for_testing(conn):
    conn.execute("BEGIN IMMEDIATE")
    time.sleep(2)  # Force other workers to wait
    # Apply migration
    conn.commit()

# Test timeout handling
@patch('sqlite3.connect')
def test_lock_timeout(mock_connect):
    mock_connect.side_effect = sqlite3.OperationalError("database is locked")
    # Verify retry logic
```

### 13. Performance Tests
**Q: What timing is acceptable?**

**Answer: Performance targets**
- Single worker: < 100ms for all migrations
- 4 workers with contention: < 500ms total
- 10 workers stress test: < 2s total
- Lock acquisition per retry: < 50ms
- Test with:
  ```python
  import timeit
  setup_time = timeit.timeit(lambda: create_app(), number=1)
  assert setup_time < 0.5, f"Startup too slow: {setup_time}s"
  ```

### 14. Retry Logic Unit Tests
**Q: How to unit test retry logic?**

**Answer: Mock the lock failures**
```python
class TestRetryLogic:
    def test_retry_on_lock(self):
        with patch('sqlite3.connect') as mock:
            # First 2 attempts fail, 3rd succeeds
            mock.side_effect = [
                sqlite3.OperationalError("database is locked"),
                sqlite3.OperationalError("database is locked"),
                MagicMock()  # Success
            ]
            run_migrations(db_path)
            assert mock.call_count == 3
```

---

## SQLite-Specific Concerns

### 15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)
**Q: Deep dive on lock choice?**

**Answer: Lock escalation path**
```
BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
BEGIN EXCLUSIVE → EXCLUSIVE

For migrations:
- IMMEDIATE starts at RESERVED (blocks other writers immediately)
- Escalates to EXCLUSIVE only during actual writes
- Optimal for our use case
```

### 16. WAL Mode Interaction
**Q: How does this work with WAL mode?**

**Answer: Works correctly with both modes**
- Journal mode: BEGIN IMMEDIATE works as described
- WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
- No code changes needed
- Add mode detection for logging:
  ```python
  cursor = conn.execute("PRAGMA journal_mode")
  mode = cursor.fetchone()[0]
  logger.debug(f"Database in {mode} mode")
  ```

### 17. Database File Permissions
**Q: How to handle permission issues?**

**Answer: Fail fast with helpful diagnostics**
```python
import os
import stat

db_path = Path(db_path)
if not db_path.exists():
    # Will be created - check parent dir
    parent = db_path.parent
    if not os.access(parent, os.W_OK):
        raise MigrationError(f"Cannot write to directory: {parent}")
else:
    # Check existing file
    if not os.access(db_path, os.W_OK):
        stats = os.stat(db_path)
        mode = stat.filemode(stats.st_mode)
        raise MigrationError(
            f"Database not writable: {db_path}\n"
            f"Permissions: {mode}\n"
            f"Owner: {stats.st_uid}:{stats.st_gid}"
        )
```

---

## Deployment/Operations

### 18. Container Startup and Health Checks
**Q: How to handle health checks during migration?**

**Answer: Return 503 during migration**
```python
# In app.py
MIGRATION_IN_PROGRESS = False

def create_app():
    global MIGRATION_IN_PROGRESS
    MIGRATION_IN_PROGRESS = True
    try:
        init_db()
    finally:
        MIGRATION_IN_PROGRESS = False

@app.route('/health')
def health():
    if MIGRATION_IN_PROGRESS:
        return {'status': 'migrating'}, 503
    return {'status': 'healthy'}, 200
```

### 19. Monitoring and Alerting
**Q: What metrics/alerts are needed?**

**Answer: Key metrics to track**
```python
# Add metrics collection
metrics = {
    'migration_duration_ms': 0,
    'migration_retries': 0,
    'migration_lock_wait_ms': 0,
    'migrations_applied': 0
}

# Alert thresholds
ALERTS = {
    'migration_duration_ms': 5000,  # Alert if > 5s
    'migration_retries': 5,         # Alert if > 5 retries
    'worker_failures': 1             # Alert on any failure
}

# Log in structured format
logger.info(json.dumps({
    'event': 'migration_complete',
    'metrics': metrics
}))
```

---

## Alternative Approaches

### 20. Version Compatibility
**Q: How to handle version mismatches?**

**Answer: Strict version checking**
```python
# In migrations.py
MIGRATION_VERSION = "1.0.0"

def check_version_compatibility(conn):
    cursor = conn.execute(
        "SELECT value FROM app_config WHERE key = 'migration_version'"
    )
    row = cursor.fetchone()
    if row and row[0] != MIGRATION_VERSION:
        raise MigrationError(
            f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
            f"Action: Run migration tool separately"
        )
```

### 21. File-Based Locking
**Q: Should we consider flock() as backup?**

**Answer: NO - Adds complexity without benefit**
- SQLite locking is sufficient and portable
- flock() not available on all systems
- Would require additional cleanup logic
- Database-level locking is the correct approach

### 22. Gunicorn Preload
**Q: Would --preload flag help?**

**Answer: NO - Makes problem WORSE**
- --preload runs app initialization ONCE in master
- Workers fork from master AFTER migrations complete
- BUT: Doesn't work with lazy-loaded resources
- Current architecture expects per-worker initialization
- Keep current approach

### 23. Application-Level Locks
**Q: Should we add Redis/memcached for coordination?**

**Answer: NO - Violates simplicity principle**
- Adds external dependency
- More complex deployment
- SQLite locking is sufficient
- Would require Redis/memcached to be running before app starts
- Solving a solved problem

---

## Final Implementation Checklist

### Required Changes

1. ✅ Add imports: `time`, `random`
2. ✅ Implement retry loop with exponential backoff
3. ✅ Use BEGIN IMMEDIATE for lock acquisition
4. ✅ Add graduated logging levels
5. ✅ Proper error messages with diagnostics
6. ✅ Fresh connection per retry
7. ✅ Total timeout check (2 minutes max)
8. ✅ Preserve all existing migration logic

### Test Coverage Required

1. ✅ Unit test: Retry on lock
2. ✅ Unit test: Exhaustion handling
3. ✅ Integration test: 4 workers with multiprocessing
4. ✅ System test: gunicorn with 4 workers
5. ✅ Container test: Full deployment simulation
6. ✅ Performance test: < 500ms with contention

### Documentation Updates

1. ✅ Update ADR-022 with final decision
2. ✅ Add operational runbook for migration issues
3. ✅ Document monitoring metrics
4. ✅ Update deployment guide with health check info

---

## Go/No-Go Decision

### ✅ GO FOR IMPLEMENTATION

**Rationale:**
- All 23 questions have concrete answers
- Design is proven with SQLite's native capabilities
- No external dependencies needed
- Risk is low with clear rollback plan
- Testing strategy is comprehensive

**Implementation Priority: IMMEDIATE**
- This is blocking v1.0.0-rc.4 release
- Production systems affected
- Fix is well-understood and low-risk

**Next Steps:**
1. Implement changes to migrations.py as specified
2. Run test suite at all levels
3. Deploy as hotfix v1.0.0-rc.3.1
4. Monitor metrics in production
5. Document lessons learned

---

*Document Version: 1.0*
*Created: 2025-11-24*
*Status: Approved for Implementation*
*Author: StarPunk Architecture Team*