Files
StarPunk/docs/architecture/migration-race-condition-answers.md
Phil Skentelbery 2240414f22 docs: Add architect documentation for migration race condition fix
Add comprehensive architectural documentation for the migration race
condition fix, including:

- ADR-022: Architectural decision record for the fix
- migration-race-condition-answers.md: All 23 Q&A answered
- migration-fix-quick-reference.md: Implementation checklist
- migration-race-condition-fix-implementation.md: Detailed guide

These documents guided the implementation in v1.0.0-rc.5.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:53:55 -07:00

13 KiB

Migration Race Condition Fix - Architectural Answers

Status: READY FOR IMPLEMENTATION

All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.


Critical Questions

1. Connection Lifecycle Management

Q: Should we create a new connection for each retry or reuse the same connection?

Answer: NEW CONNECTION per retry

  • Each retry MUST create a fresh connection
  • Rationale: Failed lock acquisition may leave connection in inconsistent state
  • SQLite connections are lightweight; overhead is minimal
  • Pattern:
    while retry_count < max_retries:
        conn = None  # Fresh connection each iteration
        try:
            conn = sqlite3.connect(db_path, timeout=30.0)
            # ... attempt migration ...
        finally:
            if conn:
                conn.close()
    

2. Transaction Boundaries

Q: Should init_db() wrap everything in one transaction?

Answer: NO - Separate transactions for different operations

  • Schema creation: Own transaction (already implicit in executescript)
  • Migrations: Own transaction with BEGIN IMMEDIATE
  • Initial data: Own transaction
  • Rationale: Minimizes lock duration and allows partial success visibility
  • Each operation is atomic but independent

3. Lock Timeout vs Retry Timeout

Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?

Answer: This is BY DESIGN - No conflict

  • 30s timeout: Maximum wait for any single lock acquisition attempt
  • 102s total: Maximum cumulative retry duration across multiple attempts
  • If one worker holds lock for 30s+, other workers timeout and retry
  • Pattern ensures no single worker waits indefinitely
  • Recommendation: Add total timeout check:
    start_time = time.time()
    max_total_time = 120  # 2 minutes absolute maximum
    while retry_count < max_retries and (time.time() - start_time) < max_total_time:
    

4. Testing Strategy

Q: Should we use multiprocessing.Pool or actual gunicorn for testing?

Answer: BOTH - Different test levels

  • Unit tests: multiprocessing.Pool (fast, isolated)
  • Integration tests: Actual gunicorn with --workers 4
  • Container tests: Full podman/docker run
  • Test matrix:
    Level 1: Mock concurrent access (unit)
    Level 2: multiprocessing.Pool (integration)
    Level 3: gunicorn locally (system)
    Level 4: Container with gunicorn (e2e)
    

5. BEGIN IMMEDIATE vs EXCLUSIVE

Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?

Answer: BEGIN IMMEDIATE is CORRECT choice

  • BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
  • BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
  • Rationale:
    • Migrations only need to prevent concurrent migrations (writes)
    • Other workers can still read schema while one migrates
    • Less contention, faster startup
    • Only escalates to EXCLUSIVE when actually writing
  • Keep BEGIN IMMEDIATE as specified

Edge Cases and Error Handling

6. Partial Migration Failure

Q: What if a migration partially applies or rollback fails?

Answer: Transaction atomicity handles this

  • Within transaction: Automatic rollback on ANY error
  • Rollback failure: Extremely rare (corrupt database)
  • Strategy:
    except Exception as e:
        try:
            conn.rollback()
        except Exception as rollback_error:
            logger.critical(f"FATAL: Rollback failed: {rollback_error}")
            # Database potentially corrupt - fail hard
            raise SystemExit(1)
        raise MigrationError(e)
    

7. Migration File Consistency

Q: What if migration files change during deployment?

Answer: Not a concern with proper deployment

  • Container deployments: Files are immutable in image
  • Traditional deployment: Use atomic directory swap
  • If concerned, add checksum validation:
    # Store in schema_migrations: (name, checksum, applied_at)
    # Verify checksum matches before applying
    

8. Retry Exhaustion Error Messages

Q: What error message when retries exhausted?

Answer: Be specific and actionable

raise MigrationError(
    f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
    f"Possible causes:\n"
    f"1. Another process is stuck in migration (check logs)\n"
    f"2. Database file permissions issue\n"
    f"3. Disk I/O problems\n"
    f"Action: Restart container with single worker to diagnose"
)

9. Logging Levels

Q: What log level for lock waits?

Answer: Graduated approach

  • Retry 1-3: DEBUG (normal operation)
  • Retry 4-7: INFO (getting concerning)
  • Retry 8+: WARNING (abnormal)
  • Exhausted: ERROR (operation failed)
  • Pattern:
    if retry_count <= 3:
        level = logging.DEBUG
    elif retry_count <= 7:
        level = logging.INFO
    else:
        level = logging.WARNING
    logger.log(level, f"Retry {retry_count}/{max_retries}")
    

10. Index Creation Failure

Q: How to handle index creation failures in migration 002?

Answer: Fail fast with clear context

for index_name, index_sql in indexes_to_create:
    try:
        conn.execute(index_sql)
    except sqlite3.OperationalError as e:
        if "already exists" in str(e):
            logger.debug(f"Index {index_name} already exists")
        else:
            raise MigrationError(
                f"Failed to create index {index_name}: {e}\n"
                f"SQL: {index_sql}"
            )

Testing Strategy

11. Concurrent Testing Simulation

Q: How to properly simulate concurrent worker startup?

Answer: Multiple approaches

# Approach 1: Barrier synchronization
def test_concurrent_migrations():
    barrier = multiprocessing.Barrier(4)

    def worker():
        barrier.wait()  # All start together
        return run_migrations(db_path)

    with multiprocessing.Pool(4) as pool:
        results = pool.map(worker, range(4))

# Approach 2: Process start
processes = []
for i in range(4):
    p = Process(target=run_migrations, args=(db_path,))
    processes.append(p)
for p in processes:
    p.start()  # Near-simultaneous

12. Lock Contention Testing

Q: How to test lock contention scenarios?

Answer: Inject delays

# Test helper to force contention
def slow_migration_for_testing(conn):
    conn.execute("BEGIN IMMEDIATE")
    time.sleep(2)  # Force other workers to wait
    # Apply migration
    conn.commit()

# Test timeout handling
@patch('sqlite3.connect')
def test_lock_timeout(mock_connect):
    mock_connect.side_effect = sqlite3.OperationalError("database is locked")
    # Verify retry logic

13. Performance Tests

Q: What timing is acceptable?

Answer: Performance targets

  • Single worker: < 100ms for all migrations
  • 4 workers with contention: < 500ms total
  • 10 workers stress test: < 2s total
  • Lock acquisition per retry: < 50ms
  • Test with:
    import timeit
    setup_time = timeit.timeit(lambda: create_app(), number=1)
    assert setup_time < 0.5, f"Startup too slow: {setup_time}s"
    

14. Retry Logic Unit Tests

Q: How to unit test retry logic?

Answer: Mock the lock failures

class TestRetryLogic:
    def test_retry_on_lock(self):
        with patch('sqlite3.connect') as mock:
            # First 2 attempts fail, 3rd succeeds
            mock.side_effect = [
                sqlite3.OperationalError("database is locked"),
                sqlite3.OperationalError("database is locked"),
                MagicMock()  # Success
            ]
            run_migrations(db_path)
            assert mock.call_count == 3

SQLite-Specific Concerns

15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)

Q: Deep dive on lock choice?

Answer: Lock escalation path

BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
BEGIN EXCLUSIVE → EXCLUSIVE

For migrations:
- IMMEDIATE starts at RESERVED (blocks other writers immediately)
- Escalates to EXCLUSIVE only during actual writes
- Optimal for our use case

16. WAL Mode Interaction

Q: How does this work with WAL mode?

Answer: Works correctly with both modes

  • Journal mode: BEGIN IMMEDIATE works as described
  • WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
  • No code changes needed
  • Add mode detection for logging:
    cursor = conn.execute("PRAGMA journal_mode")
    mode = cursor.fetchone()[0]
    logger.debug(f"Database in {mode} mode")
    

17. Database File Permissions

Q: How to handle permission issues?

Answer: Fail fast with helpful diagnostics

import os
import stat

db_path = Path(db_path)
if not db_path.exists():
    # Will be created - check parent dir
    parent = db_path.parent
    if not os.access(parent, os.W_OK):
        raise MigrationError(f"Cannot write to directory: {parent}")
else:
    # Check existing file
    if not os.access(db_path, os.W_OK):
        stats = os.stat(db_path)
        mode = stat.filemode(stats.st_mode)
        raise MigrationError(
            f"Database not writable: {db_path}\n"
            f"Permissions: {mode}\n"
            f"Owner: {stats.st_uid}:{stats.st_gid}"
        )

Deployment/Operations

18. Container Startup and Health Checks

Q: How to handle health checks during migration?

Answer: Return 503 during migration

# In app.py
MIGRATION_IN_PROGRESS = False

def create_app():
    global MIGRATION_IN_PROGRESS
    MIGRATION_IN_PROGRESS = True
    try:
        init_db()
    finally:
        MIGRATION_IN_PROGRESS = False

@app.route('/health')
def health():
    if MIGRATION_IN_PROGRESS:
        return {'status': 'migrating'}, 503
    return {'status': 'healthy'}, 200

19. Monitoring and Alerting

Q: What metrics/alerts are needed?

Answer: Key metrics to track

# Add metrics collection
metrics = {
    'migration_duration_ms': 0,
    'migration_retries': 0,
    'migration_lock_wait_ms': 0,
    'migrations_applied': 0
}

# Alert thresholds
ALERTS = {
    'migration_duration_ms': 5000,  # Alert if > 5s
    'migration_retries': 5,         # Alert if > 5 retries
    'worker_failures': 1             # Alert on any failure
}

# Log in structured format
logger.info(json.dumps({
    'event': 'migration_complete',
    'metrics': metrics
}))

Alternative Approaches

20. Version Compatibility

Q: How to handle version mismatches?

Answer: Strict version checking

# In migrations.py
MIGRATION_VERSION = "1.0.0"

def check_version_compatibility(conn):
    cursor = conn.execute(
        "SELECT value FROM app_config WHERE key = 'migration_version'"
    )
    row = cursor.fetchone()
    if row and row[0] != MIGRATION_VERSION:
        raise MigrationError(
            f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
            f"Action: Run migration tool separately"
        )

21. File-Based Locking

Q: Should we consider flock() as backup?

Answer: NO - Adds complexity without benefit

  • SQLite locking is sufficient and portable
  • flock() not available on all systems
  • Would require additional cleanup logic
  • Database-level locking is the correct approach

22. Gunicorn Preload

Q: Would --preload flag help?

Answer: NO - Makes problem WORSE

  • --preload runs app initialization ONCE in master
  • Workers fork from master AFTER migrations complete
  • BUT: Doesn't work with lazy-loaded resources
  • Current architecture expects per-worker initialization
  • Keep current approach

23. Application-Level Locks

Q: Should we add Redis/memcached for coordination?

Answer: NO - Violates simplicity principle

  • Adds external dependency
  • More complex deployment
  • SQLite locking is sufficient
  • Would require Redis/memcached to be running before app starts
  • Solving a solved problem

Final Implementation Checklist

Required Changes

  1. Add imports: time, random
  2. Implement retry loop with exponential backoff
  3. Use BEGIN IMMEDIATE for lock acquisition
  4. Add graduated logging levels
  5. Proper error messages with diagnostics
  6. Fresh connection per retry
  7. Total timeout check (2 minutes max)
  8. Preserve all existing migration logic

Test Coverage Required

  1. Unit test: Retry on lock
  2. Unit test: Exhaustion handling
  3. Integration test: 4 workers with multiprocessing
  4. System test: gunicorn with 4 workers
  5. Container test: Full deployment simulation
  6. Performance test: < 500ms with contention

Documentation Updates

  1. Update ADR-022 with final decision
  2. Add operational runbook for migration issues
  3. Document monitoring metrics
  4. Update deployment guide with health check info

Go/No-Go Decision

GO FOR IMPLEMENTATION

Rationale:

  • All 23 questions have concrete answers
  • Design is proven with SQLite's native capabilities
  • No external dependencies needed
  • Risk is low with clear rollback plan
  • Testing strategy is comprehensive

Implementation Priority: IMMEDIATE

  • This is blocking v1.0.0-rc.4 release
  • Production systems affected
  • Fix is well-understood and low-risk

Next Steps:

  1. Implement changes to migrations.py as specified
  2. Run test suite at all levels
  3. Deploy as hotfix v1.0.0-rc.3.1
  4. Monitor metrics in production
  5. Document lessons learned

Document Version: 1.0 Created: 2025-11-24 Status: Approved for Implementation Author: StarPunk Architecture Team