Files

Phil Skentelbery 2240414f22 docs: Add architect documentation for migration race condition fix

Add comprehensive architectural documentation for the migration race
condition fix, including:

- ADR-022: Architectural decision record for the fix
- migration-race-condition-answers.md: All 23 Q&A answered
- migration-fix-quick-reference.md: Implementation checklist
- migration-race-condition-fix-implementation.md: Detailed guide

These documents guided the implementation in v1.0.0-rc.5.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-24 18:53:55 -07:00

13 KiB

Raw Blame History

Migration Race Condition Fix - Architectural Answers

Status: READY FOR IMPLEMENTATION

All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.

Critical Questions

1. Connection Lifecycle Management

Q: Should we create a new connection for each retry or reuse the same connection?

Answer: NEW CONNECTION per retry

Each retry MUST create a fresh connection
Rationale: Failed lock acquisition may leave connection in inconsistent state
SQLite connections are lightweight; overhead is minimal

Pattern:

while retry_count < max_retries:
    conn = None  # Fresh connection each iteration
    try:
        conn = sqlite3.connect(db_path, timeout=30.0)
        # ... attempt migration ...
    finally:
        if conn:
            conn.close()

2. Transaction Boundaries

Q: Should init_db() wrap everything in one transaction?

Answer: NO - Separate transactions for different operations

Schema creation: Own transaction (already implicit in executescript)
Migrations: Own transaction with BEGIN IMMEDIATE
Initial data: Own transaction
Rationale: Minimizes lock duration and allows partial success visibility
Each operation is atomic but independent

3. Lock Timeout vs Retry Timeout

Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?

Answer: This is BY DESIGN - No conflict

30s timeout: Maximum wait for any single lock acquisition attempt
102s total: Maximum cumulative retry duration across multiple attempts
If one worker holds lock for 30s+, other workers timeout and retry
Pattern ensures no single worker waits indefinitely

Recommendation: Add total timeout check:

start_time = time.time()
max_total_time = 120  # 2 minutes absolute maximum
while retry_count < max_retries and (time.time() - start_time) < max_total_time:

4. Testing Strategy

Q: Should we use multiprocessing.Pool or actual gunicorn for testing?

Answer: BOTH - Different test levels

Unit tests: multiprocessing.Pool (fast, isolated)
Integration tests: Actual gunicorn with --workers 4
Container tests: Full podman/docker run

Test matrix:

Level 1: Mock concurrent access (unit)
Level 2: multiprocessing.Pool (integration)
Level 3: gunicorn locally (system)
Level 4: Container with gunicorn (e2e)

5. BEGIN IMMEDIATE vs EXCLUSIVE

Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?

Answer: BEGIN IMMEDIATE is CORRECT choice

BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
Rationale:
- Migrations only need to prevent concurrent migrations (writes)
- Other workers can still read schema while one migrates
- Less contention, faster startup
- Only escalates to EXCLUSIVE when actually writing
Keep BEGIN IMMEDIATE as specified

Edge Cases and Error Handling

6. Partial Migration Failure

Q: What if a migration partially applies or rollback fails?

Answer: Transaction atomicity handles this

Within transaction: Automatic rollback on ANY error
Rollback failure: Extremely rare (corrupt database)

Strategy:

except Exception as e:
    try:
        conn.rollback()
    except Exception as rollback_error:
        logger.critical(f"FATAL: Rollback failed: {rollback_error}")
        # Database potentially corrupt - fail hard
        raise SystemExit(1)
    raise MigrationError(e)

7. Migration File Consistency

Q: What if migration files change during deployment?

Answer: Not a concern with proper deployment

Container deployments: Files are immutable in image
Traditional deployment: Use atomic directory swap

If concerned, add checksum validation:

# Store in schema_migrations: (name, checksum, applied_at)
# Verify checksum matches before applying

8. Retry Exhaustion Error Messages

Q: What error message when retries exhausted?

Answer: Be specific and actionable

raise MigrationError(
    f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
    f"Possible causes:\n"
    f"1. Another process is stuck in migration (check logs)\n"
    f"2. Database file permissions issue\n"
    f"3. Disk I/O problems\n"
    f"Action: Restart container with single worker to diagnose"
)

9. Logging Levels

Q: What log level for lock waits?

Answer: Graduated approach

Retry 1-3: DEBUG (normal operation)
Retry 4-7: INFO (getting concerning)
Retry 8+: WARNING (abnormal)
Exhausted: ERROR (operation failed)

Pattern:

if retry_count <= 3:
    level = logging.DEBUG
elif retry_count <= 7:
    level = logging.INFO
else:
    level = logging.WARNING
logger.log(level, f"Retry {retry_count}/{max_retries}")

10. Index Creation Failure

Q: How to handle index creation failures in migration 002?

Answer: Fail fast with clear context

for index_name, index_sql in indexes_to_create:
    try:
        conn.execute(index_sql)
    except sqlite3.OperationalError as e:
        if "already exists" in str(e):
            logger.debug(f"Index {index_name} already exists")
        else:
            raise MigrationError(
                f"Failed to create index {index_name}: {e}\n"
                f"SQL: {index_sql}"
            )

Testing Strategy

11. Concurrent Testing Simulation

Q: How to properly simulate concurrent worker startup?

Answer: Multiple approaches

# Approach 1: Barrier synchronization
def test_concurrent_migrations():
    barrier = multiprocessing.Barrier(4)

    def worker():
        barrier.wait()  # All start together
        return run_migrations(db_path)

    with multiprocessing.Pool(4) as pool:
        results = pool.map(worker, range(4))

# Approach 2: Process start
processes = []
for i in range(4):
    p = Process(target=run_migrations, args=(db_path,))
    processes.append(p)
for p in processes:
    p.start()  # Near-simultaneous

12. Lock Contention Testing

Q: How to test lock contention scenarios?

Answer: Inject delays

# Test helper to force contention
def slow_migration_for_testing(conn):
    conn.execute("BEGIN IMMEDIATE")
    time.sleep(2)  # Force other workers to wait
    # Apply migration
    conn.commit()

# Test timeout handling
@patch('sqlite3.connect')
def test_lock_timeout(mock_connect):
    mock_connect.side_effect = sqlite3.OperationalError("database is locked")
    # Verify retry logic

13. Performance Tests

Q: What timing is acceptable?

Answer: Performance targets

Single worker: < 100ms for all migrations
4 workers with contention: < 500ms total
10 workers stress test: < 2s total
Lock acquisition per retry: < 50ms

Test with:

import timeit
setup_time = timeit.timeit(lambda: create_app(), number=1)
assert setup_time < 0.5, f"Startup too slow: {setup_time}s"

14. Retry Logic Unit Tests

Q: How to unit test retry logic?

Answer: Mock the lock failures

class TestRetryLogic:
    def test_retry_on_lock(self):
        with patch('sqlite3.connect') as mock:
            # First 2 attempts fail, 3rd succeeds
            mock.side_effect = [
                sqlite3.OperationalError("database is locked"),
                sqlite3.OperationalError("database is locked"),
                MagicMock()  # Success
            ]
            run_migrations(db_path)
            assert mock.call_count == 3

SQLite-Specific Concerns

15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)

Q: Deep dive on lock choice?

Answer: Lock escalation path

BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
BEGIN EXCLUSIVE → EXCLUSIVE

For migrations:
- IMMEDIATE starts at RESERVED (blocks other writers immediately)
- Escalates to EXCLUSIVE only during actual writes
- Optimal for our use case

16. WAL Mode Interaction

Q: How does this work with WAL mode?

Answer: Works correctly with both modes

Journal mode: BEGIN IMMEDIATE works as described
WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
No code changes needed

Add mode detection for logging:

cursor = conn.execute("PRAGMA journal_mode")
mode = cursor.fetchone()[0]
logger.debug(f"Database in {mode} mode")

17. Database File Permissions

Q: How to handle permission issues?

Answer: Fail fast with helpful diagnostics

import os
import stat

db_path = Path(db_path)
if not db_path.exists():
    # Will be created - check parent dir
    parent = db_path.parent
    if not os.access(parent, os.W_OK):
        raise MigrationError(f"Cannot write to directory: {parent}")
else:
    # Check existing file
    if not os.access(db_path, os.W_OK):
        stats = os.stat(db_path)
        mode = stat.filemode(stats.st_mode)
        raise MigrationError(
            f"Database not writable: {db_path}\n"
            f"Permissions: {mode}\n"
            f"Owner: {stats.st_uid}:{stats.st_gid}"
        )

Deployment/Operations

18. Container Startup and Health Checks

Q: How to handle health checks during migration?

Answer: Return 503 during migration

# In app.py
MIGRATION_IN_PROGRESS = False

def create_app():
    global MIGRATION_IN_PROGRESS
    MIGRATION_IN_PROGRESS = True
    try:
        init_db()
    finally:
        MIGRATION_IN_PROGRESS = False

@app.route('/health')
def health():
    if MIGRATION_IN_PROGRESS:
        return {'status': 'migrating'}, 503
    return {'status': 'healthy'}, 200

19. Monitoring and Alerting

Q: What metrics/alerts are needed?

Answer: Key metrics to track

# Add metrics collection
metrics = {
    'migration_duration_ms': 0,
    'migration_retries': 0,
    'migration_lock_wait_ms': 0,
    'migrations_applied': 0
}

# Alert thresholds
ALERTS = {
    'migration_duration_ms': 5000,  # Alert if > 5s
    'migration_retries': 5,         # Alert if > 5 retries
    'worker_failures': 1             # Alert on any failure
}

# Log in structured format
logger.info(json.dumps({
    'event': 'migration_complete',
    'metrics': metrics
}))

Alternative Approaches

20. Version Compatibility

Q: How to handle version mismatches?

Answer: Strict version checking

# In migrations.py
MIGRATION_VERSION = "1.0.0"

def check_version_compatibility(conn):
    cursor = conn.execute(
        "SELECT value FROM app_config WHERE key = 'migration_version'"
    )
    row = cursor.fetchone()
    if row and row[0] != MIGRATION_VERSION:
        raise MigrationError(
            f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
            f"Action: Run migration tool separately"
        )

21. File-Based Locking

Q: Should we consider flock() as backup?

Answer: NO - Adds complexity without benefit

SQLite locking is sufficient and portable
flock() not available on all systems
Would require additional cleanup logic
Database-level locking is the correct approach

22. Gunicorn Preload

Q: Would --preload flag help?

Answer: NO - Makes problem WORSE

--preload runs app initialization ONCE in master
Workers fork from master AFTER migrations complete
BUT: Doesn't work with lazy-loaded resources
Current architecture expects per-worker initialization
Keep current approach

23. Application-Level Locks

Q: Should we add Redis/memcached for coordination?

Answer: NO - Violates simplicity principle

Adds external dependency
More complex deployment
SQLite locking is sufficient
Would require Redis/memcached to be running before app starts
Solving a solved problem

Final Implementation Checklist

Required Changes

✅ Add imports: time, random
✅ Implement retry loop with exponential backoff
✅ Use BEGIN IMMEDIATE for lock acquisition
✅ Add graduated logging levels
✅ Proper error messages with diagnostics
✅ Fresh connection per retry
✅ Total timeout check (2 minutes max)
✅ Preserve all existing migration logic

Test Coverage Required

✅ Unit test: Retry on lock
✅ Unit test: Exhaustion handling
✅ Integration test: 4 workers with multiprocessing
✅ System test: gunicorn with 4 workers
✅ Container test: Full deployment simulation
✅ Performance test: < 500ms with contention

Documentation Updates

✅ Update ADR-022 with final decision
✅ Add operational runbook for migration issues
✅ Document monitoring metrics
✅ Update deployment guide with health check info

Go/No-Go Decision

✅ GO FOR IMPLEMENTATION

Rationale:

All 23 questions have concrete answers
Design is proven with SQLite's native capabilities
No external dependencies needed
Risk is low with clear rollback plan
Testing strategy is comprehensive

Implementation Priority: IMMEDIATE

This is blocking v1.0.0-rc.4 release
Production systems affected
Fix is well-understood and low-risk

Next Steps:

Implement changes to migrations.py as specified
Run test suite at all levels
Deploy as hotfix v1.0.0-rc.3.1
Monitor metrics in production
Document lessons learned

Document Version: 1.0 Created: 2025-11-24 Status: Approved for Implementation Author: StarPunk Architecture Team

13 KiB Raw Blame History