Add comprehensive architectural documentation for the migration race condition fix, including: - ADR-022: Architectural decision record for the fix - migration-race-condition-answers.md: All 23 Q&A answered - migration-fix-quick-reference.md: Implementation checklist - migration-race-condition-fix-implementation.md: Detailed guide These documents guided the implementation in v1.0.0-rc.5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
13 KiB
Migration Race Condition Fix - Architectural Answers
Status: READY FOR IMPLEMENTATION
All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.
Critical Questions
1. Connection Lifecycle Management
Q: Should we create a new connection for each retry or reuse the same connection?
Answer: NEW CONNECTION per retry
- Each retry MUST create a fresh connection
- Rationale: Failed lock acquisition may leave connection in inconsistent state
- SQLite connections are lightweight; overhead is minimal
- Pattern:
while retry_count < max_retries: conn = None # Fresh connection each iteration try: conn = sqlite3.connect(db_path, timeout=30.0) # ... attempt migration ... finally: if conn: conn.close()
2. Transaction Boundaries
Q: Should init_db() wrap everything in one transaction?
Answer: NO - Separate transactions for different operations
- Schema creation: Own transaction (already implicit in executescript)
- Migrations: Own transaction with BEGIN IMMEDIATE
- Initial data: Own transaction
- Rationale: Minimizes lock duration and allows partial success visibility
- Each operation is atomic but independent
3. Lock Timeout vs Retry Timeout
Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?
Answer: This is BY DESIGN - No conflict
- 30s timeout: Maximum wait for any single lock acquisition attempt
- 102s total: Maximum cumulative retry duration across multiple attempts
- If one worker holds lock for 30s+, other workers timeout and retry
- Pattern ensures no single worker waits indefinitely
- Recommendation: Add total timeout check:
start_time = time.time() max_total_time = 120 # 2 minutes absolute maximum while retry_count < max_retries and (time.time() - start_time) < max_total_time:
4. Testing Strategy
Q: Should we use multiprocessing.Pool or actual gunicorn for testing?
Answer: BOTH - Different test levels
- Unit tests: multiprocessing.Pool (fast, isolated)
- Integration tests: Actual gunicorn with --workers 4
- Container tests: Full podman/docker run
- Test matrix:
Level 1: Mock concurrent access (unit) Level 2: multiprocessing.Pool (integration) Level 3: gunicorn locally (system) Level 4: Container with gunicorn (e2e)
5. BEGIN IMMEDIATE vs EXCLUSIVE
Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?
Answer: BEGIN IMMEDIATE is CORRECT choice
- BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
- BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
- Rationale:
- Migrations only need to prevent concurrent migrations (writes)
- Other workers can still read schema while one migrates
- Less contention, faster startup
- Only escalates to EXCLUSIVE when actually writing
- Keep BEGIN IMMEDIATE as specified
Edge Cases and Error Handling
6. Partial Migration Failure
Q: What if a migration partially applies or rollback fails?
Answer: Transaction atomicity handles this
- Within transaction: Automatic rollback on ANY error
- Rollback failure: Extremely rare (corrupt database)
- Strategy:
except Exception as e: try: conn.rollback() except Exception as rollback_error: logger.critical(f"FATAL: Rollback failed: {rollback_error}") # Database potentially corrupt - fail hard raise SystemExit(1) raise MigrationError(e)
7. Migration File Consistency
Q: What if migration files change during deployment?
Answer: Not a concern with proper deployment
- Container deployments: Files are immutable in image
- Traditional deployment: Use atomic directory swap
- If concerned, add checksum validation:
# Store in schema_migrations: (name, checksum, applied_at) # Verify checksum matches before applying
8. Retry Exhaustion Error Messages
Q: What error message when retries exhausted?
Answer: Be specific and actionable
raise MigrationError(
f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
f"Possible causes:\n"
f"1. Another process is stuck in migration (check logs)\n"
f"2. Database file permissions issue\n"
f"3. Disk I/O problems\n"
f"Action: Restart container with single worker to diagnose"
)
9. Logging Levels
Q: What log level for lock waits?
Answer: Graduated approach
- Retry 1-3: DEBUG (normal operation)
- Retry 4-7: INFO (getting concerning)
- Retry 8+: WARNING (abnormal)
- Exhausted: ERROR (operation failed)
- Pattern:
if retry_count <= 3: level = logging.DEBUG elif retry_count <= 7: level = logging.INFO else: level = logging.WARNING logger.log(level, f"Retry {retry_count}/{max_retries}")
10. Index Creation Failure
Q: How to handle index creation failures in migration 002?
Answer: Fail fast with clear context
for index_name, index_sql in indexes_to_create:
try:
conn.execute(index_sql)
except sqlite3.OperationalError as e:
if "already exists" in str(e):
logger.debug(f"Index {index_name} already exists")
else:
raise MigrationError(
f"Failed to create index {index_name}: {e}\n"
f"SQL: {index_sql}"
)
Testing Strategy
11. Concurrent Testing Simulation
Q: How to properly simulate concurrent worker startup?
Answer: Multiple approaches
# Approach 1: Barrier synchronization
def test_concurrent_migrations():
barrier = multiprocessing.Barrier(4)
def worker():
barrier.wait() # All start together
return run_migrations(db_path)
with multiprocessing.Pool(4) as pool:
results = pool.map(worker, range(4))
# Approach 2: Process start
processes = []
for i in range(4):
p = Process(target=run_migrations, args=(db_path,))
processes.append(p)
for p in processes:
p.start() # Near-simultaneous
12. Lock Contention Testing
Q: How to test lock contention scenarios?
Answer: Inject delays
# Test helper to force contention
def slow_migration_for_testing(conn):
conn.execute("BEGIN IMMEDIATE")
time.sleep(2) # Force other workers to wait
# Apply migration
conn.commit()
# Test timeout handling
@patch('sqlite3.connect')
def test_lock_timeout(mock_connect):
mock_connect.side_effect = sqlite3.OperationalError("database is locked")
# Verify retry logic
13. Performance Tests
Q: What timing is acceptable?
Answer: Performance targets
- Single worker: < 100ms for all migrations
- 4 workers with contention: < 500ms total
- 10 workers stress test: < 2s total
- Lock acquisition per retry: < 50ms
- Test with:
import timeit setup_time = timeit.timeit(lambda: create_app(), number=1) assert setup_time < 0.5, f"Startup too slow: {setup_time}s"
14. Retry Logic Unit Tests
Q: How to unit test retry logic?
Answer: Mock the lock failures
class TestRetryLogic:
def test_retry_on_lock(self):
with patch('sqlite3.connect') as mock:
# First 2 attempts fail, 3rd succeeds
mock.side_effect = [
sqlite3.OperationalError("database is locked"),
sqlite3.OperationalError("database is locked"),
MagicMock() # Success
]
run_migrations(db_path)
assert mock.call_count == 3
SQLite-Specific Concerns
15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)
Q: Deep dive on lock choice?
Answer: Lock escalation path
BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
BEGIN EXCLUSIVE → EXCLUSIVE
For migrations:
- IMMEDIATE starts at RESERVED (blocks other writers immediately)
- Escalates to EXCLUSIVE only during actual writes
- Optimal for our use case
16. WAL Mode Interaction
Q: How does this work with WAL mode?
Answer: Works correctly with both modes
- Journal mode: BEGIN IMMEDIATE works as described
- WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
- No code changes needed
- Add mode detection for logging:
cursor = conn.execute("PRAGMA journal_mode") mode = cursor.fetchone()[0] logger.debug(f"Database in {mode} mode")
17. Database File Permissions
Q: How to handle permission issues?
Answer: Fail fast with helpful diagnostics
import os
import stat
db_path = Path(db_path)
if not db_path.exists():
# Will be created - check parent dir
parent = db_path.parent
if not os.access(parent, os.W_OK):
raise MigrationError(f"Cannot write to directory: {parent}")
else:
# Check existing file
if not os.access(db_path, os.W_OK):
stats = os.stat(db_path)
mode = stat.filemode(stats.st_mode)
raise MigrationError(
f"Database not writable: {db_path}\n"
f"Permissions: {mode}\n"
f"Owner: {stats.st_uid}:{stats.st_gid}"
)
Deployment/Operations
18. Container Startup and Health Checks
Q: How to handle health checks during migration?
Answer: Return 503 during migration
# In app.py
MIGRATION_IN_PROGRESS = False
def create_app():
global MIGRATION_IN_PROGRESS
MIGRATION_IN_PROGRESS = True
try:
init_db()
finally:
MIGRATION_IN_PROGRESS = False
@app.route('/health')
def health():
if MIGRATION_IN_PROGRESS:
return {'status': 'migrating'}, 503
return {'status': 'healthy'}, 200
19. Monitoring and Alerting
Q: What metrics/alerts are needed?
Answer: Key metrics to track
# Add metrics collection
metrics = {
'migration_duration_ms': 0,
'migration_retries': 0,
'migration_lock_wait_ms': 0,
'migrations_applied': 0
}
# Alert thresholds
ALERTS = {
'migration_duration_ms': 5000, # Alert if > 5s
'migration_retries': 5, # Alert if > 5 retries
'worker_failures': 1 # Alert on any failure
}
# Log in structured format
logger.info(json.dumps({
'event': 'migration_complete',
'metrics': metrics
}))
Alternative Approaches
20. Version Compatibility
Q: How to handle version mismatches?
Answer: Strict version checking
# In migrations.py
MIGRATION_VERSION = "1.0.0"
def check_version_compatibility(conn):
cursor = conn.execute(
"SELECT value FROM app_config WHERE key = 'migration_version'"
)
row = cursor.fetchone()
if row and row[0] != MIGRATION_VERSION:
raise MigrationError(
f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
f"Action: Run migration tool separately"
)
21. File-Based Locking
Q: Should we consider flock() as backup?
Answer: NO - Adds complexity without benefit
- SQLite locking is sufficient and portable
- flock() not available on all systems
- Would require additional cleanup logic
- Database-level locking is the correct approach
22. Gunicorn Preload
Q: Would --preload flag help?
Answer: NO - Makes problem WORSE
- --preload runs app initialization ONCE in master
- Workers fork from master AFTER migrations complete
- BUT: Doesn't work with lazy-loaded resources
- Current architecture expects per-worker initialization
- Keep current approach
23. Application-Level Locks
Q: Should we add Redis/memcached for coordination?
Answer: NO - Violates simplicity principle
- Adds external dependency
- More complex deployment
- SQLite locking is sufficient
- Would require Redis/memcached to be running before app starts
- Solving a solved problem
Final Implementation Checklist
Required Changes
- ✅ Add imports:
time,random - ✅ Implement retry loop with exponential backoff
- ✅ Use BEGIN IMMEDIATE for lock acquisition
- ✅ Add graduated logging levels
- ✅ Proper error messages with diagnostics
- ✅ Fresh connection per retry
- ✅ Total timeout check (2 minutes max)
- ✅ Preserve all existing migration logic
Test Coverage Required
- ✅ Unit test: Retry on lock
- ✅ Unit test: Exhaustion handling
- ✅ Integration test: 4 workers with multiprocessing
- ✅ System test: gunicorn with 4 workers
- ✅ Container test: Full deployment simulation
- ✅ Performance test: < 500ms with contention
Documentation Updates
- ✅ Update ADR-022 with final decision
- ✅ Add operational runbook for migration issues
- ✅ Document monitoring metrics
- ✅ Update deployment guide with health check info
Go/No-Go Decision
✅ GO FOR IMPLEMENTATION
Rationale:
- All 23 questions have concrete answers
- Design is proven with SQLite's native capabilities
- No external dependencies needed
- Risk is low with clear rollback plan
- Testing strategy is comprehensive
Implementation Priority: IMMEDIATE
- This is blocking v1.0.0-rc.4 release
- Production systems affected
- Fix is well-understood and low-risk
Next Steps:
- Implement changes to migrations.py as specified
- Run test suite at all levels
- Deploy as hotfix v1.0.0-rc.3.1
- Monitor metrics in production
- Document lessons learned
Document Version: 1.0 Created: 2025-11-24 Status: Approved for Implementation Author: StarPunk Architecture Team