Files
StarPunk/docs/architecture/migration-race-condition-answers.md
Phil Skentelbery 2240414f22 docs: Add architect documentation for migration race condition fix
Add comprehensive architectural documentation for the migration race
condition fix, including:

- ADR-022: Architectural decision record for the fix
- migration-race-condition-answers.md: All 23 Q&A answered
- migration-fix-quick-reference.md: Implementation checklist
- migration-race-condition-fix-implementation.md: Detailed guide

These documents guided the implementation in v1.0.0-rc.5.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:53:55 -07:00

477 lines
13 KiB
Markdown

# Migration Race Condition Fix - Architectural Answers
## Status: READY FOR IMPLEMENTATION
All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.
---
## Critical Questions
### 1. Connection Lifecycle Management
**Q: Should we create a new connection for each retry or reuse the same connection?**
**Answer: NEW CONNECTION per retry**
- Each retry MUST create a fresh connection
- Rationale: Failed lock acquisition may leave connection in inconsistent state
- SQLite connections are lightweight; overhead is minimal
- Pattern:
```python
while retry_count < max_retries:
conn = None # Fresh connection each iteration
try:
conn = sqlite3.connect(db_path, timeout=30.0)
# ... attempt migration ...
finally:
if conn:
conn.close()
```
### 2. Transaction Boundaries
**Q: Should init_db() wrap everything in one transaction?**
**Answer: NO - Separate transactions for different operations**
- Schema creation: Own transaction (already implicit in executescript)
- Migrations: Own transaction with BEGIN IMMEDIATE
- Initial data: Own transaction
- Rationale: Minimizes lock duration and allows partial success visibility
- Each operation is atomic but independent
### 3. Lock Timeout vs Retry Timeout
**Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?**
**Answer: This is BY DESIGN - No conflict**
- 30s timeout: Maximum wait for any single lock acquisition attempt
- 102s total: Maximum cumulative retry duration across multiple attempts
- If one worker holds lock for 30s+, other workers timeout and retry
- Pattern ensures no single worker waits indefinitely
- Recommendation: Add total timeout check:
```python
start_time = time.time()
max_total_time = 120 # 2 minutes absolute maximum
while retry_count < max_retries and (time.time() - start_time) < max_total_time:
```
### 4. Testing Strategy
**Q: Should we use multiprocessing.Pool or actual gunicorn for testing?**
**Answer: BOTH - Different test levels**
- Unit tests: multiprocessing.Pool (fast, isolated)
- Integration tests: Actual gunicorn with --workers 4
- Container tests: Full podman/docker run
- Test matrix:
```
Level 1: Mock concurrent access (unit)
Level 2: multiprocessing.Pool (integration)
Level 3: gunicorn locally (system)
Level 4: Container with gunicorn (e2e)
```
### 5. BEGIN IMMEDIATE vs EXCLUSIVE
**Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?**
**Answer: BEGIN IMMEDIATE is CORRECT choice**
- BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
- BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
- Rationale:
- Migrations only need to prevent concurrent migrations (writes)
- Other workers can still read schema while one migrates
- Less contention, faster startup
- Only escalates to EXCLUSIVE when actually writing
- Keep BEGIN IMMEDIATE as specified
---
## Edge Cases and Error Handling
### 6. Partial Migration Failure
**Q: What if a migration partially applies or rollback fails?**
**Answer: Transaction atomicity handles this**
- Within transaction: Automatic rollback on ANY error
- Rollback failure: Extremely rare (corrupt database)
- Strategy:
```python
except Exception as e:
try:
conn.rollback()
except Exception as rollback_error:
logger.critical(f"FATAL: Rollback failed: {rollback_error}")
# Database potentially corrupt - fail hard
raise SystemExit(1)
raise MigrationError(e)
```
### 7. Migration File Consistency
**Q: What if migration files change during deployment?**
**Answer: Not a concern with proper deployment**
- Container deployments: Files are immutable in image
- Traditional deployment: Use atomic directory swap
- If concerned, add checksum validation:
```python
# Store in schema_migrations: (name, checksum, applied_at)
# Verify checksum matches before applying
```
### 8. Retry Exhaustion Error Messages
**Q: What error message when retries exhausted?**
**Answer: Be specific and actionable**
```python
raise MigrationError(
f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
f"Possible causes:\n"
f"1. Another process is stuck in migration (check logs)\n"
f"2. Database file permissions issue\n"
f"3. Disk I/O problems\n"
f"Action: Restart container with single worker to diagnose"
)
```
### 9. Logging Levels
**Q: What log level for lock waits?**
**Answer: Graduated approach**
- Retry 1-3: DEBUG (normal operation)
- Retry 4-7: INFO (getting concerning)
- Retry 8+: WARNING (abnormal)
- Exhausted: ERROR (operation failed)
- Pattern:
```python
if retry_count <= 3:
level = logging.DEBUG
elif retry_count <= 7:
level = logging.INFO
else:
level = logging.WARNING
logger.log(level, f"Retry {retry_count}/{max_retries}")
```
### 10. Index Creation Failure
**Q: How to handle index creation failures in migration 002?**
**Answer: Fail fast with clear context**
```python
for index_name, index_sql in indexes_to_create:
try:
conn.execute(index_sql)
except sqlite3.OperationalError as e:
if "already exists" in str(e):
logger.debug(f"Index {index_name} already exists")
else:
raise MigrationError(
f"Failed to create index {index_name}: {e}\n"
f"SQL: {index_sql}"
)
```
---
## Testing Strategy
### 11. Concurrent Testing Simulation
**Q: How to properly simulate concurrent worker startup?**
**Answer: Multiple approaches**
```python
# Approach 1: Barrier synchronization
def test_concurrent_migrations():
barrier = multiprocessing.Barrier(4)
def worker():
barrier.wait() # All start together
return run_migrations(db_path)
with multiprocessing.Pool(4) as pool:
results = pool.map(worker, range(4))
# Approach 2: Process start
processes = []
for i in range(4):
p = Process(target=run_migrations, args=(db_path,))
processes.append(p)
for p in processes:
p.start() # Near-simultaneous
```
### 12. Lock Contention Testing
**Q: How to test lock contention scenarios?**
**Answer: Inject delays**
```python
# Test helper to force contention
def slow_migration_for_testing(conn):
conn.execute("BEGIN IMMEDIATE")
time.sleep(2) # Force other workers to wait
# Apply migration
conn.commit()
# Test timeout handling
@patch('sqlite3.connect')
def test_lock_timeout(mock_connect):
mock_connect.side_effect = sqlite3.OperationalError("database is locked")
# Verify retry logic
```
### 13. Performance Tests
**Q: What timing is acceptable?**
**Answer: Performance targets**
- Single worker: < 100ms for all migrations
- 4 workers with contention: < 500ms total
- 10 workers stress test: < 2s total
- Lock acquisition per retry: < 50ms
- Test with:
```python
import timeit
setup_time = timeit.timeit(lambda: create_app(), number=1)
assert setup_time < 0.5, f"Startup too slow: {setup_time}s"
```
### 14. Retry Logic Unit Tests
**Q: How to unit test retry logic?**
**Answer: Mock the lock failures**
```python
class TestRetryLogic:
def test_retry_on_lock(self):
with patch('sqlite3.connect') as mock:
# First 2 attempts fail, 3rd succeeds
mock.side_effect = [
sqlite3.OperationalError("database is locked"),
sqlite3.OperationalError("database is locked"),
MagicMock() # Success
]
run_migrations(db_path)
assert mock.call_count == 3
```
---
## SQLite-Specific Concerns
### 15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)
**Q: Deep dive on lock choice?**
**Answer: Lock escalation path**
```
BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
BEGIN EXCLUSIVE → EXCLUSIVE
For migrations:
- IMMEDIATE starts at RESERVED (blocks other writers immediately)
- Escalates to EXCLUSIVE only during actual writes
- Optimal for our use case
```
### 16. WAL Mode Interaction
**Q: How does this work with WAL mode?**
**Answer: Works correctly with both modes**
- Journal mode: BEGIN IMMEDIATE works as described
- WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
- No code changes needed
- Add mode detection for logging:
```python
cursor = conn.execute("PRAGMA journal_mode")
mode = cursor.fetchone()[0]
logger.debug(f"Database in {mode} mode")
```
### 17. Database File Permissions
**Q: How to handle permission issues?**
**Answer: Fail fast with helpful diagnostics**
```python
import os
import stat
db_path = Path(db_path)
if not db_path.exists():
# Will be created - check parent dir
parent = db_path.parent
if not os.access(parent, os.W_OK):
raise MigrationError(f"Cannot write to directory: {parent}")
else:
# Check existing file
if not os.access(db_path, os.W_OK):
stats = os.stat(db_path)
mode = stat.filemode(stats.st_mode)
raise MigrationError(
f"Database not writable: {db_path}\n"
f"Permissions: {mode}\n"
f"Owner: {stats.st_uid}:{stats.st_gid}"
)
```
---
## Deployment/Operations
### 18. Container Startup and Health Checks
**Q: How to handle health checks during migration?**
**Answer: Return 503 during migration**
```python
# In app.py
MIGRATION_IN_PROGRESS = False
def create_app():
global MIGRATION_IN_PROGRESS
MIGRATION_IN_PROGRESS = True
try:
init_db()
finally:
MIGRATION_IN_PROGRESS = False
@app.route('/health')
def health():
if MIGRATION_IN_PROGRESS:
return {'status': 'migrating'}, 503
return {'status': 'healthy'}, 200
```
### 19. Monitoring and Alerting
**Q: What metrics/alerts are needed?**
**Answer: Key metrics to track**
```python
# Add metrics collection
metrics = {
'migration_duration_ms': 0,
'migration_retries': 0,
'migration_lock_wait_ms': 0,
'migrations_applied': 0
}
# Alert thresholds
ALERTS = {
'migration_duration_ms': 5000, # Alert if > 5s
'migration_retries': 5, # Alert if > 5 retries
'worker_failures': 1 # Alert on any failure
}
# Log in structured format
logger.info(json.dumps({
'event': 'migration_complete',
'metrics': metrics
}))
```
---
## Alternative Approaches
### 20. Version Compatibility
**Q: How to handle version mismatches?**
**Answer: Strict version checking**
```python
# In migrations.py
MIGRATION_VERSION = "1.0.0"
def check_version_compatibility(conn):
cursor = conn.execute(
"SELECT value FROM app_config WHERE key = 'migration_version'"
)
row = cursor.fetchone()
if row and row[0] != MIGRATION_VERSION:
raise MigrationError(
f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
f"Action: Run migration tool separately"
)
```
### 21. File-Based Locking
**Q: Should we consider flock() as backup?**
**Answer: NO - Adds complexity without benefit**
- SQLite locking is sufficient and portable
- flock() not available on all systems
- Would require additional cleanup logic
- Database-level locking is the correct approach
### 22. Gunicorn Preload
**Q: Would --preload flag help?**
**Answer: NO - Makes problem WORSE**
- --preload runs app initialization ONCE in master
- Workers fork from master AFTER migrations complete
- BUT: Doesn't work with lazy-loaded resources
- Current architecture expects per-worker initialization
- Keep current approach
### 23. Application-Level Locks
**Q: Should we add Redis/memcached for coordination?**
**Answer: NO - Violates simplicity principle**
- Adds external dependency
- More complex deployment
- SQLite locking is sufficient
- Would require Redis/memcached to be running before app starts
- Solving a solved problem
---
## Final Implementation Checklist
### Required Changes
1. ✅ Add imports: `time`, `random`
2. ✅ Implement retry loop with exponential backoff
3. ✅ Use BEGIN IMMEDIATE for lock acquisition
4. ✅ Add graduated logging levels
5. ✅ Proper error messages with diagnostics
6. ✅ Fresh connection per retry
7. ✅ Total timeout check (2 minutes max)
8. ✅ Preserve all existing migration logic
### Test Coverage Required
1. ✅ Unit test: Retry on lock
2. ✅ Unit test: Exhaustion handling
3. ✅ Integration test: 4 workers with multiprocessing
4. ✅ System test: gunicorn with 4 workers
5. ✅ Container test: Full deployment simulation
6. ✅ Performance test: < 500ms with contention
### Documentation Updates
1. ✅ Update ADR-022 with final decision
2. ✅ Add operational runbook for migration issues
3. ✅ Document monitoring metrics
4. ✅ Update deployment guide with health check info
---
## Go/No-Go Decision
### ✅ GO FOR IMPLEMENTATION
**Rationale:**
- All 23 questions have concrete answers
- Design is proven with SQLite's native capabilities
- No external dependencies needed
- Risk is low with clear rollback plan
- Testing strategy is comprehensive
**Implementation Priority: IMMEDIATE**
- This is blocking v1.0.0-rc.4 release
- Production systems affected
- Fix is well-understood and low-risk
**Next Steps:**
1. Implement changes to migrations.py as specified
2. Run test suite at all levels
3. Deploy as hotfix v1.0.0-rc.3.1
4. Monitor metrics in production
5. Document lessons learned
---
*Document Version: 1.0*
*Created: 2025-11-24*
*Status: Approved for Implementation*
*Author: StarPunk Architecture Team*