Add comprehensive architectural documentation for the migration race condition fix, including: - ADR-022: Architectural decision record for the fix - migration-race-condition-answers.md: All 23 Q&A answered - migration-fix-quick-reference.md: Implementation checklist - migration-race-condition-fix-implementation.md: Detailed guide These documents guided the implementation in v1.0.0-rc.5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
477 lines
13 KiB
Markdown
477 lines
13 KiB
Markdown
# Migration Race Condition Fix - Architectural Answers
|
|
|
|
## Status: READY FOR IMPLEMENTATION
|
|
|
|
All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.
|
|
|
|
---
|
|
|
|
## Critical Questions
|
|
|
|
### 1. Connection Lifecycle Management
|
|
**Q: Should we create a new connection for each retry or reuse the same connection?**
|
|
|
|
**Answer: NEW CONNECTION per retry**
|
|
- Each retry MUST create a fresh connection
|
|
- Rationale: Failed lock acquisition may leave connection in inconsistent state
|
|
- SQLite connections are lightweight; overhead is minimal
|
|
- Pattern:
|
|
```python
|
|
while retry_count < max_retries:
|
|
conn = None # Fresh connection each iteration
|
|
try:
|
|
conn = sqlite3.connect(db_path, timeout=30.0)
|
|
# ... attempt migration ...
|
|
finally:
|
|
if conn:
|
|
conn.close()
|
|
```
|
|
|
|
### 2. Transaction Boundaries
|
|
**Q: Should init_db() wrap everything in one transaction?**
|
|
|
|
**Answer: NO - Separate transactions for different operations**
|
|
- Schema creation: Own transaction (already implicit in executescript)
|
|
- Migrations: Own transaction with BEGIN IMMEDIATE
|
|
- Initial data: Own transaction
|
|
- Rationale: Minimizes lock duration and allows partial success visibility
|
|
- Each operation is atomic but independent
|
|
|
|
### 3. Lock Timeout vs Retry Timeout
|
|
**Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?**
|
|
|
|
**Answer: This is BY DESIGN - No conflict**
|
|
- 30s timeout: Maximum wait for any single lock acquisition attempt
|
|
- 102s total: Maximum cumulative retry duration across multiple attempts
|
|
- If one worker holds lock for 30s+, other workers timeout and retry
|
|
- Pattern ensures no single worker waits indefinitely
|
|
- Recommendation: Add total timeout check:
|
|
```python
|
|
start_time = time.time()
|
|
max_total_time = 120 # 2 minutes absolute maximum
|
|
while retry_count < max_retries and (time.time() - start_time) < max_total_time:
|
|
```
|
|
|
|
### 4. Testing Strategy
|
|
**Q: Should we use multiprocessing.Pool or actual gunicorn for testing?**
|
|
|
|
**Answer: BOTH - Different test levels**
|
|
- Unit tests: multiprocessing.Pool (fast, isolated)
|
|
- Integration tests: Actual gunicorn with --workers 4
|
|
- Container tests: Full podman/docker run
|
|
- Test matrix:
|
|
```
|
|
Level 1: Mock concurrent access (unit)
|
|
Level 2: multiprocessing.Pool (integration)
|
|
Level 3: gunicorn locally (system)
|
|
Level 4: Container with gunicorn (e2e)
|
|
```
|
|
|
|
### 5. BEGIN IMMEDIATE vs EXCLUSIVE
|
|
**Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?**
|
|
|
|
**Answer: BEGIN IMMEDIATE is CORRECT choice**
|
|
- BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
|
|
- BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
|
|
- Rationale:
|
|
- Migrations only need to prevent concurrent migrations (writes)
|
|
- Other workers can still read schema while one migrates
|
|
- Less contention, faster startup
|
|
- Only escalates to EXCLUSIVE when actually writing
|
|
- Keep BEGIN IMMEDIATE as specified
|
|
|
|
---
|
|
|
|
## Edge Cases and Error Handling
|
|
|
|
### 6. Partial Migration Failure
|
|
**Q: What if a migration partially applies or rollback fails?**
|
|
|
|
**Answer: Transaction atomicity handles this**
|
|
- Within transaction: Automatic rollback on ANY error
|
|
- Rollback failure: Extremely rare (corrupt database)
|
|
- Strategy:
|
|
```python
|
|
except Exception as e:
|
|
try:
|
|
conn.rollback()
|
|
except Exception as rollback_error:
|
|
logger.critical(f"FATAL: Rollback failed: {rollback_error}")
|
|
# Database potentially corrupt - fail hard
|
|
raise SystemExit(1)
|
|
raise MigrationError(e)
|
|
```
|
|
|
|
### 7. Migration File Consistency
|
|
**Q: What if migration files change during deployment?**
|
|
|
|
**Answer: Not a concern with proper deployment**
|
|
- Container deployments: Files are immutable in image
|
|
- Traditional deployment: Use atomic directory swap
|
|
- If concerned, add checksum validation:
|
|
```python
|
|
# Store in schema_migrations: (name, checksum, applied_at)
|
|
# Verify checksum matches before applying
|
|
```
|
|
|
|
### 8. Retry Exhaustion Error Messages
|
|
**Q: What error message when retries exhausted?**
|
|
|
|
**Answer: Be specific and actionable**
|
|
```python
|
|
raise MigrationError(
|
|
f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
|
|
f"Possible causes:\n"
|
|
f"1. Another process is stuck in migration (check logs)\n"
|
|
f"2. Database file permissions issue\n"
|
|
f"3. Disk I/O problems\n"
|
|
f"Action: Restart container with single worker to diagnose"
|
|
)
|
|
```
|
|
|
|
### 9. Logging Levels
|
|
**Q: What log level for lock waits?**
|
|
|
|
**Answer: Graduated approach**
|
|
- Retry 1-3: DEBUG (normal operation)
|
|
- Retry 4-7: INFO (getting concerning)
|
|
- Retry 8+: WARNING (abnormal)
|
|
- Exhausted: ERROR (operation failed)
|
|
- Pattern:
|
|
```python
|
|
if retry_count <= 3:
|
|
level = logging.DEBUG
|
|
elif retry_count <= 7:
|
|
level = logging.INFO
|
|
else:
|
|
level = logging.WARNING
|
|
logger.log(level, f"Retry {retry_count}/{max_retries}")
|
|
```
|
|
|
|
### 10. Index Creation Failure
|
|
**Q: How to handle index creation failures in migration 002?**
|
|
|
|
**Answer: Fail fast with clear context**
|
|
```python
|
|
for index_name, index_sql in indexes_to_create:
|
|
try:
|
|
conn.execute(index_sql)
|
|
except sqlite3.OperationalError as e:
|
|
if "already exists" in str(e):
|
|
logger.debug(f"Index {index_name} already exists")
|
|
else:
|
|
raise MigrationError(
|
|
f"Failed to create index {index_name}: {e}\n"
|
|
f"SQL: {index_sql}"
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Testing Strategy
|
|
|
|
### 11. Concurrent Testing Simulation
|
|
**Q: How to properly simulate concurrent worker startup?**
|
|
|
|
**Answer: Multiple approaches**
|
|
```python
|
|
# Approach 1: Barrier synchronization
|
|
def test_concurrent_migrations():
|
|
barrier = multiprocessing.Barrier(4)
|
|
|
|
def worker():
|
|
barrier.wait() # All start together
|
|
return run_migrations(db_path)
|
|
|
|
with multiprocessing.Pool(4) as pool:
|
|
results = pool.map(worker, range(4))
|
|
|
|
# Approach 2: Process start
|
|
processes = []
|
|
for i in range(4):
|
|
p = Process(target=run_migrations, args=(db_path,))
|
|
processes.append(p)
|
|
for p in processes:
|
|
p.start() # Near-simultaneous
|
|
```
|
|
|
|
### 12. Lock Contention Testing
|
|
**Q: How to test lock contention scenarios?**
|
|
|
|
**Answer: Inject delays**
|
|
```python
|
|
# Test helper to force contention
|
|
def slow_migration_for_testing(conn):
|
|
conn.execute("BEGIN IMMEDIATE")
|
|
time.sleep(2) # Force other workers to wait
|
|
# Apply migration
|
|
conn.commit()
|
|
|
|
# Test timeout handling
|
|
@patch('sqlite3.connect')
|
|
def test_lock_timeout(mock_connect):
|
|
mock_connect.side_effect = sqlite3.OperationalError("database is locked")
|
|
# Verify retry logic
|
|
```
|
|
|
|
### 13. Performance Tests
|
|
**Q: What timing is acceptable?**
|
|
|
|
**Answer: Performance targets**
|
|
- Single worker: < 100ms for all migrations
|
|
- 4 workers with contention: < 500ms total
|
|
- 10 workers stress test: < 2s total
|
|
- Lock acquisition per retry: < 50ms
|
|
- Test with:
|
|
```python
|
|
import timeit
|
|
setup_time = timeit.timeit(lambda: create_app(), number=1)
|
|
assert setup_time < 0.5, f"Startup too slow: {setup_time}s"
|
|
```
|
|
|
|
### 14. Retry Logic Unit Tests
|
|
**Q: How to unit test retry logic?**
|
|
|
|
**Answer: Mock the lock failures**
|
|
```python
|
|
class TestRetryLogic:
|
|
def test_retry_on_lock(self):
|
|
with patch('sqlite3.connect') as mock:
|
|
# First 2 attempts fail, 3rd succeeds
|
|
mock.side_effect = [
|
|
sqlite3.OperationalError("database is locked"),
|
|
sqlite3.OperationalError("database is locked"),
|
|
MagicMock() # Success
|
|
]
|
|
run_migrations(db_path)
|
|
assert mock.call_count == 3
|
|
```
|
|
|
|
---
|
|
|
|
## SQLite-Specific Concerns
|
|
|
|
### 15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)
|
|
**Q: Deep dive on lock choice?**
|
|
|
|
**Answer: Lock escalation path**
|
|
```
|
|
BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
|
|
BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
|
|
BEGIN EXCLUSIVE → EXCLUSIVE
|
|
|
|
For migrations:
|
|
- IMMEDIATE starts at RESERVED (blocks other writers immediately)
|
|
- Escalates to EXCLUSIVE only during actual writes
|
|
- Optimal for our use case
|
|
```
|
|
|
|
### 16. WAL Mode Interaction
|
|
**Q: How does this work with WAL mode?**
|
|
|
|
**Answer: Works correctly with both modes**
|
|
- Journal mode: BEGIN IMMEDIATE works as described
|
|
- WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
|
|
- No code changes needed
|
|
- Add mode detection for logging:
|
|
```python
|
|
cursor = conn.execute("PRAGMA journal_mode")
|
|
mode = cursor.fetchone()[0]
|
|
logger.debug(f"Database in {mode} mode")
|
|
```
|
|
|
|
### 17. Database File Permissions
|
|
**Q: How to handle permission issues?**
|
|
|
|
**Answer: Fail fast with helpful diagnostics**
|
|
```python
|
|
import os
|
|
import stat
|
|
|
|
db_path = Path(db_path)
|
|
if not db_path.exists():
|
|
# Will be created - check parent dir
|
|
parent = db_path.parent
|
|
if not os.access(parent, os.W_OK):
|
|
raise MigrationError(f"Cannot write to directory: {parent}")
|
|
else:
|
|
# Check existing file
|
|
if not os.access(db_path, os.W_OK):
|
|
stats = os.stat(db_path)
|
|
mode = stat.filemode(stats.st_mode)
|
|
raise MigrationError(
|
|
f"Database not writable: {db_path}\n"
|
|
f"Permissions: {mode}\n"
|
|
f"Owner: {stats.st_uid}:{stats.st_gid}"
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Deployment/Operations
|
|
|
|
### 18. Container Startup and Health Checks
|
|
**Q: How to handle health checks during migration?**
|
|
|
|
**Answer: Return 503 during migration**
|
|
```python
|
|
# In app.py
|
|
MIGRATION_IN_PROGRESS = False
|
|
|
|
def create_app():
|
|
global MIGRATION_IN_PROGRESS
|
|
MIGRATION_IN_PROGRESS = True
|
|
try:
|
|
init_db()
|
|
finally:
|
|
MIGRATION_IN_PROGRESS = False
|
|
|
|
@app.route('/health')
|
|
def health():
|
|
if MIGRATION_IN_PROGRESS:
|
|
return {'status': 'migrating'}, 503
|
|
return {'status': 'healthy'}, 200
|
|
```
|
|
|
|
### 19. Monitoring and Alerting
|
|
**Q: What metrics/alerts are needed?**
|
|
|
|
**Answer: Key metrics to track**
|
|
```python
|
|
# Add metrics collection
|
|
metrics = {
|
|
'migration_duration_ms': 0,
|
|
'migration_retries': 0,
|
|
'migration_lock_wait_ms': 0,
|
|
'migrations_applied': 0
|
|
}
|
|
|
|
# Alert thresholds
|
|
ALERTS = {
|
|
'migration_duration_ms': 5000, # Alert if > 5s
|
|
'migration_retries': 5, # Alert if > 5 retries
|
|
'worker_failures': 1 # Alert on any failure
|
|
}
|
|
|
|
# Log in structured format
|
|
logger.info(json.dumps({
|
|
'event': 'migration_complete',
|
|
'metrics': metrics
|
|
}))
|
|
```
|
|
|
|
---
|
|
|
|
## Alternative Approaches
|
|
|
|
### 20. Version Compatibility
|
|
**Q: How to handle version mismatches?**
|
|
|
|
**Answer: Strict version checking**
|
|
```python
|
|
# In migrations.py
|
|
MIGRATION_VERSION = "1.0.0"
|
|
|
|
def check_version_compatibility(conn):
|
|
cursor = conn.execute(
|
|
"SELECT value FROM app_config WHERE key = 'migration_version'"
|
|
)
|
|
row = cursor.fetchone()
|
|
if row and row[0] != MIGRATION_VERSION:
|
|
raise MigrationError(
|
|
f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
|
|
f"Action: Run migration tool separately"
|
|
)
|
|
```
|
|
|
|
### 21. File-Based Locking
|
|
**Q: Should we consider flock() as backup?**
|
|
|
|
**Answer: NO - Adds complexity without benefit**
|
|
- SQLite locking is sufficient and portable
|
|
- flock() not available on all systems
|
|
- Would require additional cleanup logic
|
|
- Database-level locking is the correct approach
|
|
|
|
### 22. Gunicorn Preload
|
|
**Q: Would --preload flag help?**
|
|
|
|
**Answer: NO - Makes problem WORSE**
|
|
- --preload runs app initialization ONCE in master
|
|
- Workers fork from master AFTER migrations complete
|
|
- BUT: Doesn't work with lazy-loaded resources
|
|
- Current architecture expects per-worker initialization
|
|
- Keep current approach
|
|
|
|
### 23. Application-Level Locks
|
|
**Q: Should we add Redis/memcached for coordination?**
|
|
|
|
**Answer: NO - Violates simplicity principle**
|
|
- Adds external dependency
|
|
- More complex deployment
|
|
- SQLite locking is sufficient
|
|
- Would require Redis/memcached to be running before app starts
|
|
- Solving a solved problem
|
|
|
|
---
|
|
|
|
## Final Implementation Checklist
|
|
|
|
### Required Changes
|
|
|
|
1. ✅ Add imports: `time`, `random`
|
|
2. ✅ Implement retry loop with exponential backoff
|
|
3. ✅ Use BEGIN IMMEDIATE for lock acquisition
|
|
4. ✅ Add graduated logging levels
|
|
5. ✅ Proper error messages with diagnostics
|
|
6. ✅ Fresh connection per retry
|
|
7. ✅ Total timeout check (2 minutes max)
|
|
8. ✅ Preserve all existing migration logic
|
|
|
|
### Test Coverage Required
|
|
|
|
1. ✅ Unit test: Retry on lock
|
|
2. ✅ Unit test: Exhaustion handling
|
|
3. ✅ Integration test: 4 workers with multiprocessing
|
|
4. ✅ System test: gunicorn with 4 workers
|
|
5. ✅ Container test: Full deployment simulation
|
|
6. ✅ Performance test: < 500ms with contention
|
|
|
|
### Documentation Updates
|
|
|
|
1. ✅ Update ADR-022 with final decision
|
|
2. ✅ Add operational runbook for migration issues
|
|
3. ✅ Document monitoring metrics
|
|
4. ✅ Update deployment guide with health check info
|
|
|
|
---
|
|
|
|
## Go/No-Go Decision
|
|
|
|
### ✅ GO FOR IMPLEMENTATION
|
|
|
|
**Rationale:**
|
|
- All 23 questions have concrete answers
|
|
- Design is proven with SQLite's native capabilities
|
|
- No external dependencies needed
|
|
- Risk is low with clear rollback plan
|
|
- Testing strategy is comprehensive
|
|
|
|
**Implementation Priority: IMMEDIATE**
|
|
- This is blocking v1.0.0-rc.4 release
|
|
- Production systems affected
|
|
- Fix is well-understood and low-risk
|
|
|
|
**Next Steps:**
|
|
1. Implement changes to migrations.py as specified
|
|
2. Run test suite at all levels
|
|
3. Deploy as hotfix v1.0.0-rc.3.1
|
|
4. Monitor metrics in production
|
|
5. Document lessons learned
|
|
|
|
---
|
|
|
|
*Document Version: 1.0*
|
|
*Created: 2025-11-24*
|
|
*Status: Approved for Implementation*
|
|
*Author: StarPunk Architecture Team* |