docs: Add architect documentation for migration race condition fix
Add comprehensive architectural documentation for the migration race condition fix, including: - ADR-022: Architectural decision record for the fix - migration-race-condition-answers.md: All 23 Q&A answered - migration-fix-quick-reference.md: Implementation checklist - migration-race-condition-fix-implementation.md: Detailed guide These documents guided the implementation in v1.0.0-rc.5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
238
docs/architecture/migration-fix-quick-reference.md
Normal file
238
docs/architecture/migration-fix-quick-reference.md
Normal file
@@ -0,0 +1,238 @@
|
||||
# Migration Race Condition Fix - Quick Implementation Reference
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
### Code Changes - `/home/phil/Projects/starpunk/starpunk/migrations.py`
|
||||
|
||||
```python
|
||||
# 1. Add imports at top
|
||||
import time
|
||||
import random
|
||||
|
||||
# 2. Replace entire run_migrations function (lines 304-462)
|
||||
# See full implementation in migration-race-condition-fix-implementation.md
|
||||
|
||||
# Key patterns to implement:
|
||||
|
||||
# A. Retry loop structure
|
||||
max_retries = 10
|
||||
retry_count = 0
|
||||
base_delay = 0.1
|
||||
start_time = time.time()
|
||||
max_total_time = 120 # 2 minute absolute max
|
||||
|
||||
while retry_count < max_retries and (time.time() - start_time) < max_total_time:
|
||||
conn = None # NEW connection each iteration
|
||||
try:
|
||||
conn = sqlite3.connect(db_path, timeout=30.0)
|
||||
conn.execute("BEGIN IMMEDIATE") # Lock acquisition
|
||||
# ... migration logic ...
|
||||
conn.commit()
|
||||
return # Success
|
||||
except sqlite3.OperationalError as e:
|
||||
if "database is locked" in str(e).lower():
|
||||
retry_count += 1
|
||||
if retry_count < max_retries:
|
||||
# Exponential backoff with jitter
|
||||
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
|
||||
# Graduated logging
|
||||
if retry_count <= 3:
|
||||
logger.debug(f"Retry {retry_count}/{max_retries}")
|
||||
elif retry_count <= 7:
|
||||
logger.info(f"Retry {retry_count}/{max_retries}")
|
||||
else:
|
||||
logger.warning(f"Retry {retry_count}/{max_retries}")
|
||||
time.sleep(delay)
|
||||
continue
|
||||
finally:
|
||||
if conn:
|
||||
try:
|
||||
conn.close()
|
||||
except:
|
||||
pass
|
||||
|
||||
# B. Error handling pattern
|
||||
except Exception as e:
|
||||
try:
|
||||
conn.rollback()
|
||||
except Exception as rollback_error:
|
||||
logger.critical(f"FATAL: Rollback failed: {rollback_error}")
|
||||
raise SystemExit(1)
|
||||
raise MigrationError(f"Migration failed: {e}")
|
||||
|
||||
# C. Final error message
|
||||
raise MigrationError(
|
||||
f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
|
||||
f"Possible causes:\n"
|
||||
f"1. Another process is stuck in migration (check logs)\n"
|
||||
f"2. Database file permissions issue\n"
|
||||
f"3. Disk I/O problems\n"
|
||||
f"Action: Restart container with single worker to diagnose"
|
||||
)
|
||||
```
|
||||
|
||||
### Testing Requirements
|
||||
|
||||
#### 1. Unit Test File: `test_migration_race_condition.py`
|
||||
```python
|
||||
import multiprocessing
|
||||
from multiprocessing import Barrier, Process
|
||||
import time
|
||||
|
||||
def test_concurrent_migrations():
|
||||
"""Test 4 workers starting simultaneously"""
|
||||
barrier = Barrier(4)
|
||||
|
||||
def worker(worker_id):
|
||||
barrier.wait() # Synchronize start
|
||||
from starpunk import create_app
|
||||
app = create_app()
|
||||
return True
|
||||
|
||||
with multiprocessing.Pool(4) as pool:
|
||||
results = pool.map(worker, range(4))
|
||||
|
||||
assert all(results), "Some workers failed"
|
||||
|
||||
def test_lock_retry():
|
||||
"""Test retry logic with mock"""
|
||||
with patch('sqlite3.connect') as mock:
|
||||
mock.side_effect = [
|
||||
sqlite3.OperationalError("database is locked"),
|
||||
sqlite3.OperationalError("database is locked"),
|
||||
MagicMock() # Success on 3rd try
|
||||
]
|
||||
run_migrations(db_path)
|
||||
assert mock.call_count == 3
|
||||
```
|
||||
|
||||
#### 2. Integration Test: `test_integration.sh`
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Test with actual gunicorn
|
||||
|
||||
# Clean start
|
||||
rm -f test.db
|
||||
|
||||
# Start gunicorn with 4 workers
|
||||
timeout 10 gunicorn --workers 4 --bind 127.0.0.1:8001 app:app &
|
||||
PID=$!
|
||||
|
||||
# Wait for startup
|
||||
sleep 3
|
||||
|
||||
# Check if running
|
||||
if ! kill -0 $PID 2>/dev/null; then
|
||||
echo "FAILED: Gunicorn crashed"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check health endpoint
|
||||
curl -f http://127.0.0.1:8001/health || exit 1
|
||||
|
||||
# Cleanup
|
||||
kill $PID
|
||||
|
||||
echo "SUCCESS: All workers started without race condition"
|
||||
```
|
||||
|
||||
#### 3. Container Test: `test_container.sh`
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Test in container environment
|
||||
|
||||
# Build
|
||||
podman build -t starpunk:race-test -f Containerfile .
|
||||
|
||||
# Run with fresh database
|
||||
podman run --rm -d --name race-test \
|
||||
-v $(pwd)/test-data:/data \
|
||||
starpunk:race-test
|
||||
|
||||
# Check logs for success patterns
|
||||
sleep 5
|
||||
podman logs race-test | grep -E "(Applied migration|already applied by another worker)"
|
||||
|
||||
# Cleanup
|
||||
podman stop race-test
|
||||
```
|
||||
|
||||
### Verification Patterns in Logs
|
||||
|
||||
#### Successful Migration (One Worker Wins)
|
||||
```
|
||||
Worker 0: Applying migration: 001_initial_schema.sql
|
||||
Worker 1: Database locked by another worker, retry 1/10 in 0.21s
|
||||
Worker 2: Database locked by another worker, retry 1/10 in 0.23s
|
||||
Worker 3: Database locked by another worker, retry 1/10 in 0.19s
|
||||
Worker 0: Applied migration: 001_initial_schema.sql
|
||||
Worker 1: All migrations already applied by another worker
|
||||
Worker 2: All migrations already applied by another worker
|
||||
Worker 3: All migrations already applied by another worker
|
||||
```
|
||||
|
||||
#### Performance Metrics to Check
|
||||
- Single worker: < 100ms total
|
||||
- 4 workers: < 500ms total
|
||||
- 10 workers (stress): < 2000ms total
|
||||
|
||||
### Rollback Plan if Issues
|
||||
|
||||
1. **Immediate Workaround**
|
||||
```bash
|
||||
# Change to single worker temporarily
|
||||
gunicorn --workers 1 --bind 0.0.0.0:8000 app:app
|
||||
```
|
||||
|
||||
2. **Revert Code**
|
||||
```bash
|
||||
git revert HEAD
|
||||
```
|
||||
|
||||
3. **Emergency Patch**
|
||||
```python
|
||||
# In app.py temporarily
|
||||
import os
|
||||
if os.getenv('GUNICORN_WORKER_ID', '1') == '1':
|
||||
init_db() # Only first worker runs migrations
|
||||
```
|
||||
|
||||
### Deployment Commands
|
||||
|
||||
```bash
|
||||
# 1. Run tests
|
||||
python -m pytest test_migration_race_condition.py -v
|
||||
|
||||
# 2. Build container
|
||||
podman build -t starpunk:v1.0.0-rc.3.1 -f Containerfile .
|
||||
|
||||
# 3. Tag for release
|
||||
podman tag starpunk:v1.0.0-rc.3.1 git.philmade.com/starpunk:v1.0.0-rc.3.1
|
||||
|
||||
# 4. Push
|
||||
podman push git.philmade.com/starpunk:v1.0.0-rc.3.1
|
||||
|
||||
# 5. Deploy
|
||||
kubectl rollout restart deployment/starpunk
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Critical Points to Remember
|
||||
|
||||
1. **NEW CONNECTION EACH RETRY** - Don't reuse connections
|
||||
2. **BEGIN IMMEDIATE** - Not EXCLUSIVE, not DEFERRED
|
||||
3. **30s per attempt, 120s total max** - Two different timeouts
|
||||
4. **Graduated logging** - DEBUG → INFO → WARNING based on retry count
|
||||
5. **Test at multiple levels** - Unit, integration, container
|
||||
6. **Fresh database state** between tests
|
||||
|
||||
## Support
|
||||
|
||||
If issues arise, check:
|
||||
1. `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md` - Full Q&A
|
||||
2. `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md` - Detailed implementation
|
||||
3. SQLite lock states: `PRAGMA lock_status` during issue
|
||||
|
||||
---
|
||||
*Quick Reference v1.0 - 2025-11-24*
|
||||
477
docs/architecture/migration-race-condition-answers.md
Normal file
477
docs/architecture/migration-race-condition-answers.md
Normal file
@@ -0,0 +1,477 @@
|
||||
# Migration Race Condition Fix - Architectural Answers
|
||||
|
||||
## Status: READY FOR IMPLEMENTATION
|
||||
|
||||
All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.
|
||||
|
||||
---
|
||||
|
||||
## Critical Questions
|
||||
|
||||
### 1. Connection Lifecycle Management
|
||||
**Q: Should we create a new connection for each retry or reuse the same connection?**
|
||||
|
||||
**Answer: NEW CONNECTION per retry**
|
||||
- Each retry MUST create a fresh connection
|
||||
- Rationale: Failed lock acquisition may leave connection in inconsistent state
|
||||
- SQLite connections are lightweight; overhead is minimal
|
||||
- Pattern:
|
||||
```python
|
||||
while retry_count < max_retries:
|
||||
conn = None # Fresh connection each iteration
|
||||
try:
|
||||
conn = sqlite3.connect(db_path, timeout=30.0)
|
||||
# ... attempt migration ...
|
||||
finally:
|
||||
if conn:
|
||||
conn.close()
|
||||
```
|
||||
|
||||
### 2. Transaction Boundaries
|
||||
**Q: Should init_db() wrap everything in one transaction?**
|
||||
|
||||
**Answer: NO - Separate transactions for different operations**
|
||||
- Schema creation: Own transaction (already implicit in executescript)
|
||||
- Migrations: Own transaction with BEGIN IMMEDIATE
|
||||
- Initial data: Own transaction
|
||||
- Rationale: Minimizes lock duration and allows partial success visibility
|
||||
- Each operation is atomic but independent
|
||||
|
||||
### 3. Lock Timeout vs Retry Timeout
|
||||
**Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?**
|
||||
|
||||
**Answer: This is BY DESIGN - No conflict**
|
||||
- 30s timeout: Maximum wait for any single lock acquisition attempt
|
||||
- 102s total: Maximum cumulative retry duration across multiple attempts
|
||||
- If one worker holds lock for 30s+, other workers timeout and retry
|
||||
- Pattern ensures no single worker waits indefinitely
|
||||
- Recommendation: Add total timeout check:
|
||||
```python
|
||||
start_time = time.time()
|
||||
max_total_time = 120 # 2 minutes absolute maximum
|
||||
while retry_count < max_retries and (time.time() - start_time) < max_total_time:
|
||||
```
|
||||
|
||||
### 4. Testing Strategy
|
||||
**Q: Should we use multiprocessing.Pool or actual gunicorn for testing?**
|
||||
|
||||
**Answer: BOTH - Different test levels**
|
||||
- Unit tests: multiprocessing.Pool (fast, isolated)
|
||||
- Integration tests: Actual gunicorn with --workers 4
|
||||
- Container tests: Full podman/docker run
|
||||
- Test matrix:
|
||||
```
|
||||
Level 1: Mock concurrent access (unit)
|
||||
Level 2: multiprocessing.Pool (integration)
|
||||
Level 3: gunicorn locally (system)
|
||||
Level 4: Container with gunicorn (e2e)
|
||||
```
|
||||
|
||||
### 5. BEGIN IMMEDIATE vs EXCLUSIVE
|
||||
**Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?**
|
||||
|
||||
**Answer: BEGIN IMMEDIATE is CORRECT choice**
|
||||
- BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
|
||||
- BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
|
||||
- Rationale:
|
||||
- Migrations only need to prevent concurrent migrations (writes)
|
||||
- Other workers can still read schema while one migrates
|
||||
- Less contention, faster startup
|
||||
- Only escalates to EXCLUSIVE when actually writing
|
||||
- Keep BEGIN IMMEDIATE as specified
|
||||
|
||||
---
|
||||
|
||||
## Edge Cases and Error Handling
|
||||
|
||||
### 6. Partial Migration Failure
|
||||
**Q: What if a migration partially applies or rollback fails?**
|
||||
|
||||
**Answer: Transaction atomicity handles this**
|
||||
- Within transaction: Automatic rollback on ANY error
|
||||
- Rollback failure: Extremely rare (corrupt database)
|
||||
- Strategy:
|
||||
```python
|
||||
except Exception as e:
|
||||
try:
|
||||
conn.rollback()
|
||||
except Exception as rollback_error:
|
||||
logger.critical(f"FATAL: Rollback failed: {rollback_error}")
|
||||
# Database potentially corrupt - fail hard
|
||||
raise SystemExit(1)
|
||||
raise MigrationError(e)
|
||||
```
|
||||
|
||||
### 7. Migration File Consistency
|
||||
**Q: What if migration files change during deployment?**
|
||||
|
||||
**Answer: Not a concern with proper deployment**
|
||||
- Container deployments: Files are immutable in image
|
||||
- Traditional deployment: Use atomic directory swap
|
||||
- If concerned, add checksum validation:
|
||||
```python
|
||||
# Store in schema_migrations: (name, checksum, applied_at)
|
||||
# Verify checksum matches before applying
|
||||
```
|
||||
|
||||
### 8. Retry Exhaustion Error Messages
|
||||
**Q: What error message when retries exhausted?**
|
||||
|
||||
**Answer: Be specific and actionable**
|
||||
```python
|
||||
raise MigrationError(
|
||||
f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
|
||||
f"Possible causes:\n"
|
||||
f"1. Another process is stuck in migration (check logs)\n"
|
||||
f"2. Database file permissions issue\n"
|
||||
f"3. Disk I/O problems\n"
|
||||
f"Action: Restart container with single worker to diagnose"
|
||||
)
|
||||
```
|
||||
|
||||
### 9. Logging Levels
|
||||
**Q: What log level for lock waits?**
|
||||
|
||||
**Answer: Graduated approach**
|
||||
- Retry 1-3: DEBUG (normal operation)
|
||||
- Retry 4-7: INFO (getting concerning)
|
||||
- Retry 8+: WARNING (abnormal)
|
||||
- Exhausted: ERROR (operation failed)
|
||||
- Pattern:
|
||||
```python
|
||||
if retry_count <= 3:
|
||||
level = logging.DEBUG
|
||||
elif retry_count <= 7:
|
||||
level = logging.INFO
|
||||
else:
|
||||
level = logging.WARNING
|
||||
logger.log(level, f"Retry {retry_count}/{max_retries}")
|
||||
```
|
||||
|
||||
### 10. Index Creation Failure
|
||||
**Q: How to handle index creation failures in migration 002?**
|
||||
|
||||
**Answer: Fail fast with clear context**
|
||||
```python
|
||||
for index_name, index_sql in indexes_to_create:
|
||||
try:
|
||||
conn.execute(index_sql)
|
||||
except sqlite3.OperationalError as e:
|
||||
if "already exists" in str(e):
|
||||
logger.debug(f"Index {index_name} already exists")
|
||||
else:
|
||||
raise MigrationError(
|
||||
f"Failed to create index {index_name}: {e}\n"
|
||||
f"SQL: {index_sql}"
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### 11. Concurrent Testing Simulation
|
||||
**Q: How to properly simulate concurrent worker startup?**
|
||||
|
||||
**Answer: Multiple approaches**
|
||||
```python
|
||||
# Approach 1: Barrier synchronization
|
||||
def test_concurrent_migrations():
|
||||
barrier = multiprocessing.Barrier(4)
|
||||
|
||||
def worker():
|
||||
barrier.wait() # All start together
|
||||
return run_migrations(db_path)
|
||||
|
||||
with multiprocessing.Pool(4) as pool:
|
||||
results = pool.map(worker, range(4))
|
||||
|
||||
# Approach 2: Process start
|
||||
processes = []
|
||||
for i in range(4):
|
||||
p = Process(target=run_migrations, args=(db_path,))
|
||||
processes.append(p)
|
||||
for p in processes:
|
||||
p.start() # Near-simultaneous
|
||||
```
|
||||
|
||||
### 12. Lock Contention Testing
|
||||
**Q: How to test lock contention scenarios?**
|
||||
|
||||
**Answer: Inject delays**
|
||||
```python
|
||||
# Test helper to force contention
|
||||
def slow_migration_for_testing(conn):
|
||||
conn.execute("BEGIN IMMEDIATE")
|
||||
time.sleep(2) # Force other workers to wait
|
||||
# Apply migration
|
||||
conn.commit()
|
||||
|
||||
# Test timeout handling
|
||||
@patch('sqlite3.connect')
|
||||
def test_lock_timeout(mock_connect):
|
||||
mock_connect.side_effect = sqlite3.OperationalError("database is locked")
|
||||
# Verify retry logic
|
||||
```
|
||||
|
||||
### 13. Performance Tests
|
||||
**Q: What timing is acceptable?**
|
||||
|
||||
**Answer: Performance targets**
|
||||
- Single worker: < 100ms for all migrations
|
||||
- 4 workers with contention: < 500ms total
|
||||
- 10 workers stress test: < 2s total
|
||||
- Lock acquisition per retry: < 50ms
|
||||
- Test with:
|
||||
```python
|
||||
import timeit
|
||||
setup_time = timeit.timeit(lambda: create_app(), number=1)
|
||||
assert setup_time < 0.5, f"Startup too slow: {setup_time}s"
|
||||
```
|
||||
|
||||
### 14. Retry Logic Unit Tests
|
||||
**Q: How to unit test retry logic?**
|
||||
|
||||
**Answer: Mock the lock failures**
|
||||
```python
|
||||
class TestRetryLogic:
|
||||
def test_retry_on_lock(self):
|
||||
with patch('sqlite3.connect') as mock:
|
||||
# First 2 attempts fail, 3rd succeeds
|
||||
mock.side_effect = [
|
||||
sqlite3.OperationalError("database is locked"),
|
||||
sqlite3.OperationalError("database is locked"),
|
||||
MagicMock() # Success
|
||||
]
|
||||
run_migrations(db_path)
|
||||
assert mock.call_count == 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SQLite-Specific Concerns
|
||||
|
||||
### 15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)
|
||||
**Q: Deep dive on lock choice?**
|
||||
|
||||
**Answer: Lock escalation path**
|
||||
```
|
||||
BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
|
||||
BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
|
||||
BEGIN EXCLUSIVE → EXCLUSIVE
|
||||
|
||||
For migrations:
|
||||
- IMMEDIATE starts at RESERVED (blocks other writers immediately)
|
||||
- Escalates to EXCLUSIVE only during actual writes
|
||||
- Optimal for our use case
|
||||
```
|
||||
|
||||
### 16. WAL Mode Interaction
|
||||
**Q: How does this work with WAL mode?**
|
||||
|
||||
**Answer: Works correctly with both modes**
|
||||
- Journal mode: BEGIN IMMEDIATE works as described
|
||||
- WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
|
||||
- No code changes needed
|
||||
- Add mode detection for logging:
|
||||
```python
|
||||
cursor = conn.execute("PRAGMA journal_mode")
|
||||
mode = cursor.fetchone()[0]
|
||||
logger.debug(f"Database in {mode} mode")
|
||||
```
|
||||
|
||||
### 17. Database File Permissions
|
||||
**Q: How to handle permission issues?**
|
||||
|
||||
**Answer: Fail fast with helpful diagnostics**
|
||||
```python
|
||||
import os
|
||||
import stat
|
||||
|
||||
db_path = Path(db_path)
|
||||
if not db_path.exists():
|
||||
# Will be created - check parent dir
|
||||
parent = db_path.parent
|
||||
if not os.access(parent, os.W_OK):
|
||||
raise MigrationError(f"Cannot write to directory: {parent}")
|
||||
else:
|
||||
# Check existing file
|
||||
if not os.access(db_path, os.W_OK):
|
||||
stats = os.stat(db_path)
|
||||
mode = stat.filemode(stats.st_mode)
|
||||
raise MigrationError(
|
||||
f"Database not writable: {db_path}\n"
|
||||
f"Permissions: {mode}\n"
|
||||
f"Owner: {stats.st_uid}:{stats.st_gid}"
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment/Operations
|
||||
|
||||
### 18. Container Startup and Health Checks
|
||||
**Q: How to handle health checks during migration?**
|
||||
|
||||
**Answer: Return 503 during migration**
|
||||
```python
|
||||
# In app.py
|
||||
MIGRATION_IN_PROGRESS = False
|
||||
|
||||
def create_app():
|
||||
global MIGRATION_IN_PROGRESS
|
||||
MIGRATION_IN_PROGRESS = True
|
||||
try:
|
||||
init_db()
|
||||
finally:
|
||||
MIGRATION_IN_PROGRESS = False
|
||||
|
||||
@app.route('/health')
|
||||
def health():
|
||||
if MIGRATION_IN_PROGRESS:
|
||||
return {'status': 'migrating'}, 503
|
||||
return {'status': 'healthy'}, 200
|
||||
```
|
||||
|
||||
### 19. Monitoring and Alerting
|
||||
**Q: What metrics/alerts are needed?**
|
||||
|
||||
**Answer: Key metrics to track**
|
||||
```python
|
||||
# Add metrics collection
|
||||
metrics = {
|
||||
'migration_duration_ms': 0,
|
||||
'migration_retries': 0,
|
||||
'migration_lock_wait_ms': 0,
|
||||
'migrations_applied': 0
|
||||
}
|
||||
|
||||
# Alert thresholds
|
||||
ALERTS = {
|
||||
'migration_duration_ms': 5000, # Alert if > 5s
|
||||
'migration_retries': 5, # Alert if > 5 retries
|
||||
'worker_failures': 1 # Alert on any failure
|
||||
}
|
||||
|
||||
# Log in structured format
|
||||
logger.info(json.dumps({
|
||||
'event': 'migration_complete',
|
||||
'metrics': metrics
|
||||
}))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alternative Approaches
|
||||
|
||||
### 20. Version Compatibility
|
||||
**Q: How to handle version mismatches?**
|
||||
|
||||
**Answer: Strict version checking**
|
||||
```python
|
||||
# In migrations.py
|
||||
MIGRATION_VERSION = "1.0.0"
|
||||
|
||||
def check_version_compatibility(conn):
|
||||
cursor = conn.execute(
|
||||
"SELECT value FROM app_config WHERE key = 'migration_version'"
|
||||
)
|
||||
row = cursor.fetchone()
|
||||
if row and row[0] != MIGRATION_VERSION:
|
||||
raise MigrationError(
|
||||
f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
|
||||
f"Action: Run migration tool separately"
|
||||
)
|
||||
```
|
||||
|
||||
### 21. File-Based Locking
|
||||
**Q: Should we consider flock() as backup?**
|
||||
|
||||
**Answer: NO - Adds complexity without benefit**
|
||||
- SQLite locking is sufficient and portable
|
||||
- flock() not available on all systems
|
||||
- Would require additional cleanup logic
|
||||
- Database-level locking is the correct approach
|
||||
|
||||
### 22. Gunicorn Preload
|
||||
**Q: Would --preload flag help?**
|
||||
|
||||
**Answer: NO - Makes problem WORSE**
|
||||
- --preload runs app initialization ONCE in master
|
||||
- Workers fork from master AFTER migrations complete
|
||||
- BUT: Doesn't work with lazy-loaded resources
|
||||
- Current architecture expects per-worker initialization
|
||||
- Keep current approach
|
||||
|
||||
### 23. Application-Level Locks
|
||||
**Q: Should we add Redis/memcached for coordination?**
|
||||
|
||||
**Answer: NO - Violates simplicity principle**
|
||||
- Adds external dependency
|
||||
- More complex deployment
|
||||
- SQLite locking is sufficient
|
||||
- Would require Redis/memcached to be running before app starts
|
||||
- Solving a solved problem
|
||||
|
||||
---
|
||||
|
||||
## Final Implementation Checklist
|
||||
|
||||
### Required Changes
|
||||
|
||||
1. ✅ Add imports: `time`, `random`
|
||||
2. ✅ Implement retry loop with exponential backoff
|
||||
3. ✅ Use BEGIN IMMEDIATE for lock acquisition
|
||||
4. ✅ Add graduated logging levels
|
||||
5. ✅ Proper error messages with diagnostics
|
||||
6. ✅ Fresh connection per retry
|
||||
7. ✅ Total timeout check (2 minutes max)
|
||||
8. ✅ Preserve all existing migration logic
|
||||
|
||||
### Test Coverage Required
|
||||
|
||||
1. ✅ Unit test: Retry on lock
|
||||
2. ✅ Unit test: Exhaustion handling
|
||||
3. ✅ Integration test: 4 workers with multiprocessing
|
||||
4. ✅ System test: gunicorn with 4 workers
|
||||
5. ✅ Container test: Full deployment simulation
|
||||
6. ✅ Performance test: < 500ms with contention
|
||||
|
||||
### Documentation Updates
|
||||
|
||||
1. ✅ Update ADR-022 with final decision
|
||||
2. ✅ Add operational runbook for migration issues
|
||||
3. ✅ Document monitoring metrics
|
||||
4. ✅ Update deployment guide with health check info
|
||||
|
||||
---
|
||||
|
||||
## Go/No-Go Decision
|
||||
|
||||
### ✅ GO FOR IMPLEMENTATION
|
||||
|
||||
**Rationale:**
|
||||
- All 23 questions have concrete answers
|
||||
- Design is proven with SQLite's native capabilities
|
||||
- No external dependencies needed
|
||||
- Risk is low with clear rollback plan
|
||||
- Testing strategy is comprehensive
|
||||
|
||||
**Implementation Priority: IMMEDIATE**
|
||||
- This is blocking v1.0.0-rc.4 release
|
||||
- Production systems affected
|
||||
- Fix is well-understood and low-risk
|
||||
|
||||
**Next Steps:**
|
||||
1. Implement changes to migrations.py as specified
|
||||
2. Run test suite at all levels
|
||||
3. Deploy as hotfix v1.0.0-rc.3.1
|
||||
4. Monitor metrics in production
|
||||
5. Document lessons learned
|
||||
|
||||
---
|
||||
|
||||
*Document Version: 1.0*
|
||||
*Created: 2025-11-24*
|
||||
*Status: Approved for Implementation*
|
||||
*Author: StarPunk Architecture Team*
|
||||
208
docs/decisions/ADR-022-migration-race-condition-fix.md
Normal file
208
docs/decisions/ADR-022-migration-race-condition-fix.md
Normal file
@@ -0,0 +1,208 @@
|
||||
# ADR-022: Database Migration Race Condition Resolution
|
||||
|
||||
## Status
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
In production, StarPunk runs with multiple gunicorn workers (currently 4). Each worker process independently initializes the Flask application through `create_app()`, which calls `init_db()`, which in turn runs database migrations via `run_migrations()`.
|
||||
|
||||
When the container starts fresh, all 4 workers start simultaneously and attempt to:
|
||||
1. Create the `schema_migrations` table
|
||||
2. Apply pending migrations
|
||||
3. Insert records into `schema_migrations`
|
||||
|
||||
This causes a race condition where:
|
||||
- Worker 1 successfully applies migration and inserts record
|
||||
- Workers 2-4 fail with "UNIQUE constraint failed: schema_migrations.migration_name"
|
||||
- Failed workers crash, causing container restarts
|
||||
- After restart, migrations are already applied so it works
|
||||
|
||||
## Decision
|
||||
|
||||
We will implement **database-level advisory locking** using SQLite's transaction mechanism with IMMEDIATE mode, combined with retry logic. This approach:
|
||||
|
||||
1. Uses SQLite's built-in `BEGIN IMMEDIATE` transaction to acquire a write lock
|
||||
2. Implements exponential backoff retry for workers that can't acquire the lock
|
||||
3. Ensures only one worker can run migrations at a time
|
||||
4. Other workers wait and verify migrations are complete
|
||||
|
||||
This is the simplest, most robust solution that:
|
||||
- Requires minimal code changes
|
||||
- Uses SQLite's native capabilities
|
||||
- Doesn't require external dependencies
|
||||
- Works across all deployment scenarios
|
||||
|
||||
## Rationale
|
||||
|
||||
### Options Considered
|
||||
|
||||
1. **File-based locking (fcntl)**
|
||||
- Pro: Simple to implement
|
||||
- Con: Doesn't work across containers/network filesystems
|
||||
- Con: Lock files can be orphaned if process crashes
|
||||
|
||||
2. **Run migrations before workers start**
|
||||
- Pro: Cleanest separation of concerns
|
||||
- Con: Requires container entrypoint script changes
|
||||
- Con: Complicates development workflow
|
||||
- Con: Doesn't fix the root cause for non-container deployments
|
||||
|
||||
3. **Make migration insertion idempotent (INSERT OR IGNORE)**
|
||||
- Pro: Simple SQL change
|
||||
- Con: Doesn't prevent parallel migration execution
|
||||
- Con: Could corrupt database if migrations partially apply
|
||||
- Con: Masks the real problem
|
||||
|
||||
4. **Database advisory locking (CHOSEN)**
|
||||
- Pro: Uses SQLite's native transaction locking
|
||||
- Pro: Guaranteed atomicity
|
||||
- Pro: Works across all deployment scenarios
|
||||
- Pro: Self-cleaning (no orphaned locks)
|
||||
- Con: Requires retry logic
|
||||
|
||||
### Why Database Locking?
|
||||
|
||||
SQLite's `BEGIN IMMEDIATE` transaction mode acquires a RESERVED lock immediately, preventing other connections from writing. This provides:
|
||||
|
||||
1. **Atomicity**: Either all migrations apply or none do
|
||||
2. **Isolation**: Only one worker can modify schema at a time
|
||||
3. **Automatic cleanup**: Locks released on connection close/crash
|
||||
4. **No external dependencies**: Uses SQLite's built-in features
|
||||
|
||||
## Implementation
|
||||
|
||||
The fix will be implemented in `/home/phil/Projects/starpunk/starpunk/migrations.py`:
|
||||
|
||||
```python
|
||||
def run_migrations(db_path, logger=None):
|
||||
"""Run all pending database migrations with concurrency protection"""
|
||||
|
||||
max_retries = 10
|
||||
retry_count = 0
|
||||
base_delay = 0.1 # 100ms
|
||||
|
||||
while retry_count < max_retries:
|
||||
try:
|
||||
conn = sqlite3.connect(db_path, timeout=30.0)
|
||||
|
||||
# Acquire exclusive lock for migrations
|
||||
conn.execute("BEGIN IMMEDIATE")
|
||||
|
||||
try:
|
||||
# Create migrations table if needed
|
||||
create_migrations_table(conn)
|
||||
|
||||
# Check if another worker already ran migrations
|
||||
cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
|
||||
if cursor.fetchone()[0] > 0:
|
||||
# Migrations already run by another worker
|
||||
conn.commit()
|
||||
logger.info("Migrations already applied by another worker")
|
||||
return
|
||||
|
||||
# Run migration logic (existing code)
|
||||
# ... rest of migration code ...
|
||||
|
||||
conn.commit()
|
||||
return # Success
|
||||
|
||||
except Exception:
|
||||
conn.rollback()
|
||||
raise
|
||||
|
||||
except sqlite3.OperationalError as e:
|
||||
if "database is locked" in str(e):
|
||||
retry_count += 1
|
||||
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
|
||||
|
||||
if retry_count < max_retries:
|
||||
logger.debug(f"Database locked, retry {retry_count}/{max_retries} in {delay:.2f}s")
|
||||
time.sleep(delay)
|
||||
else:
|
||||
raise MigrationError(f"Failed to acquire migration lock after {max_retries} attempts")
|
||||
else:
|
||||
raise
|
||||
|
||||
finally:
|
||||
if conn:
|
||||
conn.close()
|
||||
```
|
||||
|
||||
Additional changes needed:
|
||||
|
||||
1. Add imports: `import time`, `import random`
|
||||
2. Modify connection timeout from default 5s to 30s
|
||||
3. Add early check for already-applied migrations
|
||||
4. Wrap entire migration process in IMMEDIATE transaction
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Eliminates race condition completely
|
||||
- No container configuration changes needed
|
||||
- Works in all deployment scenarios (container, systemd, manual)
|
||||
- Minimal code changes (~50 lines)
|
||||
- Self-healing (no manual lock cleanup needed)
|
||||
- Provides clear logging of what's happening
|
||||
|
||||
### Negative
|
||||
- Slight startup delay for workers that wait (100ms-2s typical)
|
||||
- Adds complexity to migration runner
|
||||
- Requires careful testing of retry logic
|
||||
|
||||
### Neutral
|
||||
- Workers start sequentially for migration phase, then run in parallel
|
||||
- First worker to acquire lock runs migrations for all
|
||||
- Log output will show retry attempts (useful for debugging)
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
1. **Unit test with mock**: Test retry logic with simulated lock contention
|
||||
2. **Integration test**: Spawn multiple processes, verify only one runs migrations
|
||||
3. **Container test**: Build container, verify clean startup with 4 workers
|
||||
4. **Stress test**: Start 20 processes simultaneously, verify correctness
|
||||
|
||||
## Migration Path
|
||||
|
||||
1. Implement fix in `starpunk/migrations.py`
|
||||
2. Test locally with multiple workers
|
||||
3. Build and test container
|
||||
4. Deploy as v1.0.0-rc.4 or hotfix v1.0.0-rc.3.1
|
||||
5. Monitor production logs for retry patterns
|
||||
|
||||
## Implementation Notes (Post-Analysis)
|
||||
|
||||
Based on comprehensive architectural review, the following clarifications have been established:
|
||||
|
||||
### Critical Implementation Details
|
||||
|
||||
1. **Connection Management**: Create NEW connection for each retry attempt (no reuse)
|
||||
2. **Lock Mode**: Use BEGIN IMMEDIATE (not EXCLUSIVE) for optimal concurrency
|
||||
3. **Timeout Strategy**: 30s per connection attempt, 120s total maximum duration
|
||||
4. **Logging Levels**: Graduated (DEBUG for retry 1-3, INFO for 4-7, WARNING for 8+)
|
||||
5. **Transaction Boundaries**: Separate transactions for schema/migrations/data
|
||||
|
||||
### Test Requirements
|
||||
|
||||
- Unit tests with multiprocessing.Pool
|
||||
- Integration tests with actual gunicorn
|
||||
- Container tests with full deployment
|
||||
- Performance target: <500ms with 4 workers
|
||||
|
||||
### Documentation
|
||||
|
||||
- Full Q&A: `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
|
||||
- Implementation Guide: `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md`
|
||||
- Quick Reference: `/home/phil/Projects/starpunk/docs/architecture/migration-fix-quick-reference.md`
|
||||
|
||||
## References
|
||||
|
||||
- [SQLite Transaction Documentation](https://www.sqlite.org/lang_transaction.html)
|
||||
- [SQLite Locking Documentation](https://www.sqlite.org/lockingv3.html)
|
||||
- [SQLite BEGIN IMMEDIATE](https://www.sqlite.org/lang_transaction.html#immediate)
|
||||
- Issue: Production migration race condition with gunicorn workers
|
||||
|
||||
## Status Update
|
||||
|
||||
**2025-11-24**: All 23 architectural questions answered. Implementation approved. Ready for development.
|
||||
431
docs/reports/migration-race-condition-fix-implementation.md
Normal file
431
docs/reports/migration-race-condition-fix-implementation.md
Normal file
@@ -0,0 +1,431 @@
|
||||
# Migration Race Condition Fix - Implementation Guide
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**CRITICAL PRODUCTION ISSUE**: Multiple gunicorn workers racing to apply migrations causes container startup failures.
|
||||
|
||||
**Solution**: Implement database-level advisory locking with retry logic in `migrations.py`.
|
||||
|
||||
**Urgency**: HIGH - This is a blocker for v1.0.0-rc.4 release.
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### The Problem Flow
|
||||
|
||||
1. Container starts with `gunicorn --workers 4`
|
||||
2. Each worker independently calls:
|
||||
```
|
||||
app.py → create_app() → init_db() → run_migrations()
|
||||
```
|
||||
3. All 4 workers simultaneously try to:
|
||||
- INSERT into schema_migrations table
|
||||
- Apply the same migrations
|
||||
4. SQLite's UNIQUE constraint on migration_name causes workers 2-4 to crash
|
||||
5. Container restarts, works on second attempt (migrations already applied)
|
||||
|
||||
### Why This Happens
|
||||
|
||||
- **No synchronization**: Workers are independent processes
|
||||
- **No locking**: Migration code doesn't prevent concurrent execution
|
||||
- **Immediate failure**: UNIQUE constraint violation crashes the worker
|
||||
- **Gunicorn behavior**: Worker crash triggers container restart
|
||||
|
||||
## Immediate Fix Implementation
|
||||
|
||||
### Step 1: Update migrations.py
|
||||
|
||||
Add these imports at the top of `/home/phil/Projects/starpunk/starpunk/migrations.py`:
|
||||
|
||||
```python
|
||||
import time
|
||||
import random
|
||||
```
|
||||
|
||||
### Step 2: Replace run_migrations function
|
||||
|
||||
Replace the entire `run_migrations` function (lines 304-462) with:
|
||||
|
||||
```python
|
||||
def run_migrations(db_path, logger=None):
|
||||
"""
|
||||
Run all pending database migrations with concurrency protection
|
||||
|
||||
Uses database-level locking to prevent race conditions when multiple
|
||||
workers start simultaneously. Only one worker will apply migrations;
|
||||
others will wait and verify completion.
|
||||
|
||||
Args:
|
||||
db_path: Path to SQLite database file
|
||||
logger: Optional logger for output
|
||||
|
||||
Raises:
|
||||
MigrationError: If any migration fails to apply or lock cannot be acquired
|
||||
"""
|
||||
if logger is None:
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Determine migrations directory
|
||||
migrations_dir = Path(__file__).parent.parent / "migrations"
|
||||
|
||||
if not migrations_dir.exists():
|
||||
logger.warning(f"Migrations directory not found: {migrations_dir}")
|
||||
return
|
||||
|
||||
# Retry configuration for lock acquisition
|
||||
max_retries = 10
|
||||
retry_count = 0
|
||||
base_delay = 0.1 # 100ms
|
||||
|
||||
while retry_count < max_retries:
|
||||
conn = None
|
||||
try:
|
||||
# Connect with longer timeout for lock contention
|
||||
conn = sqlite3.connect(db_path, timeout=30.0)
|
||||
|
||||
# Attempt to acquire exclusive lock for migrations
|
||||
# BEGIN IMMEDIATE acquires RESERVED lock, preventing other writes
|
||||
conn.execute("BEGIN IMMEDIATE")
|
||||
|
||||
try:
|
||||
# Ensure migrations tracking table exists
|
||||
create_migrations_table(conn)
|
||||
|
||||
# Quick check: have migrations already been applied by another worker?
|
||||
cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
|
||||
migration_count = cursor.fetchone()[0]
|
||||
|
||||
# Discover migration files
|
||||
migration_files = discover_migration_files(migrations_dir)
|
||||
|
||||
if not migration_files:
|
||||
conn.commit()
|
||||
logger.info("No migration files found")
|
||||
return
|
||||
|
||||
# If migrations exist and we're not the first worker, verify and exit
|
||||
if migration_count > 0:
|
||||
# Check if all migrations are applied
|
||||
applied = get_applied_migrations(conn)
|
||||
pending = [m for m, _ in migration_files if m not in applied]
|
||||
|
||||
if not pending:
|
||||
conn.commit()
|
||||
logger.debug("All migrations already applied by another worker")
|
||||
return
|
||||
# If there are pending migrations, we continue to apply them
|
||||
logger.info(f"Found {len(pending)} pending migrations to apply")
|
||||
|
||||
# Fresh database detection (original logic preserved)
|
||||
if migration_count == 0:
|
||||
if is_schema_current(conn):
|
||||
# Schema is current - mark all migrations as applied
|
||||
for migration_name, _ in migration_files:
|
||||
conn.execute(
|
||||
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
||||
(migration_name,)
|
||||
)
|
||||
conn.commit()
|
||||
logger.info(
|
||||
f"Fresh database detected: marked {len(migration_files)} "
|
||||
f"migrations as applied (schema already current)"
|
||||
)
|
||||
return
|
||||
else:
|
||||
logger.info("Fresh database with partial schema: applying needed migrations")
|
||||
|
||||
# Get already-applied migrations
|
||||
applied = get_applied_migrations(conn)
|
||||
|
||||
# Apply pending migrations (original logic preserved)
|
||||
pending_count = 0
|
||||
skipped_count = 0
|
||||
for migration_name, migration_path in migration_files:
|
||||
if migration_name not in applied:
|
||||
# Check if migration is actually needed
|
||||
should_check_needed = (
|
||||
migration_count == 0 or
|
||||
migration_name == "002_secure_tokens_and_authorization_codes.sql"
|
||||
)
|
||||
|
||||
if should_check_needed and not is_migration_needed(conn, migration_name):
|
||||
# Special handling for migration 002: if tables exist but indexes don't
|
||||
if migration_name == "002_secure_tokens_and_authorization_codes.sql":
|
||||
# Check if we need to create indexes
|
||||
indexes_to_create = []
|
||||
if not index_exists(conn, 'idx_tokens_hash'):
|
||||
indexes_to_create.append("CREATE INDEX idx_tokens_hash ON tokens(token_hash)")
|
||||
if not index_exists(conn, 'idx_tokens_me'):
|
||||
indexes_to_create.append("CREATE INDEX idx_tokens_me ON tokens(me)")
|
||||
if not index_exists(conn, 'idx_tokens_expires'):
|
||||
indexes_to_create.append("CREATE INDEX idx_tokens_expires ON tokens(expires_at)")
|
||||
if not index_exists(conn, 'idx_auth_codes_hash'):
|
||||
indexes_to_create.append("CREATE INDEX idx_auth_codes_hash ON authorization_codes(code_hash)")
|
||||
if not index_exists(conn, 'idx_auth_codes_expires'):
|
||||
indexes_to_create.append("CREATE INDEX idx_auth_codes_expires ON authorization_codes(expires_at)")
|
||||
|
||||
if indexes_to_create:
|
||||
for index_sql in indexes_to_create:
|
||||
conn.execute(index_sql)
|
||||
logger.info(f"Created {len(indexes_to_create)} missing indexes from migration 002")
|
||||
|
||||
# Mark as applied without executing full migration
|
||||
conn.execute(
|
||||
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
||||
(migration_name,)
|
||||
)
|
||||
skipped_count += 1
|
||||
logger.debug(f"Skipped migration {migration_name} (already in SCHEMA_SQL)")
|
||||
else:
|
||||
# Apply the migration (within our transaction)
|
||||
try:
|
||||
# Read migration SQL
|
||||
migration_sql = migration_path.read_text()
|
||||
|
||||
logger.debug(f"Applying migration: {migration_name}")
|
||||
|
||||
# Execute migration (already in transaction)
|
||||
conn.executescript(migration_sql)
|
||||
|
||||
# Record migration as applied
|
||||
conn.execute(
|
||||
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
||||
(migration_name,)
|
||||
)
|
||||
|
||||
logger.info(f"Applied migration: {migration_name}")
|
||||
pending_count += 1
|
||||
|
||||
except Exception as e:
|
||||
# Roll back the transaction
|
||||
raise MigrationError(f"Migration {migration_name} failed: {e}")
|
||||
|
||||
# Commit all migrations atomically
|
||||
conn.commit()
|
||||
|
||||
# Summary
|
||||
total_count = len(migration_files)
|
||||
if pending_count > 0 or skipped_count > 0:
|
||||
if skipped_count > 0:
|
||||
logger.info(
|
||||
f"Migrations complete: {pending_count} applied, {skipped_count} skipped "
|
||||
f"(already in SCHEMA_SQL), {total_count} total"
|
||||
)
|
||||
else:
|
||||
logger.info(
|
||||
f"Migrations complete: {pending_count} applied, "
|
||||
f"{total_count} total"
|
||||
)
|
||||
else:
|
||||
logger.info(f"All migrations up to date ({total_count} total)")
|
||||
|
||||
return # Success!
|
||||
|
||||
except MigrationError:
|
||||
conn.rollback()
|
||||
raise
|
||||
|
||||
except Exception as e:
|
||||
conn.rollback()
|
||||
raise MigrationError(f"Migration system error: {e}")
|
||||
|
||||
except sqlite3.OperationalError as e:
|
||||
if "database is locked" in str(e).lower():
|
||||
# Another worker has the lock, retry with exponential backoff
|
||||
retry_count += 1
|
||||
|
||||
if retry_count < max_retries:
|
||||
# Exponential backoff with jitter
|
||||
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
|
||||
logger.debug(
|
||||
f"Database locked by another worker, retry {retry_count}/{max_retries} "
|
||||
f"in {delay:.2f}s"
|
||||
)
|
||||
time.sleep(delay)
|
||||
continue
|
||||
else:
|
||||
raise MigrationError(
|
||||
f"Failed to acquire migration lock after {max_retries} attempts. "
|
||||
f"This may indicate a hung migration process."
|
||||
)
|
||||
else:
|
||||
# Non-lock related database error
|
||||
error_msg = f"Database error during migration: {e}"
|
||||
logger.error(error_msg)
|
||||
raise MigrationError(error_msg)
|
||||
|
||||
except Exception as e:
|
||||
# Unexpected error
|
||||
error_msg = f"Unexpected error during migration: {e}"
|
||||
logger.error(error_msg)
|
||||
raise MigrationError(error_msg)
|
||||
|
||||
finally:
|
||||
if conn:
|
||||
try:
|
||||
conn.close()
|
||||
except:
|
||||
pass # Ignore errors during cleanup
|
||||
|
||||
# Should never reach here, but just in case
|
||||
raise MigrationError("Migration retry loop exited unexpectedly")
|
||||
```
|
||||
|
||||
### Step 3: Testing the Fix
|
||||
|
||||
Create a test script to verify the fix works:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
"""Test migration race condition fix"""
|
||||
|
||||
import multiprocessing
|
||||
import time
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add project to path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
def worker_init(worker_id):
|
||||
"""Simulate a gunicorn worker starting"""
|
||||
print(f"Worker {worker_id}: Starting...")
|
||||
|
||||
try:
|
||||
from starpunk import create_app
|
||||
app = create_app()
|
||||
print(f"Worker {worker_id}: Successfully initialized")
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"Worker {worker_id}: FAILED - {e}")
|
||||
return False
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Test with 10 workers (more than production to stress test)
|
||||
num_workers = 10
|
||||
|
||||
print(f"Starting {num_workers} workers simultaneously...")
|
||||
|
||||
with multiprocessing.Pool(num_workers) as pool:
|
||||
results = pool.map(worker_init, range(num_workers))
|
||||
|
||||
success_count = sum(results)
|
||||
print(f"\nResults: {success_count}/{num_workers} workers succeeded")
|
||||
|
||||
if success_count == num_workers:
|
||||
print("SUCCESS: All workers initialized without race condition")
|
||||
sys.exit(0)
|
||||
else:
|
||||
print("FAILURE: Race condition still present")
|
||||
sys.exit(1)
|
||||
```
|
||||
|
||||
## Verification Steps
|
||||
|
||||
1. **Local Testing**:
|
||||
```bash
|
||||
# Test with multiple workers
|
||||
gunicorn --workers 4 --bind 0.0.0.0:8000 app:app
|
||||
|
||||
# Check logs for retry messages
|
||||
# Should see "Database locked by another worker, retry..." messages
|
||||
```
|
||||
|
||||
2. **Container Testing**:
|
||||
```bash
|
||||
# Build container
|
||||
podman build -t starpunk:test -f Containerfile .
|
||||
|
||||
# Run with fresh database
|
||||
podman run --rm -p 8000:8000 -v ./test-data:/data starpunk:test
|
||||
|
||||
# Should start cleanly without restarts
|
||||
```
|
||||
|
||||
3. **Log Verification**:
|
||||
Look for these patterns:
|
||||
- One worker: "Applied migration: XXX"
|
||||
- Other workers: "Database locked by another worker, retry..."
|
||||
- Final: "All migrations already applied by another worker"
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
### Risk Level: LOW
|
||||
|
||||
The fix is safe because:
|
||||
1. Uses SQLite's native transaction mechanism
|
||||
2. Preserves all existing migration logic
|
||||
3. Only adds retry wrapper around existing code
|
||||
4. Fails safely with clear error messages
|
||||
5. No data loss possible (transactions ensure atomicity)
|
||||
|
||||
### Rollback Plan
|
||||
|
||||
If issues occur:
|
||||
1. Revert to previous version
|
||||
2. Start container with single worker temporarily: `--workers 1`
|
||||
3. Once migrations apply, scale back to 4 workers
|
||||
|
||||
## Release Strategy
|
||||
|
||||
### Option 1: Hotfix (Recommended)
|
||||
- Release as v1.0.0-rc.3.1
|
||||
- Immediate deployment to fix production issue
|
||||
- Minimal testing required (focused fix)
|
||||
|
||||
### Option 2: Include in rc.4
|
||||
- Bundle with other rc.4 changes
|
||||
- More testing time
|
||||
- Risk: Production remains broken until rc.4
|
||||
|
||||
**Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately.
|
||||
|
||||
## Alternative Workarounds (If Needed Urgently)
|
||||
|
||||
While the proper fix is implemented, these temporary workarounds can be used:
|
||||
|
||||
### Workaround 1: Single Worker Startup
|
||||
```bash
|
||||
# In Containerfile, temporarily change:
|
||||
CMD ["gunicorn", "--workers", "1", ...]
|
||||
|
||||
# After first successful start, rebuild with 4 workers
|
||||
```
|
||||
|
||||
### Workaround 2: Pre-migration Script
|
||||
```bash
|
||||
# Add entrypoint script that runs migrations before gunicorn
|
||||
#!/bin/bash
|
||||
python3 -c "from starpunk.database import init_db; init_db()"
|
||||
exec gunicorn --workers 4 ...
|
||||
```
|
||||
|
||||
### Workaround 3: Delayed Worker Startup
|
||||
```bash
|
||||
# Stagger worker startup with --preload
|
||||
gunicorn --preload --workers 4 ...
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
- **Problem**: Race condition when multiple workers apply migrations
|
||||
- **Solution**: Database-level locking with retry logic
|
||||
- **Implementation**: ~150 lines of code changes in migrations.py
|
||||
- **Testing**: Verify with multi-worker startup
|
||||
- **Risk**: LOW - Safe, atomic changes
|
||||
- **Urgency**: HIGH - Blocks production deployment
|
||||
- **Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately
|
||||
|
||||
## Developer Questions Answered
|
||||
|
||||
All 23 architectural questions have been comprehensively answered in:
|
||||
`/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
|
||||
|
||||
**Key Decisions:**
|
||||
- NEW connection per retry (not reused)
|
||||
- BEGIN IMMEDIATE is correct (not EXCLUSIVE)
|
||||
- Separate transactions for each operation
|
||||
- Both multiprocessing.Pool AND gunicorn testing needed
|
||||
- 30s timeout per attempt, 120s total maximum
|
||||
- Graduated logging levels based on retry count
|
||||
|
||||
**Implementation Status: READY TO PROCEED**
|
||||
Reference in New Issue
Block a user