docs: Add architect documentation for migration race condition fix
Add comprehensive architectural documentation for the migration race condition fix, including: - ADR-022: Architectural decision record for the fix - migration-race-condition-answers.md: All 23 Q&A answered - migration-fix-quick-reference.md: Implementation checklist - migration-race-condition-fix-implementation.md: Detailed guide These documents guided the implementation in v1.0.0-rc.5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
238
docs/architecture/migration-fix-quick-reference.md
Normal file
238
docs/architecture/migration-fix-quick-reference.md
Normal file
@@ -0,0 +1,238 @@
|
|||||||
|
# Migration Race Condition Fix - Quick Implementation Reference
|
||||||
|
|
||||||
|
## Implementation Checklist
|
||||||
|
|
||||||
|
### Code Changes - `/home/phil/Projects/starpunk/starpunk/migrations.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
# 1. Add imports at top
|
||||||
|
import time
|
||||||
|
import random
|
||||||
|
|
||||||
|
# 2. Replace entire run_migrations function (lines 304-462)
|
||||||
|
# See full implementation in migration-race-condition-fix-implementation.md
|
||||||
|
|
||||||
|
# Key patterns to implement:
|
||||||
|
|
||||||
|
# A. Retry loop structure
|
||||||
|
max_retries = 10
|
||||||
|
retry_count = 0
|
||||||
|
base_delay = 0.1
|
||||||
|
start_time = time.time()
|
||||||
|
max_total_time = 120 # 2 minute absolute max
|
||||||
|
|
||||||
|
while retry_count < max_retries and (time.time() - start_time) < max_total_time:
|
||||||
|
conn = None # NEW connection each iteration
|
||||||
|
try:
|
||||||
|
conn = sqlite3.connect(db_path, timeout=30.0)
|
||||||
|
conn.execute("BEGIN IMMEDIATE") # Lock acquisition
|
||||||
|
# ... migration logic ...
|
||||||
|
conn.commit()
|
||||||
|
return # Success
|
||||||
|
except sqlite3.OperationalError as e:
|
||||||
|
if "database is locked" in str(e).lower():
|
||||||
|
retry_count += 1
|
||||||
|
if retry_count < max_retries:
|
||||||
|
# Exponential backoff with jitter
|
||||||
|
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
|
||||||
|
# Graduated logging
|
||||||
|
if retry_count <= 3:
|
||||||
|
logger.debug(f"Retry {retry_count}/{max_retries}")
|
||||||
|
elif retry_count <= 7:
|
||||||
|
logger.info(f"Retry {retry_count}/{max_retries}")
|
||||||
|
else:
|
||||||
|
logger.warning(f"Retry {retry_count}/{max_retries}")
|
||||||
|
time.sleep(delay)
|
||||||
|
continue
|
||||||
|
finally:
|
||||||
|
if conn:
|
||||||
|
try:
|
||||||
|
conn.close()
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# B. Error handling pattern
|
||||||
|
except Exception as e:
|
||||||
|
try:
|
||||||
|
conn.rollback()
|
||||||
|
except Exception as rollback_error:
|
||||||
|
logger.critical(f"FATAL: Rollback failed: {rollback_error}")
|
||||||
|
raise SystemExit(1)
|
||||||
|
raise MigrationError(f"Migration failed: {e}")
|
||||||
|
|
||||||
|
# C. Final error message
|
||||||
|
raise MigrationError(
|
||||||
|
f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
|
||||||
|
f"Possible causes:\n"
|
||||||
|
f"1. Another process is stuck in migration (check logs)\n"
|
||||||
|
f"2. Database file permissions issue\n"
|
||||||
|
f"3. Disk I/O problems\n"
|
||||||
|
f"Action: Restart container with single worker to diagnose"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Testing Requirements
|
||||||
|
|
||||||
|
#### 1. Unit Test File: `test_migration_race_condition.py`
|
||||||
|
```python
|
||||||
|
import multiprocessing
|
||||||
|
from multiprocessing import Barrier, Process
|
||||||
|
import time
|
||||||
|
|
||||||
|
def test_concurrent_migrations():
|
||||||
|
"""Test 4 workers starting simultaneously"""
|
||||||
|
barrier = Barrier(4)
|
||||||
|
|
||||||
|
def worker(worker_id):
|
||||||
|
barrier.wait() # Synchronize start
|
||||||
|
from starpunk import create_app
|
||||||
|
app = create_app()
|
||||||
|
return True
|
||||||
|
|
||||||
|
with multiprocessing.Pool(4) as pool:
|
||||||
|
results = pool.map(worker, range(4))
|
||||||
|
|
||||||
|
assert all(results), "Some workers failed"
|
||||||
|
|
||||||
|
def test_lock_retry():
|
||||||
|
"""Test retry logic with mock"""
|
||||||
|
with patch('sqlite3.connect') as mock:
|
||||||
|
mock.side_effect = [
|
||||||
|
sqlite3.OperationalError("database is locked"),
|
||||||
|
sqlite3.OperationalError("database is locked"),
|
||||||
|
MagicMock() # Success on 3rd try
|
||||||
|
]
|
||||||
|
run_migrations(db_path)
|
||||||
|
assert mock.call_count == 3
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. Integration Test: `test_integration.sh`
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Test with actual gunicorn
|
||||||
|
|
||||||
|
# Clean start
|
||||||
|
rm -f test.db
|
||||||
|
|
||||||
|
# Start gunicorn with 4 workers
|
||||||
|
timeout 10 gunicorn --workers 4 --bind 127.0.0.1:8001 app:app &
|
||||||
|
PID=$!
|
||||||
|
|
||||||
|
# Wait for startup
|
||||||
|
sleep 3
|
||||||
|
|
||||||
|
# Check if running
|
||||||
|
if ! kill -0 $PID 2>/dev/null; then
|
||||||
|
echo "FAILED: Gunicorn crashed"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check health endpoint
|
||||||
|
curl -f http://127.0.0.1:8001/health || exit 1
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
kill $PID
|
||||||
|
|
||||||
|
echo "SUCCESS: All workers started without race condition"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3. Container Test: `test_container.sh`
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Test in container environment
|
||||||
|
|
||||||
|
# Build
|
||||||
|
podman build -t starpunk:race-test -f Containerfile .
|
||||||
|
|
||||||
|
# Run with fresh database
|
||||||
|
podman run --rm -d --name race-test \
|
||||||
|
-v $(pwd)/test-data:/data \
|
||||||
|
starpunk:race-test
|
||||||
|
|
||||||
|
# Check logs for success patterns
|
||||||
|
sleep 5
|
||||||
|
podman logs race-test | grep -E "(Applied migration|already applied by another worker)"
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
podman stop race-test
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verification Patterns in Logs
|
||||||
|
|
||||||
|
#### Successful Migration (One Worker Wins)
|
||||||
|
```
|
||||||
|
Worker 0: Applying migration: 001_initial_schema.sql
|
||||||
|
Worker 1: Database locked by another worker, retry 1/10 in 0.21s
|
||||||
|
Worker 2: Database locked by another worker, retry 1/10 in 0.23s
|
||||||
|
Worker 3: Database locked by another worker, retry 1/10 in 0.19s
|
||||||
|
Worker 0: Applied migration: 001_initial_schema.sql
|
||||||
|
Worker 1: All migrations already applied by another worker
|
||||||
|
Worker 2: All migrations already applied by another worker
|
||||||
|
Worker 3: All migrations already applied by another worker
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Performance Metrics to Check
|
||||||
|
- Single worker: < 100ms total
|
||||||
|
- 4 workers: < 500ms total
|
||||||
|
- 10 workers (stress): < 2000ms total
|
||||||
|
|
||||||
|
### Rollback Plan if Issues
|
||||||
|
|
||||||
|
1. **Immediate Workaround**
|
||||||
|
```bash
|
||||||
|
# Change to single worker temporarily
|
||||||
|
gunicorn --workers 1 --bind 0.0.0.0:8000 app:app
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Revert Code**
|
||||||
|
```bash
|
||||||
|
git revert HEAD
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Emergency Patch**
|
||||||
|
```python
|
||||||
|
# In app.py temporarily
|
||||||
|
import os
|
||||||
|
if os.getenv('GUNICORN_WORKER_ID', '1') == '1':
|
||||||
|
init_db() # Only first worker runs migrations
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deployment Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Run tests
|
||||||
|
python -m pytest test_migration_race_condition.py -v
|
||||||
|
|
||||||
|
# 2. Build container
|
||||||
|
podman build -t starpunk:v1.0.0-rc.3.1 -f Containerfile .
|
||||||
|
|
||||||
|
# 3. Tag for release
|
||||||
|
podman tag starpunk:v1.0.0-rc.3.1 git.philmade.com/starpunk:v1.0.0-rc.3.1
|
||||||
|
|
||||||
|
# 4. Push
|
||||||
|
podman push git.philmade.com/starpunk:v1.0.0-rc.3.1
|
||||||
|
|
||||||
|
# 5. Deploy
|
||||||
|
kubectl rollout restart deployment/starpunk
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Critical Points to Remember
|
||||||
|
|
||||||
|
1. **NEW CONNECTION EACH RETRY** - Don't reuse connections
|
||||||
|
2. **BEGIN IMMEDIATE** - Not EXCLUSIVE, not DEFERRED
|
||||||
|
3. **30s per attempt, 120s total max** - Two different timeouts
|
||||||
|
4. **Graduated logging** - DEBUG → INFO → WARNING based on retry count
|
||||||
|
5. **Test at multiple levels** - Unit, integration, container
|
||||||
|
6. **Fresh database state** between tests
|
||||||
|
|
||||||
|
## Support
|
||||||
|
|
||||||
|
If issues arise, check:
|
||||||
|
1. `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md` - Full Q&A
|
||||||
|
2. `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md` - Detailed implementation
|
||||||
|
3. SQLite lock states: `PRAGMA lock_status` during issue
|
||||||
|
|
||||||
|
---
|
||||||
|
*Quick Reference v1.0 - 2025-11-24*
|
||||||
477
docs/architecture/migration-race-condition-answers.md
Normal file
477
docs/architecture/migration-race-condition-answers.md
Normal file
@@ -0,0 +1,477 @@
|
|||||||
|
# Migration Race Condition Fix - Architectural Answers
|
||||||
|
|
||||||
|
## Status: READY FOR IMPLEMENTATION
|
||||||
|
|
||||||
|
All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Critical Questions
|
||||||
|
|
||||||
|
### 1. Connection Lifecycle Management
|
||||||
|
**Q: Should we create a new connection for each retry or reuse the same connection?**
|
||||||
|
|
||||||
|
**Answer: NEW CONNECTION per retry**
|
||||||
|
- Each retry MUST create a fresh connection
|
||||||
|
- Rationale: Failed lock acquisition may leave connection in inconsistent state
|
||||||
|
- SQLite connections are lightweight; overhead is minimal
|
||||||
|
- Pattern:
|
||||||
|
```python
|
||||||
|
while retry_count < max_retries:
|
||||||
|
conn = None # Fresh connection each iteration
|
||||||
|
try:
|
||||||
|
conn = sqlite3.connect(db_path, timeout=30.0)
|
||||||
|
# ... attempt migration ...
|
||||||
|
finally:
|
||||||
|
if conn:
|
||||||
|
conn.close()
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Transaction Boundaries
|
||||||
|
**Q: Should init_db() wrap everything in one transaction?**
|
||||||
|
|
||||||
|
**Answer: NO - Separate transactions for different operations**
|
||||||
|
- Schema creation: Own transaction (already implicit in executescript)
|
||||||
|
- Migrations: Own transaction with BEGIN IMMEDIATE
|
||||||
|
- Initial data: Own transaction
|
||||||
|
- Rationale: Minimizes lock duration and allows partial success visibility
|
||||||
|
- Each operation is atomic but independent
|
||||||
|
|
||||||
|
### 3. Lock Timeout vs Retry Timeout
|
||||||
|
**Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?**
|
||||||
|
|
||||||
|
**Answer: This is BY DESIGN - No conflict**
|
||||||
|
- 30s timeout: Maximum wait for any single lock acquisition attempt
|
||||||
|
- 102s total: Maximum cumulative retry duration across multiple attempts
|
||||||
|
- If one worker holds lock for 30s+, other workers timeout and retry
|
||||||
|
- Pattern ensures no single worker waits indefinitely
|
||||||
|
- Recommendation: Add total timeout check:
|
||||||
|
```python
|
||||||
|
start_time = time.time()
|
||||||
|
max_total_time = 120 # 2 minutes absolute maximum
|
||||||
|
while retry_count < max_retries and (time.time() - start_time) < max_total_time:
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Testing Strategy
|
||||||
|
**Q: Should we use multiprocessing.Pool or actual gunicorn for testing?**
|
||||||
|
|
||||||
|
**Answer: BOTH - Different test levels**
|
||||||
|
- Unit tests: multiprocessing.Pool (fast, isolated)
|
||||||
|
- Integration tests: Actual gunicorn with --workers 4
|
||||||
|
- Container tests: Full podman/docker run
|
||||||
|
- Test matrix:
|
||||||
|
```
|
||||||
|
Level 1: Mock concurrent access (unit)
|
||||||
|
Level 2: multiprocessing.Pool (integration)
|
||||||
|
Level 3: gunicorn locally (system)
|
||||||
|
Level 4: Container with gunicorn (e2e)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. BEGIN IMMEDIATE vs EXCLUSIVE
|
||||||
|
**Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?**
|
||||||
|
|
||||||
|
**Answer: BEGIN IMMEDIATE is CORRECT choice**
|
||||||
|
- BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
|
||||||
|
- BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
|
||||||
|
- Rationale:
|
||||||
|
- Migrations only need to prevent concurrent migrations (writes)
|
||||||
|
- Other workers can still read schema while one migrates
|
||||||
|
- Less contention, faster startup
|
||||||
|
- Only escalates to EXCLUSIVE when actually writing
|
||||||
|
- Keep BEGIN IMMEDIATE as specified
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Edge Cases and Error Handling
|
||||||
|
|
||||||
|
### 6. Partial Migration Failure
|
||||||
|
**Q: What if a migration partially applies or rollback fails?**
|
||||||
|
|
||||||
|
**Answer: Transaction atomicity handles this**
|
||||||
|
- Within transaction: Automatic rollback on ANY error
|
||||||
|
- Rollback failure: Extremely rare (corrupt database)
|
||||||
|
- Strategy:
|
||||||
|
```python
|
||||||
|
except Exception as e:
|
||||||
|
try:
|
||||||
|
conn.rollback()
|
||||||
|
except Exception as rollback_error:
|
||||||
|
logger.critical(f"FATAL: Rollback failed: {rollback_error}")
|
||||||
|
# Database potentially corrupt - fail hard
|
||||||
|
raise SystemExit(1)
|
||||||
|
raise MigrationError(e)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7. Migration File Consistency
|
||||||
|
**Q: What if migration files change during deployment?**
|
||||||
|
|
||||||
|
**Answer: Not a concern with proper deployment**
|
||||||
|
- Container deployments: Files are immutable in image
|
||||||
|
- Traditional deployment: Use atomic directory swap
|
||||||
|
- If concerned, add checksum validation:
|
||||||
|
```python
|
||||||
|
# Store in schema_migrations: (name, checksum, applied_at)
|
||||||
|
# Verify checksum matches before applying
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8. Retry Exhaustion Error Messages
|
||||||
|
**Q: What error message when retries exhausted?**
|
||||||
|
|
||||||
|
**Answer: Be specific and actionable**
|
||||||
|
```python
|
||||||
|
raise MigrationError(
|
||||||
|
f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
|
||||||
|
f"Possible causes:\n"
|
||||||
|
f"1. Another process is stuck in migration (check logs)\n"
|
||||||
|
f"2. Database file permissions issue\n"
|
||||||
|
f"3. Disk I/O problems\n"
|
||||||
|
f"Action: Restart container with single worker to diagnose"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 9. Logging Levels
|
||||||
|
**Q: What log level for lock waits?**
|
||||||
|
|
||||||
|
**Answer: Graduated approach**
|
||||||
|
- Retry 1-3: DEBUG (normal operation)
|
||||||
|
- Retry 4-7: INFO (getting concerning)
|
||||||
|
- Retry 8+: WARNING (abnormal)
|
||||||
|
- Exhausted: ERROR (operation failed)
|
||||||
|
- Pattern:
|
||||||
|
```python
|
||||||
|
if retry_count <= 3:
|
||||||
|
level = logging.DEBUG
|
||||||
|
elif retry_count <= 7:
|
||||||
|
level = logging.INFO
|
||||||
|
else:
|
||||||
|
level = logging.WARNING
|
||||||
|
logger.log(level, f"Retry {retry_count}/{max_retries}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 10. Index Creation Failure
|
||||||
|
**Q: How to handle index creation failures in migration 002?**
|
||||||
|
|
||||||
|
**Answer: Fail fast with clear context**
|
||||||
|
```python
|
||||||
|
for index_name, index_sql in indexes_to_create:
|
||||||
|
try:
|
||||||
|
conn.execute(index_sql)
|
||||||
|
except sqlite3.OperationalError as e:
|
||||||
|
if "already exists" in str(e):
|
||||||
|
logger.debug(f"Index {index_name} already exists")
|
||||||
|
else:
|
||||||
|
raise MigrationError(
|
||||||
|
f"Failed to create index {index_name}: {e}\n"
|
||||||
|
f"SQL: {index_sql}"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing Strategy
|
||||||
|
|
||||||
|
### 11. Concurrent Testing Simulation
|
||||||
|
**Q: How to properly simulate concurrent worker startup?**
|
||||||
|
|
||||||
|
**Answer: Multiple approaches**
|
||||||
|
```python
|
||||||
|
# Approach 1: Barrier synchronization
|
||||||
|
def test_concurrent_migrations():
|
||||||
|
barrier = multiprocessing.Barrier(4)
|
||||||
|
|
||||||
|
def worker():
|
||||||
|
barrier.wait() # All start together
|
||||||
|
return run_migrations(db_path)
|
||||||
|
|
||||||
|
with multiprocessing.Pool(4) as pool:
|
||||||
|
results = pool.map(worker, range(4))
|
||||||
|
|
||||||
|
# Approach 2: Process start
|
||||||
|
processes = []
|
||||||
|
for i in range(4):
|
||||||
|
p = Process(target=run_migrations, args=(db_path,))
|
||||||
|
processes.append(p)
|
||||||
|
for p in processes:
|
||||||
|
p.start() # Near-simultaneous
|
||||||
|
```
|
||||||
|
|
||||||
|
### 12. Lock Contention Testing
|
||||||
|
**Q: How to test lock contention scenarios?**
|
||||||
|
|
||||||
|
**Answer: Inject delays**
|
||||||
|
```python
|
||||||
|
# Test helper to force contention
|
||||||
|
def slow_migration_for_testing(conn):
|
||||||
|
conn.execute("BEGIN IMMEDIATE")
|
||||||
|
time.sleep(2) # Force other workers to wait
|
||||||
|
# Apply migration
|
||||||
|
conn.commit()
|
||||||
|
|
||||||
|
# Test timeout handling
|
||||||
|
@patch('sqlite3.connect')
|
||||||
|
def test_lock_timeout(mock_connect):
|
||||||
|
mock_connect.side_effect = sqlite3.OperationalError("database is locked")
|
||||||
|
# Verify retry logic
|
||||||
|
```
|
||||||
|
|
||||||
|
### 13. Performance Tests
|
||||||
|
**Q: What timing is acceptable?**
|
||||||
|
|
||||||
|
**Answer: Performance targets**
|
||||||
|
- Single worker: < 100ms for all migrations
|
||||||
|
- 4 workers with contention: < 500ms total
|
||||||
|
- 10 workers stress test: < 2s total
|
||||||
|
- Lock acquisition per retry: < 50ms
|
||||||
|
- Test with:
|
||||||
|
```python
|
||||||
|
import timeit
|
||||||
|
setup_time = timeit.timeit(lambda: create_app(), number=1)
|
||||||
|
assert setup_time < 0.5, f"Startup too slow: {setup_time}s"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 14. Retry Logic Unit Tests
|
||||||
|
**Q: How to unit test retry logic?**
|
||||||
|
|
||||||
|
**Answer: Mock the lock failures**
|
||||||
|
```python
|
||||||
|
class TestRetryLogic:
|
||||||
|
def test_retry_on_lock(self):
|
||||||
|
with patch('sqlite3.connect') as mock:
|
||||||
|
# First 2 attempts fail, 3rd succeeds
|
||||||
|
mock.side_effect = [
|
||||||
|
sqlite3.OperationalError("database is locked"),
|
||||||
|
sqlite3.OperationalError("database is locked"),
|
||||||
|
MagicMock() # Success
|
||||||
|
]
|
||||||
|
run_migrations(db_path)
|
||||||
|
assert mock.call_count == 3
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## SQLite-Specific Concerns
|
||||||
|
|
||||||
|
### 15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)
|
||||||
|
**Q: Deep dive on lock choice?**
|
||||||
|
|
||||||
|
**Answer: Lock escalation path**
|
||||||
|
```
|
||||||
|
BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
|
||||||
|
BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
|
||||||
|
BEGIN EXCLUSIVE → EXCLUSIVE
|
||||||
|
|
||||||
|
For migrations:
|
||||||
|
- IMMEDIATE starts at RESERVED (blocks other writers immediately)
|
||||||
|
- Escalates to EXCLUSIVE only during actual writes
|
||||||
|
- Optimal for our use case
|
||||||
|
```
|
||||||
|
|
||||||
|
### 16. WAL Mode Interaction
|
||||||
|
**Q: How does this work with WAL mode?**
|
||||||
|
|
||||||
|
**Answer: Works correctly with both modes**
|
||||||
|
- Journal mode: BEGIN IMMEDIATE works as described
|
||||||
|
- WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
|
||||||
|
- No code changes needed
|
||||||
|
- Add mode detection for logging:
|
||||||
|
```python
|
||||||
|
cursor = conn.execute("PRAGMA journal_mode")
|
||||||
|
mode = cursor.fetchone()[0]
|
||||||
|
logger.debug(f"Database in {mode} mode")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 17. Database File Permissions
|
||||||
|
**Q: How to handle permission issues?**
|
||||||
|
|
||||||
|
**Answer: Fail fast with helpful diagnostics**
|
||||||
|
```python
|
||||||
|
import os
|
||||||
|
import stat
|
||||||
|
|
||||||
|
db_path = Path(db_path)
|
||||||
|
if not db_path.exists():
|
||||||
|
# Will be created - check parent dir
|
||||||
|
parent = db_path.parent
|
||||||
|
if not os.access(parent, os.W_OK):
|
||||||
|
raise MigrationError(f"Cannot write to directory: {parent}")
|
||||||
|
else:
|
||||||
|
# Check existing file
|
||||||
|
if not os.access(db_path, os.W_OK):
|
||||||
|
stats = os.stat(db_path)
|
||||||
|
mode = stat.filemode(stats.st_mode)
|
||||||
|
raise MigrationError(
|
||||||
|
f"Database not writable: {db_path}\n"
|
||||||
|
f"Permissions: {mode}\n"
|
||||||
|
f"Owner: {stats.st_uid}:{stats.st_gid}"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Deployment/Operations
|
||||||
|
|
||||||
|
### 18. Container Startup and Health Checks
|
||||||
|
**Q: How to handle health checks during migration?**
|
||||||
|
|
||||||
|
**Answer: Return 503 during migration**
|
||||||
|
```python
|
||||||
|
# In app.py
|
||||||
|
MIGRATION_IN_PROGRESS = False
|
||||||
|
|
||||||
|
def create_app():
|
||||||
|
global MIGRATION_IN_PROGRESS
|
||||||
|
MIGRATION_IN_PROGRESS = True
|
||||||
|
try:
|
||||||
|
init_db()
|
||||||
|
finally:
|
||||||
|
MIGRATION_IN_PROGRESS = False
|
||||||
|
|
||||||
|
@app.route('/health')
|
||||||
|
def health():
|
||||||
|
if MIGRATION_IN_PROGRESS:
|
||||||
|
return {'status': 'migrating'}, 503
|
||||||
|
return {'status': 'healthy'}, 200
|
||||||
|
```
|
||||||
|
|
||||||
|
### 19. Monitoring and Alerting
|
||||||
|
**Q: What metrics/alerts are needed?**
|
||||||
|
|
||||||
|
**Answer: Key metrics to track**
|
||||||
|
```python
|
||||||
|
# Add metrics collection
|
||||||
|
metrics = {
|
||||||
|
'migration_duration_ms': 0,
|
||||||
|
'migration_retries': 0,
|
||||||
|
'migration_lock_wait_ms': 0,
|
||||||
|
'migrations_applied': 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# Alert thresholds
|
||||||
|
ALERTS = {
|
||||||
|
'migration_duration_ms': 5000, # Alert if > 5s
|
||||||
|
'migration_retries': 5, # Alert if > 5 retries
|
||||||
|
'worker_failures': 1 # Alert on any failure
|
||||||
|
}
|
||||||
|
|
||||||
|
# Log in structured format
|
||||||
|
logger.info(json.dumps({
|
||||||
|
'event': 'migration_complete',
|
||||||
|
'metrics': metrics
|
||||||
|
}))
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Alternative Approaches
|
||||||
|
|
||||||
|
### 20. Version Compatibility
|
||||||
|
**Q: How to handle version mismatches?**
|
||||||
|
|
||||||
|
**Answer: Strict version checking**
|
||||||
|
```python
|
||||||
|
# In migrations.py
|
||||||
|
MIGRATION_VERSION = "1.0.0"
|
||||||
|
|
||||||
|
def check_version_compatibility(conn):
|
||||||
|
cursor = conn.execute(
|
||||||
|
"SELECT value FROM app_config WHERE key = 'migration_version'"
|
||||||
|
)
|
||||||
|
row = cursor.fetchone()
|
||||||
|
if row and row[0] != MIGRATION_VERSION:
|
||||||
|
raise MigrationError(
|
||||||
|
f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
|
||||||
|
f"Action: Run migration tool separately"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 21. File-Based Locking
|
||||||
|
**Q: Should we consider flock() as backup?**
|
||||||
|
|
||||||
|
**Answer: NO - Adds complexity without benefit**
|
||||||
|
- SQLite locking is sufficient and portable
|
||||||
|
- flock() not available on all systems
|
||||||
|
- Would require additional cleanup logic
|
||||||
|
- Database-level locking is the correct approach
|
||||||
|
|
||||||
|
### 22. Gunicorn Preload
|
||||||
|
**Q: Would --preload flag help?**
|
||||||
|
|
||||||
|
**Answer: NO - Makes problem WORSE**
|
||||||
|
- --preload runs app initialization ONCE in master
|
||||||
|
- Workers fork from master AFTER migrations complete
|
||||||
|
- BUT: Doesn't work with lazy-loaded resources
|
||||||
|
- Current architecture expects per-worker initialization
|
||||||
|
- Keep current approach
|
||||||
|
|
||||||
|
### 23. Application-Level Locks
|
||||||
|
**Q: Should we add Redis/memcached for coordination?**
|
||||||
|
|
||||||
|
**Answer: NO - Violates simplicity principle**
|
||||||
|
- Adds external dependency
|
||||||
|
- More complex deployment
|
||||||
|
- SQLite locking is sufficient
|
||||||
|
- Would require Redis/memcached to be running before app starts
|
||||||
|
- Solving a solved problem
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Final Implementation Checklist
|
||||||
|
|
||||||
|
### Required Changes
|
||||||
|
|
||||||
|
1. ✅ Add imports: `time`, `random`
|
||||||
|
2. ✅ Implement retry loop with exponential backoff
|
||||||
|
3. ✅ Use BEGIN IMMEDIATE for lock acquisition
|
||||||
|
4. ✅ Add graduated logging levels
|
||||||
|
5. ✅ Proper error messages with diagnostics
|
||||||
|
6. ✅ Fresh connection per retry
|
||||||
|
7. ✅ Total timeout check (2 minutes max)
|
||||||
|
8. ✅ Preserve all existing migration logic
|
||||||
|
|
||||||
|
### Test Coverage Required
|
||||||
|
|
||||||
|
1. ✅ Unit test: Retry on lock
|
||||||
|
2. ✅ Unit test: Exhaustion handling
|
||||||
|
3. ✅ Integration test: 4 workers with multiprocessing
|
||||||
|
4. ✅ System test: gunicorn with 4 workers
|
||||||
|
5. ✅ Container test: Full deployment simulation
|
||||||
|
6. ✅ Performance test: < 500ms with contention
|
||||||
|
|
||||||
|
### Documentation Updates
|
||||||
|
|
||||||
|
1. ✅ Update ADR-022 with final decision
|
||||||
|
2. ✅ Add operational runbook for migration issues
|
||||||
|
3. ✅ Document monitoring metrics
|
||||||
|
4. ✅ Update deployment guide with health check info
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Go/No-Go Decision
|
||||||
|
|
||||||
|
### ✅ GO FOR IMPLEMENTATION
|
||||||
|
|
||||||
|
**Rationale:**
|
||||||
|
- All 23 questions have concrete answers
|
||||||
|
- Design is proven with SQLite's native capabilities
|
||||||
|
- No external dependencies needed
|
||||||
|
- Risk is low with clear rollback plan
|
||||||
|
- Testing strategy is comprehensive
|
||||||
|
|
||||||
|
**Implementation Priority: IMMEDIATE**
|
||||||
|
- This is blocking v1.0.0-rc.4 release
|
||||||
|
- Production systems affected
|
||||||
|
- Fix is well-understood and low-risk
|
||||||
|
|
||||||
|
**Next Steps:**
|
||||||
|
1. Implement changes to migrations.py as specified
|
||||||
|
2. Run test suite at all levels
|
||||||
|
3. Deploy as hotfix v1.0.0-rc.3.1
|
||||||
|
4. Monitor metrics in production
|
||||||
|
5. Document lessons learned
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Document Version: 1.0*
|
||||||
|
*Created: 2025-11-24*
|
||||||
|
*Status: Approved for Implementation*
|
||||||
|
*Author: StarPunk Architecture Team*
|
||||||
208
docs/decisions/ADR-022-migration-race-condition-fix.md
Normal file
208
docs/decisions/ADR-022-migration-race-condition-fix.md
Normal file
@@ -0,0 +1,208 @@
|
|||||||
|
# ADR-022: Database Migration Race Condition Resolution
|
||||||
|
|
||||||
|
## Status
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
In production, StarPunk runs with multiple gunicorn workers (currently 4). Each worker process independently initializes the Flask application through `create_app()`, which calls `init_db()`, which in turn runs database migrations via `run_migrations()`.
|
||||||
|
|
||||||
|
When the container starts fresh, all 4 workers start simultaneously and attempt to:
|
||||||
|
1. Create the `schema_migrations` table
|
||||||
|
2. Apply pending migrations
|
||||||
|
3. Insert records into `schema_migrations`
|
||||||
|
|
||||||
|
This causes a race condition where:
|
||||||
|
- Worker 1 successfully applies migration and inserts record
|
||||||
|
- Workers 2-4 fail with "UNIQUE constraint failed: schema_migrations.migration_name"
|
||||||
|
- Failed workers crash, causing container restarts
|
||||||
|
- After restart, migrations are already applied so it works
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
We will implement **database-level advisory locking** using SQLite's transaction mechanism with IMMEDIATE mode, combined with retry logic. This approach:
|
||||||
|
|
||||||
|
1. Uses SQLite's built-in `BEGIN IMMEDIATE` transaction to acquire a write lock
|
||||||
|
2. Implements exponential backoff retry for workers that can't acquire the lock
|
||||||
|
3. Ensures only one worker can run migrations at a time
|
||||||
|
4. Other workers wait and verify migrations are complete
|
||||||
|
|
||||||
|
This is the simplest, most robust solution that:
|
||||||
|
- Requires minimal code changes
|
||||||
|
- Uses SQLite's native capabilities
|
||||||
|
- Doesn't require external dependencies
|
||||||
|
- Works across all deployment scenarios
|
||||||
|
|
||||||
|
## Rationale
|
||||||
|
|
||||||
|
### Options Considered
|
||||||
|
|
||||||
|
1. **File-based locking (fcntl)**
|
||||||
|
- Pro: Simple to implement
|
||||||
|
- Con: Doesn't work across containers/network filesystems
|
||||||
|
- Con: Lock files can be orphaned if process crashes
|
||||||
|
|
||||||
|
2. **Run migrations before workers start**
|
||||||
|
- Pro: Cleanest separation of concerns
|
||||||
|
- Con: Requires container entrypoint script changes
|
||||||
|
- Con: Complicates development workflow
|
||||||
|
- Con: Doesn't fix the root cause for non-container deployments
|
||||||
|
|
||||||
|
3. **Make migration insertion idempotent (INSERT OR IGNORE)**
|
||||||
|
- Pro: Simple SQL change
|
||||||
|
- Con: Doesn't prevent parallel migration execution
|
||||||
|
- Con: Could corrupt database if migrations partially apply
|
||||||
|
- Con: Masks the real problem
|
||||||
|
|
||||||
|
4. **Database advisory locking (CHOSEN)**
|
||||||
|
- Pro: Uses SQLite's native transaction locking
|
||||||
|
- Pro: Guaranteed atomicity
|
||||||
|
- Pro: Works across all deployment scenarios
|
||||||
|
- Pro: Self-cleaning (no orphaned locks)
|
||||||
|
- Con: Requires retry logic
|
||||||
|
|
||||||
|
### Why Database Locking?
|
||||||
|
|
||||||
|
SQLite's `BEGIN IMMEDIATE` transaction mode acquires a RESERVED lock immediately, preventing other connections from writing. This provides:
|
||||||
|
|
||||||
|
1. **Atomicity**: Either all migrations apply or none do
|
||||||
|
2. **Isolation**: Only one worker can modify schema at a time
|
||||||
|
3. **Automatic cleanup**: Locks released on connection close/crash
|
||||||
|
4. **No external dependencies**: Uses SQLite's built-in features
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
The fix will be implemented in `/home/phil/Projects/starpunk/starpunk/migrations.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def run_migrations(db_path, logger=None):
|
||||||
|
"""Run all pending database migrations with concurrency protection"""
|
||||||
|
|
||||||
|
max_retries = 10
|
||||||
|
retry_count = 0
|
||||||
|
base_delay = 0.1 # 100ms
|
||||||
|
|
||||||
|
while retry_count < max_retries:
|
||||||
|
try:
|
||||||
|
conn = sqlite3.connect(db_path, timeout=30.0)
|
||||||
|
|
||||||
|
# Acquire exclusive lock for migrations
|
||||||
|
conn.execute("BEGIN IMMEDIATE")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create migrations table if needed
|
||||||
|
create_migrations_table(conn)
|
||||||
|
|
||||||
|
# Check if another worker already ran migrations
|
||||||
|
cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
|
||||||
|
if cursor.fetchone()[0] > 0:
|
||||||
|
# Migrations already run by another worker
|
||||||
|
conn.commit()
|
||||||
|
logger.info("Migrations already applied by another worker")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Run migration logic (existing code)
|
||||||
|
# ... rest of migration code ...
|
||||||
|
|
||||||
|
conn.commit()
|
||||||
|
return # Success
|
||||||
|
|
||||||
|
except Exception:
|
||||||
|
conn.rollback()
|
||||||
|
raise
|
||||||
|
|
||||||
|
except sqlite3.OperationalError as e:
|
||||||
|
if "database is locked" in str(e):
|
||||||
|
retry_count += 1
|
||||||
|
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
|
||||||
|
|
||||||
|
if retry_count < max_retries:
|
||||||
|
logger.debug(f"Database locked, retry {retry_count}/{max_retries} in {delay:.2f}s")
|
||||||
|
time.sleep(delay)
|
||||||
|
else:
|
||||||
|
raise MigrationError(f"Failed to acquire migration lock after {max_retries} attempts")
|
||||||
|
else:
|
||||||
|
raise
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if conn:
|
||||||
|
conn.close()
|
||||||
|
```
|
||||||
|
|
||||||
|
Additional changes needed:
|
||||||
|
|
||||||
|
1. Add imports: `import time`, `import random`
|
||||||
|
2. Modify connection timeout from default 5s to 30s
|
||||||
|
3. Add early check for already-applied migrations
|
||||||
|
4. Wrap entire migration process in IMMEDIATE transaction
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
- Eliminates race condition completely
|
||||||
|
- No container configuration changes needed
|
||||||
|
- Works in all deployment scenarios (container, systemd, manual)
|
||||||
|
- Minimal code changes (~50 lines)
|
||||||
|
- Self-healing (no manual lock cleanup needed)
|
||||||
|
- Provides clear logging of what's happening
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
- Slight startup delay for workers that wait (100ms-2s typical)
|
||||||
|
- Adds complexity to migration runner
|
||||||
|
- Requires careful testing of retry logic
|
||||||
|
|
||||||
|
### Neutral
|
||||||
|
- Workers start sequentially for migration phase, then run in parallel
|
||||||
|
- First worker to acquire lock runs migrations for all
|
||||||
|
- Log output will show retry attempts (useful for debugging)
|
||||||
|
|
||||||
|
## Testing Strategy
|
||||||
|
|
||||||
|
1. **Unit test with mock**: Test retry logic with simulated lock contention
|
||||||
|
2. **Integration test**: Spawn multiple processes, verify only one runs migrations
|
||||||
|
3. **Container test**: Build container, verify clean startup with 4 workers
|
||||||
|
4. **Stress test**: Start 20 processes simultaneously, verify correctness
|
||||||
|
|
||||||
|
## Migration Path
|
||||||
|
|
||||||
|
1. Implement fix in `starpunk/migrations.py`
|
||||||
|
2. Test locally with multiple workers
|
||||||
|
3. Build and test container
|
||||||
|
4. Deploy as v1.0.0-rc.4 or hotfix v1.0.0-rc.3.1
|
||||||
|
5. Monitor production logs for retry patterns
|
||||||
|
|
||||||
|
## Implementation Notes (Post-Analysis)
|
||||||
|
|
||||||
|
Based on comprehensive architectural review, the following clarifications have been established:
|
||||||
|
|
||||||
|
### Critical Implementation Details
|
||||||
|
|
||||||
|
1. **Connection Management**: Create NEW connection for each retry attempt (no reuse)
|
||||||
|
2. **Lock Mode**: Use BEGIN IMMEDIATE (not EXCLUSIVE) for optimal concurrency
|
||||||
|
3. **Timeout Strategy**: 30s per connection attempt, 120s total maximum duration
|
||||||
|
4. **Logging Levels**: Graduated (DEBUG for retry 1-3, INFO for 4-7, WARNING for 8+)
|
||||||
|
5. **Transaction Boundaries**: Separate transactions for schema/migrations/data
|
||||||
|
|
||||||
|
### Test Requirements
|
||||||
|
|
||||||
|
- Unit tests with multiprocessing.Pool
|
||||||
|
- Integration tests with actual gunicorn
|
||||||
|
- Container tests with full deployment
|
||||||
|
- Performance target: <500ms with 4 workers
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
|
||||||
|
- Full Q&A: `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
|
||||||
|
- Implementation Guide: `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md`
|
||||||
|
- Quick Reference: `/home/phil/Projects/starpunk/docs/architecture/migration-fix-quick-reference.md`
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- [SQLite Transaction Documentation](https://www.sqlite.org/lang_transaction.html)
|
||||||
|
- [SQLite Locking Documentation](https://www.sqlite.org/lockingv3.html)
|
||||||
|
- [SQLite BEGIN IMMEDIATE](https://www.sqlite.org/lang_transaction.html#immediate)
|
||||||
|
- Issue: Production migration race condition with gunicorn workers
|
||||||
|
|
||||||
|
## Status Update
|
||||||
|
|
||||||
|
**2025-11-24**: All 23 architectural questions answered. Implementation approved. Ready for development.
|
||||||
431
docs/reports/migration-race-condition-fix-implementation.md
Normal file
431
docs/reports/migration-race-condition-fix-implementation.md
Normal file
@@ -0,0 +1,431 @@
|
|||||||
|
# Migration Race Condition Fix - Implementation Guide
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
**CRITICAL PRODUCTION ISSUE**: Multiple gunicorn workers racing to apply migrations causes container startup failures.
|
||||||
|
|
||||||
|
**Solution**: Implement database-level advisory locking with retry logic in `migrations.py`.
|
||||||
|
|
||||||
|
**Urgency**: HIGH - This is a blocker for v1.0.0-rc.4 release.
|
||||||
|
|
||||||
|
## Root Cause Analysis
|
||||||
|
|
||||||
|
### The Problem Flow
|
||||||
|
|
||||||
|
1. Container starts with `gunicorn --workers 4`
|
||||||
|
2. Each worker independently calls:
|
||||||
|
```
|
||||||
|
app.py → create_app() → init_db() → run_migrations()
|
||||||
|
```
|
||||||
|
3. All 4 workers simultaneously try to:
|
||||||
|
- INSERT into schema_migrations table
|
||||||
|
- Apply the same migrations
|
||||||
|
4. SQLite's UNIQUE constraint on migration_name causes workers 2-4 to crash
|
||||||
|
5. Container restarts, works on second attempt (migrations already applied)
|
||||||
|
|
||||||
|
### Why This Happens
|
||||||
|
|
||||||
|
- **No synchronization**: Workers are independent processes
|
||||||
|
- **No locking**: Migration code doesn't prevent concurrent execution
|
||||||
|
- **Immediate failure**: UNIQUE constraint violation crashes the worker
|
||||||
|
- **Gunicorn behavior**: Worker crash triggers container restart
|
||||||
|
|
||||||
|
## Immediate Fix Implementation
|
||||||
|
|
||||||
|
### Step 1: Update migrations.py
|
||||||
|
|
||||||
|
Add these imports at the top of `/home/phil/Projects/starpunk/starpunk/migrations.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import time
|
||||||
|
import random
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Replace run_migrations function
|
||||||
|
|
||||||
|
Replace the entire `run_migrations` function (lines 304-462) with:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def run_migrations(db_path, logger=None):
|
||||||
|
"""
|
||||||
|
Run all pending database migrations with concurrency protection
|
||||||
|
|
||||||
|
Uses database-level locking to prevent race conditions when multiple
|
||||||
|
workers start simultaneously. Only one worker will apply migrations;
|
||||||
|
others will wait and verify completion.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
db_path: Path to SQLite database file
|
||||||
|
logger: Optional logger for output
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
MigrationError: If any migration fails to apply or lock cannot be acquired
|
||||||
|
"""
|
||||||
|
if logger is None:
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# Determine migrations directory
|
||||||
|
migrations_dir = Path(__file__).parent.parent / "migrations"
|
||||||
|
|
||||||
|
if not migrations_dir.exists():
|
||||||
|
logger.warning(f"Migrations directory not found: {migrations_dir}")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Retry configuration for lock acquisition
|
||||||
|
max_retries = 10
|
||||||
|
retry_count = 0
|
||||||
|
base_delay = 0.1 # 100ms
|
||||||
|
|
||||||
|
while retry_count < max_retries:
|
||||||
|
conn = None
|
||||||
|
try:
|
||||||
|
# Connect with longer timeout for lock contention
|
||||||
|
conn = sqlite3.connect(db_path, timeout=30.0)
|
||||||
|
|
||||||
|
# Attempt to acquire exclusive lock for migrations
|
||||||
|
# BEGIN IMMEDIATE acquires RESERVED lock, preventing other writes
|
||||||
|
conn.execute("BEGIN IMMEDIATE")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Ensure migrations tracking table exists
|
||||||
|
create_migrations_table(conn)
|
||||||
|
|
||||||
|
# Quick check: have migrations already been applied by another worker?
|
||||||
|
cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
|
||||||
|
migration_count = cursor.fetchone()[0]
|
||||||
|
|
||||||
|
# Discover migration files
|
||||||
|
migration_files = discover_migration_files(migrations_dir)
|
||||||
|
|
||||||
|
if not migration_files:
|
||||||
|
conn.commit()
|
||||||
|
logger.info("No migration files found")
|
||||||
|
return
|
||||||
|
|
||||||
|
# If migrations exist and we're not the first worker, verify and exit
|
||||||
|
if migration_count > 0:
|
||||||
|
# Check if all migrations are applied
|
||||||
|
applied = get_applied_migrations(conn)
|
||||||
|
pending = [m for m, _ in migration_files if m not in applied]
|
||||||
|
|
||||||
|
if not pending:
|
||||||
|
conn.commit()
|
||||||
|
logger.debug("All migrations already applied by another worker")
|
||||||
|
return
|
||||||
|
# If there are pending migrations, we continue to apply them
|
||||||
|
logger.info(f"Found {len(pending)} pending migrations to apply")
|
||||||
|
|
||||||
|
# Fresh database detection (original logic preserved)
|
||||||
|
if migration_count == 0:
|
||||||
|
if is_schema_current(conn):
|
||||||
|
# Schema is current - mark all migrations as applied
|
||||||
|
for migration_name, _ in migration_files:
|
||||||
|
conn.execute(
|
||||||
|
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
||||||
|
(migration_name,)
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
logger.info(
|
||||||
|
f"Fresh database detected: marked {len(migration_files)} "
|
||||||
|
f"migrations as applied (schema already current)"
|
||||||
|
)
|
||||||
|
return
|
||||||
|
else:
|
||||||
|
logger.info("Fresh database with partial schema: applying needed migrations")
|
||||||
|
|
||||||
|
# Get already-applied migrations
|
||||||
|
applied = get_applied_migrations(conn)
|
||||||
|
|
||||||
|
# Apply pending migrations (original logic preserved)
|
||||||
|
pending_count = 0
|
||||||
|
skipped_count = 0
|
||||||
|
for migration_name, migration_path in migration_files:
|
||||||
|
if migration_name not in applied:
|
||||||
|
# Check if migration is actually needed
|
||||||
|
should_check_needed = (
|
||||||
|
migration_count == 0 or
|
||||||
|
migration_name == "002_secure_tokens_and_authorization_codes.sql"
|
||||||
|
)
|
||||||
|
|
||||||
|
if should_check_needed and not is_migration_needed(conn, migration_name):
|
||||||
|
# Special handling for migration 002: if tables exist but indexes don't
|
||||||
|
if migration_name == "002_secure_tokens_and_authorization_codes.sql":
|
||||||
|
# Check if we need to create indexes
|
||||||
|
indexes_to_create = []
|
||||||
|
if not index_exists(conn, 'idx_tokens_hash'):
|
||||||
|
indexes_to_create.append("CREATE INDEX idx_tokens_hash ON tokens(token_hash)")
|
||||||
|
if not index_exists(conn, 'idx_tokens_me'):
|
||||||
|
indexes_to_create.append("CREATE INDEX idx_tokens_me ON tokens(me)")
|
||||||
|
if not index_exists(conn, 'idx_tokens_expires'):
|
||||||
|
indexes_to_create.append("CREATE INDEX idx_tokens_expires ON tokens(expires_at)")
|
||||||
|
if not index_exists(conn, 'idx_auth_codes_hash'):
|
||||||
|
indexes_to_create.append("CREATE INDEX idx_auth_codes_hash ON authorization_codes(code_hash)")
|
||||||
|
if not index_exists(conn, 'idx_auth_codes_expires'):
|
||||||
|
indexes_to_create.append("CREATE INDEX idx_auth_codes_expires ON authorization_codes(expires_at)")
|
||||||
|
|
||||||
|
if indexes_to_create:
|
||||||
|
for index_sql in indexes_to_create:
|
||||||
|
conn.execute(index_sql)
|
||||||
|
logger.info(f"Created {len(indexes_to_create)} missing indexes from migration 002")
|
||||||
|
|
||||||
|
# Mark as applied without executing full migration
|
||||||
|
conn.execute(
|
||||||
|
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
||||||
|
(migration_name,)
|
||||||
|
)
|
||||||
|
skipped_count += 1
|
||||||
|
logger.debug(f"Skipped migration {migration_name} (already in SCHEMA_SQL)")
|
||||||
|
else:
|
||||||
|
# Apply the migration (within our transaction)
|
||||||
|
try:
|
||||||
|
# Read migration SQL
|
||||||
|
migration_sql = migration_path.read_text()
|
||||||
|
|
||||||
|
logger.debug(f"Applying migration: {migration_name}")
|
||||||
|
|
||||||
|
# Execute migration (already in transaction)
|
||||||
|
conn.executescript(migration_sql)
|
||||||
|
|
||||||
|
# Record migration as applied
|
||||||
|
conn.execute(
|
||||||
|
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
|
||||||
|
(migration_name,)
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(f"Applied migration: {migration_name}")
|
||||||
|
pending_count += 1
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# Roll back the transaction
|
||||||
|
raise MigrationError(f"Migration {migration_name} failed: {e}")
|
||||||
|
|
||||||
|
# Commit all migrations atomically
|
||||||
|
conn.commit()
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
total_count = len(migration_files)
|
||||||
|
if pending_count > 0 or skipped_count > 0:
|
||||||
|
if skipped_count > 0:
|
||||||
|
logger.info(
|
||||||
|
f"Migrations complete: {pending_count} applied, {skipped_count} skipped "
|
||||||
|
f"(already in SCHEMA_SQL), {total_count} total"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logger.info(
|
||||||
|
f"Migrations complete: {pending_count} applied, "
|
||||||
|
f"{total_count} total"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logger.info(f"All migrations up to date ({total_count} total)")
|
||||||
|
|
||||||
|
return # Success!
|
||||||
|
|
||||||
|
except MigrationError:
|
||||||
|
conn.rollback()
|
||||||
|
raise
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
conn.rollback()
|
||||||
|
raise MigrationError(f"Migration system error: {e}")
|
||||||
|
|
||||||
|
except sqlite3.OperationalError as e:
|
||||||
|
if "database is locked" in str(e).lower():
|
||||||
|
# Another worker has the lock, retry with exponential backoff
|
||||||
|
retry_count += 1
|
||||||
|
|
||||||
|
if retry_count < max_retries:
|
||||||
|
# Exponential backoff with jitter
|
||||||
|
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
|
||||||
|
logger.debug(
|
||||||
|
f"Database locked by another worker, retry {retry_count}/{max_retries} "
|
||||||
|
f"in {delay:.2f}s"
|
||||||
|
)
|
||||||
|
time.sleep(delay)
|
||||||
|
continue
|
||||||
|
else:
|
||||||
|
raise MigrationError(
|
||||||
|
f"Failed to acquire migration lock after {max_retries} attempts. "
|
||||||
|
f"This may indicate a hung migration process."
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
# Non-lock related database error
|
||||||
|
error_msg = f"Database error during migration: {e}"
|
||||||
|
logger.error(error_msg)
|
||||||
|
raise MigrationError(error_msg)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# Unexpected error
|
||||||
|
error_msg = f"Unexpected error during migration: {e}"
|
||||||
|
logger.error(error_msg)
|
||||||
|
raise MigrationError(error_msg)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if conn:
|
||||||
|
try:
|
||||||
|
conn.close()
|
||||||
|
except:
|
||||||
|
pass # Ignore errors during cleanup
|
||||||
|
|
||||||
|
# Should never reach here, but just in case
|
||||||
|
raise MigrationError("Migration retry loop exited unexpectedly")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Testing the Fix
|
||||||
|
|
||||||
|
Create a test script to verify the fix works:
|
||||||
|
|
||||||
|
```python
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Test migration race condition fix"""
|
||||||
|
|
||||||
|
import multiprocessing
|
||||||
|
import time
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# Add project to path
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent))
|
||||||
|
|
||||||
|
def worker_init(worker_id):
|
||||||
|
"""Simulate a gunicorn worker starting"""
|
||||||
|
print(f"Worker {worker_id}: Starting...")
|
||||||
|
|
||||||
|
try:
|
||||||
|
from starpunk import create_app
|
||||||
|
app = create_app()
|
||||||
|
print(f"Worker {worker_id}: Successfully initialized")
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Worker {worker_id}: FAILED - {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Test with 10 workers (more than production to stress test)
|
||||||
|
num_workers = 10
|
||||||
|
|
||||||
|
print(f"Starting {num_workers} workers simultaneously...")
|
||||||
|
|
||||||
|
with multiprocessing.Pool(num_workers) as pool:
|
||||||
|
results = pool.map(worker_init, range(num_workers))
|
||||||
|
|
||||||
|
success_count = sum(results)
|
||||||
|
print(f"\nResults: {success_count}/{num_workers} workers succeeded")
|
||||||
|
|
||||||
|
if success_count == num_workers:
|
||||||
|
print("SUCCESS: All workers initialized without race condition")
|
||||||
|
sys.exit(0)
|
||||||
|
else:
|
||||||
|
print("FAILURE: Race condition still present")
|
||||||
|
sys.exit(1)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verification Steps
|
||||||
|
|
||||||
|
1. **Local Testing**:
|
||||||
|
```bash
|
||||||
|
# Test with multiple workers
|
||||||
|
gunicorn --workers 4 --bind 0.0.0.0:8000 app:app
|
||||||
|
|
||||||
|
# Check logs for retry messages
|
||||||
|
# Should see "Database locked by another worker, retry..." messages
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Container Testing**:
|
||||||
|
```bash
|
||||||
|
# Build container
|
||||||
|
podman build -t starpunk:test -f Containerfile .
|
||||||
|
|
||||||
|
# Run with fresh database
|
||||||
|
podman run --rm -p 8000:8000 -v ./test-data:/data starpunk:test
|
||||||
|
|
||||||
|
# Should start cleanly without restarts
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Log Verification**:
|
||||||
|
Look for these patterns:
|
||||||
|
- One worker: "Applied migration: XXX"
|
||||||
|
- Other workers: "Database locked by another worker, retry..."
|
||||||
|
- Final: "All migrations already applied by another worker"
|
||||||
|
|
||||||
|
## Risk Assessment
|
||||||
|
|
||||||
|
### Risk Level: LOW
|
||||||
|
|
||||||
|
The fix is safe because:
|
||||||
|
1. Uses SQLite's native transaction mechanism
|
||||||
|
2. Preserves all existing migration logic
|
||||||
|
3. Only adds retry wrapper around existing code
|
||||||
|
4. Fails safely with clear error messages
|
||||||
|
5. No data loss possible (transactions ensure atomicity)
|
||||||
|
|
||||||
|
### Rollback Plan
|
||||||
|
|
||||||
|
If issues occur:
|
||||||
|
1. Revert to previous version
|
||||||
|
2. Start container with single worker temporarily: `--workers 1`
|
||||||
|
3. Once migrations apply, scale back to 4 workers
|
||||||
|
|
||||||
|
## Release Strategy
|
||||||
|
|
||||||
|
### Option 1: Hotfix (Recommended)
|
||||||
|
- Release as v1.0.0-rc.3.1
|
||||||
|
- Immediate deployment to fix production issue
|
||||||
|
- Minimal testing required (focused fix)
|
||||||
|
|
||||||
|
### Option 2: Include in rc.4
|
||||||
|
- Bundle with other rc.4 changes
|
||||||
|
- More testing time
|
||||||
|
- Risk: Production remains broken until rc.4
|
||||||
|
|
||||||
|
**Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately.
|
||||||
|
|
||||||
|
## Alternative Workarounds (If Needed Urgently)
|
||||||
|
|
||||||
|
While the proper fix is implemented, these temporary workarounds can be used:
|
||||||
|
|
||||||
|
### Workaround 1: Single Worker Startup
|
||||||
|
```bash
|
||||||
|
# In Containerfile, temporarily change:
|
||||||
|
CMD ["gunicorn", "--workers", "1", ...]
|
||||||
|
|
||||||
|
# After first successful start, rebuild with 4 workers
|
||||||
|
```
|
||||||
|
|
||||||
|
### Workaround 2: Pre-migration Script
|
||||||
|
```bash
|
||||||
|
# Add entrypoint script that runs migrations before gunicorn
|
||||||
|
#!/bin/bash
|
||||||
|
python3 -c "from starpunk.database import init_db; init_db()"
|
||||||
|
exec gunicorn --workers 4 ...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Workaround 3: Delayed Worker Startup
|
||||||
|
```bash
|
||||||
|
# Stagger worker startup with --preload
|
||||||
|
gunicorn --preload --workers 4 ...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
- **Problem**: Race condition when multiple workers apply migrations
|
||||||
|
- **Solution**: Database-level locking with retry logic
|
||||||
|
- **Implementation**: ~150 lines of code changes in migrations.py
|
||||||
|
- **Testing**: Verify with multi-worker startup
|
||||||
|
- **Risk**: LOW - Safe, atomic changes
|
||||||
|
- **Urgency**: HIGH - Blocks production deployment
|
||||||
|
- **Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately
|
||||||
|
|
||||||
|
## Developer Questions Answered
|
||||||
|
|
||||||
|
All 23 architectural questions have been comprehensively answered in:
|
||||||
|
`/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
|
||||||
|
|
||||||
|
**Key Decisions:**
|
||||||
|
- NEW connection per retry (not reused)
|
||||||
|
- BEGIN IMMEDIATE is correct (not EXCLUSIVE)
|
||||||
|
- Separate transactions for each operation
|
||||||
|
- Both multiprocessing.Pool AND gunicorn testing needed
|
||||||
|
- 30s timeout per attempt, 120s total maximum
|
||||||
|
- Graduated logging levels based on retry count
|
||||||
|
|
||||||
|
**Implementation Status: READY TO PROCEED**
|
||||||
Reference in New Issue
Block a user