docs: Add architect documentation for migration race condition fix

Add comprehensive architectural documentation for the migration race
condition fix, including:

- ADR-022: Architectural decision record for the fix
- migration-race-condition-answers.md: All 23 Q&A answered
- migration-fix-quick-reference.md: Implementation checklist
- migration-race-condition-fix-implementation.md: Detailed guide

These documents guided the implementation in v1.0.0-rc.5.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-11-24 18:53:55 -07:00
parent 686d753fb9
commit 2240414f22
4 changed files with 1354 additions and 0 deletions

View File

@@ -0,0 +1,238 @@
# Migration Race Condition Fix - Quick Implementation Reference
## Implementation Checklist
### Code Changes - `/home/phil/Projects/starpunk/starpunk/migrations.py`
```python
# 1. Add imports at top
import time
import random
# 2. Replace entire run_migrations function (lines 304-462)
# See full implementation in migration-race-condition-fix-implementation.md
# Key patterns to implement:
# A. Retry loop structure
max_retries = 10
retry_count = 0
base_delay = 0.1
start_time = time.time()
max_total_time = 120 # 2 minute absolute max
while retry_count < max_retries and (time.time() - start_time) < max_total_time:
conn = None # NEW connection each iteration
try:
conn = sqlite3.connect(db_path, timeout=30.0)
conn.execute("BEGIN IMMEDIATE") # Lock acquisition
# ... migration logic ...
conn.commit()
return # Success
except sqlite3.OperationalError as e:
if "database is locked" in str(e).lower():
retry_count += 1
if retry_count < max_retries:
# Exponential backoff with jitter
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
# Graduated logging
if retry_count <= 3:
logger.debug(f"Retry {retry_count}/{max_retries}")
elif retry_count <= 7:
logger.info(f"Retry {retry_count}/{max_retries}")
else:
logger.warning(f"Retry {retry_count}/{max_retries}")
time.sleep(delay)
continue
finally:
if conn:
try:
conn.close()
except:
pass
# B. Error handling pattern
except Exception as e:
try:
conn.rollback()
except Exception as rollback_error:
logger.critical(f"FATAL: Rollback failed: {rollback_error}")
raise SystemExit(1)
raise MigrationError(f"Migration failed: {e}")
# C. Final error message
raise MigrationError(
f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
f"Possible causes:\n"
f"1. Another process is stuck in migration (check logs)\n"
f"2. Database file permissions issue\n"
f"3. Disk I/O problems\n"
f"Action: Restart container with single worker to diagnose"
)
```
### Testing Requirements
#### 1. Unit Test File: `test_migration_race_condition.py`
```python
import multiprocessing
from multiprocessing import Barrier, Process
import time
def test_concurrent_migrations():
"""Test 4 workers starting simultaneously"""
barrier = Barrier(4)
def worker(worker_id):
barrier.wait() # Synchronize start
from starpunk import create_app
app = create_app()
return True
with multiprocessing.Pool(4) as pool:
results = pool.map(worker, range(4))
assert all(results), "Some workers failed"
def test_lock_retry():
"""Test retry logic with mock"""
with patch('sqlite3.connect') as mock:
mock.side_effect = [
sqlite3.OperationalError("database is locked"),
sqlite3.OperationalError("database is locked"),
MagicMock() # Success on 3rd try
]
run_migrations(db_path)
assert mock.call_count == 3
```
#### 2. Integration Test: `test_integration.sh`
```bash
#!/bin/bash
# Test with actual gunicorn
# Clean start
rm -f test.db
# Start gunicorn with 4 workers
timeout 10 gunicorn --workers 4 --bind 127.0.0.1:8001 app:app &
PID=$!
# Wait for startup
sleep 3
# Check if running
if ! kill -0 $PID 2>/dev/null; then
echo "FAILED: Gunicorn crashed"
exit 1
fi
# Check health endpoint
curl -f http://127.0.0.1:8001/health || exit 1
# Cleanup
kill $PID
echo "SUCCESS: All workers started without race condition"
```
#### 3. Container Test: `test_container.sh`
```bash
#!/bin/bash
# Test in container environment
# Build
podman build -t starpunk:race-test -f Containerfile .
# Run with fresh database
podman run --rm -d --name race-test \
-v $(pwd)/test-data:/data \
starpunk:race-test
# Check logs for success patterns
sleep 5
podman logs race-test | grep -E "(Applied migration|already applied by another worker)"
# Cleanup
podman stop race-test
```
### Verification Patterns in Logs
#### Successful Migration (One Worker Wins)
```
Worker 0: Applying migration: 001_initial_schema.sql
Worker 1: Database locked by another worker, retry 1/10 in 0.21s
Worker 2: Database locked by another worker, retry 1/10 in 0.23s
Worker 3: Database locked by another worker, retry 1/10 in 0.19s
Worker 0: Applied migration: 001_initial_schema.sql
Worker 1: All migrations already applied by another worker
Worker 2: All migrations already applied by another worker
Worker 3: All migrations already applied by another worker
```
#### Performance Metrics to Check
- Single worker: < 100ms total
- 4 workers: < 500ms total
- 10 workers (stress): < 2000ms total
### Rollback Plan if Issues
1. **Immediate Workaround**
```bash
# Change to single worker temporarily
gunicorn --workers 1 --bind 0.0.0.0:8000 app:app
```
2. **Revert Code**
```bash
git revert HEAD
```
3. **Emergency Patch**
```python
# In app.py temporarily
import os
if os.getenv('GUNICORN_WORKER_ID', '1') == '1':
init_db() # Only first worker runs migrations
```
### Deployment Commands
```bash
# 1. Run tests
python -m pytest test_migration_race_condition.py -v
# 2. Build container
podman build -t starpunk:v1.0.0-rc.3.1 -f Containerfile .
# 3. Tag for release
podman tag starpunk:v1.0.0-rc.3.1 git.philmade.com/starpunk:v1.0.0-rc.3.1
# 4. Push
podman push git.philmade.com/starpunk:v1.0.0-rc.3.1
# 5. Deploy
kubectl rollout restart deployment/starpunk
```
---
## Critical Points to Remember
1. **NEW CONNECTION EACH RETRY** - Don't reuse connections
2. **BEGIN IMMEDIATE** - Not EXCLUSIVE, not DEFERRED
3. **30s per attempt, 120s total max** - Two different timeouts
4. **Graduated logging** - DEBUG → INFO → WARNING based on retry count
5. **Test at multiple levels** - Unit, integration, container
6. **Fresh database state** between tests
## Support
If issues arise, check:
1. `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md` - Full Q&A
2. `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md` - Detailed implementation
3. SQLite lock states: `PRAGMA lock_status` during issue
---
*Quick Reference v1.0 - 2025-11-24*

View File

@@ -0,0 +1,477 @@
# Migration Race Condition Fix - Architectural Answers
## Status: READY FOR IMPLEMENTATION
All 23 questions have been answered with concrete guidance. The developer can proceed with implementation.
---
## Critical Questions
### 1. Connection Lifecycle Management
**Q: Should we create a new connection for each retry or reuse the same connection?**
**Answer: NEW CONNECTION per retry**
- Each retry MUST create a fresh connection
- Rationale: Failed lock acquisition may leave connection in inconsistent state
- SQLite connections are lightweight; overhead is minimal
- Pattern:
```python
while retry_count < max_retries:
conn = None # Fresh connection each iteration
try:
conn = sqlite3.connect(db_path, timeout=30.0)
# ... attempt migration ...
finally:
if conn:
conn.close()
```
### 2. Transaction Boundaries
**Q: Should init_db() wrap everything in one transaction?**
**Answer: NO - Separate transactions for different operations**
- Schema creation: Own transaction (already implicit in executescript)
- Migrations: Own transaction with BEGIN IMMEDIATE
- Initial data: Own transaction
- Rationale: Minimizes lock duration and allows partial success visibility
- Each operation is atomic but independent
### 3. Lock Timeout vs Retry Timeout
**Q: Connection timeout is 30s but retry logic could take ~102s. Conflict?**
**Answer: This is BY DESIGN - No conflict**
- 30s timeout: Maximum wait for any single lock acquisition attempt
- 102s total: Maximum cumulative retry duration across multiple attempts
- If one worker holds lock for 30s+, other workers timeout and retry
- Pattern ensures no single worker waits indefinitely
- Recommendation: Add total timeout check:
```python
start_time = time.time()
max_total_time = 120 # 2 minutes absolute maximum
while retry_count < max_retries and (time.time() - start_time) < max_total_time:
```
### 4. Testing Strategy
**Q: Should we use multiprocessing.Pool or actual gunicorn for testing?**
**Answer: BOTH - Different test levels**
- Unit tests: multiprocessing.Pool (fast, isolated)
- Integration tests: Actual gunicorn with --workers 4
- Container tests: Full podman/docker run
- Test matrix:
```
Level 1: Mock concurrent access (unit)
Level 2: multiprocessing.Pool (integration)
Level 3: gunicorn locally (system)
Level 4: Container with gunicorn (e2e)
```
### 5. BEGIN IMMEDIATE vs EXCLUSIVE
**Q: Why use BEGIN IMMEDIATE instead of BEGIN EXCLUSIVE?**
**Answer: BEGIN IMMEDIATE is CORRECT choice**
- BEGIN IMMEDIATE: Acquires RESERVED lock (prevents other writes, allows reads)
- BEGIN EXCLUSIVE: Acquires EXCLUSIVE lock (prevents all access)
- Rationale:
- Migrations only need to prevent concurrent migrations (writes)
- Other workers can still read schema while one migrates
- Less contention, faster startup
- Only escalates to EXCLUSIVE when actually writing
- Keep BEGIN IMMEDIATE as specified
---
## Edge Cases and Error Handling
### 6. Partial Migration Failure
**Q: What if a migration partially applies or rollback fails?**
**Answer: Transaction atomicity handles this**
- Within transaction: Automatic rollback on ANY error
- Rollback failure: Extremely rare (corrupt database)
- Strategy:
```python
except Exception as e:
try:
conn.rollback()
except Exception as rollback_error:
logger.critical(f"FATAL: Rollback failed: {rollback_error}")
# Database potentially corrupt - fail hard
raise SystemExit(1)
raise MigrationError(e)
```
### 7. Migration File Consistency
**Q: What if migration files change during deployment?**
**Answer: Not a concern with proper deployment**
- Container deployments: Files are immutable in image
- Traditional deployment: Use atomic directory swap
- If concerned, add checksum validation:
```python
# Store in schema_migrations: (name, checksum, applied_at)
# Verify checksum matches before applying
```
### 8. Retry Exhaustion Error Messages
**Q: What error message when retries exhausted?**
**Answer: Be specific and actionable**
```python
raise MigrationError(
f"Failed to acquire migration lock after {max_retries} attempts over {elapsed:.1f}s. "
f"Possible causes:\n"
f"1. Another process is stuck in migration (check logs)\n"
f"2. Database file permissions issue\n"
f"3. Disk I/O problems\n"
f"Action: Restart container with single worker to diagnose"
)
```
### 9. Logging Levels
**Q: What log level for lock waits?**
**Answer: Graduated approach**
- Retry 1-3: DEBUG (normal operation)
- Retry 4-7: INFO (getting concerning)
- Retry 8+: WARNING (abnormal)
- Exhausted: ERROR (operation failed)
- Pattern:
```python
if retry_count <= 3:
level = logging.DEBUG
elif retry_count <= 7:
level = logging.INFO
else:
level = logging.WARNING
logger.log(level, f"Retry {retry_count}/{max_retries}")
```
### 10. Index Creation Failure
**Q: How to handle index creation failures in migration 002?**
**Answer: Fail fast with clear context**
```python
for index_name, index_sql in indexes_to_create:
try:
conn.execute(index_sql)
except sqlite3.OperationalError as e:
if "already exists" in str(e):
logger.debug(f"Index {index_name} already exists")
else:
raise MigrationError(
f"Failed to create index {index_name}: {e}\n"
f"SQL: {index_sql}"
)
```
---
## Testing Strategy
### 11. Concurrent Testing Simulation
**Q: How to properly simulate concurrent worker startup?**
**Answer: Multiple approaches**
```python
# Approach 1: Barrier synchronization
def test_concurrent_migrations():
barrier = multiprocessing.Barrier(4)
def worker():
barrier.wait() # All start together
return run_migrations(db_path)
with multiprocessing.Pool(4) as pool:
results = pool.map(worker, range(4))
# Approach 2: Process start
processes = []
for i in range(4):
p = Process(target=run_migrations, args=(db_path,))
processes.append(p)
for p in processes:
p.start() # Near-simultaneous
```
### 12. Lock Contention Testing
**Q: How to test lock contention scenarios?**
**Answer: Inject delays**
```python
# Test helper to force contention
def slow_migration_for_testing(conn):
conn.execute("BEGIN IMMEDIATE")
time.sleep(2) # Force other workers to wait
# Apply migration
conn.commit()
# Test timeout handling
@patch('sqlite3.connect')
def test_lock_timeout(mock_connect):
mock_connect.side_effect = sqlite3.OperationalError("database is locked")
# Verify retry logic
```
### 13. Performance Tests
**Q: What timing is acceptable?**
**Answer: Performance targets**
- Single worker: < 100ms for all migrations
- 4 workers with contention: < 500ms total
- 10 workers stress test: < 2s total
- Lock acquisition per retry: < 50ms
- Test with:
```python
import timeit
setup_time = timeit.timeit(lambda: create_app(), number=1)
assert setup_time < 0.5, f"Startup too slow: {setup_time}s"
```
### 14. Retry Logic Unit Tests
**Q: How to unit test retry logic?**
**Answer: Mock the lock failures**
```python
class TestRetryLogic:
def test_retry_on_lock(self):
with patch('sqlite3.connect') as mock:
# First 2 attempts fail, 3rd succeeds
mock.side_effect = [
sqlite3.OperationalError("database is locked"),
sqlite3.OperationalError("database is locked"),
MagicMock() # Success
]
run_migrations(db_path)
assert mock.call_count == 3
```
---
## SQLite-Specific Concerns
### 15. BEGIN IMMEDIATE vs EXCLUSIVE (Detailed)
**Q: Deep dive on lock choice?**
**Answer: Lock escalation path**
```
BEGIN DEFERRED → SHARED → RESERVED → EXCLUSIVE
BEGIN IMMEDIATE → RESERVED → EXCLUSIVE
BEGIN EXCLUSIVE → EXCLUSIVE
For migrations:
- IMMEDIATE starts at RESERVED (blocks other writers immediately)
- Escalates to EXCLUSIVE only during actual writes
- Optimal for our use case
```
### 16. WAL Mode Interaction
**Q: How does this work with WAL mode?**
**Answer: Works correctly with both modes**
- Journal mode: BEGIN IMMEDIATE works as described
- WAL mode: BEGIN IMMEDIATE still prevents concurrent writers
- No code changes needed
- Add mode detection for logging:
```python
cursor = conn.execute("PRAGMA journal_mode")
mode = cursor.fetchone()[0]
logger.debug(f"Database in {mode} mode")
```
### 17. Database File Permissions
**Q: How to handle permission issues?**
**Answer: Fail fast with helpful diagnostics**
```python
import os
import stat
db_path = Path(db_path)
if not db_path.exists():
# Will be created - check parent dir
parent = db_path.parent
if not os.access(parent, os.W_OK):
raise MigrationError(f"Cannot write to directory: {parent}")
else:
# Check existing file
if not os.access(db_path, os.W_OK):
stats = os.stat(db_path)
mode = stat.filemode(stats.st_mode)
raise MigrationError(
f"Database not writable: {db_path}\n"
f"Permissions: {mode}\n"
f"Owner: {stats.st_uid}:{stats.st_gid}"
)
```
---
## Deployment/Operations
### 18. Container Startup and Health Checks
**Q: How to handle health checks during migration?**
**Answer: Return 503 during migration**
```python
# In app.py
MIGRATION_IN_PROGRESS = False
def create_app():
global MIGRATION_IN_PROGRESS
MIGRATION_IN_PROGRESS = True
try:
init_db()
finally:
MIGRATION_IN_PROGRESS = False
@app.route('/health')
def health():
if MIGRATION_IN_PROGRESS:
return {'status': 'migrating'}, 503
return {'status': 'healthy'}, 200
```
### 19. Monitoring and Alerting
**Q: What metrics/alerts are needed?**
**Answer: Key metrics to track**
```python
# Add metrics collection
metrics = {
'migration_duration_ms': 0,
'migration_retries': 0,
'migration_lock_wait_ms': 0,
'migrations_applied': 0
}
# Alert thresholds
ALERTS = {
'migration_duration_ms': 5000, # Alert if > 5s
'migration_retries': 5, # Alert if > 5 retries
'worker_failures': 1 # Alert on any failure
}
# Log in structured format
logger.info(json.dumps({
'event': 'migration_complete',
'metrics': metrics
}))
```
---
## Alternative Approaches
### 20. Version Compatibility
**Q: How to handle version mismatches?**
**Answer: Strict version checking**
```python
# In migrations.py
MIGRATION_VERSION = "1.0.0"
def check_version_compatibility(conn):
cursor = conn.execute(
"SELECT value FROM app_config WHERE key = 'migration_version'"
)
row = cursor.fetchone()
if row and row[0] != MIGRATION_VERSION:
raise MigrationError(
f"Version mismatch: Database={row[0]}, Code={MIGRATION_VERSION}\n"
f"Action: Run migration tool separately"
)
```
### 21. File-Based Locking
**Q: Should we consider flock() as backup?**
**Answer: NO - Adds complexity without benefit**
- SQLite locking is sufficient and portable
- flock() not available on all systems
- Would require additional cleanup logic
- Database-level locking is the correct approach
### 22. Gunicorn Preload
**Q: Would --preload flag help?**
**Answer: NO - Makes problem WORSE**
- --preload runs app initialization ONCE in master
- Workers fork from master AFTER migrations complete
- BUT: Doesn't work with lazy-loaded resources
- Current architecture expects per-worker initialization
- Keep current approach
### 23. Application-Level Locks
**Q: Should we add Redis/memcached for coordination?**
**Answer: NO - Violates simplicity principle**
- Adds external dependency
- More complex deployment
- SQLite locking is sufficient
- Would require Redis/memcached to be running before app starts
- Solving a solved problem
---
## Final Implementation Checklist
### Required Changes
1. ✅ Add imports: `time`, `random`
2. ✅ Implement retry loop with exponential backoff
3. ✅ Use BEGIN IMMEDIATE for lock acquisition
4. ✅ Add graduated logging levels
5. ✅ Proper error messages with diagnostics
6. ✅ Fresh connection per retry
7. ✅ Total timeout check (2 minutes max)
8. ✅ Preserve all existing migration logic
### Test Coverage Required
1. ✅ Unit test: Retry on lock
2. ✅ Unit test: Exhaustion handling
3. ✅ Integration test: 4 workers with multiprocessing
4. ✅ System test: gunicorn with 4 workers
5. ✅ Container test: Full deployment simulation
6. ✅ Performance test: < 500ms with contention
### Documentation Updates
1. ✅ Update ADR-022 with final decision
2. ✅ Add operational runbook for migration issues
3. ✅ Document monitoring metrics
4. ✅ Update deployment guide with health check info
---
## Go/No-Go Decision
### ✅ GO FOR IMPLEMENTATION
**Rationale:**
- All 23 questions have concrete answers
- Design is proven with SQLite's native capabilities
- No external dependencies needed
- Risk is low with clear rollback plan
- Testing strategy is comprehensive
**Implementation Priority: IMMEDIATE**
- This is blocking v1.0.0-rc.4 release
- Production systems affected
- Fix is well-understood and low-risk
**Next Steps:**
1. Implement changes to migrations.py as specified
2. Run test suite at all levels
3. Deploy as hotfix v1.0.0-rc.3.1
4. Monitor metrics in production
5. Document lessons learned
---
*Document Version: 1.0*
*Created: 2025-11-24*
*Status: Approved for Implementation*
*Author: StarPunk Architecture Team*

View File

@@ -0,0 +1,208 @@
# ADR-022: Database Migration Race Condition Resolution
## Status
Accepted
## Context
In production, StarPunk runs with multiple gunicorn workers (currently 4). Each worker process independently initializes the Flask application through `create_app()`, which calls `init_db()`, which in turn runs database migrations via `run_migrations()`.
When the container starts fresh, all 4 workers start simultaneously and attempt to:
1. Create the `schema_migrations` table
2. Apply pending migrations
3. Insert records into `schema_migrations`
This causes a race condition where:
- Worker 1 successfully applies migration and inserts record
- Workers 2-4 fail with "UNIQUE constraint failed: schema_migrations.migration_name"
- Failed workers crash, causing container restarts
- After restart, migrations are already applied so it works
## Decision
We will implement **database-level advisory locking** using SQLite's transaction mechanism with IMMEDIATE mode, combined with retry logic. This approach:
1. Uses SQLite's built-in `BEGIN IMMEDIATE` transaction to acquire a write lock
2. Implements exponential backoff retry for workers that can't acquire the lock
3. Ensures only one worker can run migrations at a time
4. Other workers wait and verify migrations are complete
This is the simplest, most robust solution that:
- Requires minimal code changes
- Uses SQLite's native capabilities
- Doesn't require external dependencies
- Works across all deployment scenarios
## Rationale
### Options Considered
1. **File-based locking (fcntl)**
- Pro: Simple to implement
- Con: Doesn't work across containers/network filesystems
- Con: Lock files can be orphaned if process crashes
2. **Run migrations before workers start**
- Pro: Cleanest separation of concerns
- Con: Requires container entrypoint script changes
- Con: Complicates development workflow
- Con: Doesn't fix the root cause for non-container deployments
3. **Make migration insertion idempotent (INSERT OR IGNORE)**
- Pro: Simple SQL change
- Con: Doesn't prevent parallel migration execution
- Con: Could corrupt database if migrations partially apply
- Con: Masks the real problem
4. **Database advisory locking (CHOSEN)**
- Pro: Uses SQLite's native transaction locking
- Pro: Guaranteed atomicity
- Pro: Works across all deployment scenarios
- Pro: Self-cleaning (no orphaned locks)
- Con: Requires retry logic
### Why Database Locking?
SQLite's `BEGIN IMMEDIATE` transaction mode acquires a RESERVED lock immediately, preventing other connections from writing. This provides:
1. **Atomicity**: Either all migrations apply or none do
2. **Isolation**: Only one worker can modify schema at a time
3. **Automatic cleanup**: Locks released on connection close/crash
4. **No external dependencies**: Uses SQLite's built-in features
## Implementation
The fix will be implemented in `/home/phil/Projects/starpunk/starpunk/migrations.py`:
```python
def run_migrations(db_path, logger=None):
"""Run all pending database migrations with concurrency protection"""
max_retries = 10
retry_count = 0
base_delay = 0.1 # 100ms
while retry_count < max_retries:
try:
conn = sqlite3.connect(db_path, timeout=30.0)
# Acquire exclusive lock for migrations
conn.execute("BEGIN IMMEDIATE")
try:
# Create migrations table if needed
create_migrations_table(conn)
# Check if another worker already ran migrations
cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
if cursor.fetchone()[0] > 0:
# Migrations already run by another worker
conn.commit()
logger.info("Migrations already applied by another worker")
return
# Run migration logic (existing code)
# ... rest of migration code ...
conn.commit()
return # Success
except Exception:
conn.rollback()
raise
except sqlite3.OperationalError as e:
if "database is locked" in str(e):
retry_count += 1
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
if retry_count < max_retries:
logger.debug(f"Database locked, retry {retry_count}/{max_retries} in {delay:.2f}s")
time.sleep(delay)
else:
raise MigrationError(f"Failed to acquire migration lock after {max_retries} attempts")
else:
raise
finally:
if conn:
conn.close()
```
Additional changes needed:
1. Add imports: `import time`, `import random`
2. Modify connection timeout from default 5s to 30s
3. Add early check for already-applied migrations
4. Wrap entire migration process in IMMEDIATE transaction
## Consequences
### Positive
- Eliminates race condition completely
- No container configuration changes needed
- Works in all deployment scenarios (container, systemd, manual)
- Minimal code changes (~50 lines)
- Self-healing (no manual lock cleanup needed)
- Provides clear logging of what's happening
### Negative
- Slight startup delay for workers that wait (100ms-2s typical)
- Adds complexity to migration runner
- Requires careful testing of retry logic
### Neutral
- Workers start sequentially for migration phase, then run in parallel
- First worker to acquire lock runs migrations for all
- Log output will show retry attempts (useful for debugging)
## Testing Strategy
1. **Unit test with mock**: Test retry logic with simulated lock contention
2. **Integration test**: Spawn multiple processes, verify only one runs migrations
3. **Container test**: Build container, verify clean startup with 4 workers
4. **Stress test**: Start 20 processes simultaneously, verify correctness
## Migration Path
1. Implement fix in `starpunk/migrations.py`
2. Test locally with multiple workers
3. Build and test container
4. Deploy as v1.0.0-rc.4 or hotfix v1.0.0-rc.3.1
5. Monitor production logs for retry patterns
## Implementation Notes (Post-Analysis)
Based on comprehensive architectural review, the following clarifications have been established:
### Critical Implementation Details
1. **Connection Management**: Create NEW connection for each retry attempt (no reuse)
2. **Lock Mode**: Use BEGIN IMMEDIATE (not EXCLUSIVE) for optimal concurrency
3. **Timeout Strategy**: 30s per connection attempt, 120s total maximum duration
4. **Logging Levels**: Graduated (DEBUG for retry 1-3, INFO for 4-7, WARNING for 8+)
5. **Transaction Boundaries**: Separate transactions for schema/migrations/data
### Test Requirements
- Unit tests with multiprocessing.Pool
- Integration tests with actual gunicorn
- Container tests with full deployment
- Performance target: <500ms with 4 workers
### Documentation
- Full Q&A: `/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
- Implementation Guide: `/home/phil/Projects/starpunk/docs/reports/migration-race-condition-fix-implementation.md`
- Quick Reference: `/home/phil/Projects/starpunk/docs/architecture/migration-fix-quick-reference.md`
## References
- [SQLite Transaction Documentation](https://www.sqlite.org/lang_transaction.html)
- [SQLite Locking Documentation](https://www.sqlite.org/lockingv3.html)
- [SQLite BEGIN IMMEDIATE](https://www.sqlite.org/lang_transaction.html#immediate)
- Issue: Production migration race condition with gunicorn workers
## Status Update
**2025-11-24**: All 23 architectural questions answered. Implementation approved. Ready for development.

View File

@@ -0,0 +1,431 @@
# Migration Race Condition Fix - Implementation Guide
## Executive Summary
**CRITICAL PRODUCTION ISSUE**: Multiple gunicorn workers racing to apply migrations causes container startup failures.
**Solution**: Implement database-level advisory locking with retry logic in `migrations.py`.
**Urgency**: HIGH - This is a blocker for v1.0.0-rc.4 release.
## Root Cause Analysis
### The Problem Flow
1. Container starts with `gunicorn --workers 4`
2. Each worker independently calls:
```
app.py → create_app() → init_db() → run_migrations()
```
3. All 4 workers simultaneously try to:
- INSERT into schema_migrations table
- Apply the same migrations
4. SQLite's UNIQUE constraint on migration_name causes workers 2-4 to crash
5. Container restarts, works on second attempt (migrations already applied)
### Why This Happens
- **No synchronization**: Workers are independent processes
- **No locking**: Migration code doesn't prevent concurrent execution
- **Immediate failure**: UNIQUE constraint violation crashes the worker
- **Gunicorn behavior**: Worker crash triggers container restart
## Immediate Fix Implementation
### Step 1: Update migrations.py
Add these imports at the top of `/home/phil/Projects/starpunk/starpunk/migrations.py`:
```python
import time
import random
```
### Step 2: Replace run_migrations function
Replace the entire `run_migrations` function (lines 304-462) with:
```python
def run_migrations(db_path, logger=None):
"""
Run all pending database migrations with concurrency protection
Uses database-level locking to prevent race conditions when multiple
workers start simultaneously. Only one worker will apply migrations;
others will wait and verify completion.
Args:
db_path: Path to SQLite database file
logger: Optional logger for output
Raises:
MigrationError: If any migration fails to apply or lock cannot be acquired
"""
if logger is None:
logger = logging.getLogger(__name__)
# Determine migrations directory
migrations_dir = Path(__file__).parent.parent / "migrations"
if not migrations_dir.exists():
logger.warning(f"Migrations directory not found: {migrations_dir}")
return
# Retry configuration for lock acquisition
max_retries = 10
retry_count = 0
base_delay = 0.1 # 100ms
while retry_count < max_retries:
conn = None
try:
# Connect with longer timeout for lock contention
conn = sqlite3.connect(db_path, timeout=30.0)
# Attempt to acquire exclusive lock for migrations
# BEGIN IMMEDIATE acquires RESERVED lock, preventing other writes
conn.execute("BEGIN IMMEDIATE")
try:
# Ensure migrations tracking table exists
create_migrations_table(conn)
# Quick check: have migrations already been applied by another worker?
cursor = conn.execute("SELECT COUNT(*) FROM schema_migrations")
migration_count = cursor.fetchone()[0]
# Discover migration files
migration_files = discover_migration_files(migrations_dir)
if not migration_files:
conn.commit()
logger.info("No migration files found")
return
# If migrations exist and we're not the first worker, verify and exit
if migration_count > 0:
# Check if all migrations are applied
applied = get_applied_migrations(conn)
pending = [m for m, _ in migration_files if m not in applied]
if not pending:
conn.commit()
logger.debug("All migrations already applied by another worker")
return
# If there are pending migrations, we continue to apply them
logger.info(f"Found {len(pending)} pending migrations to apply")
# Fresh database detection (original logic preserved)
if migration_count == 0:
if is_schema_current(conn):
# Schema is current - mark all migrations as applied
for migration_name, _ in migration_files:
conn.execute(
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
(migration_name,)
)
conn.commit()
logger.info(
f"Fresh database detected: marked {len(migration_files)} "
f"migrations as applied (schema already current)"
)
return
else:
logger.info("Fresh database with partial schema: applying needed migrations")
# Get already-applied migrations
applied = get_applied_migrations(conn)
# Apply pending migrations (original logic preserved)
pending_count = 0
skipped_count = 0
for migration_name, migration_path in migration_files:
if migration_name not in applied:
# Check if migration is actually needed
should_check_needed = (
migration_count == 0 or
migration_name == "002_secure_tokens_and_authorization_codes.sql"
)
if should_check_needed and not is_migration_needed(conn, migration_name):
# Special handling for migration 002: if tables exist but indexes don't
if migration_name == "002_secure_tokens_and_authorization_codes.sql":
# Check if we need to create indexes
indexes_to_create = []
if not index_exists(conn, 'idx_tokens_hash'):
indexes_to_create.append("CREATE INDEX idx_tokens_hash ON tokens(token_hash)")
if not index_exists(conn, 'idx_tokens_me'):
indexes_to_create.append("CREATE INDEX idx_tokens_me ON tokens(me)")
if not index_exists(conn, 'idx_tokens_expires'):
indexes_to_create.append("CREATE INDEX idx_tokens_expires ON tokens(expires_at)")
if not index_exists(conn, 'idx_auth_codes_hash'):
indexes_to_create.append("CREATE INDEX idx_auth_codes_hash ON authorization_codes(code_hash)")
if not index_exists(conn, 'idx_auth_codes_expires'):
indexes_to_create.append("CREATE INDEX idx_auth_codes_expires ON authorization_codes(expires_at)")
if indexes_to_create:
for index_sql in indexes_to_create:
conn.execute(index_sql)
logger.info(f"Created {len(indexes_to_create)} missing indexes from migration 002")
# Mark as applied without executing full migration
conn.execute(
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
(migration_name,)
)
skipped_count += 1
logger.debug(f"Skipped migration {migration_name} (already in SCHEMA_SQL)")
else:
# Apply the migration (within our transaction)
try:
# Read migration SQL
migration_sql = migration_path.read_text()
logger.debug(f"Applying migration: {migration_name}")
# Execute migration (already in transaction)
conn.executescript(migration_sql)
# Record migration as applied
conn.execute(
"INSERT INTO schema_migrations (migration_name) VALUES (?)",
(migration_name,)
)
logger.info(f"Applied migration: {migration_name}")
pending_count += 1
except Exception as e:
# Roll back the transaction
raise MigrationError(f"Migration {migration_name} failed: {e}")
# Commit all migrations atomically
conn.commit()
# Summary
total_count = len(migration_files)
if pending_count > 0 or skipped_count > 0:
if skipped_count > 0:
logger.info(
f"Migrations complete: {pending_count} applied, {skipped_count} skipped "
f"(already in SCHEMA_SQL), {total_count} total"
)
else:
logger.info(
f"Migrations complete: {pending_count} applied, "
f"{total_count} total"
)
else:
logger.info(f"All migrations up to date ({total_count} total)")
return # Success!
except MigrationError:
conn.rollback()
raise
except Exception as e:
conn.rollback()
raise MigrationError(f"Migration system error: {e}")
except sqlite3.OperationalError as e:
if "database is locked" in str(e).lower():
# Another worker has the lock, retry with exponential backoff
retry_count += 1
if retry_count < max_retries:
# Exponential backoff with jitter
delay = base_delay * (2 ** retry_count) + random.uniform(0, 0.1)
logger.debug(
f"Database locked by another worker, retry {retry_count}/{max_retries} "
f"in {delay:.2f}s"
)
time.sleep(delay)
continue
else:
raise MigrationError(
f"Failed to acquire migration lock after {max_retries} attempts. "
f"This may indicate a hung migration process."
)
else:
# Non-lock related database error
error_msg = f"Database error during migration: {e}"
logger.error(error_msg)
raise MigrationError(error_msg)
except Exception as e:
# Unexpected error
error_msg = f"Unexpected error during migration: {e}"
logger.error(error_msg)
raise MigrationError(error_msg)
finally:
if conn:
try:
conn.close()
except:
pass # Ignore errors during cleanup
# Should never reach here, but just in case
raise MigrationError("Migration retry loop exited unexpectedly")
```
### Step 3: Testing the Fix
Create a test script to verify the fix works:
```python
#!/usr/bin/env python3
"""Test migration race condition fix"""
import multiprocessing
import time
import sys
from pathlib import Path
# Add project to path
sys.path.insert(0, str(Path(__file__).parent))
def worker_init(worker_id):
"""Simulate a gunicorn worker starting"""
print(f"Worker {worker_id}: Starting...")
try:
from starpunk import create_app
app = create_app()
print(f"Worker {worker_id}: Successfully initialized")
return True
except Exception as e:
print(f"Worker {worker_id}: FAILED - {e}")
return False
if __name__ == "__main__":
# Test with 10 workers (more than production to stress test)
num_workers = 10
print(f"Starting {num_workers} workers simultaneously...")
with multiprocessing.Pool(num_workers) as pool:
results = pool.map(worker_init, range(num_workers))
success_count = sum(results)
print(f"\nResults: {success_count}/{num_workers} workers succeeded")
if success_count == num_workers:
print("SUCCESS: All workers initialized without race condition")
sys.exit(0)
else:
print("FAILURE: Race condition still present")
sys.exit(1)
```
## Verification Steps
1. **Local Testing**:
```bash
# Test with multiple workers
gunicorn --workers 4 --bind 0.0.0.0:8000 app:app
# Check logs for retry messages
# Should see "Database locked by another worker, retry..." messages
```
2. **Container Testing**:
```bash
# Build container
podman build -t starpunk:test -f Containerfile .
# Run with fresh database
podman run --rm -p 8000:8000 -v ./test-data:/data starpunk:test
# Should start cleanly without restarts
```
3. **Log Verification**:
Look for these patterns:
- One worker: "Applied migration: XXX"
- Other workers: "Database locked by another worker, retry..."
- Final: "All migrations already applied by another worker"
## Risk Assessment
### Risk Level: LOW
The fix is safe because:
1. Uses SQLite's native transaction mechanism
2. Preserves all existing migration logic
3. Only adds retry wrapper around existing code
4. Fails safely with clear error messages
5. No data loss possible (transactions ensure atomicity)
### Rollback Plan
If issues occur:
1. Revert to previous version
2. Start container with single worker temporarily: `--workers 1`
3. Once migrations apply, scale back to 4 workers
## Release Strategy
### Option 1: Hotfix (Recommended)
- Release as v1.0.0-rc.3.1
- Immediate deployment to fix production issue
- Minimal testing required (focused fix)
### Option 2: Include in rc.4
- Bundle with other rc.4 changes
- More testing time
- Risk: Production remains broken until rc.4
**Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately.
## Alternative Workarounds (If Needed Urgently)
While the proper fix is implemented, these temporary workarounds can be used:
### Workaround 1: Single Worker Startup
```bash
# In Containerfile, temporarily change:
CMD ["gunicorn", "--workers", "1", ...]
# After first successful start, rebuild with 4 workers
```
### Workaround 2: Pre-migration Script
```bash
# Add entrypoint script that runs migrations before gunicorn
#!/bin/bash
python3 -c "from starpunk.database import init_db; init_db()"
exec gunicorn --workers 4 ...
```
### Workaround 3: Delayed Worker Startup
```bash
# Stagger worker startup with --preload
gunicorn --preload --workers 4 ...
```
## Summary
- **Problem**: Race condition when multiple workers apply migrations
- **Solution**: Database-level locking with retry logic
- **Implementation**: ~150 lines of code changes in migrations.py
- **Testing**: Verify with multi-worker startup
- **Risk**: LOW - Safe, atomic changes
- **Urgency**: HIGH - Blocks production deployment
- **Recommendation**: Deploy as hotfix v1.0.0-rc.3.1 immediately
## Developer Questions Answered
All 23 architectural questions have been comprehensively answered in:
`/home/phil/Projects/starpunk/docs/architecture/migration-race-condition-answers.md`
**Key Decisions:**
- NEW connection per retry (not reused)
- BEGIN IMMEDIATE is correct (not EXCLUSIVE)
- Separate transactions for each operation
- Both multiprocessing.Pool AND gunicorn testing needed
- 30s timeout per attempt, 120s total maximum
- Graduated logging levels based on retry count
**Implementation Status: READY TO PROCEED**