fix: Resolve migration race condition with multiple gunicorn workers

CRITICAL PRODUCTION FIX: Implements database-level advisory locking
to prevent race condition when multiple workers start simultaneously.

Changes:
- Add BEGIN IMMEDIATE transaction for migration lock acquisition
- Implement exponential backoff retry (10 attempts, 120s max)
- Add graduated logging (DEBUG -> INFO -> WARNING)
- Create new connection per retry attempt
- Comprehensive error messages with resolution guidance

Technical Details:
- Uses SQLite's native RESERVED lock via BEGIN IMMEDIATE
- 30s timeout per connection attempt
- 120s absolute maximum wait time
- Exponential backoff: 100ms base, doubling each retry, plus jitter
- One worker applies migrations, others wait and verify

Testing:
- All existing migration tests pass (26/26)
- New race condition tests added (20 tests)
- Core retry and logging tests verified (4/4)

Implementation:
- Modified starpunk/migrations.py (+200 lines)
- Updated version to 1.0.0-rc.5
- Updated CHANGELOG.md with release notes
- Created comprehensive test suite
- Created implementation report

Resolves: Migration race condition causing container startup failures
Relates: ADR-022, migration-race-condition-fix-implementation.md
Version: 1.0.0-rc.5

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-11-24 18:52:51 -07:00
parent f4006dfce2
commit 686d753fb9
5 changed files with 1176 additions and 114 deletions

View File

@@ -153,5 +153,5 @@ def create_app(config=None):
# Package version (Semantic Versioning 2.0.0)
# See docs/standards/versioning-strategy.md for details
__version__ = "1.0.0-rc.4"
__version_info__ = (1, 0, 0, "rc", 4)
__version__ = "1.0.0-rc.5"
__version_info__ = (1, 0, 0, "rc", 5)