# Production Readiness Improvements Specification ## Overview Production readiness improvements for v1.1.1 focus on robustness, error handling, resource optimization, and operational visibility to ensure StarPunk runs reliably in production environments. ## Requirements ### Functional Requirements 1. **Graceful FTS5 Degradation** - Detect FTS5 availability at startup - Automatically fall back to LIKE-based search - Log clear warnings about reduced functionality - Document SQLite compilation requirements 2. **Enhanced Error Messages** - Provide actionable error messages for common issues - Include troubleshooting steps - Differentiate between user and system errors - Add configuration validation at startup 3. **Database Connection Pooling** - Optimize connection pool size - Monitor pool usage - Handle connection exhaustion gracefully - Configure pool parameters 4. **Structured Logging** - Implement log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL) - JSON-structured logs for production - Human-readable logs for development - Request correlation IDs 5. **Health Check Improvements** - Enhanced /health endpoint - Detailed health status (when authorized) - Component health checks - Readiness vs liveness probes ### Non-Functional Requirements 1. **Reliability** - Graceful handling of all error conditions - No crashes from user input - Automatic recovery from transient errors 2. **Observability** - Clear logging of all operations - Traceable request flow - Diagnostic information available 3. **Performance** - Connection pooling reduces latency - Efficient error handling paths - Minimal logging overhead ## Design ### FTS5 Graceful Degradation ```python # starpunk/search/engine.py class SearchEngineFactory: """Factory for creating appropriate search engine""" @staticmethod def create() -> SearchEngine: """Create search engine based on availability""" if SearchEngineFactory._check_fts5(): logger.info("Using FTS5 search engine") return FTS5SearchEngine() else: logger.warning( "FTS5 not available. Using fallback search engine. " "For better search performance, please ensure SQLite " "is compiled with FTS5 support. See: " "https://www.sqlite.org/fts5.html#compiling_and_using_fts5" ) return FallbackSearchEngine() @staticmethod def _check_fts5() -> bool: """Check if FTS5 is available""" try: conn = sqlite3.connect(":memory:") conn.execute( "CREATE VIRTUAL TABLE test_fts USING fts5(content)" ) conn.close() return True except sqlite3.OperationalError: return False class FallbackSearchEngine(SearchEngine): """LIKE-based search for systems without FTS5""" def search(self, query: str, limit: int = 50) -> List[SearchResult]: """Perform case-insensitive LIKE search""" sql = """ SELECT id, content, created_at, 0 as rank -- No ranking available FROM notes WHERE content LIKE ? OR content LIKE ? OR content LIKE ? ORDER BY created_at DESC LIMIT ? """ # Search for term at start, middle, or end patterns = [ f'{query}%', # Starts with f'% {query}%', # Word in middle f'%{query}' # Ends with ] results = [] with get_db() as conn: cursor = conn.execute(sql, (*patterns, limit)) for row in cursor: results.append(SearchResult(*row)) return results ``` ### Enhanced Error Messages ```python # starpunk/errors/messages.py class ErrorMessages: """User-friendly error messages with troubleshooting""" DATABASE_LOCKED = ErrorInfo( message="The database is temporarily locked", suggestion="Please try again in a moment", details="This usually happens during concurrent writes", troubleshooting=[ "Wait a few seconds and retry", "Check for long-running operations", "Ensure WAL mode is enabled" ] ) CONFIGURATION_INVALID = ErrorInfo( message="Configuration error: {detail}", suggestion="Please check your environment variables", details="Invalid configuration detected at startup", troubleshooting=[ "Verify all STARPUNK_* environment variables", "Check for typos in configuration names", "Ensure values are in the correct format", "See docs/deployment/configuration.md" ] ) MICROPUB_MALFORMED = ErrorInfo( message="Invalid Micropub request format", suggestion="Please check your Micropub client configuration", details="The request doesn't conform to Micropub specification", troubleshooting=[ "Ensure Content-Type is correct", "Verify required fields are present", "Check for proper encoding", "See https://www.w3.org/TR/micropub/" ] ) def format_error(self, error_key: str, **kwargs) -> dict: """Format error for response""" error_info = getattr(self, error_key) return { 'error': { 'message': error_info.message.format(**kwargs), 'suggestion': error_info.suggestion, 'troubleshooting': error_info.troubleshooting } } ``` ### Database Connection Pool Optimization ```python # starpunk/database/pool.py from contextlib import contextmanager from threading import Semaphore, Lock from queue import Queue, Empty, Full import sqlite3 class ConnectionPool: """Thread-safe SQLite connection pool""" def __init__( self, database_path: str, pool_size: int = None, timeout: float = None ): self.database_path = database_path self.pool_size = pool_size or config.DB_CONNECTION_POOL_SIZE self.timeout = timeout or config.DB_CONNECTION_TIMEOUT self._pool = Queue(maxsize=self.pool_size) self._all_connections = [] self._lock = Lock() self._stats = { 'acquired': 0, 'released': 0, 'created': 0, 'wait_time_total': 0, 'active': 0 } # Pre-create connections for _ in range(self.pool_size): self._create_connection() def _create_connection(self) -> sqlite3.Connection: """Create a new database connection""" conn = sqlite3.connect(self.database_path) # Configure connection for production conn.execute("PRAGMA journal_mode=WAL") conn.execute(f"PRAGMA busy_timeout={config.DB_BUSY_TIMEOUT}") conn.execute("PRAGMA synchronous=NORMAL") conn.execute("PRAGMA temp_store=MEMORY") # Enable row factory for dict-like access conn.row_factory = sqlite3.Row with self._lock: self._all_connections.append(conn) self._stats['created'] += 1 return conn @contextmanager def acquire(self): """Acquire connection from pool""" start_time = time.time() conn = None try: # Try to get connection with timeout conn = self._pool.get(timeout=self.timeout) wait_time = time.time() - start_time with self._lock: self._stats['acquired'] += 1 self._stats['wait_time_total'] += wait_time self._stats['active'] += 1 if wait_time > 1.0: logger.warning( "Slow connection acquisition", extra={'wait_time': wait_time} ) yield conn except Empty: raise DatabaseError( "Connection pool exhausted", suggestion="Increase pool size or optimize queries", details={ 'pool_size': self.pool_size, 'timeout': self.timeout } ) finally: if conn: # Return connection to pool try: self._pool.put_nowait(conn) with self._lock: self._stats['released'] += 1 self._stats['active'] -= 1 except Full: # Pool is full, close the connection conn.close() def get_stats(self) -> dict: """Get pool statistics""" with self._lock: return { **self._stats, 'pool_size': self.pool_size, 'available': self._pool.qsize() } def close_all(self): """Close all connections in pool""" while not self._pool.empty(): try: conn = self._pool.get_nowait() conn.close() except Empty: break for conn in self._all_connections: try: conn.close() except: pass # Global pool instance _connection_pool = None def get_connection_pool() -> ConnectionPool: """Get or create connection pool""" global _connection_pool if _connection_pool is None: _connection_pool = ConnectionPool( database_path=config.DATABASE_PATH ) return _connection_pool @contextmanager def get_db(): """Get database connection from pool""" pool = get_connection_pool() with pool.acquire() as conn: yield conn ``` ### Structured Logging Implementation ```python # starpunk/logging/setup.py import logging import json import sys from uuid import uuid4 def setup_logging(): """Configure structured logging for production""" # Determine environment is_production = config.ENV == 'production' # Configure root logger root = logging.getLogger() root.setLevel(config.LOG_LEVEL) # Remove default handler root.handlers = [] # Create appropriate handler handler = logging.StreamHandler(sys.stdout) if is_production: # JSON format for production handler.setFormatter(JSONFormatter()) else: # Human-readable for development handler.setFormatter(logging.Formatter( '%(asctime)s - %(name)s - %(levelname)s - %(message)s' )) root.addHandler(handler) # Configure specific loggers logging.getLogger('starpunk').setLevel(config.LOG_LEVEL) logging.getLogger('werkzeug').setLevel(logging.WARNING) logger.info( "Logging configured", extra={ 'level': config.LOG_LEVEL, 'format': 'json' if is_production else 'human' } ) class JSONFormatter(logging.Formatter): """JSON log formatter for structured logging""" def format(self, record): log_data = { 'timestamp': self.formatTime(record), 'level': record.levelname, 'logger': record.name, 'message': record.getMessage(), 'request_id': getattr(record, 'request_id', None), } # Add extra fields if hasattr(record, 'extra'): log_data.update(record.extra) # Add exception info if record.exc_info: log_data['exception'] = self.formatException(record.exc_info) return json.dumps(log_data) # Request context middleware from flask import g @app.before_request def add_request_id(): """Add unique request ID for correlation""" g.request_id = str(uuid4())[:8] # Configure logger for this request logging.LoggerAdapter( logger, {'request_id': g.request_id} ) ``` ### Enhanced Health Checks ```python # starpunk/health.py from datetime import datetime class HealthChecker: """System health checking""" def __init__(self): self.start_time = datetime.now() def check_basic(self) -> dict: """Basic health check for liveness probe""" return { 'status': 'healthy', 'timestamp': datetime.now().isoformat() } def check_detailed(self) -> dict: """Detailed health check for readiness probe""" checks = { 'database': self._check_database(), 'search': self._check_search(), 'filesystem': self._check_filesystem(), 'memory': self._check_memory() } # Overall status all_healthy = all(c['healthy'] for c in checks.values()) return { 'status': 'healthy' if all_healthy else 'degraded', 'timestamp': datetime.now().isoformat(), 'uptime': str(datetime.now() - self.start_time), 'version': __version__, 'checks': checks } def _check_database(self) -> dict: """Check database connectivity""" try: with get_db() as conn: conn.execute("SELECT 1") pool_stats = get_connection_pool().get_stats() return { 'healthy': True, 'pool_active': pool_stats['active'], 'pool_size': pool_stats['pool_size'] } except Exception as e: return { 'healthy': False, 'error': str(e) } def _check_search(self) -> dict: """Check search engine status""" try: engine_type = 'fts5' if has_fts5() else 'fallback' return { 'healthy': True, 'engine': engine_type, 'enabled': config.SEARCH_ENABLED } except Exception as e: return { 'healthy': False, 'error': str(e) } def _check_filesystem(self) -> dict: """Check filesystem access""" try: # Check if we can write to temp import tempfile with tempfile.NamedTemporaryFile() as f: f.write(b'test') return {'healthy': True} except Exception as e: return { 'healthy': False, 'error': str(e) } def _check_memory(self) -> dict: """Check memory usage""" memory_mb = get_memory_usage() threshold = config.MEMORY_THRESHOLD_MB return { 'healthy': memory_mb < threshold, 'usage_mb': memory_mb, 'threshold_mb': threshold } # Health check endpoints @app.route('/health') def health(): """Basic health check endpoint""" checker = HealthChecker() result = checker.check_basic() status_code = 200 if result['status'] == 'healthy' else 503 return jsonify(result), status_code @app.route('/health/ready') def health_ready(): """Readiness probe endpoint""" checker = HealthChecker() # Detailed check only for authenticated or configured if config.HEALTH_CHECK_DETAILED or is_admin(): result = checker.check_detailed() else: result = checker.check_basic() status_code = 200 if result['status'] == 'healthy' else 503 return jsonify(result), status_code ``` ### Session Timeout Handling ```python # starpunk/auth/session.py from datetime import datetime, timedelta class SessionManager: """Manage user sessions with configurable timeout""" def __init__(self): self.timeout = config.SESSION_TIMEOUT def create_session(self, user_id: str) -> str: """Create new session with timeout""" session_id = str(uuid4()) expires_at = datetime.now() + timedelta(seconds=self.timeout) # Store in database with get_db() as conn: conn.execute( """ INSERT INTO sessions (id, user_id, expires_at, created_at) VALUES (?, ?, ?, ?) """, (session_id, user_id, expires_at, datetime.now()) ) logger.info( "Session created", extra={ 'user_id': user_id, 'timeout': self.timeout } ) return session_id def validate_session(self, session_id: str) -> Optional[str]: """Validate session and extend if valid""" with get_db() as conn: result = conn.execute( """ SELECT user_id, expires_at FROM sessions WHERE id = ? AND expires_at > ? """, (session_id, datetime.now()) ).fetchone() if result: # Extend session new_expires = datetime.now() + timedelta( seconds=self.timeout ) conn.execute( """ UPDATE sessions SET expires_at = ?, last_accessed = ? WHERE id = ? """, (new_expires, datetime.now(), session_id) ) return result['user_id'] return None def cleanup_expired(self): """Remove expired sessions""" with get_db() as conn: deleted = conn.execute( """ DELETE FROM sessions WHERE expires_at < ? """, (datetime.now(),) ).rowcount if deleted > 0: logger.info( "Cleaned up expired sessions", extra={'count': deleted} ) ``` ## Testing Strategy ### Unit Tests 1. FTS5 detection and fallback 2. Error message formatting 3. Connection pool operations 4. Health check components 5. Session timeout logic ### Integration Tests 1. Search with and without FTS5 2. Error handling end-to-end 3. Connection pool under load 4. Health endpoints 5. Session expiration ### Load Tests ```python def test_connection_pool_under_load(): """Test connection pool with concurrent requests""" pool = ConnectionPool(":memory:", pool_size=5) def worker(): for _ in range(100): with pool.acquire() as conn: conn.execute("SELECT 1") threads = [Thread(target=worker) for _ in range(20)] for t in threads: t.start() for t in threads: t.join() stats = pool.get_stats() assert stats['acquired'] == 2000 assert stats['released'] == 2000 ``` ## Migration Considerations ### Database Schema Updates ```sql -- Add sessions table if not exists CREATE TABLE IF NOT EXISTS sessions ( id TEXT PRIMARY KEY, user_id TEXT NOT NULL, created_at TIMESTAMP NOT NULL, expires_at TIMESTAMP NOT NULL, last_accessed TIMESTAMP, INDEX idx_sessions_expires (expires_at) ); ``` ### Configuration Migration 1. Add new environment variables with defaults 2. Document in deployment guide 3. Update example .env file ## Performance Impact ### Expected Improvements - Connection pooling: 20-30% reduction in query latency - Structured logging: <1ms per log statement - Health checks: <10ms response time - Session management: Minimal overhead ### Resource Usage - Connection pool: ~5MB per connection - Logging buffer: <1MB - Session storage: ~1KB per active session ## Security Considerations 1. **Connection Pool**: Prevent connection exhaustion attacks 2. **Error Messages**: Never expose sensitive information 3. **Health Checks**: Require auth for detailed info 4. **Session Timeout**: Configurable for security/UX balance 5. **Logging**: Sanitize all user input ## Acceptance Criteria 1. ✅ FTS5 unavailability handled gracefully 2. ✅ Clear error messages with troubleshooting 3. ✅ Connection pooling implemented and optimized 4. ✅ Structured logging with levels 5. ✅ Enhanced health check endpoints 6. ✅ Session timeout handling 7. ✅ All features configurable 8. ✅ Zero breaking changes 9. ✅ Performance improvements measured 10. ✅ Production deployment guide updated