StarPunk/docs/design/v1.1.1/production-readiness-spec.md

# Production Readiness Improvements Specification

## Overview
Production readiness improvements for v1.1.1 focus on robustness, error handling, resource optimization, and operational visibility to ensure StarPunk runs reliably in production environments.

## Requirements

### Functional Requirements

1. **Graceful FTS5 Degradation**
   - Detect FTS5 availability at startup
   - Automatically fall back to LIKE-based search
   - Log clear warnings about reduced functionality
   - Document SQLite compilation requirements

2. **Enhanced Error Messages**
   - Provide actionable error messages for common issues
   - Include troubleshooting steps
   - Differentiate between user and system errors
   - Add configuration validation at startup

3. **Database Connection Pooling**
   - Optimize connection pool size
   - Monitor pool usage
   - Handle connection exhaustion gracefully
   - Configure pool parameters

4. **Structured Logging**
   - Implement log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL)
   - JSON-structured logs for production
   - Human-readable logs for development
   - Request correlation IDs

5. **Health Check Improvements**
   - Enhanced /health endpoint
   - Detailed health status (when authorized)
   - Component health checks
   - Readiness vs liveness probes

### Non-Functional Requirements

1. **Reliability**
   - Graceful handling of all error conditions
   - No crashes from user input
   - Automatic recovery from transient errors

2. **Observability**
   - Clear logging of all operations
   - Traceable request flow
   - Diagnostic information available

3. **Performance**
   - Connection pooling reduces latency
   - Efficient error handling paths
   - Minimal logging overhead

## Design

### FTS5 Graceful Degradation

```python
# starpunk/search/engine.py
class SearchEngineFactory:
    """Factory for creating appropriate search engine"""

    @staticmethod
    def create() -> SearchEngine:
        """Create search engine based on availability"""
        if SearchEngineFactory._check_fts5():
            logger.info("Using FTS5 search engine")
            return FTS5SearchEngine()
        else:
            logger.warning(
                "FTS5 not available. Using fallback search engine. "
                "For better search performance, please ensure SQLite "
                "is compiled with FTS5 support. See: "
                "https://www.sqlite.org/fts5.html#compiling_and_using_fts5"
            )
            return FallbackSearchEngine()

    @staticmethod
    def _check_fts5() -> bool:
        """Check if FTS5 is available"""
        try:
            conn = sqlite3.connect(":memory:")
            conn.execute(
                "CREATE VIRTUAL TABLE test_fts USING fts5(content)"
            )
            conn.close()
            return True
        except sqlite3.OperationalError:
            return False

class FallbackSearchEngine(SearchEngine):
    """LIKE-based search for systems without FTS5"""

    def search(self, query: str, limit: int = 50) -> List[SearchResult]:
        """Perform case-insensitive LIKE search"""
        sql = """
            SELECT
                id,
                content,
                created_at,
                0 as rank  -- No ranking available
            FROM notes
            WHERE
                content LIKE ? OR
                content LIKE ? OR
                content LIKE ?
            ORDER BY created_at DESC
            LIMIT ?
        """

        # Search for term at start, middle, or end
        patterns = [
            f'{query}%',  # Starts with
            f'% {query}%',  # Word in middle
            f'%{query}'  # Ends with
        ]

        results = []
        with get_db() as conn:
            cursor = conn.execute(sql, (*patterns, limit))
            for row in cursor:
                results.append(SearchResult(*row))

        return results
```

### Enhanced Error Messages

```python
# starpunk/errors/messages.py
class ErrorMessages:
    """User-friendly error messages with troubleshooting"""

    DATABASE_LOCKED = ErrorInfo(
        message="The database is temporarily locked",
        suggestion="Please try again in a moment",
        details="This usually happens during concurrent writes",
        troubleshooting=[
            "Wait a few seconds and retry",
            "Check for long-running operations",
            "Ensure WAL mode is enabled"
        ]
    )

    CONFIGURATION_INVALID = ErrorInfo(
        message="Configuration error: {detail}",
        suggestion="Please check your environment variables",
        details="Invalid configuration detected at startup",
        troubleshooting=[
            "Verify all STARPUNK_* environment variables",
            "Check for typos in configuration names",
            "Ensure values are in the correct format",
            "See docs/deployment/configuration.md"
        ]
    )

    MICROPUB_MALFORMED = ErrorInfo(
        message="Invalid Micropub request format",
        suggestion="Please check your Micropub client configuration",
        details="The request doesn't conform to Micropub specification",
        troubleshooting=[
            "Ensure Content-Type is correct",
            "Verify required fields are present",
            "Check for proper encoding",
            "See https://www.w3.org/TR/micropub/"
        ]
    )

    def format_error(self, error_key: str, **kwargs) -> dict:
        """Format error for response"""
        error_info = getattr(self, error_key)
        return {
            'error': {
                'message': error_info.message.format(**kwargs),
                'suggestion': error_info.suggestion,
                'troubleshooting': error_info.troubleshooting
            }
        }
```

### Database Connection Pool Optimization

```python
# starpunk/database/pool.py
from contextlib import contextmanager
from threading import Semaphore, Lock
from queue import Queue, Empty, Full
import sqlite3

class ConnectionPool:
    """Thread-safe SQLite connection pool"""

    def __init__(
        self,
        database_path: str,
        pool_size: int = None,
        timeout: float = None
    ):
        self.database_path = database_path
        self.pool_size = pool_size or config.DB_CONNECTION_POOL_SIZE
        self.timeout = timeout or config.DB_CONNECTION_TIMEOUT
        self._pool = Queue(maxsize=self.pool_size)
        self._all_connections = []
        self._lock = Lock()
        self._stats = {
            'acquired': 0,
            'released': 0,
            'created': 0,
            'wait_time_total': 0,
            'active': 0
        }

        # Pre-create connections
        for _ in range(self.pool_size):
            self._create_connection()

    def _create_connection(self) -> sqlite3.Connection:
        """Create a new database connection"""
        conn = sqlite3.connect(self.database_path)

        # Configure connection for production
        conn.execute("PRAGMA journal_mode=WAL")
        conn.execute(f"PRAGMA busy_timeout={config.DB_BUSY_TIMEOUT}")
        conn.execute("PRAGMA synchronous=NORMAL")
        conn.execute("PRAGMA temp_store=MEMORY")

        # Enable row factory for dict-like access
        conn.row_factory = sqlite3.Row

        with self._lock:
            self._all_connections.append(conn)
            self._stats['created'] += 1

        return conn

    @contextmanager
    def acquire(self):
        """Acquire connection from pool"""
        start_time = time.time()
        conn = None

        try:
            # Try to get connection with timeout
            conn = self._pool.get(timeout=self.timeout)
            wait_time = time.time() - start_time

            with self._lock:
                self._stats['acquired'] += 1
                self._stats['wait_time_total'] += wait_time
                self._stats['active'] += 1

            if wait_time > 1.0:
                logger.warning(
                    "Slow connection acquisition",
                    extra={'wait_time': wait_time}
                )

            yield conn

        except Empty:
            raise DatabaseError(
                "Connection pool exhausted",
                suggestion="Increase pool size or optimize queries",
                details={
                    'pool_size': self.pool_size,
                    'timeout': self.timeout
                }
            )
        finally:
            if conn:
                # Return connection to pool
                try:
                    self._pool.put_nowait(conn)
                    with self._lock:
                        self._stats['released'] += 1
                        self._stats['active'] -= 1
                except Full:
                    # Pool is full, close the connection
                    conn.close()

    def get_stats(self) -> dict:
        """Get pool statistics"""
        with self._lock:
            return {
                **self._stats,
                'pool_size': self.pool_size,
                'available': self._pool.qsize()
            }

    def close_all(self):
        """Close all connections in pool"""
        while not self._pool.empty():
            try:
                conn = self._pool.get_nowait()
                conn.close()
            except Empty:
                break

        for conn in self._all_connections:
            try:
                conn.close()
            except:
                pass

# Global pool instance
_connection_pool = None

def get_connection_pool() -> ConnectionPool:
    """Get or create connection pool"""
    global _connection_pool
    if _connection_pool is None:
        _connection_pool = ConnectionPool(
            database_path=config.DATABASE_PATH
        )
    return _connection_pool

@contextmanager
def get_db():
    """Get database connection from pool"""
    pool = get_connection_pool()
    with pool.acquire() as conn:
        yield conn
```

### Structured Logging Implementation

```python
# starpunk/logging/setup.py
import logging
import json
import sys
from uuid import uuid4

def setup_logging():
    """Configure structured logging for production"""

    # Determine environment
    is_production = config.ENV == 'production'

    # Configure root logger
    root = logging.getLogger()
    root.setLevel(config.LOG_LEVEL)

    # Remove default handler
    root.handlers = []

    # Create appropriate handler
    handler = logging.StreamHandler(sys.stdout)

    if is_production:
        # JSON format for production
        handler.setFormatter(JSONFormatter())
    else:
        # Human-readable for development
        handler.setFormatter(logging.Formatter(
            '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
        ))

    root.addHandler(handler)

    # Configure specific loggers
    logging.getLogger('starpunk').setLevel(config.LOG_LEVEL)
    logging.getLogger('werkzeug').setLevel(logging.WARNING)

    logger.info(
        "Logging configured",
        extra={
            'level': config.LOG_LEVEL,
            'format': 'json' if is_production else 'human'
        }
    )

class JSONFormatter(logging.Formatter):
    """JSON log formatter for structured logging"""

    def format(self, record):
        log_data = {
            'timestamp': self.formatTime(record),
            'level': record.levelname,
            'logger': record.name,
            'message': record.getMessage(),
            'request_id': getattr(record, 'request_id', None),
        }

        # Add extra fields
        if hasattr(record, 'extra'):
            log_data.update(record.extra)

        # Add exception info
        if record.exc_info:
            log_data['exception'] = self.formatException(record.exc_info)

        return json.dumps(log_data)

# Request context middleware
from flask import g

@app.before_request
def add_request_id():
    """Add unique request ID for correlation"""
    g.request_id = str(uuid4())[:8]

    # Configure logger for this request
    logging.LoggerAdapter(
        logger,
        {'request_id': g.request_id}
    )
```

### Enhanced Health Checks

```python
# starpunk/health.py
from datetime import datetime

class HealthChecker:
    """System health checking"""

    def __init__(self):
        self.start_time = datetime.now()

    def check_basic(self) -> dict:
        """Basic health check for liveness probe"""
        return {
            'status': 'healthy',
            'timestamp': datetime.now().isoformat()
        }

    def check_detailed(self) -> dict:
        """Detailed health check for readiness probe"""
        checks = {
            'database': self._check_database(),
            'search': self._check_search(),
            'filesystem': self._check_filesystem(),
            'memory': self._check_memory()
        }

        # Overall status
        all_healthy = all(c['healthy'] for c in checks.values())

        return {
            'status': 'healthy' if all_healthy else 'degraded',
            'timestamp': datetime.now().isoformat(),
            'uptime': str(datetime.now() - self.start_time),
            'version': __version__,
            'checks': checks
        }

    def _check_database(self) -> dict:
        """Check database connectivity"""
        try:
            with get_db() as conn:
                conn.execute("SELECT 1")

            pool_stats = get_connection_pool().get_stats()
            return {
                'healthy': True,
                'pool_active': pool_stats['active'],
                'pool_size': pool_stats['pool_size']
            }
        except Exception as e:
            return {
                'healthy': False,
                'error': str(e)
            }

    def _check_search(self) -> dict:
        """Check search engine status"""
        try:
            engine_type = 'fts5' if has_fts5() else 'fallback'
            return {
                'healthy': True,
                'engine': engine_type,
                'enabled': config.SEARCH_ENABLED
            }
        except Exception as e:
            return {
                'healthy': False,
                'error': str(e)
            }

    def _check_filesystem(self) -> dict:
        """Check filesystem access"""
        try:
            # Check if we can write to temp
            import tempfile
            with tempfile.NamedTemporaryFile() as f:
                f.write(b'test')

            return {'healthy': True}
        except Exception as e:
            return {
                'healthy': False,
                'error': str(e)
            }

    def _check_memory(self) -> dict:
        """Check memory usage"""
        memory_mb = get_memory_usage()
        threshold = config.MEMORY_THRESHOLD_MB

        return {
            'healthy': memory_mb < threshold,
            'usage_mb': memory_mb,
            'threshold_mb': threshold
        }

# Health check endpoints
@app.route('/health')
def health():
    """Basic health check endpoint"""
    checker = HealthChecker()
    result = checker.check_basic()
    status_code = 200 if result['status'] == 'healthy' else 503
    return jsonify(result), status_code

@app.route('/health/ready')
def health_ready():
    """Readiness probe endpoint"""
    checker = HealthChecker()

    # Detailed check only for authenticated or configured
    if config.HEALTH_CHECK_DETAILED or is_admin():
        result = checker.check_detailed()
    else:
        result = checker.check_basic()

    status_code = 200 if result['status'] == 'healthy' else 503
    return jsonify(result), status_code
```

### Session Timeout Handling

```python
# starpunk/auth/session.py
from datetime import datetime, timedelta

class SessionManager:
    """Manage user sessions with configurable timeout"""

    def __init__(self):
        self.timeout = config.SESSION_TIMEOUT

    def create_session(self, user_id: str) -> str:
        """Create new session with timeout"""
        session_id = str(uuid4())
        expires_at = datetime.now() + timedelta(seconds=self.timeout)

        # Store in database
        with get_db() as conn:
            conn.execute(
                """
                INSERT INTO sessions (id, user_id, expires_at, created_at)
                VALUES (?, ?, ?, ?)
                """,
                (session_id, user_id, expires_at, datetime.now())
            )

        logger.info(
            "Session created",
            extra={
                'user_id': user_id,
                'timeout': self.timeout
            }
        )

        return session_id

    def validate_session(self, session_id: str) -> Optional[str]:
        """Validate session and extend if valid"""
        with get_db() as conn:
            result = conn.execute(
                """
                SELECT user_id, expires_at
                FROM sessions
                WHERE id = ? AND expires_at > ?
                """,
                (session_id, datetime.now())
            ).fetchone()

            if result:
                # Extend session
                new_expires = datetime.now() + timedelta(
                    seconds=self.timeout
                )
                conn.execute(
                    """
                    UPDATE sessions
                    SET expires_at = ?, last_accessed = ?
                    WHERE id = ?
                    """,
                    (new_expires, datetime.now(), session_id)
                )

                return result['user_id']

        return None

    def cleanup_expired(self):
        """Remove expired sessions"""
        with get_db() as conn:
            deleted = conn.execute(
                """
                DELETE FROM sessions
                WHERE expires_at < ?
                """,
                (datetime.now(),)
            ).rowcount

            if deleted > 0:
                logger.info(
                    "Cleaned up expired sessions",
                    extra={'count': deleted}
                )
```

## Testing Strategy

### Unit Tests
1. FTS5 detection and fallback
2. Error message formatting
3. Connection pool operations
4. Health check components
5. Session timeout logic

### Integration Tests
1. Search with and without FTS5
2. Error handling end-to-end
3. Connection pool under load
4. Health endpoints
5. Session expiration

### Load Tests
```python
def test_connection_pool_under_load():
    """Test connection pool with concurrent requests"""
    pool = ConnectionPool(":memory:", pool_size=5)

    def worker():
        for _ in range(100):
            with pool.acquire() as conn:
                conn.execute("SELECT 1")

    threads = [Thread(target=worker) for _ in range(20)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

    stats = pool.get_stats()
    assert stats['acquired'] == 2000
    assert stats['released'] == 2000
```

## Migration Considerations

### Database Schema Updates
```sql
-- Add sessions table if not exists
CREATE TABLE IF NOT EXISTS sessions (
    id TEXT PRIMARY KEY,
    user_id TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL,
    expires_at TIMESTAMP NOT NULL,
    last_accessed TIMESTAMP,
    INDEX idx_sessions_expires (expires_at)
);
```

### Configuration Migration
1. Add new environment variables with defaults
2. Document in deployment guide
3. Update example .env file

## Performance Impact

### Expected Improvements
- Connection pooling: 20-30% reduction in query latency
- Structured logging: <1ms per log statement
- Health checks: <10ms response time
- Session management: Minimal overhead

### Resource Usage
- Connection pool: ~5MB per connection
- Logging buffer: <1MB
- Session storage: ~1KB per active session

## Security Considerations

1. **Connection Pool**: Prevent connection exhaustion attacks
2. **Error Messages**: Never expose sensitive information
3. **Health Checks**: Require auth for detailed info
4. **Session Timeout**: Configurable for security/UX balance
5. **Logging**: Sanitize all user input

## Acceptance Criteria

1. ✅ FTS5 unavailability handled gracefully
2. ✅ Clear error messages with troubleshooting
3. ✅ Connection pooling implemented and optimized
4. ✅ Structured logging with levels
5. ✅ Enhanced health check endpoints
6. ✅ Session timeout handling
7. ✅ All features configurable
8. ✅ Zero breaking changes
9. ✅ Performance improvements measured
10. ✅ Production deployment guide updated