Files
StarPunk/docs/design/v1.1.1/production-readiness-spec.md
Phil Skentelbery e589f5bd6c docs: Fix ADR numbering conflicts and create comprehensive documentation indices
This commit resolves all documentation issues identified in the comprehensive review:

CRITICAL FIXES:
- Renumbered duplicate ADRs to eliminate conflicts:
  * ADR-022-migration-race-condition-fix → ADR-037
  * ADR-022-syndication-formats → ADR-038
  * ADR-023-microformats2-compliance → ADR-040
  * ADR-027-versioning-strategy-for-authorization-removal → ADR-042
  * ADR-030-CORRECTED-indieauth-endpoint-discovery → ADR-043
  * ADR-031-endpoint-discovery-implementation → ADR-044

- Updated all cross-references to renumbered ADRs in:
  * docs/projectplan/ROADMAP.md
  * docs/reports/v1.0.0-rc.5-migration-race-condition-implementation.md
  * docs/reports/2025-11-24-endpoint-discovery-analysis.md
  * docs/decisions/ADR-043-CORRECTED-indieauth-endpoint-discovery.md
  * docs/decisions/ADR-044-endpoint-discovery-implementation.md

- Updated README.md version from 1.0.0 to 1.1.0
- Tracked ADR-021-indieauth-provider-strategy.md in git

DOCUMENTATION IMPROVEMENTS:
- Created comprehensive INDEX.md files for all docs/ subdirectories:
  * docs/architecture/INDEX.md (28 documents indexed)
  * docs/decisions/INDEX.md (55 ADRs indexed with topical grouping)
  * docs/design/INDEX.md (phase plans and feature designs)
  * docs/standards/INDEX.md (9 standards with compliance checklist)
  * docs/reports/INDEX.md (57 implementation reports)
  * docs/deployment/INDEX.md (deployment guides)
  * docs/examples/INDEX.md (code samples and usage patterns)
  * docs/migration/INDEX.md (version migration guides)
  * docs/releases/INDEX.md (release documentation)
  * docs/reviews/INDEX.md (architectural reviews)
  * docs/security/INDEX.md (security documentation)

- Updated CLAUDE.md with complete folder descriptions including:
  * docs/migration/
  * docs/releases/
  * docs/security/

VERIFICATION:
- All ADR numbers now sequential and unique (50 total ADRs)
- No duplicate ADR numbers remain
- All cross-references updated and verified
- Documentation structure consistent and well-organized

These changes improve documentation discoverability, maintainability, and
ensure proper version tracking. All index files follow consistent format
with clear navigation guidance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-25 13:28:56 -07:00

20 KiB

Production Readiness Improvements Specification

Overview

Production readiness improvements for v1.1.1 focus on robustness, error handling, resource optimization, and operational visibility to ensure StarPunk runs reliably in production environments.

Requirements

Functional Requirements

  1. Graceful FTS5 Degradation

    • Detect FTS5 availability at startup
    • Automatically fall back to LIKE-based search
    • Log clear warnings about reduced functionality
    • Document SQLite compilation requirements
  2. Enhanced Error Messages

    • Provide actionable error messages for common issues
    • Include troubleshooting steps
    • Differentiate between user and system errors
    • Add configuration validation at startup
  3. Database Connection Pooling

    • Optimize connection pool size
    • Monitor pool usage
    • Handle connection exhaustion gracefully
    • Configure pool parameters
  4. Structured Logging

    • Implement log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL)
    • JSON-structured logs for production
    • Human-readable logs for development
    • Request correlation IDs
  5. Health Check Improvements

    • Enhanced /health endpoint
    • Detailed health status (when authorized)
    • Component health checks
    • Readiness vs liveness probes

Non-Functional Requirements

  1. Reliability

    • Graceful handling of all error conditions
    • No crashes from user input
    • Automatic recovery from transient errors
  2. Observability

    • Clear logging of all operations
    • Traceable request flow
    • Diagnostic information available
  3. Performance

    • Connection pooling reduces latency
    • Efficient error handling paths
    • Minimal logging overhead

Design

FTS5 Graceful Degradation

# starpunk/search/engine.py
class SearchEngineFactory:
    """Factory for creating appropriate search engine"""

    @staticmethod
    def create() -> SearchEngine:
        """Create search engine based on availability"""
        if SearchEngineFactory._check_fts5():
            logger.info("Using FTS5 search engine")
            return FTS5SearchEngine()
        else:
            logger.warning(
                "FTS5 not available. Using fallback search engine. "
                "For better search performance, please ensure SQLite "
                "is compiled with FTS5 support. See: "
                "https://www.sqlite.org/fts5.html#compiling_and_using_fts5"
            )
            return FallbackSearchEngine()

    @staticmethod
    def _check_fts5() -> bool:
        """Check if FTS5 is available"""
        try:
            conn = sqlite3.connect(":memory:")
            conn.execute(
                "CREATE VIRTUAL TABLE test_fts USING fts5(content)"
            )
            conn.close()
            return True
        except sqlite3.OperationalError:
            return False

class FallbackSearchEngine(SearchEngine):
    """LIKE-based search for systems without FTS5"""

    def search(self, query: str, limit: int = 50) -> List[SearchResult]:
        """Perform case-insensitive LIKE search"""
        sql = """
            SELECT
                id,
                content,
                created_at,
                0 as rank  -- No ranking available
            FROM notes
            WHERE
                content LIKE ? OR
                content LIKE ? OR
                content LIKE ?
            ORDER BY created_at DESC
            LIMIT ?
        """

        # Search for term at start, middle, or end
        patterns = [
            f'{query}%',  # Starts with
            f'% {query}%',  # Word in middle
            f'%{query}'  # Ends with
        ]

        results = []
        with get_db() as conn:
            cursor = conn.execute(sql, (*patterns, limit))
            for row in cursor:
                results.append(SearchResult(*row))

        return results

Enhanced Error Messages

# starpunk/errors/messages.py
class ErrorMessages:
    """User-friendly error messages with troubleshooting"""

    DATABASE_LOCKED = ErrorInfo(
        message="The database is temporarily locked",
        suggestion="Please try again in a moment",
        details="This usually happens during concurrent writes",
        troubleshooting=[
            "Wait a few seconds and retry",
            "Check for long-running operations",
            "Ensure WAL mode is enabled"
        ]
    )

    CONFIGURATION_INVALID = ErrorInfo(
        message="Configuration error: {detail}",
        suggestion="Please check your environment variables",
        details="Invalid configuration detected at startup",
        troubleshooting=[
            "Verify all STARPUNK_* environment variables",
            "Check for typos in configuration names",
            "Ensure values are in the correct format",
            "See docs/deployment/configuration.md"
        ]
    )

    MICROPUB_MALFORMED = ErrorInfo(
        message="Invalid Micropub request format",
        suggestion="Please check your Micropub client configuration",
        details="The request doesn't conform to Micropub specification",
        troubleshooting=[
            "Ensure Content-Type is correct",
            "Verify required fields are present",
            "Check for proper encoding",
            "See https://www.w3.org/TR/micropub/"
        ]
    )

    def format_error(self, error_key: str, **kwargs) -> dict:
        """Format error for response"""
        error_info = getattr(self, error_key)
        return {
            'error': {
                'message': error_info.message.format(**kwargs),
                'suggestion': error_info.suggestion,
                'troubleshooting': error_info.troubleshooting
            }
        }

Database Connection Pool Optimization

# starpunk/database/pool.py
from contextlib import contextmanager
from threading import Semaphore, Lock
from queue import Queue, Empty, Full
import sqlite3

class ConnectionPool:
    """Thread-safe SQLite connection pool"""

    def __init__(
        self,
        database_path: str,
        pool_size: int = None,
        timeout: float = None
    ):
        self.database_path = database_path
        self.pool_size = pool_size or config.DB_CONNECTION_POOL_SIZE
        self.timeout = timeout or config.DB_CONNECTION_TIMEOUT
        self._pool = Queue(maxsize=self.pool_size)
        self._all_connections = []
        self._lock = Lock()
        self._stats = {
            'acquired': 0,
            'released': 0,
            'created': 0,
            'wait_time_total': 0,
            'active': 0
        }

        # Pre-create connections
        for _ in range(self.pool_size):
            self._create_connection()

    def _create_connection(self) -> sqlite3.Connection:
        """Create a new database connection"""
        conn = sqlite3.connect(self.database_path)

        # Configure connection for production
        conn.execute("PRAGMA journal_mode=WAL")
        conn.execute(f"PRAGMA busy_timeout={config.DB_BUSY_TIMEOUT}")
        conn.execute("PRAGMA synchronous=NORMAL")
        conn.execute("PRAGMA temp_store=MEMORY")

        # Enable row factory for dict-like access
        conn.row_factory = sqlite3.Row

        with self._lock:
            self._all_connections.append(conn)
            self._stats['created'] += 1

        return conn

    @contextmanager
    def acquire(self):
        """Acquire connection from pool"""
        start_time = time.time()
        conn = None

        try:
            # Try to get connection with timeout
            conn = self._pool.get(timeout=self.timeout)
            wait_time = time.time() - start_time

            with self._lock:
                self._stats['acquired'] += 1
                self._stats['wait_time_total'] += wait_time
                self._stats['active'] += 1

            if wait_time > 1.0:
                logger.warning(
                    "Slow connection acquisition",
                    extra={'wait_time': wait_time}
                )

            yield conn

        except Empty:
            raise DatabaseError(
                "Connection pool exhausted",
                suggestion="Increase pool size or optimize queries",
                details={
                    'pool_size': self.pool_size,
                    'timeout': self.timeout
                }
            )
        finally:
            if conn:
                # Return connection to pool
                try:
                    self._pool.put_nowait(conn)
                    with self._lock:
                        self._stats['released'] += 1
                        self._stats['active'] -= 1
                except Full:
                    # Pool is full, close the connection
                    conn.close()

    def get_stats(self) -> dict:
        """Get pool statistics"""
        with self._lock:
            return {
                **self._stats,
                'pool_size': self.pool_size,
                'available': self._pool.qsize()
            }

    def close_all(self):
        """Close all connections in pool"""
        while not self._pool.empty():
            try:
                conn = self._pool.get_nowait()
                conn.close()
            except Empty:
                break

        for conn in self._all_connections:
            try:
                conn.close()
            except:
                pass

# Global pool instance
_connection_pool = None

def get_connection_pool() -> ConnectionPool:
    """Get or create connection pool"""
    global _connection_pool
    if _connection_pool is None:
        _connection_pool = ConnectionPool(
            database_path=config.DATABASE_PATH
        )
    return _connection_pool

@contextmanager
def get_db():
    """Get database connection from pool"""
    pool = get_connection_pool()
    with pool.acquire() as conn:
        yield conn

Structured Logging Implementation

# starpunk/logging/setup.py
import logging
import json
import sys
from uuid import uuid4

def setup_logging():
    """Configure structured logging for production"""

    # Determine environment
    is_production = config.ENV == 'production'

    # Configure root logger
    root = logging.getLogger()
    root.setLevel(config.LOG_LEVEL)

    # Remove default handler
    root.handlers = []

    # Create appropriate handler
    handler = logging.StreamHandler(sys.stdout)

    if is_production:
        # JSON format for production
        handler.setFormatter(JSONFormatter())
    else:
        # Human-readable for development
        handler.setFormatter(logging.Formatter(
            '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
        ))

    root.addHandler(handler)

    # Configure specific loggers
    logging.getLogger('starpunk').setLevel(config.LOG_LEVEL)
    logging.getLogger('werkzeug').setLevel(logging.WARNING)

    logger.info(
        "Logging configured",
        extra={
            'level': config.LOG_LEVEL,
            'format': 'json' if is_production else 'human'
        }
    )

class JSONFormatter(logging.Formatter):
    """JSON log formatter for structured logging"""

    def format(self, record):
        log_data = {
            'timestamp': self.formatTime(record),
            'level': record.levelname,
            'logger': record.name,
            'message': record.getMessage(),
            'request_id': getattr(record, 'request_id', None),
        }

        # Add extra fields
        if hasattr(record, 'extra'):
            log_data.update(record.extra)

        # Add exception info
        if record.exc_info:
            log_data['exception'] = self.formatException(record.exc_info)

        return json.dumps(log_data)

# Request context middleware
from flask import g

@app.before_request
def add_request_id():
    """Add unique request ID for correlation"""
    g.request_id = str(uuid4())[:8]

    # Configure logger for this request
    logging.LoggerAdapter(
        logger,
        {'request_id': g.request_id}
    )

Enhanced Health Checks

# starpunk/health.py
from datetime import datetime

class HealthChecker:
    """System health checking"""

    def __init__(self):
        self.start_time = datetime.now()

    def check_basic(self) -> dict:
        """Basic health check for liveness probe"""
        return {
            'status': 'healthy',
            'timestamp': datetime.now().isoformat()
        }

    def check_detailed(self) -> dict:
        """Detailed health check for readiness probe"""
        checks = {
            'database': self._check_database(),
            'search': self._check_search(),
            'filesystem': self._check_filesystem(),
            'memory': self._check_memory()
        }

        # Overall status
        all_healthy = all(c['healthy'] for c in checks.values())

        return {
            'status': 'healthy' if all_healthy else 'degraded',
            'timestamp': datetime.now().isoformat(),
            'uptime': str(datetime.now() - self.start_time),
            'version': __version__,
            'checks': checks
        }

    def _check_database(self) -> dict:
        """Check database connectivity"""
        try:
            with get_db() as conn:
                conn.execute("SELECT 1")

            pool_stats = get_connection_pool().get_stats()
            return {
                'healthy': True,
                'pool_active': pool_stats['active'],
                'pool_size': pool_stats['pool_size']
            }
        except Exception as e:
            return {
                'healthy': False,
                'error': str(e)
            }

    def _check_search(self) -> dict:
        """Check search engine status"""
        try:
            engine_type = 'fts5' if has_fts5() else 'fallback'
            return {
                'healthy': True,
                'engine': engine_type,
                'enabled': config.SEARCH_ENABLED
            }
        except Exception as e:
            return {
                'healthy': False,
                'error': str(e)
            }

    def _check_filesystem(self) -> dict:
        """Check filesystem access"""
        try:
            # Check if we can write to temp
            import tempfile
            with tempfile.NamedTemporaryFile() as f:
                f.write(b'test')

            return {'healthy': True}
        except Exception as e:
            return {
                'healthy': False,
                'error': str(e)
            }

    def _check_memory(self) -> dict:
        """Check memory usage"""
        memory_mb = get_memory_usage()
        threshold = config.MEMORY_THRESHOLD_MB

        return {
            'healthy': memory_mb < threshold,
            'usage_mb': memory_mb,
            'threshold_mb': threshold
        }

# Health check endpoints
@app.route('/health')
def health():
    """Basic health check endpoint"""
    checker = HealthChecker()
    result = checker.check_basic()
    status_code = 200 if result['status'] == 'healthy' else 503
    return jsonify(result), status_code

@app.route('/health/ready')
def health_ready():
    """Readiness probe endpoint"""
    checker = HealthChecker()

    # Detailed check only for authenticated or configured
    if config.HEALTH_CHECK_DETAILED or is_admin():
        result = checker.check_detailed()
    else:
        result = checker.check_basic()

    status_code = 200 if result['status'] == 'healthy' else 503
    return jsonify(result), status_code

Session Timeout Handling

# starpunk/auth/session.py
from datetime import datetime, timedelta

class SessionManager:
    """Manage user sessions with configurable timeout"""

    def __init__(self):
        self.timeout = config.SESSION_TIMEOUT

    def create_session(self, user_id: str) -> str:
        """Create new session with timeout"""
        session_id = str(uuid4())
        expires_at = datetime.now() + timedelta(seconds=self.timeout)

        # Store in database
        with get_db() as conn:
            conn.execute(
                """
                INSERT INTO sessions (id, user_id, expires_at, created_at)
                VALUES (?, ?, ?, ?)
                """,
                (session_id, user_id, expires_at, datetime.now())
            )

        logger.info(
            "Session created",
            extra={
                'user_id': user_id,
                'timeout': self.timeout
            }
        )

        return session_id

    def validate_session(self, session_id: str) -> Optional[str]:
        """Validate session and extend if valid"""
        with get_db() as conn:
            result = conn.execute(
                """
                SELECT user_id, expires_at
                FROM sessions
                WHERE id = ? AND expires_at > ?
                """,
                (session_id, datetime.now())
            ).fetchone()

            if result:
                # Extend session
                new_expires = datetime.now() + timedelta(
                    seconds=self.timeout
                )
                conn.execute(
                    """
                    UPDATE sessions
                    SET expires_at = ?, last_accessed = ?
                    WHERE id = ?
                    """,
                    (new_expires, datetime.now(), session_id)
                )

                return result['user_id']

        return None

    def cleanup_expired(self):
        """Remove expired sessions"""
        with get_db() as conn:
            deleted = conn.execute(
                """
                DELETE FROM sessions
                WHERE expires_at < ?
                """,
                (datetime.now(),)
            ).rowcount

            if deleted > 0:
                logger.info(
                    "Cleaned up expired sessions",
                    extra={'count': deleted}
                )

Testing Strategy

Unit Tests

  1. FTS5 detection and fallback
  2. Error message formatting
  3. Connection pool operations
  4. Health check components
  5. Session timeout logic

Integration Tests

  1. Search with and without FTS5
  2. Error handling end-to-end
  3. Connection pool under load
  4. Health endpoints
  5. Session expiration

Load Tests

def test_connection_pool_under_load():
    """Test connection pool with concurrent requests"""
    pool = ConnectionPool(":memory:", pool_size=5)

    def worker():
        for _ in range(100):
            with pool.acquire() as conn:
                conn.execute("SELECT 1")

    threads = [Thread(target=worker) for _ in range(20)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

    stats = pool.get_stats()
    assert stats['acquired'] == 2000
    assert stats['released'] == 2000

Migration Considerations

Database Schema Updates

-- Add sessions table if not exists
CREATE TABLE IF NOT EXISTS sessions (
    id TEXT PRIMARY KEY,
    user_id TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL,
    expires_at TIMESTAMP NOT NULL,
    last_accessed TIMESTAMP,
    INDEX idx_sessions_expires (expires_at)
);

Configuration Migration

  1. Add new environment variables with defaults
  2. Document in deployment guide
  3. Update example .env file

Performance Impact

Expected Improvements

  • Connection pooling: 20-30% reduction in query latency
  • Structured logging: <1ms per log statement
  • Health checks: <10ms response time
  • Session management: Minimal overhead

Resource Usage

  • Connection pool: ~5MB per connection
  • Logging buffer: <1MB
  • Session storage: ~1KB per active session

Security Considerations

  1. Connection Pool: Prevent connection exhaustion attacks
  2. Error Messages: Never expose sensitive information
  3. Health Checks: Require auth for detailed info
  4. Session Timeout: Configurable for security/UX balance
  5. Logging: Sanitize all user input

Acceptance Criteria

  1. FTS5 unavailability handled gracefully
  2. Clear error messages with troubleshooting
  3. Connection pooling implemented and optimized
  4. Structured logging with levels
  5. Enhanced health check endpoints
  6. Session timeout handling
  7. All features configurable
  8. Zero breaking changes
  9. Performance improvements measured
  10. Production deployment guide updated