Files
StarPunk/docs/decisions/ADR-055-error-handling-philosophy.md
Phil Skentelbery e589f5bd6c docs: Fix ADR numbering conflicts and create comprehensive documentation indices
This commit resolves all documentation issues identified in the comprehensive review:

CRITICAL FIXES:
- Renumbered duplicate ADRs to eliminate conflicts:
  * ADR-022-migration-race-condition-fix → ADR-037
  * ADR-022-syndication-formats → ADR-038
  * ADR-023-microformats2-compliance → ADR-040
  * ADR-027-versioning-strategy-for-authorization-removal → ADR-042
  * ADR-030-CORRECTED-indieauth-endpoint-discovery → ADR-043
  * ADR-031-endpoint-discovery-implementation → ADR-044

- Updated all cross-references to renumbered ADRs in:
  * docs/projectplan/ROADMAP.md
  * docs/reports/v1.0.0-rc.5-migration-race-condition-implementation.md
  * docs/reports/2025-11-24-endpoint-discovery-analysis.md
  * docs/decisions/ADR-043-CORRECTED-indieauth-endpoint-discovery.md
  * docs/decisions/ADR-044-endpoint-discovery-implementation.md

- Updated README.md version from 1.0.0 to 1.1.0
- Tracked ADR-021-indieauth-provider-strategy.md in git

DOCUMENTATION IMPROVEMENTS:
- Created comprehensive INDEX.md files for all docs/ subdirectories:
  * docs/architecture/INDEX.md (28 documents indexed)
  * docs/decisions/INDEX.md (55 ADRs indexed with topical grouping)
  * docs/design/INDEX.md (phase plans and feature designs)
  * docs/standards/INDEX.md (9 standards with compliance checklist)
  * docs/reports/INDEX.md (57 implementation reports)
  * docs/deployment/INDEX.md (deployment guides)
  * docs/examples/INDEX.md (code samples and usage patterns)
  * docs/migration/INDEX.md (version migration guides)
  * docs/releases/INDEX.md (release documentation)
  * docs/reviews/INDEX.md (architectural reviews)
  * docs/security/INDEX.md (security documentation)

- Updated CLAUDE.md with complete folder descriptions including:
  * docs/migration/
  * docs/releases/
  * docs/security/

VERIFICATION:
- All ADR numbers now sequential and unique (50 total ADRs)
- No duplicate ADR numbers remain
- All cross-references updated and verified
- Documentation structure consistent and well-organized

These changes improve documentation discoverability, maintainability, and
ensure proper version tracking. All index files follow consistent format
with clear navigation guidance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-25 13:28:56 -07:00

11 KiB

ADR-055: Error Handling Philosophy

Status

Accepted

Context

StarPunk v1.1.1 focuses on production readiness, including graceful error handling. Currently, error handling is inconsistent:

  • Some errors crash the application
  • Error messages vary in helpfulness
  • No distinction between user and system errors
  • Insufficient context for debugging

We need a consistent philosophy for handling errors that balances user experience, security, and debuggability.

Decision

Adopt a layered error handling strategy that provides graceful degradation, helpful user messages, and detailed logging for operators.

Error Handling Principles

  1. Fail Gracefully: Never crash when recovery is possible
  2. Be Helpful: Provide actionable error messages
  3. Log Everything: Detailed context for debugging
  4. Secure by Default: Don't leak sensitive information
  5. User vs System: Different handling for different audiences

Error Categories

1. User Errors (4xx class)

Errors caused by user action or client issues.

Examples:

  • Invalid Micropub request
  • Authentication failure
  • Missing required fields
  • Invalid slug format

Handling:

  • Return helpful error message
  • Suggest corrective action
  • Log at INFO level
  • Don't expose internals

2. System Errors (5xx class)

Errors in system operation.

Examples:

  • Database connection failure
  • File system errors
  • Memory exhaustion
  • Template rendering errors

Handling:

  • Generic user message
  • Detailed logging at ERROR level
  • Attempt recovery if possible
  • Alert operators (future)

3. Configuration Errors

Errors due to misconfiguration.

Examples:

  • Missing required config
  • Invalid configuration values
  • Incompatible settings
  • Permission issues

Handling:

  • Fail fast at startup
  • Clear error messages
  • Suggest fixes
  • Document requirements

4. Transient Errors

Temporary errors that may succeed on retry.

Examples:

  • Database lock
  • Network timeout
  • Resource temporarily unavailable

Handling:

  • Automatic retry with backoff
  • Log at WARNING level
  • Fail gracefully after retries
  • Track frequency

Error Response Format

Development Mode

{
  "error": {
    "type": "ValidationError",
    "message": "Invalid slug format",
    "details": {
      "field": "slug",
      "value": "my/bad/slug",
      "pattern": "^[a-z0-9-]+$"
    },
    "suggestion": "Slugs can only contain lowercase letters, numbers, and hyphens",
    "documentation": "/docs/api/micropub#slugs",
    "trace_id": "abc123"
  }
}

Production Mode

{
  "error": {
    "message": "Invalid request format",
    "suggestion": "Please check your request and try again",
    "documentation": "/docs/api/micropub",
    "trace_id": "abc123"
  }
}

Implementation Pattern

# starpunk/errors.py
from enum import Enum
from typing import Optional, Dict, Any
import logging

logger = logging.getLogger('starpunk.errors')

class ErrorCategory(Enum):
    USER = "user"
    SYSTEM = "system"
    CONFIG = "config"
    TRANSIENT = "transient"

class StarPunkError(Exception):
    """Base exception for all StarPunk errors"""

    def __init__(
        self,
        message: str,
        category: ErrorCategory = ErrorCategory.SYSTEM,
        suggestion: Optional[str] = None,
        details: Optional[Dict[str, Any]] = None,
        status_code: int = 500,
        recoverable: bool = False
    ):
        self.message = message
        self.category = category
        self.suggestion = suggestion
        self.details = details or {}
        self.status_code = status_code
        self.recoverable = recoverable
        super().__init__(message)

    def to_user_dict(self, debug: bool = False) -> dict:
        """Format error for user response"""
        result = {
            'error': {
                'message': self.message,
                'trace_id': self.trace_id
            }
        }

        if self.suggestion:
            result['error']['suggestion'] = self.suggestion

        if debug and self.details:
            result['error']['details'] = self.details
            result['error']['type'] = self.__class__.__name__

        return result

    def log(self):
        """Log error with appropriate level"""
        if self.category == ErrorCategory.USER:
            logger.info(
                "User error: %s",
                self.message,
                extra={'context': self.details}
            )
        elif self.category == ErrorCategory.TRANSIENT:
            logger.warning(
                "Transient error: %s",
                self.message,
                extra={'context': self.details}
            )
        else:
            logger.error(
                "System error: %s",
                self.message,
                extra={'context': self.details},
                exc_info=True
            )

# Specific error classes
class ValidationError(StarPunkError):
    """User input validation failed"""
    def __init__(self, message: str, field: str = None, **kwargs):
        super().__init__(
            message,
            category=ErrorCategory.USER,
            status_code=400,
            **kwargs
        )
        if field:
            self.details['field'] = field

class AuthenticationError(StarPunkError):
    """Authentication failed"""
    def __init__(self, message: str = "Authentication required", **kwargs):
        super().__init__(
            message,
            category=ErrorCategory.USER,
            status_code=401,
            suggestion="Please authenticate and try again",
            **kwargs
        )

class DatabaseError(StarPunkError):
    """Database operation failed"""
    def __init__(self, message: str, **kwargs):
        super().__init__(
            message,
            category=ErrorCategory.SYSTEM,
            status_code=500,
            suggestion="Please try again later",
            **kwargs
        )

class ConfigurationError(StarPunkError):
    """Configuration is invalid"""
    def __init__(self, message: str, setting: str = None, **kwargs):
        super().__init__(
            message,
            category=ErrorCategory.CONFIG,
            status_code=500,
            **kwargs
        )
        if setting:
            self.details['setting'] = setting

Error Handling Middleware

# starpunk/middleware/errors.py
def error_handler(func):
    """Decorator for consistent error handling"""
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except StarPunkError as e:
            e.log()
            return e.to_user_dict(debug=is_debug_mode())
        except Exception as e:
            # Unexpected error
            error = StarPunkError(
                message="An unexpected error occurred",
                category=ErrorCategory.SYSTEM,
                details={'original': str(e)}
            )
            error.log()
            return error.to_user_dict(debug=is_debug_mode())
    return wrapper

Graceful Degradation Examples

FTS5 Unavailable

try:
    # Attempt FTS5 search
    results = search_with_fts5(query)
except FTS5UnavailableError:
    logger.warning("FTS5 unavailable, falling back to LIKE")
    results = search_with_like(query)
    flash("Search is running in compatibility mode")

Database Lock

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=0.5, max=2),
    retry=retry_if_exception_type(sqlite3.OperationalError)
)
def execute_query(query):
    """Execute with retry for transient errors"""
    return db.execute(query)

Missing Optional Feature

if not config.SEARCH_ENABLED:
    # Return empty results instead of error
    return {
        'results': [],
        'message': 'Search is disabled on this instance'
    }

Rationale

Why Graceful Degradation?

  1. User Experience: Don't break the whole app
  2. Reliability: Partial functionality better than none
  3. Operations: Easier to diagnose in production
  4. Recovery: System can self-heal from transients

Why Different Error Categories?

  1. Appropriate Response: Different errors need different handling
  2. Security: Don't expose internals for system errors
  3. Debugging: Operators need full context
  4. User Experience: Users need actionable messages

Why Structured Errors?

  1. Consistency: Predictable error format
  2. Parsing: Tools can process errors
  3. Correlation: Trace IDs link logs to responses
  4. Documentation: Self-documenting error details

Consequences

Positive

  1. Better UX: Helpful error messages
  2. Easier Debugging: Rich context in logs
  3. More Reliable: Graceful degradation
  4. Secure: No information leakage
  5. Consistent: Predictable error handling

Negative

  1. More Code: Error handling adds complexity
  2. Testing Burden: Many error paths to test
  3. Performance: Error handling overhead
  4. Maintenance: Error messages need updates

Mitigations

  1. Use error hierarchy to reduce duplication
  2. Generate tests for error paths
  3. Cache error messages
  4. Document error codes clearly

Alternatives Considered

1. Let Exceptions Bubble

Pros: Simple, Python default Cons: Poor UX, crashes, no context Decision: Not production-ready

2. Generic Error Pages

Pros: Simple to implement Cons: Not helpful, poor API experience Decision: Insufficient for Micropub API

3. Error Codes System

Pros: Precise, machine-readable Cons: Complex, needs documentation Decision: Over-engineered for our scale

4. Sentry/Error Tracking Service

Pros: Rich features, alerting Cons: External dependency, privacy Decision: Conflicts with self-hosted philosophy

Implementation Notes

Critical Path Protection

Always protect critical paths:

# Never let note creation completely fail
try:
    create_search_index(note)
except Exception as e:
    logger.error("Search indexing failed: %s", e)
    # Continue without search - note still created

Error Budget

Track error rates for SLO monitoring:

  • User errors: Unlimited (not our fault)
  • System errors: <0.1% of requests
  • Configuration errors: 0 after startup
  • Transient errors: <1% of requests

Testing Strategy

  1. Unit tests for each error class
  2. Integration tests for error paths
  3. Chaos testing for transient errors
  4. User journey tests with errors

Security Considerations

  1. Never expose stack traces to users
  2. Sanitize error messages
  3. Rate limit error endpoints
  4. Don't leak existence via errors
  5. Log security errors specially

Migration Path

  1. Phase 1: Add error classes
  2. Phase 2: Wrap existing code
  3. Phase 3: Add graceful degradation
  4. Phase 4: Improve error messages

References

Document History

  • 2025-11-25: Initial draft for v1.1.1 release planning