# ADR-055: Error Handling Philosophy ## Status Accepted ## Context StarPunk v1.1.1 focuses on production readiness, including graceful error handling. Currently, error handling is inconsistent: - Some errors crash the application - Error messages vary in helpfulness - No distinction between user and system errors - Insufficient context for debugging We need a consistent philosophy for handling errors that balances user experience, security, and debuggability. ## Decision Adopt a layered error handling strategy that provides graceful degradation, helpful user messages, and detailed logging for operators. ### Error Handling Principles 1. **Fail Gracefully**: Never crash when recovery is possible 2. **Be Helpful**: Provide actionable error messages 3. **Log Everything**: Detailed context for debugging 4. **Secure by Default**: Don't leak sensitive information 5. **User vs System**: Different handling for different audiences ### Error Categories #### 1. User Errors (4xx class) Errors caused by user action or client issues. Examples: - Invalid Micropub request - Authentication failure - Missing required fields - Invalid slug format Handling: - Return helpful error message - Suggest corrective action - Log at INFO level - Don't expose internals #### 2. System Errors (5xx class) Errors in system operation. Examples: - Database connection failure - File system errors - Memory exhaustion - Template rendering errors Handling: - Generic user message - Detailed logging at ERROR level - Attempt recovery if possible - Alert operators (future) #### 3. Configuration Errors Errors due to misconfiguration. Examples: - Missing required config - Invalid configuration values - Incompatible settings - Permission issues Handling: - Fail fast at startup - Clear error messages - Suggest fixes - Document requirements #### 4. Transient Errors Temporary errors that may succeed on retry. Examples: - Database lock - Network timeout - Resource temporarily unavailable Handling: - Automatic retry with backoff - Log at WARNING level - Fail gracefully after retries - Track frequency ### Error Response Format #### Development Mode ```json { "error": { "type": "ValidationError", "message": "Invalid slug format", "details": { "field": "slug", "value": "my/bad/slug", "pattern": "^[a-z0-9-]+$" }, "suggestion": "Slugs can only contain lowercase letters, numbers, and hyphens", "documentation": "/docs/api/micropub#slugs", "trace_id": "abc123" } } ``` #### Production Mode ```json { "error": { "message": "Invalid request format", "suggestion": "Please check your request and try again", "documentation": "/docs/api/micropub", "trace_id": "abc123" } } ``` ### Implementation Pattern ```python # starpunk/errors.py from enum import Enum from typing import Optional, Dict, Any import logging logger = logging.getLogger('starpunk.errors') class ErrorCategory(Enum): USER = "user" SYSTEM = "system" CONFIG = "config" TRANSIENT = "transient" class StarPunkError(Exception): """Base exception for all StarPunk errors""" def __init__( self, message: str, category: ErrorCategory = ErrorCategory.SYSTEM, suggestion: Optional[str] = None, details: Optional[Dict[str, Any]] = None, status_code: int = 500, recoverable: bool = False ): self.message = message self.category = category self.suggestion = suggestion self.details = details or {} self.status_code = status_code self.recoverable = recoverable super().__init__(message) def to_user_dict(self, debug: bool = False) -> dict: """Format error for user response""" result = { 'error': { 'message': self.message, 'trace_id': self.trace_id } } if self.suggestion: result['error']['suggestion'] = self.suggestion if debug and self.details: result['error']['details'] = self.details result['error']['type'] = self.__class__.__name__ return result def log(self): """Log error with appropriate level""" if self.category == ErrorCategory.USER: logger.info( "User error: %s", self.message, extra={'context': self.details} ) elif self.category == ErrorCategory.TRANSIENT: logger.warning( "Transient error: %s", self.message, extra={'context': self.details} ) else: logger.error( "System error: %s", self.message, extra={'context': self.details}, exc_info=True ) # Specific error classes class ValidationError(StarPunkError): """User input validation failed""" def __init__(self, message: str, field: str = None, **kwargs): super().__init__( message, category=ErrorCategory.USER, status_code=400, **kwargs ) if field: self.details['field'] = field class AuthenticationError(StarPunkError): """Authentication failed""" def __init__(self, message: str = "Authentication required", **kwargs): super().__init__( message, category=ErrorCategory.USER, status_code=401, suggestion="Please authenticate and try again", **kwargs ) class DatabaseError(StarPunkError): """Database operation failed""" def __init__(self, message: str, **kwargs): super().__init__( message, category=ErrorCategory.SYSTEM, status_code=500, suggestion="Please try again later", **kwargs ) class ConfigurationError(StarPunkError): """Configuration is invalid""" def __init__(self, message: str, setting: str = None, **kwargs): super().__init__( message, category=ErrorCategory.CONFIG, status_code=500, **kwargs ) if setting: self.details['setting'] = setting ``` ### Error Handling Middleware ```python # starpunk/middleware/errors.py def error_handler(func): """Decorator for consistent error handling""" def wrapper(*args, **kwargs): try: return func(*args, **kwargs) except StarPunkError as e: e.log() return e.to_user_dict(debug=is_debug_mode()) except Exception as e: # Unexpected error error = StarPunkError( message="An unexpected error occurred", category=ErrorCategory.SYSTEM, details={'original': str(e)} ) error.log() return error.to_user_dict(debug=is_debug_mode()) return wrapper ``` ### Graceful Degradation Examples #### FTS5 Unavailable ```python try: # Attempt FTS5 search results = search_with_fts5(query) except FTS5UnavailableError: logger.warning("FTS5 unavailable, falling back to LIKE") results = search_with_like(query) flash("Search is running in compatibility mode") ``` #### Database Lock ```python @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=0.5, max=2), retry=retry_if_exception_type(sqlite3.OperationalError) ) def execute_query(query): """Execute with retry for transient errors""" return db.execute(query) ``` #### Missing Optional Feature ```python if not config.SEARCH_ENABLED: # Return empty results instead of error return { 'results': [], 'message': 'Search is disabled on this instance' } ``` ## Rationale ### Why Graceful Degradation? 1. **User Experience**: Don't break the whole app 2. **Reliability**: Partial functionality better than none 3. **Operations**: Easier to diagnose in production 4. **Recovery**: System can self-heal from transients ### Why Different Error Categories? 1. **Appropriate Response**: Different errors need different handling 2. **Security**: Don't expose internals for system errors 3. **Debugging**: Operators need full context 4. **User Experience**: Users need actionable messages ### Why Structured Errors? 1. **Consistency**: Predictable error format 2. **Parsing**: Tools can process errors 3. **Correlation**: Trace IDs link logs to responses 4. **Documentation**: Self-documenting error details ## Consequences ### Positive 1. **Better UX**: Helpful error messages 2. **Easier Debugging**: Rich context in logs 3. **More Reliable**: Graceful degradation 4. **Secure**: No information leakage 5. **Consistent**: Predictable error handling ### Negative 1. **More Code**: Error handling adds complexity 2. **Testing Burden**: Many error paths to test 3. **Performance**: Error handling overhead 4. **Maintenance**: Error messages need updates ### Mitigations 1. Use error hierarchy to reduce duplication 2. Generate tests for error paths 3. Cache error messages 4. Document error codes clearly ## Alternatives Considered ### 1. Let Exceptions Bubble **Pros**: Simple, Python default **Cons**: Poor UX, crashes, no context **Decision**: Not production-ready ### 2. Generic Error Pages **Pros**: Simple to implement **Cons**: Not helpful, poor API experience **Decision**: Insufficient for Micropub API ### 3. Error Codes System **Pros**: Precise, machine-readable **Cons**: Complex, needs documentation **Decision**: Over-engineered for our scale ### 4. Sentry/Error Tracking Service **Pros**: Rich features, alerting **Cons**: External dependency, privacy **Decision**: Conflicts with self-hosted philosophy ## Implementation Notes ### Critical Path Protection Always protect critical paths: ```python # Never let note creation completely fail try: create_search_index(note) except Exception as e: logger.error("Search indexing failed: %s", e) # Continue without search - note still created ``` ### Error Budget Track error rates for SLO monitoring: - User errors: Unlimited (not our fault) - System errors: <0.1% of requests - Configuration errors: 0 after startup - Transient errors: <1% of requests ### Testing Strategy 1. Unit tests for each error class 2. Integration tests for error paths 3. Chaos testing for transient errors 4. User journey tests with errors ## Security Considerations 1. Never expose stack traces to users 2. Sanitize error messages 3. Rate limit error endpoints 4. Don't leak existence via errors 5. Log security errors specially ## Migration Path 1. Phase 1: Add error classes 2. Phase 2: Wrap existing code 3. Phase 3: Add graceful degradation 4. Phase 4: Improve error messages ## References - [Error Handling Best Practices](https://www.python.org/dev/peps/pep-0008/#programming-recommendations) - [HTTP Status Codes](https://httpstatuses.com/) - [OWASP Error Handling](https://owasp.org/www-community/Improper_Error_Handling) - [Google SRE Book - Handling Overload](https://sre.google/sre-book/handling-overload/) ## Document History - 2025-11-25: Initial draft for v1.1.1 release planning