StarPunk/docs/decisions/ADR-055-error-handling-philosophy.md

# ADR-055: Error Handling Philosophy

## Status
Accepted

## Context
StarPunk v1.1.1 focuses on production readiness, including graceful error handling. Currently, error handling is inconsistent:
- Some errors crash the application
- Error messages vary in helpfulness
- No distinction between user and system errors
- Insufficient context for debugging

We need a consistent philosophy for handling errors that balances user experience, security, and debuggability.

## Decision
Adopt a layered error handling strategy that provides graceful degradation, helpful user messages, and detailed logging for operators.

### Error Handling Principles

1. **Fail Gracefully**: Never crash when recovery is possible
2. **Be Helpful**: Provide actionable error messages
3. **Log Everything**: Detailed context for debugging
4. **Secure by Default**: Don't leak sensitive information
5. **User vs System**: Different handling for different audiences

### Error Categories

#### 1. User Errors (4xx class)
Errors caused by user action or client issues.

Examples:
- Invalid Micropub request
- Authentication failure
- Missing required fields
- Invalid slug format

Handling:
- Return helpful error message
- Suggest corrective action
- Log at INFO level
- Don't expose internals

#### 2. System Errors (5xx class)
Errors in system operation.

Examples:
- Database connection failure
- File system errors
- Memory exhaustion
- Template rendering errors

Handling:
- Generic user message
- Detailed logging at ERROR level
- Attempt recovery if possible
- Alert operators (future)

#### 3. Configuration Errors
Errors due to misconfiguration.

Examples:
- Missing required config
- Invalid configuration values
- Incompatible settings
- Permission issues

Handling:
- Fail fast at startup
- Clear error messages
- Suggest fixes
- Document requirements

#### 4. Transient Errors
Temporary errors that may succeed on retry.

Examples:
- Database lock
- Network timeout
- Resource temporarily unavailable

Handling:
- Automatic retry with backoff
- Log at WARNING level
- Fail gracefully after retries
- Track frequency

### Error Response Format

#### Development Mode
```json
{
  "error": {
    "type": "ValidationError",
    "message": "Invalid slug format",
    "details": {
      "field": "slug",
      "value": "my/bad/slug",
      "pattern": "^[a-z0-9-]+$"
    },
    "suggestion": "Slugs can only contain lowercase letters, numbers, and hyphens",
    "documentation": "/docs/api/micropub#slugs",
    "trace_id": "abc123"
  }
}
```

#### Production Mode
```json
{
  "error": {
    "message": "Invalid request format",
    "suggestion": "Please check your request and try again",
    "documentation": "/docs/api/micropub",
    "trace_id": "abc123"
  }
}
```

### Implementation Pattern

```python
# starpunk/errors.py
from enum import Enum
from typing import Optional, Dict, Any
import logging

logger = logging.getLogger('starpunk.errors')

class ErrorCategory(Enum):
    USER = "user"
    SYSTEM = "system"
    CONFIG = "config"
    TRANSIENT = "transient"

class StarPunkError(Exception):
    """Base exception for all StarPunk errors"""

    def __init__(
        self,
        message: str,
        category: ErrorCategory = ErrorCategory.SYSTEM,
        suggestion: Optional[str] = None,
        details: Optional[Dict[str, Any]] = None,
        status_code: int = 500,
        recoverable: bool = False
    ):
        self.message = message
        self.category = category
        self.suggestion = suggestion
        self.details = details or {}
        self.status_code = status_code
        self.recoverable = recoverable
        super().__init__(message)

    def to_user_dict(self, debug: bool = False) -> dict:
        """Format error for user response"""
        result = {
            'error': {
                'message': self.message,
                'trace_id': self.trace_id
            }
        }

        if self.suggestion:
            result['error']['suggestion'] = self.suggestion

        if debug and self.details:
            result['error']['details'] = self.details
            result['error']['type'] = self.__class__.__name__

        return result

    def log(self):
        """Log error with appropriate level"""
        if self.category == ErrorCategory.USER:
            logger.info(
                "User error: %s",
                self.message,
                extra={'context': self.details}
            )
        elif self.category == ErrorCategory.TRANSIENT:
            logger.warning(
                "Transient error: %s",
                self.message,
                extra={'context': self.details}
            )
        else:
            logger.error(
                "System error: %s",
                self.message,
                extra={'context': self.details},
                exc_info=True
            )

# Specific error classes
class ValidationError(StarPunkError):
    """User input validation failed"""
    def __init__(self, message: str, field: str = None, **kwargs):
        super().__init__(
            message,
            category=ErrorCategory.USER,
            status_code=400,
            **kwargs
        )
        if field:
            self.details['field'] = field

class AuthenticationError(StarPunkError):
    """Authentication failed"""
    def __init__(self, message: str = "Authentication required", **kwargs):
        super().__init__(
            message,
            category=ErrorCategory.USER,
            status_code=401,
            suggestion="Please authenticate and try again",
            **kwargs
        )

class DatabaseError(StarPunkError):
    """Database operation failed"""
    def __init__(self, message: str, **kwargs):
        super().__init__(
            message,
            category=ErrorCategory.SYSTEM,
            status_code=500,
            suggestion="Please try again later",
            **kwargs
        )

class ConfigurationError(StarPunkError):
    """Configuration is invalid"""
    def __init__(self, message: str, setting: str = None, **kwargs):
        super().__init__(
            message,
            category=ErrorCategory.CONFIG,
            status_code=500,
            **kwargs
        )
        if setting:
            self.details['setting'] = setting
```

### Error Handling Middleware

```python
# starpunk/middleware/errors.py
def error_handler(func):
    """Decorator for consistent error handling"""
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except StarPunkError as e:
            e.log()
            return e.to_user_dict(debug=is_debug_mode())
        except Exception as e:
            # Unexpected error
            error = StarPunkError(
                message="An unexpected error occurred",
                category=ErrorCategory.SYSTEM,
                details={'original': str(e)}
            )
            error.log()
            return error.to_user_dict(debug=is_debug_mode())
    return wrapper
```

### Graceful Degradation Examples

#### FTS5 Unavailable
```python
try:
    # Attempt FTS5 search
    results = search_with_fts5(query)
except FTS5UnavailableError:
    logger.warning("FTS5 unavailable, falling back to LIKE")
    results = search_with_like(query)
    flash("Search is running in compatibility mode")
```

#### Database Lock
```python
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=0.5, max=2),
    retry=retry_if_exception_type(sqlite3.OperationalError)
)
def execute_query(query):
    """Execute with retry for transient errors"""
    return db.execute(query)
```

#### Missing Optional Feature
```python
if not config.SEARCH_ENABLED:
    # Return empty results instead of error
    return {
        'results': [],
        'message': 'Search is disabled on this instance'
    }
```

## Rationale

### Why Graceful Degradation?
1. **User Experience**: Don't break the whole app
2. **Reliability**: Partial functionality better than none
3. **Operations**: Easier to diagnose in production
4. **Recovery**: System can self-heal from transients

### Why Different Error Categories?
1. **Appropriate Response**: Different errors need different handling
2. **Security**: Don't expose internals for system errors
3. **Debugging**: Operators need full context
4. **User Experience**: Users need actionable messages

### Why Structured Errors?
1. **Consistency**: Predictable error format
2. **Parsing**: Tools can process errors
3. **Correlation**: Trace IDs link logs to responses
4. **Documentation**: Self-documenting error details

## Consequences

### Positive
1. **Better UX**: Helpful error messages
2. **Easier Debugging**: Rich context in logs
3. **More Reliable**: Graceful degradation
4. **Secure**: No information leakage
5. **Consistent**: Predictable error handling

### Negative
1. **More Code**: Error handling adds complexity
2. **Testing Burden**: Many error paths to test
3. **Performance**: Error handling overhead
4. **Maintenance**: Error messages need updates

### Mitigations
1. Use error hierarchy to reduce duplication
2. Generate tests for error paths
3. Cache error messages
4. Document error codes clearly

## Alternatives Considered

### 1. Let Exceptions Bubble
**Pros**: Simple, Python default
**Cons**: Poor UX, crashes, no context
**Decision**: Not production-ready

### 2. Generic Error Pages
**Pros**: Simple to implement
**Cons**: Not helpful, poor API experience
**Decision**: Insufficient for Micropub API

### 3. Error Codes System
**Pros**: Precise, machine-readable
**Cons**: Complex, needs documentation
**Decision**: Over-engineered for our scale

### 4. Sentry/Error Tracking Service
**Pros**: Rich features, alerting
**Cons**: External dependency, privacy
**Decision**: Conflicts with self-hosted philosophy

## Implementation Notes

### Critical Path Protection
Always protect critical paths:
```python
# Never let note creation completely fail
try:
    create_search_index(note)
except Exception as e:
    logger.error("Search indexing failed: %s", e)
    # Continue without search - note still created
```

### Error Budget
Track error rates for SLO monitoring:
- User errors: Unlimited (not our fault)
- System errors: <0.1% of requests
- Configuration errors: 0 after startup
- Transient errors: <1% of requests

### Testing Strategy
1. Unit tests for each error class
2. Integration tests for error paths
3. Chaos testing for transient errors
4. User journey tests with errors

## Security Considerations

1. Never expose stack traces to users
2. Sanitize error messages
3. Rate limit error endpoints
4. Don't leak existence via errors
5. Log security errors specially

## Migration Path

1. Phase 1: Add error classes
2. Phase 2: Wrap existing code
3. Phase 3: Add graceful degradation
4. Phase 4: Improve error messages

## References

- [Error Handling Best Practices](https://www.python.org/dev/peps/pep-0008/#programming-recommendations)
- [HTTP Status Codes](https://httpstatuses.com/)
- [OWASP Error Handling](https://owasp.org/www-community/Improper_Error_Handling)
- [Google SRE Book - Handling Overload](https://sre.google/sre-book/handling-overload/)

## Document History

- 2025-11-25: Initial draft for v1.1.1 release planning