Files
StarPunk/docs/decisions/ADR-055-error-handling-philosophy.md
Phil Skentelbery e589f5bd6c docs: Fix ADR numbering conflicts and create comprehensive documentation indices
This commit resolves all documentation issues identified in the comprehensive review:

CRITICAL FIXES:
- Renumbered duplicate ADRs to eliminate conflicts:
  * ADR-022-migration-race-condition-fix → ADR-037
  * ADR-022-syndication-formats → ADR-038
  * ADR-023-microformats2-compliance → ADR-040
  * ADR-027-versioning-strategy-for-authorization-removal → ADR-042
  * ADR-030-CORRECTED-indieauth-endpoint-discovery → ADR-043
  * ADR-031-endpoint-discovery-implementation → ADR-044

- Updated all cross-references to renumbered ADRs in:
  * docs/projectplan/ROADMAP.md
  * docs/reports/v1.0.0-rc.5-migration-race-condition-implementation.md
  * docs/reports/2025-11-24-endpoint-discovery-analysis.md
  * docs/decisions/ADR-043-CORRECTED-indieauth-endpoint-discovery.md
  * docs/decisions/ADR-044-endpoint-discovery-implementation.md

- Updated README.md version from 1.0.0 to 1.1.0
- Tracked ADR-021-indieauth-provider-strategy.md in git

DOCUMENTATION IMPROVEMENTS:
- Created comprehensive INDEX.md files for all docs/ subdirectories:
  * docs/architecture/INDEX.md (28 documents indexed)
  * docs/decisions/INDEX.md (55 ADRs indexed with topical grouping)
  * docs/design/INDEX.md (phase plans and feature designs)
  * docs/standards/INDEX.md (9 standards with compliance checklist)
  * docs/reports/INDEX.md (57 implementation reports)
  * docs/deployment/INDEX.md (deployment guides)
  * docs/examples/INDEX.md (code samples and usage patterns)
  * docs/migration/INDEX.md (version migration guides)
  * docs/releases/INDEX.md (release documentation)
  * docs/reviews/INDEX.md (architectural reviews)
  * docs/security/INDEX.md (security documentation)

- Updated CLAUDE.md with complete folder descriptions including:
  * docs/migration/
  * docs/releases/
  * docs/security/

VERIFICATION:
- All ADR numbers now sequential and unique (50 total ADRs)
- No duplicate ADR numbers remain
- All cross-references updated and verified
- Documentation structure consistent and well-organized

These changes improve documentation discoverability, maintainability, and
ensure proper version tracking. All index files follow consistent format
with clear navigation guidance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-25 13:28:56 -07:00

415 lines
11 KiB
Markdown

# ADR-055: Error Handling Philosophy
## Status
Accepted
## Context
StarPunk v1.1.1 focuses on production readiness, including graceful error handling. Currently, error handling is inconsistent:
- Some errors crash the application
- Error messages vary in helpfulness
- No distinction between user and system errors
- Insufficient context for debugging
We need a consistent philosophy for handling errors that balances user experience, security, and debuggability.
## Decision
Adopt a layered error handling strategy that provides graceful degradation, helpful user messages, and detailed logging for operators.
### Error Handling Principles
1. **Fail Gracefully**: Never crash when recovery is possible
2. **Be Helpful**: Provide actionable error messages
3. **Log Everything**: Detailed context for debugging
4. **Secure by Default**: Don't leak sensitive information
5. **User vs System**: Different handling for different audiences
### Error Categories
#### 1. User Errors (4xx class)
Errors caused by user action or client issues.
Examples:
- Invalid Micropub request
- Authentication failure
- Missing required fields
- Invalid slug format
Handling:
- Return helpful error message
- Suggest corrective action
- Log at INFO level
- Don't expose internals
#### 2. System Errors (5xx class)
Errors in system operation.
Examples:
- Database connection failure
- File system errors
- Memory exhaustion
- Template rendering errors
Handling:
- Generic user message
- Detailed logging at ERROR level
- Attempt recovery if possible
- Alert operators (future)
#### 3. Configuration Errors
Errors due to misconfiguration.
Examples:
- Missing required config
- Invalid configuration values
- Incompatible settings
- Permission issues
Handling:
- Fail fast at startup
- Clear error messages
- Suggest fixes
- Document requirements
#### 4. Transient Errors
Temporary errors that may succeed on retry.
Examples:
- Database lock
- Network timeout
- Resource temporarily unavailable
Handling:
- Automatic retry with backoff
- Log at WARNING level
- Fail gracefully after retries
- Track frequency
### Error Response Format
#### Development Mode
```json
{
"error": {
"type": "ValidationError",
"message": "Invalid slug format",
"details": {
"field": "slug",
"value": "my/bad/slug",
"pattern": "^[a-z0-9-]+$"
},
"suggestion": "Slugs can only contain lowercase letters, numbers, and hyphens",
"documentation": "/docs/api/micropub#slugs",
"trace_id": "abc123"
}
}
```
#### Production Mode
```json
{
"error": {
"message": "Invalid request format",
"suggestion": "Please check your request and try again",
"documentation": "/docs/api/micropub",
"trace_id": "abc123"
}
}
```
### Implementation Pattern
```python
# starpunk/errors.py
from enum import Enum
from typing import Optional, Dict, Any
import logging
logger = logging.getLogger('starpunk.errors')
class ErrorCategory(Enum):
USER = "user"
SYSTEM = "system"
CONFIG = "config"
TRANSIENT = "transient"
class StarPunkError(Exception):
"""Base exception for all StarPunk errors"""
def __init__(
self,
message: str,
category: ErrorCategory = ErrorCategory.SYSTEM,
suggestion: Optional[str] = None,
details: Optional[Dict[str, Any]] = None,
status_code: int = 500,
recoverable: bool = False
):
self.message = message
self.category = category
self.suggestion = suggestion
self.details = details or {}
self.status_code = status_code
self.recoverable = recoverable
super().__init__(message)
def to_user_dict(self, debug: bool = False) -> dict:
"""Format error for user response"""
result = {
'error': {
'message': self.message,
'trace_id': self.trace_id
}
}
if self.suggestion:
result['error']['suggestion'] = self.suggestion
if debug and self.details:
result['error']['details'] = self.details
result['error']['type'] = self.__class__.__name__
return result
def log(self):
"""Log error with appropriate level"""
if self.category == ErrorCategory.USER:
logger.info(
"User error: %s",
self.message,
extra={'context': self.details}
)
elif self.category == ErrorCategory.TRANSIENT:
logger.warning(
"Transient error: %s",
self.message,
extra={'context': self.details}
)
else:
logger.error(
"System error: %s",
self.message,
extra={'context': self.details},
exc_info=True
)
# Specific error classes
class ValidationError(StarPunkError):
"""User input validation failed"""
def __init__(self, message: str, field: str = None, **kwargs):
super().__init__(
message,
category=ErrorCategory.USER,
status_code=400,
**kwargs
)
if field:
self.details['field'] = field
class AuthenticationError(StarPunkError):
"""Authentication failed"""
def __init__(self, message: str = "Authentication required", **kwargs):
super().__init__(
message,
category=ErrorCategory.USER,
status_code=401,
suggestion="Please authenticate and try again",
**kwargs
)
class DatabaseError(StarPunkError):
"""Database operation failed"""
def __init__(self, message: str, **kwargs):
super().__init__(
message,
category=ErrorCategory.SYSTEM,
status_code=500,
suggestion="Please try again later",
**kwargs
)
class ConfigurationError(StarPunkError):
"""Configuration is invalid"""
def __init__(self, message: str, setting: str = None, **kwargs):
super().__init__(
message,
category=ErrorCategory.CONFIG,
status_code=500,
**kwargs
)
if setting:
self.details['setting'] = setting
```
### Error Handling Middleware
```python
# starpunk/middleware/errors.py
def error_handler(func):
"""Decorator for consistent error handling"""
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except StarPunkError as e:
e.log()
return e.to_user_dict(debug=is_debug_mode())
except Exception as e:
# Unexpected error
error = StarPunkError(
message="An unexpected error occurred",
category=ErrorCategory.SYSTEM,
details={'original': str(e)}
)
error.log()
return error.to_user_dict(debug=is_debug_mode())
return wrapper
```
### Graceful Degradation Examples
#### FTS5 Unavailable
```python
try:
# Attempt FTS5 search
results = search_with_fts5(query)
except FTS5UnavailableError:
logger.warning("FTS5 unavailable, falling back to LIKE")
results = search_with_like(query)
flash("Search is running in compatibility mode")
```
#### Database Lock
```python
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=0.5, max=2),
retry=retry_if_exception_type(sqlite3.OperationalError)
)
def execute_query(query):
"""Execute with retry for transient errors"""
return db.execute(query)
```
#### Missing Optional Feature
```python
if not config.SEARCH_ENABLED:
# Return empty results instead of error
return {
'results': [],
'message': 'Search is disabled on this instance'
}
```
## Rationale
### Why Graceful Degradation?
1. **User Experience**: Don't break the whole app
2. **Reliability**: Partial functionality better than none
3. **Operations**: Easier to diagnose in production
4. **Recovery**: System can self-heal from transients
### Why Different Error Categories?
1. **Appropriate Response**: Different errors need different handling
2. **Security**: Don't expose internals for system errors
3. **Debugging**: Operators need full context
4. **User Experience**: Users need actionable messages
### Why Structured Errors?
1. **Consistency**: Predictable error format
2. **Parsing**: Tools can process errors
3. **Correlation**: Trace IDs link logs to responses
4. **Documentation**: Self-documenting error details
## Consequences
### Positive
1. **Better UX**: Helpful error messages
2. **Easier Debugging**: Rich context in logs
3. **More Reliable**: Graceful degradation
4. **Secure**: No information leakage
5. **Consistent**: Predictable error handling
### Negative
1. **More Code**: Error handling adds complexity
2. **Testing Burden**: Many error paths to test
3. **Performance**: Error handling overhead
4. **Maintenance**: Error messages need updates
### Mitigations
1. Use error hierarchy to reduce duplication
2. Generate tests for error paths
3. Cache error messages
4. Document error codes clearly
## Alternatives Considered
### 1. Let Exceptions Bubble
**Pros**: Simple, Python default
**Cons**: Poor UX, crashes, no context
**Decision**: Not production-ready
### 2. Generic Error Pages
**Pros**: Simple to implement
**Cons**: Not helpful, poor API experience
**Decision**: Insufficient for Micropub API
### 3. Error Codes System
**Pros**: Precise, machine-readable
**Cons**: Complex, needs documentation
**Decision**: Over-engineered for our scale
### 4. Sentry/Error Tracking Service
**Pros**: Rich features, alerting
**Cons**: External dependency, privacy
**Decision**: Conflicts with self-hosted philosophy
## Implementation Notes
### Critical Path Protection
Always protect critical paths:
```python
# Never let note creation completely fail
try:
create_search_index(note)
except Exception as e:
logger.error("Search indexing failed: %s", e)
# Continue without search - note still created
```
### Error Budget
Track error rates for SLO monitoring:
- User errors: Unlimited (not our fault)
- System errors: <0.1% of requests
- Configuration errors: 0 after startup
- Transient errors: <1% of requests
### Testing Strategy
1. Unit tests for each error class
2. Integration tests for error paths
3. Chaos testing for transient errors
4. User journey tests with errors
## Security Considerations
1. Never expose stack traces to users
2. Sanitize error messages
3. Rate limit error endpoints
4. Don't leak existence via errors
5. Log security errors specially
## Migration Path
1. Phase 1: Add error classes
2. Phase 2: Wrap existing code
3. Phase 3: Add graceful degradation
4. Phase 4: Improve error messages
## References
- [Error Handling Best Practices](https://www.python.org/dev/peps/pep-0008/#programming-recommendations)
- [HTTP Status Codes](https://httpstatuses.com/)
- [OWASP Error Handling](https://owasp.org/www-community/Improper_Error_Handling)
- [Google SRE Book - Handling Overload](https://sre.google/sre-book/handling-overload/)
## Document History
- 2025-11-25: Initial draft for v1.1.1 release planning