This commit resolves all documentation issues identified in the comprehensive review: CRITICAL FIXES: - Renumbered duplicate ADRs to eliminate conflicts: * ADR-022-migration-race-condition-fix → ADR-037 * ADR-022-syndication-formats → ADR-038 * ADR-023-microformats2-compliance → ADR-040 * ADR-027-versioning-strategy-for-authorization-removal → ADR-042 * ADR-030-CORRECTED-indieauth-endpoint-discovery → ADR-043 * ADR-031-endpoint-discovery-implementation → ADR-044 - Updated all cross-references to renumbered ADRs in: * docs/projectplan/ROADMAP.md * docs/reports/v1.0.0-rc.5-migration-race-condition-implementation.md * docs/reports/2025-11-24-endpoint-discovery-analysis.md * docs/decisions/ADR-043-CORRECTED-indieauth-endpoint-discovery.md * docs/decisions/ADR-044-endpoint-discovery-implementation.md - Updated README.md version from 1.0.0 to 1.1.0 - Tracked ADR-021-indieauth-provider-strategy.md in git DOCUMENTATION IMPROVEMENTS: - Created comprehensive INDEX.md files for all docs/ subdirectories: * docs/architecture/INDEX.md (28 documents indexed) * docs/decisions/INDEX.md (55 ADRs indexed with topical grouping) * docs/design/INDEX.md (phase plans and feature designs) * docs/standards/INDEX.md (9 standards with compliance checklist) * docs/reports/INDEX.md (57 implementation reports) * docs/deployment/INDEX.md (deployment guides) * docs/examples/INDEX.md (code samples and usage patterns) * docs/migration/INDEX.md (version migration guides) * docs/releases/INDEX.md (release documentation) * docs/reviews/INDEX.md (architectural reviews) * docs/security/INDEX.md (security documentation) - Updated CLAUDE.md with complete folder descriptions including: * docs/migration/ * docs/releases/ * docs/security/ VERIFICATION: - All ADR numbers now sequential and unique (50 total ADRs) - No duplicate ADR numbers remain - All cross-references updated and verified - Documentation structure consistent and well-organized These changes improve documentation discoverability, maintainability, and ensure proper version tracking. All index files follow consistent format with clear navigation guidance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
415 lines
11 KiB
Markdown
415 lines
11 KiB
Markdown
# ADR-055: Error Handling Philosophy
|
|
|
|
## Status
|
|
Accepted
|
|
|
|
## Context
|
|
StarPunk v1.1.1 focuses on production readiness, including graceful error handling. Currently, error handling is inconsistent:
|
|
- Some errors crash the application
|
|
- Error messages vary in helpfulness
|
|
- No distinction between user and system errors
|
|
- Insufficient context for debugging
|
|
|
|
We need a consistent philosophy for handling errors that balances user experience, security, and debuggability.
|
|
|
|
## Decision
|
|
Adopt a layered error handling strategy that provides graceful degradation, helpful user messages, and detailed logging for operators.
|
|
|
|
### Error Handling Principles
|
|
|
|
1. **Fail Gracefully**: Never crash when recovery is possible
|
|
2. **Be Helpful**: Provide actionable error messages
|
|
3. **Log Everything**: Detailed context for debugging
|
|
4. **Secure by Default**: Don't leak sensitive information
|
|
5. **User vs System**: Different handling for different audiences
|
|
|
|
### Error Categories
|
|
|
|
#### 1. User Errors (4xx class)
|
|
Errors caused by user action or client issues.
|
|
|
|
Examples:
|
|
- Invalid Micropub request
|
|
- Authentication failure
|
|
- Missing required fields
|
|
- Invalid slug format
|
|
|
|
Handling:
|
|
- Return helpful error message
|
|
- Suggest corrective action
|
|
- Log at INFO level
|
|
- Don't expose internals
|
|
|
|
#### 2. System Errors (5xx class)
|
|
Errors in system operation.
|
|
|
|
Examples:
|
|
- Database connection failure
|
|
- File system errors
|
|
- Memory exhaustion
|
|
- Template rendering errors
|
|
|
|
Handling:
|
|
- Generic user message
|
|
- Detailed logging at ERROR level
|
|
- Attempt recovery if possible
|
|
- Alert operators (future)
|
|
|
|
#### 3. Configuration Errors
|
|
Errors due to misconfiguration.
|
|
|
|
Examples:
|
|
- Missing required config
|
|
- Invalid configuration values
|
|
- Incompatible settings
|
|
- Permission issues
|
|
|
|
Handling:
|
|
- Fail fast at startup
|
|
- Clear error messages
|
|
- Suggest fixes
|
|
- Document requirements
|
|
|
|
#### 4. Transient Errors
|
|
Temporary errors that may succeed on retry.
|
|
|
|
Examples:
|
|
- Database lock
|
|
- Network timeout
|
|
- Resource temporarily unavailable
|
|
|
|
Handling:
|
|
- Automatic retry with backoff
|
|
- Log at WARNING level
|
|
- Fail gracefully after retries
|
|
- Track frequency
|
|
|
|
### Error Response Format
|
|
|
|
#### Development Mode
|
|
```json
|
|
{
|
|
"error": {
|
|
"type": "ValidationError",
|
|
"message": "Invalid slug format",
|
|
"details": {
|
|
"field": "slug",
|
|
"value": "my/bad/slug",
|
|
"pattern": "^[a-z0-9-]+$"
|
|
},
|
|
"suggestion": "Slugs can only contain lowercase letters, numbers, and hyphens",
|
|
"documentation": "/docs/api/micropub#slugs",
|
|
"trace_id": "abc123"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Production Mode
|
|
```json
|
|
{
|
|
"error": {
|
|
"message": "Invalid request format",
|
|
"suggestion": "Please check your request and try again",
|
|
"documentation": "/docs/api/micropub",
|
|
"trace_id": "abc123"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Implementation Pattern
|
|
|
|
```python
|
|
# starpunk/errors.py
|
|
from enum import Enum
|
|
from typing import Optional, Dict, Any
|
|
import logging
|
|
|
|
logger = logging.getLogger('starpunk.errors')
|
|
|
|
class ErrorCategory(Enum):
|
|
USER = "user"
|
|
SYSTEM = "system"
|
|
CONFIG = "config"
|
|
TRANSIENT = "transient"
|
|
|
|
class StarPunkError(Exception):
|
|
"""Base exception for all StarPunk errors"""
|
|
|
|
def __init__(
|
|
self,
|
|
message: str,
|
|
category: ErrorCategory = ErrorCategory.SYSTEM,
|
|
suggestion: Optional[str] = None,
|
|
details: Optional[Dict[str, Any]] = None,
|
|
status_code: int = 500,
|
|
recoverable: bool = False
|
|
):
|
|
self.message = message
|
|
self.category = category
|
|
self.suggestion = suggestion
|
|
self.details = details or {}
|
|
self.status_code = status_code
|
|
self.recoverable = recoverable
|
|
super().__init__(message)
|
|
|
|
def to_user_dict(self, debug: bool = False) -> dict:
|
|
"""Format error for user response"""
|
|
result = {
|
|
'error': {
|
|
'message': self.message,
|
|
'trace_id': self.trace_id
|
|
}
|
|
}
|
|
|
|
if self.suggestion:
|
|
result['error']['suggestion'] = self.suggestion
|
|
|
|
if debug and self.details:
|
|
result['error']['details'] = self.details
|
|
result['error']['type'] = self.__class__.__name__
|
|
|
|
return result
|
|
|
|
def log(self):
|
|
"""Log error with appropriate level"""
|
|
if self.category == ErrorCategory.USER:
|
|
logger.info(
|
|
"User error: %s",
|
|
self.message,
|
|
extra={'context': self.details}
|
|
)
|
|
elif self.category == ErrorCategory.TRANSIENT:
|
|
logger.warning(
|
|
"Transient error: %s",
|
|
self.message,
|
|
extra={'context': self.details}
|
|
)
|
|
else:
|
|
logger.error(
|
|
"System error: %s",
|
|
self.message,
|
|
extra={'context': self.details},
|
|
exc_info=True
|
|
)
|
|
|
|
# Specific error classes
|
|
class ValidationError(StarPunkError):
|
|
"""User input validation failed"""
|
|
def __init__(self, message: str, field: str = None, **kwargs):
|
|
super().__init__(
|
|
message,
|
|
category=ErrorCategory.USER,
|
|
status_code=400,
|
|
**kwargs
|
|
)
|
|
if field:
|
|
self.details['field'] = field
|
|
|
|
class AuthenticationError(StarPunkError):
|
|
"""Authentication failed"""
|
|
def __init__(self, message: str = "Authentication required", **kwargs):
|
|
super().__init__(
|
|
message,
|
|
category=ErrorCategory.USER,
|
|
status_code=401,
|
|
suggestion="Please authenticate and try again",
|
|
**kwargs
|
|
)
|
|
|
|
class DatabaseError(StarPunkError):
|
|
"""Database operation failed"""
|
|
def __init__(self, message: str, **kwargs):
|
|
super().__init__(
|
|
message,
|
|
category=ErrorCategory.SYSTEM,
|
|
status_code=500,
|
|
suggestion="Please try again later",
|
|
**kwargs
|
|
)
|
|
|
|
class ConfigurationError(StarPunkError):
|
|
"""Configuration is invalid"""
|
|
def __init__(self, message: str, setting: str = None, **kwargs):
|
|
super().__init__(
|
|
message,
|
|
category=ErrorCategory.CONFIG,
|
|
status_code=500,
|
|
**kwargs
|
|
)
|
|
if setting:
|
|
self.details['setting'] = setting
|
|
```
|
|
|
|
### Error Handling Middleware
|
|
|
|
```python
|
|
# starpunk/middleware/errors.py
|
|
def error_handler(func):
|
|
"""Decorator for consistent error handling"""
|
|
def wrapper(*args, **kwargs):
|
|
try:
|
|
return func(*args, **kwargs)
|
|
except StarPunkError as e:
|
|
e.log()
|
|
return e.to_user_dict(debug=is_debug_mode())
|
|
except Exception as e:
|
|
# Unexpected error
|
|
error = StarPunkError(
|
|
message="An unexpected error occurred",
|
|
category=ErrorCategory.SYSTEM,
|
|
details={'original': str(e)}
|
|
)
|
|
error.log()
|
|
return error.to_user_dict(debug=is_debug_mode())
|
|
return wrapper
|
|
```
|
|
|
|
### Graceful Degradation Examples
|
|
|
|
#### FTS5 Unavailable
|
|
```python
|
|
try:
|
|
# Attempt FTS5 search
|
|
results = search_with_fts5(query)
|
|
except FTS5UnavailableError:
|
|
logger.warning("FTS5 unavailable, falling back to LIKE")
|
|
results = search_with_like(query)
|
|
flash("Search is running in compatibility mode")
|
|
```
|
|
|
|
#### Database Lock
|
|
```python
|
|
@retry(
|
|
stop=stop_after_attempt(3),
|
|
wait=wait_exponential(multiplier=0.5, max=2),
|
|
retry=retry_if_exception_type(sqlite3.OperationalError)
|
|
)
|
|
def execute_query(query):
|
|
"""Execute with retry for transient errors"""
|
|
return db.execute(query)
|
|
```
|
|
|
|
#### Missing Optional Feature
|
|
```python
|
|
if not config.SEARCH_ENABLED:
|
|
# Return empty results instead of error
|
|
return {
|
|
'results': [],
|
|
'message': 'Search is disabled on this instance'
|
|
}
|
|
```
|
|
|
|
## Rationale
|
|
|
|
### Why Graceful Degradation?
|
|
1. **User Experience**: Don't break the whole app
|
|
2. **Reliability**: Partial functionality better than none
|
|
3. **Operations**: Easier to diagnose in production
|
|
4. **Recovery**: System can self-heal from transients
|
|
|
|
### Why Different Error Categories?
|
|
1. **Appropriate Response**: Different errors need different handling
|
|
2. **Security**: Don't expose internals for system errors
|
|
3. **Debugging**: Operators need full context
|
|
4. **User Experience**: Users need actionable messages
|
|
|
|
### Why Structured Errors?
|
|
1. **Consistency**: Predictable error format
|
|
2. **Parsing**: Tools can process errors
|
|
3. **Correlation**: Trace IDs link logs to responses
|
|
4. **Documentation**: Self-documenting error details
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
1. **Better UX**: Helpful error messages
|
|
2. **Easier Debugging**: Rich context in logs
|
|
3. **More Reliable**: Graceful degradation
|
|
4. **Secure**: No information leakage
|
|
5. **Consistent**: Predictable error handling
|
|
|
|
### Negative
|
|
1. **More Code**: Error handling adds complexity
|
|
2. **Testing Burden**: Many error paths to test
|
|
3. **Performance**: Error handling overhead
|
|
4. **Maintenance**: Error messages need updates
|
|
|
|
### Mitigations
|
|
1. Use error hierarchy to reduce duplication
|
|
2. Generate tests for error paths
|
|
3. Cache error messages
|
|
4. Document error codes clearly
|
|
|
|
## Alternatives Considered
|
|
|
|
### 1. Let Exceptions Bubble
|
|
**Pros**: Simple, Python default
|
|
**Cons**: Poor UX, crashes, no context
|
|
**Decision**: Not production-ready
|
|
|
|
### 2. Generic Error Pages
|
|
**Pros**: Simple to implement
|
|
**Cons**: Not helpful, poor API experience
|
|
**Decision**: Insufficient for Micropub API
|
|
|
|
### 3. Error Codes System
|
|
**Pros**: Precise, machine-readable
|
|
**Cons**: Complex, needs documentation
|
|
**Decision**: Over-engineered for our scale
|
|
|
|
### 4. Sentry/Error Tracking Service
|
|
**Pros**: Rich features, alerting
|
|
**Cons**: External dependency, privacy
|
|
**Decision**: Conflicts with self-hosted philosophy
|
|
|
|
## Implementation Notes
|
|
|
|
### Critical Path Protection
|
|
Always protect critical paths:
|
|
```python
|
|
# Never let note creation completely fail
|
|
try:
|
|
create_search_index(note)
|
|
except Exception as e:
|
|
logger.error("Search indexing failed: %s", e)
|
|
# Continue without search - note still created
|
|
```
|
|
|
|
### Error Budget
|
|
Track error rates for SLO monitoring:
|
|
- User errors: Unlimited (not our fault)
|
|
- System errors: <0.1% of requests
|
|
- Configuration errors: 0 after startup
|
|
- Transient errors: <1% of requests
|
|
|
|
### Testing Strategy
|
|
1. Unit tests for each error class
|
|
2. Integration tests for error paths
|
|
3. Chaos testing for transient errors
|
|
4. User journey tests with errors
|
|
|
|
## Security Considerations
|
|
|
|
1. Never expose stack traces to users
|
|
2. Sanitize error messages
|
|
3. Rate limit error endpoints
|
|
4. Don't leak existence via errors
|
|
5. Log security errors specially
|
|
|
|
## Migration Path
|
|
|
|
1. Phase 1: Add error classes
|
|
2. Phase 2: Wrap existing code
|
|
3. Phase 3: Add graceful degradation
|
|
4. Phase 4: Improve error messages
|
|
|
|
## References
|
|
|
|
- [Error Handling Best Practices](https://www.python.org/dev/peps/pep-0008/#programming-recommendations)
|
|
- [HTTP Status Codes](https://httpstatuses.com/)
|
|
- [OWASP Error Handling](https://owasp.org/www-community/Improper_Error_Handling)
|
|
- [Google SRE Book - Handling Overload](https://sre.google/sre-book/handling-overload/)
|
|
|
|
## Document History
|
|
|
|
- 2025-11-25: Initial draft for v1.1.1 release planning |