feat(tags): Add database schema and tags module (v1.3.0 Phase 1)

Implements tag/category system backend following microformats2 p-category specification. Database changes: - Migration 008: Add tags and note_tags tables - Normalized tag storage (case-insensitive lookup, display name preserved) - Indexes for performance New module: - starpunk/tags.py: Tag management functions - normalize_tag: Normalize tag strings - get_or_create_tag: Get or create tag records - add_tags_to_note: Associate tags with notes (replaces existing) - get_note_tags: Retrieve note tags (alphabetically ordered) - get_tag_by_name: Lookup tag by normalized name - get_notes_by_tag: Get all notes with specific tag - parse_tag_input: Parse comma-separated tag input Model updates: - Note.tags property (lazy-loaded, prefer pre-loading in routes) - Note.to_dict() add include_tags parameter CRUD updates: - create_note() accepts tags parameter - update_note() accepts tags parameter (None = no change, [] = remove all) Micropub integration: - Pass tags to create_note() (tags already extracted by extract_tags()) - Return tags in q=source response Per design doc: docs/design/v1.3.0/microformats-tags-design.md Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-10 11:24:23 -07:00
parent 927db4aea0
commit f10d0679da
188 changed files with 601 additions and 945 deletions
--- a/docs/design/v1.0.0/indieauth-endpoint-discovery-security.md
+++ b/docs/design/v1.0.0/indieauth-endpoint-discovery-security.md
@@ -0,0 +1,397 @@
+# IndieAuth Endpoint Discovery Security Analysis
+
+## Executive Summary
+
+This document analyzes the security implications of implementing IndieAuth endpoint discovery correctly, contrasting it with the fundamentally flawed approach of hardcoding endpoints.
+
+## The Critical Error: Hardcoded Endpoints
+
+### What Was Wrong
+
+```ini
+# FATALLY FLAWED - Breaks IndieAuth completely
+TOKEN_ENDPOINT=https://tokens.indieauth.com/token
+```
+
+### Why It's a Security Disaster
+
+1. **Single Point of Failure**: If the hardcoded endpoint is compromised, ALL users are affected
+2. **No User Control**: Users cannot change providers if security issues arise
+3. **Trust Concentration**: Forces all users to trust a single provider
+4. **Not IndieAuth**: This isn't IndieAuth at all - it's just OAuth with extra steps
+5. **Violates User Sovereignty**: Users don't control their own authentication
+
+## The Correct Approach: Dynamic Discovery
+
+### Security Model
+
+```
+User Identity URL → Endpoint Discovery → Provider Verification
+     (User Controls)     (Dynamic)        (User's Choice)
+```
+
+### Security Benefits
+
+1. **Distributed Trust**: No single provider compromise affects all users
+2. **User Control**: Users can switch providers instantly if needed
+3. **Provider Independence**: Each user's security is independent
+4. **Immediate Revocation**: Users can revoke by changing profile links
+5. **True Decentralization**: No central authority
+
+## Threat Analysis
+
+### Threat 1: Profile URL Hijacking
+
+**Attack Vector**: Attacker gains control of user's profile URL
+
+**Impact**: Can redirect authentication to attacker's endpoints
+
+**Mitigations**:
+- Profile URL must use HTTPS
+- Verify SSL certificates
+- Monitor for unexpected endpoint changes
+- Cache endpoints with reasonable TTL
+
+### Threat 2: Endpoint Discovery Manipulation
+
+**Attack Vector**: MITM attack during endpoint discovery
+
+**Impact**: Could redirect to malicious endpoints
+
+**Mitigations**:
+```python
+def discover_endpoints(profile_url: str) -> dict:
+    # CRITICAL: Enforce HTTPS
+    if not profile_url.startswith('https://'):
+        raise SecurityError("Profile URL must use HTTPS")
+
+    # Verify SSL certificates
+    response = requests.get(
+        profile_url,
+        verify=True,  # Enforce certificate validation
+        timeout=5
+    )
+
+    # Validate discovered endpoints
+    endpoints = extract_endpoints(response)
+    for endpoint_url in endpoints.values():
+        if not endpoint_url.startswith('https://'):
+            raise SecurityError(f"Endpoint must use HTTPS: {endpoint_url}")
+
+    return endpoints
+```
+
+### Threat 3: Cache Poisoning
+
+**Attack Vector**: Attacker poisons endpoint cache with malicious URLs
+
+**Impact**: Subsequent requests use attacker's endpoints
+
+**Mitigations**:
+```python
+class SecureEndpointCache:
+    def store_endpoints(self, profile_url: str, endpoints: dict):
+        # Validate before caching
+        self._validate_profile_url(profile_url)
+        self._validate_endpoints(endpoints)
+
+        # Store with integrity check
+        cache_entry = {
+            'endpoints': endpoints,
+            'stored_at': time.time(),
+            'checksum': self._calculate_checksum(endpoints)
+        }
+        self.cache[profile_url] = cache_entry
+
+    def get_endpoints(self, profile_url: str) -> dict:
+        entry = self.cache.get(profile_url)
+        if entry:
+            # Verify integrity
+            if self._calculate_checksum(entry['endpoints']) != entry['checksum']:
+                # Cache corruption detected
+                del self.cache[profile_url]
+                raise SecurityError("Cache integrity check failed")
+        return entry['endpoints']
+```
+
+### Threat 4: Redirect Attacks
+
+**Attack Vector**: Malicious redirects during discovery
+
+**Impact**: Could redirect to attacker-controlled endpoints
+
+**Mitigations**:
+```python
+def fetch_with_redirect_limit(url: str, max_redirects: int = 5):
+    redirect_count = 0
+    visited = set()
+
+    while redirect_count < max_redirects:
+        if url in visited:
+            raise SecurityError("Redirect loop detected")
+        visited.add(url)
+
+        response = requests.get(url, allow_redirects=False)
+
+        if response.status_code in (301, 302, 303, 307, 308):
+            redirect_url = response.headers.get('Location')
+
+            # Validate redirect target
+            if not redirect_url.startswith('https://'):
+                raise SecurityError("Redirect to non-HTTPS URL blocked")
+
+            url = redirect_url
+            redirect_count += 1
+        else:
+            return response
+
+    raise SecurityError("Too many redirects")
+```
+
+### Threat 5: Token Replay Attacks
+
+**Attack Vector**: Intercepted token reused
+
+**Impact**: Unauthorized access
+
+**Mitigations**:
+- Always use HTTPS for token transmission
+- Implement token expiration
+- Cache token verification results briefly
+- Use nonce/timestamp validation
+
+## Security Requirements
+
+### 1. HTTPS Enforcement
+
+```python
+class HTTPSEnforcer:
+    def validate_url(self, url: str, context: str):
+        """Enforce HTTPS for all security-critical URLs"""
+
+        parsed = urlparse(url)
+
+        # Development exception (with warning)
+        if self.development_mode and parsed.hostname in ['localhost', '127.0.0.1']:
+            logger.warning(f"Allowing HTTP in development for {context}: {url}")
+            return
+
+        # Production: HTTPS required
+        if parsed.scheme != 'https':
+            raise SecurityError(f"HTTPS required for {context}: {url}")
+```
+
+### 2. Certificate Validation
+
+```python
+def create_secure_http_client():
+    """Create HTTP client with proper security settings"""
+
+    return httpx.Client(
+        verify=True,  # Always verify SSL certificates
+        follow_redirects=False,  # Handle redirects manually
+        timeout=httpx.Timeout(
+            connect=5.0,
+            read=10.0,
+            write=10.0,
+            pool=10.0
+        ),
+        limits=httpx.Limits(
+            max_connections=100,
+            max_keepalive_connections=20
+        ),
+        headers={
+            'User-Agent': 'StarPunk/1.0 (+https://starpunk.example.com/)'
+        }
+    )
+```
+
+### 3. Input Validation
+
+```python
+def validate_endpoint_response(response: dict, expected_me: str):
+    """Validate token verification response"""
+
+    # Required fields
+    if 'me' not in response:
+        raise ValidationError("Missing 'me' field in response")
+
+    # URL normalization and comparison
+    normalized_me = normalize_url(response['me'])
+    normalized_expected = normalize_url(expected_me)
+
+    if normalized_me != normalized_expected:
+        raise ValidationError(
+            f"Token 'me' mismatch: expected {normalized_expected}, "
+            f"got {normalized_me}"
+        )
+
+    # Scope validation
+    scopes = response.get('scope', '').split()
+    if 'create' not in scopes:
+        raise ValidationError("Token missing required 'create' scope")
+
+    return True
+```
+
+### 4. Rate Limiting
+
+```python
+class DiscoveryRateLimiter:
+    """Prevent discovery abuse"""
+
+    def __init__(self, max_per_minute: int = 60):
+        self.requests = defaultdict(list)
+        self.max_per_minute = max_per_minute
+
+    def check_rate_limit(self, profile_url: str):
+        now = time.time()
+        minute_ago = now - 60
+
+        # Clean old entries
+        self.requests[profile_url] = [
+            t for t in self.requests[profile_url]
+            if t > minute_ago
+        ]
+
+        # Check limit
+        if len(self.requests[profile_url]) >= self.max_per_minute:
+            raise RateLimitError(f"Too many discovery requests for {profile_url}")
+
+        # Record request
+        self.requests[profile_url].append(now)
+```
+
+## Implementation Checklist
+
+### Discovery Security
+
+- [ ] Enforce HTTPS for profile URLs
+- [ ] Validate SSL certificates
+- [ ] Limit redirect chains to 5
+- [ ] Detect redirect loops
+- [ ] Validate discovered endpoint URLs
+- [ ] Implement discovery rate limiting
+- [ ] Log all discovery attempts
+- [ ] Handle timeouts gracefully
+
+### Token Verification Security
+
+- [ ] Use HTTPS for all token endpoints
+- [ ] Validate token endpoint responses
+- [ ] Check 'me' field matches expected
+- [ ] Verify required scopes present
+- [ ] Hash tokens before caching
+- [ ] Implement cache expiration
+- [ ] Use constant-time comparisons
+- [ ] Log verification failures
+
+### Cache Security
+
+- [ ] Validate data before caching
+- [ ] Implement cache size limits
+- [ ] Use TTL for all cache entries
+- [ ] Clear cache on configuration changes
+- [ ] Protect against cache poisoning
+- [ ] Monitor cache hit/miss rates
+- [ ] Implement cache integrity checks
+
+### Error Handling
+
+- [ ] Never expose internal errors
+- [ ] Log security events
+- [ ] Rate limit error responses
+- [ ] Implement proper timeouts
+- [ ] Handle network failures gracefully
+- [ ] Provide clear user messages
+
+## Security Testing
+
+### Test Scenarios
+
+1. **HTTPS Downgrade Attack**
+   - Try to use HTTP endpoints
+   - Verify rejection
+
+2. **Invalid Certificates**
+   - Test with self-signed certs
+   - Test with expired certs
+   - Verify rejection
+
+3. **Redirect Attacks**
+   - Test redirect loops
+   - Test excessive redirects
+   - Test HTTP redirects
+   - Verify proper handling
+
+4. **Cache Poisoning**
+   - Attempt to inject invalid data
+   - Verify cache validation
+
+5. **Token Manipulation**
+   - Modify token before verification
+   - Test expired tokens
+   - Test tokens with wrong 'me'
+   - Verify proper rejection
+
+## Monitoring and Alerting
+
+### Security Metrics
+
+```python
+# Track these metrics
+security_metrics = {
+    'discovery_failures': Counter(),
+    'https_violations': Counter(),
+    'certificate_errors': Counter(),
+    'redirect_limit_exceeded': Counter(),
+    'cache_poisoning_attempts': Counter(),
+    'token_verification_failures': Counter(),
+    'rate_limit_violations': Counter()
+}
+```
+
+### Alert Conditions
+
+- Multiple discovery failures for same profile
+- Sudden increase in HTTPS violations
+- Certificate validation failures
+- Cache poisoning attempts detected
+- Unusual token verification patterns
+
+## Incident Response
+
+### If Endpoint Compromise Suspected
+
+1. Clear endpoint cache immediately
+2. Force re-discovery of all endpoints
+3. Alert affected users
+4. Review logs for suspicious patterns
+5. Document incident
+
+### If Cache Poisoning Detected
+
+1. Clear entire cache
+2. Review cache validation logic
+3. Identify attack vector
+4. Implement additional validation
+5. Monitor for recurrence
+
+## Conclusion
+
+Dynamic endpoint discovery is not just correct according to the IndieAuth specification - it's also more secure than hardcoded endpoints. By allowing users to control their authentication infrastructure, we:
+
+1. Eliminate single points of failure
+2. Enable immediate provider switching
+3. Distribute security responsibility
+4. Maintain true decentralization
+5. Respect user sovereignty
+
+The complexity of proper implementation is justified by the security and flexibility benefits. This is what IndieAuth is designed to provide, and we must implement it correctly.
+
+---
+
+**Document Version**: 1.0
+**Created**: 2024-11-24
+**Classification**: Security Architecture
+**Review Schedule**: Quarterly