Complete architectural documentation for: - Migration race condition fix with database locking - IndieAuth endpoint discovery implementation - Security considerations and migration guides New documentation: - ADR-030-CORRECTED: IndieAuth endpoint discovery decision - ADR-031: Endpoint discovery implementation details - Architecture docs on endpoint discovery - Migration guide for removed TOKEN_ENDPOINT - Security analysis of endpoint discovery - Implementation and analysis reports
26 KiB
IndieAuth Endpoint Discovery Implementation Analysis
Date: 2025-11-24 Developer: StarPunk Fullstack Developer Status: Ready for Architect Review Target Version: 1.0.0-rc.5
Executive Summary
I have reviewed the architect's corrected IndieAuth endpoint discovery design and the W3C IndieAuth specification. The design is fundamentally sound and correctly implements the IndieAuth specification. However, I have critical questions about implementation details, particularly around the "chicken-and-egg" problem of determining which endpoint to verify a token with when we don't know the user's identity beforehand.
Overall Assessment: The design is architecturally correct, but needs clarification on practical implementation details before coding can begin.
What I Understand
1. The Core Problem Fixed
The architect correctly identified that hardcoding TOKEN_ENDPOINT=https://tokens.indieauth.com/token is fundamentally wrong. This violates IndieAuth's core principle of user sovereignty.
Correct Approach:
- Store only
ADMIN_ME=https://admin.example.com/in configuration - Discover endpoints dynamically from the user's profile URL at runtime
- Each user can use their own IndieAuth provider
2. Endpoint Discovery Flow
Per W3C IndieAuth Section 4.2, I understand the discovery process:
1. Fetch user's profile URL (e.g., https://admin.example.com/)
2. Check in priority order:
a. HTTP Link headers (highest priority)
b. HTML <link> elements (document order)
c. IndieAuth metadata endpoint (optional)
3. Parse rel="authorization_endpoint" and rel="token_endpoint"
4. Resolve relative URLs against profile URL base
5. Cache discovered endpoints (with TTL)
Example Discovery:
GET https://admin.example.com/ HTTP/1.1
HTTP/1.1 200 OK
Link: <https://auth.example.com/token>; rel="token_endpoint"
Content-Type: text/html
<html>
<head>
<link rel="authorization_endpoint" href="https://auth.example.com/authorize">
<link rel="token_endpoint" href="https://auth.example.com/token">
</head>
3. Token Verification Flow
Per W3C IndieAuth Section 6, I understand token verification:
1. Receive Bearer token in Authorization header
2. Make GET request to token endpoint with Bearer token
3. Token endpoint returns: {me, client_id, scope}
4. Validate 'me' matches expected identity
5. Check required scopes present
Example Verification:
GET https://auth.example.com/token HTTP/1.1
Authorization: Bearer xyz123
Accept: application/json
HTTP/1.1 200 OK
Content-Type: application/json
{
"me": "https://admin.example.com/",
"client_id": "https://quill.p3k.io/",
"scope": "create update delete"
}
4. Security Considerations
I understand the security model from the architect's docs:
- HTTPS Required: Profile URLs and endpoints MUST use HTTPS in production
- Redirect Limits: Maximum 5 redirects to prevent loops
- Cache Integrity: Validate endpoints before caching
- URL Validation: Ensure discovered URLs are well-formed
- Token Hashing: Hash tokens before caching (SHA-256)
5. Implementation Components
I understand these modules need to be created:
-
endpoint_discovery.py: Discover endpoints from profile URLs- HTTP Link header parsing
- HTML link element extraction
- URL resolution (relative to absolute)
- Error handling
-
Updated
auth_external.py: Token verification with discovery- Integrate endpoint discovery
- Cache discovered endpoints
- Verify tokens with discovered endpoints
- Validate responses
-
endpoint_cache.py(or part of auth_external): Caching layer- Endpoint caching (TTL: 3600s)
- Token verification caching (TTL: 300s)
- Cache invalidation
6. Current Broken Code
From starpunk/auth_external.py line 49:
token_endpoint = current_app.config.get("TOKEN_ENDPOINT")
This hardcoded approach is the problem we're fixing.
Critical Questions for the Architect
Question 1: The "Which Endpoint?" Problem ⚠️
The Problem: When Micropub receives a token, we need to verify it. But which endpoint do we use to verify it?
The W3C spec says:
"GET request to the token endpoint containing an HTTP Authorization header with the Bearer Token according to RFC6750"
But it doesn't say how we know which token endpoint to use when we receive a token from an unknown source.
Current Micropub Flow:
# micropub.py line 74
token_info = verify_external_token(token)
The token is an opaque string like "abc123xyz". We have no idea:
- Which user it belongs to
- Which provider issued it
- Which endpoint to verify it with
ADR-030-CORRECTED suggests (line 204-258):
4. Option A: If we have cached token info, use cached 'me' URL
5. Option B: Try verification with last known endpoint for similar tokens
6. Option C: Require 'me' parameter in Micropub request
My Questions:
1a) Which option should I implement? The ADR presents three options but doesn't specify which one.
1b) For Option A (cached token): How does the first request work? We need to verify a token to cache its 'me' URL, but we need the 'me' URL to know which endpoint to verify with. This is circular.
1c) For Option B (last known endpoint): How do we handle the first token ever received? What is the "last known endpoint" when the cache is empty?
1d) For Option C (require 'me' parameter): Does this violate the Micropub spec? The W3C Micropub specification doesn't include a 'me' parameter in requests. Is this a StarPunk-specific extension?
1e) Proposed Solution (awaiting architect approval):
Since StarPunk is a single-user CMS, we KNOW the only valid tokens are for ADMIN_ME. Therefore:
def verify_external_token(token: str) -> Optional[Dict[str, Any]]:
"""Verify token for the admin user"""
admin_me = current_app.config.get("ADMIN_ME")
# Discover endpoints from ADMIN_ME
endpoints = discover_endpoints(admin_me)
token_endpoint = endpoints['token_endpoint']
# Verify token with discovered endpoint
response = httpx.get(
token_endpoint,
headers={'Authorization': f'Bearer {token}'}
)
token_info = response.json()
# Validate token belongs to admin
if normalize_url(token_info['me']) != normalize_url(admin_me):
raise TokenVerificationError("Token not for admin user")
return token_info
Is this the correct approach? This assumes:
- StarPunk only accepts tokens for
ADMIN_ME - We always discover from
ADMIN_MEprofile URL - Multi-user support is explicitly out of scope for V1
Please confirm this is correct or provide the proper approach.
Question 2: Caching Strategy Details
ADR-030-CORRECTED suggests (line 131-160):
- Endpoint cache TTL: 3600s (1 hour)
- Token verification cache TTL: 300s (5 minutes)
My Questions:
2a) Cache Key for Endpoints: Should the cache key be the profile URL (admin_me) or should we maintain a global cache?
For single-user StarPunk, we only have one profile URL (ADMIN_ME), so a simple cache like:
self.cached_endpoints = None
self.cached_until = 0
Would suffice. Is this acceptable, or should I implement a full profile_url -> endpoints dict for future multi-user support?
2b) Cache Key for Tokens: The migration guide (line 259) suggests hashing tokens:
token_hash = hashlib.sha256(token.encode()).hexdigest()
But if tokens are opaque and unpredictable, why hash them? Is this:
- To prevent tokens appearing in logs/debug output?
- To prevent tokens being extracted from memory dumps?
- Because cache keys should be fixed-length?
If it's for security, should I also:
- Use a constant-time comparison for token hash lookups?
- Add HMAC with a secret key instead of plain SHA-256?
2c) Cache Invalidation: When should I clear the cache?
- On application startup? (cache is in-memory, so yes?)
- On configuration changes? (how do I detect these?)
- On token verification failures? (what if it's a network issue, not a provider change?)
- Manual admin endpoint
/admin/clear-cache? (should I implement this?)
2d) Cache Storage: The ADR shows in-memory caching. Should I:
- Use a simple dict with tuples:
cache[key] = (value, expiry) - Use
functools.lru_cachedecorator? - Use
cachetoolslibrary for TTL support? - Implement custom
EndpointCacheclass as shown in ADR?
For V1 simplicity, I propose custom class with simple dict, but please confirm.
Question 3: HTML Parsing Implementation
From docs/migration/fix-hardcoded-endpoints.md line 139-159:
from bs4 import BeautifulSoup
def _extract_from_html(self, html: str, base_url: str) -> Dict[str, str]:
soup = BeautifulSoup(html, 'html.parser')
auth_link = soup.find('link', rel='authorization_endpoint')
if auth_link and auth_link.get('href'):
endpoints['authorization_endpoint'] = urljoin(base_url, auth_link['href'])
My Questions:
3a) Dependency: Do we want to add BeautifulSoup4 as a dependency? Current dependencies (from quick check):
- Flask
- httpx
- Other core libs
BeautifulSoup4 is a new dependency. Alternatives:
- Use Python's built-in
html.parser(more fragile) - Use regex (bad for HTML, but endpoints are simple)
- Use
lxml(faster, but C extension dependency)
Recommendation: Add BeautifulSoup4 with html.parser backend (pure Python). Confirm?
3b) HTML Validation: Should I validate HTML before parsing?
- Malformed HTML could cause parsing errors
- Should I catch and handle
ParserError? - What if there's no
<head>section? - What if
<link>elements are in<body>(technically invalid but might exist)?
3c) Case Sensitivity: HTML rel attributes are case-insensitive per spec. Should I:
soup.find('link', rel='token_endpoint') # Exact match
# vs
soup.find('link', rel=lambda x: x.lower() == 'token_endpoint' if x else False)
BeautifulSoup's find() is case-insensitive by default for attributes, so this should be fine, but confirm?
Question 4: HTTP Link Header Parsing
From docs/migration/fix-hardcoded-endpoints.md line 126-136:
def _parse_link_header(self, header: str, base_url: str) -> Dict[str, str]:
pattern = r'<([^>]+)>;\s*rel="([^"]+)"'
matches = re.findall(pattern, header)
My Questions:
4a) Regex Robustness: This regex assumes:
- Double quotes around rel value
- Semicolon separator
- No spaces in weird places
But HTTP Link header format (RFC 8288) is more complex:
Link: <url>; rel="value"; param="other"
Link: <url>; rel=value (no quotes allowed per spec)
Link: <url>;rel="value" (no space after semicolon)
Should I:
- Use a more robust regex?
- Use a proper Link header parser library (e.g.,
httpxhas built-in parsing)? - Stick with simple regex and document limitations?
Recommendation: Use httpx.Headers built-in Link header parsing if available, otherwise simple regex. Confirm?
4b) Multiple Headers: RFC 8288 allows multiple Link headers:
Link: <https://auth.example.com/authorize>; rel="authorization_endpoint"
Link: <https://auth.example.com/token>; rel="token_endpoint"
Or comma-separated in single header:
Link: <https://auth.example.com/authorize>; rel="authorization_endpoint", <https://auth.example.com/token>; rel="token_endpoint"
My regex with re.findall() should handle both. Confirm this is correct?
4c) Priority Order: ADR says "HTTP Link headers take precedence over HTML". But what if:
- Link header has
authorization_endpointbut nottoken_endpoint - HTML has both
Should I:
# Option A: Once we find in Link header, stop looking
if 'token_endpoint' in link_header_endpoints:
return link_header_endpoints
else:
check_html()
# Option B: Merge Link header and HTML, Link header wins for conflicts
endpoints = html_endpoints.copy()
endpoints.update(link_header_endpoints) # Link header overwrites
The W3C spec says "first HTTP Link header takes precedence", which suggests Option B (merge and overwrite). Confirm?
Question 5: URL Resolution and Validation
From ADR-030-CORRECTED line 217:
from urllib.parse import urljoin
endpoints['token_endpoint'] = urljoin(profile_url, href)
My Questions:
5a) URL Validation: Should I validate discovered URLs? Checks:
- Must be absolute after resolution
- Must use HTTPS (in production)
- Must be valid URL format
- Hostname must be valid
- No localhost/127.0.0.1 in production (allow in dev?)
Example validation:
def validate_endpoint_url(url: str, is_production: bool) -> bool:
parsed = urlparse(url)
if is_production and parsed.scheme != 'https':
raise DiscoveryError("HTTPS required in production")
if is_production and parsed.hostname in ['localhost', '127.0.0.1', '::1']:
raise DiscoveryError("localhost not allowed in production")
if not parsed.scheme or not parsed.netloc:
raise DiscoveryError("Invalid URL format")
return True
Is this overkill, or necessary? What validation do you want?
5b) URL Normalization: Should I normalize URLs before comparing?
def normalize_url(url: str) -> str:
# Add trailing slash?
# Convert to lowercase?
# Remove default ports?
# Sort query params?
The current code does:
# auth_external.py line 96
token_me = token_info["me"].rstrip("/")
expected_me = admin_me.rstrip("/")
Should endpoint URLs also be normalized? Or left as-is?
5c) Relative URL Edge Cases: What should happen with these?
<!-- Relative path -->
<link rel="token_endpoint" href="/auth/token">
Result: https://admin.example.com/auth/token
<!-- Protocol-relative -->
<link rel="token_endpoint" href="//other-domain.com/token">
Result: https://other-domain.com/token (if profile was HTTPS)
<!-- No protocol -->
<link rel="token_endpoint" href="other-domain.com/token">
Result: https://admin.example.com/other-domain.com/token (broken!)
Python's urljoin() handles first two correctly. Third is ambiguous. Should I:
- Reject URLs without
://or leading/? - Try to detect and fix common mistakes?
- Document expected format and let it fail?
Question 6: Error Handling and Retry Logic
My Questions:
6a) Discovery Failures: When endpoint discovery fails, what should happen?
Scenarios:
- Profile URL unreachable (DNS failure, network timeout)
- Profile URL returns 404/500
- Profile HTML malformed (parsing fails)
- No endpoints found in profile
- Endpoints found but invalid URLs
For each scenario, should I:
- Return error immediately?
- Retry with backoff?
- Use cached endpoints if available (even if expired)?
- Fail open (allow access) or fail closed (deny access)?
Recommendation: Fail closed (deny access), use cached endpoints if available, no retries for discovery (but retries for token verification?). Confirm?
6b) Token Verification Failures: When token verification fails, what should happen?
Scenarios:
- Token endpoint unreachable (timeout)
- Token endpoint returns 400/401/403 (token invalid)
- Token endpoint returns 500 (server error)
- Token response missing required fields
- Token 'me' doesn't match expected
For scenarios 1 and 3 (network/server errors), should I:
- Retry with backoff?
- Use cached token info if available?
- Fail immediately?
Recommendation: Retry up to 3 times with exponential backoff for network errors (1, 3). For invalid tokens (2, 4, 5), fail immediately. Confirm?
6c) Timeout Configuration: What timeouts should I use?
Suggested:
- Profile URL fetch: 5s (discovery is cached, so can be slow)
- Token verification: 3s (happens on every request, must be fast)
- Cache lookup: <1ms (in-memory)
Are these acceptable? Should they be configurable?
Question 7: Testing Strategy
My Questions:
7a) Mock vs Real: Should tests:
- Mock all HTTP requests (faster, isolated)
- Hit real IndieAuth providers (slow, integration test)
- Both (unit tests mock, integration tests real)?
Recommendation: Unit tests mock everything, add one integration test for real IndieAuth.com. Confirm?
7b) Test Fixtures: Should I create test fixtures like:
# tests/fixtures/profiles.py
PROFILE_WITH_LINK_HEADERS = {
'url': 'https://user.example.com/',
'headers': {
'Link': '<https://auth.example.com/token>; rel="token_endpoint"'
},
'expected': {'token_endpoint': 'https://auth.example.com/token'}
}
PROFILE_WITH_HTML_LINKS = {
'url': 'https://user.example.com/',
'html': '<link rel="token_endpoint" href="https://auth.example.com/token">',
'expected': {'token_endpoint': 'https://auth.example.com/token'}
}
# ... more fixtures
Or inline test data in test functions? Fixtures would be reusable across tests.
7c) Test Coverage: What coverage % is acceptable? Current test suite has 501 passing tests. I should aim for:
- 100% coverage of new endpoint discovery code?
- Edge cases covered (malformed HTML, network errors, etc.)?
- Integration tests for full flow?
Question 8: Performance Implications
My Questions:
8a) First Request Latency: Without cached endpoints, first Micropub request will:
- Fetch profile URL (HTTP GET): ~100-500ms
- Parse HTML/headers: ~10-50ms
- Verify token with endpoint: ~100-300ms
- Total: ~200-850ms
Is this acceptable? User will notice delay on first post. Should I:
- Pre-warm cache on application startup?
- Show "Authenticating..." message to user?
- Accept the delay (only happens once per TTL)?
8b) Cache Hit Rate: With TTL of 3600s for endpoints and 300s for tokens:
- Endpoints discovered once per hour
- Tokens verified every 5 minutes
For active user posting frequently:
- First post: 850ms (discovery + verification)
- Posts within 5 min: <1ms (cached token)
- Posts after 5 min but within 1 hour: ~150ms (cached endpoint, verify token)
- Posts after 1 hour: 850ms again
Is this acceptable? Or should I increase token cache TTL?
8c) Concurrent Requests: If two Micropub requests arrive simultaneously with uncached token:
- Both will trigger endpoint discovery
- Race condition in cache update
Should I:
- Add locking around cache updates?
- Accept duplicate discoveries (harmless, just wasteful)?
- Use thread-safe cache implementation?
Recommendation: For V1 single-user CMS with low traffic, accept duplicates. Add locking in V2+ if needed.
Question 9: Configuration and Deployment
My Questions:
9a) Configuration Changes: Current config has:
# .env (WRONG - to be removed)
TOKEN_ENDPOINT=https://tokens.indieauth.com/token
# .env (CORRECT - to be kept)
ADMIN_ME=https://admin.example.com/
Should I:
- Remove
TOKEN_ENDPOINTfrom config.py immediately? - Add deprecation warning if
TOKEN_ENDPOINTis set? - Provide migration instructions in CHANGELOG?
9b) Backward Compatibility: RC.4 was just released with TOKEN_ENDPOINT configuration. RC.5 will remove it. Should I:
- Provide migration script?
- Automatic migration (detect and convert)?
- Just document breaking change in CHANGELOG?
Since we're in RC phase, breaking changes are acceptable, but users might be testing. Recommendation?
9c) Health Check: Should the /health endpoint also check:
- Endpoint discovery working (fetch ADMIN_ME profile)?
- Token endpoint reachable?
Or is this too expensive for health checks?
Question 10: Development and Testing Workflow
My Questions:
10a) Local Development: Developers typically use http://localhost:5000 for SITE_URL. But IndieAuth requires HTTPS. How should developers test?
Options:
- Allow HTTP in development mode (detect DEV_MODE=true)
- Require ngrok/localhost.run for HTTPS tunneling
- Use mock endpoints in dev mode
- Accept that IndieAuth won't work locally without setup
Current auth_external.py doesn't have HTTPS check. Should I add it with dev mode exception?
10b) Testing with Real Providers: To test against real IndieAuth providers, I need:
- A real profile URL with IndieAuth links
- Valid tokens from that provider
Should I:
- Create test profile for integration tests?
- Document how developers can test?
- Skip real provider tests in CI (only run locally)?
Implementation Readiness Assessment
What's Clear and Ready to Implement
✅ HTTP Link Header Parsing: Clear algorithm, standard format
✅ HTML Link Element Extraction: Clear approach with BeautifulSoup4
✅ URL Resolution: Standard urljoin() from urllib.parse
✅ Basic Caching: In-memory dict with TTL expiry
✅ Token Verification HTTP Request: Standard GET with Bearer token
✅ Response Validation: Check for required fields (me, client_id, scope)
What Needs Architect Clarification
⚠️ Critical (blocks implementation):
- Q1: Which endpoint to verify tokens with (the "chicken-and-egg" problem)
- Q2a: Cache structure for single-user vs future multi-user
- Q3a: Add BeautifulSoup4 dependency?
⚠️ Important (affects quality):
- Q5a: URL validation requirements
- Q6a: Error handling strategy (fail open vs closed)
- Q6b: Retry logic for network failures
- Q9a: Remove TOKEN_ENDPOINT config or deprecate?
⚠️ Nice to have (can implement sensibly):
- Q2c: Cache invalidation triggers
- Q7a: Test strategy (mock vs real)
- Q8a: First request latency acceptable?
Proposed Implementation Plan
Once questions are answered, here's my implementation approach:
Phase 1: Core Discovery (Days 1-2)
-
Create
endpoint_discovery.pymoduleEndpointDiscoveryclass- HTTP Link header parsing
- HTML link element extraction
- URL resolution and validation
- Error handling
-
Unit tests for discovery
- Test Link header parsing
- Test HTML parsing
- Test URL resolution
- Test error cases
Phase 2: Token Verification Update (Day 3)
-
Update
auth_external.py- Integrate endpoint discovery
- Add caching layer
- Update
verify_external_token() - Remove hardcoded TOKEN_ENDPOINT usage
-
Unit tests for updated verification
- Test with discovered endpoints
- Test caching behavior
- Test error handling
Phase 3: Integration and Testing (Day 4)
-
Integration tests
- Full Micropub request flow
- Cache behavior across requests
- Error scenarios
-
Update existing tests
- Fix any broken tests
- Update mocks to use discovery
Phase 4: Configuration and Documentation (Day 5)
-
Update configuration
- Remove TOKEN_ENDPOINT from config.py
- Add deprecation warning if still set
- Update .env.example
-
Update documentation
- CHANGELOG entry for rc.5
- Migration guide if needed
- API documentation
Phase 5: Manual Testing and Refinement (Day 6)
- Test with real IndieAuth provider
- Performance testing (cache effectiveness)
- Error handling verification
- Final refinements
Estimated Total Time: 5-7 days
Dependencies to Add
Based on migration guide, I'll need to add:
# pyproject.toml or requirements.txt
beautifulsoup4>=4.12.0 # HTML parsing for link extraction
httpx is already a dependency (used in current auth_external.py).
Risks and Concerns
Risk 1: Breaking Change Timing
- Issue: RC.4 just shipped with TOKEN_ENDPOINT config
- Impact: Users testing RC.4 will need to reconfigure for RC.5
- Mitigation: Clear migration notes in CHANGELOG, consider grace period
Risk 2: Performance Degradation
- Issue: First request will be slower (800ms vs <100ms cached)
- Impact: User experience on first post after restart/cache expiry
- Mitigation: Document expected behavior, consider pre-warming cache
Risk 3: External Dependency
- Issue: StarPunk now depends on external profile URL availability
- Impact: If profile URL is down, Micropub stops working
- Mitigation: Cache endpoints for longer TTL, fail gracefully with clear errors
Risk 4: Testing Complexity
- Issue: More moving parts to test (HTTP, HTML parsing, caching)
- Impact: More test code, more mocking, more edge cases
- Mitigation: Good test fixtures, clear test organization
Recommended Next Steps
- Architect reviews this report and answers questions
- I create test fixtures based on ADR examples
- I implement Phase 1 (core discovery) with tests
- Checkpoint review - verify discovery working correctly
- I implement Phase 2 (integration with token verification)
- Checkpoint review - verify end-to-end flow
- I implement Phase 3-5 (tests, config, docs)
- Final review before merge
Questions Summary (Quick Reference)
Critical (must answer before coding):
- Q1: Which endpoint to verify tokens with? Proposed: Use ADMIN_ME profile for single-user StarPunk
- Q2a: Cache structure for single-user vs multi-user?
- Q3a: Add BeautifulSoup4 dependency?
Important (affects implementation quality): 4. Q5a: URL validation requirements? 5. Q6a: Error handling strategy (fail open/closed)? 6. Q6b: Retry logic for network failures? 7. Q9a: Remove or deprecate TOKEN_ENDPOINT config?
Can implement sensibly (but prefer guidance): 8. Q2c: Cache invalidation triggers? 9. Q7a: Test strategy (mock vs real)? 10. Q8a: First request latency acceptable?
Conclusion
The architect's corrected design is sound and properly implements IndieAuth endpoint discovery per the W3C specification. The primary blocker is clarifying the "which endpoint?" question for token verification in a single-user CMS context.
My proposed solution (always use ADMIN_ME profile for endpoint discovery) seems correct for StarPunk's single-user model, but I need architect confirmation before proceeding.
Once questions are answered, I'm ready to implement with high confidence. The code will be clean, tested, and follow the specifications exactly.
Status: ⏸️ Waiting for Architect Review
Document Version: 1.0 Created: 2025-11-24 Author: StarPunk Fullstack Developer Next Review: After architect responds to questions