Files
StarPunk/docs/reports/indieauth-removal-analysis.md
Phil Skentelbery a3bac86647 feat: Complete IndieAuth server removal (Phases 2-4)
Completed all remaining phases of ADR-030 IndieAuth provider removal.
StarPunk no longer acts as an authorization server - all IndieAuth
operations delegated to external providers.

Phase 2 - Remove Token Issuance:
- Deleted /auth/token endpoint
- Removed token_endpoint() function from routes/auth.py
- Deleted tests/test_routes_token.py

Phase 3 - Remove Token Storage:
- Deleted starpunk/tokens.py module entirely
- Created migration 004 to drop tokens and authorization_codes tables
- Deleted tests/test_tokens.py
- Removed all internal token CRUD operations

Phase 4 - External Token Verification:
- Created starpunk/auth_external.py module
- Implemented verify_external_token() for external IndieAuth providers
- Updated Micropub endpoint to use external verification
- Added TOKEN_ENDPOINT configuration
- Updated all Micropub tests to mock external verification
- HTTP timeout protection (5s) for external requests

Additional Changes:
- Created migration 003 to remove code_verifier from auth_state
- Fixed 5 migration tests that referenced obsolete code_verifier column
- Updated 11 Micropub tests for external verification
- Fixed test fixture and app context issues
- All 501 tests passing

Breaking Changes:
- Micropub clients must use external IndieAuth providers
- TOKEN_ENDPOINT configuration now required
- Existing internal tokens invalid (tables dropped)

Migration Impact:
- Simpler codebase: -500 lines of code
- Fewer database tables: -2 tables (tokens, authorization_codes)
- More secure: External providers handle token security
- More maintainable: Less authentication code to maintain

Standards Compliance:
- W3C IndieAuth specification
- OAuth 2.0 Bearer token authentication
- IndieWeb principle: delegate to external services

Related:
- ADR-030: IndieAuth Provider Removal Strategy
- ADR-050: Remove Custom IndieAuth Server
- Migration 003: Remove code_verifier from auth_state
- Migration 004: Drop tokens and authorization_codes tables

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 17:23:46 -07:00

17 KiB

IndieAuth Removal Implementation Analysis

Date: 2025-11-24 Developer: Fullstack Developer Agent Status: Pre-Implementation Review

Executive Summary

I have thoroughly reviewed the architect's plan to remove the custom IndieAuth authorization server from StarPunk. This document presents my understanding, identifies concerns, and lists questions that need clarification before implementation begins.

What I Understand

Current Architecture

The system currently implements BOTH roles:

  1. Authorization Server (to be removed):

    • /auth/authorization endpoint with consent UI
    • /auth/token endpoint for token issuance
    • starpunk/tokens.py module (~413 lines)
    • PKCE implementation in starpunk/auth.py
    • Two database tables: authorization_codes and tokens
    • Migration 002 that creates these tables
  2. Resource Server (to be kept and modified):

    • /micropub endpoint
    • Admin authentication via IndieLogin.com
    • Session management
    • Token verification (currently local, will become external)

Proposed Changes

  • Remove ~500+ lines of authorization server code
  • Delete 2 database tables
  • Replace local token verification with external API calls
  • Add token caching (5-minute TTL) for performance
  • Update HTML discovery headers
  • Bump version from 0.4.0 → 0.5.0

Implementation Phases

The plan breaks the work into 5 phases over 3 days:

  1. Remove authorization endpoint (Day 1)
  2. Remove token issuance (Day 1)
  3. Database schema simplification (Day 2)
  4. External token verification (Day 2)
  5. Documentation and discovery (Day 3)

Critical Questions for the Architect

1. Admin Authentication Clarification

Question: How exactly does admin authentication work after removal?

Context: I see two authentication flows in the current code:

  • Admin login: Uses IndieLogin.com → creates session cookie
  • Micropub auth: Uses local tokens → will use external verification

The plan says "admin login still works" but I need to confirm:

  • Does admin login continue using IndieLogin.com ONLY for session creation?
  • The admin never needs Micropub tokens for the web UI, correct?
  • Sessions are completely separate from Micropub tokens?

Why this matters: I need to ensure Phase 1-2 don't break admin access.

2. Token Verification Implementation Details

Question: What exactly should the external token verification return?

Current local implementation (starpunk/tokens.py:116-164):

def verify_token(token: str) -> Optional[Dict[str, Any]]:
    # Returns: {me, client_id, scope}
    # Updates last_used_at timestamp

Proposed external implementation (ADR-050 lines 156-191):

def verify_token(bearer_token: str) -> Optional[Dict[str, Any]]:
    response = httpx.get(
        token_endpoint,
        headers={'Authorization': f'Bearer {bearer_token}'}
    )
    # Returns response.json()

Concerns:

  • Does tokens.indieauth.com return the same fields (me, client_id, scope)?
  • What if the endpoint returns different field names?
  • How do we handle token endpoint errors vs invalid tokens?
  • Should we distinguish between "token invalid" and "endpoint unreachable"?

Request: Provide exact expected response format from tokens.indieauth.com or document what fields we should expect.

3. Scope Validation Strategy

Question: Where does scope validation happen after removal?

Current flow:

  1. Client requests scope during authorization
  2. We validate scope → only "create" supported
  3. We store validated scope in authorization code
  4. We issue token with validated scope
  5. Micropub endpoint checks token has "create" scope

After removal:

  • External provider issues tokens with scopes
  • What if external provider issues a token with unsupported scopes?
  • Should we validate scope is "create" in our verify_token()?
  • Or trust the external provider completely?

From ADR-050 lines 180-185:

# Check scope
if 'create' not in data.get('scope', ''):
    return None

This suggests we validate, but I want to confirm this is the right approach.

4. Migration Backwards Compatibility

Question: What happens to existing StarPunk installations?

Scenario 1: Fresh install after 0.5.0

  • No problem - migration 002 never runs
  • But wait... other code might expect migration 002 to exist?

Scenario 2: Existing 0.4.0 installation upgrading to 0.5.0

  • Has migration 002 already run
  • Has tokens and authorization_codes tables
  • May have active tokens in database

The plan says (indieauth-removal-phases.md lines 168-189):

-- 003_remove_indieauth_tables.sql
DROP TABLE IF EXISTS tokens CASCADE;
DROP TABLE IF EXISTS authorization_codes CASCADE;

Concerns:

  • Should we archive migration 002 or delete it?
  • If we delete it, fresh installs won't have the migration number continuity
  • If we archive it, where? The plan shows /migrations/archive/
  • Do we need a "down migration" for rollback?

Request: Clarify migration strategy:

  • Keep 002 but add 003 that drops tables? (staged approach)
  • Delete 002 and renumber everything? (breaking approach)
  • Archive 002 to different directory? (git history approach)

5. Token Caching Security

Question: Is in-memory token caching secure?

Proposed cache (indieauth-removal-phases.md lines 266-280):

_token_cache = {}  # {token_hash: (data, expiry)}

def cache_token(token: str, data: dict, ttl: int = 300):
    token_hash = hashlib.sha256(token.encode()).hexdigest()
    token_cache[token_hash] = (data, time() + ttl)

Concerns:

  1. Cache invalidation: If a token is revoked externally, we'll continue accepting it for up to 5 minutes
  2. Memory growth: No cache cleanup of expired entries - they just accumulate
  3. Multi-process: If running with multiple workers (gunicorn/uwsgi), each process has separate cache
  4. Token exposure: Are we caching the full token or just the hash?

Questions:

  • Is 5-minute window for revocation acceptable?
  • Should we implement cache cleanup (LRU or TTL-based)?
  • Should we document that caching makes revocation non-immediate?
  • For production, should we recommend Redis instead?

The plan shows we cache the hash, not the token, which is good. But should we document the revocation delay?

6. Error Handling and User Experience

Question: How should we handle external endpoint failures?

Scenarios:

  1. tokens.indieauth.com is down (network error)
  2. tokens.indieauth.com returns 500 (server error)
  3. tokens.indieauth.com returns 429 (rate limit)
  4. Token is invalid (returns 401/404)
  5. Request times out (> 5 seconds)

Current plan (indieauth-removal-plan.md lines 169-173):

if response.status_code != 200:
    return None

This treats ALL failures the same: "forbidden" error to user.

Questions:

  • Should we differentiate between "invalid token" and "verification service down"?
  • Should we fail open (allow request) or fail closed (deny request) on timeout?
  • Should we log different error types differently?
  • Should we have a fallback mechanism?

Recommendation: Return different error messages:

  • 401/404 from endpoint → "Invalid or expired token"
  • Network/timeout error → "Authentication service temporarily unavailable"
  • This gives users better feedback

7. Configuration Changes

Question: Should TOKEN_ENDPOINT be configurable or hardcoded?

Current plan:

TOKEN_ENDPOINT = os.getenv('TOKEN_ENDPOINT', 'https://tokens.indieauth.com/token')

Questions:

  • Is there ever a reason to use a different token endpoint?
  • Should we support per-user token endpoints (discovery from user's domain)?
  • Or should we hardcode tokens.indieauth.com and simplify?

From the HTML discovery (simplified-auth-architecture.md lines 193-211):

<link rel="token_endpoint" href="{{ config.TOKEN_ENDPOINT }}">

This advertises OUR token endpoint to clients. But we're using an external one. Should this link point to:

  • tokens.indieauth.com (external provider)?
  • Or should we remove this link entirely since we're not issuing tokens?

This seems like a spec compliance issue that needs clarification.

8. Testing Strategy

Question: How do we test external token verification?

Proposed test (indieauth-removal-phases.md lines 332-348):

@patch('starpunk.micropub.httpx.get')
def test_external_token_verification(mock_get):
    mock_response.status_code = 200
    mock_response.json.return_value = {
        'me': 'https://example.com',
        'scope': 'create update'
    }

Concerns:

  1. All tests will be mocked - we never test real integration
  2. If tokens.indieauth.com changes response format, we won't know
  3. We're mocking at the wrong level (httpx) - should mock at verify_token level?

Questions:

  • Should we have integration tests with real tokens.indieauth.com?
  • Should we test in CI with actual test tokens?
  • How do we get test tokens for CI? Manual process?
  • Should we implement a "test mode" that uses mock verification?

Recommendation: Create integration test suite that:

  1. Uses real tokens.indieauth.com in CI
  2. Requires CI environment variable with test token
  3. Skips integration tests in local development
  4. Keeps unit tests mocked as planned

9. Rollback Procedure

Question: What's the actual rollback procedure?

The plan mentions (ADR-050 lines 224-240):

git revert HEAD~5..HEAD
pg_dump restoration

Concerns:

  1. This assumes PostgreSQL but StarPunk uses SQLite
  2. HEAD~5 is fragile - depends on exactly 5 commits
  3. No clear step-by-step rollback instructions
  4. What if we're in the middle of Phase 3?

Questions:

  • Should we create backup before starting?
  • Should each phase be a separate commit for easier rollback?
  • How do we handle database rollback with SQLite?
  • Should we test the rollback procedure before starting?

Request: Create clear rollback procedure for each phase.

10. Performance Impact

Question: What's the expected performance impact?

Current: Local token verification

  • Database query: ~1-5ms
  • No network calls

Proposed: External verification

  • HTTP request to tokens.indieauth.com: 200-500ms
  • Cached requests: <1ms (cache hit)

Concerns:

  1. First request to Micropub will be 200-500ms slower
  2. If cache is cold, every request is 200-500ms slower
  3. What if user makes batch requests (multiple posts)?
  4. Does this make the UI feel slow?

Questions:

  • Is 200-500ms acceptable for Micropub clients?
  • Should we pre-warm the cache somehow?
  • Should cache TTL be configurable?
  • Should we implement request coalescing (multiple concurrent verifications for same token)?

Note: The plan mentions 90% cache hit rate, but this assumes:

  • Clients reuse tokens across requests
  • Multiple requests within 5-minute window
  • Single-process deployment

With multiple gunicorn workers, cache hit rate will be lower.

11. Database Schema Question

Question: Why does migration 003 update schema_version?

From indieauth-removal-plan.md lines 246-248:

UPDATE schema_version SET version = 3 WHERE id = 1;

But I don't see a schema_version table in the current migrations.

Questions:

  • Does this table exist?
  • Is this part of a migration tracking system?
  • Should migration 003 check for this table first?

Question: What should the HTML discovery headers be?

Current (implied by removal):

<link rel="authorization_endpoint" href="/auth/authorization">
<link rel="token_endpoint" href="/auth/token">

Proposed (simplified-auth-architecture.md lines 207-210):

<link rel="authorization_endpoint" href="https://indieauth.com/auth">
<link rel="token_endpoint" href="https://tokens.indieauth.com/token">
<link rel="micropub" href="https://starpunk.example.com/micropub">

Questions:

  1. Should these be in base.html (every page) or just the homepage?
  2. Are we advertising that WE use indieauth.com, or that CLIENTS should?
  3. Shouldn't these come from the user's own domain (ADMIN_ME)?
  4. What if the user wants to use a different provider?

My understanding from IndieAuth spec:

  • These links tell clients WHERE to authenticate
  • They should point to the provider the USER wants to use
  • Not the provider StarPunk uses internally

This seems like it might be architecturally wrong. Need clarification.

Risks Identified

High-Risk Areas

  1. Breaking Admin Access (Phase 1-2)

    • Risk: Accidentally remove code needed for admin login
    • Mitigation: Test admin login after each commit
    • Severity: Critical (blocks all access)
  2. Data Loss (Phase 3)

    • Risk: Drop tables with no backup
    • Mitigation: Backup database before migration
    • Severity: High (no recovery path)
  3. External Dependency (Phase 4)

    • Risk: tokens.indieauth.com becomes required for operation
    • Mitigation: Good error handling, caching
    • Severity: High (service becomes unusable)
  4. Token Format Mismatch (Phase 4)

    • Risk: External endpoint returns different format than expected
    • Mitigation: Thorough testing, error handling
    • Severity: High (all Micropub requests fail)

Medium-Risk Areas

  1. Cache Memory Leak (Phase 4)

    • Risk: Token cache grows unbounded
    • Mitigation: Implement cache cleanup
    • Severity: Medium (performance degradation)
  2. Multi-Worker Cache Misses (Phase 4)

    • Risk: Poor cache hit rate with multiple processes
    • Mitigation: Document limitation, consider Redis
    • Severity: Medium (performance impact)
  3. Migration Continuity (Phase 3)

    • Risk: Migration numbering confusion
    • Mitigation: Clear documentation
    • Severity: Low (documentation issue)

Recommendations

Before Starting Implementation

  1. Create Integration Test Suite

    • Get test token from tokens.indieauth.com
    • Write tests that verify actual response format
    • Ensure we handle all error cases
  2. Document Rollback Procedure

    • Create step-by-step rollback for each phase
    • Test rollback procedure before starting
    • Create database backup script
  3. Clarify Architecture Questions

    • Resolve HTML discovery header confusion
    • Confirm token verification response format
    • Define error handling strategy
  4. Implement Cache Cleanup

    • Add LRU or TTL-based cache eviction
    • Add cache size limit
    • Add monitoring/logging

During Implementation

  1. One Phase at a Time

    • Complete each phase fully before moving to next
    • Test thoroughly after each phase
    • Create checkpoint commits for rollback
  2. Comprehensive Testing

    • Test admin login after Phase 1-2
    • Test database migration on test database first
    • Test external verification with real tokens
  3. Monitor Performance

    • Log token verification times
    • Monitor cache hit rates
    • Check for memory leaks

After Implementation

  1. Production Migration Guide

    • Document exact upgrade steps
    • Include backup procedures
    • Provide user communication template
  2. Performance Monitoring

    • Track external API latency
    • Monitor cache effectiveness
    • Alert on verification failures
  3. User Documentation

    • Update README with new setup instructions
    • Create troubleshooting guide
    • Document rollback procedure

Questions Summary

Here are all my questions organized by priority:

Must Answer Before Implementation

  1. What is the exact response format from tokens.indieauth.com?
  2. How should HTML discovery headers work (user's domain vs our provider)?
  3. What's the migration strategy (keep 002, delete 002, or archive)?
  4. How should we differentiate between token invalid vs service down?
  5. Is 5-minute revocation delay acceptable?

Should Answer Before Implementation

  1. Should we implement cache cleanup or just document the issue?
  2. Should we have integration tests with real tokens?
  3. What's the detailed rollback procedure for each phase?
  4. Should TOKEN_ENDPOINT be configurable or hardcoded?
  5. Does schema_version table exist?

Nice to Answer

  1. Should we support multiple providers?
  2. Should we implement request coalescing for concurrent verifications?
  3. Should cache TTL be configurable?

My Recommendation to Proceed

I recommend we get answers to the "Must Answer" questions before implementing. The plan is solid overall, but these architectural details will affect how we implement Phase 4 (external verification), which is the core of this change.

Once we have clarity on:

  1. External endpoint response format
  2. HTML discovery strategy
  3. Migration approach
  4. Error handling strategy

...then I can implement confidently following the phased approach.

The plan is well-structured and thoughtfully designed. I appreciate the clear separation of phases and the detailed acceptance criteria. My questions are primarily about clarifying implementation details and edge cases.


Ready to implement: No Blocking issues: 5 architectural questions Estimated time after clarification: 2-3 days per plan