Gondulf/docs/decisions/0009-phase-3-token-endpoint-clarifications.md

# 0009. Phase 3 Token Endpoint Implementation Clarifications

Date: 2025-11-20

## Status
Accepted

## Context
The Developer has reviewed the Phase 3 Token Endpoint design and identified 8 clarification questions that require architectural decisions. These questions range from critical (CodeStore value type compatibility) to minor (logging levels), but all require clear decisions to proceed with implementation.

## Decision

We make the following architectural decisions for Phase 3 implementation:

### 1. Authorization Code Storage Format (CRITICAL)

**Decision**: Modify CodeStore to accept dict values directly, with JSON serialization handled internally.

**Implementation**:
```python
# In CodeStore class
def store(self, key: str, value: Union[str, dict], ttl: int = 600) -> None:
    """Store key-value pair with TTL. Value can be string or dict."""
    if isinstance(value, dict):
        value_to_store = json.dumps(value)
    else:
        value_to_store = value

    expiry = time.time() + ttl
    self._data[key] = {
        'value': value_to_store,
        'expires': expiry
    }

def get(self, key: str) -> Optional[Union[str, dict]]:
    """Get value by key. Returns dict if value is JSON, string otherwise."""
    if key not in self._data:
        return None

    entry = self._data[key]
    if time.time() > entry['expires']:
        del self._data[key]
        return None

    value = entry['value']
    # Try to parse as JSON
    try:
        return json.loads(value)
    except (json.JSONDecodeError, TypeError):
        return value
```

**Rationale**: This is the simplest approach that maintains backward compatibility with Phase 1 (string values) while supporting Phase 2/3 needs (dict metadata). The CodeStore handles serialization internally, keeping the interface clean.

### 2. Authorization Code Single-Use Marking

**Decision**: Simplify to atomic check-and-delete operation. Do NOT mark-then-delete.

**Implementation**:
```python
# In token endpoint handler
# STEP 5: Check if code already used
if metadata.get('used'):
    logger.error(f"Authorization code replay detected: {code[:8]}...")
    raise HTTPException(400, {"error": "invalid_grant", "error_description": "Authorization code has already been used"})

# STEP 6-8: Extract user data, validate PKCE if needed, generate token...

# STEP 9: Delete authorization code immediately after successful token generation
code_storage.delete(code)
logger.info(f"Authorization code exchanged and deleted: {code[:8]}...")
```

**Rationale**: The simpler approach avoids the race condition complexity of calculating remaining TTL and re-storing. Since we control both the authorization and token endpoints, we can ensure codes are generated with the 'used' field set to False initially, then simply delete them after use.

### 3. Token Endpoint Error Response Format

**Decision**: FastAPI automatically handles dict detail correctly for JSON responses. No custom handler needed.

**Verification**: FastAPI's HTTPException with dict detail automatically:
- Sets Content-Type: application/json
- Serializes the dict to JSON
- Returns proper OAuth error response

**Additional Headers**: Add OAuth-required cache headers explicitly:
```python
from fastapi import Response

@router.post("/token")
async def token_exchange(response: Response, ...):
    # Add OAuth cache headers
    response.headers["Cache-Control"] = "no-store"
    response.headers["Pragma"] = "no-cache"

    # ... rest of implementation
```

**Rationale**: Use FastAPI's built-in capabilities. Explicit headers ensure OAuth compliance.

### 4. Phase 2/3 Authorization Code Structure

**Decision**: Phase 2 must include PKCE fields with default values. Phase 3 does NOT need to handle missing keys.

**Phase 2 Authorization Code Structure** (UPDATE REQUIRED):
```python
# Phase 2 authorization endpoint must store:
code_data = {
    'client_id': client_id,
    'redirect_uri': redirect_uri,
    'state': state,
    'me': verified_email,  # or domain
    'scope': scope,
    'code_challenge': code_challenge or "",  # Empty string if not provided
    'code_challenge_method': code_challenge_method or "",  # Empty string if not provided
    'created_at': int(time.time()),
    'expires_at': int(time.time() + 600),
    'used': False  # Always False when created
}
```

**Rationale**: Consistency is more important than backward compatibility within a single version. Since we're building v1.0.0, all components should use the same data structure.

### 5. Database Connection Pattern

**Decision**: The Phase 1 database connection context manager does NOT auto-commit. Explicit commit required.

**Confirmation from Phase 1 implementation**:
```python
# Phase 1 uses SQLite connection directly
with self.database.get_connection() as conn:
    conn.execute(query, params)
    conn.commit()  # Explicit commit required
```

**Rationale**: Explicit commits give us transaction control and match SQLite's default behavior.

### 6. Token Hash Collision Handling

**Decision**: Do NOT handle UNIQUE constraint violations. Let them fail catastrophically.

**Implementation**:
```python
def generate_token(self, me: str, client_id: str, scope: str = "") -> str:
    # Generate token (2^256 entropy)
    token = secrets.token_urlsafe(self.token_length)
    token_hash = hashlib.sha256(token.encode('utf-8')).hexdigest()

    # Store in database - if this fails, let it propagate
    with self.database.get_connection() as conn:
        conn.execute(
            """INSERT INTO tokens (token_hash, me, client_id, scope, issued_at, expires_at, revoked)
               VALUES (?, ?, ?, ?, ?, ?, 0)""",
            (token_hash, me, client_id, scope, issued_at, expires_at)
        )
        conn.commit()

    return token
```

**Rationale**: With 2^256 possible values, a collision is so astronomically unlikely that if it occurs, it indicates a fundamental problem (bad RNG, cosmic rays, etc.). Retrying won't help. The UNIQUE constraint violation will be logged as an ERROR and return 500 to client, which is appropriate for this "impossible" scenario.

### 7. Logging Token Validation

**Decision**: Use the Developer's suggested logging levels:
- DEBUG for successful validations (high volume, not interesting)
- INFO for token generation (important events)
- WARNING for validation failures (potential attacks or misconfiguration)

**Implementation**:
```python
# In validate_token
if valid:
    logger.debug(f"Token validated successfully (me: {token_data['me']})")
else:
    logger.warning(f"Token validation failed: {reason}")

# In generate_token
logger.info(f"Token generated for {me} (client: {client_id})")
```

**Rationale**: This provides appropriate visibility without flooding logs during normal operation.

### 8. Token Cleanup Configuration

**Decision**: Implement as utility method only for v1.0.0. No automatic calling.

**Implementation**:
```python
# In TokenService
def cleanup_expired_tokens(self) -> int:
    """Delete expired tokens. Call manually or via cron/scheduled task."""
    # Implementation as designed

# Not called automatically in v1.0.0
# Future v1.1.0 can add background task if needed
```

**Configuration**: Keep TOKEN_CLEANUP_ENABLED and TOKEN_CLEANUP_INTERVAL in config for future use, but don't act on them in v1.0.0.

**Rationale**: Simplicity for v1.0.0. With small scale (10s of users), manual or cron-based cleanup is sufficient. Automatic background tasks add complexity we don't need yet.

## Consequences

### Positive
- All decisions prioritize simplicity over complexity
- No unnecessary defensive programming for "impossible" scenarios
- Clear, consistent data structures across phases
- Minimal changes to existing Phase 1/2 code
- Appropriate logging levels for operational visibility

### Negative
- Phase 2 needs a minor update to include PKCE fields and 'used' flag
- No automatic token cleanup in v1.0.0 (acceptable for small scale)
- Token hash collisions cause hard failures (acceptable given probability)

### Technical Debt Created
- TOKEN_CLEANUP automation deferred to v1.1.0
- CodeStore dict handling could be more elegant (but works fine)

## Implementation Actions Required

1. **Update Phase 2** authorization endpoint to include all fields in code metadata (code_challenge, code_challenge_method, used)
2. **Modify CodeStore** in Phase 1 to handle dict values with JSON serialization
3. **Implement Phase 3** with these clarifications
4. **Document** the manual token cleanup process for operators

## Sign-off

**Architect**: Claude (Architect Agent)
**Date**: 2025-11-20
**Status**: Approved for implementation