# Client ID Validation Compliance ## Purpose This design addresses critical non-compliance issues in Gondulf's client_id validation that violate the W3C IndieAuth specification Section 3.2. These issues must be fixed before v1.0.0 release to ensure any compliant IndieAuth client can successfully authenticate. ## CLARIFICATIONS (2025-11-24) Based on Developer questions, the following clarifications have been added: 1. **IPv6 Bracket Handling**: Python's `urlparse` returns `hostname` WITHOUT brackets for IPv6 addresses. The brackets are only in `netloc`. Therefore, the check should be against '::1' without brackets. 2. **Normalization of IPv6 with Port**: When reconstructing URLs with IPv6 addresses and ports, brackets MUST be added back (e.g., `[::1]:8080`). 3. **Empty Path Normalization**: Confirmed - `https://example.com` should normalize to `https://example.com/` (with trailing slash). 4. **Validation Rule Ordering**: Implementation should follow the logical flow shown in the example implementation (lines 87-138), not the numbered list order. The try/except for URL parsing serves as the "Basic URL Structure" check. 5. **Endpoint Updates**: These are SEPARATE tasks and should NOT be implemented as part of the validation.py update task. 6. **Test File Location**: Tests should go in the existing `/home/phil/Projects/Gondulf/tests/unit/test_validation.py` file. 7. **Import Location**: The `ipaddress` import should be at module level (Python convention), not inside the function. ## Specification References - **Primary**: [W3C IndieAuth Section 3.2 - Client Identifier](https://www.w3.org/TR/indieauth/#client-identifier) - **OAuth 2.0**: [RFC 6749 Section 2.2](https://datatracker.ietf.org/doc/html/rfc6749#section-2.2) - **Reference Implementation**: IndieLogin.com `/app/Authenticate.php` ## Design Overview Replace the current incomplete `normalize_client_id()` function with two distinct functions: 1. `validate_client_id()` - Validates client_id against all specification requirements 2. `normalize_client_id()` - Normalizes a valid client_id to canonical form This separation ensures clear validation logic and proper error reporting while maintaining backward compatibility with existing code that expects normalization. ## Component Details ### New Function: validate_client_id() **Location**: `/home/phil/Projects/Gondulf/src/gondulf/utils/validation.py` **Purpose**: Validate a client_id URL against all W3C IndieAuth specification requirements. **Function Signature**: ```python def validate_client_id(client_id: str) -> tuple[bool, str]: """ Validate client_id against W3C IndieAuth specification Section 3.2. Args: client_id: The client identifier URL to validate Returns: Tuple of (is_valid, error_message) - is_valid: True if client_id is valid, False otherwise - error_message: Empty string if valid, specific error message if invalid """ ``` **Validation Rules** (in order): 1. **Basic URL Structure** - Must be a parseable URL with urlparse() - Error: "client_id must be a valid URL" 2. **Scheme Validation** - Must be 'https' OR 'http' - Error: "client_id must use https or http scheme" 3. **HTTP Scheme Restriction** - If scheme is 'http', hostname MUST be one of: 'localhost', '127.0.0.1', '::1' (note: hostname from urlparse has no brackets) - Error: "client_id with http scheme is only allowed for localhost, 127.0.0.1, or [::1]" 4. **Fragment Rejection** - Must NOT contain a fragment component (# part) - Error: "client_id must not contain a fragment (#)" 5. **User Info Rejection** - Must NOT contain username or password components - Error: "client_id must not contain username or password" 6. **IP Address Validation** - Check if hostname is an IP address using ipaddress.ip_address() - If it's an IP: - Must be loopback (127.0.0.1 or ::1) - Error: "client_id must not use IP address (except 127.0.0.1 or [::1])" - If not an IP (ValueError), it's a domain name (valid) 7. **Path Component Requirement** - Path must exist (at minimum "/") - If empty path, it's still valid (will be normalized to "/" later) 8. **Path Segment Validation** - Split path by '/' and check segments - Must NOT contain single dot ('.') as a complete segment - Must NOT contain double dot ('..') as a complete segment - Note: './file' or '../file' as part of a segment is allowed, only standalone '.' or '..' segments are rejected - Error: "client_id must not contain single-dot (.) or double-dot (..) path segments" **Implementation**: ```python import ipaddress # At module level with other imports def validate_client_id(client_id: str) -> tuple[bool, str]: """ Validate client_id against W3C IndieAuth specification Section 3.2. Args: client_id: The client identifier URL to validate Returns: Tuple of (is_valid, error_message) """ try: parsed = urlparse(client_id) # 1. Check scheme if parsed.scheme not in ['https', 'http']: return False, "client_id must use https or http scheme" # 2. HTTP only for localhost/loopback if parsed.scheme == 'http': # Note: parsed.hostname returns '::1' without brackets for IPv6 if parsed.hostname not in ['localhost', '127.0.0.1', '::1']: return False, "client_id with http scheme is only allowed for localhost, 127.0.0.1, or [::1]" # 3. No fragments allowed if parsed.fragment: return False, "client_id must not contain a fragment (#)" # 4. No username/password allowed if parsed.username or parsed.password: return False, "client_id must not contain username or password" # 5. Check for non-loopback IP addresses if parsed.hostname: try: # parsed.hostname already has no brackets for IPv6 ip = ipaddress.ip_address(parsed.hostname) if not ip.is_loopback: return False, f"client_id must not use IP address (except 127.0.0.1 or [::1])" except ValueError: # Not an IP address, it's a domain (valid) pass # 6. Check for . or .. path segments if parsed.path: segments = parsed.path.split('/') for segment in segments: if segment == '.' or segment == '..': return False, "client_id must not contain single-dot (.) or double-dot (..) path segments" return True, "" except Exception as e: return False, f"client_id must be a valid URL: {e}" ``` ### Updated Function: normalize_client_id() **Purpose**: Normalize a valid client_id to canonical form. Must validate first. **Function Signature**: ```python def normalize_client_id(client_id: str) -> str: """ Normalize client_id URL to canonical form per IndieAuth spec. Normalization rules: - Validate against specification first - Convert hostname to lowercase - Remove default ports (80 for http, 443 for https) - Ensure path exists (default to "/" if empty) - Preserve query string if present - Never include fragments (already validated out) Args: client_id: Client ID URL to normalize Returns: Normalized client_id Raises: ValueError: If client_id is not valid per specification """ ``` **Normalization Rules**: 1. **Validation First** - Call validate_client_id() - If invalid, raise ValueError with the error message 2. **Hostname Normalization** - Convert hostname to lowercase - Preserve IPv6 brackets if present 3. **Port Normalization** - Remove port 80 for http URLs - Remove port 443 for https URLs - Preserve any other ports 4. **Path Normalization** - If path is empty, set to "/" - Do NOT remove trailing slashes (spec doesn't require this) - Do NOT normalize . or .. (already validated out) 5. **Component Assembly** - Reconstruct URL with normalized components - Include query string if present - Never include fragment (already validated out) **Implementation**: ```python def normalize_client_id(client_id: str) -> str: """ Normalize client_id URL to canonical form per IndieAuth spec. Args: client_id: Client ID URL to normalize Returns: Normalized client_id Raises: ValueError: If client_id is not valid per specification """ # First validate is_valid, error = validate_client_id(client_id) if not is_valid: raise ValueError(error) parsed = urlparse(client_id) # Normalize hostname to lowercase hostname = parsed.hostname.lower() if parsed.hostname else '' # Determine if this is an IPv6 address (for bracket handling) is_ipv6 = ':' in hostname # Simple check since hostname has no brackets # Handle port normalization port = parsed.port if (parsed.scheme == 'http' and port == 80) or \ (parsed.scheme == 'https' and port == 443): # Default port, omit it if is_ipv6: netloc = f"[{hostname}]" # IPv6 needs brackets in URL else: netloc = hostname elif port: # Non-default port, include it if is_ipv6: netloc = f"[{hostname}]:{port}" # IPv6 with port needs brackets else: netloc = f"{hostname}:{port}" else: # No port if is_ipv6: netloc = f"[{hostname}]" # IPv6 needs brackets in URL else: netloc = hostname # Ensure path exists path = parsed.path if parsed.path else '/' # Reconstruct URL normalized = f"{parsed.scheme}://{netloc}{path}" # Add query if present if parsed.query: normalized += f"?{parsed.query}" # Never add fragment (validated out) return normalized ``` ### Authorization Endpoint Updates (SEPARATE TASK) **NOTE**: This is a SEPARATE task and should NOT be implemented as part of the validation.py update task. **Location**: `/home/phil/Projects/Gondulf/src/gondulf/endpoints/authorization.py` When this separate task is implemented, update the authorization endpoint to use the new validation: ```python # In the authorize() function, when validating client_id: # Validate and normalize client_id is_valid, error = validate_client_id(client_id) if not is_valid: # Return error to client return authorization_error_response( redirect_uri=redirect_uri, error="invalid_request", error_description=f"Invalid client_id: {error}", state=state ) # Normalize for consistent storage/comparison try: normalized_client_id = normalize_client_id(client_id) except ValueError as e: # This shouldn't happen if validate_client_id passed, but handle it return authorization_error_response( redirect_uri=redirect_uri, error="invalid_request", error_description=str(e), state=state ) ``` ### Token Endpoint Updates (SEPARATE TASK) **NOTE**: This is a SEPARATE task and should NOT be implemented as part of the validation.py update task. **Location**: `/home/phil/Projects/Gondulf/src/gondulf/endpoints/token.py` When this separate task is implemented, update token endpoint validation similarly: ```python # In the token() function, when validating client_id: # Validate and normalize client_id is_valid, error = validate_client_id(client_id) if not is_valid: return JSONResponse( status_code=400, content={ "error": "invalid_client", "error_description": f"Invalid client_id: {error}" } ) # Normalize for comparison with stored value normalized_client_id = normalize_client_id(client_id) ``` ## Data Models No database schema changes required. The validation happens at the API layer before storage. ## API Contracts ### Error Responses When client_id validation fails, return appropriate OAuth 2.0 error responses: **Authorization Endpoint** (if redirect_uri is valid): ``` HTTP/1.1 302 Found Location: {redirect_uri}?error=invalid_request&error_description=Invalid+client_id%3A+{specific_error}&state={state} ``` **Authorization Endpoint** (if redirect_uri is also invalid): ``` HTTP/1.1 400 Bad Request Content-Type: text/html

Invalid Request

Invalid client_id: {specific_error}

``` **Token Endpoint**: ``` HTTP/1.1 400 Bad Request Content-Type: application/json { "error": "invalid_client", "error_description": "Invalid client_id: {specific_error}" } ``` ## Error Handling ### Validation Error Messages Each validation rule has a specific, user-friendly error message: | Validation Rule | Error Message | |-----------------|---------------| | Invalid URL | "client_id must be a valid URL: {parse_error}" | | Wrong scheme | "client_id must use https or http scheme" | | HTTP not localhost | "client_id with http scheme is only allowed for localhost, 127.0.0.1, or [::1]" | | Has fragment | "client_id must not contain a fragment (#)" | | Has credentials | "client_id must not contain username or password" | | Non-loopback IP | "client_id must not use IP address (except 127.0.0.1 or [::1])" | | Path traversal | "client_id must not contain single-dot (.) or double-dot (..) path segments" | ### Exception Handling - `validate_client_id()` never raises exceptions, returns (False, error_message) - `normalize_client_id()` raises ValueError if validation fails - URL parsing exceptions are caught and converted to validation errors ## Security Considerations ### Fragment Rejection Fragments in client_ids could cause confusion about the actual client identity. By rejecting them, we ensure clear client identification. ### Credential Rejection Username/password in URLs could leak into logs or be displayed to users. Rejecting them prevents credential exposure. ### IP Address Restriction Allowing arbitrary IP addresses could bypass domain-based security controls. Only loopback addresses are permitted for local development. ### Path Traversal Prevention Single-dot and double-dot segments could potentially be used for path traversal attacks or cause confusion about the client's identity. ### HTTP Localhost Support HTTP is only allowed for localhost/loopback addresses to support local development while maintaining security in production. ## Testing Strategy ### Unit Tests Required Create comprehensive tests in `/home/phil/Projects/Gondulf/tests/unit/test_validation.py`: #### Valid Client IDs ```python valid_client_ids = [ "https://example.com", "https://example.com/", "https://example.com/app", "https://example.com/app/client", "https://example.com?foo=bar", "https://example.com/app?foo=bar&baz=qux", "https://sub.example.com", "https://example.com:8080", "https://example.com:8080/app", "http://localhost", "http://localhost:3000", "http://127.0.0.1", "http://127.0.0.1:8080", "http://[::1]", "http://[::1]:8080", ] ``` #### Invalid Client IDs ```python invalid_client_ids = [ ("ftp://example.com", "must use https or http scheme"), ("https://example.com#fragment", "must not contain a fragment"), ("https://user:pass@example.com", "must not contain username or password"), ("https://example.com/./invalid", "must not contain single-dot"), ("https://example.com/../invalid", "must not contain double-dot"), ("http://example.com", "http scheme is only allowed for localhost"), ("https://192.168.1.1", "must not use IP address"), ("https://10.0.0.1", "must not use IP address"), ("https://[2001:db8::1]", "must not use IP address"), ("not-a-url", "must be a valid URL"), ("", "must be a valid URL"), ] ``` #### Normalization Tests ```python normalization_cases = [ ("HTTPS://EXAMPLE.COM", "https://example.com/"), ("https://example.com", "https://example.com/"), ("https://example.com:443", "https://example.com/"), ("http://localhost:80", "http://localhost/"), ("https://EXAMPLE.COM:443/app", "https://example.com/app"), ("https://Example.Com/APP", "https://example.com/APP"), # Path case preserved ("https://example.com?foo=bar", "https://example.com/?foo=bar"), ] ``` ### Integration Tests 1. Test authorization endpoint with various client_ids 2. Test token endpoint with various client_ids 3. Test that normalized client_ids match correctly between endpoints 4. Test error responses for invalid client_ids ### Security Tests 1. Test that fragments are always rejected 2. Test that credentials are always rejected 3. Test that non-loopback IPs are rejected 4. Test that path traversal segments are rejected 5. Test that HTTP is only allowed for localhost ## Acceptance Criteria 1. ✅ All valid client_ids per W3C specification are accepted 2. ✅ All invalid client_ids per W3C specification are rejected with specific error messages 3. ✅ HTTP scheme is accepted for localhost, 127.0.0.1, and [::1] 4. ✅ HTTPS scheme is accepted for all valid domain names 5. ✅ Fragments are always rejected 6. ✅ Username/password components are always rejected 7. ✅ Non-loopback IP addresses are rejected 8. ✅ Single-dot and double-dot path segments are rejected 9. ✅ Hostnames are normalized to lowercase 10. ✅ Default ports (80 for HTTP, 443 for HTTPS) are removed 11. ✅ Empty paths are normalized to "/" 12. ✅ Query strings are preserved 13. ✅ Authorization endpoint uses new validation 14. ✅ Token endpoint uses new validation 15. ✅ All tests pass with 100% coverage of validation logic 16. ✅ Error messages are specific and helpful ## Implementation Order ### Current Task (validation.py update): 1. Implement `validate_client_id()` function in validation.py 2. Update `normalize_client_id()` to use validation in validation.py 3. Write comprehensive unit tests in tests/unit/test_validation.py ### Separate Future Tasks: 4. Update authorization endpoint (SEPARATE TASK) 5. Update token endpoint (SEPARATE TASK) 6. Write integration tests (SEPARATE TASK) 7. Test with real IndieAuth clients (SEPARATE TASK) ## Migration Notes - No database migration needed - Existing stored client_ids remain valid (they were normalized on storage) - New validation is stricter but backward compatible with valid client_ids ## References - [W3C IndieAuth Section 3.2](https://www.w3.org/TR/indieauth/#client-identifier) - [RFC 3986 - URI Generic Syntax](https://datatracker.ietf.org/doc/html/rfc3986) - [OAuth 2.0 RFC 6749](https://datatracker.ietf.org/doc/html/rfc6749) - [IndieLogin Implementation](https://github.com/aaronpk/indielogin.com)