Files
Gondulf/docs/designs/client-id-validation-compliance.md
Phil Skentelbery 526a21d3fb fix(validation): implement W3C IndieAuth compliant client_id validation
Implements complete W3C IndieAuth Section 3.2 client identifier
validation including:
- Fragment rejection
- HTTP scheme support for localhost/loopback only
- Username/password component rejection
- Non-loopback IP address rejection
- Path traversal prevention (.. and . segments)
- Hostname case normalization
- Default port removal (80/443)
- Path component enforcement

All 75 validation tests passing with 99% coverage.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:14:55 -07:00

536 lines
18 KiB
Markdown

# Client ID Validation Compliance
## Purpose
This design addresses critical non-compliance issues in Gondulf's client_id validation that violate the W3C IndieAuth specification Section 3.2. These issues must be fixed before v1.0.0 release to ensure any compliant IndieAuth client can successfully authenticate.
## CLARIFICATIONS (2025-11-24)
Based on Developer questions, the following clarifications have been added:
1. **IPv6 Bracket Handling**: Python's `urlparse` returns `hostname` WITHOUT brackets for IPv6 addresses. The brackets are only in `netloc`. Therefore, the check should be against '::1' without brackets.
2. **Normalization of IPv6 with Port**: When reconstructing URLs with IPv6 addresses and ports, brackets MUST be added back (e.g., `[::1]:8080`).
3. **Empty Path Normalization**: Confirmed - `https://example.com` should normalize to `https://example.com/` (with trailing slash).
4. **Validation Rule Ordering**: Implementation should follow the logical flow shown in the example implementation (lines 87-138), not the numbered list order. The try/except for URL parsing serves as the "Basic URL Structure" check.
5. **Endpoint Updates**: These are SEPARATE tasks and should NOT be implemented as part of the validation.py update task.
6. **Test File Location**: Tests should go in the existing `/home/phil/Projects/Gondulf/tests/unit/test_validation.py` file.
7. **Import Location**: The `ipaddress` import should be at module level (Python convention), not inside the function.
## Specification References
- **Primary**: [W3C IndieAuth Section 3.2 - Client Identifier](https://www.w3.org/TR/indieauth/#client-identifier)
- **OAuth 2.0**: [RFC 6749 Section 2.2](https://datatracker.ietf.org/doc/html/rfc6749#section-2.2)
- **Reference Implementation**: IndieLogin.com `/app/Authenticate.php`
## Design Overview
Replace the current incomplete `normalize_client_id()` function with two distinct functions:
1. `validate_client_id()` - Validates client_id against all specification requirements
2. `normalize_client_id()` - Normalizes a valid client_id to canonical form
This separation ensures clear validation logic and proper error reporting while maintaining backward compatibility with existing code that expects normalization.
## Component Details
### New Function: validate_client_id()
**Location**: `/home/phil/Projects/Gondulf/src/gondulf/utils/validation.py`
**Purpose**: Validate a client_id URL against all W3C IndieAuth specification requirements.
**Function Signature**:
```python
def validate_client_id(client_id: str) -> tuple[bool, str]:
"""
Validate client_id against W3C IndieAuth specification Section 3.2.
Args:
client_id: The client identifier URL to validate
Returns:
Tuple of (is_valid, error_message)
- is_valid: True if client_id is valid, False otherwise
- error_message: Empty string if valid, specific error message if invalid
"""
```
**Validation Rules** (in order):
1. **Basic URL Structure**
- Must be a parseable URL with urlparse()
- Error: "client_id must be a valid URL"
2. **Scheme Validation**
- Must be 'https' OR 'http'
- Error: "client_id must use https or http scheme"
3. **HTTP Scheme Restriction**
- If scheme is 'http', hostname MUST be one of: 'localhost', '127.0.0.1', '::1' (note: hostname from urlparse has no brackets)
- Error: "client_id with http scheme is only allowed for localhost, 127.0.0.1, or [::1]"
4. **Fragment Rejection**
- Must NOT contain a fragment component (# part)
- Error: "client_id must not contain a fragment (#)"
5. **User Info Rejection**
- Must NOT contain username or password components
- Error: "client_id must not contain username or password"
6. **IP Address Validation**
- Check if hostname is an IP address using ipaddress.ip_address()
- If it's an IP:
- Must be loopback (127.0.0.1 or ::1)
- Error: "client_id must not use IP address (except 127.0.0.1 or [::1])"
- If not an IP (ValueError), it's a domain name (valid)
7. **Path Component Requirement**
- Path must exist (at minimum "/")
- If empty path, it's still valid (will be normalized to "/" later)
8. **Path Segment Validation**
- Split path by '/' and check segments
- Must NOT contain single dot ('.') as a complete segment
- Must NOT contain double dot ('..') as a complete segment
- Note: './file' or '../file' as part of a segment is allowed, only standalone '.' or '..' segments are rejected
- Error: "client_id must not contain single-dot (.) or double-dot (..) path segments"
**Implementation**:
```python
import ipaddress # At module level with other imports
def validate_client_id(client_id: str) -> tuple[bool, str]:
"""
Validate client_id against W3C IndieAuth specification Section 3.2.
Args:
client_id: The client identifier URL to validate
Returns:
Tuple of (is_valid, error_message)
"""
try:
parsed = urlparse(client_id)
# 1. Check scheme
if parsed.scheme not in ['https', 'http']:
return False, "client_id must use https or http scheme"
# 2. HTTP only for localhost/loopback
if parsed.scheme == 'http':
# Note: parsed.hostname returns '::1' without brackets for IPv6
if parsed.hostname not in ['localhost', '127.0.0.1', '::1']:
return False, "client_id with http scheme is only allowed for localhost, 127.0.0.1, or [::1]"
# 3. No fragments allowed
if parsed.fragment:
return False, "client_id must not contain a fragment (#)"
# 4. No username/password allowed
if parsed.username or parsed.password:
return False, "client_id must not contain username or password"
# 5. Check for non-loopback IP addresses
if parsed.hostname:
try:
# parsed.hostname already has no brackets for IPv6
ip = ipaddress.ip_address(parsed.hostname)
if not ip.is_loopback:
return False, f"client_id must not use IP address (except 127.0.0.1 or [::1])"
except ValueError:
# Not an IP address, it's a domain (valid)
pass
# 6. Check for . or .. path segments
if parsed.path:
segments = parsed.path.split('/')
for segment in segments:
if segment == '.' or segment == '..':
return False, "client_id must not contain single-dot (.) or double-dot (..) path segments"
return True, ""
except Exception as e:
return False, f"client_id must be a valid URL: {e}"
```
### Updated Function: normalize_client_id()
**Purpose**: Normalize a valid client_id to canonical form. Must validate first.
**Function Signature**:
```python
def normalize_client_id(client_id: str) -> str:
"""
Normalize client_id URL to canonical form per IndieAuth spec.
Normalization rules:
- Validate against specification first
- Convert hostname to lowercase
- Remove default ports (80 for http, 443 for https)
- Ensure path exists (default to "/" if empty)
- Preserve query string if present
- Never include fragments (already validated out)
Args:
client_id: Client ID URL to normalize
Returns:
Normalized client_id
Raises:
ValueError: If client_id is not valid per specification
"""
```
**Normalization Rules**:
1. **Validation First**
- Call validate_client_id()
- If invalid, raise ValueError with the error message
2. **Hostname Normalization**
- Convert hostname to lowercase
- Preserve IPv6 brackets if present
3. **Port Normalization**
- Remove port 80 for http URLs
- Remove port 443 for https URLs
- Preserve any other ports
4. **Path Normalization**
- If path is empty, set to "/"
- Do NOT remove trailing slashes (spec doesn't require this)
- Do NOT normalize . or .. (already validated out)
5. **Component Assembly**
- Reconstruct URL with normalized components
- Include query string if present
- Never include fragment (already validated out)
**Implementation**:
```python
def normalize_client_id(client_id: str) -> str:
"""
Normalize client_id URL to canonical form per IndieAuth spec.
Args:
client_id: Client ID URL to normalize
Returns:
Normalized client_id
Raises:
ValueError: If client_id is not valid per specification
"""
# First validate
is_valid, error = validate_client_id(client_id)
if not is_valid:
raise ValueError(error)
parsed = urlparse(client_id)
# Normalize hostname to lowercase
hostname = parsed.hostname.lower() if parsed.hostname else ''
# Determine if this is an IPv6 address (for bracket handling)
is_ipv6 = ':' in hostname # Simple check since hostname has no brackets
# Handle port normalization
port = parsed.port
if (parsed.scheme == 'http' and port == 80) or \
(parsed.scheme == 'https' and port == 443):
# Default port, omit it
if is_ipv6:
netloc = f"[{hostname}]" # IPv6 needs brackets in URL
else:
netloc = hostname
elif port:
# Non-default port, include it
if is_ipv6:
netloc = f"[{hostname}]:{port}" # IPv6 with port needs brackets
else:
netloc = f"{hostname}:{port}"
else:
# No port
if is_ipv6:
netloc = f"[{hostname}]" # IPv6 needs brackets in URL
else:
netloc = hostname
# Ensure path exists
path = parsed.path if parsed.path else '/'
# Reconstruct URL
normalized = f"{parsed.scheme}://{netloc}{path}"
# Add query if present
if parsed.query:
normalized += f"?{parsed.query}"
# Never add fragment (validated out)
return normalized
```
### Authorization Endpoint Updates (SEPARATE TASK)
**NOTE**: This is a SEPARATE task and should NOT be implemented as part of the validation.py update task.
**Location**: `/home/phil/Projects/Gondulf/src/gondulf/endpoints/authorization.py`
When this separate task is implemented, update the authorization endpoint to use the new validation:
```python
# In the authorize() function, when validating client_id:
# Validate and normalize client_id
is_valid, error = validate_client_id(client_id)
if not is_valid:
# Return error to client
return authorization_error_response(
redirect_uri=redirect_uri,
error="invalid_request",
error_description=f"Invalid client_id: {error}",
state=state
)
# Normalize for consistent storage/comparison
try:
normalized_client_id = normalize_client_id(client_id)
except ValueError as e:
# This shouldn't happen if validate_client_id passed, but handle it
return authorization_error_response(
redirect_uri=redirect_uri,
error="invalid_request",
error_description=str(e),
state=state
)
```
### Token Endpoint Updates (SEPARATE TASK)
**NOTE**: This is a SEPARATE task and should NOT be implemented as part of the validation.py update task.
**Location**: `/home/phil/Projects/Gondulf/src/gondulf/endpoints/token.py`
When this separate task is implemented, update token endpoint validation similarly:
```python
# In the token() function, when validating client_id:
# Validate and normalize client_id
is_valid, error = validate_client_id(client_id)
if not is_valid:
return JSONResponse(
status_code=400,
content={
"error": "invalid_client",
"error_description": f"Invalid client_id: {error}"
}
)
# Normalize for comparison with stored value
normalized_client_id = normalize_client_id(client_id)
```
## Data Models
No database schema changes required. The validation happens at the API layer before storage.
## API Contracts
### Error Responses
When client_id validation fails, return appropriate OAuth 2.0 error responses:
**Authorization Endpoint** (if redirect_uri is valid):
```
HTTP/1.1 302 Found
Location: {redirect_uri}?error=invalid_request&error_description=Invalid+client_id%3A+{specific_error}&state={state}
```
**Authorization Endpoint** (if redirect_uri is also invalid):
```
HTTP/1.1 400 Bad Request
Content-Type: text/html
<html>
<body>
<h1>Invalid Request</h1>
<p>Invalid client_id: {specific_error}</p>
</body>
</html>
```
**Token Endpoint**:
```
HTTP/1.1 400 Bad Request
Content-Type: application/json
{
"error": "invalid_client",
"error_description": "Invalid client_id: {specific_error}"
}
```
## Error Handling
### Validation Error Messages
Each validation rule has a specific, user-friendly error message:
| Validation Rule | Error Message |
|-----------------|---------------|
| Invalid URL | "client_id must be a valid URL: {parse_error}" |
| Wrong scheme | "client_id must use https or http scheme" |
| HTTP not localhost | "client_id with http scheme is only allowed for localhost, 127.0.0.1, or [::1]" |
| Has fragment | "client_id must not contain a fragment (#)" |
| Has credentials | "client_id must not contain username or password" |
| Non-loopback IP | "client_id must not use IP address (except 127.0.0.1 or [::1])" |
| Path traversal | "client_id must not contain single-dot (.) or double-dot (..) path segments" |
### Exception Handling
- `validate_client_id()` never raises exceptions, returns (False, error_message)
- `normalize_client_id()` raises ValueError if validation fails
- URL parsing exceptions are caught and converted to validation errors
## Security Considerations
### Fragment Rejection
Fragments in client_ids could cause confusion about the actual client identity. By rejecting them, we ensure clear client identification.
### Credential Rejection
Username/password in URLs could leak into logs or be displayed to users. Rejecting them prevents credential exposure.
### IP Address Restriction
Allowing arbitrary IP addresses could bypass domain-based security controls. Only loopback addresses are permitted for local development.
### Path Traversal Prevention
Single-dot and double-dot segments could potentially be used for path traversal attacks or cause confusion about the client's identity.
### HTTP Localhost Support
HTTP is only allowed for localhost/loopback addresses to support local development while maintaining security in production.
## Testing Strategy
### Unit Tests Required
Create comprehensive tests in `/home/phil/Projects/Gondulf/tests/unit/test_validation.py`:
#### Valid Client IDs
```python
valid_client_ids = [
"https://example.com",
"https://example.com/",
"https://example.com/app",
"https://example.com/app/client",
"https://example.com?foo=bar",
"https://example.com/app?foo=bar&baz=qux",
"https://sub.example.com",
"https://example.com:8080",
"https://example.com:8080/app",
"http://localhost",
"http://localhost:3000",
"http://127.0.0.1",
"http://127.0.0.1:8080",
"http://[::1]",
"http://[::1]:8080",
]
```
#### Invalid Client IDs
```python
invalid_client_ids = [
("ftp://example.com", "must use https or http scheme"),
("https://example.com#fragment", "must not contain a fragment"),
("https://user:pass@example.com", "must not contain username or password"),
("https://example.com/./invalid", "must not contain single-dot"),
("https://example.com/../invalid", "must not contain double-dot"),
("http://example.com", "http scheme is only allowed for localhost"),
("https://192.168.1.1", "must not use IP address"),
("https://10.0.0.1", "must not use IP address"),
("https://[2001:db8::1]", "must not use IP address"),
("not-a-url", "must be a valid URL"),
("", "must be a valid URL"),
]
```
#### Normalization Tests
```python
normalization_cases = [
("HTTPS://EXAMPLE.COM", "https://example.com/"),
("https://example.com", "https://example.com/"),
("https://example.com:443", "https://example.com/"),
("http://localhost:80", "http://localhost/"),
("https://EXAMPLE.COM:443/app", "https://example.com/app"),
("https://Example.Com/APP", "https://example.com/APP"), # Path case preserved
("https://example.com?foo=bar", "https://example.com/?foo=bar"),
]
```
### Integration Tests
1. Test authorization endpoint with various client_ids
2. Test token endpoint with various client_ids
3. Test that normalized client_ids match correctly between endpoints
4. Test error responses for invalid client_ids
### Security Tests
1. Test that fragments are always rejected
2. Test that credentials are always rejected
3. Test that non-loopback IPs are rejected
4. Test that path traversal segments are rejected
5. Test that HTTP is only allowed for localhost
## Acceptance Criteria
1. ✅ All valid client_ids per W3C specification are accepted
2. ✅ All invalid client_ids per W3C specification are rejected with specific error messages
3. ✅ HTTP scheme is accepted for localhost, 127.0.0.1, and [::1]
4. ✅ HTTPS scheme is accepted for all valid domain names
5. ✅ Fragments are always rejected
6. ✅ Username/password components are always rejected
7. ✅ Non-loopback IP addresses are rejected
8. ✅ Single-dot and double-dot path segments are rejected
9. ✅ Hostnames are normalized to lowercase
10. ✅ Default ports (80 for HTTP, 443 for HTTPS) are removed
11. ✅ Empty paths are normalized to "/"
12. ✅ Query strings are preserved
13. ✅ Authorization endpoint uses new validation
14. ✅ Token endpoint uses new validation
15. ✅ All tests pass with 100% coverage of validation logic
16. ✅ Error messages are specific and helpful
## Implementation Order
### Current Task (validation.py update):
1. Implement `validate_client_id()` function in validation.py
2. Update `normalize_client_id()` to use validation in validation.py
3. Write comprehensive unit tests in tests/unit/test_validation.py
### Separate Future Tasks:
4. Update authorization endpoint (SEPARATE TASK)
5. Update token endpoint (SEPARATE TASK)
6. Write integration tests (SEPARATE TASK)
7. Test with real IndieAuth clients (SEPARATE TASK)
## Migration Notes
- No database migration needed
- Existing stored client_ids remain valid (they were normalized on storage)
- New validation is stricter but backward compatible with valid client_ids
## References
- [W3C IndieAuth Section 3.2](https://www.w3.org/TR/indieauth/#client-identifier)
- [RFC 3986 - URI Generic Syntax](https://datatracker.ietf.org/doc/html/rfc3986)
- [OAuth 2.0 RFC 6749](https://datatracker.ietf.org/doc/html/rfc6749)
- [IndieLogin Implementation](https://github.com/aaronpk/indielogin.com)