Files
Gondulf/docs/designs/client-id-validation-compliance.md
Phil Skentelbery 526a21d3fb fix(validation): implement W3C IndieAuth compliant client_id validation
Implements complete W3C IndieAuth Section 3.2 client identifier
validation including:
- Fragment rejection
- HTTP scheme support for localhost/loopback only
- Username/password component rejection
- Non-loopback IP address rejection
- Path traversal prevention (.. and . segments)
- Hostname case normalization
- Default port removal (80/443)
- Path component enforcement

All 75 validation tests passing with 99% coverage.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 18:14:55 -07:00

18 KiB

Client ID Validation Compliance

Purpose

This design addresses critical non-compliance issues in Gondulf's client_id validation that violate the W3C IndieAuth specification Section 3.2. These issues must be fixed before v1.0.0 release to ensure any compliant IndieAuth client can successfully authenticate.

CLARIFICATIONS (2025-11-24)

Based on Developer questions, the following clarifications have been added:

  1. IPv6 Bracket Handling: Python's urlparse returns hostname WITHOUT brackets for IPv6 addresses. The brackets are only in netloc. Therefore, the check should be against '::1' without brackets.

  2. Normalization of IPv6 with Port: When reconstructing URLs with IPv6 addresses and ports, brackets MUST be added back (e.g., [::1]:8080).

  3. Empty Path Normalization: Confirmed - https://example.com should normalize to https://example.com/ (with trailing slash).

  4. Validation Rule Ordering: Implementation should follow the logical flow shown in the example implementation (lines 87-138), not the numbered list order. The try/except for URL parsing serves as the "Basic URL Structure" check.

  5. Endpoint Updates: These are SEPARATE tasks and should NOT be implemented as part of the validation.py update task.

  6. Test File Location: Tests should go in the existing /home/phil/Projects/Gondulf/tests/unit/test_validation.py file.

  7. Import Location: The ipaddress import should be at module level (Python convention), not inside the function.

Specification References

Design Overview

Replace the current incomplete normalize_client_id() function with two distinct functions:

  1. validate_client_id() - Validates client_id against all specification requirements
  2. normalize_client_id() - Normalizes a valid client_id to canonical form

This separation ensures clear validation logic and proper error reporting while maintaining backward compatibility with existing code that expects normalization.

Component Details

New Function: validate_client_id()

Location: /home/phil/Projects/Gondulf/src/gondulf/utils/validation.py

Purpose: Validate a client_id URL against all W3C IndieAuth specification requirements.

Function Signature:

def validate_client_id(client_id: str) -> tuple[bool, str]:
    """
    Validate client_id against W3C IndieAuth specification Section 3.2.

    Args:
        client_id: The client identifier URL to validate

    Returns:
        Tuple of (is_valid, error_message)
        - is_valid: True if client_id is valid, False otherwise
        - error_message: Empty string if valid, specific error message if invalid
    """

Validation Rules (in order):

  1. Basic URL Structure

    • Must be a parseable URL with urlparse()
    • Error: "client_id must be a valid URL"
  2. Scheme Validation

    • Must be 'https' OR 'http'
    • Error: "client_id must use https or http scheme"
  3. HTTP Scheme Restriction

    • If scheme is 'http', hostname MUST be one of: 'localhost', '127.0.0.1', '::1' (note: hostname from urlparse has no brackets)
    • Error: "client_id with http scheme is only allowed for localhost, 127.0.0.1, or [::1]"
  4. Fragment Rejection

    • Must NOT contain a fragment component (# part)
    • Error: "client_id must not contain a fragment (#)"
  5. User Info Rejection

    • Must NOT contain username or password components
    • Error: "client_id must not contain username or password"
  6. IP Address Validation

    • Check if hostname is an IP address using ipaddress.ip_address()
    • If it's an IP:
      • Must be loopback (127.0.0.1 or ::1)
      • Error: "client_id must not use IP address (except 127.0.0.1 or [::1])"
    • If not an IP (ValueError), it's a domain name (valid)
  7. Path Component Requirement

    • Path must exist (at minimum "/")
    • If empty path, it's still valid (will be normalized to "/" later)
  8. Path Segment Validation

    • Split path by '/' and check segments
    • Must NOT contain single dot ('.') as a complete segment
    • Must NOT contain double dot ('..') as a complete segment
    • Note: './file' or '../file' as part of a segment is allowed, only standalone '.' or '..' segments are rejected
    • Error: "client_id must not contain single-dot (.) or double-dot (..) path segments"

Implementation:

import ipaddress  # At module level with other imports

def validate_client_id(client_id: str) -> tuple[bool, str]:
    """
    Validate client_id against W3C IndieAuth specification Section 3.2.

    Args:
        client_id: The client identifier URL to validate

    Returns:
        Tuple of (is_valid, error_message)
    """
    try:
        parsed = urlparse(client_id)

        # 1. Check scheme
        if parsed.scheme not in ['https', 'http']:
            return False, "client_id must use https or http scheme"

        # 2. HTTP only for localhost/loopback
        if parsed.scheme == 'http':
            # Note: parsed.hostname returns '::1' without brackets for IPv6
            if parsed.hostname not in ['localhost', '127.0.0.1', '::1']:
                return False, "client_id with http scheme is only allowed for localhost, 127.0.0.1, or [::1]"

        # 3. No fragments allowed
        if parsed.fragment:
            return False, "client_id must not contain a fragment (#)"

        # 4. No username/password allowed
        if parsed.username or parsed.password:
            return False, "client_id must not contain username or password"

        # 5. Check for non-loopback IP addresses
        if parsed.hostname:
            try:
                # parsed.hostname already has no brackets for IPv6
                ip = ipaddress.ip_address(parsed.hostname)
                if not ip.is_loopback:
                    return False, f"client_id must not use IP address (except 127.0.0.1 or [::1])"
            except ValueError:
                # Not an IP address, it's a domain (valid)
                pass

        # 6. Check for . or .. path segments
        if parsed.path:
            segments = parsed.path.split('/')
            for segment in segments:
                if segment == '.' or segment == '..':
                    return False, "client_id must not contain single-dot (.) or double-dot (..) path segments"

        return True, ""

    except Exception as e:
        return False, f"client_id must be a valid URL: {e}"

Updated Function: normalize_client_id()

Purpose: Normalize a valid client_id to canonical form. Must validate first.

Function Signature:

def normalize_client_id(client_id: str) -> str:
    """
    Normalize client_id URL to canonical form per IndieAuth spec.

    Normalization rules:
    - Validate against specification first
    - Convert hostname to lowercase
    - Remove default ports (80 for http, 443 for https)
    - Ensure path exists (default to "/" if empty)
    - Preserve query string if present
    - Never include fragments (already validated out)

    Args:
        client_id: Client ID URL to normalize

    Returns:
        Normalized client_id

    Raises:
        ValueError: If client_id is not valid per specification
    """

Normalization Rules:

  1. Validation First

    • Call validate_client_id()
    • If invalid, raise ValueError with the error message
  2. Hostname Normalization

    • Convert hostname to lowercase
    • Preserve IPv6 brackets if present
  3. Port Normalization

    • Remove port 80 for http URLs
    • Remove port 443 for https URLs
    • Preserve any other ports
  4. Path Normalization

    • If path is empty, set to "/"
    • Do NOT remove trailing slashes (spec doesn't require this)
    • Do NOT normalize . or .. (already validated out)
  5. Component Assembly

    • Reconstruct URL with normalized components
    • Include query string if present
    • Never include fragment (already validated out)

Implementation:

def normalize_client_id(client_id: str) -> str:
    """
    Normalize client_id URL to canonical form per IndieAuth spec.

    Args:
        client_id: Client ID URL to normalize

    Returns:
        Normalized client_id

    Raises:
        ValueError: If client_id is not valid per specification
    """
    # First validate
    is_valid, error = validate_client_id(client_id)
    if not is_valid:
        raise ValueError(error)

    parsed = urlparse(client_id)

    # Normalize hostname to lowercase
    hostname = parsed.hostname.lower() if parsed.hostname else ''

    # Determine if this is an IPv6 address (for bracket handling)
    is_ipv6 = ':' in hostname  # Simple check since hostname has no brackets

    # Handle port normalization
    port = parsed.port
    if (parsed.scheme == 'http' and port == 80) or \
       (parsed.scheme == 'https' and port == 443):
        # Default port, omit it
        if is_ipv6:
            netloc = f"[{hostname}]"  # IPv6 needs brackets in URL
        else:
            netloc = hostname
    elif port:
        # Non-default port, include it
        if is_ipv6:
            netloc = f"[{hostname}]:{port}"  # IPv6 with port needs brackets
        else:
            netloc = f"{hostname}:{port}"
    else:
        # No port
        if is_ipv6:
            netloc = f"[{hostname}]"  # IPv6 needs brackets in URL
        else:
            netloc = hostname

    # Ensure path exists
    path = parsed.path if parsed.path else '/'

    # Reconstruct URL
    normalized = f"{parsed.scheme}://{netloc}{path}"

    # Add query if present
    if parsed.query:
        normalized += f"?{parsed.query}"

    # Never add fragment (validated out)

    return normalized

Authorization Endpoint Updates (SEPARATE TASK)

NOTE: This is a SEPARATE task and should NOT be implemented as part of the validation.py update task.

Location: /home/phil/Projects/Gondulf/src/gondulf/endpoints/authorization.py

When this separate task is implemented, update the authorization endpoint to use the new validation:

# In the authorize() function, when validating client_id:

# Validate and normalize client_id
is_valid, error = validate_client_id(client_id)
if not is_valid:
    # Return error to client
    return authorization_error_response(
        redirect_uri=redirect_uri,
        error="invalid_request",
        error_description=f"Invalid client_id: {error}",
        state=state
    )

# Normalize for consistent storage/comparison
try:
    normalized_client_id = normalize_client_id(client_id)
except ValueError as e:
    # This shouldn't happen if validate_client_id passed, but handle it
    return authorization_error_response(
        redirect_uri=redirect_uri,
        error="invalid_request",
        error_description=str(e),
        state=state
    )

Token Endpoint Updates (SEPARATE TASK)

NOTE: This is a SEPARATE task and should NOT be implemented as part of the validation.py update task.

Location: /home/phil/Projects/Gondulf/src/gondulf/endpoints/token.py

When this separate task is implemented, update token endpoint validation similarly:

# In the token() function, when validating client_id:

# Validate and normalize client_id
is_valid, error = validate_client_id(client_id)
if not is_valid:
    return JSONResponse(
        status_code=400,
        content={
            "error": "invalid_client",
            "error_description": f"Invalid client_id: {error}"
        }
    )

# Normalize for comparison with stored value
normalized_client_id = normalize_client_id(client_id)

Data Models

No database schema changes required. The validation happens at the API layer before storage.

API Contracts

Error Responses

When client_id validation fails, return appropriate OAuth 2.0 error responses:

Authorization Endpoint (if redirect_uri is valid):

HTTP/1.1 302 Found
Location: {redirect_uri}?error=invalid_request&error_description=Invalid+client_id%3A+{specific_error}&state={state}

Authorization Endpoint (if redirect_uri is also invalid):

HTTP/1.1 400 Bad Request
Content-Type: text/html

<html>
  <body>
    <h1>Invalid Request</h1>
    <p>Invalid client_id: {specific_error}</p>
  </body>
</html>

Token Endpoint:

HTTP/1.1 400 Bad Request
Content-Type: application/json

{
  "error": "invalid_client",
  "error_description": "Invalid client_id: {specific_error}"
}

Error Handling

Validation Error Messages

Each validation rule has a specific, user-friendly error message:

Validation Rule Error Message
Invalid URL "client_id must be a valid URL: {parse_error}"
Wrong scheme "client_id must use https or http scheme"
HTTP not localhost "client_id with http scheme is only allowed for localhost, 127.0.0.1, or [::1]"
Has fragment "client_id must not contain a fragment (#)"
Has credentials "client_id must not contain username or password"
Non-loopback IP "client_id must not use IP address (except 127.0.0.1 or [::1])"
Path traversal "client_id must not contain single-dot (.) or double-dot (..) path segments"

Exception Handling

  • validate_client_id() never raises exceptions, returns (False, error_message)
  • normalize_client_id() raises ValueError if validation fails
  • URL parsing exceptions are caught and converted to validation errors

Security Considerations

Fragment Rejection

Fragments in client_ids could cause confusion about the actual client identity. By rejecting them, we ensure clear client identification.

Credential Rejection

Username/password in URLs could leak into logs or be displayed to users. Rejecting them prevents credential exposure.

IP Address Restriction

Allowing arbitrary IP addresses could bypass domain-based security controls. Only loopback addresses are permitted for local development.

Path Traversal Prevention

Single-dot and double-dot segments could potentially be used for path traversal attacks or cause confusion about the client's identity.

HTTP Localhost Support

HTTP is only allowed for localhost/loopback addresses to support local development while maintaining security in production.

Testing Strategy

Unit Tests Required

Create comprehensive tests in /home/phil/Projects/Gondulf/tests/unit/test_validation.py:

Valid Client IDs

valid_client_ids = [
    "https://example.com",
    "https://example.com/",
    "https://example.com/app",
    "https://example.com/app/client",
    "https://example.com?foo=bar",
    "https://example.com/app?foo=bar&baz=qux",
    "https://sub.example.com",
    "https://example.com:8080",
    "https://example.com:8080/app",
    "http://localhost",
    "http://localhost:3000",
    "http://127.0.0.1",
    "http://127.0.0.1:8080",
    "http://[::1]",
    "http://[::1]:8080",
]

Invalid Client IDs

invalid_client_ids = [
    ("ftp://example.com", "must use https or http scheme"),
    ("https://example.com#fragment", "must not contain a fragment"),
    ("https://user:pass@example.com", "must not contain username or password"),
    ("https://example.com/./invalid", "must not contain single-dot"),
    ("https://example.com/../invalid", "must not contain double-dot"),
    ("http://example.com", "http scheme is only allowed for localhost"),
    ("https://192.168.1.1", "must not use IP address"),
    ("https://10.0.0.1", "must not use IP address"),
    ("https://[2001:db8::1]", "must not use IP address"),
    ("not-a-url", "must be a valid URL"),
    ("", "must be a valid URL"),
]

Normalization Tests

normalization_cases = [
    ("HTTPS://EXAMPLE.COM", "https://example.com/"),
    ("https://example.com", "https://example.com/"),
    ("https://example.com:443", "https://example.com/"),
    ("http://localhost:80", "http://localhost/"),
    ("https://EXAMPLE.COM:443/app", "https://example.com/app"),
    ("https://Example.Com/APP", "https://example.com/APP"),  # Path case preserved
    ("https://example.com?foo=bar", "https://example.com/?foo=bar"),
]

Integration Tests

  1. Test authorization endpoint with various client_ids
  2. Test token endpoint with various client_ids
  3. Test that normalized client_ids match correctly between endpoints
  4. Test error responses for invalid client_ids

Security Tests

  1. Test that fragments are always rejected
  2. Test that credentials are always rejected
  3. Test that non-loopback IPs are rejected
  4. Test that path traversal segments are rejected
  5. Test that HTTP is only allowed for localhost

Acceptance Criteria

  1. All valid client_ids per W3C specification are accepted
  2. All invalid client_ids per W3C specification are rejected with specific error messages
  3. HTTP scheme is accepted for localhost, 127.0.0.1, and [::1]
  4. HTTPS scheme is accepted for all valid domain names
  5. Fragments are always rejected
  6. Username/password components are always rejected
  7. Non-loopback IP addresses are rejected
  8. Single-dot and double-dot path segments are rejected
  9. Hostnames are normalized to lowercase
  10. Default ports (80 for HTTP, 443 for HTTPS) are removed
  11. Empty paths are normalized to "/"
  12. Query strings are preserved
  13. Authorization endpoint uses new validation
  14. Token endpoint uses new validation
  15. All tests pass with 100% coverage of validation logic
  16. Error messages are specific and helpful

Implementation Order

Current Task (validation.py update):

  1. Implement validate_client_id() function in validation.py
  2. Update normalize_client_id() to use validation in validation.py
  3. Write comprehensive unit tests in tests/unit/test_validation.py

Separate Future Tasks:

  1. Update authorization endpoint (SEPARATE TASK)
  2. Update token endpoint (SEPARATE TASK)
  3. Write integration tests (SEPARATE TASK)
  4. Test with real IndieAuth clients (SEPARATE TASK)

Migration Notes

  • No database migration needed
  • Existing stored client_ids remain valid (they were normalized on storage)
  • New validation is stricter but backward compatible with valid client_ids

References