docs: add Phase 2 domain verification design and clarifications

Add comprehensive Phase 2 documentation: - Complete design document for two-factor domain verification - Implementation guide with code examples - ADR for implementation decisions (ADR-0004) - ADR for rel="me" email discovery (ADR-008) - Phase 1 impact assessment - All 23 clarification questions answered - Updated architecture docs (indieauth-protocol, security) - Updated ADR-005 with rel="me" approach - Updated backlog with technical debt items Design ready for Phase 2 implementation. Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 13:05:09 -07:00
parent bebd47955f
commit 6f06aebf40
10 changed files with 5605 additions and 410 deletions
--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@@ -58,108 +58,174 @@ Gondulf follows a defense-in-depth security model with these core principles:

 ## Authentication Security

-### Email-Based Verification (v1.0.0)
+### Two-Factor Domain Verification (v1.0.0)

-**Mechanism**: Users prove domain ownership by receiving verification code at email address on that domain.
+**Mechanism**: Users prove domain ownership through TWO independent factors:
+1. **DNS TXT Record**: Proves DNS control (`_gondulf.{domain}` = `verified`)
+2. **Email via rel="me"**: Proves email control (discovered from site's rel="me" link)
+
+**Security Model**: An attacker must compromise BOTH factors to authenticate fraudulently. This is significantly stronger than single-factor verification.

 #### Threat: Email Interception

 **Risk**: Attacker intercepts email containing verification code.

 **Mitigations**:
-1. **Short Code Lifetime**: 15-minute expiration
-2. **Single Use**: Code invalidated after verification
-3. **Rate Limiting**: Max 3 code requests per email per hour
-4. **TLS Email Delivery**: Require STARTTLS for SMTP
-5. **Display Warning**: "Only request code if you initiated this login"
+1. **Two-Factor Requirement**: Email alone is insufficient (DNS also required)
+2. **Short Code Lifetime**: 15-minute expiration
+3. **Single Use**: Code invalidated after verification
+4. **Rate Limiting**: Max 3 code requests per domain per hour
+5. **TLS Email Delivery**: Require STARTTLS for SMTP
+6. **Display Warning**: "Only request code if you initiated this login"

-**Residual Risk**: Acceptable for v1.0.0 given short lifetime and single-use.
+**Residual Risk**: Low. Even with email interception, attacker still needs DNS control.

 #### Threat: Code Brute Force

 **Risk**: Attacker guesses 6-digit verification code.

 **Mitigations**:
-1. **Sufficient Entropy**: 1,000,000 possible codes (6 digits)
-2. **Attempt Limiting**: Max 3 attempts per email
-3. **Short Lifetime**: 15-minute window
-4. **Rate Limiting**: Max 10 attempts per IP per hour
-5. **Exponential Backoff**: 5-second delay after each failed attempt
+1. **Two-Factor Requirement**: Code alone is insufficient (DNS also required)
+2. **Sufficient Entropy**: 1,000,000 possible codes (6 digits)
+3. **Attempt Limiting**: Max 3 attempts per email
+4. **Short Lifetime**: 15-minute window
+5. **Rate Limiting**: Max 3 codes per domain per hour
+6. **Single-Use**: Code invalidated after use

 **Math**:
 - 3 attempts × 1,000,000 codes = 0.0003% success probability
 - 15-minute window limits attack time
- Rate limiting prevents distributed guessing
+- Even if guessed, attacker still needs DNS control

-**Residual Risk**: Very low, acceptable for v1.0.0.
+**Residual Risk**: Very low. Two-factor requirement makes brute force insufficient.
+
+#### Threat: DNS TXT Record Spoofing
+
+**Risk**: Attacker attempts to spoof DNS responses.
+
+**Mitigations**:
+1. **Multiple Resolvers**: Query 2+ independent DNS servers (Google, Cloudflare)
+2. **Consensus Required**: Require agreement from at least 2 resolvers
+3. **DNSSEC Support**: Validate DNSSEC signatures when available (future)
+4. **Timeout Handling**: Fail securely if DNS unavailable
+5. **Logging**: Log all DNS verification attempts
+
+**Residual Risk**: Low. Spoofing multiple independent resolvers is difficult.
+
+#### Threat: rel="me" Link Spoofing
+
+**Risk**: Attacker compromises user's website to add malicious rel="me" link.
+
+**Mitigations**:
+1. **Two-Factor Requirement**: Website compromise alone insufficient (DNS also required)
+2. **HTTPS Required**: Fetch site over TLS (prevents MITM)
+3. **Certificate Validation**: Verify SSL certificate
+4. **Email Domain Matching**: Email should match site domain (warning if not)
+5. **User Education**: Inform users to secure their website
+
+**Residual Risk**: Moderate. If attacker compromises both DNS and website, they can authenticate. This is acceptable as it represents full domain compromise.

 #### Threat: Email Address Enumeration

-**Risk**: Attacker discovers which domains are registered by requesting codes.
+**Risk**: Attacker discovers email addresses by triggering rel="me" discovery.

 **Mitigations**:
-1. **Consistent Response**: Always say "If email exists, code sent"
-2. **No Error Differentiation**: Same message for valid/invalid emails
-3. **Rate Limiting**: Prevent bulk enumeration
+1. **Public Information**: rel="me" links are intentionally public
+2. **User Awareness**: Users know they're publishing email on their site
+3. **Rate Limiting**: Prevent bulk scanning
+4. **Robots.txt**: Users can restrict crawler access if desired

-**Residual Risk**: Minimal, domain names are public anyway (DNS).
+**Residual Risk**: Minimal. Email addresses are intentionally published by users on their own sites.

-### Domain Ownership Verification
+### Domain Ownership Verification (Two-Factor)

-#### TXT Record Validation (Preferred)
+**Mechanism**: v1.0.0 requires BOTH verification methods:

-**Mechanism**: Admin adds DNS TXT record `_gondulf.example.com` = `verified`.
+#### 1. TXT Record Validation (Required)
+
+**Mechanism**: Admin adds DNS TXT record `_gondulf.{domain}` = `verified`.

 **Security Properties**:
- Requires DNS control (stronger than email)
+- Proves DNS control (first factor)
 - Verifiable without user interaction
 - Cacheable for performance
 - Re-verifiable periodically

-**Threat: DNS Spoofing**
-
-**Mitigations**:
-1. **DNSSEC**: Validate DNSSEC signatures if available
-2. **Multiple Resolvers**: Query 2+ DNS servers, require consensus
-3. **Caching**: Cache valid results, re-verify daily
-4. **Logging**: Log all DNS verification attempts
-
 **Implementation**:
 ```python
 import dns.resolver
-import dns.dnssec

 def verify_txt_record(domain: str) -> bool:
    """
    Verify _gondulf.{domain} TXT record exists with value 'verified'.
+    Requires consensus from multiple independent resolvers.
    """
    try:
        # Use Google and Cloudflare DNS for redundancy
        resolvers = ['8.8.8.8', '1.1.1.1']
-        results = []
+        verified_count = 0

        for resolver_ip in resolvers:
            resolver = dns.resolver.Resolver()
            resolver.nameservers = [resolver_ip]
            resolver.timeout = 5
-            resolver.lifetime = 5

            answers = resolver.resolve(f'_gondulf.{domain}', 'TXT')
            for rdata in answers:
                txt_value = rdata.to_text().strip('"')
                if txt_value == 'verified':
-                    results.append(True)
+                    verified_count += 1
                    break

-        # Require consensus from both resolvers
-        return len(results) >= 2
+        # Require consensus from at least 2 resolvers
+        return verified_count >= 2

    except Exception as e:
        logger.warning(f"DNS verification failed for {domain}: {e}")
        return False
 ```

-**Residual Risk**: Low, DNS is foundational internet infrastructure.
+#### 2. Email Verification via rel="me" (Required)
+
+**Mechanism**: Email discovered from site's `<link rel="me" href="mailto:...">`, then verified with code.
+
+**Security Properties**:
+- Proves website control (can modify HTML)
+- Proves email control (receives and enters code)
+- Follows IndieWeb standards (rel="me")
+- Self-documenting (user declares email publicly)
+
+**Implementation**:
+```python
+from bs4 import BeautifulSoup
+import requests
+
+def discover_email_from_site(domain: str) -> Optional[str]:
+    """
+    Fetch site and discover email from rel="me" link.
+    """
+    try:
+        response = requests.get(f"https://{domain}", timeout=10, allow_redirects=True)
+        response.raise_for_status()
+
+        soup = BeautifulSoup(response.content, 'html.parser')
+        me_links = soup.find_all('link', rel='me') + soup.find_all('a', rel='me')
+
+        for link in me_links:
+            href = link.get('href', '')
+            if href.startswith('mailto:'):
+                email = href.replace('mailto:', '').strip()
+                if validate_email_format(email):
+                    return email
+
+        return None
+
+    except Exception as e:
+        logger.error(f"Failed to discover email for {domain}: {e}")
+        return None
+```
+
+**Combined Residual Risk**: Low. Attacker must compromise DNS, website, and email account to authenticate fraudulently.

 ## Authorization Security

@@ -431,15 +497,80 @@ class AuthorizeRequest(BaseModel):

 **Residual Risk**: Minimal, Pydantic provides strong validation.

+### HTML Parsing Security (rel="me" Discovery)
+
+#### Threat: Malicious HTML Injection
+
+**Risk**: Attacker's site contains malicious HTML to exploit parser.
+
+**Mitigations**:
+1. **Robust Parser**: Use BeautifulSoup (handles malformed HTML safely)
+2. **Link Extraction Only**: Only extract href attributes, no script execution
+3. **Timeout**: 10-second timeout for HTTP requests
+4. **Size Limit**: Limit response size (prevent memory exhaustion)
+5. **HTTPS Required**: Fetch over TLS only
+6. **Certificate Validation**: Verify SSL certificates
+
+**Implementation**:
+```python
+from bs4 import BeautifulSoup
+import requests
+
+def discover_email_from_site(domain: str) -> Optional[str]:
+    """
+    Safely discover email from rel="me" link.
+    """
+    try:
+        # Fetch with safety limits
+        response = requests.get(
+            f"https://{domain}",
+            timeout=10,
+            allow_redirects=True,
+            max_redirects=5,
+            stream=True  # Don't load entire response into memory
+        )
+        response.raise_for_status()
+
+        # Limit response size (prevent memory exhaustion)
+        MAX_SIZE = 5 * 1024 * 1024  # 5MB
+        content = response.raw.read(MAX_SIZE)
+
+        # Parse HTML (BeautifulSoup handles malformed HTML safely)
+        soup = BeautifulSoup(content, 'html.parser')
+
+        # Find rel="me" links (no script execution)
+        me_links = soup.find_all('link', rel='me') + soup.find_all('a', rel='me')
+
+        # Extract mailto: links only
+        for link in me_links:
+            href = link.get('href', '')
+            if href.startswith('mailto:'):
+                email = href.replace('mailto:', '').strip()
+                # Validate email format before returning
+                if validate_email_format(email):
+                    return email
+
+        return None
+
+    except requests.exceptions.SSLError as e:
+        logger.error(f"SSL certificate validation failed for {domain}: {e}")
+        return None
+    except Exception as e:
+        logger.error(f"Failed to discover email for {domain}: {e}")
+        return None
+```
+
+**Residual Risk**: Very low. BeautifulSoup is designed for untrusted HTML.
+
 ### Email Validation

 #### Threat: Email Injection Attacks

-**Risk**: Attacker injects SMTP commands via email address field.
+**Risk**: Attacker crafts malicious email address in rel="me" link.

 **Mitigations**:
 1. **Format Validation**: Strict email regex (RFC 5322)
-2. **Domain Matching**: Require email domain match `me` domain
+2. **No User Input**: Email discovered from site (not user-provided)
 3. **SMTP Library**: Use well-tested library (smtplib)
 4. **Content Encoding**: Encode email content properly
 5. **Rate Limiting**: Prevent abuse
@@ -447,31 +578,27 @@ class AuthorizeRequest(BaseModel):
 **Validation**:
 ```python
 import re
-from email.utils import parseaddr

-def validate_email(email: str, required_domain: str) -> tuple[bool, str]:
+def validate_email_format(email: str) -> bool:
    """
-    Validate email address and domain match.
+    Validate email address format.
    """
-    # Parse email (RFC 5322 compliant)
-    name, addr = parseaddr(email)
-
-    # Basic format check
+    # Basic format check (RFC 5322 simplified)
    email_regex = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
-    if not re.match(email_regex, addr):
-        return False, "Invalid email format"
+    if not re.match(email_regex, email):
+        return False

-    # Extract domain
-    email_domain = addr.split('@')[1].lower()
-    required_domain = required_domain.lower()
+    # Sanity checks
+    if len(email) > 254:  # RFC 5321 maximum
+        return False
+    if email.count('@') != 1:
+        return False

-    # Domain must match
-    if email_domain != required_domain:
-        return False, f"Email must be at {required_domain}"
-
-    return True, ""
+    return True
 ```

+**Note**: Domain matching is NOT enforced in v1.0.0. User may have email at different domain than their identity site (e.g., phil@gmail.com for phil.example.com). This is acceptable as user explicitly publishes the email on their site.
+
 **Residual Risk**: Low, standard validation patterns.

 ## Network Security
@@ -567,21 +694,29 @@ async def add_security_headers(request: Request, call_next):

 **Email Handling**:
 ```python
-# Email stored ONLY during verification (in-memory, 15-min TTL)
+# Email discovered from rel="me" link (not user-provided)
+# Stored ONLY during verification (in-memory, 15-min TTL)
 verification_codes[code_id] = {
-    "email": email,  # ← Exists ONLY here, NEVER in database
+    "email": email,  # ← Discovered from site, exists ONLY here, NEVER in database
    "code": code,
+    "domain": domain,
    "expires_at": datetime.utcnow() + timedelta(minutes=15)
 }

-# After verification: email is deleted, only domain stored
+# After verification: email is deleted, only domain + timestamp stored
 db.execute('''
-    INSERT INTO domains (domain, verification_method, verified_at)
-    VALUES (?, 'email', ?)
-''', (domain, datetime.utcnow()))
-# Note: NO email address in database
+    INSERT INTO domains (domain, verification_method, verified_at, last_email_check)
+    VALUES (?, 'two_factor', ?, ?)
+''', (domain, datetime.utcnow(), datetime.utcnow()))
+# Note: NO email address in database, only verification timestamp
 ```

+**rel="me" Discovery**:
+- Email addresses are public (user publishes on their site)
+- Server fetches email from user's site (not user input)
+- Reduces social engineering risk (can't claim arbitrary email)
+- Follows IndieWeb standards for identity
+
 ### Database Security

 **SQLite Security**:
@@ -829,13 +964,15 @@ security:
 ## Security Roadmap

 ### v1.0.0 (MVP)
- ✅ Email-based authentication
+- ✅ Two-factor domain verification (DNS TXT + Email via rel="me")
+- ✅ rel="me" email discovery (IndieWeb standard)
+- ✅ HTML parsing security (BeautifulSoup)
 - ✅ TLS/HTTPS enforcement
 - ✅ Secure token generation (opaque, hashed)
 - ✅ URL validation (open redirect prevention)
 - ✅ Input validation (Pydantic)
 - ✅ Security headers
- ✅ Minimal data collection
+- ✅ Minimal data collection (no email storage)

 ### v1.1.0
 - PKCE support (code challenge/verifier)