# ADR-007: Slug Generation Algorithm

## Status
Accepted

## Context

Notes in StarPunk require URL-safe identifiers (slugs) for permalinks and file naming. The slug generation algorithm is critical because:

1. **User experience**: Slugs appear in URLs and should be readable/meaningful
2. **SEO**: Descriptive slugs improve search engine optimization
3. **File system**: Slugs become filenames, must be filesystem-safe
4. **Uniqueness**: Slugs must be unique across all notes
5. **Portability**: Slugs should work across different systems and browsers

The challenge is designing an algorithm that creates readable, unique, safe slugs automatically from note content.

## Decision

### Content-Based Slug Generation with Timestamp Fallback

**Primary Algorithm**: Extract first N words from content and normalize
**Fallback**: Timestamp-based slug when content is insufficient
**Uniqueness**: Random suffix when collision detected

### Algorithm Specification

#### Step 1: Extract Words
```python
# Extract first 5 words from content
words = content.split()[:5]
text = " ".join(words)
```

#### Step 2: Normalize
```python
# Convert to lowercase
text = text.lower()

# Replace spaces with hyphens
text = text.replace(" ", "-")

# Remove all characters except a-z, 0-9, and hyphens
text = re.sub(r'[^a-z0-9-]', '', text)

# Collapse multiple hyphens
text = re.sub(r'-+', '-', text)

# Strip leading/trailing hyphens
text = text.strip('-')
```

#### Step 3: Validate Length
```python
# If slug too short or empty, use timestamp fallback
if len(text) < 1:
    text = created_at.strftime("%Y%m%d-%H%M%S")
```

#### Step 4: Truncate
```python
# Limit to 100 characters
text = text[:100]
```

#### Step 5: Check Uniqueness
```python
# If slug exists, add random 4-character suffix
if slug_exists(text):
    text = f"{text}-{random_alphanumeric(4)}"
```

### Character Set

**Allowed characters**: `a-z`, `0-9`, `-` (hyphen)

**Rationale**:
- URL-safe without encoding
- Filesystem-safe on all platforms (Windows, Linux, macOS)
- Human-readable
- No escaping required in HTML
- Compatible with DNS hostnames (if ever used)

### Examples

| Input Content | Generated Slug |
|--------------|----------------|
| "Hello World! This is my first note." | `hello-world-this-is-my` |
| "Testing... with special chars!@#" | `testing-with-special-chars` |
| "2024-11-18 Daily Journal Entry" | `2024-11-18-daily-journal-entry` |
| "A" (too short) | `20241118-143022` (timestamp) |
| "   " (whitespace only) | Error: ValueError |
| "Hello World" (duplicate) | `hello-world-a7c9` (random suffix) |

### Slug Uniqueness Strategy

**Collision Detection**: Check database for existing slug before use

**Resolution**: Append random 4-character suffix
- Character set: `a-z0-9` (36 characters)
- Combinations: 36^4 = 1,679,616 possible suffixes
- Collision probability: Negligible for reasonable note counts

**Example**:
```
Original:  hello-world
Collision: hello-world-a7c9
Collision: hello-world-x3k2
```

### Timestamp Fallback Format

**Pattern**: `YYYYMMDD-HHMMSS`
**Example**: `20241118-143022`

**When Used**:
- Content is empty or whitespace-only (raises error instead)
- Normalized slug is empty (after removing special characters)
- Normalized slug is too short (< 1 character)

**Rationale**:
- Guaranteed unique (unless two notes created in same second)
- Sortable chronologically
- Still readable and meaningful
- No special characters required

## Rationale

### Content-Based Generation (Score: 9/10)

**Pros**:
- **Readability**: Users can understand URL meaning
- **SEO**: Search engines prefer descriptive URLs
- **Memorability**: Easier to remember and share
- **Meaningful**: Reflects note content

**Cons**:
- **Collisions**: Multiple notes might have similar titles
- **Changes**: Editing note doesn't update slug (by design)

### First 5 Words (Score: 8/10)

**Pros**:
- **Sufficient**: 5 words usually capture note topic
- **Concise**: Keeps URLs short and readable
- **Consistent**: Predictable slug length

**Cons**:
- **Arbitrary**: 5 is somewhat arbitrary (could be 3-7)
- **Language**: Assumes space-separated words (English-centric)

**Alternatives Considered**:
- First 3 words: Too short, often not descriptive
- First 10 words: Too long, URLs become unwieldy
- First line: Could be very long, harder to normalize
- First sentence: Variable length, complex to parse

**Decision**: 5 words is a good balance (configurable constant)

### Lowercase with Hyphens (Score: 10/10)

**Pros**:
- **URL Standard**: Common pattern (github.com, stackoverflow.com)
- **Readability**: Easier to read than underscores or camelCase
- **Compatibility**: Works everywhere
- **Simplicity**: One separator type only

**Cons**:
- None significant

### Alphanumeric Only (Score: 10/10)

**Pros**:
- **Safety**: No escaping required in URLs or filenames
- **Portability**: Works on all filesystems (FAT32, NTFS, ext4, APFS)
- **Predictability**: No ambiguity about character handling

**Cons**:
- **Unicode Loss**: Non-ASCII characters stripped (acceptable trade-off)

### Random Suffix for Uniqueness (Score: 9/10)

**Pros**:
- **Simplicity**: No complex conflict resolution
- **Security**: Cryptographically secure random (secrets module)
- **Scalability**: 1.6M possible suffixes per base slug

**Cons**:
- **Ugliness**: Suffix looks less clean (but rare occurrence)
- **Unpredictability**: User can't control suffix

**Alternatives Considered**:
- Incrementing numbers (`hello-world-2`, `hello-world-3`): More predictable but reveals note count
- Longer random suffix: More secure but uglier URLs
- User-specified slug: More complex, deferred to V2

**Decision**: 4-character random suffix is good balance

## Consequences

### Positive

1. **Automatic**: No user input required for slug
2. **Readable**: Slugs are human-readable and meaningful
3. **Safe**: Works on all platforms and browsers
4. **Unique**: Collision resolution ensures uniqueness
5. **SEO-friendly**: Descriptive URLs help search ranking
6. **Predictable**: User can anticipate what slug will be
7. **Simple**: Single, consistent algorithm

### Negative

1. **Not editable**: User can't customize slug in V1
2. **English-biased**: Assumes space-separated words
3. **Unicode stripped**: Non-ASCII content loses characters
4. **Content-dependent**: Similar content = similar slugs
5. **Timestamp fallback**: Short notes get ugly timestamp slugs

### Mitigations

**Non-editable slugs**:
- V1 trade-off for simplicity
- V2 can add custom slug support
- Users can still reference notes by slug once created

**English-bias**:
- Acceptable for V1 (English-first IndieWeb)
- V2 can add Unicode slug support (requires more complex normalization)

**Unicode stripping**:
- Markdown content can still contain Unicode (only slug is ASCII)
- Timestamp fallback ensures note is still creatable
- V2 can use Unicode normalization (transliteration)

**Timestamp fallback**:
- Rare occurrence (most notes have >5 words)
- Still functional and unique
- V2 can improve (use first word if exists + timestamp)

## Standards Compliance

### URL Standards (RFC 3986)

Slugs comply with URL path segment requirements:
- No percent-encoding required
- No reserved characters (`/`, `?`, `#`, etc.)
- Case-insensitive safe (always lowercase)

### Filesystem Standards

Slugs work on all major filesystems:
- **FAT32**: Yes (no special chars, length OK)
- **NTFS**: Yes
- **ext4**: Yes
- **APFS**: Yes
- **HFS+**: Yes

**Reserved names**: None of our slugs conflict with OS reserved names (CON, PRN, etc.)

### IndieWeb Recommendations

Aligns with IndieWeb permalink best practices:
- Descriptive URLs
- No query parameters
- Short and memorable
- Permanent (don't change after creation)

## Implementation Requirements

### Validation Rules

```python
# Valid slug pattern
SLUG_PATTERN = r'^[a-z0-9]+(?:-[a-z0-9]+)*$'

# Constraints
MIN_SLUG_LENGTH = 1
MAX_SLUG_LENGTH = 100
```

### Reserved Slugs

Certain slugs should be reserved for system routes:

**Reserved List** (reject these slugs):
- `admin`
- `api`
- `static`
- `auth`
- `feed`
- `login`
- `logout`

Implementation:
```python
RESERVED_SLUGS = {'admin', 'api', 'static', 'auth', 'feed', 'login', 'logout'}

def is_slug_reserved(slug: str) -> bool:
    return slug in RESERVED_SLUGS
```

### Error Cases

```python
# Empty content
generate_slug("")  # Raises ValueError

# Whitespace only
generate_slug("   ")  # Raises ValueError

# Valid but short
generate_slug("Hi")  # Returns timestamp: "20241118-143022"

# Special characters only
generate_slug("!@#$%")  # Returns timestamp: "20241118-143022"
```

## Alternatives Considered

### UUID-based Slugs (Rejected)

```python
slug = str(uuid.uuid4())  # "550e8400-e29b-41d4-a716-446655440000"
```

**Pros**: Guaranteed unique, no collision checking
**Cons**: Not human-readable, poor SEO, not memorable

**Verdict**: Violates principle of readable URLs

### Hash-based Slugs (Rejected)

```python
slug = hashlib.sha256(content.encode()).hexdigest()[:12]  # "a591a6d40bf4"
```

**Pros**: Deterministic, unique
**Cons**: Not human-readable, changes if content edited

**Verdict**: Not meaningful to users

### Title Extraction (Rejected for V1)

```python
# Extract from # heading or first line
title = extract_title_from_markdown(content)
slug = normalize(title)
```

**Pros**: More semantic, uses actual title
**Cons**: Requires markdown parsing, more complex, title might not exist

**Verdict**: Deferred to V2 (V1 uses first N words which is simpler)

### User-Specified Slugs (Rejected for V1)

```python
def create_note(content, custom_slug=None):
    if custom_slug:
        slug = validate_and_use(custom_slug)
    else:
        slug = generate_slug(content)
```

**Pros**: Maximum user control, no surprises
**Cons**: Requires UI input, validation complexity, user burden

**Verdict**: Deferred to V2 (V1 auto-generates for simplicity)

### Incrementing Numbers (Rejected)

```python
# If collision, increment
slug = "hello-world"
slug = "hello-world-2"  # Collision
slug = "hello-world-3"  # Collision
```

**Pros**: Predictable, simple
**Cons**: Reveals note count, enumeration attack vector, less random

**Verdict**: Random suffix is more secure and scales better

## Performance Considerations

### Generation Speed

- Extract words: O(n) where n = content length (negligible, content is small)
- Normalize: O(m) where m = extracted text length (< 100 chars)
- Uniqueness check: O(1) database lookup with index
- Random suffix: O(1) generation

**Target**: < 1ms per slug generation (easily achieved)

### Database Impact

- Index on `slug` column: O(log n) lookup
- Collision rate: < 1% (most notes have unique first 5 words)
- Random suffix retries: Nearly never (1.6M combinations)

## Testing Requirements

### Test Cases

**Normal Cases**:
- Standard English content → descriptive slug
- Content with punctuation → punctuation removed
- Content with numbers → numbers preserved
- Content with hyphens → hyphens preserved

**Edge Cases**:
- Very short content → timestamp fallback
- Empty content → ValueError
- Special characters only → timestamp fallback
- Very long words → truncated to max length
- Unicode content → stripped to ASCII

**Collision Cases**:
- Duplicate slug → random suffix added
- Multiple collisions → different random suffixes
- Reserved slug → rejected

**Security Cases**:
- Path traversal attempt (`../../../etc/passwd`)
- Special characters (`<script>`, `%00`, etc.)
- Very long input (>10,000 characters)

## Migration Path (V2)

Future enhancements that build on this foundation:

### Custom Slugs
```python
def create_note(content, custom_slug=None):
    slug = custom_slug or generate_slug(content)
```

### Unicode Support
```python
def generate_unicode_slug(content):
    # Use Unicode normalization (NFKD)
    # Transliterate to ASCII (unidecode library)
    # Support CJK languages
```

### Title Extraction
```python
def extract_title_from_content(content):
    # Check for # heading
    # Use first line if no heading
    # Fall back to first N words
```

### Slug Editing
```python
def update_note_slug(note_id, new_slug):
    # Validate new slug
    # Update database
    # Rename file
    # Create redirect from old slug
```

## References

- [RFC 3986 - URI Generic Syntax](https://www.rfc-editor.org/rfc/rfc3986)
- [IndieWeb Permalink Design](https://indieweb.org/permalink)
- [URL Slug Best Practices](https://moz.com/learn/seo/url)
- [Python secrets Module](https://docs.python.org/3/library/secrets.html)
- [ADR-004: File-Based Note Storage](/home/phil/Projects/starpunk/docs/decisions/ADR-004-file-based-note-storage.md)

## Acceptance Criteria

- [ ] Slug generation creates valid, URL-safe slugs
- [ ] Slugs are descriptive (use first 5 words)
- [ ] Slugs are unique (collision detection + random suffix)
- [ ] Slugs meet length constraints (1-100 characters)
- [ ] Timestamp fallback works for short content
- [ ] Reserved slugs are rejected
- [ ] Unicode content is handled gracefully
- [ ] All edge cases tested
- [ ] Performance meets target (<1ms)
- [ ] Code follows Python coding standards

---

**Approved**: 2024-11-18
**Architect**: StarPunk Architect Agent