Files
StarPunk/docs/decisions/ADR-007-slug-generation-algorithm.md
2025-11-18 19:21:31 -07:00

13 KiB

ADR-007: Slug Generation Algorithm

Status

Accepted

Context

Notes in StarPunk require URL-safe identifiers (slugs) for permalinks and file naming. The slug generation algorithm is critical because:

  1. User experience: Slugs appear in URLs and should be readable/meaningful
  2. SEO: Descriptive slugs improve search engine optimization
  3. File system: Slugs become filenames, must be filesystem-safe
  4. Uniqueness: Slugs must be unique across all notes
  5. Portability: Slugs should work across different systems and browsers

The challenge is designing an algorithm that creates readable, unique, safe slugs automatically from note content.

Decision

Content-Based Slug Generation with Timestamp Fallback

Primary Algorithm: Extract first N words from content and normalize Fallback: Timestamp-based slug when content is insufficient Uniqueness: Random suffix when collision detected

Algorithm Specification

Step 1: Extract Words

# Extract first 5 words from content
words = content.split()[:5]
text = " ".join(words)

Step 2: Normalize

# Convert to lowercase
text = text.lower()

# Replace spaces with hyphens
text = text.replace(" ", "-")

# Remove all characters except a-z, 0-9, and hyphens
text = re.sub(r'[^a-z0-9-]', '', text)

# Collapse multiple hyphens
text = re.sub(r'-+', '-', text)

# Strip leading/trailing hyphens
text = text.strip('-')

Step 3: Validate Length

# If slug too short or empty, use timestamp fallback
if len(text) < 1:
    text = created_at.strftime("%Y%m%d-%H%M%S")

Step 4: Truncate

# Limit to 100 characters
text = text[:100]

Step 5: Check Uniqueness

# If slug exists, add random 4-character suffix
if slug_exists(text):
    text = f"{text}-{random_alphanumeric(4)}"

Character Set

Allowed characters: a-z, 0-9, - (hyphen)

Rationale:

  • URL-safe without encoding
  • Filesystem-safe on all platforms (Windows, Linux, macOS)
  • Human-readable
  • No escaping required in HTML
  • Compatible with DNS hostnames (if ever used)

Examples

Input Content Generated Slug
"Hello World! This is my first note." hello-world-this-is-my
"Testing... with special chars!@#" testing-with-special-chars
"2024-11-18 Daily Journal Entry" 2024-11-18-daily-journal-entry
"A" (too short) 20241118-143022 (timestamp)
" " (whitespace only) Error: ValueError
"Hello World" (duplicate) hello-world-a7c9 (random suffix)

Slug Uniqueness Strategy

Collision Detection: Check database for existing slug before use

Resolution: Append random 4-character suffix

  • Character set: a-z0-9 (36 characters)
  • Combinations: 36^4 = 1,679,616 possible suffixes
  • Collision probability: Negligible for reasonable note counts

Example:

Original:  hello-world
Collision: hello-world-a7c9
Collision: hello-world-x3k2

Timestamp Fallback Format

Pattern: YYYYMMDD-HHMMSS Example: 20241118-143022

When Used:

  • Content is empty or whitespace-only (raises error instead)
  • Normalized slug is empty (after removing special characters)
  • Normalized slug is too short (< 1 character)

Rationale:

  • Guaranteed unique (unless two notes created in same second)
  • Sortable chronologically
  • Still readable and meaningful
  • No special characters required

Rationale

Content-Based Generation (Score: 9/10)

Pros:

  • Readability: Users can understand URL meaning
  • SEO: Search engines prefer descriptive URLs
  • Memorability: Easier to remember and share
  • Meaningful: Reflects note content

Cons:

  • Collisions: Multiple notes might have similar titles
  • Changes: Editing note doesn't update slug (by design)

First 5 Words (Score: 8/10)

Pros:

  • Sufficient: 5 words usually capture note topic
  • Concise: Keeps URLs short and readable
  • Consistent: Predictable slug length

Cons:

  • Arbitrary: 5 is somewhat arbitrary (could be 3-7)
  • Language: Assumes space-separated words (English-centric)

Alternatives Considered:

  • First 3 words: Too short, often not descriptive
  • First 10 words: Too long, URLs become unwieldy
  • First line: Could be very long, harder to normalize
  • First sentence: Variable length, complex to parse

Decision: 5 words is a good balance (configurable constant)

Lowercase with Hyphens (Score: 10/10)

Pros:

  • URL Standard: Common pattern (github.com, stackoverflow.com)
  • Readability: Easier to read than underscores or camelCase
  • Compatibility: Works everywhere
  • Simplicity: One separator type only

Cons:

  • None significant

Alphanumeric Only (Score: 10/10)

Pros:

  • Safety: No escaping required in URLs or filenames
  • Portability: Works on all filesystems (FAT32, NTFS, ext4, APFS)
  • Predictability: No ambiguity about character handling

Cons:

  • Unicode Loss: Non-ASCII characters stripped (acceptable trade-off)

Random Suffix for Uniqueness (Score: 9/10)

Pros:

  • Simplicity: No complex conflict resolution
  • Security: Cryptographically secure random (secrets module)
  • Scalability: 1.6M possible suffixes per base slug

Cons:

  • Ugliness: Suffix looks less clean (but rare occurrence)
  • Unpredictability: User can't control suffix

Alternatives Considered:

  • Incrementing numbers (hello-world-2, hello-world-3): More predictable but reveals note count
  • Longer random suffix: More secure but uglier URLs
  • User-specified slug: More complex, deferred to V2

Decision: 4-character random suffix is good balance

Consequences

Positive

  1. Automatic: No user input required for slug
  2. Readable: Slugs are human-readable and meaningful
  3. Safe: Works on all platforms and browsers
  4. Unique: Collision resolution ensures uniqueness
  5. SEO-friendly: Descriptive URLs help search ranking
  6. Predictable: User can anticipate what slug will be
  7. Simple: Single, consistent algorithm

Negative

  1. Not editable: User can't customize slug in V1
  2. English-biased: Assumes space-separated words
  3. Unicode stripped: Non-ASCII content loses characters
  4. Content-dependent: Similar content = similar slugs
  5. Timestamp fallback: Short notes get ugly timestamp slugs

Mitigations

Non-editable slugs:

  • V1 trade-off for simplicity
  • V2 can add custom slug support
  • Users can still reference notes by slug once created

English-bias:

  • Acceptable for V1 (English-first IndieWeb)
  • V2 can add Unicode slug support (requires more complex normalization)

Unicode stripping:

  • Markdown content can still contain Unicode (only slug is ASCII)
  • Timestamp fallback ensures note is still creatable
  • V2 can use Unicode normalization (transliteration)

Timestamp fallback:

  • Rare occurrence (most notes have >5 words)
  • Still functional and unique
  • V2 can improve (use first word if exists + timestamp)

Standards Compliance

URL Standards (RFC 3986)

Slugs comply with URL path segment requirements:

  • No percent-encoding required
  • No reserved characters (/, ?, #, etc.)
  • Case-insensitive safe (always lowercase)

Filesystem Standards

Slugs work on all major filesystems:

  • FAT32: Yes (no special chars, length OK)
  • NTFS: Yes
  • ext4: Yes
  • APFS: Yes
  • HFS+: Yes

Reserved names: None of our slugs conflict with OS reserved names (CON, PRN, etc.)

IndieWeb Recommendations

Aligns with IndieWeb permalink best practices:

  • Descriptive URLs
  • No query parameters
  • Short and memorable
  • Permanent (don't change after creation)

Implementation Requirements

Validation Rules

# Valid slug pattern
SLUG_PATTERN = r'^[a-z0-9]+(?:-[a-z0-9]+)*$'

# Constraints
MIN_SLUG_LENGTH = 1
MAX_SLUG_LENGTH = 100

Reserved Slugs

Certain slugs should be reserved for system routes:

Reserved List (reject these slugs):

  • admin
  • api
  • static
  • auth
  • feed
  • login
  • logout

Implementation:

RESERVED_SLUGS = {'admin', 'api', 'static', 'auth', 'feed', 'login', 'logout'}

def is_slug_reserved(slug: str) -> bool:
    return slug in RESERVED_SLUGS

Error Cases

# Empty content
generate_slug("")  # Raises ValueError

# Whitespace only
generate_slug("   ")  # Raises ValueError

# Valid but short
generate_slug("Hi")  # Returns timestamp: "20241118-143022"

# Special characters only
generate_slug("!@#$%")  # Returns timestamp: "20241118-143022"

Alternatives Considered

UUID-based Slugs (Rejected)

slug = str(uuid.uuid4())  # "550e8400-e29b-41d4-a716-446655440000"

Pros: Guaranteed unique, no collision checking Cons: Not human-readable, poor SEO, not memorable

Verdict: Violates principle of readable URLs

Hash-based Slugs (Rejected)

slug = hashlib.sha256(content.encode()).hexdigest()[:12]  # "a591a6d40bf4"

Pros: Deterministic, unique Cons: Not human-readable, changes if content edited

Verdict: Not meaningful to users

Title Extraction (Rejected for V1)

# Extract from # heading or first line
title = extract_title_from_markdown(content)
slug = normalize(title)

Pros: More semantic, uses actual title Cons: Requires markdown parsing, more complex, title might not exist

Verdict: Deferred to V2 (V1 uses first N words which is simpler)

User-Specified Slugs (Rejected for V1)

def create_note(content, custom_slug=None):
    if custom_slug:
        slug = validate_and_use(custom_slug)
    else:
        slug = generate_slug(content)

Pros: Maximum user control, no surprises Cons: Requires UI input, validation complexity, user burden

Verdict: Deferred to V2 (V1 auto-generates for simplicity)

Incrementing Numbers (Rejected)

# If collision, increment
slug = "hello-world"
slug = "hello-world-2"  # Collision
slug = "hello-world-3"  # Collision

Pros: Predictable, simple Cons: Reveals note count, enumeration attack vector, less random

Verdict: Random suffix is more secure and scales better

Performance Considerations

Generation Speed

  • Extract words: O(n) where n = content length (negligible, content is small)
  • Normalize: O(m) where m = extracted text length (< 100 chars)
  • Uniqueness check: O(1) database lookup with index
  • Random suffix: O(1) generation

Target: < 1ms per slug generation (easily achieved)

Database Impact

  • Index on slug column: O(log n) lookup
  • Collision rate: < 1% (most notes have unique first 5 words)
  • Random suffix retries: Nearly never (1.6M combinations)

Testing Requirements

Test Cases

Normal Cases:

  • Standard English content → descriptive slug
  • Content with punctuation → punctuation removed
  • Content with numbers → numbers preserved
  • Content with hyphens → hyphens preserved

Edge Cases:

  • Very short content → timestamp fallback
  • Empty content → ValueError
  • Special characters only → timestamp fallback
  • Very long words → truncated to max length
  • Unicode content → stripped to ASCII

Collision Cases:

  • Duplicate slug → random suffix added
  • Multiple collisions → different random suffixes
  • Reserved slug → rejected

Security Cases:

  • Path traversal attempt (../../../etc/passwd)
  • Special characters (<script>, %00, etc.)
  • Very long input (>10,000 characters)

Migration Path (V2)

Future enhancements that build on this foundation:

Custom Slugs

def create_note(content, custom_slug=None):
    slug = custom_slug or generate_slug(content)

Unicode Support

def generate_unicode_slug(content):
    # Use Unicode normalization (NFKD)
    # Transliterate to ASCII (unidecode library)
    # Support CJK languages

Title Extraction

def extract_title_from_content(content):
    # Check for # heading
    # Use first line if no heading
    # Fall back to first N words

Slug Editing

def update_note_slug(note_id, new_slug):
    # Validate new slug
    # Update database
    # Rename file
    # Create redirect from old slug

References

Acceptance Criteria

  • Slug generation creates valid, URL-safe slugs
  • Slugs are descriptive (use first 5 words)
  • Slugs are unique (collision detection + random suffix)
  • Slugs meet length constraints (1-100 characters)
  • Timestamp fallback works for short content
  • Reserved slugs are rejected
  • Unicode content is handled gracefully
  • All edge cases tested
  • Performance meets target (<1ms)
  • Code follows Python coding standards

Approved: 2024-11-18 Architect: StarPunk Architect Agent