phil/StarPunk

Fork 0

Files

Phil Skentelbery a68fd570c7 that initial commit

2025-11-18 19:21:31 -07:00

13 KiB

Raw Blame History

ADR-007: Slug Generation Algorithm

Status

Accepted

Context

Notes in StarPunk require URL-safe identifiers (slugs) for permalinks and file naming. The slug generation algorithm is critical because:

User experience: Slugs appear in URLs and should be readable/meaningful
SEO: Descriptive slugs improve search engine optimization
File system: Slugs become filenames, must be filesystem-safe
Uniqueness: Slugs must be unique across all notes
Portability: Slugs should work across different systems and browsers

The challenge is designing an algorithm that creates readable, unique, safe slugs automatically from note content.

Decision

Content-Based Slug Generation with Timestamp Fallback

Primary Algorithm: Extract first N words from content and normalize Fallback: Timestamp-based slug when content is insufficient Uniqueness: Random suffix when collision detected

Algorithm Specification

Step 1: Extract Words

# Extract first 5 words from content
words = content.split()[:5]
text = " ".join(words)

Step 2: Normalize

# Convert to lowercase
text = text.lower()

# Replace spaces with hyphens
text = text.replace(" ", "-")

# Remove all characters except a-z, 0-9, and hyphens
text = re.sub(r'[^a-z0-9-]', '', text)

# Collapse multiple hyphens
text = re.sub(r'-+', '-', text)

# Strip leading/trailing hyphens
text = text.strip('-')

Step 3: Validate Length

# If slug too short or empty, use timestamp fallback
if len(text) < 1:
    text = created_at.strftime("%Y%m%d-%H%M%S")

Step 4: Truncate

# Limit to 100 characters
text = text[:100]

Step 5: Check Uniqueness

# If slug exists, add random 4-character suffix
if slug_exists(text):
    text = f"{text}-{random_alphanumeric(4)}"

Character Set

Allowed characters: a-z, 0-9, - (hyphen)

Rationale:

URL-safe without encoding
Filesystem-safe on all platforms (Windows, Linux, macOS)
Human-readable
No escaping required in HTML
Compatible with DNS hostnames (if ever used)

Examples

Input Content	Generated Slug
"Hello World! This is my first note."	`hello-world-this-is-my`
"Testing... with special chars!@#"	`testing-with-special-chars`
"2024-11-18 Daily Journal Entry"	`2024-11-18-daily-journal-entry`
"A" (too short)	`20241118-143022` (timestamp)
" " (whitespace only)	Error: ValueError
"Hello World" (duplicate)	`hello-world-a7c9` (random suffix)

Slug Uniqueness Strategy

Collision Detection: Check database for existing slug before use

Resolution: Append random 4-character suffix

Character set: a-z0-9 (36 characters)
Combinations: 36^4 = 1,679,616 possible suffixes
Collision probability: Negligible for reasonable note counts

Example:

Original:  hello-world
Collision: hello-world-a7c9
Collision: hello-world-x3k2

Timestamp Fallback Format

Pattern: YYYYMMDD-HHMMSS Example: 20241118-143022

When Used:

Content is empty or whitespace-only (raises error instead)
Normalized slug is empty (after removing special characters)
Normalized slug is too short (< 1 character)

Rationale:

Guaranteed unique (unless two notes created in same second)
Sortable chronologically
Still readable and meaningful
No special characters required

Rationale

Content-Based Generation (Score: 9/10)

Pros:

Readability: Users can understand URL meaning
SEO: Search engines prefer descriptive URLs
Memorability: Easier to remember and share
Meaningful: Reflects note content

Cons:

Collisions: Multiple notes might have similar titles
Changes: Editing note doesn't update slug (by design)

First 5 Words (Score: 8/10)

Pros:

Sufficient: 5 words usually capture note topic
Concise: Keeps URLs short and readable
Consistent: Predictable slug length

Cons:

Arbitrary: 5 is somewhat arbitrary (could be 3-7)
Language: Assumes space-separated words (English-centric)

Alternatives Considered:

First 3 words: Too short, often not descriptive
First 10 words: Too long, URLs become unwieldy
First line: Could be very long, harder to normalize
First sentence: Variable length, complex to parse

Decision: 5 words is a good balance (configurable constant)

Lowercase with Hyphens (Score: 10/10)

Pros:

URL Standard: Common pattern (github.com, stackoverflow.com)
Readability: Easier to read than underscores or camelCase
Compatibility: Works everywhere
Simplicity: One separator type only

Cons:

None significant

Alphanumeric Only (Score: 10/10)

Pros:

Safety: No escaping required in URLs or filenames
Portability: Works on all filesystems (FAT32, NTFS, ext4, APFS)
Predictability: No ambiguity about character handling

Cons:

Unicode Loss: Non-ASCII characters stripped (acceptable trade-off)

Random Suffix for Uniqueness (Score: 9/10)

Pros:

Simplicity: No complex conflict resolution
Security: Cryptographically secure random (secrets module)
Scalability: 1.6M possible suffixes per base slug

Cons:

Ugliness: Suffix looks less clean (but rare occurrence)
Unpredictability: User can't control suffix

Alternatives Considered:

Incrementing numbers (hello-world-2, hello-world-3): More predictable but reveals note count
Longer random suffix: More secure but uglier URLs
User-specified slug: More complex, deferred to V2

Decision: 4-character random suffix is good balance

Consequences

Positive

Automatic: No user input required for slug
Readable: Slugs are human-readable and meaningful
Safe: Works on all platforms and browsers
Unique: Collision resolution ensures uniqueness
SEO-friendly: Descriptive URLs help search ranking
Predictable: User can anticipate what slug will be
Simple: Single, consistent algorithm

Negative

Not editable: User can't customize slug in V1
English-biased: Assumes space-separated words
Unicode stripped: Non-ASCII content loses characters
Content-dependent: Similar content = similar slugs
Timestamp fallback: Short notes get ugly timestamp slugs

Mitigations

Non-editable slugs:

V1 trade-off for simplicity
V2 can add custom slug support
Users can still reference notes by slug once created

English-bias:

Acceptable for V1 (English-first IndieWeb)
V2 can add Unicode slug support (requires more complex normalization)

Unicode stripping:

Markdown content can still contain Unicode (only slug is ASCII)
Timestamp fallback ensures note is still creatable
V2 can use Unicode normalization (transliteration)

Timestamp fallback:

Rare occurrence (most notes have >5 words)
Still functional and unique
V2 can improve (use first word if exists + timestamp)

Standards Compliance

URL Standards (RFC 3986)

Slugs comply with URL path segment requirements:

No percent-encoding required
No reserved characters (/, ?, #, etc.)
Case-insensitive safe (always lowercase)

Filesystem Standards

Slugs work on all major filesystems:

FAT32: Yes (no special chars, length OK)
NTFS: Yes
ext4: Yes
APFS: Yes
HFS+: Yes

Reserved names: None of our slugs conflict with OS reserved names (CON, PRN, etc.)

IndieWeb Recommendations

Aligns with IndieWeb permalink best practices:

Descriptive URLs
No query parameters
Short and memorable
Permanent (don't change after creation)

Implementation Requirements

Validation Rules

# Valid slug pattern
SLUG_PATTERN = r'^[a-z0-9]+(?:-[a-z0-9]+)*$'

# Constraints
MIN_SLUG_LENGTH = 1
MAX_SLUG_LENGTH = 100

Reserved Slugs

Certain slugs should be reserved for system routes:

Reserved List (reject these slugs):

admin
api
static
auth
feed
login
logout

Implementation:

RESERVED_SLUGS = {'admin', 'api', 'static', 'auth', 'feed', 'login', 'logout'}

def is_slug_reserved(slug: str) -> bool:
    return slug in RESERVED_SLUGS

Error Cases

# Empty content
generate_slug("")  # Raises ValueError

# Whitespace only
generate_slug("   ")  # Raises ValueError

# Valid but short
generate_slug("Hi")  # Returns timestamp: "20241118-143022"

# Special characters only
generate_slug("!@#$%")  # Returns timestamp: "20241118-143022"

Alternatives Considered

UUID-based Slugs (Rejected)

slug = str(uuid.uuid4())  # "550e8400-e29b-41d4-a716-446655440000"

Pros: Guaranteed unique, no collision checking Cons: Not human-readable, poor SEO, not memorable

Verdict: Violates principle of readable URLs

Hash-based Slugs (Rejected)

slug = hashlib.sha256(content.encode()).hexdigest()[:12]  # "a591a6d40bf4"

Pros: Deterministic, unique Cons: Not human-readable, changes if content edited

Verdict: Not meaningful to users

Title Extraction (Rejected for V1)

# Extract from # heading or first line
title = extract_title_from_markdown(content)
slug = normalize(title)

Pros: More semantic, uses actual title Cons: Requires markdown parsing, more complex, title might not exist

Verdict: Deferred to V2 (V1 uses first N words which is simpler)

User-Specified Slugs (Rejected for V1)

def create_note(content, custom_slug=None):
    if custom_slug:
        slug = validate_and_use(custom_slug)
    else:
        slug = generate_slug(content)

Pros: Maximum user control, no surprises Cons: Requires UI input, validation complexity, user burden

Verdict: Deferred to V2 (V1 auto-generates for simplicity)

Incrementing Numbers (Rejected)

# If collision, increment
slug = "hello-world"
slug = "hello-world-2"  # Collision
slug = "hello-world-3"  # Collision

Pros: Predictable, simple Cons: Reveals note count, enumeration attack vector, less random

Verdict: Random suffix is more secure and scales better

Performance Considerations

Generation Speed

Extract words: O(n) where n = content length (negligible, content is small)
Normalize: O(m) where m = extracted text length (< 100 chars)
Uniqueness check: O(1) database lookup with index
Random suffix: O(1) generation

Target: < 1ms per slug generation (easily achieved)

Database Impact

Index on slug column: O(log n) lookup
Collision rate: < 1% (most notes have unique first 5 words)
Random suffix retries: Nearly never (1.6M combinations)

Testing Requirements

Test Cases

Normal Cases:

Standard English content → descriptive slug
Content with punctuation → punctuation removed
Content with numbers → numbers preserved
Content with hyphens → hyphens preserved

Edge Cases:

Very short content → timestamp fallback
Empty content → ValueError
Special characters only → timestamp fallback
Very long words → truncated to max length
Unicode content → stripped to ASCII

Collision Cases:

Duplicate slug → random suffix added
Multiple collisions → different random suffixes
Reserved slug → rejected

Security Cases:

Path traversal attempt (../../../etc/passwd)
Special characters (<script>, %00, etc.)
Very long input (>10,000 characters)

Migration Path (V2)

Future enhancements that build on this foundation:

Custom Slugs

def create_note(content, custom_slug=None):
    slug = custom_slug or generate_slug(content)

Unicode Support

def generate_unicode_slug(content):
    # Use Unicode normalization (NFKD)
    # Transliterate to ASCII (unidecode library)
    # Support CJK languages

Title Extraction

def extract_title_from_content(content):
    # Check for # heading
    # Use first line if no heading
    # Fall back to first N words

Slug Editing

def update_note_slug(note_id, new_slug):
    # Validate new slug
    # Update database
    # Rename file
    # Create redirect from old slug

References

Acceptance Criteria

Slug generation creates valid, URL-safe slugs
Slugs are descriptive (use first 5 words)
Slugs are unique (collision detection + random suffix)
Slugs meet length constraints (1-100 characters)
Timestamp fallback works for short content
Reserved slugs are rejected
Unicode content is handled gracefully
All edge cases tested
Performance meets target (<1ms)
Code follows Python coding standards

Approved: 2024-11-18 Architect: StarPunk Architect Agent

13 KiB Raw Blame History

ADR-007: Slug Generation Algorithm

Status

Context

Decision

Content-Based Slug Generation with Timestamp Fallback

Algorithm Specification

Step 1: Extract Words

Step 2: Normalize

Step 3: Validate Length

Step 4: Truncate

Step 5: Check Uniqueness

Character Set

Examples

Slug Uniqueness Strategy

Timestamp Fallback Format

Rationale

Content-Based Generation (Score: 9/10)

First 5 Words (Score: 8/10)

Lowercase with Hyphens (Score: 10/10)

Alphanumeric Only (Score: 10/10)

Random Suffix for Uniqueness (Score: 9/10)

Consequences

Positive

Negative

Mitigations

Standards Compliance

URL Standards (RFC 3986)

Filesystem Standards

IndieWeb Recommendations

Implementation Requirements

Validation Rules

Reserved Slugs

Error Cases

Alternatives Considered

UUID-based Slugs (Rejected)

Hash-based Slugs (Rejected)

Title Extraction (Rejected for V1)

User-Specified Slugs (Rejected for V1)

Incrementing Numbers (Rejected)

Performance Considerations

Generation Speed

Database Impact

Testing Requirements

Test Cases

Migration Path (V2)

Custom Slugs

Unicode Support

Title Extraction

Slug Editing

References

Acceptance Criteria

13 KiB

Raw Blame History