13 KiB
ADR-007: Slug Generation Algorithm
Status
Accepted
Context
Notes in StarPunk require URL-safe identifiers (slugs) for permalinks and file naming. The slug generation algorithm is critical because:
- User experience: Slugs appear in URLs and should be readable/meaningful
- SEO: Descriptive slugs improve search engine optimization
- File system: Slugs become filenames, must be filesystem-safe
- Uniqueness: Slugs must be unique across all notes
- Portability: Slugs should work across different systems and browsers
The challenge is designing an algorithm that creates readable, unique, safe slugs automatically from note content.
Decision
Content-Based Slug Generation with Timestamp Fallback
Primary Algorithm: Extract first N words from content and normalize Fallback: Timestamp-based slug when content is insufficient Uniqueness: Random suffix when collision detected
Algorithm Specification
Step 1: Extract Words
# Extract first 5 words from content
words = content.split()[:5]
text = " ".join(words)
Step 2: Normalize
# Convert to lowercase
text = text.lower()
# Replace spaces with hyphens
text = text.replace(" ", "-")
# Remove all characters except a-z, 0-9, and hyphens
text = re.sub(r'[^a-z0-9-]', '', text)
# Collapse multiple hyphens
text = re.sub(r'-+', '-', text)
# Strip leading/trailing hyphens
text = text.strip('-')
Step 3: Validate Length
# If slug too short or empty, use timestamp fallback
if len(text) < 1:
text = created_at.strftime("%Y%m%d-%H%M%S")
Step 4: Truncate
# Limit to 100 characters
text = text[:100]
Step 5: Check Uniqueness
# If slug exists, add random 4-character suffix
if slug_exists(text):
text = f"{text}-{random_alphanumeric(4)}"
Character Set
Allowed characters: a-z, 0-9, - (hyphen)
Rationale:
- URL-safe without encoding
- Filesystem-safe on all platforms (Windows, Linux, macOS)
- Human-readable
- No escaping required in HTML
- Compatible with DNS hostnames (if ever used)
Examples
| Input Content | Generated Slug |
|---|---|
| "Hello World! This is my first note." | hello-world-this-is-my |
| "Testing... with special chars!@#" | testing-with-special-chars |
| "2024-11-18 Daily Journal Entry" | 2024-11-18-daily-journal-entry |
| "A" (too short) | 20241118-143022 (timestamp) |
| " " (whitespace only) | Error: ValueError |
| "Hello World" (duplicate) | hello-world-a7c9 (random suffix) |
Slug Uniqueness Strategy
Collision Detection: Check database for existing slug before use
Resolution: Append random 4-character suffix
- Character set:
a-z0-9(36 characters) - Combinations: 36^4 = 1,679,616 possible suffixes
- Collision probability: Negligible for reasonable note counts
Example:
Original: hello-world
Collision: hello-world-a7c9
Collision: hello-world-x3k2
Timestamp Fallback Format
Pattern: YYYYMMDD-HHMMSS
Example: 20241118-143022
When Used:
- Content is empty or whitespace-only (raises error instead)
- Normalized slug is empty (after removing special characters)
- Normalized slug is too short (< 1 character)
Rationale:
- Guaranteed unique (unless two notes created in same second)
- Sortable chronologically
- Still readable and meaningful
- No special characters required
Rationale
Content-Based Generation (Score: 9/10)
Pros:
- Readability: Users can understand URL meaning
- SEO: Search engines prefer descriptive URLs
- Memorability: Easier to remember and share
- Meaningful: Reflects note content
Cons:
- Collisions: Multiple notes might have similar titles
- Changes: Editing note doesn't update slug (by design)
First 5 Words (Score: 8/10)
Pros:
- Sufficient: 5 words usually capture note topic
- Concise: Keeps URLs short and readable
- Consistent: Predictable slug length
Cons:
- Arbitrary: 5 is somewhat arbitrary (could be 3-7)
- Language: Assumes space-separated words (English-centric)
Alternatives Considered:
- First 3 words: Too short, often not descriptive
- First 10 words: Too long, URLs become unwieldy
- First line: Could be very long, harder to normalize
- First sentence: Variable length, complex to parse
Decision: 5 words is a good balance (configurable constant)
Lowercase with Hyphens (Score: 10/10)
Pros:
- URL Standard: Common pattern (github.com, stackoverflow.com)
- Readability: Easier to read than underscores or camelCase
- Compatibility: Works everywhere
- Simplicity: One separator type only
Cons:
- None significant
Alphanumeric Only (Score: 10/10)
Pros:
- Safety: No escaping required in URLs or filenames
- Portability: Works on all filesystems (FAT32, NTFS, ext4, APFS)
- Predictability: No ambiguity about character handling
Cons:
- Unicode Loss: Non-ASCII characters stripped (acceptable trade-off)
Random Suffix for Uniqueness (Score: 9/10)
Pros:
- Simplicity: No complex conflict resolution
- Security: Cryptographically secure random (secrets module)
- Scalability: 1.6M possible suffixes per base slug
Cons:
- Ugliness: Suffix looks less clean (but rare occurrence)
- Unpredictability: User can't control suffix
Alternatives Considered:
- Incrementing numbers (
hello-world-2,hello-world-3): More predictable but reveals note count - Longer random suffix: More secure but uglier URLs
- User-specified slug: More complex, deferred to V2
Decision: 4-character random suffix is good balance
Consequences
Positive
- Automatic: No user input required for slug
- Readable: Slugs are human-readable and meaningful
- Safe: Works on all platforms and browsers
- Unique: Collision resolution ensures uniqueness
- SEO-friendly: Descriptive URLs help search ranking
- Predictable: User can anticipate what slug will be
- Simple: Single, consistent algorithm
Negative
- Not editable: User can't customize slug in V1
- English-biased: Assumes space-separated words
- Unicode stripped: Non-ASCII content loses characters
- Content-dependent: Similar content = similar slugs
- Timestamp fallback: Short notes get ugly timestamp slugs
Mitigations
Non-editable slugs:
- V1 trade-off for simplicity
- V2 can add custom slug support
- Users can still reference notes by slug once created
English-bias:
- Acceptable for V1 (English-first IndieWeb)
- V2 can add Unicode slug support (requires more complex normalization)
Unicode stripping:
- Markdown content can still contain Unicode (only slug is ASCII)
- Timestamp fallback ensures note is still creatable
- V2 can use Unicode normalization (transliteration)
Timestamp fallback:
- Rare occurrence (most notes have >5 words)
- Still functional and unique
- V2 can improve (use first word if exists + timestamp)
Standards Compliance
URL Standards (RFC 3986)
Slugs comply with URL path segment requirements:
- No percent-encoding required
- No reserved characters (
/,?,#, etc.) - Case-insensitive safe (always lowercase)
Filesystem Standards
Slugs work on all major filesystems:
- FAT32: Yes (no special chars, length OK)
- NTFS: Yes
- ext4: Yes
- APFS: Yes
- HFS+: Yes
Reserved names: None of our slugs conflict with OS reserved names (CON, PRN, etc.)
IndieWeb Recommendations
Aligns with IndieWeb permalink best practices:
- Descriptive URLs
- No query parameters
- Short and memorable
- Permanent (don't change after creation)
Implementation Requirements
Validation Rules
# Valid slug pattern
SLUG_PATTERN = r'^[a-z0-9]+(?:-[a-z0-9]+)*$'
# Constraints
MIN_SLUG_LENGTH = 1
MAX_SLUG_LENGTH = 100
Reserved Slugs
Certain slugs should be reserved for system routes:
Reserved List (reject these slugs):
adminapistaticauthfeedloginlogout
Implementation:
RESERVED_SLUGS = {'admin', 'api', 'static', 'auth', 'feed', 'login', 'logout'}
def is_slug_reserved(slug: str) -> bool:
return slug in RESERVED_SLUGS
Error Cases
# Empty content
generate_slug("") # Raises ValueError
# Whitespace only
generate_slug(" ") # Raises ValueError
# Valid but short
generate_slug("Hi") # Returns timestamp: "20241118-143022"
# Special characters only
generate_slug("!@#$%") # Returns timestamp: "20241118-143022"
Alternatives Considered
UUID-based Slugs (Rejected)
slug = str(uuid.uuid4()) # "550e8400-e29b-41d4-a716-446655440000"
Pros: Guaranteed unique, no collision checking Cons: Not human-readable, poor SEO, not memorable
Verdict: Violates principle of readable URLs
Hash-based Slugs (Rejected)
slug = hashlib.sha256(content.encode()).hexdigest()[:12] # "a591a6d40bf4"
Pros: Deterministic, unique Cons: Not human-readable, changes if content edited
Verdict: Not meaningful to users
Title Extraction (Rejected for V1)
# Extract from # heading or first line
title = extract_title_from_markdown(content)
slug = normalize(title)
Pros: More semantic, uses actual title Cons: Requires markdown parsing, more complex, title might not exist
Verdict: Deferred to V2 (V1 uses first N words which is simpler)
User-Specified Slugs (Rejected for V1)
def create_note(content, custom_slug=None):
if custom_slug:
slug = validate_and_use(custom_slug)
else:
slug = generate_slug(content)
Pros: Maximum user control, no surprises Cons: Requires UI input, validation complexity, user burden
Verdict: Deferred to V2 (V1 auto-generates for simplicity)
Incrementing Numbers (Rejected)
# If collision, increment
slug = "hello-world"
slug = "hello-world-2" # Collision
slug = "hello-world-3" # Collision
Pros: Predictable, simple Cons: Reveals note count, enumeration attack vector, less random
Verdict: Random suffix is more secure and scales better
Performance Considerations
Generation Speed
- Extract words: O(n) where n = content length (negligible, content is small)
- Normalize: O(m) where m = extracted text length (< 100 chars)
- Uniqueness check: O(1) database lookup with index
- Random suffix: O(1) generation
Target: < 1ms per slug generation (easily achieved)
Database Impact
- Index on
slugcolumn: O(log n) lookup - Collision rate: < 1% (most notes have unique first 5 words)
- Random suffix retries: Nearly never (1.6M combinations)
Testing Requirements
Test Cases
Normal Cases:
- Standard English content → descriptive slug
- Content with punctuation → punctuation removed
- Content with numbers → numbers preserved
- Content with hyphens → hyphens preserved
Edge Cases:
- Very short content → timestamp fallback
- Empty content → ValueError
- Special characters only → timestamp fallback
- Very long words → truncated to max length
- Unicode content → stripped to ASCII
Collision Cases:
- Duplicate slug → random suffix added
- Multiple collisions → different random suffixes
- Reserved slug → rejected
Security Cases:
- Path traversal attempt (
../../../etc/passwd) - Special characters (
<script>,%00, etc.) - Very long input (>10,000 characters)
Migration Path (V2)
Future enhancements that build on this foundation:
Custom Slugs
def create_note(content, custom_slug=None):
slug = custom_slug or generate_slug(content)
Unicode Support
def generate_unicode_slug(content):
# Use Unicode normalization (NFKD)
# Transliterate to ASCII (unidecode library)
# Support CJK languages
Title Extraction
def extract_title_from_content(content):
# Check for # heading
# Use first line if no heading
# Fall back to first N words
Slug Editing
def update_note_slug(note_id, new_slug):
# Validate new slug
# Update database
# Rename file
# Create redirect from old slug
References
- RFC 3986 - URI Generic Syntax
- IndieWeb Permalink Design
- URL Slug Best Practices
- Python secrets Module
- ADR-004: File-Based Note Storage
Acceptance Criteria
- Slug generation creates valid, URL-safe slugs
- Slugs are descriptive (use first 5 words)
- Slugs are unique (collision detection + random suffix)
- Slugs meet length constraints (1-100 characters)
- Timestamp fallback works for short content
- Reserved slugs are rejected
- Unicode content is handled gracefully
- All edge cases tested
- Performance meets target (<1ms)
- Code follows Python coding standards
Approved: 2024-11-18 Architect: StarPunk Architect Agent