# ADR-007: Slug Generation Algorithm ## Status Accepted ## Context Notes in StarPunk require URL-safe identifiers (slugs) for permalinks and file naming. The slug generation algorithm is critical because: 1. **User experience**: Slugs appear in URLs and should be readable/meaningful 2. **SEO**: Descriptive slugs improve search engine optimization 3. **File system**: Slugs become filenames, must be filesystem-safe 4. **Uniqueness**: Slugs must be unique across all notes 5. **Portability**: Slugs should work across different systems and browsers The challenge is designing an algorithm that creates readable, unique, safe slugs automatically from note content. ## Decision ### Content-Based Slug Generation with Timestamp Fallback **Primary Algorithm**: Extract first N words from content and normalize **Fallback**: Timestamp-based slug when content is insufficient **Uniqueness**: Random suffix when collision detected ### Algorithm Specification #### Step 1: Extract Words ```python # Extract first 5 words from content words = content.split()[:5] text = " ".join(words) ``` #### Step 2: Normalize ```python # Convert to lowercase text = text.lower() # Replace spaces with hyphens text = text.replace(" ", "-") # Remove all characters except a-z, 0-9, and hyphens text = re.sub(r'[^a-z0-9-]', '', text) # Collapse multiple hyphens text = re.sub(r'-+', '-', text) # Strip leading/trailing hyphens text = text.strip('-') ``` #### Step 3: Validate Length ```python # If slug too short or empty, use timestamp fallback if len(text) < 1: text = created_at.strftime("%Y%m%d-%H%M%S") ``` #### Step 4: Truncate ```python # Limit to 100 characters text = text[:100] ``` #### Step 5: Check Uniqueness ```python # If slug exists, add random 4-character suffix if slug_exists(text): text = f"{text}-{random_alphanumeric(4)}" ``` ### Character Set **Allowed characters**: `a-z`, `0-9`, `-` (hyphen) **Rationale**: - URL-safe without encoding - Filesystem-safe on all platforms (Windows, Linux, macOS) - Human-readable - No escaping required in HTML - Compatible with DNS hostnames (if ever used) ### Examples | Input Content | Generated Slug | |--------------|----------------| | "Hello World! This is my first note." | `hello-world-this-is-my` | | "Testing... with special chars!@#" | `testing-with-special-chars` | | "2024-11-18 Daily Journal Entry" | `2024-11-18-daily-journal-entry` | | "A" (too short) | `20241118-143022` (timestamp) | | " " (whitespace only) | Error: ValueError | | "Hello World" (duplicate) | `hello-world-a7c9` (random suffix) | ### Slug Uniqueness Strategy **Collision Detection**: Check database for existing slug before use **Resolution**: Append random 4-character suffix - Character set: `a-z0-9` (36 characters) - Combinations: 36^4 = 1,679,616 possible suffixes - Collision probability: Negligible for reasonable note counts **Example**: ``` Original: hello-world Collision: hello-world-a7c9 Collision: hello-world-x3k2 ``` ### Timestamp Fallback Format **Pattern**: `YYYYMMDD-HHMMSS` **Example**: `20241118-143022` **When Used**: - Content is empty or whitespace-only (raises error instead) - Normalized slug is empty (after removing special characters) - Normalized slug is too short (< 1 character) **Rationale**: - Guaranteed unique (unless two notes created in same second) - Sortable chronologically - Still readable and meaningful - No special characters required ## Rationale ### Content-Based Generation (Score: 9/10) **Pros**: - **Readability**: Users can understand URL meaning - **SEO**: Search engines prefer descriptive URLs - **Memorability**: Easier to remember and share - **Meaningful**: Reflects note content **Cons**: - **Collisions**: Multiple notes might have similar titles - **Changes**: Editing note doesn't update slug (by design) ### First 5 Words (Score: 8/10) **Pros**: - **Sufficient**: 5 words usually capture note topic - **Concise**: Keeps URLs short and readable - **Consistent**: Predictable slug length **Cons**: - **Arbitrary**: 5 is somewhat arbitrary (could be 3-7) - **Language**: Assumes space-separated words (English-centric) **Alternatives Considered**: - First 3 words: Too short, often not descriptive - First 10 words: Too long, URLs become unwieldy - First line: Could be very long, harder to normalize - First sentence: Variable length, complex to parse **Decision**: 5 words is a good balance (configurable constant) ### Lowercase with Hyphens (Score: 10/10) **Pros**: - **URL Standard**: Common pattern (github.com, stackoverflow.com) - **Readability**: Easier to read than underscores or camelCase - **Compatibility**: Works everywhere - **Simplicity**: One separator type only **Cons**: - None significant ### Alphanumeric Only (Score: 10/10) **Pros**: - **Safety**: No escaping required in URLs or filenames - **Portability**: Works on all filesystems (FAT32, NTFS, ext4, APFS) - **Predictability**: No ambiguity about character handling **Cons**: - **Unicode Loss**: Non-ASCII characters stripped (acceptable trade-off) ### Random Suffix for Uniqueness (Score: 9/10) **Pros**: - **Simplicity**: No complex conflict resolution - **Security**: Cryptographically secure random (secrets module) - **Scalability**: 1.6M possible suffixes per base slug **Cons**: - **Ugliness**: Suffix looks less clean (but rare occurrence) - **Unpredictability**: User can't control suffix **Alternatives Considered**: - Incrementing numbers (`hello-world-2`, `hello-world-3`): More predictable but reveals note count - Longer random suffix: More secure but uglier URLs - User-specified slug: More complex, deferred to V2 **Decision**: 4-character random suffix is good balance ## Consequences ### Positive 1. **Automatic**: No user input required for slug 2. **Readable**: Slugs are human-readable and meaningful 3. **Safe**: Works on all platforms and browsers 4. **Unique**: Collision resolution ensures uniqueness 5. **SEO-friendly**: Descriptive URLs help search ranking 6. **Predictable**: User can anticipate what slug will be 7. **Simple**: Single, consistent algorithm ### Negative 1. **Not editable**: User can't customize slug in V1 2. **English-biased**: Assumes space-separated words 3. **Unicode stripped**: Non-ASCII content loses characters 4. **Content-dependent**: Similar content = similar slugs 5. **Timestamp fallback**: Short notes get ugly timestamp slugs ### Mitigations **Non-editable slugs**: - V1 trade-off for simplicity - V2 can add custom slug support - Users can still reference notes by slug once created **English-bias**: - Acceptable for V1 (English-first IndieWeb) - V2 can add Unicode slug support (requires more complex normalization) **Unicode stripping**: - Markdown content can still contain Unicode (only slug is ASCII) - Timestamp fallback ensures note is still creatable - V2 can use Unicode normalization (transliteration) **Timestamp fallback**: - Rare occurrence (most notes have >5 words) - Still functional and unique - V2 can improve (use first word if exists + timestamp) ## Standards Compliance ### URL Standards (RFC 3986) Slugs comply with URL path segment requirements: - No percent-encoding required - No reserved characters (`/`, `?`, `#`, etc.) - Case-insensitive safe (always lowercase) ### Filesystem Standards Slugs work on all major filesystems: - **FAT32**: Yes (no special chars, length OK) - **NTFS**: Yes - **ext4**: Yes - **APFS**: Yes - **HFS+**: Yes **Reserved names**: None of our slugs conflict with OS reserved names (CON, PRN, etc.) ### IndieWeb Recommendations Aligns with IndieWeb permalink best practices: - Descriptive URLs - No query parameters - Short and memorable - Permanent (don't change after creation) ## Implementation Requirements ### Validation Rules ```python # Valid slug pattern SLUG_PATTERN = r'^[a-z0-9]+(?:-[a-z0-9]+)*$' # Constraints MIN_SLUG_LENGTH = 1 MAX_SLUG_LENGTH = 100 ``` ### Reserved Slugs Certain slugs should be reserved for system routes: **Reserved List** (reject these slugs): - `admin` - `api` - `static` - `auth` - `feed` - `login` - `logout` Implementation: ```python RESERVED_SLUGS = {'admin', 'api', 'static', 'auth', 'feed', 'login', 'logout'} def is_slug_reserved(slug: str) -> bool: return slug in RESERVED_SLUGS ``` ### Error Cases ```python # Empty content generate_slug("") # Raises ValueError # Whitespace only generate_slug(" ") # Raises ValueError # Valid but short generate_slug("Hi") # Returns timestamp: "20241118-143022" # Special characters only generate_slug("!@#$%") # Returns timestamp: "20241118-143022" ``` ## Alternatives Considered ### UUID-based Slugs (Rejected) ```python slug = str(uuid.uuid4()) # "550e8400-e29b-41d4-a716-446655440000" ``` **Pros**: Guaranteed unique, no collision checking **Cons**: Not human-readable, poor SEO, not memorable **Verdict**: Violates principle of readable URLs ### Hash-based Slugs (Rejected) ```python slug = hashlib.sha256(content.encode()).hexdigest()[:12] # "a591a6d40bf4" ``` **Pros**: Deterministic, unique **Cons**: Not human-readable, changes if content edited **Verdict**: Not meaningful to users ### Title Extraction (Rejected for V1) ```python # Extract from # heading or first line title = extract_title_from_markdown(content) slug = normalize(title) ``` **Pros**: More semantic, uses actual title **Cons**: Requires markdown parsing, more complex, title might not exist **Verdict**: Deferred to V2 (V1 uses first N words which is simpler) ### User-Specified Slugs (Rejected for V1) ```python def create_note(content, custom_slug=None): if custom_slug: slug = validate_and_use(custom_slug) else: slug = generate_slug(content) ``` **Pros**: Maximum user control, no surprises **Cons**: Requires UI input, validation complexity, user burden **Verdict**: Deferred to V2 (V1 auto-generates for simplicity) ### Incrementing Numbers (Rejected) ```python # If collision, increment slug = "hello-world" slug = "hello-world-2" # Collision slug = "hello-world-3" # Collision ``` **Pros**: Predictable, simple **Cons**: Reveals note count, enumeration attack vector, less random **Verdict**: Random suffix is more secure and scales better ## Performance Considerations ### Generation Speed - Extract words: O(n) where n = content length (negligible, content is small) - Normalize: O(m) where m = extracted text length (< 100 chars) - Uniqueness check: O(1) database lookup with index - Random suffix: O(1) generation **Target**: < 1ms per slug generation (easily achieved) ### Database Impact - Index on `slug` column: O(log n) lookup - Collision rate: < 1% (most notes have unique first 5 words) - Random suffix retries: Nearly never (1.6M combinations) ## Testing Requirements ### Test Cases **Normal Cases**: - Standard English content → descriptive slug - Content with punctuation → punctuation removed - Content with numbers → numbers preserved - Content with hyphens → hyphens preserved **Edge Cases**: - Very short content → timestamp fallback - Empty content → ValueError - Special characters only → timestamp fallback - Very long words → truncated to max length - Unicode content → stripped to ASCII **Collision Cases**: - Duplicate slug → random suffix added - Multiple collisions → different random suffixes - Reserved slug → rejected **Security Cases**: - Path traversal attempt (`../../../etc/passwd`) - Special characters (`