488 lines
13 KiB
Markdown
488 lines
13 KiB
Markdown
# ADR-007: Slug Generation Algorithm
|
|
|
|
## Status
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
Notes in StarPunk require URL-safe identifiers (slugs) for permalinks and file naming. The slug generation algorithm is critical because:
|
|
|
|
1. **User experience**: Slugs appear in URLs and should be readable/meaningful
|
|
2. **SEO**: Descriptive slugs improve search engine optimization
|
|
3. **File system**: Slugs become filenames, must be filesystem-safe
|
|
4. **Uniqueness**: Slugs must be unique across all notes
|
|
5. **Portability**: Slugs should work across different systems and browsers
|
|
|
|
The challenge is designing an algorithm that creates readable, unique, safe slugs automatically from note content.
|
|
|
|
## Decision
|
|
|
|
### Content-Based Slug Generation with Timestamp Fallback
|
|
|
|
**Primary Algorithm**: Extract first N words from content and normalize
|
|
**Fallback**: Timestamp-based slug when content is insufficient
|
|
**Uniqueness**: Random suffix when collision detected
|
|
|
|
### Algorithm Specification
|
|
|
|
#### Step 1: Extract Words
|
|
```python
|
|
# Extract first 5 words from content
|
|
words = content.split()[:5]
|
|
text = " ".join(words)
|
|
```
|
|
|
|
#### Step 2: Normalize
|
|
```python
|
|
# Convert to lowercase
|
|
text = text.lower()
|
|
|
|
# Replace spaces with hyphens
|
|
text = text.replace(" ", "-")
|
|
|
|
# Remove all characters except a-z, 0-9, and hyphens
|
|
text = re.sub(r'[^a-z0-9-]', '', text)
|
|
|
|
# Collapse multiple hyphens
|
|
text = re.sub(r'-+', '-', text)
|
|
|
|
# Strip leading/trailing hyphens
|
|
text = text.strip('-')
|
|
```
|
|
|
|
#### Step 3: Validate Length
|
|
```python
|
|
# If slug too short or empty, use timestamp fallback
|
|
if len(text) < 1:
|
|
text = created_at.strftime("%Y%m%d-%H%M%S")
|
|
```
|
|
|
|
#### Step 4: Truncate
|
|
```python
|
|
# Limit to 100 characters
|
|
text = text[:100]
|
|
```
|
|
|
|
#### Step 5: Check Uniqueness
|
|
```python
|
|
# If slug exists, add random 4-character suffix
|
|
if slug_exists(text):
|
|
text = f"{text}-{random_alphanumeric(4)}"
|
|
```
|
|
|
|
### Character Set
|
|
|
|
**Allowed characters**: `a-z`, `0-9`, `-` (hyphen)
|
|
|
|
**Rationale**:
|
|
- URL-safe without encoding
|
|
- Filesystem-safe on all platforms (Windows, Linux, macOS)
|
|
- Human-readable
|
|
- No escaping required in HTML
|
|
- Compatible with DNS hostnames (if ever used)
|
|
|
|
### Examples
|
|
|
|
| Input Content | Generated Slug |
|
|
|--------------|----------------|
|
|
| "Hello World! This is my first note." | `hello-world-this-is-my` |
|
|
| "Testing... with special chars!@#" | `testing-with-special-chars` |
|
|
| "2024-11-18 Daily Journal Entry" | `2024-11-18-daily-journal-entry` |
|
|
| "A" (too short) | `20241118-143022` (timestamp) |
|
|
| " " (whitespace only) | Error: ValueError |
|
|
| "Hello World" (duplicate) | `hello-world-a7c9` (random suffix) |
|
|
|
|
### Slug Uniqueness Strategy
|
|
|
|
**Collision Detection**: Check database for existing slug before use
|
|
|
|
**Resolution**: Append random 4-character suffix
|
|
- Character set: `a-z0-9` (36 characters)
|
|
- Combinations: 36^4 = 1,679,616 possible suffixes
|
|
- Collision probability: Negligible for reasonable note counts
|
|
|
|
**Example**:
|
|
```
|
|
Original: hello-world
|
|
Collision: hello-world-a7c9
|
|
Collision: hello-world-x3k2
|
|
```
|
|
|
|
### Timestamp Fallback Format
|
|
|
|
**Pattern**: `YYYYMMDD-HHMMSS`
|
|
**Example**: `20241118-143022`
|
|
|
|
**When Used**:
|
|
- Content is empty or whitespace-only (raises error instead)
|
|
- Normalized slug is empty (after removing special characters)
|
|
- Normalized slug is too short (< 1 character)
|
|
|
|
**Rationale**:
|
|
- Guaranteed unique (unless two notes created in same second)
|
|
- Sortable chronologically
|
|
- Still readable and meaningful
|
|
- No special characters required
|
|
|
|
## Rationale
|
|
|
|
### Content-Based Generation (Score: 9/10)
|
|
|
|
**Pros**:
|
|
- **Readability**: Users can understand URL meaning
|
|
- **SEO**: Search engines prefer descriptive URLs
|
|
- **Memorability**: Easier to remember and share
|
|
- **Meaningful**: Reflects note content
|
|
|
|
**Cons**:
|
|
- **Collisions**: Multiple notes might have similar titles
|
|
- **Changes**: Editing note doesn't update slug (by design)
|
|
|
|
### First 5 Words (Score: 8/10)
|
|
|
|
**Pros**:
|
|
- **Sufficient**: 5 words usually capture note topic
|
|
- **Concise**: Keeps URLs short and readable
|
|
- **Consistent**: Predictable slug length
|
|
|
|
**Cons**:
|
|
- **Arbitrary**: 5 is somewhat arbitrary (could be 3-7)
|
|
- **Language**: Assumes space-separated words (English-centric)
|
|
|
|
**Alternatives Considered**:
|
|
- First 3 words: Too short, often not descriptive
|
|
- First 10 words: Too long, URLs become unwieldy
|
|
- First line: Could be very long, harder to normalize
|
|
- First sentence: Variable length, complex to parse
|
|
|
|
**Decision**: 5 words is a good balance (configurable constant)
|
|
|
|
### Lowercase with Hyphens (Score: 10/10)
|
|
|
|
**Pros**:
|
|
- **URL Standard**: Common pattern (github.com, stackoverflow.com)
|
|
- **Readability**: Easier to read than underscores or camelCase
|
|
- **Compatibility**: Works everywhere
|
|
- **Simplicity**: One separator type only
|
|
|
|
**Cons**:
|
|
- None significant
|
|
|
|
### Alphanumeric Only (Score: 10/10)
|
|
|
|
**Pros**:
|
|
- **Safety**: No escaping required in URLs or filenames
|
|
- **Portability**: Works on all filesystems (FAT32, NTFS, ext4, APFS)
|
|
- **Predictability**: No ambiguity about character handling
|
|
|
|
**Cons**:
|
|
- **Unicode Loss**: Non-ASCII characters stripped (acceptable trade-off)
|
|
|
|
### Random Suffix for Uniqueness (Score: 9/10)
|
|
|
|
**Pros**:
|
|
- **Simplicity**: No complex conflict resolution
|
|
- **Security**: Cryptographically secure random (secrets module)
|
|
- **Scalability**: 1.6M possible suffixes per base slug
|
|
|
|
**Cons**:
|
|
- **Ugliness**: Suffix looks less clean (but rare occurrence)
|
|
- **Unpredictability**: User can't control suffix
|
|
|
|
**Alternatives Considered**:
|
|
- Incrementing numbers (`hello-world-2`, `hello-world-3`): More predictable but reveals note count
|
|
- Longer random suffix: More secure but uglier URLs
|
|
- User-specified slug: More complex, deferred to V2
|
|
|
|
**Decision**: 4-character random suffix is good balance
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
1. **Automatic**: No user input required for slug
|
|
2. **Readable**: Slugs are human-readable and meaningful
|
|
3. **Safe**: Works on all platforms and browsers
|
|
4. **Unique**: Collision resolution ensures uniqueness
|
|
5. **SEO-friendly**: Descriptive URLs help search ranking
|
|
6. **Predictable**: User can anticipate what slug will be
|
|
7. **Simple**: Single, consistent algorithm
|
|
|
|
### Negative
|
|
|
|
1. **Not editable**: User can't customize slug in V1
|
|
2. **English-biased**: Assumes space-separated words
|
|
3. **Unicode stripped**: Non-ASCII content loses characters
|
|
4. **Content-dependent**: Similar content = similar slugs
|
|
5. **Timestamp fallback**: Short notes get ugly timestamp slugs
|
|
|
|
### Mitigations
|
|
|
|
**Non-editable slugs**:
|
|
- V1 trade-off for simplicity
|
|
- V2 can add custom slug support
|
|
- Users can still reference notes by slug once created
|
|
|
|
**English-bias**:
|
|
- Acceptable for V1 (English-first IndieWeb)
|
|
- V2 can add Unicode slug support (requires more complex normalization)
|
|
|
|
**Unicode stripping**:
|
|
- Markdown content can still contain Unicode (only slug is ASCII)
|
|
- Timestamp fallback ensures note is still creatable
|
|
- V2 can use Unicode normalization (transliteration)
|
|
|
|
**Timestamp fallback**:
|
|
- Rare occurrence (most notes have >5 words)
|
|
- Still functional and unique
|
|
- V2 can improve (use first word if exists + timestamp)
|
|
|
|
## Standards Compliance
|
|
|
|
### URL Standards (RFC 3986)
|
|
|
|
Slugs comply with URL path segment requirements:
|
|
- No percent-encoding required
|
|
- No reserved characters (`/`, `?`, `#`, etc.)
|
|
- Case-insensitive safe (always lowercase)
|
|
|
|
### Filesystem Standards
|
|
|
|
Slugs work on all major filesystems:
|
|
- **FAT32**: Yes (no special chars, length OK)
|
|
- **NTFS**: Yes
|
|
- **ext4**: Yes
|
|
- **APFS**: Yes
|
|
- **HFS+**: Yes
|
|
|
|
**Reserved names**: None of our slugs conflict with OS reserved names (CON, PRN, etc.)
|
|
|
|
### IndieWeb Recommendations
|
|
|
|
Aligns with IndieWeb permalink best practices:
|
|
- Descriptive URLs
|
|
- No query parameters
|
|
- Short and memorable
|
|
- Permanent (don't change after creation)
|
|
|
|
## Implementation Requirements
|
|
|
|
### Validation Rules
|
|
|
|
```python
|
|
# Valid slug pattern
|
|
SLUG_PATTERN = r'^[a-z0-9]+(?:-[a-z0-9]+)*$'
|
|
|
|
# Constraints
|
|
MIN_SLUG_LENGTH = 1
|
|
MAX_SLUG_LENGTH = 100
|
|
```
|
|
|
|
### Reserved Slugs
|
|
|
|
Certain slugs should be reserved for system routes:
|
|
|
|
**Reserved List** (reject these slugs):
|
|
- `admin`
|
|
- `api`
|
|
- `static`
|
|
- `auth`
|
|
- `feed`
|
|
- `login`
|
|
- `logout`
|
|
|
|
Implementation:
|
|
```python
|
|
RESERVED_SLUGS = {'admin', 'api', 'static', 'auth', 'feed', 'login', 'logout'}
|
|
|
|
def is_slug_reserved(slug: str) -> bool:
|
|
return slug in RESERVED_SLUGS
|
|
```
|
|
|
|
### Error Cases
|
|
|
|
```python
|
|
# Empty content
|
|
generate_slug("") # Raises ValueError
|
|
|
|
# Whitespace only
|
|
generate_slug(" ") # Raises ValueError
|
|
|
|
# Valid but short
|
|
generate_slug("Hi") # Returns timestamp: "20241118-143022"
|
|
|
|
# Special characters only
|
|
generate_slug("!@#$%") # Returns timestamp: "20241118-143022"
|
|
```
|
|
|
|
## Alternatives Considered
|
|
|
|
### UUID-based Slugs (Rejected)
|
|
|
|
```python
|
|
slug = str(uuid.uuid4()) # "550e8400-e29b-41d4-a716-446655440000"
|
|
```
|
|
|
|
**Pros**: Guaranteed unique, no collision checking
|
|
**Cons**: Not human-readable, poor SEO, not memorable
|
|
|
|
**Verdict**: Violates principle of readable URLs
|
|
|
|
### Hash-based Slugs (Rejected)
|
|
|
|
```python
|
|
slug = hashlib.sha256(content.encode()).hexdigest()[:12] # "a591a6d40bf4"
|
|
```
|
|
|
|
**Pros**: Deterministic, unique
|
|
**Cons**: Not human-readable, changes if content edited
|
|
|
|
**Verdict**: Not meaningful to users
|
|
|
|
### Title Extraction (Rejected for V1)
|
|
|
|
```python
|
|
# Extract from # heading or first line
|
|
title = extract_title_from_markdown(content)
|
|
slug = normalize(title)
|
|
```
|
|
|
|
**Pros**: More semantic, uses actual title
|
|
**Cons**: Requires markdown parsing, more complex, title might not exist
|
|
|
|
**Verdict**: Deferred to V2 (V1 uses first N words which is simpler)
|
|
|
|
### User-Specified Slugs (Rejected for V1)
|
|
|
|
```python
|
|
def create_note(content, custom_slug=None):
|
|
if custom_slug:
|
|
slug = validate_and_use(custom_slug)
|
|
else:
|
|
slug = generate_slug(content)
|
|
```
|
|
|
|
**Pros**: Maximum user control, no surprises
|
|
**Cons**: Requires UI input, validation complexity, user burden
|
|
|
|
**Verdict**: Deferred to V2 (V1 auto-generates for simplicity)
|
|
|
|
### Incrementing Numbers (Rejected)
|
|
|
|
```python
|
|
# If collision, increment
|
|
slug = "hello-world"
|
|
slug = "hello-world-2" # Collision
|
|
slug = "hello-world-3" # Collision
|
|
```
|
|
|
|
**Pros**: Predictable, simple
|
|
**Cons**: Reveals note count, enumeration attack vector, less random
|
|
|
|
**Verdict**: Random suffix is more secure and scales better
|
|
|
|
## Performance Considerations
|
|
|
|
### Generation Speed
|
|
|
|
- Extract words: O(n) where n = content length (negligible, content is small)
|
|
- Normalize: O(m) where m = extracted text length (< 100 chars)
|
|
- Uniqueness check: O(1) database lookup with index
|
|
- Random suffix: O(1) generation
|
|
|
|
**Target**: < 1ms per slug generation (easily achieved)
|
|
|
|
### Database Impact
|
|
|
|
- Index on `slug` column: O(log n) lookup
|
|
- Collision rate: < 1% (most notes have unique first 5 words)
|
|
- Random suffix retries: Nearly never (1.6M combinations)
|
|
|
|
## Testing Requirements
|
|
|
|
### Test Cases
|
|
|
|
**Normal Cases**:
|
|
- Standard English content → descriptive slug
|
|
- Content with punctuation → punctuation removed
|
|
- Content with numbers → numbers preserved
|
|
- Content with hyphens → hyphens preserved
|
|
|
|
**Edge Cases**:
|
|
- Very short content → timestamp fallback
|
|
- Empty content → ValueError
|
|
- Special characters only → timestamp fallback
|
|
- Very long words → truncated to max length
|
|
- Unicode content → stripped to ASCII
|
|
|
|
**Collision Cases**:
|
|
- Duplicate slug → random suffix added
|
|
- Multiple collisions → different random suffixes
|
|
- Reserved slug → rejected
|
|
|
|
**Security Cases**:
|
|
- Path traversal attempt (`../../../etc/passwd`)
|
|
- Special characters (`<script>`, `%00`, etc.)
|
|
- Very long input (>10,000 characters)
|
|
|
|
## Migration Path (V2)
|
|
|
|
Future enhancements that build on this foundation:
|
|
|
|
### Custom Slugs
|
|
```python
|
|
def create_note(content, custom_slug=None):
|
|
slug = custom_slug or generate_slug(content)
|
|
```
|
|
|
|
### Unicode Support
|
|
```python
|
|
def generate_unicode_slug(content):
|
|
# Use Unicode normalization (NFKD)
|
|
# Transliterate to ASCII (unidecode library)
|
|
# Support CJK languages
|
|
```
|
|
|
|
### Title Extraction
|
|
```python
|
|
def extract_title_from_content(content):
|
|
# Check for # heading
|
|
# Use first line if no heading
|
|
# Fall back to first N words
|
|
```
|
|
|
|
### Slug Editing
|
|
```python
|
|
def update_note_slug(note_id, new_slug):
|
|
# Validate new slug
|
|
# Update database
|
|
# Rename file
|
|
# Create redirect from old slug
|
|
```
|
|
|
|
## References
|
|
|
|
- [RFC 3986 - URI Generic Syntax](https://www.rfc-editor.org/rfc/rfc3986)
|
|
- [IndieWeb Permalink Design](https://indieweb.org/permalink)
|
|
- [URL Slug Best Practices](https://moz.com/learn/seo/url)
|
|
- [Python secrets Module](https://docs.python.org/3/library/secrets.html)
|
|
- [ADR-004: File-Based Note Storage](/home/phil/Projects/starpunk/docs/decisions/ADR-004-file-based-note-storage.md)
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] Slug generation creates valid, URL-safe slugs
|
|
- [ ] Slugs are descriptive (use first 5 words)
|
|
- [ ] Slugs are unique (collision detection + random suffix)
|
|
- [ ] Slugs meet length constraints (1-100 characters)
|
|
- [ ] Timestamp fallback works for short content
|
|
- [ ] Reserved slugs are rejected
|
|
- [ ] Unicode content is handled gracefully
|
|
- [ ] All edge cases tested
|
|
- [ ] Performance meets target (<1ms)
|
|
- [ ] Code follows Python coding standards
|
|
|
|
---
|
|
|
|
**Approved**: 2024-11-18
|
|
**Architect**: StarPunk Architect Agent
|