that initial commit
This commit is contained in:
487
docs/decisions/ADR-007-slug-generation-algorithm.md
Normal file
487
docs/decisions/ADR-007-slug-generation-algorithm.md
Normal file
@@ -0,0 +1,487 @@
|
||||
# ADR-007: Slug Generation Algorithm
|
||||
|
||||
## Status
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Notes in StarPunk require URL-safe identifiers (slugs) for permalinks and file naming. The slug generation algorithm is critical because:
|
||||
|
||||
1. **User experience**: Slugs appear in URLs and should be readable/meaningful
|
||||
2. **SEO**: Descriptive slugs improve search engine optimization
|
||||
3. **File system**: Slugs become filenames, must be filesystem-safe
|
||||
4. **Uniqueness**: Slugs must be unique across all notes
|
||||
5. **Portability**: Slugs should work across different systems and browsers
|
||||
|
||||
The challenge is designing an algorithm that creates readable, unique, safe slugs automatically from note content.
|
||||
|
||||
## Decision
|
||||
|
||||
### Content-Based Slug Generation with Timestamp Fallback
|
||||
|
||||
**Primary Algorithm**: Extract first N words from content and normalize
|
||||
**Fallback**: Timestamp-based slug when content is insufficient
|
||||
**Uniqueness**: Random suffix when collision detected
|
||||
|
||||
### Algorithm Specification
|
||||
|
||||
#### Step 1: Extract Words
|
||||
```python
|
||||
# Extract first 5 words from content
|
||||
words = content.split()[:5]
|
||||
text = " ".join(words)
|
||||
```
|
||||
|
||||
#### Step 2: Normalize
|
||||
```python
|
||||
# Convert to lowercase
|
||||
text = text.lower()
|
||||
|
||||
# Replace spaces with hyphens
|
||||
text = text.replace(" ", "-")
|
||||
|
||||
# Remove all characters except a-z, 0-9, and hyphens
|
||||
text = re.sub(r'[^a-z0-9-]', '', text)
|
||||
|
||||
# Collapse multiple hyphens
|
||||
text = re.sub(r'-+', '-', text)
|
||||
|
||||
# Strip leading/trailing hyphens
|
||||
text = text.strip('-')
|
||||
```
|
||||
|
||||
#### Step 3: Validate Length
|
||||
```python
|
||||
# If slug too short or empty, use timestamp fallback
|
||||
if len(text) < 1:
|
||||
text = created_at.strftime("%Y%m%d-%H%M%S")
|
||||
```
|
||||
|
||||
#### Step 4: Truncate
|
||||
```python
|
||||
# Limit to 100 characters
|
||||
text = text[:100]
|
||||
```
|
||||
|
||||
#### Step 5: Check Uniqueness
|
||||
```python
|
||||
# If slug exists, add random 4-character suffix
|
||||
if slug_exists(text):
|
||||
text = f"{text}-{random_alphanumeric(4)}"
|
||||
```
|
||||
|
||||
### Character Set
|
||||
|
||||
**Allowed characters**: `a-z`, `0-9`, `-` (hyphen)
|
||||
|
||||
**Rationale**:
|
||||
- URL-safe without encoding
|
||||
- Filesystem-safe on all platforms (Windows, Linux, macOS)
|
||||
- Human-readable
|
||||
- No escaping required in HTML
|
||||
- Compatible with DNS hostnames (if ever used)
|
||||
|
||||
### Examples
|
||||
|
||||
| Input Content | Generated Slug |
|
||||
|--------------|----------------|
|
||||
| "Hello World! This is my first note." | `hello-world-this-is-my` |
|
||||
| "Testing... with special chars!@#" | `testing-with-special-chars` |
|
||||
| "2024-11-18 Daily Journal Entry" | `2024-11-18-daily-journal-entry` |
|
||||
| "A" (too short) | `20241118-143022` (timestamp) |
|
||||
| " " (whitespace only) | Error: ValueError |
|
||||
| "Hello World" (duplicate) | `hello-world-a7c9` (random suffix) |
|
||||
|
||||
### Slug Uniqueness Strategy
|
||||
|
||||
**Collision Detection**: Check database for existing slug before use
|
||||
|
||||
**Resolution**: Append random 4-character suffix
|
||||
- Character set: `a-z0-9` (36 characters)
|
||||
- Combinations: 36^4 = 1,679,616 possible suffixes
|
||||
- Collision probability: Negligible for reasonable note counts
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Original: hello-world
|
||||
Collision: hello-world-a7c9
|
||||
Collision: hello-world-x3k2
|
||||
```
|
||||
|
||||
### Timestamp Fallback Format
|
||||
|
||||
**Pattern**: `YYYYMMDD-HHMMSS`
|
||||
**Example**: `20241118-143022`
|
||||
|
||||
**When Used**:
|
||||
- Content is empty or whitespace-only (raises error instead)
|
||||
- Normalized slug is empty (after removing special characters)
|
||||
- Normalized slug is too short (< 1 character)
|
||||
|
||||
**Rationale**:
|
||||
- Guaranteed unique (unless two notes created in same second)
|
||||
- Sortable chronologically
|
||||
- Still readable and meaningful
|
||||
- No special characters required
|
||||
|
||||
## Rationale
|
||||
|
||||
### Content-Based Generation (Score: 9/10)
|
||||
|
||||
**Pros**:
|
||||
- **Readability**: Users can understand URL meaning
|
||||
- **SEO**: Search engines prefer descriptive URLs
|
||||
- **Memorability**: Easier to remember and share
|
||||
- **Meaningful**: Reflects note content
|
||||
|
||||
**Cons**:
|
||||
- **Collisions**: Multiple notes might have similar titles
|
||||
- **Changes**: Editing note doesn't update slug (by design)
|
||||
|
||||
### First 5 Words (Score: 8/10)
|
||||
|
||||
**Pros**:
|
||||
- **Sufficient**: 5 words usually capture note topic
|
||||
- **Concise**: Keeps URLs short and readable
|
||||
- **Consistent**: Predictable slug length
|
||||
|
||||
**Cons**:
|
||||
- **Arbitrary**: 5 is somewhat arbitrary (could be 3-7)
|
||||
- **Language**: Assumes space-separated words (English-centric)
|
||||
|
||||
**Alternatives Considered**:
|
||||
- First 3 words: Too short, often not descriptive
|
||||
- First 10 words: Too long, URLs become unwieldy
|
||||
- First line: Could be very long, harder to normalize
|
||||
- First sentence: Variable length, complex to parse
|
||||
|
||||
**Decision**: 5 words is a good balance (configurable constant)
|
||||
|
||||
### Lowercase with Hyphens (Score: 10/10)
|
||||
|
||||
**Pros**:
|
||||
- **URL Standard**: Common pattern (github.com, stackoverflow.com)
|
||||
- **Readability**: Easier to read than underscores or camelCase
|
||||
- **Compatibility**: Works everywhere
|
||||
- **Simplicity**: One separator type only
|
||||
|
||||
**Cons**:
|
||||
- None significant
|
||||
|
||||
### Alphanumeric Only (Score: 10/10)
|
||||
|
||||
**Pros**:
|
||||
- **Safety**: No escaping required in URLs or filenames
|
||||
- **Portability**: Works on all filesystems (FAT32, NTFS, ext4, APFS)
|
||||
- **Predictability**: No ambiguity about character handling
|
||||
|
||||
**Cons**:
|
||||
- **Unicode Loss**: Non-ASCII characters stripped (acceptable trade-off)
|
||||
|
||||
### Random Suffix for Uniqueness (Score: 9/10)
|
||||
|
||||
**Pros**:
|
||||
- **Simplicity**: No complex conflict resolution
|
||||
- **Security**: Cryptographically secure random (secrets module)
|
||||
- **Scalability**: 1.6M possible suffixes per base slug
|
||||
|
||||
**Cons**:
|
||||
- **Ugliness**: Suffix looks less clean (but rare occurrence)
|
||||
- **Unpredictability**: User can't control suffix
|
||||
|
||||
**Alternatives Considered**:
|
||||
- Incrementing numbers (`hello-world-2`, `hello-world-3`): More predictable but reveals note count
|
||||
- Longer random suffix: More secure but uglier URLs
|
||||
- User-specified slug: More complex, deferred to V2
|
||||
|
||||
**Decision**: 4-character random suffix is good balance
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
1. **Automatic**: No user input required for slug
|
||||
2. **Readable**: Slugs are human-readable and meaningful
|
||||
3. **Safe**: Works on all platforms and browsers
|
||||
4. **Unique**: Collision resolution ensures uniqueness
|
||||
5. **SEO-friendly**: Descriptive URLs help search ranking
|
||||
6. **Predictable**: User can anticipate what slug will be
|
||||
7. **Simple**: Single, consistent algorithm
|
||||
|
||||
### Negative
|
||||
|
||||
1. **Not editable**: User can't customize slug in V1
|
||||
2. **English-biased**: Assumes space-separated words
|
||||
3. **Unicode stripped**: Non-ASCII content loses characters
|
||||
4. **Content-dependent**: Similar content = similar slugs
|
||||
5. **Timestamp fallback**: Short notes get ugly timestamp slugs
|
||||
|
||||
### Mitigations
|
||||
|
||||
**Non-editable slugs**:
|
||||
- V1 trade-off for simplicity
|
||||
- V2 can add custom slug support
|
||||
- Users can still reference notes by slug once created
|
||||
|
||||
**English-bias**:
|
||||
- Acceptable for V1 (English-first IndieWeb)
|
||||
- V2 can add Unicode slug support (requires more complex normalization)
|
||||
|
||||
**Unicode stripping**:
|
||||
- Markdown content can still contain Unicode (only slug is ASCII)
|
||||
- Timestamp fallback ensures note is still creatable
|
||||
- V2 can use Unicode normalization (transliteration)
|
||||
|
||||
**Timestamp fallback**:
|
||||
- Rare occurrence (most notes have >5 words)
|
||||
- Still functional and unique
|
||||
- V2 can improve (use first word if exists + timestamp)
|
||||
|
||||
## Standards Compliance
|
||||
|
||||
### URL Standards (RFC 3986)
|
||||
|
||||
Slugs comply with URL path segment requirements:
|
||||
- No percent-encoding required
|
||||
- No reserved characters (`/`, `?`, `#`, etc.)
|
||||
- Case-insensitive safe (always lowercase)
|
||||
|
||||
### Filesystem Standards
|
||||
|
||||
Slugs work on all major filesystems:
|
||||
- **FAT32**: Yes (no special chars, length OK)
|
||||
- **NTFS**: Yes
|
||||
- **ext4**: Yes
|
||||
- **APFS**: Yes
|
||||
- **HFS+**: Yes
|
||||
|
||||
**Reserved names**: None of our slugs conflict with OS reserved names (CON, PRN, etc.)
|
||||
|
||||
### IndieWeb Recommendations
|
||||
|
||||
Aligns with IndieWeb permalink best practices:
|
||||
- Descriptive URLs
|
||||
- No query parameters
|
||||
- Short and memorable
|
||||
- Permanent (don't change after creation)
|
||||
|
||||
## Implementation Requirements
|
||||
|
||||
### Validation Rules
|
||||
|
||||
```python
|
||||
# Valid slug pattern
|
||||
SLUG_PATTERN = r'^[a-z0-9]+(?:-[a-z0-9]+)*$'
|
||||
|
||||
# Constraints
|
||||
MIN_SLUG_LENGTH = 1
|
||||
MAX_SLUG_LENGTH = 100
|
||||
```
|
||||
|
||||
### Reserved Slugs
|
||||
|
||||
Certain slugs should be reserved for system routes:
|
||||
|
||||
**Reserved List** (reject these slugs):
|
||||
- `admin`
|
||||
- `api`
|
||||
- `static`
|
||||
- `auth`
|
||||
- `feed`
|
||||
- `login`
|
||||
- `logout`
|
||||
|
||||
Implementation:
|
||||
```python
|
||||
RESERVED_SLUGS = {'admin', 'api', 'static', 'auth', 'feed', 'login', 'logout'}
|
||||
|
||||
def is_slug_reserved(slug: str) -> bool:
|
||||
return slug in RESERVED_SLUGS
|
||||
```
|
||||
|
||||
### Error Cases
|
||||
|
||||
```python
|
||||
# Empty content
|
||||
generate_slug("") # Raises ValueError
|
||||
|
||||
# Whitespace only
|
||||
generate_slug(" ") # Raises ValueError
|
||||
|
||||
# Valid but short
|
||||
generate_slug("Hi") # Returns timestamp: "20241118-143022"
|
||||
|
||||
# Special characters only
|
||||
generate_slug("!@#$%") # Returns timestamp: "20241118-143022"
|
||||
```
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### UUID-based Slugs (Rejected)
|
||||
|
||||
```python
|
||||
slug = str(uuid.uuid4()) # "550e8400-e29b-41d4-a716-446655440000"
|
||||
```
|
||||
|
||||
**Pros**: Guaranteed unique, no collision checking
|
||||
**Cons**: Not human-readable, poor SEO, not memorable
|
||||
|
||||
**Verdict**: Violates principle of readable URLs
|
||||
|
||||
### Hash-based Slugs (Rejected)
|
||||
|
||||
```python
|
||||
slug = hashlib.sha256(content.encode()).hexdigest()[:12] # "a591a6d40bf4"
|
||||
```
|
||||
|
||||
**Pros**: Deterministic, unique
|
||||
**Cons**: Not human-readable, changes if content edited
|
||||
|
||||
**Verdict**: Not meaningful to users
|
||||
|
||||
### Title Extraction (Rejected for V1)
|
||||
|
||||
```python
|
||||
# Extract from # heading or first line
|
||||
title = extract_title_from_markdown(content)
|
||||
slug = normalize(title)
|
||||
```
|
||||
|
||||
**Pros**: More semantic, uses actual title
|
||||
**Cons**: Requires markdown parsing, more complex, title might not exist
|
||||
|
||||
**Verdict**: Deferred to V2 (V1 uses first N words which is simpler)
|
||||
|
||||
### User-Specified Slugs (Rejected for V1)
|
||||
|
||||
```python
|
||||
def create_note(content, custom_slug=None):
|
||||
if custom_slug:
|
||||
slug = validate_and_use(custom_slug)
|
||||
else:
|
||||
slug = generate_slug(content)
|
||||
```
|
||||
|
||||
**Pros**: Maximum user control, no surprises
|
||||
**Cons**: Requires UI input, validation complexity, user burden
|
||||
|
||||
**Verdict**: Deferred to V2 (V1 auto-generates for simplicity)
|
||||
|
||||
### Incrementing Numbers (Rejected)
|
||||
|
||||
```python
|
||||
# If collision, increment
|
||||
slug = "hello-world"
|
||||
slug = "hello-world-2" # Collision
|
||||
slug = "hello-world-3" # Collision
|
||||
```
|
||||
|
||||
**Pros**: Predictable, simple
|
||||
**Cons**: Reveals note count, enumeration attack vector, less random
|
||||
|
||||
**Verdict**: Random suffix is more secure and scales better
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Generation Speed
|
||||
|
||||
- Extract words: O(n) where n = content length (negligible, content is small)
|
||||
- Normalize: O(m) where m = extracted text length (< 100 chars)
|
||||
- Uniqueness check: O(1) database lookup with index
|
||||
- Random suffix: O(1) generation
|
||||
|
||||
**Target**: < 1ms per slug generation (easily achieved)
|
||||
|
||||
### Database Impact
|
||||
|
||||
- Index on `slug` column: O(log n) lookup
|
||||
- Collision rate: < 1% (most notes have unique first 5 words)
|
||||
- Random suffix retries: Nearly never (1.6M combinations)
|
||||
|
||||
## Testing Requirements
|
||||
|
||||
### Test Cases
|
||||
|
||||
**Normal Cases**:
|
||||
- Standard English content → descriptive slug
|
||||
- Content with punctuation → punctuation removed
|
||||
- Content with numbers → numbers preserved
|
||||
- Content with hyphens → hyphens preserved
|
||||
|
||||
**Edge Cases**:
|
||||
- Very short content → timestamp fallback
|
||||
- Empty content → ValueError
|
||||
- Special characters only → timestamp fallback
|
||||
- Very long words → truncated to max length
|
||||
- Unicode content → stripped to ASCII
|
||||
|
||||
**Collision Cases**:
|
||||
- Duplicate slug → random suffix added
|
||||
- Multiple collisions → different random suffixes
|
||||
- Reserved slug → rejected
|
||||
|
||||
**Security Cases**:
|
||||
- Path traversal attempt (`../../../etc/passwd`)
|
||||
- Special characters (`<script>`, `%00`, etc.)
|
||||
- Very long input (>10,000 characters)
|
||||
|
||||
## Migration Path (V2)
|
||||
|
||||
Future enhancements that build on this foundation:
|
||||
|
||||
### Custom Slugs
|
||||
```python
|
||||
def create_note(content, custom_slug=None):
|
||||
slug = custom_slug or generate_slug(content)
|
||||
```
|
||||
|
||||
### Unicode Support
|
||||
```python
|
||||
def generate_unicode_slug(content):
|
||||
# Use Unicode normalization (NFKD)
|
||||
# Transliterate to ASCII (unidecode library)
|
||||
# Support CJK languages
|
||||
```
|
||||
|
||||
### Title Extraction
|
||||
```python
|
||||
def extract_title_from_content(content):
|
||||
# Check for # heading
|
||||
# Use first line if no heading
|
||||
# Fall back to first N words
|
||||
```
|
||||
|
||||
### Slug Editing
|
||||
```python
|
||||
def update_note_slug(note_id, new_slug):
|
||||
# Validate new slug
|
||||
# Update database
|
||||
# Rename file
|
||||
# Create redirect from old slug
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [RFC 3986 - URI Generic Syntax](https://www.rfc-editor.org/rfc/rfc3986)
|
||||
- [IndieWeb Permalink Design](https://indieweb.org/permalink)
|
||||
- [URL Slug Best Practices](https://moz.com/learn/seo/url)
|
||||
- [Python secrets Module](https://docs.python.org/3/library/secrets.html)
|
||||
- [ADR-004: File-Based Note Storage](/home/phil/Projects/starpunk/docs/decisions/ADR-004-file-based-note-storage.md)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] Slug generation creates valid, URL-safe slugs
|
||||
- [ ] Slugs are descriptive (use first 5 words)
|
||||
- [ ] Slugs are unique (collision detection + random suffix)
|
||||
- [ ] Slugs meet length constraints (1-100 characters)
|
||||
- [ ] Timestamp fallback works for short content
|
||||
- [ ] Reserved slugs are rejected
|
||||
- [ ] Unicode content is handled gracefully
|
||||
- [ ] All edge cases tested
|
||||
- [ ] Performance meets target (<1ms)
|
||||
- [ ] Code follows Python coding standards
|
||||
|
||||
---
|
||||
|
||||
**Approved**: 2024-11-18
|
||||
**Architect**: StarPunk Architect Agent
|
||||
Reference in New Issue
Block a user