that initial commit

This commit is contained in:
2025-11-18 19:21:31 -07:00
commit a68fd570c7
69 changed files with 31070 additions and 0 deletions

View File

@@ -0,0 +1,487 @@
# ADR-007: Slug Generation Algorithm
## Status
Accepted
## Context
Notes in StarPunk require URL-safe identifiers (slugs) for permalinks and file naming. The slug generation algorithm is critical because:
1. **User experience**: Slugs appear in URLs and should be readable/meaningful
2. **SEO**: Descriptive slugs improve search engine optimization
3. **File system**: Slugs become filenames, must be filesystem-safe
4. **Uniqueness**: Slugs must be unique across all notes
5. **Portability**: Slugs should work across different systems and browsers
The challenge is designing an algorithm that creates readable, unique, safe slugs automatically from note content.
## Decision
### Content-Based Slug Generation with Timestamp Fallback
**Primary Algorithm**: Extract first N words from content and normalize
**Fallback**: Timestamp-based slug when content is insufficient
**Uniqueness**: Random suffix when collision detected
### Algorithm Specification
#### Step 1: Extract Words
```python
# Extract first 5 words from content
words = content.split()[:5]
text = " ".join(words)
```
#### Step 2: Normalize
```python
# Convert to lowercase
text = text.lower()
# Replace spaces with hyphens
text = text.replace(" ", "-")
# Remove all characters except a-z, 0-9, and hyphens
text = re.sub(r'[^a-z0-9-]', '', text)
# Collapse multiple hyphens
text = re.sub(r'-+', '-', text)
# Strip leading/trailing hyphens
text = text.strip('-')
```
#### Step 3: Validate Length
```python
# If slug too short or empty, use timestamp fallback
if len(text) < 1:
text = created_at.strftime("%Y%m%d-%H%M%S")
```
#### Step 4: Truncate
```python
# Limit to 100 characters
text = text[:100]
```
#### Step 5: Check Uniqueness
```python
# If slug exists, add random 4-character suffix
if slug_exists(text):
text = f"{text}-{random_alphanumeric(4)}"
```
### Character Set
**Allowed characters**: `a-z`, `0-9`, `-` (hyphen)
**Rationale**:
- URL-safe without encoding
- Filesystem-safe on all platforms (Windows, Linux, macOS)
- Human-readable
- No escaping required in HTML
- Compatible with DNS hostnames (if ever used)
### Examples
| Input Content | Generated Slug |
|--------------|----------------|
| "Hello World! This is my first note." | `hello-world-this-is-my` |
| "Testing... with special chars!@#" | `testing-with-special-chars` |
| "2024-11-18 Daily Journal Entry" | `2024-11-18-daily-journal-entry` |
| "A" (too short) | `20241118-143022` (timestamp) |
| " " (whitespace only) | Error: ValueError |
| "Hello World" (duplicate) | `hello-world-a7c9` (random suffix) |
### Slug Uniqueness Strategy
**Collision Detection**: Check database for existing slug before use
**Resolution**: Append random 4-character suffix
- Character set: `a-z0-9` (36 characters)
- Combinations: 36^4 = 1,679,616 possible suffixes
- Collision probability: Negligible for reasonable note counts
**Example**:
```
Original: hello-world
Collision: hello-world-a7c9
Collision: hello-world-x3k2
```
### Timestamp Fallback Format
**Pattern**: `YYYYMMDD-HHMMSS`
**Example**: `20241118-143022`
**When Used**:
- Content is empty or whitespace-only (raises error instead)
- Normalized slug is empty (after removing special characters)
- Normalized slug is too short (< 1 character)
**Rationale**:
- Guaranteed unique (unless two notes created in same second)
- Sortable chronologically
- Still readable and meaningful
- No special characters required
## Rationale
### Content-Based Generation (Score: 9/10)
**Pros**:
- **Readability**: Users can understand URL meaning
- **SEO**: Search engines prefer descriptive URLs
- **Memorability**: Easier to remember and share
- **Meaningful**: Reflects note content
**Cons**:
- **Collisions**: Multiple notes might have similar titles
- **Changes**: Editing note doesn't update slug (by design)
### First 5 Words (Score: 8/10)
**Pros**:
- **Sufficient**: 5 words usually capture note topic
- **Concise**: Keeps URLs short and readable
- **Consistent**: Predictable slug length
**Cons**:
- **Arbitrary**: 5 is somewhat arbitrary (could be 3-7)
- **Language**: Assumes space-separated words (English-centric)
**Alternatives Considered**:
- First 3 words: Too short, often not descriptive
- First 10 words: Too long, URLs become unwieldy
- First line: Could be very long, harder to normalize
- First sentence: Variable length, complex to parse
**Decision**: 5 words is a good balance (configurable constant)
### Lowercase with Hyphens (Score: 10/10)
**Pros**:
- **URL Standard**: Common pattern (github.com, stackoverflow.com)
- **Readability**: Easier to read than underscores or camelCase
- **Compatibility**: Works everywhere
- **Simplicity**: One separator type only
**Cons**:
- None significant
### Alphanumeric Only (Score: 10/10)
**Pros**:
- **Safety**: No escaping required in URLs or filenames
- **Portability**: Works on all filesystems (FAT32, NTFS, ext4, APFS)
- **Predictability**: No ambiguity about character handling
**Cons**:
- **Unicode Loss**: Non-ASCII characters stripped (acceptable trade-off)
### Random Suffix for Uniqueness (Score: 9/10)
**Pros**:
- **Simplicity**: No complex conflict resolution
- **Security**: Cryptographically secure random (secrets module)
- **Scalability**: 1.6M possible suffixes per base slug
**Cons**:
- **Ugliness**: Suffix looks less clean (but rare occurrence)
- **Unpredictability**: User can't control suffix
**Alternatives Considered**:
- Incrementing numbers (`hello-world-2`, `hello-world-3`): More predictable but reveals note count
- Longer random suffix: More secure but uglier URLs
- User-specified slug: More complex, deferred to V2
**Decision**: 4-character random suffix is good balance
## Consequences
### Positive
1. **Automatic**: No user input required for slug
2. **Readable**: Slugs are human-readable and meaningful
3. **Safe**: Works on all platforms and browsers
4. **Unique**: Collision resolution ensures uniqueness
5. **SEO-friendly**: Descriptive URLs help search ranking
6. **Predictable**: User can anticipate what slug will be
7. **Simple**: Single, consistent algorithm
### Negative
1. **Not editable**: User can't customize slug in V1
2. **English-biased**: Assumes space-separated words
3. **Unicode stripped**: Non-ASCII content loses characters
4. **Content-dependent**: Similar content = similar slugs
5. **Timestamp fallback**: Short notes get ugly timestamp slugs
### Mitigations
**Non-editable slugs**:
- V1 trade-off for simplicity
- V2 can add custom slug support
- Users can still reference notes by slug once created
**English-bias**:
- Acceptable for V1 (English-first IndieWeb)
- V2 can add Unicode slug support (requires more complex normalization)
**Unicode stripping**:
- Markdown content can still contain Unicode (only slug is ASCII)
- Timestamp fallback ensures note is still creatable
- V2 can use Unicode normalization (transliteration)
**Timestamp fallback**:
- Rare occurrence (most notes have >5 words)
- Still functional and unique
- V2 can improve (use first word if exists + timestamp)
## Standards Compliance
### URL Standards (RFC 3986)
Slugs comply with URL path segment requirements:
- No percent-encoding required
- No reserved characters (`/`, `?`, `#`, etc.)
- Case-insensitive safe (always lowercase)
### Filesystem Standards
Slugs work on all major filesystems:
- **FAT32**: Yes (no special chars, length OK)
- **NTFS**: Yes
- **ext4**: Yes
- **APFS**: Yes
- **HFS+**: Yes
**Reserved names**: None of our slugs conflict with OS reserved names (CON, PRN, etc.)
### IndieWeb Recommendations
Aligns with IndieWeb permalink best practices:
- Descriptive URLs
- No query parameters
- Short and memorable
- Permanent (don't change after creation)
## Implementation Requirements
### Validation Rules
```python
# Valid slug pattern
SLUG_PATTERN = r'^[a-z0-9]+(?:-[a-z0-9]+)*$'
# Constraints
MIN_SLUG_LENGTH = 1
MAX_SLUG_LENGTH = 100
```
### Reserved Slugs
Certain slugs should be reserved for system routes:
**Reserved List** (reject these slugs):
- `admin`
- `api`
- `static`
- `auth`
- `feed`
- `login`
- `logout`
Implementation:
```python
RESERVED_SLUGS = {'admin', 'api', 'static', 'auth', 'feed', 'login', 'logout'}
def is_slug_reserved(slug: str) -> bool:
return slug in RESERVED_SLUGS
```
### Error Cases
```python
# Empty content
generate_slug("") # Raises ValueError
# Whitespace only
generate_slug(" ") # Raises ValueError
# Valid but short
generate_slug("Hi") # Returns timestamp: "20241118-143022"
# Special characters only
generate_slug("!@#$%") # Returns timestamp: "20241118-143022"
```
## Alternatives Considered
### UUID-based Slugs (Rejected)
```python
slug = str(uuid.uuid4()) # "550e8400-e29b-41d4-a716-446655440000"
```
**Pros**: Guaranteed unique, no collision checking
**Cons**: Not human-readable, poor SEO, not memorable
**Verdict**: Violates principle of readable URLs
### Hash-based Slugs (Rejected)
```python
slug = hashlib.sha256(content.encode()).hexdigest()[:12] # "a591a6d40bf4"
```
**Pros**: Deterministic, unique
**Cons**: Not human-readable, changes if content edited
**Verdict**: Not meaningful to users
### Title Extraction (Rejected for V1)
```python
# Extract from # heading or first line
title = extract_title_from_markdown(content)
slug = normalize(title)
```
**Pros**: More semantic, uses actual title
**Cons**: Requires markdown parsing, more complex, title might not exist
**Verdict**: Deferred to V2 (V1 uses first N words which is simpler)
### User-Specified Slugs (Rejected for V1)
```python
def create_note(content, custom_slug=None):
if custom_slug:
slug = validate_and_use(custom_slug)
else:
slug = generate_slug(content)
```
**Pros**: Maximum user control, no surprises
**Cons**: Requires UI input, validation complexity, user burden
**Verdict**: Deferred to V2 (V1 auto-generates for simplicity)
### Incrementing Numbers (Rejected)
```python
# If collision, increment
slug = "hello-world"
slug = "hello-world-2" # Collision
slug = "hello-world-3" # Collision
```
**Pros**: Predictable, simple
**Cons**: Reveals note count, enumeration attack vector, less random
**Verdict**: Random suffix is more secure and scales better
## Performance Considerations
### Generation Speed
- Extract words: O(n) where n = content length (negligible, content is small)
- Normalize: O(m) where m = extracted text length (< 100 chars)
- Uniqueness check: O(1) database lookup with index
- Random suffix: O(1) generation
**Target**: < 1ms per slug generation (easily achieved)
### Database Impact
- Index on `slug` column: O(log n) lookup
- Collision rate: < 1% (most notes have unique first 5 words)
- Random suffix retries: Nearly never (1.6M combinations)
## Testing Requirements
### Test Cases
**Normal Cases**:
- Standard English content → descriptive slug
- Content with punctuation → punctuation removed
- Content with numbers → numbers preserved
- Content with hyphens → hyphens preserved
**Edge Cases**:
- Very short content → timestamp fallback
- Empty content → ValueError
- Special characters only → timestamp fallback
- Very long words → truncated to max length
- Unicode content → stripped to ASCII
**Collision Cases**:
- Duplicate slug → random suffix added
- Multiple collisions → different random suffixes
- Reserved slug → rejected
**Security Cases**:
- Path traversal attempt (`../../../etc/passwd`)
- Special characters (`<script>`, `%00`, etc.)
- Very long input (>10,000 characters)
## Migration Path (V2)
Future enhancements that build on this foundation:
### Custom Slugs
```python
def create_note(content, custom_slug=None):
slug = custom_slug or generate_slug(content)
```
### Unicode Support
```python
def generate_unicode_slug(content):
# Use Unicode normalization (NFKD)
# Transliterate to ASCII (unidecode library)
# Support CJK languages
```
### Title Extraction
```python
def extract_title_from_content(content):
# Check for # heading
# Use first line if no heading
# Fall back to first N words
```
### Slug Editing
```python
def update_note_slug(note_id, new_slug):
# Validate new slug
# Update database
# Rename file
# Create redirect from old slug
```
## References
- [RFC 3986 - URI Generic Syntax](https://www.rfc-editor.org/rfc/rfc3986)
- [IndieWeb Permalink Design](https://indieweb.org/permalink)
- [URL Slug Best Practices](https://moz.com/learn/seo/url)
- [Python secrets Module](https://docs.python.org/3/library/secrets.html)
- [ADR-004: File-Based Note Storage](/home/phil/Projects/starpunk/docs/decisions/ADR-004-file-based-note-storage.md)
## Acceptance Criteria
- [ ] Slug generation creates valid, URL-safe slugs
- [ ] Slugs are descriptive (use first 5 words)
- [ ] Slugs are unique (collision detection + random suffix)
- [ ] Slugs meet length constraints (1-100 characters)
- [ ] Timestamp fallback works for short content
- [ ] Reserved slugs are rejected
- [ ] Unicode content is handled gracefully
- [ ] All edge cases tested
- [ ] Performance meets target (<1ms)
- [ ] Code follows Python coding standards
---
**Approved**: 2024-11-18
**Architect**: StarPunk Architect Agent