that initial commit

2025-11-18 19:21:31 -07:00
commit a68fd570c7
69 changed files with 31070 additions and 0 deletions
--- a/docs/decisions/ADR-007-slug-generation-algorithm.md
+++ b/docs/decisions/ADR-007-slug-generation-algorithm.md
@@ -0,0 +1,487 @@
+# ADR-007: Slug Generation Algorithm
+
+## Status
+Accepted
+
+## Context
+
+Notes in StarPunk require URL-safe identifiers (slugs) for permalinks and file naming. The slug generation algorithm is critical because:
+
+1. **User experience**: Slugs appear in URLs and should be readable/meaningful
+2. **SEO**: Descriptive slugs improve search engine optimization
+3. **File system**: Slugs become filenames, must be filesystem-safe
+4. **Uniqueness**: Slugs must be unique across all notes
+5. **Portability**: Slugs should work across different systems and browsers
+
+The challenge is designing an algorithm that creates readable, unique, safe slugs automatically from note content.
+
+## Decision
+
+### Content-Based Slug Generation with Timestamp Fallback
+
+**Primary Algorithm**: Extract first N words from content and normalize
+**Fallback**: Timestamp-based slug when content is insufficient
+**Uniqueness**: Random suffix when collision detected
+
+### Algorithm Specification
+
+#### Step 1: Extract Words
+```python
+# Extract first 5 words from content
+words = content.split()[:5]
+text = " ".join(words)
+```
+
+#### Step 2: Normalize
+```python
+# Convert to lowercase
+text = text.lower()
+
+# Replace spaces with hyphens
+text = text.replace(" ", "-")
+
+# Remove all characters except a-z, 0-9, and hyphens
+text = re.sub(r'[^a-z0-9-]', '', text)
+
+# Collapse multiple hyphens
+text = re.sub(r'-+', '-', text)
+
+# Strip leading/trailing hyphens
+text = text.strip('-')
+```
+
+#### Step 3: Validate Length
+```python
+# If slug too short or empty, use timestamp fallback
+if len(text) < 1:
+    text = created_at.strftime("%Y%m%d-%H%M%S")
+```
+
+#### Step 4: Truncate
+```python
+# Limit to 100 characters
+text = text[:100]
+```
+
+#### Step 5: Check Uniqueness
+```python
+# If slug exists, add random 4-character suffix
+if slug_exists(text):
+    text = f"{text}-{random_alphanumeric(4)}"
+```
+
+### Character Set
+
+**Allowed characters**: `a-z`, `0-9`, `-` (hyphen)
+
+**Rationale**:
+- URL-safe without encoding
+- Filesystem-safe on all platforms (Windows, Linux, macOS)
+- Human-readable
+- No escaping required in HTML
+- Compatible with DNS hostnames (if ever used)
+
+### Examples
+
+| Input Content | Generated Slug |
+|--------------|----------------|
+| "Hello World! This is my first note." | `hello-world-this-is-my` |
+| "Testing... with special chars!@#" | `testing-with-special-chars` |
+| "2024-11-18 Daily Journal Entry" | `2024-11-18-daily-journal-entry` |
+| "A" (too short) | `20241118-143022` (timestamp) |
+| "   " (whitespace only) | Error: ValueError |
+| "Hello World" (duplicate) | `hello-world-a7c9` (random suffix) |
+
+### Slug Uniqueness Strategy
+
+**Collision Detection**: Check database for existing slug before use
+
+**Resolution**: Append random 4-character suffix
+- Character set: `a-z0-9` (36 characters)
+- Combinations: 36^4 = 1,679,616 possible suffixes
+- Collision probability: Negligible for reasonable note counts
+
+**Example**:
+```
+Original:  hello-world
+Collision: hello-world-a7c9
+Collision: hello-world-x3k2
+```
+
+### Timestamp Fallback Format
+
+**Pattern**: `YYYYMMDD-HHMMSS`
+**Example**: `20241118-143022`
+
+**When Used**:
+- Content is empty or whitespace-only (raises error instead)
+- Normalized slug is empty (after removing special characters)
+- Normalized slug is too short (< 1 character)
+
+**Rationale**:
+- Guaranteed unique (unless two notes created in same second)
+- Sortable chronologically
+- Still readable and meaningful
+- No special characters required
+
+## Rationale
+
+### Content-Based Generation (Score: 9/10)
+
+**Pros**:
+- **Readability**: Users can understand URL meaning
+- **SEO**: Search engines prefer descriptive URLs
+- **Memorability**: Easier to remember and share
+- **Meaningful**: Reflects note content
+
+**Cons**:
+- **Collisions**: Multiple notes might have similar titles
+- **Changes**: Editing note doesn't update slug (by design)
+
+### First 5 Words (Score: 8/10)
+
+**Pros**:
+- **Sufficient**: 5 words usually capture note topic
+- **Concise**: Keeps URLs short and readable
+- **Consistent**: Predictable slug length
+
+**Cons**:
+- **Arbitrary**: 5 is somewhat arbitrary (could be 3-7)
+- **Language**: Assumes space-separated words (English-centric)
+
+**Alternatives Considered**:
+- First 3 words: Too short, often not descriptive
+- First 10 words: Too long, URLs become unwieldy
+- First line: Could be very long, harder to normalize
+- First sentence: Variable length, complex to parse
+
+**Decision**: 5 words is a good balance (configurable constant)
+
+### Lowercase with Hyphens (Score: 10/10)
+
+**Pros**:
+- **URL Standard**: Common pattern (github.com, stackoverflow.com)
+- **Readability**: Easier to read than underscores or camelCase
+- **Compatibility**: Works everywhere
+- **Simplicity**: One separator type only
+
+**Cons**:
+- None significant
+
+### Alphanumeric Only (Score: 10/10)
+
+**Pros**:
+- **Safety**: No escaping required in URLs or filenames
+- **Portability**: Works on all filesystems (FAT32, NTFS, ext4, APFS)
+- **Predictability**: No ambiguity about character handling
+
+**Cons**:
+- **Unicode Loss**: Non-ASCII characters stripped (acceptable trade-off)
+
+### Random Suffix for Uniqueness (Score: 9/10)
+
+**Pros**:
+- **Simplicity**: No complex conflict resolution
+- **Security**: Cryptographically secure random (secrets module)
+- **Scalability**: 1.6M possible suffixes per base slug
+
+**Cons**:
+- **Ugliness**: Suffix looks less clean (but rare occurrence)
+- **Unpredictability**: User can't control suffix
+
+**Alternatives Considered**:
+- Incrementing numbers (`hello-world-2`, `hello-world-3`): More predictable but reveals note count
+- Longer random suffix: More secure but uglier URLs
+- User-specified slug: More complex, deferred to V2
+
+**Decision**: 4-character random suffix is good balance
+
+## Consequences
+
+### Positive
+
+1. **Automatic**: No user input required for slug
+2. **Readable**: Slugs are human-readable and meaningful
+3. **Safe**: Works on all platforms and browsers
+4. **Unique**: Collision resolution ensures uniqueness
+5. **SEO-friendly**: Descriptive URLs help search ranking
+6. **Predictable**: User can anticipate what slug will be
+7. **Simple**: Single, consistent algorithm
+
+### Negative
+
+1. **Not editable**: User can't customize slug in V1
+2. **English-biased**: Assumes space-separated words
+3. **Unicode stripped**: Non-ASCII content loses characters
+4. **Content-dependent**: Similar content = similar slugs
+5. **Timestamp fallback**: Short notes get ugly timestamp slugs
+
+### Mitigations
+
+**Non-editable slugs**:
+- V1 trade-off for simplicity
+- V2 can add custom slug support
+- Users can still reference notes by slug once created
+
+**English-bias**:
+- Acceptable for V1 (English-first IndieWeb)
+- V2 can add Unicode slug support (requires more complex normalization)
+
+**Unicode stripping**:
+- Markdown content can still contain Unicode (only slug is ASCII)
+- Timestamp fallback ensures note is still creatable
+- V2 can use Unicode normalization (transliteration)
+
+**Timestamp fallback**:
+- Rare occurrence (most notes have >5 words)
+- Still functional and unique
+- V2 can improve (use first word if exists + timestamp)
+
+## Standards Compliance
+
+### URL Standards (RFC 3986)
+
+Slugs comply with URL path segment requirements:
+- No percent-encoding required
+- No reserved characters (`/`, `?`, `#`, etc.)
+- Case-insensitive safe (always lowercase)
+
+### Filesystem Standards
+
+Slugs work on all major filesystems:
+- **FAT32**: Yes (no special chars, length OK)
+- **NTFS**: Yes
+- **ext4**: Yes
+- **APFS**: Yes
+- **HFS+**: Yes
+
+**Reserved names**: None of our slugs conflict with OS reserved names (CON, PRN, etc.)
+
+### IndieWeb Recommendations
+
+Aligns with IndieWeb permalink best practices:
+- Descriptive URLs
+- No query parameters
+- Short and memorable
+- Permanent (don't change after creation)
+
+## Implementation Requirements
+
+### Validation Rules
+
+```python
+# Valid slug pattern
+SLUG_PATTERN = r'^[a-z0-9]+(?:-[a-z0-9]+)*$'
+
+# Constraints
+MIN_SLUG_LENGTH = 1
+MAX_SLUG_LENGTH = 100
+```
+
+### Reserved Slugs
+
+Certain slugs should be reserved for system routes:
+
+**Reserved List** (reject these slugs):
+- `admin`
+- `api`
+- `static`
+- `auth`
+- `feed`
+- `login`
+- `logout`
+
+Implementation:
+```python
+RESERVED_SLUGS = {'admin', 'api', 'static', 'auth', 'feed', 'login', 'logout'}
+
+def is_slug_reserved(slug: str) -> bool:
+    return slug in RESERVED_SLUGS
+```
+
+### Error Cases
+
+```python
+# Empty content
+generate_slug("")  # Raises ValueError
+
+# Whitespace only
+generate_slug("   ")  # Raises ValueError
+
+# Valid but short
+generate_slug("Hi")  # Returns timestamp: "20241118-143022"
+
+# Special characters only
+generate_slug("!@#$%")  # Returns timestamp: "20241118-143022"
+```
+
+## Alternatives Considered
+
+### UUID-based Slugs (Rejected)
+
+```python
+slug = str(uuid.uuid4())  # "550e8400-e29b-41d4-a716-446655440000"
+```
+
+**Pros**: Guaranteed unique, no collision checking
+**Cons**: Not human-readable, poor SEO, not memorable
+
+**Verdict**: Violates principle of readable URLs
+
+### Hash-based Slugs (Rejected)
+
+```python
+slug = hashlib.sha256(content.encode()).hexdigest()[:12]  # "a591a6d40bf4"
+```
+
+**Pros**: Deterministic, unique
+**Cons**: Not human-readable, changes if content edited
+
+**Verdict**: Not meaningful to users
+
+### Title Extraction (Rejected for V1)
+
+```python
+# Extract from # heading or first line
+title = extract_title_from_markdown(content)
+slug = normalize(title)
+```
+
+**Pros**: More semantic, uses actual title
+**Cons**: Requires markdown parsing, more complex, title might not exist
+
+**Verdict**: Deferred to V2 (V1 uses first N words which is simpler)
+
+### User-Specified Slugs (Rejected for V1)
+
+```python
+def create_note(content, custom_slug=None):
+    if custom_slug:
+        slug = validate_and_use(custom_slug)
+    else:
+        slug = generate_slug(content)
+```
+
+**Pros**: Maximum user control, no surprises
+**Cons**: Requires UI input, validation complexity, user burden
+
+**Verdict**: Deferred to V2 (V1 auto-generates for simplicity)
+
+### Incrementing Numbers (Rejected)
+
+```python
+# If collision, increment
+slug = "hello-world"
+slug = "hello-world-2"  # Collision
+slug = "hello-world-3"  # Collision
+```
+
+**Pros**: Predictable, simple
+**Cons**: Reveals note count, enumeration attack vector, less random
+
+**Verdict**: Random suffix is more secure and scales better
+
+## Performance Considerations
+
+### Generation Speed
+
+- Extract words: O(n) where n = content length (negligible, content is small)
+- Normalize: O(m) where m = extracted text length (< 100 chars)
+- Uniqueness check: O(1) database lookup with index
+- Random suffix: O(1) generation
+
+**Target**: < 1ms per slug generation (easily achieved)
+
+### Database Impact
+
+- Index on `slug` column: O(log n) lookup
+- Collision rate: < 1% (most notes have unique first 5 words)
+- Random suffix retries: Nearly never (1.6M combinations)
+
+## Testing Requirements
+
+### Test Cases
+
+**Normal Cases**:
+- Standard English content → descriptive slug
+- Content with punctuation → punctuation removed
+- Content with numbers → numbers preserved
+- Content with hyphens → hyphens preserved
+
+**Edge Cases**:
+- Very short content → timestamp fallback
+- Empty content → ValueError
+- Special characters only → timestamp fallback
+- Very long words → truncated to max length
+- Unicode content → stripped to ASCII
+
+**Collision Cases**:
+- Duplicate slug → random suffix added
+- Multiple collisions → different random suffixes
+- Reserved slug → rejected
+
+**Security Cases**:
+- Path traversal attempt (`../../../etc/passwd`)
+- Special characters (`<script>`, `%00`, etc.)
+- Very long input (>10,000 characters)
+
+## Migration Path (V2)
+
+Future enhancements that build on this foundation:
+
+### Custom Slugs
+```python
+def create_note(content, custom_slug=None):
+    slug = custom_slug or generate_slug(content)
+```
+
+### Unicode Support
+```python
+def generate_unicode_slug(content):
+    # Use Unicode normalization (NFKD)
+    # Transliterate to ASCII (unidecode library)
+    # Support CJK languages
+```
+
+### Title Extraction
+```python
+def extract_title_from_content(content):
+    # Check for # heading
+    # Use first line if no heading
+    # Fall back to first N words
+```
+
+### Slug Editing
+```python
+def update_note_slug(note_id, new_slug):
+    # Validate new slug
+    # Update database
+    # Rename file
+    # Create redirect from old slug
+```
+
+## References
+
+- [RFC 3986 - URI Generic Syntax](https://www.rfc-editor.org/rfc/rfc3986)
+- [IndieWeb Permalink Design](https://indieweb.org/permalink)
+- [URL Slug Best Practices](https://moz.com/learn/seo/url)
+- [Python secrets Module](https://docs.python.org/3/library/secrets.html)
+- [ADR-004: File-Based Note Storage](/home/phil/Projects/starpunk/docs/decisions/ADR-004-file-based-note-storage.md)
+
+## Acceptance Criteria
+
+- [ ] Slug generation creates valid, URL-safe slugs
+- [ ] Slugs are descriptive (use first 5 words)
+- [ ] Slugs are unique (collision detection + random suffix)
+- [ ] Slugs meet length constraints (1-100 characters)
+- [ ] Timestamp fallback works for short content
+- [ ] Reserved slugs are rejected
+- [ ] Unicode content is handled gracefully
+- [ ] All edge cases tested
+- [ ] Performance meets target (<1ms)
+- [ ] Code follows Python coding standards
+
+---
+
+**Approved**: 2024-11-18
+**Architect**: StarPunk Architect Agent