Files
StarPunk/docs/design/phase-1.1-core-utilities.md
2025-11-18 19:21:31 -07:00

1401 lines
39 KiB
Markdown

# Phase 1.1: Core Utilities Design
## Overview
This document provides a complete, implementation-ready design for Phase 1.1 of the StarPunk V1 implementation plan: Core Utilities. The utilities module (`starpunk/utils.py`) provides foundational functions used by all other components of the system.
**Priority**: CRITICAL - Required by all other features
**Estimated Effort**: 2-3 hours
**Dependencies**: None
**File**: `starpunk/utils.py`
## Design Principles
1. **Pure functions** - No side effects where possible
2. **Type safety** - Full type hints on all functions
3. **Error handling** - Specific exceptions with clear messages
4. **Security first** - Validate all inputs, prevent path traversal
5. **Testable** - Small, focused functions that are easy to test
6. **Well documented** - Comprehensive docstrings with examples
## Module Structure
```python
"""
Core utility functions for StarPunk
This module provides essential utilities for slug generation, file operations,
hashing, and date/time handling. These utilities are used throughout the
application and have no external dependencies beyond standard library and
Flask configuration.
"""
# Standard library imports
import hashlib
import re
import secrets
from datetime import datetime
from pathlib import Path
from typing import Optional
# Third-party imports
from flask import current_app
# Constants
MAX_SLUG_LENGTH = 100
MIN_SLUG_LENGTH = 1
SLUG_WORDS_COUNT = 5
RANDOM_SUFFIX_LENGTH = 4
CONTENT_HASH_ALGORITHM = "sha256"
SAFE_SLUG_PATTERN = re.compile(r'[^a-z0-9-]')
MULTIPLE_HYPHENS_PATTERN = re.compile(r'-+')
# Utility functions (defined below)
```
## Function Specifications
### 1. Slug Generation
#### `generate_slug(content: str, created_at: Optional[datetime] = None) -> str`
**Purpose**: Generate a URL-safe slug from note content or timestamp.
**Algorithm**:
1. Extract first N words from content (configurable, default 5)
2. Convert to lowercase
3. Replace spaces with hyphens
4. Remove all characters except a-z, 0-9, and hyphens
5. Remove leading/trailing hyphens
6. Collapse multiple consecutive hyphens to single hyphen
7. Truncate to maximum length
8. If result is empty or too short, fall back to timestamp-based slug
9. Return slug (uniqueness check handled by caller)
**Type Signature**:
```python
def generate_slug(content: str, created_at: Optional[datetime] = None) -> str:
"""
Generate URL-safe slug from note content
Creates a slug by extracting the first few words from the content and
normalizing them to lowercase with hyphens. If content is insufficient,
falls back to timestamp-based slug.
Args:
content: The note content (markdown text)
created_at: Optional timestamp for fallback slug (defaults to now)
Returns:
URL-safe slug string (lowercase, alphanumeric + hyphens only)
Raises:
ValueError: If content is empty or contains only whitespace
Examples:
>>> generate_slug("Hello World! This is my first note.")
'hello-world-this-is-my'
>>> generate_slug("Testing... with special chars!@#")
'testing-with-special-chars'
>>> generate_slug("A") # Too short, uses timestamp
'20241118-143022'
>>> generate_slug(" ") # Empty content
ValueError: Content cannot be empty
Notes:
- This function does NOT check for uniqueness
- Caller must verify slug doesn't exist in database
- Use make_slug_unique() to add random suffix if needed
"""
```
**Implementation Details**:
- Use regex to remove non-alphanumeric characters (except hyphens)
- Handle Unicode: normalize to ASCII first, remove non-ASCII characters
- Maximum slug length: 100 characters (configurable constant)
- Minimum slug length: 1 character (if shorter, use timestamp fallback)
- Timestamp format: `YYYYMMDD-HHMMSS` (e.g., `20241118-143022`)
- Extract words: split on whitespace, take first N non-empty tokens
**Edge Cases**:
- Empty content → ValueError
- Content with only special characters → timestamp fallback
- Content with only 1-2 characters → timestamp fallback
- Very long first words → truncate to max length
- Unicode/emoji content → strip to ASCII alphanumeric
#### `make_slug_unique(base_slug: str, existing_slugs: set[str]) -> str`
**Purpose**: Add random suffix to slug if it already exists.
**Type Signature**:
```python
def make_slug_unique(base_slug: str, existing_slugs: set[str]) -> str:
"""
Make a slug unique by adding random suffix if needed
If the base_slug already exists in the provided set, appends a random
alphanumeric suffix until a unique slug is found.
Args:
base_slug: The base slug to make unique
existing_slugs: Set of existing slugs to check against
Returns:
Unique slug (base_slug or base_slug-{random})
Examples:
>>> make_slug_unique("test-note", set())
'test-note'
>>> make_slug_unique("test-note", {"test-note"})
'test-note-a7c9' # Random suffix
>>> make_slug_unique("test-note", {"test-note", "test-note-a7c9"})
'test-note-x3k2' # Different random suffix
Notes:
- Random suffix is 4 lowercase alphanumeric characters
- Extremely low collision probability (36^4 = 1.6M combinations)
- Will retry up to 100 times if collision occurs (should never happen)
"""
```
**Implementation Details**:
- Check if base_slug exists in set
- If not, return base_slug unchanged
- If exists, append `-{random}` where random is 4 alphanumeric chars
- Use `secrets.token_urlsafe()` for random generation (secure)
- Retry if collision occurs (max 100 attempts, raise error if all fail)
#### `validate_slug(slug: str) -> bool`
**Purpose**: Validate that a slug meets all requirements.
**Type Signature**:
```python
def validate_slug(slug: str) -> bool:
"""
Validate that a slug meets all requirements
Checks that slug contains only allowed characters and is within
length limits.
Args:
slug: The slug to validate
Returns:
True if slug is valid, False otherwise
Rules:
- Must contain only: a-z, 0-9, hyphen (-)
- Must be between 1 and 100 characters
- Cannot start or end with hyphen
- Cannot contain consecutive hyphens
Examples:
>>> validate_slug("hello-world")
True
>>> validate_slug("Hello-World") # Uppercase
False
>>> validate_slug("-hello") # Leading hyphen
False
>>> validate_slug("hello--world") # Double hyphen
False
"""
```
**Implementation Details**:
- Regex pattern: `^[a-z0-9]+(?:-[a-z0-9]+)*$`
- Length check: 1 <= len(slug) <= 100
- No additional normalization (validation only)
---
### 2. Content Hashing
#### `calculate_content_hash(content: str) -> str`
**Purpose**: Calculate SHA-256 hash of note content for change detection.
**Type Signature**:
```python
def calculate_content_hash(content: str) -> str:
"""
Calculate SHA-256 hash of content
Generates a cryptographic hash of the content for change detection
and cache invalidation. Uses UTF-8 encoding.
Args:
content: The content to hash (markdown text)
Returns:
Hexadecimal hash string (64 characters)
Examples:
>>> calculate_content_hash("Hello World")
'a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e'
>>> calculate_content_hash("")
'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
Notes:
- Same content always produces same hash
- Hash is deterministic across systems
- Useful for detecting external file modifications
- SHA-256 chosen for security and wide support
"""
```
**Implementation Details**:
- Use `hashlib.sha256()`
- Encode content as UTF-8 bytes
- Return hexadecimal digest (lowercase)
- Handle empty content (valid use case)
---
### 3. File Path Operations
#### `generate_note_path(slug: str, created_at: datetime, data_dir: Path) -> Path`
**Purpose**: Generate file path for note using year/month directory structure.
**Type Signature**:
```python
def generate_note_path(slug: str, created_at: datetime, data_dir: Path) -> Path:
"""
Generate file path for a note
Creates path following pattern: data/notes/YYYY/MM/slug.md
Args:
slug: URL-safe slug for the note
created_at: Creation timestamp (determines YYYY/MM)
data_dir: Base data directory path
Returns:
Full Path object for the note file
Raises:
ValueError: If slug is invalid
Examples:
>>> from datetime import datetime
>>> from pathlib import Path
>>> dt = datetime(2024, 11, 18, 14, 30)
>>> generate_note_path("test-note", dt, Path("data"))
PosixPath('data/notes/2024/11/test-note.md')
Notes:
- Does NOT create directories (use ensure_note_directory)
- Does NOT check if file exists
- Validates slug before generating path
"""
```
**Implementation Details**:
- Validate slug using `validate_slug()`
- Extract year and month from created_at: `created_at.strftime("%Y")`, `"%m"`
- Build path: `data_dir / "notes" / year / month / f"{slug}.md"`
- Return Path object (not string)
#### `ensure_note_directory(note_path: Path) -> Path`
**Purpose**: Create directory structure for note if it doesn't exist.
**Type Signature**:
```python
def ensure_note_directory(note_path: Path) -> Path:
"""
Ensure directory exists for note file
Creates parent directories if they don't exist. Safe to call
even if directories already exist.
Args:
note_path: Full path to note file
Returns:
Parent directory path
Raises:
OSError: If directory cannot be created (permissions, etc.)
Examples:
>>> note_path = Path("data/notes/2024/11/test-note.md")
>>> ensure_note_directory(note_path)
PosixPath('data/notes/2024/11')
"""
```
**Implementation Details**:
- Use `note_path.parent.mkdir(parents=True, exist_ok=True)`
- Return parent directory path
- Let OSError propagate if creation fails (permissions, disk full, etc.)
#### `validate_note_path(file_path: Path, data_dir: Path) -> bool`
**Purpose**: Validate that file path is within data directory (prevent path traversal).
**Type Signature**:
```python
def validate_note_path(file_path: Path, data_dir: Path) -> bool:
"""
Validate that file path is within data directory
Security check to prevent path traversal attacks. Ensures the
resolved path is within the allowed data directory.
Args:
file_path: Path to validate
data_dir: Base data directory that must contain file_path
Returns:
True if path is safe, False otherwise
Examples:
>>> validate_note_path(
... Path("data/notes/2024/11/note.md"),
... Path("data")
... )
True
>>> validate_note_path(
... Path("data/notes/../../etc/passwd"),
... Path("data")
... )
False
Security:
- Resolves symlinks and relative paths
- Checks if resolved path is child of data_dir
- Prevents directory traversal attacks
"""
```
**Implementation Details**:
- Resolve both paths to absolute: `file_path.resolve()`, `data_dir.resolve()`
- Check if file_path is relative to data_dir: use `file_path.is_relative_to(data_dir)`
- Return boolean (no exceptions)
- Critical for security - MUST be called before any file operations
---
### 4. Atomic File Operations
#### `write_note_file(file_path: Path, content: str) -> None`
**Purpose**: Write note content to file atomically.
**Type Signature**:
```python
def write_note_file(file_path: Path, content: str) -> None:
"""
Write note content to file atomically
Writes to temporary file first, then atomically renames to final path.
This prevents corruption if write is interrupted.
Args:
file_path: Destination file path
content: Content to write (markdown text)
Raises:
OSError: If file cannot be written
ValueError: If file_path is invalid
Examples:
>>> write_note_file(Path("data/notes/2024/11/test.md"), "# Test")
Implementation:
1. Create temp file: {file_path}.tmp
2. Write content to temp file
3. Atomically rename temp to final path
4. If any step fails, clean up temp file
Notes:
- Atomic rename is guaranteed on POSIX systems
- Temp file created in same directory as target
- UTF-8 encoding used for all text
"""
```
**Implementation Details**:
- Create temp path: `temp_path = file_path.with_suffix(file_path.suffix + '.tmp')`
- Write to temp file: `temp_path.write_text(content, encoding='utf-8')`
- Atomic rename: `temp_path.replace(file_path)`
- On exception: delete temp file if exists, re-raise
- Use context manager or try/finally for cleanup
#### `read_note_file(file_path: Path) -> str`
**Purpose**: Read note content from file.
**Type Signature**:
```python
def read_note_file(file_path: Path) -> str:
"""
Read note content from file
Args:
file_path: Path to note file
Returns:
File content as string
Raises:
FileNotFoundError: If file doesn't exist
OSError: If file cannot be read
Examples:
>>> content = read_note_file(Path("data/notes/2024/11/test.md"))
>>> print(content)
# Test Note
"""
```
**Implementation Details**:
- Use `file_path.read_text(encoding='utf-8')`
- Let exceptions propagate (FileNotFoundError, OSError, etc.)
- No additional error handling needed
#### `delete_note_file(file_path: Path, soft: bool = False, data_dir: Optional[Path] = None) -> None`
**Purpose**: Delete note file (soft or hard delete).
**Type Signature**:
```python
def delete_note_file(file_path: Path, soft: bool = False, data_dir: Optional[Path] = None) -> None:
"""
Delete note file from filesystem
Supports soft delete (move to trash) or hard delete (permanent removal).
Args:
file_path: Path to note file
soft: If True, move to .trash/ directory; if False, delete permanently
data_dir: Required if soft=True, base data directory
Raises:
FileNotFoundError: If file doesn't exist
ValueError: If soft=True but data_dir not provided
OSError: If file cannot be deleted or moved
Examples:
>>> # Hard delete
>>> delete_note_file(Path("data/notes/2024/11/test.md"))
>>> # Soft delete (move to trash)
>>> delete_note_file(
... Path("data/notes/2024/11/test.md"),
... soft=True,
... data_dir=Path("data")
... )
"""
```
**Implementation Details**:
- If soft delete:
- Require data_dir parameter
- Create trash directory: `data_dir / ".trash" / year / month`
- Move file to trash with same filename
- Use `shutil.move()` or `file_path.replace()`
- If hard delete:
- Use `file_path.unlink()`
- Handle FileNotFoundError gracefully (file already gone is success)
---
### 5. Date/Time Utilities
#### `format_rfc822(dt: datetime) -> str`
**Purpose**: Format datetime as RFC-822 string for RSS feeds.
**Type Signature**:
```python
def format_rfc822(dt: datetime) -> str:
"""
Format datetime as RFC-822 string
Converts datetime to RFC-822 format required by RSS 2.0 specification.
Assumes UTC timezone.
Args:
dt: Datetime to format (assumed UTC)
Returns:
RFC-822 formatted string
Examples:
>>> from datetime import datetime
>>> dt = datetime(2024, 11, 18, 14, 30, 45)
>>> format_rfc822(dt)
'Mon, 18 Nov 2024 14:30:45 +0000'
References:
- RSS 2.0 spec: https://www.rssboard.org/rss-specification
- RFC-822 date format
"""
```
**Implementation Details**:
- Use `dt.strftime()` with format: `"%a, %d %b %Y %H:%M:%S +0000"`
- Assume UTC timezone (all timestamps stored as UTC)
- Format pattern: "DDD, DD MMM YYYY HH:MM:SS +0000"
#### `format_iso8601(dt: datetime) -> str`
**Purpose**: Format datetime as ISO 8601 string.
**Type Signature**:
```python
def format_iso8601(dt: datetime) -> str:
"""
Format datetime as ISO 8601 string
Converts datetime to ISO 8601 format for timestamps and APIs.
Args:
dt: Datetime to format
Returns:
ISO 8601 formatted string
Examples:
>>> from datetime import datetime
>>> dt = datetime(2024, 11, 18, 14, 30, 45)
>>> format_iso8601(dt)
'2024-11-18T14:30:45Z'
"""
```
**Implementation Details**:
- Use `dt.isoformat()` with 'Z' suffix for UTC
- Format: `YYYY-MM-DDTHH:MM:SSZ`
- Or use `dt.strftime("%Y-%m-%dT%H:%M:%SZ")`
#### `parse_iso8601(date_string: str) -> datetime`
**Purpose**: Parse ISO 8601 string to datetime.
**Type Signature**:
```python
def parse_iso8601(date_string: str) -> datetime:
"""
Parse ISO 8601 string to datetime
Args:
date_string: ISO 8601 formatted string
Returns:
Datetime object (UTC)
Raises:
ValueError: If string is not valid ISO 8601 format
Examples:
>>> parse_iso8601("2024-11-18T14:30:45Z")
datetime.datetime(2024, 11, 18, 14, 30, 45)
"""
```
**Implementation Details**:
- Use `datetime.fromisoformat()` (remove 'Z' suffix first if present)
- Or use `datetime.strptime()` with format string
- Let ValueError propagate if invalid format
---
### 6. Helper Functions
#### `extract_first_words(text: str, max_words: int = 5) -> str`
**Purpose**: Extract first N words from text (helper for slug generation).
**Type Signature**:
```python
def extract_first_words(text: str, max_words: int = 5) -> str:
"""
Extract first N words from text
Helper function for slug generation. Splits text on whitespace
and returns first N non-empty words.
Args:
text: Text to extract words from
max_words: Maximum number of words to extract
Returns:
Space-separated string of first N words
Examples:
>>> extract_first_words("Hello world this is a test", 3)
'Hello world this'
>>> extract_first_words(" Multiple spaces ", 2)
'Multiple spaces'
"""
```
**Implementation Details**:
- Strip whitespace from text
- Split on whitespace: `text.split()`
- Filter empty strings
- Take first max_words: `words[:max_words]`
- Join with single space: `" ".join(words)`
#### `normalize_slug_text(text: str) -> str`
**Purpose**: Normalize text for slug (lowercase, hyphens, alphanumeric only).
**Type Signature**:
```python
def normalize_slug_text(text: str) -> str:
"""
Normalize text for use in slug
Converts to lowercase, replaces spaces with hyphens, removes
special characters, and collapses multiple hyphens.
Args:
text: Text to normalize
Returns:
Normalized slug-safe text
Examples:
>>> normalize_slug_text("Hello World!")
'hello-world'
>>> normalize_slug_text("Testing... with -- special chars!")
'testing-with-special-chars'
"""
```
**Implementation Details**:
- Convert to lowercase: `text.lower()`
- Replace spaces with hyphens: `text.replace(' ', '-')`
- Remove non-alphanumeric (except hyphens): regex `[^a-z0-9-]``''`
- Collapse multiple hyphens: regex `-+``-`
- Strip leading/trailing hyphens: `text.strip('-')`
#### `generate_random_suffix(length: int = 4) -> str`
**Purpose**: Generate random alphanumeric suffix for slug uniqueness.
**Type Signature**:
```python
def generate_random_suffix(length: int = 4) -> str:
"""
Generate random alphanumeric suffix
Creates a secure random string for making slugs unique.
Uses lowercase letters and numbers only.
Args:
length: Length of suffix (default 4)
Returns:
Random alphanumeric string
Examples:
>>> suffix = generate_random_suffix()
>>> len(suffix)
4
>>> suffix.isalnum()
True
"""
```
**Implementation Details**:
- Use `secrets.token_urlsafe()` or `secrets.choice()`
- Character set: `a-z0-9` (36 characters)
- Generate random string of specified length
- Example: `'a7c9'`, `'x3k2'`, `'p9m1'`
---
## Constants
```python
# Slug configuration
MAX_SLUG_LENGTH = 100 # Maximum slug length in characters
MIN_SLUG_LENGTH = 1 # Minimum slug length in characters
SLUG_WORDS_COUNT = 5 # Number of words to extract for slug
RANDOM_SUFFIX_LENGTH = 4 # Length of random suffix for uniqueness
# File operations
TEMP_FILE_SUFFIX = '.tmp' # Suffix for temporary files
TRASH_DIR_NAME = '.trash' # Directory name for soft-deleted files
# Hashing
CONTENT_HASH_ALGORITHM = 'sha256' # Hash algorithm for content
# Regex patterns
SLUG_PATTERN = re.compile(r'^[a-z0-9]+(?:-[a-z0-9]+)*$')
SAFE_SLUG_PATTERN = re.compile(r'[^a-z0-9-]')
MULTIPLE_HYPHENS_PATTERN = re.compile(r'-+')
WHITESPACE_PATTERN = re.compile(r'\s+')
# Character sets
RANDOM_CHARS = 'abcdefghijklmnopqrstuvwxyz0123456789'
```
---
## Error Handling
### Exception Types
The module uses standard Python exceptions:
- **ValueError**: Invalid input (empty content, invalid slug, etc.)
- **FileNotFoundError**: File doesn't exist when expected
- **FileExistsError**: File already exists (not used in utils, but caller may use)
- **OSError**: General I/O errors (permissions, disk full, etc.)
### Error Messages
Error messages should be specific and actionable:
**Good**:
```python
raise ValueError(
f"Invalid slug '{slug}': must contain only lowercase letters, "
f"numbers, and hyphens"
)
raise ValueError("Content cannot be empty or whitespace-only")
raise OSError(
f"Failed to write note file: {file_path}. "
f"Check permissions and disk space."
)
```
**Bad**:
```python
raise ValueError("Invalid slug")
raise ValueError("Bad input")
raise OSError("Write failed")
```
---
## Testing Strategy
### Test Coverage Requirements
- Minimum 90% code coverage for utils.py
- Test all functions with multiple scenarios
- Test edge cases and error conditions
- Test security (path traversal, etc.)
### Test Organization
File: `tests/test_utils.py`
```python
"""
Tests for utility functions
Organized by function category:
- Slug generation tests
- Content hashing tests
- File path tests
- Atomic file operations tests
- Date/time formatting tests
"""
import pytest
from pathlib import Path
from datetime import datetime
from starpunk.utils import (
generate_slug,
make_slug_unique,
validate_slug,
calculate_content_hash,
generate_note_path,
# ... etc
)
class TestSlugGeneration:
"""Test slug generation functions"""
def test_generate_slug_from_content(self):
"""Test basic slug generation from content"""
slug = generate_slug("Hello World This Is My Note")
assert slug == "hello-world-this-is-my"
def test_generate_slug_special_characters(self):
"""Test slug generation removes special characters"""
slug = generate_slug("Testing... with!@# special chars")
assert slug == "testing-with-special-chars"
def test_generate_slug_empty_content(self):
"""Test slug generation raises error on empty content"""
with pytest.raises(ValueError):
generate_slug("")
def test_generate_slug_whitespace_only(self):
"""Test slug generation raises error on whitespace-only content"""
with pytest.raises(ValueError):
generate_slug(" \n\t ")
def test_generate_slug_short_content_fallback(self):
"""Test slug generation falls back to timestamp for short content"""
slug = generate_slug("A", created_at=datetime(2024, 11, 18, 14, 30))
assert slug == "20241118-143000" # Timestamp format
def test_generate_slug_unicode_content(self):
"""Test slug generation handles unicode"""
slug = generate_slug("Hello 世界 World")
# Should strip non-ASCII and keep ASCII words
assert "hello" in slug
assert "world" in slug
def test_make_slug_unique_no_collision(self):
"""Test make_slug_unique returns original if no collision"""
slug = make_slug_unique("test-note", set())
assert slug == "test-note"
def test_make_slug_unique_with_collision(self):
"""Test make_slug_unique adds suffix on collision"""
slug = make_slug_unique("test-note", {"test-note"})
assert slug.startswith("test-note-")
assert len(slug) == len("test-note-") + 4 # 4 char suffix
def test_validate_slug_valid(self):
"""Test validate_slug accepts valid slugs"""
assert validate_slug("hello-world") is True
assert validate_slug("test-note-123") is True
assert validate_slug("a") is True
def test_validate_slug_invalid(self):
"""Test validate_slug rejects invalid slugs"""
assert validate_slug("Hello-World") is False # Uppercase
assert validate_slug("-hello") is False # Leading hyphen
assert validate_slug("hello-") is False # Trailing hyphen
assert validate_slug("hello--world") is False # Double hyphen
assert validate_slug("hello_world") is False # Underscore
assert validate_slug("") is False # Empty
class TestContentHashing:
"""Test content hashing functions"""
def test_calculate_content_hash_consistency(self):
"""Test hash is consistent for same content"""
hash1 = calculate_content_hash("Test content")
hash2 = calculate_content_hash("Test content")
assert hash1 == hash2
def test_calculate_content_hash_different(self):
"""Test different content produces different hash"""
hash1 = calculate_content_hash("Test content 1")
hash2 = calculate_content_hash("Test content 2")
assert hash1 != hash2
def test_calculate_content_hash_empty(self):
"""Test hash of empty string"""
hash_empty = calculate_content_hash("")
assert len(hash_empty) == 64 # SHA-256 produces 64 hex chars
def test_calculate_content_hash_unicode(self):
"""Test hash handles unicode correctly"""
hash_val = calculate_content_hash("Hello 世界")
assert len(hash_val) == 64
class TestFilePathOperations:
"""Test file path generation and validation"""
def test_generate_note_path(self):
"""Test note path generation"""
dt = datetime(2024, 11, 18, 14, 30)
path = generate_note_path("test-note", dt, Path("data"))
assert path == Path("data/notes/2024/11/test-note.md")
def test_generate_note_path_invalid_slug(self):
"""Test note path generation rejects invalid slug"""
dt = datetime(2024, 11, 18)
with pytest.raises(ValueError):
generate_note_path("Invalid Slug!", dt, Path("data"))
def test_validate_note_path_safe(self):
"""Test path validation accepts safe paths"""
assert validate_note_path(
Path("data/notes/2024/11/note.md"),
Path("data")
) is True
def test_validate_note_path_traversal(self):
"""Test path validation rejects traversal attempts"""
assert validate_note_path(
Path("data/notes/../../etc/passwd"),
Path("data")
) is False
def test_validate_note_path_absolute_outside(self):
"""Test path validation rejects paths outside data dir"""
assert validate_note_path(
Path("/etc/passwd"),
Path("data")
) is False
class TestAtomicFileOperations:
"""Test atomic file write/read/delete operations"""
def test_write_and_read_note_file(self, tmp_path):
"""Test writing and reading note file"""
file_path = tmp_path / "test.md"
content = "# Test Note\n\nThis is a test."
write_note_file(file_path, content)
assert file_path.exists()
read_content = read_note_file(file_path)
assert read_content == content
def test_write_note_file_atomic(self, tmp_path):
"""Test write is atomic (temp file used)"""
file_path = tmp_path / "test.md"
temp_path = Path(str(file_path) + '.tmp')
# Temp file should not exist after write
write_note_file(file_path, "Test")
assert not temp_path.exists()
assert file_path.exists()
def test_read_note_file_not_found(self, tmp_path):
"""Test reading non-existent file raises error"""
file_path = tmp_path / "nonexistent.md"
with pytest.raises(FileNotFoundError):
read_note_file(file_path)
def test_delete_note_file_hard(self, tmp_path):
"""Test hard delete removes file"""
file_path = tmp_path / "test.md"
file_path.write_text("Test")
delete_note_file(file_path, soft=False)
assert not file_path.exists()
def test_delete_note_file_soft(self, tmp_path):
"""Test soft delete moves file to trash"""
# Create note file
notes_dir = tmp_path / "notes" / "2024" / "11"
notes_dir.mkdir(parents=True)
file_path = notes_dir / "test.md"
file_path.write_text("Test")
# Soft delete
delete_note_file(file_path, soft=True, data_dir=tmp_path)
# Original should be gone
assert not file_path.exists()
# Should be in trash
trash_path = tmp_path / ".trash" / "2024" / "11" / "test.md"
assert trash_path.exists()
class TestDateTimeFormatting:
"""Test date/time formatting functions"""
def test_format_rfc822(self):
"""Test RFC-822 date formatting"""
dt = datetime(2024, 11, 18, 14, 30, 45)
formatted = format_rfc822(dt)
assert formatted == "Mon, 18 Nov 2024 14:30:45 +0000"
def test_format_iso8601(self):
"""Test ISO 8601 date formatting"""
dt = datetime(2024, 11, 18, 14, 30, 45)
formatted = format_iso8601(dt)
assert formatted == "2024-11-18T14:30:45Z"
def test_parse_iso8601(self):
"""Test ISO 8601 date parsing"""
dt = parse_iso8601("2024-11-18T14:30:45Z")
assert dt.year == 2024
assert dt.month == 11
assert dt.day == 18
assert dt.hour == 14
assert dt.minute == 30
assert dt.second == 45
def test_parse_iso8601_invalid(self):
"""Test ISO 8601 parsing rejects invalid format"""
with pytest.raises(ValueError):
parse_iso8601("not-a-date")
```
### Additional Test Cases
**Security Tests**:
- Path traversal attempts with various patterns
- Symlink handling
- Unicode/emoji in slugs
- Very long input strings
- Malicious input patterns
**Edge Cases**:
- Empty content
- Whitespace-only content
- Very long content (>1MB)
- Unicode normalization issues
- Concurrent file operations (if applicable)
**Performance Tests**:
- Slug generation with 10,000 character input
- Hash calculation with large content
- Path validation with deep directory structures
---
## Usage Examples
### Creating a Note
```python
from datetime import datetime
from pathlib import Path
from starpunk.utils import (
generate_slug,
make_slug_unique,
generate_note_path,
ensure_note_directory,
write_note_file,
calculate_content_hash
)
# Note content
content = "This is my first note about Python programming"
created_at = datetime.utcnow()
data_dir = Path("data")
# Generate slug
base_slug = generate_slug(content, created_at) # "this-is-my-first-note"
# Check uniqueness (pseudo-code, actual check via database)
existing_slugs = get_existing_slugs_from_db() # Returns set of slugs
slug = make_slug_unique(base_slug, existing_slugs)
# Generate file path
note_path = generate_note_path(slug, created_at, data_dir)
# Result: data/notes/2024/11/this-is-my-first-note.md
# Create directory
ensure_note_directory(note_path)
# Write file
write_note_file(note_path, content)
# Calculate hash for database
content_hash = calculate_content_hash(content)
# Save metadata to database (pseudo-code)
save_note_metadata(
slug=slug,
file_path=str(note_path.relative_to(data_dir)),
created_at=created_at,
content_hash=content_hash
)
```
### Reading a Note
```python
from pathlib import Path
from starpunk.utils import read_note_file, validate_note_path
# Get file path from database (pseudo-code)
note = get_note_from_db(slug="my-first-note")
file_path = Path("data") / note.file_path
# Validate path (security check)
if not validate_note_path(file_path, Path("data")):
raise ValueError("Invalid file path")
# Read content
content = read_note_file(file_path)
```
### Deleting a Note
```python
from pathlib import Path
from starpunk.utils import delete_note_file
# Get note path from database
note = get_note_from_db(slug="old-note")
file_path = Path("data") / note.file_path
# Soft delete (move to trash)
delete_note_file(file_path, soft=True, data_dir=Path("data"))
# Update database to mark as deleted (pseudo-code)
mark_note_deleted(note.id)
```
---
## Integration with Other Modules
### Database Integration
The database module will use utilities like this:
```python
# In starpunk/notes.py
from starpunk.utils import (
generate_slug, make_slug_unique, generate_note_path,
ensure_note_directory, write_note_file, calculate_content_hash
)
from starpunk.database import get_db
def create_note(content: str, published: bool = False) -> Note:
"""Create a new note"""
db = get_db()
created_at = datetime.utcnow()
# Generate unique slug
base_slug = generate_slug(content, created_at)
existing_slugs = set(row[0] for row in db.execute("SELECT slug FROM notes").fetchall())
slug = make_slug_unique(base_slug, existing_slugs)
# Generate path and write file
note_path = generate_note_path(slug, created_at, Path(current_app.config['DATA_PATH']))
ensure_note_directory(note_path)
write_note_file(note_path, content)
# Calculate hash
content_hash = calculate_content_hash(content)
# Insert into database
db.execute(
"INSERT INTO notes (slug, file_path, published, created_at, updated_at, content_hash) "
"VALUES (?, ?, ?, ?, ?, ?)",
(slug, str(note_path.relative_to(Path(current_app.config['DATA_PATH']))),
published, created_at, created_at, content_hash)
)
db.commit()
return Note(slug=slug, content=content, created_at=created_at, ...)
```
### RSS Feed Integration
The feed module will use date formatting:
```python
# In starpunk/feed.py
from starpunk.utils import format_rfc822
def generate_rss_feed():
"""Generate RSS feed"""
fg = FeedGenerator()
for note in get_published_notes():
fe = fg.add_entry()
fe.pubDate(format_rfc822(note.created_at))
```
---
## Performance Considerations
### Slug Generation Performance
- Expected input: 50-1000 characters
- Target: <1ms per slug generation
- Bottleneck: None (string operations are fast)
### File Operations Performance
- Write operation: <10ms for typical note (1-10KB)
- Read operation: <5ms for typical note
- Atomic rename: ~1ms (OS-level operation)
### Path Validation Performance
- Target: <1ms per validation
- Use cached absolute paths where possible
---
## Security Considerations
### Path Traversal Prevention
CRITICAL: All file paths MUST be validated before use.
```python
# Always validate before file operations
if not validate_note_path(file_path, data_dir):
raise ValueError("Invalid file path: potential directory traversal")
```
### Slug Validation
Prevent injection attacks through slugs:
```python
# Validate slug before using in paths
if not validate_slug(slug):
raise ValueError(f"Invalid slug: {slug}")
```
### Random Suffix Security
Use cryptographically secure random:
```python
# Use secrets module, not random module
import secrets
suffix = secrets.token_urlsafe(4)[:4].lower()
```
---
## Dependencies
### Standard Library
- `hashlib` - SHA-256 hashing
- `re` - Regular expressions
- `secrets` - Secure random generation
- `pathlib` - Path operations
- `datetime` - Date/time handling
- `shutil` - File operations (move)
### Third-Party
- `flask` - For `current_app` context (configuration access)
### Internal
- None (this is the foundation module)
---
## Future Enhancements (V2+)
### Slug Customization
Allow user to specify custom slug:
```python
def generate_slug(content: str, custom_slug: Optional[str] = None, ...) -> str:
if custom_slug:
if validate_slug(custom_slug):
return custom_slug
raise ValueError("Invalid custom slug")
# ... existing logic
```
### Unicode Slug Support
Support non-ASCII characters in slugs:
```python
def generate_unicode_slug(content: str) -> str:
"""Generate slug with unicode support"""
# Use unicode normalization
# Support CJK characters
# Support diacritics
```
### Content Preview Extraction
Extract preview/excerpt from content:
```python
def extract_preview(content: str, max_length: int = 200) -> str:
"""Extract preview text from content"""
# Strip markdown formatting
# Truncate to max_length
# Add ellipsis if truncated
```
### Title Extraction
Extract title from first line:
```python
def extract_title(content: str) -> Optional[str]:
"""Extract title from content (first line or # heading)"""
# Check for # heading
# Or use first line
# Return None if empty
```
---
## Acceptance Criteria
- [ ] All functions implemented with type hints
- [ ] All functions have comprehensive docstrings
- [ ] Slug generation creates valid, URL-safe slugs
- [ ] Slug uniqueness enforcement works correctly
- [ ] Content hashing is consistent and correct
- [ ] Path validation prevents directory traversal
- [ ] Atomic file writes prevent corruption
- [ ] Date formatting matches RFC-822 and ISO 8601 specs
- [ ] All security checks are in place
- [ ] Test coverage >90%
- [ ] All tests pass
- [ ] Code formatted with Black
- [ ] Code passes flake8 linting
- [ ] No hardcoded values (use constants)
- [ ] Error messages are clear and actionable
---
## References
- [ADR-004: File-Based Note Storage](/home/phil/Projects/starpunk/docs/decisions/ADR-004-file-based-note-storage.md)
- [Python Coding Standards](/home/phil/Projects/starpunk/docs/standards/python-coding-standards.md)
- [Project Structure](/home/phil/Projects/starpunk/docs/design/project-structure.md)
- [CommonMark Specification](https://spec.commonmark.org/)
- [RSS 2.0 Specification](https://www.rssboard.org/rss-specification)
- [ISO 8601 Date Format](https://en.wikipedia.org/wiki/ISO_8601)
- [RFC-822 Date Format](https://www.rfc-editor.org/rfc/rfc822)
- [Python pathlib Documentation](https://docs.python.org/3/library/pathlib.html)
- [Python secrets Documentation](https://docs.python.org/3/library/secrets.html)
---
## Implementation Checklist
When implementing `starpunk/utils.py`, complete in this order:
1. [ ] Create file with module docstring
2. [ ] Add imports and constants
3. [ ] Implement helper functions (extract_first_words, normalize_slug_text, generate_random_suffix)
4. [ ] Implement slug functions (generate_slug, make_slug_unique, validate_slug)
5. [ ] Implement content hashing (calculate_content_hash)
6. [ ] Implement path functions (generate_note_path, ensure_note_directory, validate_note_path)
7. [ ] Implement file operations (write_note_file, read_note_file, delete_note_file)
8. [ ] Implement date/time functions (format_rfc822, format_iso8601, parse_iso8601)
9. [ ] Create tests/test_utils.py
10. [ ] Write tests for all functions
11. [ ] Run tests and achieve >90% coverage
12. [ ] Format with Black
13. [ ] Lint with flake8
14. [ ] Review all docstrings
15. [ ] Review all error messages
**Estimated Time**: 2-3 hours for implementation + tests