Files
StarPunk/docs/design/phase-1.1-core-utilities.md
2025-11-18 19:21:31 -07:00

39 KiB

Phase 1.1: Core Utilities Design

Overview

This document provides a complete, implementation-ready design for Phase 1.1 of the StarPunk V1 implementation plan: Core Utilities. The utilities module (starpunk/utils.py) provides foundational functions used by all other components of the system.

Priority: CRITICAL - Required by all other features Estimated Effort: 2-3 hours Dependencies: None File: starpunk/utils.py

Design Principles

  1. Pure functions - No side effects where possible
  2. Type safety - Full type hints on all functions
  3. Error handling - Specific exceptions with clear messages
  4. Security first - Validate all inputs, prevent path traversal
  5. Testable - Small, focused functions that are easy to test
  6. Well documented - Comprehensive docstrings with examples

Module Structure

"""
Core utility functions for StarPunk

This module provides essential utilities for slug generation, file operations,
hashing, and date/time handling. These utilities are used throughout the
application and have no external dependencies beyond standard library and
Flask configuration.
"""

# Standard library imports
import hashlib
import re
import secrets
from datetime import datetime
from pathlib import Path
from typing import Optional

# Third-party imports
from flask import current_app

# Constants
MAX_SLUG_LENGTH = 100
MIN_SLUG_LENGTH = 1
SLUG_WORDS_COUNT = 5
RANDOM_SUFFIX_LENGTH = 4
CONTENT_HASH_ALGORITHM = "sha256"
SAFE_SLUG_PATTERN = re.compile(r'[^a-z0-9-]')
MULTIPLE_HYPHENS_PATTERN = re.compile(r'-+')

# Utility functions (defined below)

Function Specifications

1. Slug Generation

generate_slug(content: str, created_at: Optional[datetime] = None) -> str

Purpose: Generate a URL-safe slug from note content or timestamp.

Algorithm:

  1. Extract first N words from content (configurable, default 5)
  2. Convert to lowercase
  3. Replace spaces with hyphens
  4. Remove all characters except a-z, 0-9, and hyphens
  5. Remove leading/trailing hyphens
  6. Collapse multiple consecutive hyphens to single hyphen
  7. Truncate to maximum length
  8. If result is empty or too short, fall back to timestamp-based slug
  9. Return slug (uniqueness check handled by caller)

Type Signature:

def generate_slug(content: str, created_at: Optional[datetime] = None) -> str:
    """
    Generate URL-safe slug from note content

    Creates a slug by extracting the first few words from the content and
    normalizing them to lowercase with hyphens. If content is insufficient,
    falls back to timestamp-based slug.

    Args:
        content: The note content (markdown text)
        created_at: Optional timestamp for fallback slug (defaults to now)

    Returns:
        URL-safe slug string (lowercase, alphanumeric + hyphens only)

    Raises:
        ValueError: If content is empty or contains only whitespace

    Examples:
        >>> generate_slug("Hello World! This is my first note.")
        'hello-world-this-is-my'

        >>> generate_slug("Testing... with special chars!@#")
        'testing-with-special-chars'

        >>> generate_slug("A")  # Too short, uses timestamp
        '20241118-143022'

        >>> generate_slug("   ")  # Empty content
        ValueError: Content cannot be empty

    Notes:
        - This function does NOT check for uniqueness
        - Caller must verify slug doesn't exist in database
        - Use make_slug_unique() to add random suffix if needed
    """

Implementation Details:

  • Use regex to remove non-alphanumeric characters (except hyphens)
  • Handle Unicode: normalize to ASCII first, remove non-ASCII characters
  • Maximum slug length: 100 characters (configurable constant)
  • Minimum slug length: 1 character (if shorter, use timestamp fallback)
  • Timestamp format: YYYYMMDD-HHMMSS (e.g., 20241118-143022)
  • Extract words: split on whitespace, take first N non-empty tokens

Edge Cases:

  • Empty content → ValueError
  • Content with only special characters → timestamp fallback
  • Content with only 1-2 characters → timestamp fallback
  • Very long first words → truncate to max length
  • Unicode/emoji content → strip to ASCII alphanumeric

make_slug_unique(base_slug: str, existing_slugs: set[str]) -> str

Purpose: Add random suffix to slug if it already exists.

Type Signature:

def make_slug_unique(base_slug: str, existing_slugs: set[str]) -> str:
    """
    Make a slug unique by adding random suffix if needed

    If the base_slug already exists in the provided set, appends a random
    alphanumeric suffix until a unique slug is found.

    Args:
        base_slug: The base slug to make unique
        existing_slugs: Set of existing slugs to check against

    Returns:
        Unique slug (base_slug or base_slug-{random})

    Examples:
        >>> make_slug_unique("test-note", set())
        'test-note'

        >>> make_slug_unique("test-note", {"test-note"})
        'test-note-a7c9'  # Random suffix

        >>> make_slug_unique("test-note", {"test-note", "test-note-a7c9"})
        'test-note-x3k2'  # Different random suffix

    Notes:
        - Random suffix is 4 lowercase alphanumeric characters
        - Extremely low collision probability (36^4 = 1.6M combinations)
        - Will retry up to 100 times if collision occurs (should never happen)
    """

Implementation Details:

  • Check if base_slug exists in set
  • If not, return base_slug unchanged
  • If exists, append -{random} where random is 4 alphanumeric chars
  • Use secrets.token_urlsafe() for random generation (secure)
  • Retry if collision occurs (max 100 attempts, raise error if all fail)

validate_slug(slug: str) -> bool

Purpose: Validate that a slug meets all requirements.

Type Signature:

def validate_slug(slug: str) -> bool:
    """
    Validate that a slug meets all requirements

    Checks that slug contains only allowed characters and is within
    length limits.

    Args:
        slug: The slug to validate

    Returns:
        True if slug is valid, False otherwise

    Rules:
        - Must contain only: a-z, 0-9, hyphen (-)
        - Must be between 1 and 100 characters
        - Cannot start or end with hyphen
        - Cannot contain consecutive hyphens

    Examples:
        >>> validate_slug("hello-world")
        True

        >>> validate_slug("Hello-World")  # Uppercase
        False

        >>> validate_slug("-hello")  # Leading hyphen
        False

        >>> validate_slug("hello--world")  # Double hyphen
        False
    """

Implementation Details:

  • Regex pattern: ^[a-z0-9]+(?:-[a-z0-9]+)*$
  • Length check: 1 <= len(slug) <= 100
  • No additional normalization (validation only)

2. Content Hashing

calculate_content_hash(content: str) -> str

Purpose: Calculate SHA-256 hash of note content for change detection.

Type Signature:

def calculate_content_hash(content: str) -> str:
    """
    Calculate SHA-256 hash of content

    Generates a cryptographic hash of the content for change detection
    and cache invalidation. Uses UTF-8 encoding.

    Args:
        content: The content to hash (markdown text)

    Returns:
        Hexadecimal hash string (64 characters)

    Examples:
        >>> calculate_content_hash("Hello World")
        'a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e'

        >>> calculate_content_hash("")
        'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'

    Notes:
        - Same content always produces same hash
        - Hash is deterministic across systems
        - Useful for detecting external file modifications
        - SHA-256 chosen for security and wide support
    """

Implementation Details:

  • Use hashlib.sha256()
  • Encode content as UTF-8 bytes
  • Return hexadecimal digest (lowercase)
  • Handle empty content (valid use case)

3. File Path Operations

generate_note_path(slug: str, created_at: datetime, data_dir: Path) -> Path

Purpose: Generate file path for note using year/month directory structure.

Type Signature:

def generate_note_path(slug: str, created_at: datetime, data_dir: Path) -> Path:
    """
    Generate file path for a note

    Creates path following pattern: data/notes/YYYY/MM/slug.md

    Args:
        slug: URL-safe slug for the note
        created_at: Creation timestamp (determines YYYY/MM)
        data_dir: Base data directory path

    Returns:
        Full Path object for the note file

    Raises:
        ValueError: If slug is invalid

    Examples:
        >>> from datetime import datetime
        >>> from pathlib import Path
        >>> dt = datetime(2024, 11, 18, 14, 30)
        >>> generate_note_path("test-note", dt, Path("data"))
        PosixPath('data/notes/2024/11/test-note.md')

    Notes:
        - Does NOT create directories (use ensure_note_directory)
        - Does NOT check if file exists
        - Validates slug before generating path
    """

Implementation Details:

  • Validate slug using validate_slug()
  • Extract year and month from created_at: created_at.strftime("%Y"), "%m"
  • Build path: data_dir / "notes" / year / month / f"{slug}.md"
  • Return Path object (not string)

ensure_note_directory(note_path: Path) -> Path

Purpose: Create directory structure for note if it doesn't exist.

Type Signature:

def ensure_note_directory(note_path: Path) -> Path:
    """
    Ensure directory exists for note file

    Creates parent directories if they don't exist. Safe to call
    even if directories already exist.

    Args:
        note_path: Full path to note file

    Returns:
        Parent directory path

    Raises:
        OSError: If directory cannot be created (permissions, etc.)

    Examples:
        >>> note_path = Path("data/notes/2024/11/test-note.md")
        >>> ensure_note_directory(note_path)
        PosixPath('data/notes/2024/11')
    """

Implementation Details:

  • Use note_path.parent.mkdir(parents=True, exist_ok=True)
  • Return parent directory path
  • Let OSError propagate if creation fails (permissions, disk full, etc.)

validate_note_path(file_path: Path, data_dir: Path) -> bool

Purpose: Validate that file path is within data directory (prevent path traversal).

Type Signature:

def validate_note_path(file_path: Path, data_dir: Path) -> bool:
    """
    Validate that file path is within data directory

    Security check to prevent path traversal attacks. Ensures the
    resolved path is within the allowed data directory.

    Args:
        file_path: Path to validate
        data_dir: Base data directory that must contain file_path

    Returns:
        True if path is safe, False otherwise

    Examples:
        >>> validate_note_path(
        ...     Path("data/notes/2024/11/note.md"),
        ...     Path("data")
        ... )
        True

        >>> validate_note_path(
        ...     Path("data/notes/../../etc/passwd"),
        ...     Path("data")
        ... )
        False

    Security:
        - Resolves symlinks and relative paths
        - Checks if resolved path is child of data_dir
        - Prevents directory traversal attacks
    """

Implementation Details:

  • Resolve both paths to absolute: file_path.resolve(), data_dir.resolve()
  • Check if file_path is relative to data_dir: use file_path.is_relative_to(data_dir)
  • Return boolean (no exceptions)
  • Critical for security - MUST be called before any file operations

4. Atomic File Operations

write_note_file(file_path: Path, content: str) -> None

Purpose: Write note content to file atomically.

Type Signature:

def write_note_file(file_path: Path, content: str) -> None:
    """
    Write note content to file atomically

    Writes to temporary file first, then atomically renames to final path.
    This prevents corruption if write is interrupted.

    Args:
        file_path: Destination file path
        content: Content to write (markdown text)

    Raises:
        OSError: If file cannot be written
        ValueError: If file_path is invalid

    Examples:
        >>> write_note_file(Path("data/notes/2024/11/test.md"), "# Test")

    Implementation:
        1. Create temp file: {file_path}.tmp
        2. Write content to temp file
        3. Atomically rename temp to final path
        4. If any step fails, clean up temp file

    Notes:
        - Atomic rename is guaranteed on POSIX systems
        - Temp file created in same directory as target
        - UTF-8 encoding used for all text
    """

Implementation Details:

  • Create temp path: temp_path = file_path.with_suffix(file_path.suffix + '.tmp')
  • Write to temp file: temp_path.write_text(content, encoding='utf-8')
  • Atomic rename: temp_path.replace(file_path)
  • On exception: delete temp file if exists, re-raise
  • Use context manager or try/finally for cleanup

read_note_file(file_path: Path) -> str

Purpose: Read note content from file.

Type Signature:

def read_note_file(file_path: Path) -> str:
    """
    Read note content from file

    Args:
        file_path: Path to note file

    Returns:
        File content as string

    Raises:
        FileNotFoundError: If file doesn't exist
        OSError: If file cannot be read

    Examples:
        >>> content = read_note_file(Path("data/notes/2024/11/test.md"))
        >>> print(content)
        # Test Note
    """

Implementation Details:

  • Use file_path.read_text(encoding='utf-8')
  • Let exceptions propagate (FileNotFoundError, OSError, etc.)
  • No additional error handling needed

delete_note_file(file_path: Path, soft: bool = False, data_dir: Optional[Path] = None) -> None

Purpose: Delete note file (soft or hard delete).

Type Signature:

def delete_note_file(file_path: Path, soft: bool = False, data_dir: Optional[Path] = None) -> None:
    """
    Delete note file from filesystem

    Supports soft delete (move to trash) or hard delete (permanent removal).

    Args:
        file_path: Path to note file
        soft: If True, move to .trash/ directory; if False, delete permanently
        data_dir: Required if soft=True, base data directory

    Raises:
        FileNotFoundError: If file doesn't exist
        ValueError: If soft=True but data_dir not provided
        OSError: If file cannot be deleted or moved

    Examples:
        >>> # Hard delete
        >>> delete_note_file(Path("data/notes/2024/11/test.md"))

        >>> # Soft delete (move to trash)
        >>> delete_note_file(
        ...     Path("data/notes/2024/11/test.md"),
        ...     soft=True,
        ...     data_dir=Path("data")
        ... )
    """

Implementation Details:

  • If soft delete:
    • Require data_dir parameter
    • Create trash directory: data_dir / ".trash" / year / month
    • Move file to trash with same filename
    • Use shutil.move() or file_path.replace()
  • If hard delete:
    • Use file_path.unlink()
  • Handle FileNotFoundError gracefully (file already gone is success)

5. Date/Time Utilities

format_rfc822(dt: datetime) -> str

Purpose: Format datetime as RFC-822 string for RSS feeds.

Type Signature:

def format_rfc822(dt: datetime) -> str:
    """
    Format datetime as RFC-822 string

    Converts datetime to RFC-822 format required by RSS 2.0 specification.
    Assumes UTC timezone.

    Args:
        dt: Datetime to format (assumed UTC)

    Returns:
        RFC-822 formatted string

    Examples:
        >>> from datetime import datetime
        >>> dt = datetime(2024, 11, 18, 14, 30, 45)
        >>> format_rfc822(dt)
        'Mon, 18 Nov 2024 14:30:45 +0000'

    References:
        - RSS 2.0 spec: https://www.rssboard.org/rss-specification
        - RFC-822 date format
    """

Implementation Details:

  • Use dt.strftime() with format: "%a, %d %b %Y %H:%M:%S +0000"
  • Assume UTC timezone (all timestamps stored as UTC)
  • Format pattern: "DDD, DD MMM YYYY HH:MM:SS +0000"

format_iso8601(dt: datetime) -> str

Purpose: Format datetime as ISO 8601 string.

Type Signature:

def format_iso8601(dt: datetime) -> str:
    """
    Format datetime as ISO 8601 string

    Converts datetime to ISO 8601 format for timestamps and APIs.

    Args:
        dt: Datetime to format

    Returns:
        ISO 8601 formatted string

    Examples:
        >>> from datetime import datetime
        >>> dt = datetime(2024, 11, 18, 14, 30, 45)
        >>> format_iso8601(dt)
        '2024-11-18T14:30:45Z'
    """

Implementation Details:

  • Use dt.isoformat() with 'Z' suffix for UTC
  • Format: YYYY-MM-DDTHH:MM:SSZ
  • Or use dt.strftime("%Y-%m-%dT%H:%M:%SZ")

parse_iso8601(date_string: str) -> datetime

Purpose: Parse ISO 8601 string to datetime.

Type Signature:

def parse_iso8601(date_string: str) -> datetime:
    """
    Parse ISO 8601 string to datetime

    Args:
        date_string: ISO 8601 formatted string

    Returns:
        Datetime object (UTC)

    Raises:
        ValueError: If string is not valid ISO 8601 format

    Examples:
        >>> parse_iso8601("2024-11-18T14:30:45Z")
        datetime.datetime(2024, 11, 18, 14, 30, 45)
    """

Implementation Details:

  • Use datetime.fromisoformat() (remove 'Z' suffix first if present)
  • Or use datetime.strptime() with format string
  • Let ValueError propagate if invalid format

6. Helper Functions

extract_first_words(text: str, max_words: int = 5) -> str

Purpose: Extract first N words from text (helper for slug generation).

Type Signature:

def extract_first_words(text: str, max_words: int = 5) -> str:
    """
    Extract first N words from text

    Helper function for slug generation. Splits text on whitespace
    and returns first N non-empty words.

    Args:
        text: Text to extract words from
        max_words: Maximum number of words to extract

    Returns:
        Space-separated string of first N words

    Examples:
        >>> extract_first_words("Hello world this is a test", 3)
        'Hello world this'

        >>> extract_first_words("  Multiple   spaces  ", 2)
        'Multiple spaces'
    """

Implementation Details:

  • Strip whitespace from text
  • Split on whitespace: text.split()
  • Filter empty strings
  • Take first max_words: words[:max_words]
  • Join with single space: " ".join(words)

normalize_slug_text(text: str) -> str

Purpose: Normalize text for slug (lowercase, hyphens, alphanumeric only).

Type Signature:

def normalize_slug_text(text: str) -> str:
    """
    Normalize text for use in slug

    Converts to lowercase, replaces spaces with hyphens, removes
    special characters, and collapses multiple hyphens.

    Args:
        text: Text to normalize

    Returns:
        Normalized slug-safe text

    Examples:
        >>> normalize_slug_text("Hello World!")
        'hello-world'

        >>> normalize_slug_text("Testing... with -- special chars!")
        'testing-with-special-chars'
    """

Implementation Details:

  • Convert to lowercase: text.lower()
  • Replace spaces with hyphens: text.replace(' ', '-')
  • Remove non-alphanumeric (except hyphens): regex [^a-z0-9-]''
  • Collapse multiple hyphens: regex -+-
  • Strip leading/trailing hyphens: text.strip('-')

generate_random_suffix(length: int = 4) -> str

Purpose: Generate random alphanumeric suffix for slug uniqueness.

Type Signature:

def generate_random_suffix(length: int = 4) -> str:
    """
    Generate random alphanumeric suffix

    Creates a secure random string for making slugs unique.
    Uses lowercase letters and numbers only.

    Args:
        length: Length of suffix (default 4)

    Returns:
        Random alphanumeric string

    Examples:
        >>> suffix = generate_random_suffix()
        >>> len(suffix)
        4
        >>> suffix.isalnum()
        True
    """

Implementation Details:

  • Use secrets.token_urlsafe() or secrets.choice()
  • Character set: a-z0-9 (36 characters)
  • Generate random string of specified length
  • Example: 'a7c9', 'x3k2', 'p9m1'

Constants

# Slug configuration
MAX_SLUG_LENGTH = 100           # Maximum slug length in characters
MIN_SLUG_LENGTH = 1             # Minimum slug length in characters
SLUG_WORDS_COUNT = 5            # Number of words to extract for slug
RANDOM_SUFFIX_LENGTH = 4        # Length of random suffix for uniqueness

# File operations
TEMP_FILE_SUFFIX = '.tmp'       # Suffix for temporary files
TRASH_DIR_NAME = '.trash'       # Directory name for soft-deleted files

# Hashing
CONTENT_HASH_ALGORITHM = 'sha256'  # Hash algorithm for content

# Regex patterns
SLUG_PATTERN = re.compile(r'^[a-z0-9]+(?:-[a-z0-9]+)*$')
SAFE_SLUG_PATTERN = re.compile(r'[^a-z0-9-]')
MULTIPLE_HYPHENS_PATTERN = re.compile(r'-+')
WHITESPACE_PATTERN = re.compile(r'\s+')

# Character sets
RANDOM_CHARS = 'abcdefghijklmnopqrstuvwxyz0123456789'

Error Handling

Exception Types

The module uses standard Python exceptions:

  • ValueError: Invalid input (empty content, invalid slug, etc.)
  • FileNotFoundError: File doesn't exist when expected
  • FileExistsError: File already exists (not used in utils, but caller may use)
  • OSError: General I/O errors (permissions, disk full, etc.)

Error Messages

Error messages should be specific and actionable:

Good:

raise ValueError(
    f"Invalid slug '{slug}': must contain only lowercase letters, "
    f"numbers, and hyphens"
)

raise ValueError("Content cannot be empty or whitespace-only")

raise OSError(
    f"Failed to write note file: {file_path}. "
    f"Check permissions and disk space."
)

Bad:

raise ValueError("Invalid slug")
raise ValueError("Bad input")
raise OSError("Write failed")

Testing Strategy

Test Coverage Requirements

  • Minimum 90% code coverage for utils.py
  • Test all functions with multiple scenarios
  • Test edge cases and error conditions
  • Test security (path traversal, etc.)

Test Organization

File: tests/test_utils.py

"""
Tests for utility functions

Organized by function category:
- Slug generation tests
- Content hashing tests
- File path tests
- Atomic file operations tests
- Date/time formatting tests
"""

import pytest
from pathlib import Path
from datetime import datetime
from starpunk.utils import (
    generate_slug,
    make_slug_unique,
    validate_slug,
    calculate_content_hash,
    generate_note_path,
    # ... etc
)

class TestSlugGeneration:
    """Test slug generation functions"""

    def test_generate_slug_from_content(self):
        """Test basic slug generation from content"""
        slug = generate_slug("Hello World This Is My Note")
        assert slug == "hello-world-this-is-my"

    def test_generate_slug_special_characters(self):
        """Test slug generation removes special characters"""
        slug = generate_slug("Testing... with!@# special chars")
        assert slug == "testing-with-special-chars"

    def test_generate_slug_empty_content(self):
        """Test slug generation raises error on empty content"""
        with pytest.raises(ValueError):
            generate_slug("")

    def test_generate_slug_whitespace_only(self):
        """Test slug generation raises error on whitespace-only content"""
        with pytest.raises(ValueError):
            generate_slug("   \n\t   ")

    def test_generate_slug_short_content_fallback(self):
        """Test slug generation falls back to timestamp for short content"""
        slug = generate_slug("A", created_at=datetime(2024, 11, 18, 14, 30))
        assert slug == "20241118-143000"  # Timestamp format

    def test_generate_slug_unicode_content(self):
        """Test slug generation handles unicode"""
        slug = generate_slug("Hello 世界 World")
        # Should strip non-ASCII and keep ASCII words
        assert "hello" in slug
        assert "world" in slug

    def test_make_slug_unique_no_collision(self):
        """Test make_slug_unique returns original if no collision"""
        slug = make_slug_unique("test-note", set())
        assert slug == "test-note"

    def test_make_slug_unique_with_collision(self):
        """Test make_slug_unique adds suffix on collision"""
        slug = make_slug_unique("test-note", {"test-note"})
        assert slug.startswith("test-note-")
        assert len(slug) == len("test-note-") + 4  # 4 char suffix

    def test_validate_slug_valid(self):
        """Test validate_slug accepts valid slugs"""
        assert validate_slug("hello-world") is True
        assert validate_slug("test-note-123") is True
        assert validate_slug("a") is True

    def test_validate_slug_invalid(self):
        """Test validate_slug rejects invalid slugs"""
        assert validate_slug("Hello-World") is False  # Uppercase
        assert validate_slug("-hello") is False  # Leading hyphen
        assert validate_slug("hello-") is False  # Trailing hyphen
        assert validate_slug("hello--world") is False  # Double hyphen
        assert validate_slug("hello_world") is False  # Underscore
        assert validate_slug("") is False  # Empty


class TestContentHashing:
    """Test content hashing functions"""

    def test_calculate_content_hash_consistency(self):
        """Test hash is consistent for same content"""
        hash1 = calculate_content_hash("Test content")
        hash2 = calculate_content_hash("Test content")
        assert hash1 == hash2

    def test_calculate_content_hash_different(self):
        """Test different content produces different hash"""
        hash1 = calculate_content_hash("Test content 1")
        hash2 = calculate_content_hash("Test content 2")
        assert hash1 != hash2

    def test_calculate_content_hash_empty(self):
        """Test hash of empty string"""
        hash_empty = calculate_content_hash("")
        assert len(hash_empty) == 64  # SHA-256 produces 64 hex chars

    def test_calculate_content_hash_unicode(self):
        """Test hash handles unicode correctly"""
        hash_val = calculate_content_hash("Hello 世界")
        assert len(hash_val) == 64


class TestFilePathOperations:
    """Test file path generation and validation"""

    def test_generate_note_path(self):
        """Test note path generation"""
        dt = datetime(2024, 11, 18, 14, 30)
        path = generate_note_path("test-note", dt, Path("data"))
        assert path == Path("data/notes/2024/11/test-note.md")

    def test_generate_note_path_invalid_slug(self):
        """Test note path generation rejects invalid slug"""
        dt = datetime(2024, 11, 18)
        with pytest.raises(ValueError):
            generate_note_path("Invalid Slug!", dt, Path("data"))

    def test_validate_note_path_safe(self):
        """Test path validation accepts safe paths"""
        assert validate_note_path(
            Path("data/notes/2024/11/note.md"),
            Path("data")
        ) is True

    def test_validate_note_path_traversal(self):
        """Test path validation rejects traversal attempts"""
        assert validate_note_path(
            Path("data/notes/../../etc/passwd"),
            Path("data")
        ) is False

    def test_validate_note_path_absolute_outside(self):
        """Test path validation rejects paths outside data dir"""
        assert validate_note_path(
            Path("/etc/passwd"),
            Path("data")
        ) is False


class TestAtomicFileOperations:
    """Test atomic file write/read/delete operations"""

    def test_write_and_read_note_file(self, tmp_path):
        """Test writing and reading note file"""
        file_path = tmp_path / "test.md"
        content = "# Test Note\n\nThis is a test."

        write_note_file(file_path, content)
        assert file_path.exists()

        read_content = read_note_file(file_path)
        assert read_content == content

    def test_write_note_file_atomic(self, tmp_path):
        """Test write is atomic (temp file used)"""
        file_path = tmp_path / "test.md"
        temp_path = Path(str(file_path) + '.tmp')

        # Temp file should not exist after write
        write_note_file(file_path, "Test")
        assert not temp_path.exists()
        assert file_path.exists()

    def test_read_note_file_not_found(self, tmp_path):
        """Test reading non-existent file raises error"""
        file_path = tmp_path / "nonexistent.md"
        with pytest.raises(FileNotFoundError):
            read_note_file(file_path)

    def test_delete_note_file_hard(self, tmp_path):
        """Test hard delete removes file"""
        file_path = tmp_path / "test.md"
        file_path.write_text("Test")

        delete_note_file(file_path, soft=False)
        assert not file_path.exists()

    def test_delete_note_file_soft(self, tmp_path):
        """Test soft delete moves file to trash"""
        # Create note file
        notes_dir = tmp_path / "notes" / "2024" / "11"
        notes_dir.mkdir(parents=True)
        file_path = notes_dir / "test.md"
        file_path.write_text("Test")

        # Soft delete
        delete_note_file(file_path, soft=True, data_dir=tmp_path)

        # Original should be gone
        assert not file_path.exists()

        # Should be in trash
        trash_path = tmp_path / ".trash" / "2024" / "11" / "test.md"
        assert trash_path.exists()


class TestDateTimeFormatting:
    """Test date/time formatting functions"""

    def test_format_rfc822(self):
        """Test RFC-822 date formatting"""
        dt = datetime(2024, 11, 18, 14, 30, 45)
        formatted = format_rfc822(dt)
        assert formatted == "Mon, 18 Nov 2024 14:30:45 +0000"

    def test_format_iso8601(self):
        """Test ISO 8601 date formatting"""
        dt = datetime(2024, 11, 18, 14, 30, 45)
        formatted = format_iso8601(dt)
        assert formatted == "2024-11-18T14:30:45Z"

    def test_parse_iso8601(self):
        """Test ISO 8601 date parsing"""
        dt = parse_iso8601("2024-11-18T14:30:45Z")
        assert dt.year == 2024
        assert dt.month == 11
        assert dt.day == 18
        assert dt.hour == 14
        assert dt.minute == 30
        assert dt.second == 45

    def test_parse_iso8601_invalid(self):
        """Test ISO 8601 parsing rejects invalid format"""
        with pytest.raises(ValueError):
            parse_iso8601("not-a-date")

Additional Test Cases

Security Tests:

  • Path traversal attempts with various patterns
  • Symlink handling
  • Unicode/emoji in slugs
  • Very long input strings
  • Malicious input patterns

Edge Cases:

  • Empty content
  • Whitespace-only content
  • Very long content (>1MB)
  • Unicode normalization issues
  • Concurrent file operations (if applicable)

Performance Tests:

  • Slug generation with 10,000 character input
  • Hash calculation with large content
  • Path validation with deep directory structures

Usage Examples

Creating a Note

from datetime import datetime
from pathlib import Path
from starpunk.utils import (
    generate_slug,
    make_slug_unique,
    generate_note_path,
    ensure_note_directory,
    write_note_file,
    calculate_content_hash
)

# Note content
content = "This is my first note about Python programming"
created_at = datetime.utcnow()
data_dir = Path("data")

# Generate slug
base_slug = generate_slug(content, created_at)  # "this-is-my-first-note"

# Check uniqueness (pseudo-code, actual check via database)
existing_slugs = get_existing_slugs_from_db()  # Returns set of slugs
slug = make_slug_unique(base_slug, existing_slugs)

# Generate file path
note_path = generate_note_path(slug, created_at, data_dir)
# Result: data/notes/2024/11/this-is-my-first-note.md

# Create directory
ensure_note_directory(note_path)

# Write file
write_note_file(note_path, content)

# Calculate hash for database
content_hash = calculate_content_hash(content)

# Save metadata to database (pseudo-code)
save_note_metadata(
    slug=slug,
    file_path=str(note_path.relative_to(data_dir)),
    created_at=created_at,
    content_hash=content_hash
)

Reading a Note

from pathlib import Path
from starpunk.utils import read_note_file, validate_note_path

# Get file path from database (pseudo-code)
note = get_note_from_db(slug="my-first-note")
file_path = Path("data") / note.file_path

# Validate path (security check)
if not validate_note_path(file_path, Path("data")):
    raise ValueError("Invalid file path")

# Read content
content = read_note_file(file_path)

Deleting a Note

from pathlib import Path
from starpunk.utils import delete_note_file

# Get note path from database
note = get_note_from_db(slug="old-note")
file_path = Path("data") / note.file_path

# Soft delete (move to trash)
delete_note_file(file_path, soft=True, data_dir=Path("data"))

# Update database to mark as deleted (pseudo-code)
mark_note_deleted(note.id)

Integration with Other Modules

Database Integration

The database module will use utilities like this:

# In starpunk/notes.py
from starpunk.utils import (
    generate_slug, make_slug_unique, generate_note_path,
    ensure_note_directory, write_note_file, calculate_content_hash
)
from starpunk.database import get_db

def create_note(content: str, published: bool = False) -> Note:
    """Create a new note"""
    db = get_db()
    created_at = datetime.utcnow()

    # Generate unique slug
    base_slug = generate_slug(content, created_at)
    existing_slugs = set(row[0] for row in db.execute("SELECT slug FROM notes").fetchall())
    slug = make_slug_unique(base_slug, existing_slugs)

    # Generate path and write file
    note_path = generate_note_path(slug, created_at, Path(current_app.config['DATA_PATH']))
    ensure_note_directory(note_path)
    write_note_file(note_path, content)

    # Calculate hash
    content_hash = calculate_content_hash(content)

    # Insert into database
    db.execute(
        "INSERT INTO notes (slug, file_path, published, created_at, updated_at, content_hash) "
        "VALUES (?, ?, ?, ?, ?, ?)",
        (slug, str(note_path.relative_to(Path(current_app.config['DATA_PATH']))),
         published, created_at, created_at, content_hash)
    )
    db.commit()

    return Note(slug=slug, content=content, created_at=created_at, ...)

RSS Feed Integration

The feed module will use date formatting:

# In starpunk/feed.py
from starpunk.utils import format_rfc822

def generate_rss_feed():
    """Generate RSS feed"""
    fg = FeedGenerator()

    for note in get_published_notes():
        fe = fg.add_entry()
        fe.pubDate(format_rfc822(note.created_at))

Performance Considerations

Slug Generation Performance

  • Expected input: 50-1000 characters
  • Target: <1ms per slug generation
  • Bottleneck: None (string operations are fast)

File Operations Performance

  • Write operation: <10ms for typical note (1-10KB)
  • Read operation: <5ms for typical note
  • Atomic rename: ~1ms (OS-level operation)

Path Validation Performance

  • Target: <1ms per validation
  • Use cached absolute paths where possible

Security Considerations

Path Traversal Prevention

CRITICAL: All file paths MUST be validated before use.

# Always validate before file operations
if not validate_note_path(file_path, data_dir):
    raise ValueError("Invalid file path: potential directory traversal")

Slug Validation

Prevent injection attacks through slugs:

# Validate slug before using in paths
if not validate_slug(slug):
    raise ValueError(f"Invalid slug: {slug}")

Random Suffix Security

Use cryptographically secure random:

# Use secrets module, not random module
import secrets
suffix = secrets.token_urlsafe(4)[:4].lower()

Dependencies

Standard Library

  • hashlib - SHA-256 hashing
  • re - Regular expressions
  • secrets - Secure random generation
  • pathlib - Path operations
  • datetime - Date/time handling
  • shutil - File operations (move)

Third-Party

  • flask - For current_app context (configuration access)

Internal

  • None (this is the foundation module)

Future Enhancements (V2+)

Slug Customization

Allow user to specify custom slug:

def generate_slug(content: str, custom_slug: Optional[str] = None, ...) -> str:
    if custom_slug:
        if validate_slug(custom_slug):
            return custom_slug
        raise ValueError("Invalid custom slug")
    # ... existing logic

Unicode Slug Support

Support non-ASCII characters in slugs:

def generate_unicode_slug(content: str) -> str:
    """Generate slug with unicode support"""
    # Use unicode normalization
    # Support CJK characters
    # Support diacritics

Content Preview Extraction

Extract preview/excerpt from content:

def extract_preview(content: str, max_length: int = 200) -> str:
    """Extract preview text from content"""
    # Strip markdown formatting
    # Truncate to max_length
    # Add ellipsis if truncated

Title Extraction

Extract title from first line:

def extract_title(content: str) -> Optional[str]:
    """Extract title from content (first line or # heading)"""
    # Check for # heading
    # Or use first line
    # Return None if empty

Acceptance Criteria

  • All functions implemented with type hints
  • All functions have comprehensive docstrings
  • Slug generation creates valid, URL-safe slugs
  • Slug uniqueness enforcement works correctly
  • Content hashing is consistent and correct
  • Path validation prevents directory traversal
  • Atomic file writes prevent corruption
  • Date formatting matches RFC-822 and ISO 8601 specs
  • All security checks are in place
  • Test coverage >90%
  • All tests pass
  • Code formatted with Black
  • Code passes flake8 linting
  • No hardcoded values (use constants)
  • Error messages are clear and actionable

References


Implementation Checklist

When implementing starpunk/utils.py, complete in this order:

  1. Create file with module docstring
  2. Add imports and constants
  3. Implement helper functions (extract_first_words, normalize_slug_text, generate_random_suffix)
  4. Implement slug functions (generate_slug, make_slug_unique, validate_slug)
  5. Implement content hashing (calculate_content_hash)
  6. Implement path functions (generate_note_path, ensure_note_directory, validate_note_path)
  7. Implement file operations (write_note_file, read_note_file, delete_note_file)
  8. Implement date/time functions (format_rfc822, format_iso8601, parse_iso8601)
  9. Create tests/test_utils.py
  10. Write tests for all functions
  11. Run tests and achieve >90% coverage
  12. Format with Black
  13. Lint with flake8
  14. Review all docstrings
  15. Review all error messages

Estimated Time: 2-3 hours for implementation + tests