StarPunk/docs/decisions/ADR-004-file-based-note-storage.md

# ADR-004: File-Based Note Storage Architecture

## Status
Accepted

## Context
The user explicitly requires notes to be stored as files on disk rather than as database records. This is critical for:
1. Data portability - notes can be backed up, moved, and read without the application
2. User ownership - direct access to content in human-readable format
3. Simplicity - text files are the simplest storage mechanism
4. Future-proofing - markdown files will be readable forever

However, we also need SQLite for:
- Metadata (timestamps, slugs, published status)
- Authentication tokens
- Fast querying and indexing
- Relational data

The challenge is designing how file-based storage and database metadata work together efficiently.

## Decision

### Hybrid Architecture: Files + Database Metadata

**Notes Content**: Stored as markdown files on disk
**Notes Metadata**: Stored in SQLite database
**Source of Truth**: Files are authoritative for content; database is authoritative for metadata

### File Storage Strategy

#### Directory Structure
```
data/
├── notes/
│   ├── 2024/
│   │   ├── 11/
│   │   │   ├── my-first-note.md
│   │   │   └── another-note.md
│   │   └── 12/
│   │       └── december-note.md
│   └── 2025/
│       └── 01/
│           └── new-year-note.md
├── starpunk.db          # SQLite database
└── .backups/            # Optional backup directory
```

#### File Naming Convention
- **Format**: `{slug}.md`
- **Slug rules**: lowercase, alphanumeric, hyphens only, no spaces
- **Example**: `my-first-note.md`
- **Uniqueness**: Enforced by filesystem (can't have two files with same name in same directory)

#### File Organization
- **Pattern**: Year/Month subdirectories (`YYYY/MM/`)
- **Rationale**:
  - Keeps directories manageable (max ~30 files per month)
  - Easy chronological browsing
  - Matches natural mental model
  - Scalable to thousands of notes
- **Example path**: `data/notes/2024/11/my-first-note.md`

### Database Schema

```sql
CREATE TABLE notes (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    slug TEXT UNIQUE NOT NULL,           -- URL identifier
    file_path TEXT UNIQUE NOT NULL,      -- Relative path from data/notes/
    published BOOLEAN DEFAULT 0,         -- Publication status
    created_at TIMESTAMP NOT NULL,       -- Creation timestamp
    updated_at TIMESTAMP NOT NULL,       -- Last modification timestamp
    content_hash TEXT                    -- SHA-256 of file content for change detection
);

CREATE INDEX idx_notes_created_at ON notes(created_at DESC);
CREATE INDEX idx_notes_published ON notes(published);
CREATE INDEX idx_notes_slug ON notes(slug);
```

### File Format

#### Markdown File Structure
```markdown
[Content of the note in markdown format]
```

**That's it.** No frontmatter, no metadata in file. Keep it pure.

**Rationale**:
- Maximum portability
- Readable by any markdown editor
- No custom parsing required
- Metadata belongs in database (timestamps, slugs, etc.)
- User sees just their content when opening file

#### Optional Future Enhancement (V2+)
If frontmatter becomes necessary, use standard YAML:
```markdown
---
title: Optional Title
tags: tag1, tag2
---
[Content here]
```

But for V1: **NO frontmatter**.

## Rationale

### File Storage Benefits
**Simplicity Score: 10/10**
- Text files are the simplest storage
- No binary formats
- Human-readable
- Easy to backup (rsync, git, Dropbox, etc.)

**Portability Score: 10/10**
- Standard markdown format
- Readable without application
- Can be edited in any text editor
- Easy to migrate to other systems

**Ownership Score: 10/10**
- User has direct access to their content
- No vendor lock-in
- Can grep their own notes
- Backup is simple file copy

### Hybrid Approach Benefits
**Performance**: Database indexes enable fast queries
**Flexibility**: Rich metadata without cluttering files
**Integrity**: Database enforces uniqueness and relationships
**Simplicity**: Each system does what it's best at

## Consequences

### Positive
- Notes are portable markdown files
- User can edit notes directly in filesystem if desired
- Easy backup (just copy data/ directory)
- Database provides fast metadata queries
- Can rebuild database from files if needed
- Git-friendly (can version control notes)
- Maximum data ownership

### Negative
- Must keep file and database in sync
- Potential for orphaned database records
- Potential for orphaned files
- File operations are slower than database queries
- Must handle file system errors

### Mitigation Strategies

#### Sync Strategy
1. **On note creation**: Write file FIRST, then database record
2. **On note update**: Update file FIRST, then database record (update timestamp, content_hash)
3. **On note delete**: Mark as deleted in database, optionally move file to .trash/
4. **On startup**: Optional integrity check to detect orphans

#### Orphan Detection
```python
# Pseudo-code for integrity check
def check_integrity():
    # Find database records without files
    for note in database.all_notes():
        if not file_exists(note.file_path):
            log_error(f"Orphaned database record: {note.slug}")

    # Find files without database records
    for file in filesystem.all_markdown_files():
        if not database.has_note(file_path=file):
            log_error(f"Orphaned file: {file}")
```

#### Content Hash Strategy
- Calculate SHA-256 hash of file content on write
- Store hash in database
- On read, can verify content hasn't been externally modified
- Enables change detection and cache invalidation

## Data Flow Patterns

### Creating a Note

1. Generate slug from content or timestamp
2. Determine file path: `data/notes/{YYYY}/{MM}/{slug}.md`
3. Create directories if needed
4. Write markdown content to file
5. Calculate content hash
6. Insert record into database
7. Return success

**Transaction Safety**: If database insert fails, delete file and raise error

### Reading a Note

**By Slug**:
1. Query database for file_path by slug
2. Read file content from disk
3. Return content + metadata

**For List**:
1. Query database for metadata (sorted, filtered)
2. Optionally read file content for each note
3. Return list with metadata and content

### Updating a Note

1. Query database for existing file_path
2. Write new content to file (atomic write to temp, then rename)
3. Calculate new content hash
4. Update database record (timestamp, content_hash)
5. Return success

**Transaction Safety**: Keep backup of original file until database update succeeds

### Deleting a Note

**Soft Delete (Recommended)**:
1. Update database: set `deleted_at` timestamp
2. Optionally move file to `.trash/` subdirectory
3. Return success

**Hard Delete**:
1. Delete database record
2. Delete file from filesystem
3. Return success

## File System Operations

### Atomic Writes
```python
# Pseudo-code for atomic file write
def write_note_safely(path, content):
    temp_path = f"{path}.tmp"
    write(temp_path, content)
    atomic_rename(temp_path, path)  # Atomic on POSIX systems
```

### Directory Creation
```python
# Ensure directory exists before writing
def ensure_note_directory(year, month):
    path = f"data/notes/{year}/{month}"
    makedirs(path, exist_ok=True)
    return path
```

### Slug Generation
```python
# Generate URL-safe slug
def generate_slug(content=None, timestamp=None):
    if content:
        # Extract first few words, normalize
        words = extract_first_words(content, max=5)
        slug = normalize(words)  # lowercase, hyphens, no special chars
    else:
        # Fallback: timestamp-based
        slug = timestamp.strftime("%Y%m%d-%H%M%S")

    # Ensure uniqueness
    if database.slug_exists(slug):
        slug = f"{slug}-{random_suffix()}"

    return slug
```

## Backup Strategy

### Simple Backup
```bash
# User can backup with simple copy
cp -r data/ backup/

# Or with rsync
rsync -av data/ backup/

# Or with git
cd data/ && git add . && git commit -m "Backup"
```

### Restore Strategy
1. Copy data/ directory to new location
2. Application reads database
3. If database missing or corrupt, rebuild from files:
   ```python
   def rebuild_database_from_files():
       for file_path in glob("data/notes/**/*.md"):
           content = read_file(file_path)
           metadata = extract_metadata_from_path(file_path)
           database.insert_note(
               slug=metadata.slug,
               file_path=file_path,
               created_at=file_stat.created,
               updated_at=file_stat.modified,
               content_hash=hash(content)
           )
   ```

## Standards Compliance

### Markdown Standard
- CommonMark specification
- No custom extensions in V1
- Standard markdown processors can read files

### File System Compatibility
- ASCII-safe filenames
- No special characters in paths
- Maximum path length under 255 characters
- POSIX-compatible directory structure

## Alternatives Considered

### All-Database Storage (Rejected)
- **Simplicity**: 8/10 - Simpler code, single source of truth
- **Portability**: 2/10 - Requires database export
- **Ownership**: 3/10 - User doesn't have direct access
- **Verdict**: Violates user requirement for file-based storage

### Flat File Directory (Rejected)
```
data/notes/
├── note-1.md
├── note-2.md
├── note-3.md
...
├── note-9999.md
```
- **Simplicity**: 10/10 - Simplest possible structure
- **Scalability**: 3/10 - Thousands of files in one directory is slow
- **Verdict**: Not scalable, poor performance with many notes

### Git-Based Storage (Rejected for V1)
- **Simplicity**: 6/10 - Requires git integration
- **Portability**: 9/10 - Excellent versioning
- **Performance**: 7/10 - Git operations have overhead
- **Verdict**: Interesting for V2, but adds complexity to V1

### Frontmatter in Files (Rejected for V1)
```markdown
---
slug: my-note
created: 2024-11-18
published: true
---
Note content here
```
- **Simplicity**: 7/10 - Requires YAML parsing
- **Portability**: 8/10 - Common pattern, but not pure markdown
- **Single Source**: 10/10 - All data in one place
- **Verdict**: Deferred to V2; V1 keeps files pure

### JSON Metadata Sidecar (Rejected)
```
notes/
├── my-note.md
├── my-note.json  # Metadata
```
- **Simplicity**: 6/10 - Doubles number of files
- **Portability**: 7/10 - Markdown still clean, but extra files
- **Sync Issues**: 5/10 - Must keep two files in sync
- **Verdict**: Database metadata is cleaner

## Implementation Checklist

- [ ] Create data/notes directory structure on initialization
- [ ] Implement slug generation algorithm
- [ ] Implement atomic file write operations
- [ ] Implement content hash calculation
- [ ] Create database schema with indexes
- [ ] Implement sync between files and database
- [ ] Implement orphan detection (optional for V1)
- [ ] Add file system error handling
- [ ] Create backup documentation for users
- [ ] Test with thousands of notes for performance

## References
- CommonMark Spec: https://spec.commonmark.org/
- POSIX File Operations: https://pubs.opengroup.org/onlinepubs/9699919799/
- File System Best Practices: https://www.pathname.com/fhs/
- Atomic File Operations: https://lwn.net/Articles/457667/