Files
StarPunk/docs/decisions/ADR-004-file-based-note-storage.md
2025-11-18 19:21:31 -07:00

385 lines
11 KiB
Markdown

# ADR-004: File-Based Note Storage Architecture
## Status
Accepted
## Context
The user explicitly requires notes to be stored as files on disk rather than as database records. This is critical for:
1. Data portability - notes can be backed up, moved, and read without the application
2. User ownership - direct access to content in human-readable format
3. Simplicity - text files are the simplest storage mechanism
4. Future-proofing - markdown files will be readable forever
However, we also need SQLite for:
- Metadata (timestamps, slugs, published status)
- Authentication tokens
- Fast querying and indexing
- Relational data
The challenge is designing how file-based storage and database metadata work together efficiently.
## Decision
### Hybrid Architecture: Files + Database Metadata
**Notes Content**: Stored as markdown files on disk
**Notes Metadata**: Stored in SQLite database
**Source of Truth**: Files are authoritative for content; database is authoritative for metadata
### File Storage Strategy
#### Directory Structure
```
data/
├── notes/
│ ├── 2024/
│ │ ├── 11/
│ │ │ ├── my-first-note.md
│ │ │ └── another-note.md
│ │ └── 12/
│ │ └── december-note.md
│ └── 2025/
│ └── 01/
│ └── new-year-note.md
├── starpunk.db # SQLite database
└── .backups/ # Optional backup directory
```
#### File Naming Convention
- **Format**: `{slug}.md`
- **Slug rules**: lowercase, alphanumeric, hyphens only, no spaces
- **Example**: `my-first-note.md`
- **Uniqueness**: Enforced by filesystem (can't have two files with same name in same directory)
#### File Organization
- **Pattern**: Year/Month subdirectories (`YYYY/MM/`)
- **Rationale**:
- Keeps directories manageable (max ~30 files per month)
- Easy chronological browsing
- Matches natural mental model
- Scalable to thousands of notes
- **Example path**: `data/notes/2024/11/my-first-note.md`
### Database Schema
```sql
CREATE TABLE notes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
slug TEXT UNIQUE NOT NULL, -- URL identifier
file_path TEXT UNIQUE NOT NULL, -- Relative path from data/notes/
published BOOLEAN DEFAULT 0, -- Publication status
created_at TIMESTAMP NOT NULL, -- Creation timestamp
updated_at TIMESTAMP NOT NULL, -- Last modification timestamp
content_hash TEXT -- SHA-256 of file content for change detection
);
CREATE INDEX idx_notes_created_at ON notes(created_at DESC);
CREATE INDEX idx_notes_published ON notes(published);
CREATE INDEX idx_notes_slug ON notes(slug);
```
### File Format
#### Markdown File Structure
```markdown
[Content of the note in markdown format]
```
**That's it.** No frontmatter, no metadata in file. Keep it pure.
**Rationale**:
- Maximum portability
- Readable by any markdown editor
- No custom parsing required
- Metadata belongs in database (timestamps, slugs, etc.)
- User sees just their content when opening file
#### Optional Future Enhancement (V2+)
If frontmatter becomes necessary, use standard YAML:
```markdown
---
title: Optional Title
tags: tag1, tag2
---
[Content here]
```
But for V1: **NO frontmatter**.
## Rationale
### File Storage Benefits
**Simplicity Score: 10/10**
- Text files are the simplest storage
- No binary formats
- Human-readable
- Easy to backup (rsync, git, Dropbox, etc.)
**Portability Score: 10/10**
- Standard markdown format
- Readable without application
- Can be edited in any text editor
- Easy to migrate to other systems
**Ownership Score: 10/10**
- User has direct access to their content
- No vendor lock-in
- Can grep their own notes
- Backup is simple file copy
### Hybrid Approach Benefits
**Performance**: Database indexes enable fast queries
**Flexibility**: Rich metadata without cluttering files
**Integrity**: Database enforces uniqueness and relationships
**Simplicity**: Each system does what it's best at
## Consequences
### Positive
- Notes are portable markdown files
- User can edit notes directly in filesystem if desired
- Easy backup (just copy data/ directory)
- Database provides fast metadata queries
- Can rebuild database from files if needed
- Git-friendly (can version control notes)
- Maximum data ownership
### Negative
- Must keep file and database in sync
- Potential for orphaned database records
- Potential for orphaned files
- File operations are slower than database queries
- Must handle file system errors
### Mitigation Strategies
#### Sync Strategy
1. **On note creation**: Write file FIRST, then database record
2. **On note update**: Update file FIRST, then database record (update timestamp, content_hash)
3. **On note delete**: Mark as deleted in database, optionally move file to .trash/
4. **On startup**: Optional integrity check to detect orphans
#### Orphan Detection
```python
# Pseudo-code for integrity check
def check_integrity():
# Find database records without files
for note in database.all_notes():
if not file_exists(note.file_path):
log_error(f"Orphaned database record: {note.slug}")
# Find files without database records
for file in filesystem.all_markdown_files():
if not database.has_note(file_path=file):
log_error(f"Orphaned file: {file}")
```
#### Content Hash Strategy
- Calculate SHA-256 hash of file content on write
- Store hash in database
- On read, can verify content hasn't been externally modified
- Enables change detection and cache invalidation
## Data Flow Patterns
### Creating a Note
1. Generate slug from content or timestamp
2. Determine file path: `data/notes/{YYYY}/{MM}/{slug}.md`
3. Create directories if needed
4. Write markdown content to file
5. Calculate content hash
6. Insert record into database
7. Return success
**Transaction Safety**: If database insert fails, delete file and raise error
### Reading a Note
**By Slug**:
1. Query database for file_path by slug
2. Read file content from disk
3. Return content + metadata
**For List**:
1. Query database for metadata (sorted, filtered)
2. Optionally read file content for each note
3. Return list with metadata and content
### Updating a Note
1. Query database for existing file_path
2. Write new content to file (atomic write to temp, then rename)
3. Calculate new content hash
4. Update database record (timestamp, content_hash)
5. Return success
**Transaction Safety**: Keep backup of original file until database update succeeds
### Deleting a Note
**Soft Delete (Recommended)**:
1. Update database: set `deleted_at` timestamp
2. Optionally move file to `.trash/` subdirectory
3. Return success
**Hard Delete**:
1. Delete database record
2. Delete file from filesystem
3. Return success
## File System Operations
### Atomic Writes
```python
# Pseudo-code for atomic file write
def write_note_safely(path, content):
temp_path = f"{path}.tmp"
write(temp_path, content)
atomic_rename(temp_path, path) # Atomic on POSIX systems
```
### Directory Creation
```python
# Ensure directory exists before writing
def ensure_note_directory(year, month):
path = f"data/notes/{year}/{month}"
makedirs(path, exist_ok=True)
return path
```
### Slug Generation
```python
# Generate URL-safe slug
def generate_slug(content=None, timestamp=None):
if content:
# Extract first few words, normalize
words = extract_first_words(content, max=5)
slug = normalize(words) # lowercase, hyphens, no special chars
else:
# Fallback: timestamp-based
slug = timestamp.strftime("%Y%m%d-%H%M%S")
# Ensure uniqueness
if database.slug_exists(slug):
slug = f"{slug}-{random_suffix()}"
return slug
```
## Backup Strategy
### Simple Backup
```bash
# User can backup with simple copy
cp -r data/ backup/
# Or with rsync
rsync -av data/ backup/
# Or with git
cd data/ && git add . && git commit -m "Backup"
```
### Restore Strategy
1. Copy data/ directory to new location
2. Application reads database
3. If database missing or corrupt, rebuild from files:
```python
def rebuild_database_from_files():
for file_path in glob("data/notes/**/*.md"):
content = read_file(file_path)
metadata = extract_metadata_from_path(file_path)
database.insert_note(
slug=metadata.slug,
file_path=file_path,
created_at=file_stat.created,
updated_at=file_stat.modified,
content_hash=hash(content)
)
```
## Standards Compliance
### Markdown Standard
- CommonMark specification
- No custom extensions in V1
- Standard markdown processors can read files
### File System Compatibility
- ASCII-safe filenames
- No special characters in paths
- Maximum path length under 255 characters
- POSIX-compatible directory structure
## Alternatives Considered
### All-Database Storage (Rejected)
- **Simplicity**: 8/10 - Simpler code, single source of truth
- **Portability**: 2/10 - Requires database export
- **Ownership**: 3/10 - User doesn't have direct access
- **Verdict**: Violates user requirement for file-based storage
### Flat File Directory (Rejected)
```
data/notes/
├── note-1.md
├── note-2.md
├── note-3.md
...
├── note-9999.md
```
- **Simplicity**: 10/10 - Simplest possible structure
- **Scalability**: 3/10 - Thousands of files in one directory is slow
- **Verdict**: Not scalable, poor performance with many notes
### Git-Based Storage (Rejected for V1)
- **Simplicity**: 6/10 - Requires git integration
- **Portability**: 9/10 - Excellent versioning
- **Performance**: 7/10 - Git operations have overhead
- **Verdict**: Interesting for V2, but adds complexity to V1
### Frontmatter in Files (Rejected for V1)
```markdown
---
slug: my-note
created: 2024-11-18
published: true
---
Note content here
```
- **Simplicity**: 7/10 - Requires YAML parsing
- **Portability**: 8/10 - Common pattern, but not pure markdown
- **Single Source**: 10/10 - All data in one place
- **Verdict**: Deferred to V2; V1 keeps files pure
### JSON Metadata Sidecar (Rejected)
```
notes/
├── my-note.md
├── my-note.json # Metadata
```
- **Simplicity**: 6/10 - Doubles number of files
- **Portability**: 7/10 - Markdown still clean, but extra files
- **Sync Issues**: 5/10 - Must keep two files in sync
- **Verdict**: Database metadata is cleaner
## Implementation Checklist
- [ ] Create data/notes directory structure on initialization
- [ ] Implement slug generation algorithm
- [ ] Implement atomic file write operations
- [ ] Implement content hash calculation
- [ ] Create database schema with indexes
- [ ] Implement sync between files and database
- [ ] Implement orphan detection (optional for V1)
- [ ] Add file system error handling
- [ ] Create backup documentation for users
- [ ] Test with thousands of notes for performance
## References
- CommonMark Spec: https://spec.commonmark.org/
- POSIX File Operations: https://pubs.opengroup.org/onlinepubs/9699919799/
- File System Best Practices: https://www.pathname.com/fhs/
- Atomic File Operations: https://lwn.net/Articles/457667/