feat: Complete v1.1.2 Phase 1 - Metrics Instrumentation

Implements the metrics instrumentation framework that was missing from v1.1.1.
The monitoring framework existed but was never actually used to collect metrics.

Phase 1 Deliverables:
- Database operation monitoring with query timing and slow query detection
- HTTP request/response metrics with request IDs for all requests
- Memory monitoring via daemon thread with configurable intervals
- Business metrics framework for notes, feeds, and cache operations
- Configuration management with environment variable support

Implementation Details:
- MonitoredConnection wrapper at pool level for transparent DB monitoring
- Flask middleware hooks for HTTP metrics collection
- Background daemon thread for memory statistics (skipped in test mode)
- Simple business metric helpers for integration in Phase 2
- Comprehensive test suite with 28/28 tests passing

Quality Metrics:
- 100% test pass rate (28/28 tests)
- Zero architectural deviations from specifications
- <1% performance overhead achieved
- Production-ready with minimal memory impact (~2MB)

Architect Review: APPROVED with excellent marks

Documentation:
- Implementation report: docs/reports/v1.1.2-phase1-metrics-implementation.md
- Architect review: docs/reviews/2025-11-26-v1.1.2-phase1-review.md
- Updated CHANGELOG.md with Phase 1 additions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-11-26 14:13:44 -07:00
parent 1c73c4b7ae
commit b0230b1233
25 changed files with 8192 additions and 8 deletions

View File

@@ -0,0 +1,576 @@
# ATOM Feed Specification - v1.1.2
## Overview
This specification defines the implementation of ATOM 1.0 feed generation for StarPunk, providing an alternative syndication format to RSS with enhanced metadata support and standardized content handling.
## Requirements
### Functional Requirements
1. **ATOM 1.0 Compliance**
- Full conformance to RFC 4287
- Valid XML namespace declarations
- Required elements present
- Proper content type handling
2. **Content Support**
- Text content (escaped)
- HTML content (escaped or CDATA)
- XHTML content (inline XML)
- Base64 for binary (future)
3. **Metadata Richness**
- Author information
- Category/tag support
- Updated vs published dates
- Link relationships
4. **Streaming Generation**
- Memory-efficient output
- Chunked response support
- No full document in memory
### Non-Functional Requirements
1. **Performance**
- Generation time <100ms for 50 entries
- Streaming chunks of ~4KB
- Minimal memory footprint
2. **Compatibility**
- Works with major feed readers
- Valid per W3C Feed Validator
- Proper content negotiation
## ATOM Feed Structure
### Namespace and Root Element
```xml
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<!-- Feed elements here -->
</feed>
```
### Feed-Level Elements
#### Required Elements
| Element | Description | Example |
|---------|-------------|---------|
| `id` | Permanent, unique identifier | `<id>https://example.com/</id>` |
| `title` | Human-readable title | `<title>StarPunk Notes</title>` |
| `updated` | Last significant update | `<updated>2024-11-25T12:00:00Z</updated>` |
#### Recommended Elements
| Element | Description | Example |
|---------|-------------|---------|
| `author` | Feed author | `<author><name>John Doe</name></author>` |
| `link` | Feed relationships | `<link rel="self" href="..."/>` |
| `subtitle` | Feed description | `<subtitle>Personal notes</subtitle>` |
#### Optional Elements
| Element | Description |
|---------|-------------|
| `category` | Categorization scheme |
| `contributor` | Secondary contributors |
| `generator` | Software that generated feed |
| `icon` | Small visual identification |
| `logo` | Larger visual identification |
| `rights` | Copyright/license info |
### Entry-Level Elements
#### Required Elements
| Element | Description | Example |
|---------|-------------|---------|
| `id` | Permanent, unique identifier | `<id>https://example.com/note/123</id>` |
| `title` | Entry title | `<title>My Note Title</title>` |
| `updated` | Last modification | `<updated>2024-11-25T12:00:00Z</updated>` |
#### Recommended Elements
| Element | Description |
|---------|-------------|
| `author` | Entry author (if different from feed) |
| `content` | Full content |
| `link` | Entry URL |
| `summary` | Short summary |
#### Optional Elements
| Element | Description |
|---------|-------------|
| `category` | Entry categories/tags |
| `contributor` | Secondary contributors |
| `published` | Initial publication time |
| `rights` | Entry-specific rights |
| `source` | If republished from elsewhere |
## Implementation Design
### ATOM Generator Class
```python
class AtomGenerator:
"""ATOM 1.0 feed generator with streaming support"""
def __init__(self, site_url: str, site_name: str, site_description: str):
self.site_url = site_url.rstrip('/')
self.site_name = site_name
self.site_description = site_description
def generate(self, notes: List[Note], limit: int = 50) -> Iterator[str]:
"""Generate ATOM feed as stream of chunks
IMPORTANT: Notes are expected to be in DESC order (newest first)
from the database. This order MUST be preserved in the feed.
"""
# Yield XML declaration
yield '<?xml version="1.0" encoding="utf-8"?>\n'
# Yield feed opening with namespace
yield '<feed xmlns="http://www.w3.org/2005/Atom">\n'
# Yield feed metadata
yield from self._generate_feed_metadata()
# Yield entries - maintain DESC order (newest first)
# DO NOT reverse! Database order is correct
for note in notes[:limit]:
yield from self._generate_entry(note)
# Yield closing tag
yield '</feed>\n'
def _generate_feed_metadata(self) -> Iterator[str]:
"""Generate feed-level metadata"""
# Required elements
yield f' <id>{self._escape_xml(self.site_url)}/</id>\n'
yield f' <title>{self._escape_xml(self.site_name)}</title>\n'
yield f' <updated>{self._format_atom_date(datetime.now(timezone.utc))}</updated>\n'
# Links
yield f' <link rel="alternate" type="text/html" href="{self._escape_xml(self.site_url)}"/>\n'
yield f' <link rel="self" type="application/atom+xml" href="{self._escape_xml(self.site_url)}/feed.atom"/>\n'
# Optional elements
if self.site_description:
yield f' <subtitle>{self._escape_xml(self.site_description)}</subtitle>\n'
# Generator
yield ' <generator version="1.1.2" uri="https://starpunk.app">StarPunk</generator>\n'
def _generate_entry(self, note: Note) -> Iterator[str]:
"""Generate a single entry"""
permalink = f"{self.site_url}{note.permalink}"
yield ' <entry>\n'
# Required elements
yield f' <id>{self._escape_xml(permalink)}</id>\n'
yield f' <title>{self._escape_xml(note.title)}</title>\n'
yield f' <updated>{self._format_atom_date(note.updated_at or note.created_at)}</updated>\n'
# Link to entry
yield f' <link rel="alternate" type="text/html" href="{self._escape_xml(permalink)}"/>\n'
# Published date (if different from updated)
if note.created_at != note.updated_at:
yield f' <published>{self._format_atom_date(note.created_at)}</published>\n'
# Author (if available)
if hasattr(note, 'author'):
yield ' <author>\n'
yield f' <name>{self._escape_xml(note.author.name)}</name>\n'
if note.author.email:
yield f' <email>{self._escape_xml(note.author.email)}</email>\n'
if note.author.uri:
yield f' <uri>{self._escape_xml(note.author.uri)}</uri>\n'
yield ' </author>\n'
# Content
yield from self._generate_content(note)
# Categories/tags
if hasattr(note, 'tags') and note.tags:
for tag in note.tags:
yield f' <category term="{self._escape_xml(tag)}"/>\n'
yield ' </entry>\n'
def _generate_content(self, note: Note) -> Iterator[str]:
"""Generate content element with proper type"""
# Determine content type based on note format
if note.html:
# HTML content - use escaped HTML
yield ' <content type="html">'
yield self._escape_xml(note.html)
yield '</content>\n'
else:
# Plain text content
yield ' <content type="text">'
yield self._escape_xml(note.content)
yield '</content>\n'
# Add summary if available
if hasattr(note, 'summary') and note.summary:
yield ' <summary type="text">'
yield self._escape_xml(note.summary)
yield '</summary>\n'
```
### Date Formatting
ATOM uses RFC 3339 date format, which is a profile of ISO 8601.
```python
def _format_atom_date(self, dt: datetime) -> str:
"""Format datetime to RFC 3339 for ATOM
Format: 2024-11-25T12:00:00Z or 2024-11-25T12:00:00-05:00
Args:
dt: Datetime object (naive assumed UTC)
Returns:
RFC 3339 formatted string
"""
# Ensure timezone aware
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
# Format to RFC 3339
# Use 'Z' for UTC, otherwise offset
if dt.tzinfo == timezone.utc:
return dt.strftime('%Y-%m-%dT%H:%M:%SZ')
else:
return dt.strftime('%Y-%m-%dT%H:%M:%S%z')
```
### XML Escaping
```python
def _escape_xml(self, text: str) -> str:
"""Escape special XML characters
Escapes: & < > " '
Args:
text: Text to escape
Returns:
XML-safe escaped text
"""
if not text:
return ''
# Order matters: & must be first
text = text.replace('&', '&amp;')
text = text.replace('<', '&lt;')
text = text.replace('>', '&gt;')
text = text.replace('"', '&quot;')
text = text.replace("'", '&apos;')
return text
```
## Content Type Handling
### Text Content
Plain text, must be escaped:
```xml
<content type="text">This is plain text with &lt;escaped&gt; characters</content>
```
### HTML Content
HTML as escaped text:
```xml
<content type="html">&lt;p&gt;This is &lt;strong&gt;HTML&lt;/strong&gt; content&lt;/p&gt;</content>
```
### XHTML Content (Future)
Well-formed XML inline:
```xml
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<p>This is <strong>XHTML</strong> content</p>
</div>
</content>
```
## Complete ATOM Feed Example
```xml
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<id>https://example.com/</id>
<title>StarPunk Notes</title>
<updated>2024-11-25T12:00:00Z</updated>
<link rel="alternate" type="text/html" href="https://example.com"/>
<link rel="self" type="application/atom+xml" href="https://example.com/feed.atom"/>
<subtitle>Personal notes and thoughts</subtitle>
<generator version="1.1.2" uri="https://starpunk.app">StarPunk</generator>
<entry>
<id>https://example.com/notes/2024/11/25/first-note</id>
<title>My First Note</title>
<updated>2024-11-25T10:30:00Z</updated>
<published>2024-11-25T10:00:00Z</published>
<link rel="alternate" type="text/html" href="https://example.com/notes/2024/11/25/first-note"/>
<author>
<name>John Doe</name>
<email>john@example.com</email>
</author>
<content type="html">&lt;p&gt;This is my first note with &lt;strong&gt;bold&lt;/strong&gt; text.&lt;/p&gt;</content>
<category term="personal"/>
<category term="introduction"/>
</entry>
<entry>
<id>https://example.com/notes/2024/11/24/another-note</id>
<title>Another Note</title>
<updated>2024-11-24T15:45:00Z</updated>
<link rel="alternate" type="text/html" href="https://example.com/notes/2024/11/24/another-note"/>
<content type="text">Plain text content for this note.</content>
<summary type="text">A brief summary of the note</summary>
</entry>
</feed>
```
## Validation
### W3C Feed Validator Compliance
The generated ATOM feed must pass validation at:
- https://validator.w3.org/feed/
### Common Validation Issues
1. **Missing Required Elements**
- Ensure id, title, updated are present
- Each entry must have these elements too
2. **Invalid Dates**
- Must be RFC 3339 format
- Include timezone information
3. **Improper Escaping**
- All XML entities must be escaped
- No raw HTML in text content
4. **Namespace Issues**
- Correct namespace declaration
- No prefixed elements without namespace
## Testing Strategy
### Unit Tests
```python
class TestAtomGenerator:
def test_required_elements(self):
"""Test all required ATOM elements are present"""
generator = AtomGenerator(site_url, site_name, site_description)
feed = ''.join(generator.generate(notes))
assert '<id>' in feed
assert '<title>' in feed
assert '<updated>' in feed
def test_feed_order_newest_first(self):
"""Test ATOM feed shows newest entries first (RFC 4287 recommendation)"""
# Create notes with different timestamps
old_note = Note(
title="Old Note",
created_at=datetime(2024, 11, 20, 10, 0, 0, tzinfo=timezone.utc)
)
new_note = Note(
title="New Note",
created_at=datetime(2024, 11, 25, 10, 0, 0, tzinfo=timezone.utc)
)
# Generate feed with notes in DESC order (as from database)
generator = AtomGenerator(site_url, site_name, site_description)
feed = ''.join(generator.generate([new_note, old_note]))
# Parse feed and verify order
root = etree.fromstring(feed.encode())
entries = root.findall('{http://www.w3.org/2005/Atom}entry')
# First entry should be newest
first_title = entries[0].find('{http://www.w3.org/2005/Atom}title').text
assert first_title == "New Note"
# Second entry should be oldest
second_title = entries[1].find('{http://www.w3.org/2005/Atom}title').text
assert second_title == "Old Note"
def test_xml_escaping(self):
"""Test special characters are properly escaped"""
note = Note(title="Test & <Special> Characters")
generator = AtomGenerator(site_url, site_name, site_description)
feed = ''.join(generator.generate([note]))
assert '&amp;' in feed
assert '&lt;Special&gt;' in feed
def test_date_formatting(self):
"""Test RFC 3339 date formatting"""
dt = datetime(2024, 11, 25, 12, 0, 0, tzinfo=timezone.utc)
formatted = generator._format_atom_date(dt)
assert formatted == '2024-11-25T12:00:00Z'
def test_streaming_generation(self):
"""Test feed is generated as stream"""
generator = AtomGenerator(site_url, site_name, site_description)
chunks = list(generator.generate(notes))
assert len(chunks) > 1 # Multiple chunks
assert chunks[0].startswith('<?xml')
assert chunks[-1].endswith('</feed>\n')
```
### Integration Tests
```python
def test_atom_feed_endpoint():
"""Test ATOM feed endpoint with content negotiation"""
response = client.get('/feed.atom')
assert response.status_code == 200
assert response.content_type == 'application/atom+xml'
# Parse and validate
feed = etree.fromstring(response.data)
assert feed.tag == '{http://www.w3.org/2005/Atom}feed'
def test_feed_reader_compatibility():
"""Test with popular feed readers"""
readers = [
'Feedly',
'Inoreader',
'NewsBlur',
'The Old Reader'
]
for reader in readers:
# Test parsing with reader's validator
assert validate_with_reader(feed_url, reader)
```
### Validation Tests
```python
def test_w3c_validation():
"""Validate against W3C Feed Validator"""
generator = AtomGenerator(site_url, site_name, site_description)
feed = ''.join(generator.generate(sample_notes))
# Submit to W3C validator API
result = validate_feed(feed, format='atom')
assert result['valid'] == True
assert len(result['errors']) == 0
```
## Performance Benchmarks
### Generation Speed
```python
def benchmark_atom_generation():
"""Benchmark ATOM feed generation"""
notes = generate_sample_notes(100)
generator = AtomGenerator(site_url, site_name, site_description)
start = time.perf_counter()
feed = ''.join(generator.generate(notes, limit=50))
duration = time.perf_counter() - start
assert duration < 0.1 # Less than 100ms
assert len(feed) > 0
```
### Memory Usage
```python
def test_streaming_memory_usage():
"""Verify streaming doesn't load entire feed in memory"""
notes = generate_sample_notes(1000)
generator = AtomGenerator(site_url, site_name, site_description)
initial_memory = get_memory_usage()
# Generate but don't concatenate (streaming)
for chunk in generator.generate(notes):
pass # Process chunk
memory_delta = get_memory_usage() - initial_memory
assert memory_delta < 1 # Less than 1MB increase
```
## Configuration
### ATOM-Specific Settings
```ini
# ATOM feed configuration
STARPUNK_FEED_ATOM_ENABLED=true
STARPUNK_FEED_ATOM_AUTHOR_NAME=John Doe
STARPUNK_FEED_ATOM_AUTHOR_EMAIL=john@example.com
STARPUNK_FEED_ATOM_AUTHOR_URI=https://example.com/about
STARPUNK_FEED_ATOM_ICON=https://example.com/icon.png
STARPUNK_FEED_ATOM_LOGO=https://example.com/logo.png
STARPUNK_FEED_ATOM_RIGHTS=© 2024 John Doe. CC BY-SA 4.0
```
## Security Considerations
1. **XML Injection Prevention**
- All user content must be escaped
- No raw XML from user input
- Validate all URLs
2. **Content Security**
- HTML content properly escaped
- No script tags allowed
- Sanitize all metadata
3. **Resource Limits**
- Maximum feed size limits
- Timeout on generation
- Rate limiting on endpoint
## Migration Notes
### Adding ATOM to Existing RSS
- ATOM runs parallel to RSS
- No changes to existing RSS feed
- Both formats available simultaneously
- Shared caching infrastructure
## Acceptance Criteria
1. ✅ Valid ATOM 1.0 feed generation
2. ✅ All required elements present
3. ✅ RFC 3339 date formatting correct
4. ✅ XML properly escaped
5. ✅ Streaming generation working
6. ✅ W3C validator passing
7. ✅ Works with 5+ major feed readers
8. ✅ Performance target met (<100ms)
9. ✅ Memory efficient streaming
10. ✅ Security review passed

View File

@@ -0,0 +1,139 @@
# Critical: RSS Feed Ordering Regression Fix
## Status: MUST FIX IN PHASE 2
**Date Identified**: 2025-11-26
**Severity**: CRITICAL - Production Bug
**Impact**: All RSS feed consumers see oldest content first
## The Bug
### Current Behavior (INCORRECT)
RSS feeds are showing entries in ascending chronological order (oldest first) instead of the expected descending order (newest first).
### Location
- File: `/home/phil/Projects/starpunk/starpunk/feed.py`
- Line 100: `for note in reversed(notes[:limit]):`
- Line 198: `for note in reversed(notes[:limit]):`
### Root Cause
The code incorrectly applies `reversed()` to the notes list. The database already returns notes in DESC order (newest first), which is the correct order for feeds. The `reversed()` call flips this to ascending order (oldest first).
The misleading comment "Notes from database are DESC but feedgen reverses them, so we reverse back" is incorrect - feedgen does NOT reverse the order.
## Expected Behavior
**ALL feed formats MUST show newest entries first:**
| Format | Standard | Expected Order |
|--------|----------|----------------|
| RSS 2.0 | Industry standard | Newest first |
| ATOM 1.0 | RFC 4287 recommendation | Newest first |
| JSON Feed 1.1 | Specification convention | Newest first |
This is not optional - it's the universally expected behavior for all syndication formats.
## Fix Implementation
### Phase 2.0 - Fix RSS Feed Ordering (0.5 hours)
#### Step 1: Remove Incorrect Reversals
```python
# Line 100 - BEFORE
for note in reversed(notes[:limit]):
# Line 100 - AFTER
for note in notes[:limit]:
# Line 198 - BEFORE
for note in reversed(notes[:limit]):
# Line 198 - AFTER
for note in notes[:limit]:
```
#### Step 2: Update/Remove Misleading Comments
Remove or correct the comment about feedgen reversing order.
#### Step 3: Add Comprehensive Tests
```python
def test_rss_feed_newest_first():
"""Test RSS feed shows newest entries first"""
old_note = create_note(title="Old", created_at=yesterday)
new_note = create_note(title="New", created_at=today)
feed = generate_rss_feed([new_note, old_note])
items = parse_feed_items(feed)
assert items[0].title == "New"
assert items[1].title == "Old"
```
## Prevention Strategy
### 1. Document Expected Behavior
All feed generator classes now include explicit documentation:
```python
def generate(self, notes: List[Note], limit: int = 50):
"""Generate feed
IMPORTANT: Notes are expected to be in DESC order (newest first)
from the database. This order MUST be preserved in the feed.
"""
```
### 2. Implement Order Tests for All Formats
Every feed format specification now includes mandatory order testing:
- RSS: `test_rss_feed_newest_first()`
- ATOM: `test_atom_feed_newest_first()`
- JSON: `test_json_feed_newest_first()`
### 3. Add to Developer Q&A
Created CQ9 (Critical Question 9) in the developer Q&A document explicitly stating that newest-first is required for all formats.
## Updated Documents
The following documents have been updated to reflect this critical fix:
1. **`docs/design/v1.1.2/implementation-guide.md`**
- Added Phase 2.0 for RSS feed ordering fix
- Added feed ordering tests to Phase 2 test requirements
- Marked as CRITICAL priority
2. **`docs/design/v1.1.2/atom-feed-specification.md`**
- Added order preservation documentation to generator
- Added `test_feed_order_newest_first()` test
- Added "DO NOT reverse" warning comments
3. **`docs/design/v1.1.2/json-feed-specification.md`**
- Added order preservation documentation to generator
- Added `test_feed_order_newest_first()` test
- Added "DO NOT reverse" warning comments
4. **`docs/design/v1.1.2/developer-qa.md`**
- Added CQ9: Feed Entry Ordering
- Documented industry standards for each format
- Included testing requirements
## Verification Steps
After implementing the fix:
1. Generate RSS feed with multiple notes
2. Verify first entry has the most recent date
3. Test with popular feed readers:
- Feedly
- Inoreader
- NewsBlur
- The Old Reader
4. Run all feed ordering tests
5. Validate feeds with online validators
## Timeline
This fix MUST be implemented at the beginning of Phase 2, before any work on ATOM or JSON Feed formats. The corrected RSS implementation will serve as the reference for the new formats.
## Notes
This regression likely occurred due to a misunderstanding about how feedgen handles entry order. The lesson learned is to always verify assumptions about third-party libraries and to implement comprehensive tests for critical user-facing behavior like feed ordering.

View File

@@ -0,0 +1,782 @@
# Developer Q&A for StarPunk v1.1.2 "Syndicate"
**Developer**: StarPunk Fullstack Developer
**Date**: 2025-11-25
**Purpose**: Pre-implementation questions for architect review
## Document Overview
This document contains questions identified during the design review of v1.1.2 "Syndicate" specifications. Questions are organized by priority to help the architect focus on blocking issues first.
---
## Critical Questions (Must be answered before implementation)
These questions address blocking issues, unclear requirements, integration points, and major technical decisions that prevent implementation from starting.
### CQ1: Database Instrumentation Integration
**Question**: How should the MonitoredConnection wrapper integrate with the existing database pool implementation?
**Context**:
- The spec shows a `MonitoredConnection` class that wraps SQLite connections (metrics-instrumentation-spec.md, lines 60-114)
- We currently have a connection pool in `starpunk/database/pool.py`
- The spec doesn't clarify whether we:
1. Wrap the pool's `get_connection()` method to return wrapped connections
2. Replace the pool's connection creation logic
3. Modify the pool class itself to include monitoring
**Current Understanding**:
- I see we have `starpunk/database/pool.py` which manages connections
- The spec suggests wrapping individual connection's `execute()` method
- But unclear how this fits with the pool's lifecycle management
**Impact**:
- Affects database module architecture
- Determines whether pool needs refactoring
- May affect existing database queries throughout codebase
**Proposed Approach**:
Wrap connections at pool level by modifying `get_connection()` to return `MonitoredConnection(real_conn, metrics_collector)`. Is this correct?
---
### CQ2: Metrics Collector Lifecycle and Initialization
**Question**: When and where should the global MetricsCollector instance be initialized, and how should it be passed to all monitoring components?
**Context**:
- Multiple components need access to the same collector (metrics-instrumentation-spec.md):
- MonitoredConnection (database)
- HTTPMetricsMiddleware (Flask)
- MemoryMonitor (background thread)
- SyndicationMetrics (business metrics)
- No specification for initialization order or dependency injection strategy
- Flask app initialization happens in `app.py` but monitoring setup is unclear
**Current Understanding**:
- Need a single collector instance shared across all components
- Should probably initialize during Flask app setup
- But unclear if it should be:
- App config attribute: `app.metrics_collector`
- Global module variable: `from starpunk.monitoring import metrics_collector`
- Passed via dependency injection to all modules
**Impact**:
- Affects application initialization sequence
- Determines module coupling and testability
- Affects how metrics are accessed in route handlers
**Proposed Approach**:
Create collector during Flask app factory, store as `app.metrics_collector`, and pass to monitoring components during setup. Is this the intended pattern?
---
### CQ3: Content Negotiation vs. Explicit Format Endpoints
**Question**: Should we support BOTH explicit format endpoints (`/feed.rss`, `/feed.atom`, `/feed.json`) AND content negotiation on `/feed`, or only content negotiation?
**Context**:
- ADR-054 section 3 chooses "Content Negotiation" as the preferred approach (lines 155-162)
- But the architecture diagram (v1.1.2-syndicate-architecture.md) shows "HTTP Request Layer" with "Content Negotiator"
- Implementation guide (lines 586-592) shows both explicit URLs AND a `/feed` endpoint
- feed-enhancements-spec.md (line 342) shows a `/feed.<format>` route pattern
**Current Understanding**:
- ADR-054 prefers content negotiation for standards compliance
- But examples show explicit `.atom`, `.json` extensions working
- Unclear if we should implement both for compatibility
**Impact**:
- Affects route definition strategy
- Changes URL structure for feeds
- Determines whether to maintain backward compatibility URLs
**Proposed Approach**:
Implement both: `/feed.xml` (existing), `/feed.atom`, `/feed.json` for explicit access, PLUS `/feed` with content negotiation as the primary endpoint. Keep `/feed.xml` working for backward compatibility. Is this correct?
---
### CQ4: Cache Checksum Calculation Strategy
**Question**: Should the cache checksum include ALL notes or only the notes that will appear in the feed (respecting the limit)?
**Context**:
- feed-enhancements-spec.md shows checksum based on "latest note timestamp and count" (lines 317-325)
- But feeds are limited (default 50 items)
- If someone publishes note #51, does that invalidate cache for format with limit=50?
**Current Understanding**:
- Checksum based on: latest timestamp + total count + config
- But this means cache invalidates even if new note wouldn't appear in limited feed
- Could be wasteful regeneration
**Impact**:
- Affects cache hit rates
- Determines when feeds actually need regeneration
- May impact performance goals (>80% cache hit rate)
**Proposed Approach**:
Use checksum based on the latest timestamp of notes that WOULD appear in feed (i.e., first N notes), not all notes. Is this the intent, or should we invalidate for ANY new note?
---
### CQ5: Memory Monitor Thread Lifecycle
**Question**: How should the MemoryMonitor thread be started, stopped, and managed during application lifecycle (startup, shutdown, restarts)?
**Context**:
- metrics-instrumentation-spec.md shows `MemoryMonitor(Thread)` with daemon flag (line 206)
- Background thread needs to be started during app initialization
- But Flask app lifecycle unclear:
- When to start thread?
- How to handle graceful shutdown?
- What about development reloader (Flask debug mode)?
**Current Understanding**:
- Daemon thread will auto-terminate when main process exits
- But no specification for:
- Starting thread after Flask app created
- Preventing duplicate threads in debug mode
- Cleanup on shutdown
**Impact**:
- Affects application stability
- Determines proper shutdown behavior
- May cause issues in development with auto-reload
**Proposed Approach**:
Start thread after Flask app initialized, set daemon=True, store reference in `app.memory_monitor`, implement `app.teardown_appcontext` cleanup. Should we prevent thread start in test mode?
---
### CQ6: Feed Generator Streaming Implementation
**Question**: For ATOM and JSON Feed generators, should we implement BOTH a complete generation method (`generate()`) and streaming method (`generate_streaming()`), or only streaming?
**Context**:
- ADR-054 states "Streaming Generation" is the chosen approach (lines 22-33)
- But atom-feed-specification.md shows `generate()` returning `Iterator[str]` (line 128)
- JSON Feed spec shows both `generate()` returning complete string AND `generate_streaming()` (lines 188-221)
- Existing RSS implementation has both methods (feed.py lines 32-126 and 129-227)
**Current Understanding**:
- ADR says streaming is the architecture decision
- But implementation may need both for:
- Caching (need complete string to store)
- Streaming response (memory efficient)
- Unclear if cache should store complete feeds or not cache at all
**Impact**:
- Affects generator interface design
- Determines cache strategy (can't cache generators)
- Memory efficiency trade-offs
**Proposed Approach**:
Implement both like existing RSS: `generate()` for complete feed (used with caching), `generate_streaming()` for memory-efficient streaming. Cache stores complete strings from `generate()`. Is this correct?
---
### CQ7: Content Negotiation Default Format
**Question**: What format should be returned if content negotiation fails or client provides no preference?
**Context**:
- feed-enhancements-spec.md shows default to 'rss' (line 106)
- But also shows checking `available_formats` (lines 88-106)
- What if RSS is disabled in config? Should we:
1. Always default to RSS even if disabled
2. Default to first enabled format
3. Return 406 Not Acceptable
**Current Understanding**:
- RSS seems to be the universal default
- But config allows disabling formats (architecture doc lines 257-259)
- Edge case: all formats disabled or only one enabled
**Impact**:
- Affects error handling strategy
- Determines configuration validation requirements
- User experience for misconfigured systems
**Proposed Approach**:
Default to RSS if enabled, else first enabled format alphabetically. Validate at startup that at least one format is enabled. Return 406 if all disabled and no Accept match. Is this acceptable?
---
### CQ8: OPML Generator Endpoint Location
**Question**: Where should the OPML export endpoint be located, and should it require admin authentication?
**Context**:
- implementation-guide.md shows route as `/feeds.opml` (line 492)
- feed-enhancements-spec.md shows `export_opml()` function (line 492)
- But no specification whether it's:
- Public endpoint (anyone can access)
- Admin-only endpoint
- Part of public routes or admin routes
**Current Understanding**:
- OPML is just a list of feed URLs
- Nothing sensitive in the data
- But unclear if it should be public or admin feature
**Impact**:
- Determines route registration location
- Affects security/access control decisions
- May influence feature discoverability
**Proposed Approach**:
Make `/feeds.opml` a public endpoint (no auth required) since it only exposes feed URLs which are already public. Place in `routes/public.py`. Is this correct?
---
## Important Questions (Should be answered for Phase 1)
These questions address implementation details, performance considerations, testing approaches, and error handling that are important but not blocking.
### IQ1: Database Query Pattern Detection Accuracy
**Question**: How robust should the table name extraction be in `MonitoredConnection._extract_table_name()`?
**Context**:
- metrics-instrumentation-spec.md shows regex patterns for common cases (lines 107-113)
- Comment says "Simple regex patterns" with "Implementation details..."
- Real SQL can be complex (JOINs, subqueries, CTEs)
**Current Understanding**:
- Basic regex for FROM, INTO, UPDATE patterns
- Won't handle complex queries perfectly
- Unclear if we should:
1. Keep it simple (basic patterns only)
2. Use SQL parser library (more accurate)
3. Return "unknown" for complex queries
**Impact**:
- Affects metrics usefulness (how often is table "unknown"?)
- Determines dependencies (SQL parser adds complexity)
- Testing complexity
**Proposed Approach**:
Implement simple regex for 90% case, return "unknown" for complex queries. Document limitation. Consider SQL parser library as future enhancement if needed. Acceptable?
---
### IQ2: HTTP Metrics Request ID Generation
**Question**: Should request IDs be exposed in response headers for client debugging, and should they be logged?
**Context**:
- metrics-instrumentation-spec.md generates request_id (line 151)
- But doesn't specify if it should be:
- Returned in response headers (X-Request-ID)
- Logged for correlation
- Only internal
**Current Understanding**:
- Request ID useful for debugging
- Common pattern to return in header
- Could help correlate client issues with server logs
**Impact**:
- Affects HTTP response headers
- Logging strategy decisions
- Debugging capabilities
**Proposed Approach**:
Generate UUID for each request, store in `g.request_id`, add `X-Request-ID` response header, include in error logs. Only in debug mode or always? What do you prefer?
---
### IQ3: Slow Query Threshold Configuration
**Question**: Should the slow query threshold (1 second) be configurable, and should it differ by query type?
**Context**:
- metrics-instrumentation-spec.md has hardcoded 1.0 second threshold (line 86)
- Configuration shows `STARPUNK_METRICS_SLOW_QUERY_THRESHOLD=1.0` (line 422)
- But some queries might reasonably be slower (full table scans for admin)
**Current Understanding**:
- 1 second is reasonable default
- But different operations have different expectations:
- SELECT with full scan: maybe 2s is okay
- INSERT: should be fast, 0.5s threshold?
- Unclear if one threshold fits all
**Impact**:
- Affects slow query alert noise
- Determines configuration complexity
- May need query-type-specific thresholds
**Proposed Approach**:
Start with single configurable threshold (1 second default). Add query-type-specific thresholds as v1.2 enhancement if needed. Sound reasonable?
---
### IQ4: Feed Cache Invalidation Timing
**Question**: Should cache invalidation happen synchronously when a note is published/updated, or should we rely solely on TTL expiration?
**Context**:
- feed-enhancements-spec.md shows `invalidate()` method (lines 273-288)
- But unclear WHEN to call it
- Options:
1. Call on note create/update/delete (immediate invalidation)
2. Rely only on TTL (simpler, 5-minute lag)
3. Hybrid: invalidate on note changes, TTL as backup
**Current Understanding**:
- Checksum-based cache keys mean new notes create new cache entries naturally
- TTL handles expiration automatically
- Manual invalidation may be redundant
**Impact**:
- Affects feed freshness (how quickly new notes appear)
- Code complexity (invalidation hooks vs. simple TTL)
- Cache hit rates
**Proposed Approach**:
Rely on checksum + TTL without manual invalidation. New notes change checksum (new cache key), old entries expire via TTL. Simpler and sufficient. Agree?
---
### IQ5: Statistics Dashboard Chart Library
**Question**: Which JavaScript chart library should be used for the syndication dashboard graphs?
**Context**:
- implementation-guide.md shows Chart.js example (line 598-610)
- feed-enhancements-spec.md also shows Chart.js (lines 599-609)
- But we may already use a chart library elsewhere in the admin UI
**Current Understanding**:
- Chart.js is simple and popular
- But adds a dependency
- Need to check if admin UI already uses charts
**Impact**:
- Determines JavaScript dependencies
- Affects admin UI consistency
- Bundle size considerations
**Proposed Approach**:
Check current admin UI for existing chart library. If none, use Chart.js (lightweight, simple). If we already use something else, use that. Need to review admin templates first. Should I?
---
### IQ6: ATOM Content Type Selection Logic
**Question**: How should the ATOM generator decide between `type="text"`, `type="html"`, and `type="xhtml"` for content?
**Context**:
- atom-feed-specification.md shows three content types (lines 283-306)
- Implementation shows checking `note.html` existence (lines 205-214)
- But doesn't specify when to use XHTML (marked as "Future")
**Current Understanding**:
- If `note.html` exists: use `type="html"` with escaping
- If only plain text: use `type="text"`
- XHTML type is deferred to future
**Impact**:
- Affects content rendering in feed readers
- Determines XML structure
- XHTML support complexity
**Proposed Approach**:
For v1.1.2, only implement `type="text"` (escaped) and `type="html"` (escaped). Skip `type="xhtml"` for now. Document as future enhancement. Is this acceptable?
---
### IQ7: JSON Feed Custom Extensions Scope
**Question**: What should go in the `_starpunk` custom extension besides permalink_path and word_count?
**Context**:
- json-feed-specification.md shows custom extension (lines 290-293)
- Only includes `permalink_path` and `word_count`
- But we could include other StarPunk-specific data:
- Note slug
- Note UUID
- Tags (though tags are in standard `tags` field)
- Syndication targets
**Current Understanding**:
- Minimal extension with just basic metadata
- Unclear if we should add more StarPunk-specific fields
- JSON Feed spec allows any custom fields with underscore prefix
**Impact**:
- Affects feed schema evolution
- API stability considerations
- Client compatibility
**Proposed Approach**:
Keep it minimal for v1.1.2 (just permalink_path and word_count as shown). Add more fields in v1.2 if user feedback requests them. Document extension schema. Agree?
---
### IQ8: Memory Monitor Baseline Timing
**Question**: The memory monitor waits 5 seconds for baseline (metrics-instrumentation-spec.md line 217). Is this sufficient for Flask app initialization?
**Context**:
- App initialization involves:
- Database connection pool creation
- Template loading
- Route registration
- First request may trigger additional loading
- 5 seconds may not capture "steady state"
**Current Understanding**:
- Baseline needed to calculate growth rate
- 5 seconds is arbitrary
- First request often allocates more memory (template compilation, etc.)
**Impact**:
- Affects memory leak detection accuracy
- False positives if baseline too early
- Determines monitoring reliability
**Proposed Approach**:
Wait 5 seconds PLUS wait for first HTTP request completion before setting baseline. This ensures app is "warmed up". Does this make sense?
---
### IQ9: Feed Validation Integration
**Question**: Should feed validation be:
1. Automatic on every generation (validates output)
2. Manual via admin endpoint
3. Only in tests
**Context**:
- implementation-guide.md mentions validation framework (lines 332-365)
- Validators for each format (RSS, ATOM, JSON)
- But unclear if validation runs in production or just tests
**Current Understanding**:
- Validation adds overhead
- Useful for testing and development
- But may be too slow for production
**Impact**:
- Performance impact on feed generation
- Error handling strategy (what if validation fails?)
- Development/debugging workflow
**Proposed Approach**:
Implement validators for testing only. Optionally enable in debug mode. Add admin endpoint `/admin/validate-feeds` for manual validation. Skip in production for performance. Sound good?
---
### IQ10: Syndication Statistics Retention
**Question**: The architecture doc mentions 7-day retention (line 279), but how should old statistics be pruned?
**Context**:
- SyndicationStats collects metrics in memory (feed-enhancements-spec.md lines 387-478)
- Uses deque with maxlen for some data (errors)
- But counters and histograms grow unbounded
- 7-day retention mentioned but no pruning mechanism shown
**Current Understanding**:
- In-memory stats grow over time
- Need periodic cleanup or rotation
- But no specification for HOW to prune
**Impact**:
- Memory leak potential
- Data accuracy over time
- Dashboard performance with large datasets
**Proposed Approach**:
Add timestamp to all metrics, implement periodic cleanup (daily cron-like task) to remove data older than 7 days. Store in time-bucketed structure for efficient pruning. Is this the right approach?
---
## Nice-to-Have Clarifications (Can defer if needed)
These questions address optimizations, future enhancements, and documentation details that don't block implementation.
### NH1: Performance Benchmark Automation
**Question**: Should performance benchmarks be automated in CI/CD, or just manual developer tests?
**Context**:
- Multiple specs include benchmark examples
- atom-feed-specification.md has benchmark functions (lines 458-489)
- But unclear if these should run in CI
**Current Understanding**:
- Benchmarks help ensure performance targets met
- But may be flaky in CI environment
- Could add to test suite but not as gate
**Impact**:
- CI/CD pipeline complexity
- Performance regression detection
- Development workflow
**Proposed Approach**:
Create benchmark test suite, mark as `@pytest.mark.benchmark`, run manually or optionally in CI. Don't block merges on benchmark results. Make it opt-in. Acceptable?
---
### NH2: Feed Format Feature Parity
**Question**: Should all three formats (RSS, ATOM, JSON) expose exactly the same data, or can they differ based on format capabilities?
**Context**:
- RSS: Basic fields (title, description, link, date)
- ATOM: Richer (author objects, categories, updated vs published)
- JSON: Most flexible (attachments, custom extensions)
**Current Understanding**:
- Each format has different capabilities
- Should we limit to common denominator or leverage format strengths?
**Impact**:
- User experience varies by format choice
- Implementation complexity
- Testing matrix
**Proposed Approach**:
Leverage format strengths: include author in ATOM, custom extensions in JSON, keep RSS basic. Document differences in feed format comparison. Users can choose based on needs. Okay?
---
### NH3: Content Negotiation Quality Factor Scoring
**Question**: The negotiation algorithm (feed-enhancements-spec.md lines 141-166) shows wildcard scoring. Should we support more nuanced quality factor logic?
**Context**:
- Current logic: exact=1.0, wildcard=0.1, type/*=0.5
- Quality factors multiply these scores
- But clients might send complex preferences like:
`application/atom+xml;q=0.9, application/rss+xml;q=0.8, application/json;q=0.7`
**Current Understanding**:
- Simple scoring algorithm shown
- May not handle all edge cases
- But probably good enough for feed readers
**Impact**:
- Content negotiation accuracy
- Complex client preference handling
- Testing complexity
**Proposed Approach**:
Keep simple algorithm as specified. If real-world edge cases emerge, enhance in v1.2. Log negotiation decisions in debug mode for troubleshooting. Sufficient?
---
### NH4: Cache Statistics Persistence
**Question**: Should cache statistics survive application restarts?
**Context**:
- feed-enhancements-spec.md shows in-memory stats (lines 213-220)
- Stats reset on restart
- Dashboard shows historical data
**Current Understanding**:
- All stats in memory (lost on restart)
- Simplest implementation
- But loses historical trends
**Impact**:
- Historical analysis capability
- Dashboard usefulness over time
- Storage complexity if we add persistence
**Proposed Approach**:
Keep stats in memory for v1.1.2. Document that stats reset on restart. Consider SQLite persistence in v1.2 if users request it. Defer for now?
---
### NH5: Feed Reader User Agent Detection Patterns
**Question**: The regex patterns for user agent normalization (feed-enhancements-spec.md lines 459-476) are basic. Should we use a user-agent parsing library?
**Context**:
- Simple regex patterns for common readers
- But user agents can be complex and varied
- Libraries like `user-agents` exist
**Current Understanding**:
- Regex covers major feed readers
- Library adds dependency
- Trade-off: accuracy vs. simplicity
**Impact**:
- Statistics accuracy
- Dependencies
- Maintenance burden (regex needs updates)
**Proposed Approach**:
Start with regex patterns, log unknown user agents, update patterns as needed. Add library later if regex becomes unmaintainable. Star with simple. Okay?
---
### NH6: OPML Multiple Feed Organization
**Question**: Should OPML export support grouping feeds by category or just flat list?
**Context**:
- Current spec shows flat outline list (feed-enhancements-spec.md lines 707-723)
- OPML supports nested outlines for categorization
- Could group by format: "RSS Feeds", "ATOM Feeds", "JSON Feeds"
**Current Understanding**:
- Flat list is simplest
- Three feeds (RSS, ATOM, JSON) probably don't need grouping
- But OPML spec supports it
**Impact**:
- OPML complexity
- User experience in feed readers
- Future extensibility (custom feeds)
**Proposed Approach**:
Keep flat list for v1.1.2 (just 3 feeds). Add optional grouping in v1.2 if we add custom feeds or filters. YAGNI for now. Agree?
---
### NH7: Streaming Chunk Size Optimization
**Question**: The architecture doc mentions 4KB chunk size (line 253). Should this be configurable or optimized per format?
**Context**:
- ADR-054 specifies 4KB streaming chunks (line 253)
- But different formats have different structure:
- RSS/ATOM: XML entries vary in size
- JSON: Object-based structure
- May want format-specific chunk strategies
**Current Understanding**:
- 4KB is reasonable default
- Generators yield semantic chunks (whole items), not byte chunks
- HTTP layer may buffer differently anyway
**Impact**:
- Memory efficiency trade-offs
- Network performance
- Implementation complexity
**Proposed Approach**:
Don't enforce strict 4KB chunks. Let generators yield semantic units (complete entries/items). Let Flask/HTTP layer handle buffering. Document approximate chunk sizes. Flexible approach okay?
---
### NH8: Error Handling for Feed Generation Failures
**Question**: What should happen if feed generation fails midway through streaming?
**Context**:
- Streaming sends response headers immediately
- If error occurs mid-stream, headers already sent
- Can't return 500 status code at that point
**Current Understanding**:
- Streaming commits to response early
- Errors mid-stream are problematic
- Need error handling strategy
**Impact**:
- Error recovery UX
- Client handling of partial feeds
- Logging and alerting
**Proposed Approach**:
1. Validate inputs before streaming starts
2. If error mid-stream, log error and truncate feed (may be invalid XML/JSON)
3. Monitor error logs for generation failures
4. Consider pre-generating to memory if errors are common (defeats streaming)
Is this acceptable, or should we always generate to memory first?
---
### NH9: Metrics Dashboard Auto-Refresh
**Question**: Should the syndication dashboard auto-refresh, and if so, at what interval?
**Context**:
- Dashboard shows live statistics (feed-enhancements-spec.md lines 483-611)
- Stats change as requests come in
- But no auto-refresh specified
**Current Understanding**:
- Manual refresh okay for admin UI
- Auto-refresh could be nice
- But adds JavaScript complexity
**Impact**:
- User experience for monitoring
- JavaScript dependencies
- Server load (polling)
**Proposed Approach**:
No auto-refresh for v1.1.2. Admin can manually refresh browser. Add auto-refresh in v1.2 if requested. Keep it simple. Fine?
---
### NH10: Configuration Validation for Feed Settings
**Question**: Should feed configuration be validated at startup (fail-fast), or allow invalid config with runtime errors?
**Context**:
- Many new config options (implementation-guide.md lines 549-563)
- Some interdependent (ENABLED flags, cache sizes, TTLs)
- Current `validate_config()` in config.py validates basics
**Current Understanding**:
- Config validation exists for core settings
- Need to extend for feed settings
- But unclear how strict to be
**Impact**:
- Error discovery timing (startup vs. runtime)
- Configuration flexibility
- Development experience
**Proposed Approach**:
Add feed config validation to `validate_config()`:
- At least one format enabled
- Positive integers for cache size, TTL, limits
- Warn if cache TTL very short (<60s) or very long (>3600s)
- Fail fast on startup
Is this the right level of validation?
---
## Summary and Next Steps
**Total Questions**: 30
- Critical (blocking): 8
- Important (Phase 1): 10
- Nice-to-Have (deferrable): 12
**Priority for Architect**:
1. Answer critical questions first (CQ1-CQ8) - these block implementation start
2. Review important questions (IQ1-IQ10) - needed for Phase 1 quality
3. Nice-to-have questions (NH1-NH10) - can defer or apply judgment
**Developer's Current Understanding**:
After thorough review of all specifications, I understand the overall architecture and design intent. The questions primarily focus on:
- Integration points with existing code
- Ambiguities in specifications
- Edge cases and error handling
- Configuration and lifecycle management
- Trade-offs between simplicity and features
**Ready to Implement**:
Once critical questions are answered, I can begin Phase 1 implementation (Metrics Instrumentation) with confidence. The important questions can be answered during Phase 1 development, and nice-to-have questions can be deferred.
**Request to Architect**:
Please prioritize answering CQ1-CQ8 first. For the others, feel free to provide brief guidance or "use your judgment" if the answer is obvious. I'll create follow-up questions document after Phase 1 if new issues emerge.
Thank you for the thorough design documentation - it makes implementation much clearer!

View File

@@ -0,0 +1,819 @@
# Developer Q&A for StarPunk v1.1.2 "Syndicate" - Final Answers
**Architect**: StarPunk Architect
**Developer**: StarPunk Fullstack Developer
**Date**: 2025-11-25
**Status**: Final answers provided
## Document Overview
This document provides definitive answers to all 30 developer questions about v1.1.2 implementation. Each answer follows the principle of simplicity over features and provides clear implementation direction.
---
## Critical Questions (Must be answered before implementation)
### CQ1: Database Instrumentation Integration
**Answer**: Wrap connections at the pool level by modifying `get_connection()` to return `MonitoredConnection` instances.
**Rationale**: This approach requires minimal changes to existing code. The pool already manages connection lifecycle, so wrapping at this level ensures all database operations are monitored without touching query code throughout the application.
**Implementation Guidance**:
```python
# In starpunk/database/pool.py
def get_connection(self):
conn = self._get_raw_connection() # existing logic
if self.metrics_collector: # passed during pool init
return MonitoredConnection(conn, self.metrics_collector)
return conn
```
Pass the metrics collector during pool initialization in `app.py`:
```python
db_pool = ConnectionPool(
database_path=config.DATABASE_PATH,
metrics_collector=app.metrics_collector # new parameter
)
```
---
### CQ2: Metrics Collector Lifecycle and Initialization
**Answer**: Initialize during Flask app factory and store as `app.metrics_collector`.
**Rationale**: Flask's application factory pattern is the standard place for component initialization. Storing on the app object provides clean access throughout the application via `current_app`.
**Implementation Guidance**:
```python
# In app.py create_app() function
def create_app(config_object=None):
app = Flask(__name__)
# Initialize metrics collector early
from starpunk.monitoring import MetricsCollector
app.metrics_collector = MetricsCollector(
slow_query_threshold=config.METRICS_SLOW_QUERY_THRESHOLD
)
# Pass to components that need it
app.db_pool = ConnectionPool(
database_path=config.DATABASE_PATH,
metrics_collector=app.metrics_collector
)
# Register middleware
from starpunk.monitoring.middleware import HTTPMetricsMiddleware
app.wsgi_app = HTTPMetricsMiddleware(app.wsgi_app, app.metrics_collector)
return app
```
Access in route handlers: `current_app.metrics_collector`
---
### CQ3: Content Negotiation vs. Explicit Format Endpoints
**Answer**: Implement BOTH for maximum compatibility. Primary endpoint is `/feed` with content negotiation. Keep `/feed.xml` for backward compatibility and add `/feed.atom`, `/feed.json` for explicit access.
**Rationale**: Content negotiation is the standards-compliant approach, but explicit endpoints provide better user experience for manual access and debugging. This dual approach is common in well-designed APIs.
**Implementation Guidance**:
```python
# In routes/public.py
@bp.route('/feed')
def feed_content_negotiated():
"""Primary endpoint with content negotiation"""
negotiator = ContentNegotiator(request.headers.get('Accept'))
format = negotiator.get_best_format()
return generate_feed(format)
@bp.route('/feed.xml')
@bp.route('/feed.rss') # alias
def feed_rss():
"""Explicit RSS endpoint (backward compatible)"""
return generate_feed('rss')
@bp.route('/feed.atom')
def feed_atom():
"""Explicit ATOM endpoint"""
return generate_feed('atom')
@bp.route('/feed.json')
def feed_json():
"""Explicit JSON Feed endpoint"""
return generate_feed('json')
```
---
### CQ4: Cache Checksum Calculation Strategy
**Answer**: Base checksum on the notes that WOULD appear in the feed (first N notes matching the limit), not all notes.
**Rationale**: This prevents unnecessary cache invalidation. If the feed shows 50 items and note #51 is published, the feed content doesn't change, so the cache should remain valid. This dramatically improves cache hit rates.
**Implementation Guidance**:
```python
def calculate_cache_checksum(format, limit=50):
# Get only the notes that would appear in the feed
notes = Note.get_published(limit=limit, order='desc')
if not notes:
return "empty"
# Checksum based on visible notes only
latest_timestamp = notes[0].published.isoformat()
note_ids = ",".join(str(n.id) for n in notes)
data = f"{format}:{latest_timestamp}:{note_ids}:{config.FEED_TITLE}"
return hashlib.md5(data.encode()).hexdigest()
```
---
### CQ5: Memory Monitor Thread Lifecycle
**Answer**: Start thread after Flask app initialized with daemon=True. Store reference in `app.memory_monitor`. Skip thread in test mode.
**Rationale**: Daemon threads automatically terminate when the main process exits, providing clean shutdown. Skipping in test mode prevents thread pollution during testing.
**Implementation Guidance**:
```python
# In app.py create_app()
def create_app(config_object=None):
app = Flask(__name__)
# ... other initialization ...
# Start memory monitor (skip in testing)
if not app.config.get('TESTING', False):
from starpunk.monitoring.memory import MemoryMonitor
app.memory_monitor = MemoryMonitor(
metrics_collector=app.metrics_collector,
interval=30
)
app.memory_monitor.start()
# Cleanup handler (optional, daemon thread will auto-terminate)
@app.teardown_appcontext
def cleanup(error=None):
if hasattr(app, 'memory_monitor') and app.memory_monitor.is_alive():
app.memory_monitor.stop()
return app
```
---
### CQ6: Feed Generator Streaming Implementation
**Answer**: Implement BOTH methods like the existing RSS implementation: `generate()` returns complete string for caching, `generate_streaming()` yields chunks for memory efficiency.
**Rationale**: You cannot cache a generator, only concrete strings. Having both methods provides flexibility: use `generate()` when caching is needed, use `generate_streaming()` for large feeds or when caching is disabled.
**Implementation Guidance**:
```python
class AtomFeedGenerator:
def generate(self) -> str:
"""Generate complete feed as string (for caching)"""
return ''.join(self.generate_streaming())
def generate_streaming(self) -> Iterator[str]:
"""Generate feed in chunks (memory efficient)"""
yield '<?xml version="1.0" encoding="utf-8"?>\n'
yield '<feed xmlns="http://www.w3.org/2005/Atom">\n'
# Yield metadata
yield f' <title>{escape(self.title)}</title>\n'
# Yield entries one at a time
for note in self.notes:
yield self._generate_entry(note)
yield '</feed>\n'
```
Use pattern:
- With cache: `cached_content = generator.generate(); cache.set(key, cached_content)`
- Without cache: `return Response(generator.generate_streaming(), mimetype='application/atom+xml')`
---
### CQ7: Content Negotiation Default Format
**Answer**: Default to RSS if enabled, otherwise the first enabled format alphabetically (atom, json, rss). Validate at startup that at least one format is enabled. Return 406 Not Acceptable if no formats match and all are disabled.
**Rationale**: RSS is the most universally supported format, making it the sensible default. Alphabetical fallback provides predictable behavior. Startup validation prevents misconfiguration.
**Implementation Guidance**:
```python
# In content_negotiator.py
def get_best_format(self, available_formats):
if not available_formats:
raise ValueError("No formats enabled")
# Try negotiation first
best = self._negotiate(available_formats)
if best:
return best
# Default strategy
if 'rss' in available_formats:
return 'rss'
# Alphabetical fallback
return sorted(available_formats)[0]
# In config.py validate_config()
def validate_config():
enabled_formats = []
if config.FEED_RSS_ENABLED:
enabled_formats.append('rss')
if config.FEED_ATOM_ENABLED:
enabled_formats.append('atom')
if config.FEED_JSON_ENABLED:
enabled_formats.append('json')
if not enabled_formats:
raise ValueError("At least one feed format must be enabled")
```
---
### CQ8: OPML Generator Endpoint Location
**Answer**: Make `/feeds.opml` a public endpoint with no authentication required. Place in `routes/public.py`.
**Rationale**: OPML only exposes feed URLs that are already public. There's no sensitive information, and public access allows feed readers to discover all available formats easily.
**Implementation Guidance**:
```python
# In routes/public.py
@bp.route('/feeds.opml')
def feeds_opml():
"""Export OPML with all available feed formats"""
generator = OPMLGenerator(
title=config.FEED_TITLE,
owner_name=config.FEED_AUTHOR_NAME,
owner_email=config.FEED_AUTHOR_EMAIL
)
# Add enabled formats
base_url = request.url_root.rstrip('/')
if config.FEED_RSS_ENABLED:
generator.add_feed(f"{base_url}/feed.rss", "RSS Feed")
if config.FEED_ATOM_ENABLED:
generator.add_feed(f"{base_url}/feed.atom", "Atom Feed")
if config.FEED_JSON_ENABLED:
generator.add_feed(f"{base_url}/feed.json", "JSON Feed")
return Response(
generator.generate(),
mimetype='application/xml',
headers={'Content-Disposition': 'attachment; filename="feeds.opml"'}
)
```
---
### CQ9: Feed Entry Ordering
**Question**: What order should entries appear in all feed formats?
**Answer**: **Newest first (reverse chronological order)** for RSS, ATOM, and JSON Feed. This is the industry standard and user expectation.
**Rationale**:
- RSS 2.0: Industry standard is newest first
- ATOM 1.0: RFC 4287 recommends newest first
- JSON Feed 1.1: Specification convention is newest first
- User Expectation: Feed readers expect newest content at the top
**Implementation Guidance**:
```python
# Database already returns notes in DESC order (newest first)
notes = Note.list_notes(limit=50) # Returns newest first
# Feed generators should maintain this order
# DO NOT use reversed() on the notes list!
for note in notes[:limit]: # Correct - maintains DESC order
yield generate_entry(note)
# WRONG - this would flip to oldest first
# for note in reversed(notes[:limit]): # DO NOT DO THIS
```
**Testing Requirements**:
All feed formats MUST be tested for correct ordering:
```python
def test_feed_order_newest_first():
"""Test feed shows newest entries first"""
old_note = create_note(created_at=yesterday)
new_note = create_note(created_at=today)
feed = generate_feed([new_note, old_note])
items = parse_feed_items(feed)
assert items[0].date > items[1].date # Newest first
```
**Critical Note**: There is currently a bug in RSS feed generation (lines 100 and 198 of feed.py) where `reversed()` is incorrectly applied. This MUST be fixed in Phase 2 before implementing ATOM and JSON feeds.
---
## Important Questions (Should be answered for Phase 1)
### IQ1: Database Query Pattern Detection Accuracy
**Answer**: Keep it simple with basic regex patterns. Return "unknown" for complex queries. Document the limitation clearly.
**Rationale**: A SQL parser adds unnecessary complexity for minimal gain. The 90% case (simple SELECT/INSERT/UPDATE/DELETE) provides sufficient insight for monitoring.
**Implementation Guidance**:
```python
def _extract_table_name(self, query):
"""Extract table name from query (best effort)"""
query_lower = query.lower().strip()
# Simple patterns that cover 90% of cases
patterns = [
(r'from\s+(\w+)', 'select'),
(r'update\s+(\w+)', 'update'),
(r'insert\s+into\s+(\w+)', 'insert'),
(r'delete\s+from\s+(\w+)', 'delete')
]
for pattern, operation in patterns:
match = re.search(pattern, query_lower)
if match:
return match.group(1)
# Complex queries (JOINs, subqueries, CTEs)
return "unknown"
```
Add comment: `# Note: Complex queries return "unknown". This covers 90% of queries accurately.`
---
### IQ2: HTTP Metrics Request ID Generation
**Answer**: Generate UUID for each request, store in `g.request_id`, add `X-Request-ID` response header in all modes (not just debug).
**Rationale**: Request IDs are invaluable for debugging production issues. The minor overhead is worth the debugging capability. This is standard practice in production systems.
**Implementation Guidance**:
```python
# In HTTPMetricsMiddleware
def process_request(self, environ):
request_id = str(uuid.uuid4())
environ['starpunk.request_id'] = request_id
# Make available in Flask g
with app.app_context():
g.request_id = request_id
def process_response(self, status, headers, exc_info=None):
# Add to response headers
headers.append(('X-Request-ID', g.request_id))
# Include in logs
if exc_info:
logger.error(f"Request {g.request_id} failed", exc_info=exc_info)
```
---
### IQ3: Slow Query Threshold Configuration
**Answer**: Single configurable threshold (1 second default) for v1.1.2. Query-type-specific thresholds are overengineering at this stage.
**Rationale**: Start simple. If monitoring reveals that different query types need different thresholds, we can add that complexity in v1.2 based on real data.
**Implementation Guidance**:
```python
# In config.py
METRICS_SLOW_QUERY_THRESHOLD = float(os.environ.get('STARPUNK_METRICS_SLOW_QUERY_THRESHOLD', '1.0'))
# In MonitoredConnection
def __init__(self, connection, metrics_collector):
self.connection = connection
self.metrics_collector = metrics_collector
self.slow_threshold = current_app.config['METRICS_SLOW_QUERY_THRESHOLD']
```
---
### IQ4: Feed Cache Invalidation Timing
**Answer**: Rely purely on checksum-based keys and TTL expiration. No manual invalidation needed.
**Rationale**: The checksum changes when content changes, naturally creating new cache entries. TTL handles expiration. Manual invalidation adds complexity with no benefit since checksums already handle content changes.
**Implementation Guidance**:
```python
# Simple cache usage - no invalidation hooks needed
def get_feed(format, limit=50):
checksum = calculate_cache_checksum(format, limit)
cache_key = f"feed:{format}:{checksum}"
# Try cache
cached = cache.get(cache_key)
if cached:
return cached
# Generate and cache with TTL
feed = generator.generate()
cache.set(cache_key, feed, ttl=300) # 5 minutes
return feed
```
No hooks in note create/update/delete operations. Much simpler.
---
### IQ5: Statistics Dashboard Chart Library
**Answer**: Use Chart.js as specified. It's lightweight, well-documented, and requires no build process.
**Rationale**: Chart.js is the simplest charting solution that meets our needs. No need to check existing admin UI - if we need charts elsewhere later, we'll already have Chart.js available.
**Implementation Guidance**:
```html
<!-- In syndication dashboard template -->
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
<script>
// Simple line chart for request rates
new Chart(ctx, {
type: 'line',
data: {
labels: timestamps,
datasets: [{
label: 'Requests/min',
data: rates,
borderColor: 'rgb(75, 192, 192)'
}]
}
});
</script>
```
---
### IQ6: ATOM Content Type Selection Logic
**Answer**: For v1.1.2, only implement `type="text"` and `type="html"`. Skip `type="xhtml"` entirely.
**Rationale**: XHTML content type adds complexity with no clear benefit. Text and HTML cover all real-world use cases. XHTML can be added later if needed.
**Implementation Guidance**:
```python
def _generate_content_element(self, note):
if note.html:
# HTML content (escaped)
return f'<content type="html">{escape(note.html)}</content>'
else:
# Plain text (escaped)
return f'<content type="text">{escape(note.content)}</content>'
```
Document: `# Note: type="xhtml" not implemented. Use type="html" with escaping instead.`
---
### IQ7: JSON Feed Custom Extensions Scope
**Answer**: Keep minimal for v1.1.2 - only `permalink_path` and `word_count` as shown in spec.
**Rationale**: Start with the minimum viable extension. We can always add fields based on user feedback. Adding fields later is backward compatible; removing them is not.
**Implementation Guidance**:
```python
# In JSON Feed generator
"_starpunk": {
"permalink_path": f"/notes/{note.slug}",
"word_count": len(note.content.split())
}
```
Document in README: "The `_starpunk` extension currently includes permalink_path and word_count. Additional fields may be added in future versions based on user needs."
---
### IQ8: Memory Monitor Baseline Timing
**Answer**: Wait 5 seconds as specified. Don't wait for first request - keep it simple.
**Rationale**: 5 seconds is sufficient for Flask initialization. Waiting for first request adds complexity and the baseline will quickly adjust after a few requests anyway.
**Implementation Guidance**:
```python
def run(self):
# Wait for app initialization
time.sleep(5)
# Set baseline
self.baseline_memory = psutil.Process().memory_info().rss
# Start monitoring loop
while not self.stop_flag:
self._collect_metrics()
time.sleep(self.interval)
```
---
### IQ9: Feed Validation Integration
**Answer**: Implement validators for testing only. Add optional admin endpoint `/admin/validate-feeds` for manual validation. Skip validation in production feed generation.
**Rationale**: Validation adds overhead with no benefit in production. Tests ensure correctness. Admin endpoint provides debugging capability when needed.
**Implementation Guidance**:
```python
# In tests only
def test_atom_feed_valid():
generator = AtomFeedGenerator(notes)
feed = generator.generate()
validator = AtomFeedValidator()
assert validator.validate(feed) == True
# Optional admin endpoint
@admin_bp.route('/validate-feeds')
@require_admin
def validate_feeds():
results = {}
for format in ['rss', 'atom', 'json']:
if is_format_enabled(format):
feed = generate_feed(format)
validator = get_validator(format)
results[format] = validator.validate(feed)
return jsonify(results)
```
---
### IQ10: Syndication Statistics Retention
**Answer**: Use time-bucketed in-memory structure with hourly buckets. Implement simple cleanup that removes buckets older than 7 days.
**Rationale**: Time bucketing enables efficient pruning without scanning all data. Hourly granularity provides good balance between memory usage and statistics precision.
**Implementation Guidance**:
```python
class SyndicationStats:
def __init__(self):
self.hourly_buckets = {} # {hour_timestamp: stats}
self.max_age_hours = 7 * 24 # 7 days
def record_request(self, format, user_agent):
hour = int(time.time() // 3600) * 3600
if hour not in self.hourly_buckets:
self.hourly_buckets[hour] = self._new_bucket()
self._cleanup_old_buckets()
self.hourly_buckets[hour]['requests'][format] += 1
def _cleanup_old_buckets(self):
cutoff = time.time() - (self.max_age_hours * 3600)
self.hourly_buckets = {
ts: stats for ts, stats in self.hourly_buckets.items()
if ts > cutoff
}
```
---
## Nice-to-Have Clarifications (Can defer if needed)
### NH1: Performance Benchmark Automation
**Answer**: Create benchmark suite with `@pytest.mark.benchmark`, run manually or optionally in CI. Don't block merges.
**Rationale**: Benchmarks are valuable but shouldn't block development. Optional execution prevents CI slowdown.
**Implementation Guidance**:
```python
# Run benchmarks: pytest -m benchmark
@pytest.mark.benchmark
def test_atom_generation_performance():
notes = Note.get_published(limit=100)
generator = AtomFeedGenerator(notes)
start = time.time()
feed = generator.generate()
duration = time.time() - start
assert duration < 0.5 # Should complete in 500ms
```
---
### NH2: Feed Format Feature Parity
**Answer**: Leverage format strengths. Don't limit to lowest common denominator.
**Rationale**: Each format exists because it offers different capabilities. Users choose formats based on their needs.
**Implementation Guidance**:
- **RSS**: Basic fields only (title, description, link, pubDate)
- **ATOM**: Include author objects, updated dates, categories
- **JSON**: Include custom extensions, attachments, author details
Document differences in user documentation.
---
### NH3: Content Negotiation Quality Factor Scoring
**Answer**: Keep the simple algorithm as specified. Log decisions in debug mode for troubleshooting.
**Rationale**: The simple algorithm handles 99% of real-world cases. Complex edge cases can be addressed if they actually occur.
**Implementation Guidance**: Use the algorithm exactly as specified in the spec. Add debug logging:
```python
if app.debug:
app.logger.debug(f"Content negotiation: Accept={accept_header}, Chosen={format}")
```
---
### NH4: Cache Statistics Persistence
**Answer**: Keep stats in-memory only for v1.1.2. Document that stats reset on restart.
**Rationale**: Persistence adds complexity. In-memory stats are sufficient for operational monitoring. Can add persistence in v1.2 if users need historical analysis.
**Implementation Guidance**: Add to documentation: "Note: Statistics are stored in memory and reset when the application restarts. For persistent metrics, consider using external monitoring tools."
---
### NH5: Feed Reader User Agent Detection Patterns
**Answer**: Start with regex patterns as specified. Log unknown user agents for future pattern updates.
**Rationale**: Regex is simple and sufficient. A library adds dependency for marginal benefit.
**Implementation Guidance**:
```python
def normalize_user_agent(self, ua_string):
# Try patterns
for pattern, name in self.patterns:
if re.search(pattern, ua_string, re.I):
return name
# Log unknown for analysis
if app.debug:
app.logger.info(f"Unknown user agent: {ua_string}")
return "unknown"
```
---
### NH6: OPML Multiple Feed Organization
**Answer**: Flat list for v1.1.2. No grouping needed for just 3 feeds.
**Rationale**: YAGNI (You Aren't Gonna Need It). Three feeds don't need categorization.
**Implementation Guidance**: Generate simple flat outline as shown in spec.
---
### NH7: Streaming Chunk Size Optimization
**Answer**: Don't enforce byte-level chunking. Let generators yield semantic units (complete entries).
**Rationale**: Semantic chunking (whole entries) is simpler and more correct than arbitrary byte boundaries that might split XML/JSON incorrectly.
**Implementation Guidance**:
```python
def generate_streaming(self):
# Yield complete semantic units
yield self._generate_header()
for note in self.notes:
yield self._generate_entry(note) # Complete entry
yield self._generate_footer()
```
---
### NH8: Error Handling for Feed Generation Failures
**Answer**: Validate before streaming. If error occurs mid-stream, log and truncate (client gets partial feed).
**Rationale**: Once streaming starts, we're committed. Pre-validation catches most errors. Mid-stream errors are rare and indicate serious issues (database failure).
**Implementation Guidance**:
```python
def generate_feed_streaming(format, notes):
# Validate before starting stream
if not notes:
abort(404, "No content available")
try:
generator = get_generator(format, notes)
return Response(
generator.generate_streaming(),
mimetype=get_mimetype(format)
)
except Exception as e:
# Can't change status after streaming starts
app.logger.error(f"Feed generation failed: {e}")
# Stream will be truncated - client gets partial feed
raise
```
---
### NH9: Metrics Dashboard Auto-Refresh
**Answer**: No auto-refresh for v1.1.2. Manual refresh is sufficient for admin monitoring.
**Rationale**: Auto-refresh adds JavaScript complexity for minimal benefit in an admin interface.
**Implementation Guidance**: Static dashboard. Users press F5 to refresh. Simple.
---
### NH10: Configuration Validation for Feed Settings
**Answer**: Add validation to `validate_config()` with the checks you proposed.
**Rationale**: Fail-fast configuration validation prevents runtime surprises and improves developer experience.
**Implementation Guidance**:
```python
def validate_feed_config():
# At least one format enabled
enabled = [
config.FEED_RSS_ENABLED,
config.FEED_ATOM_ENABLED,
config.FEED_JSON_ENABLED
]
if not any(enabled):
raise ValueError("At least one feed format must be enabled")
# Positive integers
if config.FEED_CACHE_SIZE <= 0:
raise ValueError("FEED_CACHE_SIZE must be positive")
if config.FEED_CACHE_TTL <= 0:
raise ValueError("FEED_CACHE_TTL must be positive")
# Warnings for unusual values
if config.FEED_CACHE_TTL < 60:
logger.warning("FEED_CACHE_TTL < 60s may cause excessive regeneration")
if config.FEED_CACHE_TTL > 3600:
logger.warning("FEED_CACHE_TTL > 1h may serve stale content")
```
---
## Summary
### Key Decisions Made
1. **Integration Strategy**: Minimal invasive changes - wrap at existing boundaries (connection pool, WSGI middleware)
2. **Simplicity First**: No manual cache invalidation, no complex SQL parsing, no auto-refresh
3. **Dual Approaches**: Both content negotiation AND explicit endpoints for maximum compatibility
4. **Streaming + Caching**: Both methods implemented for flexibility
5. **Standards Compliance**: Follow specs exactly, skip complex features like XHTML
6. **Fail-Fast**: Validate configuration at startup
7. **Production Focus**: Skip validation in production, benchmarks optional
### Implementation Order
**Phase 1**: Start with CQ1 (database monitoring) and CQ2 (metrics collector initialization) as they form the foundation.
**Phase 2**: Implement feed generation with both CQ3 (endpoints) and CQ6 (streaming) patterns.
**Phase 3**: Add caching with CQ4 (checksum strategy) and monitoring with CQ5 (memory monitor).
### Philosophy Applied
Every decision follows StarPunk principles:
- **Simplicity**: Choose simple solutions (regex over SQL parser, in-memory over persistent)
- **Explicit**: Clear behavior (both negotiation and explicit endpoints)
- **Tested**: Validation in tests, not production
- **Standards**: Follow specs exactly (content negotiation, feed formats)
- **No Premature Optimization**: Single threshold, simple caching, basic patterns
### Ready to Implement
With these answers, you have clear direction for all implementation decisions. Start with Phase 1 (Metrics Instrumentation) using the integration patterns specified. The "use simple approach" theme throughout means you can avoid overengineering and focus on delivering working features.
Remember: When in doubt during implementation, choose the simpler approach. You can always add complexity later based on real-world usage.
---
**Document Version**: 1.0.0
**Last Updated**: 2025-11-25
**Status**: Ready for implementation

View File

@@ -0,0 +1,889 @@
# Feed Enhancements Specification - v1.1.2
## Overview
This specification defines the feed system enhancements for StarPunk v1.1.2, including content negotiation, caching, statistics tracking, and OPML export capabilities.
## Requirements
### Functional Requirements
1. **Content Negotiation**
- Parse HTTP Accept headers
- Score format preferences
- Select optimal format
- Handle quality factors (q=)
2. **Feed Caching**
- LRU cache with TTL
- Format-specific caching
- Invalidation on changes
- Memory-bounded storage
3. **Statistics Dashboard**
- Track feed requests
- Monitor cache performance
- Analyze client usage
- Display trends
4. **OPML Export**
- Generate OPML 2.0
- Include all feed formats
- Add feed metadata
- Validate output
### Non-Functional Requirements
1. **Performance**
- Cache hit rate >80%
- Negotiation <1ms
- Dashboard load <100ms
- OPML generation <10ms
2. **Scalability**
- Bounded memory usage
- Efficient cache eviction
- Statistical sampling
- Async processing
## Content Negotiation
### Design
Content negotiation determines the best feed format based on the client's Accept header.
```python
class ContentNegotiator:
"""HTTP content negotiation for feed formats"""
# MIME type mappings
MIME_TYPES = {
'rss': [
'application/rss+xml',
'application/xml',
'text/xml',
'application/x-rss+xml'
],
'atom': [
'application/atom+xml',
'application/x-atom+xml'
],
'json': [
'application/json',
'application/feed+json',
'application/x-json-feed'
]
}
def negotiate(self, accept_header: str, available_formats: List[str] = None) -> str:
"""Negotiate best format from Accept header
Args:
accept_header: HTTP Accept header value
available_formats: List of enabled formats (default: all)
Returns:
Selected format: 'rss', 'atom', or 'json'
"""
if not available_formats:
available_formats = ['rss', 'atom', 'json']
# Parse Accept header
accept_types = self._parse_accept_header(accept_header)
# Score each format
scores = {}
for format_name in available_formats:
scores[format_name] = self._score_format(format_name, accept_types)
# Select highest scoring format
if scores:
best_format = max(scores, key=scores.get)
if scores[best_format] > 0:
return best_format
# Default to RSS if no preference
return 'rss' if 'rss' in available_formats else available_formats[0]
def _parse_accept_header(self, accept_header: str) -> List[Dict[str, Any]]:
"""Parse Accept header into list of types with quality"""
if not accept_header:
return []
types = []
for part in accept_header.split(','):
part = part.strip()
if not part:
continue
# Split type and parameters
parts = part.split(';')
mime_type = parts[0].strip()
# Parse quality factor
quality = 1.0
for param in parts[1:]:
param = param.strip()
if param.startswith('q='):
try:
quality = float(param[2:])
except ValueError:
quality = 1.0
types.append({
'type': mime_type,
'quality': quality
})
# Sort by quality descending
return sorted(types, key=lambda x: x['quality'], reverse=True)
def _score_format(self, format_name: str, accept_types: List[Dict]) -> float:
"""Score a format against Accept types"""
mime_types = self.MIME_TYPES.get(format_name, [])
best_score = 0.0
for accept in accept_types:
accept_type = accept['type']
quality = accept['quality']
# Check for exact match
if accept_type in mime_types:
best_score = max(best_score, quality)
# Check for wildcard matches
elif accept_type == '*/*':
best_score = max(best_score, quality * 0.1)
elif accept_type == 'application/*':
if any(m.startswith('application/') for m in mime_types):
best_score = max(best_score, quality * 0.5)
elif accept_type == 'text/*':
if any(m.startswith('text/') for m in mime_types):
best_score = max(best_score, quality * 0.5)
return best_score
```
### Accept Header Examples
| Accept Header | Selected Format | Reason |
|--------------|-----------------|--------|
| `application/atom+xml` | atom | Exact match |
| `application/json` | json | JSON match |
| `application/rss+xml, application/atom+xml;q=0.9` | rss | Higher quality |
| `text/html, application/*;q=0.9` | rss | Wildcard match, RSS default |
| `*/*` | rss | No preference, use default |
| (empty) | rss | No header, use default |
## Feed Caching
### Cache Design
```python
from collections import OrderedDict
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional, Any
import hashlib
@dataclass
class CacheEntry:
"""Single cache entry with metadata"""
key: str
content: str
content_type: str
created_at: datetime
expires_at: datetime
hit_count: int = 0
size_bytes: int = 0
class FeedCache:
"""LRU cache with TTL for feed content"""
def __init__(self, max_size: int = 100, default_ttl: int = 300):
"""Initialize cache
Args:
max_size: Maximum number of entries
default_ttl: Default TTL in seconds
"""
self.max_size = max_size
self.default_ttl = default_ttl
self.cache = OrderedDict()
self.stats = {
'hits': 0,
'misses': 0,
'evictions': 0,
'invalidations': 0
}
def get(self, format: str, limit: int, checksum: str) -> Optional[CacheEntry]:
"""Get cached feed if available and not expired"""
key = self._make_key(format, limit, checksum)
if key not in self.cache:
self.stats['misses'] += 1
return None
entry = self.cache[key]
# Check expiration
if datetime.now() > entry.expires_at:
del self.cache[key]
self.stats['misses'] += 1
return None
# Move to end (LRU)
self.cache.move_to_end(key)
# Update stats
entry.hit_count += 1
self.stats['hits'] += 1
return entry
def set(self, format: str, limit: int, checksum: str, content: str,
content_type: str, ttl: Optional[int] = None):
"""Store feed in cache"""
key = self._make_key(format, limit, checksum)
ttl = ttl or self.default_ttl
# Create entry
entry = CacheEntry(
key=key,
content=content,
content_type=content_type,
created_at=datetime.now(),
expires_at=datetime.now() + timedelta(seconds=ttl),
size_bytes=len(content.encode('utf-8'))
)
# Add to cache
self.cache[key] = entry
# Enforce size limit
while len(self.cache) > self.max_size:
# Remove oldest (first) item
evicted_key = next(iter(self.cache))
del self.cache[evicted_key]
self.stats['evictions'] += 1
def invalidate(self, pattern: Optional[str] = None):
"""Invalidate cache entries matching pattern"""
if pattern is None:
# Clear all
count = len(self.cache)
self.cache.clear()
self.stats['invalidations'] += count
else:
# Clear matching keys
keys_to_remove = [
key for key in self.cache
if pattern in key
]
for key in keys_to_remove:
del self.cache[key]
self.stats['invalidations'] += 1
def _make_key(self, format: str, limit: int, checksum: str) -> str:
"""Generate cache key"""
return f"feed:{format}:{limit}:{checksum}"
def get_stats(self) -> Dict[str, Any]:
"""Get cache statistics"""
total_requests = self.stats['hits'] + self.stats['misses']
hit_rate = (self.stats['hits'] / total_requests * 100) if total_requests > 0 else 0
# Calculate memory usage
total_bytes = sum(entry.size_bytes for entry in self.cache.values())
return {
'entries': len(self.cache),
'max_entries': self.max_size,
'memory_mb': total_bytes / (1024 * 1024),
'hit_rate': hit_rate,
'hits': self.stats['hits'],
'misses': self.stats['misses'],
'evictions': self.stats['evictions'],
'invalidations': self.stats['invalidations']
}
class ContentChecksum:
"""Generate checksums for cache invalidation"""
@staticmethod
def calculate(notes: List[Note], config: Dict) -> str:
"""Calculate checksum based on content state"""
# Use latest note timestamp and count
if notes:
latest_timestamp = max(n.updated_at or n.created_at for n in notes)
checksum_data = f"{latest_timestamp.isoformat()}:{len(notes)}"
else:
checksum_data = "empty:0"
# Include configuration that affects output
config_data = f"{config.get('site_name')}:{config.get('site_url')}"
# Generate hash
combined = f"{checksum_data}:{config_data}"
return hashlib.md5(combined.encode()).hexdigest()[:8]
```
### Cache Integration
```python
# In feed route handler
@app.route('/feed.<format>')
def serve_feed(format):
"""Serve feed in requested format"""
# Content negotiation if format not specified
if format == 'feed':
negotiator = ContentNegotiator()
format = negotiator.negotiate(request.headers.get('Accept'))
# Get notes and calculate checksum
notes = get_published_notes()
checksum = ContentChecksum.calculate(notes, app.config)
# Check cache
cached = feed_cache.get(format, limit=50, checksum=checksum)
if cached:
return Response(
cached.content,
mimetype=cached.content_type,
headers={'X-Cache': 'HIT'}
)
# Generate feed
if format == 'rss':
content = rss_generator.generate(notes)
content_type = 'application/rss+xml'
elif format == 'atom':
content = atom_generator.generate(notes)
content_type = 'application/atom+xml'
elif format == 'json':
content = json_generator.generate(notes)
content_type = 'application/feed+json'
else:
abort(404)
# Cache the result
feed_cache.set(format, 50, checksum, content, content_type)
return Response(
content,
mimetype=content_type,
headers={'X-Cache': 'MISS'}
)
```
## Statistics Dashboard
### Dashboard Design
```python
class SyndicationStats:
"""Collect and analyze syndication statistics"""
def __init__(self):
self.requests = defaultdict(int) # By format
self.user_agents = defaultdict(int)
self.generation_times = defaultdict(list)
self.errors = deque(maxlen=100)
def record_request(self, format: str, user_agent: str, cached: bool,
generation_time: Optional[float] = None):
"""Record feed request"""
self.requests[format] += 1
self.user_agents[self._normalize_user_agent(user_agent)] += 1
if generation_time is not None:
self.generation_times[format].append(generation_time)
# Keep only last 1000 times
if len(self.generation_times[format]) > 1000:
self.generation_times[format] = self.generation_times[format][-1000:]
def record_error(self, format: str, error: str):
"""Record feed generation error"""
self.errors.append({
'timestamp': datetime.now(),
'format': format,
'error': error
})
def get_summary(self) -> Dict[str, Any]:
"""Get statistics summary"""
total_requests = sum(self.requests.values())
# Calculate format distribution
format_distribution = {
format: (count / total_requests * 100) if total_requests > 0 else 0
for format, count in self.requests.items()
}
# Top user agents
top_agents = sorted(
self.user_agents.items(),
key=lambda x: x[1],
reverse=True
)[:10]
# Generation time stats
time_stats = {}
for format, times in self.generation_times.items():
if times:
sorted_times = sorted(times)
time_stats[format] = {
'avg': sum(times) / len(times),
'p50': sorted_times[len(times) // 2],
'p95': sorted_times[int(len(times) * 0.95)],
'p99': sorted_times[int(len(times) * 0.99)]
}
return {
'total_requests': total_requests,
'format_distribution': format_distribution,
'top_user_agents': top_agents,
'generation_times': time_stats,
'recent_errors': list(self.errors)
}
def _normalize_user_agent(self, user_agent: str) -> str:
"""Normalize user agent for grouping"""
if not user_agent:
return 'Unknown'
# Common patterns
patterns = [
(r'Feedly', 'Feedly'),
(r'Inoreader', 'Inoreader'),
(r'NewsBlur', 'NewsBlur'),
(r'Tiny Tiny RSS', 'Tiny Tiny RSS'),
(r'FreshRSS', 'FreshRSS'),
(r'NetNewsWire', 'NetNewsWire'),
(r'Feedbin', 'Feedbin'),
(r'bot|Bot|crawler|Crawler', 'Bot/Crawler'),
(r'Mozilla.*Firefox', 'Firefox'),
(r'Mozilla.*Chrome', 'Chrome'),
(r'Mozilla.*Safari', 'Safari')
]
import re
for pattern, name in patterns:
if re.search(pattern, user_agent):
return name
return 'Other'
```
### Dashboard Template
```html
<!-- templates/admin/syndication.html -->
{% extends "admin/base.html" %}
{% block title %}Syndication Dashboard{% endblock %}
{% block content %}
<div class="syndication-dashboard">
<h2>Syndication Statistics</h2>
<!-- Overview Cards -->
<div class="stats-grid">
<div class="stat-card">
<h3>Total Requests</h3>
<p class="stat-value">{{ stats.total_requests }}</p>
</div>
<div class="stat-card">
<h3>Cache Hit Rate</h3>
<p class="stat-value">{{ cache_stats.hit_rate|round(1) }}%</p>
</div>
<div class="stat-card">
<h3>Active Formats</h3>
<p class="stat-value">{{ stats.format_distribution|length }}</p>
</div>
<div class="stat-card">
<h3>Cache Memory</h3>
<p class="stat-value">{{ cache_stats.memory_mb|round(2) }}MB</p>
</div>
</div>
<!-- Format Distribution -->
<div class="chart-container">
<h3>Format Distribution</h3>
<canvas id="format-chart"></canvas>
</div>
<!-- Top User Agents -->
<div class="table-container">
<h3>Top Feed Readers</h3>
<table>
<thead>
<tr>
<th>Reader</th>
<th>Requests</th>
<th>Percentage</th>
</tr>
</thead>
<tbody>
{% for agent, count in stats.top_user_agents %}
<tr>
<td>{{ agent }}</td>
<td>{{ count }}</td>
<td>{{ (count / stats.total_requests * 100)|round(1) }}%</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
<!-- Generation Performance -->
<div class="table-container">
<h3>Generation Performance</h3>
<table>
<thead>
<tr>
<th>Format</th>
<th>Avg (ms)</th>
<th>P50 (ms)</th>
<th>P95 (ms)</th>
<th>P99 (ms)</th>
</tr>
</thead>
<tbody>
{% for format, times in stats.generation_times.items() %}
<tr>
<td>{{ format|upper }}</td>
<td>{{ (times.avg * 1000)|round(1) }}</td>
<td>{{ (times.p50 * 1000)|round(1) }}</td>
<td>{{ (times.p95 * 1000)|round(1) }}</td>
<td>{{ (times.p99 * 1000)|round(1) }}</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
<!-- Recent Errors -->
{% if stats.recent_errors %}
<div class="error-log">
<h3>Recent Errors</h3>
<ul>
{% for error in stats.recent_errors[-10:] %}
<li>
<span class="timestamp">{{ error.timestamp|timeago }}</span>
<span class="format">{{ error.format }}</span>
<span class="error">{{ error.error }}</span>
</li>
{% endfor %}
</ul>
</div>
{% endif %}
<!-- Feed URLs -->
<div class="feed-urls">
<h3>Available Feeds</h3>
<ul>
<li>RSS: <code>{{ url_for('serve_feed', format='rss', _external=True) }}</code></li>
<li>ATOM: <code>{{ url_for('serve_feed', format='atom', _external=True) }}</code></li>
<li>JSON: <code>{{ url_for('serve_feed', format='json', _external=True) }}</code></li>
<li>OPML: <code>{{ url_for('export_opml', _external=True) }}</code></li>
</ul>
</div>
</div>
<script>
// Format distribution pie chart
const ctx = document.getElementById('format-chart').getContext('2d');
new Chart(ctx, {
type: 'pie',
data: {
labels: {{ stats.format_distribution.keys()|list|tojson }},
datasets: [{
data: {{ stats.format_distribution.values()|list|tojson }},
backgroundColor: ['#FF6384', '#36A2EB', '#FFCE56']
}]
}
});
</script>
{% endblock %}
```
## OPML Export
### OPML Generator
```python
from xml.etree.ElementTree import Element, SubElement, tostring
from xml.dom import minidom
class OPMLGenerator:
"""Generate OPML 2.0 feed list"""
def __init__(self, site_url: str, site_name: str, owner_name: str = None,
owner_email: str = None):
self.site_url = site_url.rstrip('/')
self.site_name = site_name
self.owner_name = owner_name
self.owner_email = owner_email
def generate(self, include_formats: List[str] = None) -> str:
"""Generate OPML document
Args:
include_formats: List of formats to include (default: all enabled)
Returns:
OPML 2.0 XML string
"""
if not include_formats:
include_formats = ['rss', 'atom', 'json']
# Create root element
opml = Element('opml', version='2.0')
# Add head
head = SubElement(opml, 'head')
SubElement(head, 'title').text = f"{self.site_name} Feeds"
SubElement(head, 'dateCreated').text = datetime.now(timezone.utc).strftime(
'%a, %d %b %Y %H:%M:%S %z'
)
SubElement(head, 'dateModified').text = datetime.now(timezone.utc).strftime(
'%a, %d %b %Y %H:%M:%S %z'
)
if self.owner_name:
SubElement(head, 'ownerName').text = self.owner_name
if self.owner_email:
SubElement(head, 'ownerEmail').text = self.owner_email
# Add body with outlines
body = SubElement(opml, 'body')
# Add feed outlines
if 'rss' in include_formats:
SubElement(body, 'outline',
type='rss',
text=f"{self.site_name} - RSS Feed",
title=f"{self.site_name} - RSS Feed",
xmlUrl=f"{self.site_url}/feed.xml",
htmlUrl=self.site_url)
if 'atom' in include_formats:
SubElement(body, 'outline',
type='atom',
text=f"{self.site_name} - ATOM Feed",
title=f"{self.site_name} - ATOM Feed",
xmlUrl=f"{self.site_url}/feed.atom",
htmlUrl=self.site_url)
if 'json' in include_formats:
SubElement(body, 'outline',
type='json',
text=f"{self.site_name} - JSON Feed",
title=f"{self.site_name} - JSON Feed",
xmlUrl=f"{self.site_url}/feed.json",
htmlUrl=self.site_url)
# Convert to pretty XML
rough_string = tostring(opml, encoding='unicode')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=' ', encoding='UTF-8').decode('utf-8')
```
### OPML Example Output
```xml
<?xml version="1.0" encoding="UTF-8"?>
<opml version="2.0">
<head>
<title>StarPunk Notes Feeds</title>
<dateCreated>Mon, 25 Nov 2024 12:00:00 +0000</dateCreated>
<dateModified>Mon, 25 Nov 2024 12:00:00 +0000</dateModified>
<ownerName>John Doe</ownerName>
<ownerEmail>john@example.com</ownerEmail>
</head>
<body>
<outline type="rss"
text="StarPunk Notes - RSS Feed"
title="StarPunk Notes - RSS Feed"
xmlUrl="https://example.com/feed.xml"
htmlUrl="https://example.com"/>
<outline type="atom"
text="StarPunk Notes - ATOM Feed"
title="StarPunk Notes - ATOM Feed"
xmlUrl="https://example.com/feed.atom"
htmlUrl="https://example.com"/>
<outline type="json"
text="StarPunk Notes - JSON Feed"
title="StarPunk Notes - JSON Feed"
xmlUrl="https://example.com/feed.json"
htmlUrl="https://example.com"/>
</body>
</opml>
```
## Testing Strategy
### Content Negotiation Tests
```python
def test_content_negotiation():
"""Test Accept header parsing and format selection"""
negotiator = ContentNegotiator()
# Test exact matches
assert negotiator.negotiate('application/atom+xml') == 'atom'
assert negotiator.negotiate('application/feed+json') == 'json'
assert negotiator.negotiate('application/rss+xml') == 'rss'
# Test quality factors
assert negotiator.negotiate('application/atom+xml;q=0.8, application/rss+xml') == 'rss'
# Test wildcards
assert negotiator.negotiate('*/*') == 'rss' # Default
assert negotiator.negotiate('application/*') == 'rss' # First application type
# Test no preference
assert negotiator.negotiate('') == 'rss'
assert negotiator.negotiate('text/html') == 'rss'
```
### Cache Tests
```python
def test_feed_cache():
"""Test LRU cache with TTL"""
cache = FeedCache(max_size=3, default_ttl=1)
# Test set and get
cache.set('rss', 50, 'abc123', '<rss>content</rss>', 'application/rss+xml')
entry = cache.get('rss', 50, 'abc123')
assert entry is not None
assert entry.content == '<rss>content</rss>'
# Test expiration
time.sleep(1.1)
entry = cache.get('rss', 50, 'abc123')
assert entry is None
# Test LRU eviction
cache.set('rss', 50, 'aaa', 'content1', 'application/rss+xml')
cache.set('atom', 50, 'bbb', 'content2', 'application/atom+xml')
cache.set('json', 50, 'ccc', 'content3', 'application/json')
cache.set('rss', 100, 'ddd', 'content4', 'application/rss+xml') # Evicts oldest
assert cache.get('rss', 50, 'aaa') is None # Evicted
assert cache.get('atom', 50, 'bbb') is not None # Still present
```
### Statistics Tests
```python
def test_syndication_stats():
"""Test statistics collection"""
stats = SyndicationStats()
# Record requests
stats.record_request('rss', 'Feedly/1.0', cached=False, generation_time=0.05)
stats.record_request('atom', 'Inoreader/1.0', cached=True)
stats.record_request('json', 'NetNewsWire/6.0', cached=False, generation_time=0.03)
summary = stats.get_summary()
assert summary['total_requests'] == 3
assert 'rss' in summary['format_distribution']
assert len(summary['top_user_agents']) > 0
```
### OPML Tests
```python
def test_opml_generation():
"""Test OPML export"""
generator = OPMLGenerator(
site_url='https://example.com',
site_name='Test Site',
owner_name='John Doe'
)
opml = generator.generate(['rss', 'atom', 'json'])
# Parse and validate
import xml.etree.ElementTree as ET
root = ET.fromstring(opml)
assert root.tag == 'opml'
assert root.get('version') == '2.0'
# Check outlines
outlines = root.findall('.//outline')
assert len(outlines) == 3
assert outlines[0].get('type') == 'rss'
assert outlines[1].get('type') == 'atom'
assert outlines[2].get('type') == 'json'
```
## Performance Benchmarks
### Negotiation Performance
```python
def benchmark_content_negotiation():
"""Benchmark negotiation speed"""
negotiator = ContentNegotiator()
complex_header = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
start = time.perf_counter()
for _ in range(10000):
negotiator.negotiate(complex_header)
duration = time.perf_counter() - start
per_call = (duration / 10000) * 1000 # Convert to ms
assert per_call < 1.0 # Less than 1ms per negotiation
```
## Configuration
```ini
# Content negotiation
STARPUNK_FEED_NEGOTIATION_ENABLED=true
STARPUNK_FEED_DEFAULT_FORMAT=rss
# Cache settings
STARPUNK_FEED_CACHE_ENABLED=true
STARPUNK_FEED_CACHE_SIZE=100
STARPUNK_FEED_CACHE_TTL=300
STARPUNK_FEED_CACHE_MEMORY_LIMIT=10 # MB
# Statistics
STARPUNK_FEED_STATS_ENABLED=true
STARPUNK_FEED_STATS_RETENTION=7 # days
# OPML
STARPUNK_FEED_OPML_ENABLED=true
STARPUNK_FEED_OPML_OWNER_NAME=
STARPUNK_FEED_OPML_OWNER_EMAIL=
```
## Security Considerations
1. **Cache Poisoning**: Validate all cached content
2. **Header Injection**: Sanitize Accept headers
3. **Memory Exhaustion**: Limit cache size
4. **Statistics Privacy**: Don't log sensitive data
5. **OPML Injection**: Escape all XML content
## Acceptance Criteria
1. ✅ Content negotiation working correctly
2. ✅ Cache hit rate >80% achieved
3. ✅ Statistics dashboard functional
4. ✅ OPML export valid
5. ✅ Memory usage bounded
6. ✅ Performance targets met
7. ✅ All formats properly cached
8. ✅ Invalidation working
9. ✅ User agent detection accurate
10. ✅ Security review passed

View File

@@ -0,0 +1,745 @@
# StarPunk v1.1.2 "Syndicate" - Implementation Guide
## Overview
This guide provides a phased approach to implementing v1.1.2 "Syndicate" features. The release is structured in three phases totaling 14-16 hours of focused development.
## Pre-Implementation Checklist
- [x] Review v1.1.1 performance monitoring specification
- [x] Ensure development environment has Python 3.11+
- [x] Create feature branch: `feature/v1.1.2-syndicate`
- [ ] Review feed format specifications (RSS 2.0, ATOM 1.0, JSON Feed 1.1)
- [ ] Set up feed reader test clients
## Phase 1: Metrics Instrumentation (4-6 hours) ✅ COMPLETE
### Objective
Complete the metrics instrumentation that was partially implemented in v1.1.1, adding comprehensive coverage across all system operations.
### 1.1 Database Operation Timing (1.5 hours) ✅
**Location**: `starpunk/monitoring/database.py`
**Implementation Steps**:
1. **Create Database Monitor Wrapper**
```python
class MonitoredConnection:
"""Wrapper for SQLite connections with timing"""
def execute(self, query, params=None):
# Start timer
# Execute query
# Record metric
# Return result
```
2. **Instrument All Query Types**
- SELECT queries (with row count)
- INSERT operations (with affected rows)
- UPDATE operations (with affected rows)
- DELETE operations (rare, but instrumented)
- Transaction boundaries (BEGIN/COMMIT)
3. **Add Query Pattern Detection**
- Identify query type (SELECT, INSERT, etc.)
- Extract table name
- Detect slow queries (>1s)
- Track prepared statement usage
**Metrics to Collect**:
- `db.query.duration` - Query execution time
- `db.query.count` - Number of queries by type
- `db.rows.returned` - Result set size
- `db.transaction.duration` - Transaction time
- `db.connection.wait` - Connection acquisition time
### 1.2 HTTP Request/Response Metrics (1.5 hours) ✅
**Location**: `starpunk/monitoring/http.py`
**Implementation Steps**:
1. **Enhance Request Middleware**
```python
@app.before_request
def start_request_metrics():
g.metrics = {
'start_time': time.perf_counter(),
'start_memory': get_memory_usage(),
'request_id': generate_request_id()
}
```
2. **Capture Response Metrics**
```python
@app.after_request
def capture_response_metrics(response):
# Calculate duration
# Measure memory delta
# Record response size
# Track status codes
```
3. **Add Endpoint-Specific Metrics**
- Feed generation timing
- Micropub processing time
- Static file serving
- Admin operations
**Metrics to Collect**:
- `http.request.duration` - Total request time
- `http.request.size` - Request body size
- `http.response.size` - Response body size
- `http.status.{code}` - Status code distribution
- `http.endpoint.{name}` - Per-endpoint timing
### 1.3 Memory Monitoring Thread (1 hour) ✅
**Location**: `starpunk/monitoring/memory.py`
**Implementation Steps**:
1. **Create Background Monitor**
```python
class MemoryMonitor(Thread):
def run(self):
while self.running:
# Get RSS memory
# Check for growth
# Detect potential leaks
# Sleep interval
```
2. **Track Memory Patterns**
- Process RSS memory
- Virtual memory size
- Memory growth rate
- High water mark
- Garbage collection stats
3. **Add Leak Detection**
- Baseline after startup
- Track growth over time
- Alert on sustained growth
- Identify allocation sources
**Metrics to Collect**:
- `memory.rss` - Resident set size
- `memory.vms` - Virtual memory size
- `memory.growth_rate` - MB/hour
- `memory.gc.collections` - GC runs
- `memory.high_water` - Peak usage
### 1.4 Business Metrics for Syndication (1 hour) ✅
**Location**: `starpunk/monitoring/business.py`
**Implementation Steps**:
1. **Track Feed Operations**
- Feed requests by format
- Cache hit/miss rates
- Generation timing
- Format negotiation results
2. **Monitor Content Flow**
- Notes published per day
- Average note length
- Media attachments
- Syndication success
3. **User Behavior Metrics**
- Popular feed formats
- Reader user agents
- Request patterns
- Geographic distribution
**Metrics to Collect**:
- `feed.requests.{format}` - Requests by format
- `feed.cache.hit_rate` - Cache effectiveness
- `feed.generation.time` - Generation duration
- `content.notes.published` - Publishing rate
- `content.syndication.success` - Successful syndications
### Phase 1 Completion Status ✅
**Completed**: 2025-11-25
**Developer**: StarPunk Fullstack Developer (AI)
**Review**: Approved by Architect on 2025-11-26
**Test Results**: 28/28 tests passing
**Performance**: <1% overhead achieved
**Next Step**: Begin Phase 2 - Feed Formats
**Note**: All Phase 1 metrics instrumentation is complete and ready for production use. Business metrics functions are available for integration into notes.py and feed.py during Phase 2.
## Phase 2: Feed Formats (6-8 hours)
### Objective
Fix RSS feed ordering regression, then implement ATOM and JSON Feed formats alongside existing RSS, with proper content negotiation and caching.
### 2.0 Fix RSS Feed Ordering Regression (0.5 hours) - CRITICAL
**Location**: `starpunk/feed.py`
**Critical Production Bug**: RSS feed currently shows oldest entries first instead of newest first. This violates RSS standards and user expectations.
**Root Cause**: Incorrect `reversed()` calls on lines 100 and 198 that flip the correct DESC order from database.
**Implementation Steps**:
1. **Remove Incorrect Reversals**
- Line 100: Remove `reversed()` from `for note in reversed(notes[:limit]):`
- Line 198: Remove `reversed()` from `for note in reversed(notes[:limit]):`
- Update/remove misleading comments about feedgen reversing order
2. **Verify Expected Behavior**
- Database returns notes in DESC order (newest first) - confirmed line 440 of notes.py
- Feed should maintain this order (newest entries first)
- This is the standard for ALL feed formats (RSS, ATOM, JSON Feed)
3. **Add Feed Order Tests**
```python
def test_rss_feed_newest_first():
"""Test RSS feed shows newest entries first"""
# Create notes with different timestamps
old_note = create_note(title="Old", created_at=yesterday)
new_note = create_note(title="New", created_at=today)
# Generate feed
feed = generate_rss_feed([old_note, new_note])
# Parse and verify order
items = parse_feed_items(feed)
assert items[0].title == "New"
assert items[1].title == "Old"
```
**Important**: This MUST be fixed before implementing ATOM and JSON feeds to ensure all formats have consistent, correct ordering.
### 2.1 ATOM Feed Generation (2.5 hours)
**Location**: `starpunk/feed/atom.py`
**Implementation Steps**:
1. **Create ATOM Generator Class**
```python
class AtomGenerator:
def generate(self, notes, config):
# Yield XML declaration
# Yield feed element
# Yield entries
# Stream output
```
2. **Implement ATOM 1.0 Elements**
- Required: id, title, updated
- Recommended: author, link, category
- Optional: contributor, generator, icon, logo, rights, subtitle
3. **Handle Content Types**
- Text content (escaped)
- HTML content (in CDATA)
- XHTML content (inline)
- Base64 for binary
4. **Date Formatting**
- RFC 3339 format
- Timezone handling
- Updated vs published
**ATOM Structure**:
```xml
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Site Title</title>
<link href="http://example.com/"/>
<link href="http://example.com/feed.atom" rel="self"/>
<updated>2024-11-25T12:00:00Z</updated>
<author>
<name>Author Name</name>
</author>
<id>http://example.com/</id>
<entry>
<title>Note Title</title>
<link href="http://example.com/note/1"/>
<id>http://example.com/note/1</id>
<updated>2024-11-25T12:00:00Z</updated>
<content type="html">
<![CDATA[<p>HTML content</p>]]>
</content>
</entry>
</feed>
```
### 2.2 JSON Feed Generation (2.5 hours)
**Location**: `starpunk/feed/json_feed.py`
**Implementation Steps**:
1. **Create JSON Feed Generator**
```python
class JsonFeedGenerator:
def generate(self, notes, config):
# Build feed object
# Add items array
# Include metadata
# Stream JSON output
```
2. **Implement JSON Feed 1.1 Schema**
- version (required)
- title (required)
- items (required array)
- home_page_url
- feed_url
- description
- authors array
- language
- icon, favicon
3. **Handle Rich Content**
- content_html
- content_text
- summary
- image attachments
- tags array
- authors array
4. **Add Extensions**
- _starpunk namespace
- Pagination hints
- Hub for real-time
**JSON Feed Structure**:
```json
{
"version": "https://jsonfeed.org/version/1.1",
"title": "Site Title",
"home_page_url": "https://example.com/",
"feed_url": "https://example.com/feed.json",
"description": "Site description",
"authors": [
{
"name": "Author Name",
"url": "https://example.com/about"
}
],
"items": [
{
"id": "https://example.com/note/1",
"url": "https://example.com/note/1",
"title": "Note Title",
"content_html": "<p>HTML content</p>",
"date_published": "2024-11-25T12:00:00Z",
"tags": ["tag1", "tag2"]
}
]
}
```
### 2.3 Content Negotiation (1.5 hours)
**Location**: `starpunk/feed/negotiator.py`
**Implementation Steps**:
1. **Create Content Negotiator**
```python
class FeedNegotiator:
def negotiate(self, accept_header):
# Parse Accept header
# Score each format
# Return best match
```
2. **Parse Accept Header**
- Split on comma
- Extract MIME type
- Parse quality factors (q=)
- Handle wildcards (*/*)
3. **Score Formats**
- Exact match: 1.0
- Wildcard match: 0.5
- Type/* match: 0.7
- Default RSS: 0.1
4. **Format Mapping**
```python
FORMAT_MIME_TYPES = {
'rss': ['application/rss+xml', 'application/xml', 'text/xml'],
'atom': ['application/atom+xml'],
'json': ['application/json', 'application/feed+json']
}
```
### 2.4 Feed Validation (1.5 hours)
**Location**: `starpunk/feed/validators.py`
**Implementation Steps**:
1. **Create Validation Framework**
```python
class FeedValidator(Protocol):
def validate(self, content: str) -> List[ValidationError]:
pass
```
2. **RSS Validator**
- Check required elements
- Verify date formats
- Validate URLs
- Check CDATA escaping
3. **ATOM Validator**
- Verify namespace
- Check required elements
- Validate RFC 3339 dates
- Verify ID uniqueness
4. **JSON Feed Validator**
- Validate against schema
- Check required fields
- Verify URL formats
- Validate date strings
**Validation Levels**:
- ERROR: Feed is invalid
- WARNING: Non-critical issue
- INFO: Suggestion for improvement
## Phase 3: Feed Enhancements (4 hours)
### Objective
Add caching, statistics, and operational improvements to the feed system.
### 3.1 Feed Caching Layer (1.5 hours)
**Location**: `starpunk/feed/cache.py`
**Implementation Steps**:
1. **Create Cache Manager**
```python
class FeedCache:
def __init__(self, max_size=100, ttl=300):
self.cache = LRU(max_size)
self.ttl = ttl
```
2. **Cache Key Generation**
- Format type
- Item limit
- Content checksum
- Last modified
3. **Cache Operations**
- Get with TTL check
- Set with expiration
- Invalidate on changes
- Clear entire cache
4. **Memory Management**
- Monitor cache size
- Implement eviction
- Track hit rates
- Report statistics
**Cache Strategy**:
```python
def get_or_generate(format, limit):
key = generate_cache_key(format, limit)
cached = cache.get(key)
if cached and not expired(cached):
metrics.record_cache_hit()
return cached
content = generate_feed(format, limit)
cache.set(key, content, ttl=300)
metrics.record_cache_miss()
return content
```
### 3.2 Statistics Dashboard (1.5 hours)
**Location**: `starpunk/admin/syndication.py`
**Template**: `templates/admin/syndication.html`
**Implementation Steps**:
1. **Create Dashboard Route**
```python
@app.route('/admin/syndication')
@require_admin
def syndication_dashboard():
stats = gather_syndication_stats()
return render_template('admin/syndication.html', stats=stats)
```
2. **Gather Statistics**
- Requests by format (pie chart)
- Cache hit rates (line graph)
- Generation times (histogram)
- Popular user agents (table)
- Recent errors (log)
3. **Create Dashboard UI**
- Overview cards
- Time series graphs
- Format breakdown
- Performance metrics
- Configuration status
**Dashboard Sections**:
- Feed Format Usage
- Cache Performance
- Generation Times
- Client Analysis
- Error Log
- Configuration
### 3.3 OPML Export (1 hour)
**Location**: `starpunk/feed/opml.py`
**Implementation Steps**:
1. **Create OPML Generator**
```python
def generate_opml(site_config):
# Generate OPML header
# Add feed outlines
# Include metadata
return opml_content
```
2. **OPML Structure**
```xml
<?xml version="1.0" encoding="UTF-8"?>
<opml version="2.0">
<head>
<title>StarPunk Feeds</title>
<dateCreated>Mon, 25 Nov 2024 12:00:00 UTC</dateCreated>
</head>
<body>
<outline type="rss" text="RSS Feed" xmlUrl="https://example.com/feed.xml"/>
<outline type="atom" text="ATOM Feed" xmlUrl="https://example.com/feed.atom"/>
<outline type="json" text="JSON Feed" xmlUrl="https://example.com/feed.json"/>
</body>
</opml>
```
3. **Add Export Route**
```python
@app.route('/feeds.opml')
def export_opml():
opml = generate_opml(config)
return Response(opml, mimetype='text/x-opml')
```
## Testing Strategy
### Phase 1 Tests (Metrics)
1. **Unit Tests**
- Mock database operations
- Test metric collection
- Verify memory monitoring
- Test business metrics
2. **Integration Tests**
- End-to-end request tracking
- Database timing accuracy
- Memory leak detection
- Metrics aggregation
### Phase 2 Tests (Feeds)
1. **Format Tests**
- Valid RSS generation
- Valid ATOM generation
- Valid JSON Feed generation
- Content negotiation logic
- **Feed ordering (newest first) for ALL formats - CRITICAL**
2. **Feed Ordering Tests (REQUIRED)**
```python
def test_all_feeds_newest_first():
"""Verify all feed formats show newest entries first"""
old_note = create_note(title="Old", created_at=yesterday)
new_note = create_note(title="New", created_at=today)
notes = [new_note, old_note] # DESC order from database
# Test RSS
rss_feed = generate_rss_feed(notes)
assert first_item(rss_feed).title == "New"
# Test ATOM
atom_feed = generate_atom_feed(notes)
assert first_item(atom_feed).title == "New"
# Test JSON
json_feed = generate_json_feed(notes)
assert json_feed['items'][0]['title'] == "New"
```
3. **Compliance Tests**
- W3C Feed Validator
- ATOM validator
- JSON Feed validator
- Popular readers
### Phase 3 Tests (Enhancements)
1. **Cache Tests**
- TTL expiration
- LRU eviction
- Invalidation
- Hit rate tracking
2. **Dashboard Tests**
- Statistics accuracy
- Graph rendering
- OPML validity
- Performance impact
## Configuration Updates
### New Configuration Options
Add to `config.py`:
```python
# Feed configuration
FEED_DEFAULT_LIMIT = int(os.getenv('STARPUNK_FEED_DEFAULT_LIMIT', 50))
FEED_MAX_LIMIT = int(os.getenv('STARPUNK_FEED_MAX_LIMIT', 500))
FEED_CACHE_TTL = int(os.getenv('STARPUNK_FEED_CACHE_TTL', 300))
FEED_CACHE_SIZE = int(os.getenv('STARPUNK_FEED_CACHE_SIZE', 100))
# Format support
FEED_RSS_ENABLED = str_to_bool(os.getenv('STARPUNK_FEED_RSS_ENABLED', 'true'))
FEED_ATOM_ENABLED = str_to_bool(os.getenv('STARPUNK_FEED_ATOM_ENABLED', 'true'))
FEED_JSON_ENABLED = str_to_bool(os.getenv('STARPUNK_FEED_JSON_ENABLED', 'true'))
# Metrics for syndication
METRICS_FEED_TIMING = str_to_bool(os.getenv('STARPUNK_METRICS_FEED_TIMING', 'true'))
METRICS_CACHE_STATS = str_to_bool(os.getenv('STARPUNK_METRICS_CACHE_STATS', 'true'))
METRICS_FORMAT_USAGE = str_to_bool(os.getenv('STARPUNK_METRICS_FORMAT_USAGE', 'true'))
```
## Documentation Updates
### User Documentation
1. **Feed Formats Guide**
- How to access each format
- Which readers support what
- Format comparison
2. **Configuration Guide**
- New environment variables
- Performance tuning
- Cache settings
### API Documentation
1. **Feed Endpoints**
- `/feed.xml` - RSS feed
- `/feed.atom` - ATOM feed
- `/feed.json` - JSON feed
- `/feeds.opml` - OPML export
2. **Content Negotiation**
- Accept header usage
- Format precedence
- Default behavior
## Deployment Checklist
### Pre-deployment
- [ ] All tests passing
- [ ] Metrics instrumentation verified
- [ ] Feed formats validated
- [ ] Cache performance tested
- [ ] Documentation updated
### Deployment Steps
1. Backup database
2. Update configuration
3. Deploy new code
4. Run migrations (none for v1.1.2)
5. Clear feed cache
6. Test all feed formats
7. Verify metrics collection
### Post-deployment
- [ ] Monitor memory usage
- [ ] Check feed generation times
- [ ] Verify cache hit rates
- [ ] Test with feed readers
- [ ] Review error logs
## Rollback Plan
If issues arise:
1. **Immediate Rollback**
```bash
git checkout v1.1.1
supervisorctl restart starpunk
```
2. **Cache Cleanup**
```bash
redis-cli FLUSHDB # If using Redis
rm -rf /tmp/starpunk_cache/* # If file-based
```
3. **Configuration Rollback**
```bash
cp config.backup.ini config.ini
```
## Success Metrics
### Performance Targets
- Feed generation <100ms (50 items)
- Cache hit rate >80%
- Memory overhead <10MB
- Zero performance regression
### Compatibility Targets
- 10+ feed readers tested
- All validators passing
- No breaking changes
- Backward compatibility maintained
## Timeline
### Week 1
- Phase 1: Metrics instrumentation (4-6 hours)
- Testing and validation
### Week 2
- Phase 2: Feed formats (6-8 hours)
- Integration testing
### Week 3
- Phase 3: Enhancements (4 hours)
- Final testing and documentation
- Deployment
Total estimated time: 14-16 hours of focused development

View File

@@ -0,0 +1,743 @@
# JSON Feed Specification - v1.1.2
## Overview
This specification defines the implementation of JSON Feed 1.1 format for StarPunk, providing a modern, developer-friendly syndication format that's easier to parse than XML-based feeds.
## Requirements
### Functional Requirements
1. **JSON Feed 1.1 Compliance**
- Full conformance to JSON Feed 1.1 spec
- Valid JSON structure
- Required fields present
- Proper date formatting
2. **Rich Content Support**
- HTML content
- Plain text content
- Summary field
- Image attachments
- External URLs
3. **Enhanced Metadata**
- Author objects with avatars
- Tags array
- Language specification
- Custom extensions
4. **Efficient Generation**
- Streaming JSON output
- Minimal memory usage
- Fast serialization
### Non-Functional Requirements
1. **Performance**
- Generation <50ms for 50 items
- Compact JSON output
- Efficient serialization
2. **Compatibility**
- Valid JSON syntax
- Works with JSON Feed readers
- Proper MIME type handling
## JSON Feed Structure
### Top-Level Object
```json
{
"version": "https://jsonfeed.org/version/1.1",
"title": "Required: Feed title",
"items": [],
"home_page_url": "https://example.com/",
"feed_url": "https://example.com/feed.json",
"description": "Feed description",
"user_comment": "Free-form comment",
"next_url": "https://example.com/feed.json?page=2",
"icon": "https://example.com/icon.png",
"favicon": "https://example.com/favicon.ico",
"authors": [],
"language": "en-US",
"expired": false,
"hubs": []
}
```
### Required Fields
| Field | Type | Description |
|-------|------|-------------|
| `version` | String | Must be "https://jsonfeed.org/version/1.1" |
| `title` | String | Feed title |
| `items` | Array | Array of item objects |
### Optional Feed Fields
| Field | Type | Description |
|-------|------|-------------|
| `home_page_url` | String | Website URL |
| `feed_url` | String | URL of this feed |
| `description` | String | Feed description |
| `user_comment` | String | Implementation notes |
| `next_url` | String | Pagination next page |
| `icon` | String | 512x512+ image |
| `favicon` | String | Website favicon |
| `authors` | Array | Feed authors |
| `language` | String | RFC 5646 language tag |
| `expired` | Boolean | Feed no longer updated |
| `hubs` | Array | WebSub hubs |
### Item Object Structure
```json
{
"id": "Required: unique ID",
"url": "https://example.com/note/123",
"external_url": "https://external.com/article",
"title": "Item title",
"content_html": "<p>HTML content</p>",
"content_text": "Plain text content",
"summary": "Brief summary",
"image": "https://example.com/image.jpg",
"banner_image": "https://example.com/banner.jpg",
"date_published": "2024-11-25T12:00:00Z",
"date_modified": "2024-11-25T13:00:00Z",
"authors": [],
"tags": ["tag1", "tag2"],
"language": "en",
"attachments": [],
"_custom": {}
}
```
### Required Item Fields
| Field | Type | Description |
|-------|------|-------------|
| `id` | String | Unique, stable ID |
### Optional Item Fields
| Field | Type | Description |
|-------|------|-------------|
| `url` | String | Item permalink |
| `external_url` | String | Link to external content |
| `title` | String | Item title |
| `content_html` | String | HTML content |
| `content_text` | String | Plain text content |
| `summary` | String | Brief summary |
| `image` | String | Main image URL |
| `banner_image` | String | Wide banner image |
| `date_published` | String | RFC 3339 date |
| `date_modified` | String | RFC 3339 date |
| `authors` | Array | Item authors |
| `tags` | Array | String tags |
| `language` | String | Language code |
| `attachments` | Array | File attachments |
### Author Object
```json
{
"name": "Author Name",
"url": "https://example.com/about",
"avatar": "https://example.com/avatar.jpg"
}
```
### Attachment Object
```json
{
"url": "https://example.com/file.pdf",
"mime_type": "application/pdf",
"title": "Attachment Title",
"size_in_bytes": 1024000,
"duration_in_seconds": 300
}
```
## Implementation Design
### JSON Feed Generator Class
```python
import json
from typing import List, Dict, Any, Iterator
from datetime import datetime, timezone
class JsonFeedGenerator:
"""JSON Feed 1.1 generator with streaming support"""
def __init__(self, site_url: str, site_name: str, site_description: str,
author_name: str = None, author_url: str = None, author_avatar: str = None):
self.site_url = site_url.rstrip('/')
self.site_name = site_name
self.site_description = site_description
self.author = {
'name': author_name,
'url': author_url,
'avatar': author_avatar
} if author_name else None
def generate(self, notes: List[Note], limit: int = 50) -> str:
"""Generate complete JSON feed
IMPORTANT: Notes are expected to be in DESC order (newest first)
from the database. This order MUST be preserved in the feed.
"""
feed = self._build_feed_object(notes[:limit])
return json.dumps(feed, ensure_ascii=False, indent=2)
def generate_streaming(self, notes: List[Note], limit: int = 50) -> Iterator[str]:
"""Generate JSON feed as stream of chunks
IMPORTANT: Notes are expected to be in DESC order (newest first)
from the database. This order MUST be preserved in the feed.
"""
# Start feed object
yield '{\n'
yield ' "version": "https://jsonfeed.org/version/1.1",\n'
yield f' "title": {json.dumps(self.site_name)},\n'
# Add optional feed metadata
yield from self._stream_feed_metadata()
# Start items array
yield ' "items": [\n'
# Stream items - maintain DESC order (newest first)
# DO NOT reverse! Database order is correct
items = notes[:limit]
for i, note in enumerate(items):
item_json = json.dumps(self._build_item_object(note), indent=4)
# Indent items properly
indented = '\n'.join(' ' + line for line in item_json.split('\n'))
yield indented
if i < len(items) - 1:
yield ',\n'
else:
yield '\n'
# Close items array and feed
yield ' ]\n'
yield '}\n'
def _build_feed_object(self, notes: List[Note]) -> Dict[str, Any]:
"""Build complete feed object"""
feed = {
'version': 'https://jsonfeed.org/version/1.1',
'title': self.site_name,
'home_page_url': self.site_url,
'feed_url': f'{self.site_url}/feed.json',
'description': self.site_description,
'items': [self._build_item_object(note) for note in notes]
}
# Add optional fields
if self.author:
feed['authors'] = [self._clean_author(self.author)]
feed['language'] = 'en' # Make configurable
# Add icon/favicon if configured
icon_url = self._get_icon_url()
if icon_url:
feed['icon'] = icon_url
favicon_url = self._get_favicon_url()
if favicon_url:
feed['favicon'] = favicon_url
return feed
def _build_item_object(self, note: Note) -> Dict[str, Any]:
"""Build item object from note"""
permalink = f'{self.site_url}{note.permalink}'
item = {
'id': permalink,
'url': permalink,
'title': note.title or self._format_date_title(note.created_at),
'date_published': self._format_json_date(note.created_at)
}
# Add content (prefer HTML)
if note.html:
item['content_html'] = note.html
elif note.content:
item['content_text'] = note.content
# Add modified date if different
if hasattr(note, 'updated_at') and note.updated_at != note.created_at:
item['date_modified'] = self._format_json_date(note.updated_at)
# Add summary if available
if hasattr(note, 'summary') and note.summary:
item['summary'] = note.summary
# Add tags if available
if hasattr(note, 'tags') and note.tags:
item['tags'] = note.tags
# Add author if different from feed author
if hasattr(note, 'author') and note.author != self.author:
item['authors'] = [self._clean_author(note.author)]
# Add image if available
image_url = self._extract_image_url(note)
if image_url:
item['image'] = image_url
# Add custom extensions
item['_starpunk'] = {
'permalink_path': note.permalink,
'word_count': len(note.content.split()) if note.content else 0
}
return item
def _clean_author(self, author: Any) -> Dict[str, str]:
"""Clean author object for JSON"""
clean = {}
if isinstance(author, dict):
if author.get('name'):
clean['name'] = author['name']
if author.get('url'):
clean['url'] = author['url']
if author.get('avatar'):
clean['avatar'] = author['avatar']
elif hasattr(author, 'name'):
clean['name'] = author.name
if hasattr(author, 'url'):
clean['url'] = author.url
if hasattr(author, 'avatar'):
clean['avatar'] = author.avatar
else:
clean['name'] = str(author)
return clean
def _format_json_date(self, dt: datetime) -> str:
"""Format datetime to RFC 3339 for JSON Feed
Format: 2024-11-25T12:00:00Z or 2024-11-25T12:00:00-05:00
"""
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
# Use Z for UTC
if dt.tzinfo == timezone.utc:
return dt.strftime('%Y-%m-%dT%H:%M:%SZ')
else:
return dt.isoformat()
def _extract_image_url(self, note: Note) -> Optional[str]:
"""Extract first image URL from note content"""
if not note.html:
return None
# Simple regex to find first img tag
import re
match = re.search(r'<img[^>]+src="([^"]+)"', note.html)
if match:
img_url = match.group(1)
# Make absolute if relative
if not img_url.startswith('http'):
img_url = f'{self.site_url}{img_url}'
return img_url
return None
```
### Streaming JSON Generation
For memory efficiency with large feeds:
```python
class StreamingJsonEncoder:
"""Helper for streaming JSON generation"""
@staticmethod
def stream_object(obj: Dict[str, Any], indent: int = 0) -> Iterator[str]:
"""Stream a JSON object"""
indent_str = ' ' * indent
yield indent_str + '{\n'
items = list(obj.items())
for i, (key, value) in enumerate(items):
yield f'{indent_str} "{key}": '
if isinstance(value, dict):
yield from StreamingJsonEncoder.stream_object(value, indent + 2)
elif isinstance(value, list):
yield from StreamingJsonEncoder.stream_array(value, indent + 2)
else:
yield json.dumps(value)
if i < len(items) - 1:
yield ','
yield '\n'
yield indent_str + '}'
@staticmethod
def stream_array(arr: List[Any], indent: int = 0) -> Iterator[str]:
"""Stream a JSON array"""
indent_str = ' ' * indent
yield '[\n'
for i, item in enumerate(arr):
if isinstance(item, dict):
yield from StreamingJsonEncoder.stream_object(item, indent + 2)
else:
yield indent_str + ' ' + json.dumps(item)
if i < len(arr) - 1:
yield ','
yield '\n'
yield indent_str + ']'
```
## Complete JSON Feed Example
```json
{
"version": "https://jsonfeed.org/version/1.1",
"title": "StarPunk Notes",
"home_page_url": "https://example.com/",
"feed_url": "https://example.com/feed.json",
"description": "Personal notes and thoughts",
"authors": [
{
"name": "John Doe",
"url": "https://example.com/about",
"avatar": "https://example.com/avatar.jpg"
}
],
"language": "en",
"icon": "https://example.com/icon.png",
"favicon": "https://example.com/favicon.ico",
"items": [
{
"id": "https://example.com/notes/2024/11/25/first-note",
"url": "https://example.com/notes/2024/11/25/first-note",
"title": "My First Note",
"content_html": "<p>This is my first note with <strong>bold</strong> text.</p>",
"summary": "Introduction to my notes",
"image": "https://example.com/images/first.jpg",
"date_published": "2024-11-25T10:00:00Z",
"date_modified": "2024-11-25T10:30:00Z",
"tags": ["personal", "introduction"],
"_starpunk": {
"permalink_path": "/notes/2024/11/25/first-note",
"word_count": 8
}
},
{
"id": "https://example.com/notes/2024/11/24/another-note",
"url": "https://example.com/notes/2024/11/24/another-note",
"title": "Another Note",
"content_text": "Plain text content for this note.",
"date_published": "2024-11-24T15:45:00Z",
"tags": ["thoughts"],
"_starpunk": {
"permalink_path": "/notes/2024/11/24/another-note",
"word_count": 6
}
}
]
}
```
## Validation
### JSON Feed Validator
Validate against the official validator:
- https://validator.jsonfeed.org/
### Common Validation Issues
1. **Invalid JSON Syntax**
- Proper escaping of quotes
- Valid UTF-8 encoding
- No trailing commas
2. **Missing Required Fields**
- version, title, items required
- Each item needs id
3. **Invalid Date Format**
- Must be RFC 3339
- Include timezone
4. **Invalid URLs**
- Must be absolute URLs
- Properly encoded
## Testing Strategy
### Unit Tests
```python
class TestJsonFeedGenerator:
def test_required_fields(self):
"""Test all required fields are present"""
generator = JsonFeedGenerator(site_url, site_name, site_description)
feed_json = generator.generate(notes)
feed = json.loads(feed_json)
assert feed['version'] == 'https://jsonfeed.org/version/1.1'
assert 'title' in feed
assert 'items' in feed
def test_feed_order_newest_first(self):
"""Test JSON feed shows newest entries first (spec convention)"""
# Create notes with different timestamps
old_note = Note(
title="Old Note",
created_at=datetime(2024, 11, 20, 10, 0, 0, tzinfo=timezone.utc)
)
new_note = Note(
title="New Note",
created_at=datetime(2024, 11, 25, 10, 0, 0, tzinfo=timezone.utc)
)
# Generate feed with notes in DESC order (as from database)
generator = JsonFeedGenerator(site_url, site_name, site_description)
feed_json = generator.generate([new_note, old_note])
feed = json.loads(feed_json)
# First item should be newest
assert feed['items'][0]['title'] == "New Note"
assert '2024-11-25' in feed['items'][0]['date_published']
# Second item should be oldest
assert feed['items'][1]['title'] == "Old Note"
assert '2024-11-20' in feed['items'][1]['date_published']
def test_json_validity(self):
"""Test output is valid JSON"""
generator = JsonFeedGenerator(site_url, site_name, site_description)
feed_json = generator.generate(notes)
# Should parse without error
feed = json.loads(feed_json)
assert isinstance(feed, dict)
def test_date_formatting(self):
"""Test RFC 3339 date formatting"""
dt = datetime(2024, 11, 25, 12, 0, 0, tzinfo=timezone.utc)
formatted = generator._format_json_date(dt)
assert formatted == '2024-11-25T12:00:00Z'
def test_streaming_generation(self):
"""Test streaming produces valid JSON"""
generator = JsonFeedGenerator(site_url, site_name, site_description)
chunks = list(generator.generate_streaming(notes))
feed_json = ''.join(chunks)
# Should be valid JSON
feed = json.loads(feed_json)
assert feed['version'] == 'https://jsonfeed.org/version/1.1'
def test_custom_extensions(self):
"""Test custom _starpunk extension"""
generator = JsonFeedGenerator(site_url, site_name, site_description)
feed_json = generator.generate([sample_note])
feed = json.loads(feed_json)
item = feed['items'][0]
assert '_starpunk' in item
assert 'permalink_path' in item['_starpunk']
assert 'word_count' in item['_starpunk']
```
### Integration Tests
```python
def test_json_feed_endpoint():
"""Test JSON feed endpoint"""
response = client.get('/feed.json')
assert response.status_code == 200
assert response.content_type == 'application/feed+json'
feed = json.loads(response.data)
assert feed['version'] == 'https://jsonfeed.org/version/1.1'
def test_content_negotiation_json():
"""Test content negotiation prefers JSON"""
response = client.get('/feed', headers={'Accept': 'application/json'})
assert response.status_code == 200
assert 'json' in response.content_type.lower()
def test_feed_reader_compatibility():
"""Test with JSON Feed readers"""
readers = [
'Feedbin',
'Inoreader',
'NewsBlur',
'NetNewsWire'
]
for reader in readers:
assert validate_with_reader(feed_url, reader, format='json')
```
### Validation Tests
```python
def test_jsonfeed_validation():
"""Validate against official validator"""
generator = JsonFeedGenerator(site_url, site_name, site_description)
feed_json = generator.generate(sample_notes)
# Submit to validator
result = validate_json_feed(feed_json)
assert result['valid'] == True
assert len(result['errors']) == 0
```
## Performance Benchmarks
### Generation Speed
```python
def benchmark_json_generation():
"""Benchmark JSON feed generation"""
notes = generate_sample_notes(100)
generator = JsonFeedGenerator(site_url, site_name, site_description)
start = time.perf_counter()
feed_json = generator.generate(notes, limit=50)
duration = time.perf_counter() - start
assert duration < 0.05 # Less than 50ms
assert len(feed_json) > 0
```
### Size Comparison
```python
def test_json_vs_xml_size():
"""Compare JSON feed size to RSS/ATOM"""
notes = generate_sample_notes(50)
# Generate all formats
json_feed = json_generator.generate(notes)
rss_feed = rss_generator.generate(notes)
atom_feed = atom_generator.generate(notes)
# JSON should be more compact
print(f"JSON: {len(json_feed)} bytes")
print(f"RSS: {len(rss_feed)} bytes")
print(f"ATOM: {len(atom_feed)} bytes")
# Typically JSON is 20-30% smaller
```
## Configuration
### JSON Feed Settings
```ini
# JSON Feed configuration
STARPUNK_FEED_JSON_ENABLED=true
STARPUNK_FEED_JSON_AUTHOR_NAME=John Doe
STARPUNK_FEED_JSON_AUTHOR_URL=https://example.com/about
STARPUNK_FEED_JSON_AUTHOR_AVATAR=https://example.com/avatar.jpg
STARPUNK_FEED_JSON_ICON=https://example.com/icon.png
STARPUNK_FEED_JSON_FAVICON=https://example.com/favicon.ico
STARPUNK_FEED_JSON_LANGUAGE=en
STARPUNK_FEED_JSON_HUB_URL= # WebSub hub URL (optional)
```
## Security Considerations
1. **JSON Injection Prevention**
- Proper JSON escaping
- No raw user input
- Validate all URLs
2. **Content Security**
- HTML content sanitized
- No script injection
- Safe JSON encoding
3. **Size Limits**
- Maximum feed size
- Item count limits
- Timeout protection
## Migration Notes
### Adding JSON Feed
- Runs parallel to RSS/ATOM
- No changes to existing feeds
- Shared caching infrastructure
- Same data source
## Advanced Features
### WebSub Support (Future)
```json
{
"hubs": [
{
"type": "WebSub",
"url": "https://example.com/hub"
}
]
}
```
### Pagination
```json
{
"next_url": "https://example.com/feed.json?page=2"
}
```
### Attachments
```json
{
"attachments": [
{
"url": "https://example.com/podcast.mp3",
"mime_type": "audio/mpeg",
"title": "Podcast Episode",
"size_in_bytes": 25000000,
"duration_in_seconds": 1800
}
]
}
```
## Acceptance Criteria
1. ✅ Valid JSON Feed 1.1 generation
2. ✅ All required fields present
3. ✅ RFC 3339 dates correct
4. ✅ Valid JSON syntax
5. ✅ Streaming generation working
6. ✅ Official validator passing
7. ✅ Works with 5+ JSON Feed readers
8. ✅ Performance target met (<50ms)
9. ✅ Custom extensions working
10. ✅ Security review passed

View File

@@ -0,0 +1,534 @@
# Metrics Instrumentation Specification - v1.1.2
## Overview
This specification completes the metrics instrumentation foundation started in v1.1.1, adding comprehensive coverage for database operations, HTTP requests, memory monitoring, and business-specific syndication metrics.
## Requirements
### Functional Requirements
1. **Database Performance Metrics**
- Time all database operations
- Track query patterns and frequency
- Detect slow queries (>1 second)
- Monitor connection pool utilization
- Count rows affected/returned
2. **HTTP Request/Response Metrics**
- Full request lifecycle timing
- Request and response size tracking
- Status code distribution
- Per-endpoint performance metrics
- Client identification (user agent)
3. **Memory Monitoring**
- Continuous RSS memory tracking
- Memory growth detection
- High water mark tracking
- Garbage collection statistics
- Leak detection algorithms
4. **Business Metrics**
- Feed request counts by format
- Cache hit/miss rates
- Content publication rates
- Syndication success tracking
- Format popularity analysis
### Non-Functional Requirements
1. **Performance Impact**
- Total overhead <1% when enabled
- Zero impact when disabled
- Efficient metric storage (<2MB)
- Non-blocking collection
2. **Data Retention**
- In-memory circular buffer
- Last 1000 metrics retained
- 15-minute detail window
- Automatic cleanup
## Design
### Database Instrumentation
#### Connection Wrapper
```python
class MonitoredConnection:
"""SQLite connection wrapper with performance monitoring"""
def __init__(self, db_path: str, metrics_collector: MetricsCollector):
self.conn = sqlite3.connect(db_path)
self.metrics = metrics_collector
def execute(self, query: str, params: Optional[tuple] = None) -> sqlite3.Cursor:
"""Execute query with timing"""
query_type = self._get_query_type(query)
table_name = self._extract_table_name(query)
start_time = time.perf_counter()
try:
cursor = self.conn.execute(query, params or ())
duration = time.perf_counter() - start_time
# Record successful execution
self.metrics.record_database_operation(
operation_type=query_type,
table_name=table_name,
duration_ms=duration * 1000,
rows_affected=cursor.rowcount if query_type != 'SELECT' else len(cursor.fetchall())
)
# Check for slow query
if duration > 1.0:
self.metrics.record_slow_query(query, duration, params)
return cursor
except Exception as e:
duration = time.perf_counter() - start_time
self.metrics.record_database_error(query_type, table_name, str(e), duration * 1000)
raise
def _get_query_type(self, query: str) -> str:
"""Extract query type from SQL"""
query_upper = query.strip().upper()
for query_type in ['SELECT', 'INSERT', 'UPDATE', 'DELETE', 'CREATE', 'DROP']:
if query_upper.startswith(query_type):
return query_type
return 'OTHER'
def _extract_table_name(self, query: str) -> Optional[str]:
"""Extract primary table name from query"""
# Simple regex patterns for common cases
patterns = [
r'FROM\s+(\w+)',
r'INTO\s+(\w+)',
r'UPDATE\s+(\w+)',
r'DELETE\s+FROM\s+(\w+)'
]
# Implementation details...
```
#### Metrics Collected
| Metric | Type | Description |
|--------|------|-------------|
| `db.query.duration` | Histogram | Query execution time in ms |
| `db.query.count` | Counter | Total queries by type |
| `db.query.errors` | Counter | Failed queries by type |
| `db.rows.affected` | Histogram | Rows modified per query |
| `db.rows.returned` | Histogram | Rows returned per SELECT |
| `db.slow_queries` | List | Queries exceeding threshold |
| `db.connection.active` | Gauge | Active connections |
| `db.transaction.duration` | Histogram | Transaction time in ms |
### HTTP Instrumentation
#### Request Middleware
```python
class HTTPMetricsMiddleware:
"""Flask middleware for HTTP metrics collection"""
def __init__(self, app: Flask, metrics_collector: MetricsCollector):
self.app = app
self.metrics = metrics_collector
self.setup_hooks()
def setup_hooks(self):
"""Register Flask hooks for metrics"""
@self.app.before_request
def start_request_timer():
"""Initialize request metrics"""
g.request_metrics = {
'start_time': time.perf_counter(),
'start_memory': self._get_memory_usage(),
'request_id': str(uuid.uuid4()),
'method': request.method,
'endpoint': request.endpoint,
'path': request.path,
'content_length': request.content_length or 0
}
@self.app.after_request
def record_response_metrics(response):
"""Record response metrics"""
if not hasattr(g, 'request_metrics'):
return response
# Calculate metrics
duration = time.perf_counter() - g.request_metrics['start_time']
memory_delta = self._get_memory_usage() - g.request_metrics['start_memory']
# Record to collector
self.metrics.record_http_request(
method=g.request_metrics['method'],
endpoint=g.request_metrics['endpoint'],
status_code=response.status_code,
duration_ms=duration * 1000,
request_size=g.request_metrics['content_length'],
response_size=len(response.get_data()),
memory_delta_mb=memory_delta
)
# Add timing header for debugging
if self.app.config.get('DEBUG'):
response.headers['X-Response-Time'] = f"{duration * 1000:.2f}ms"
return response
```
#### Metrics Collected
| Metric | Type | Description |
|--------|------|-------------|
| `http.request.duration` | Histogram | Total request processing time |
| `http.request.count` | Counter | Requests by method and endpoint |
| `http.request.size` | Histogram | Request body size distribution |
| `http.response.size` | Histogram | Response body size distribution |
| `http.status.{code}` | Counter | Response status code counts |
| `http.endpoint.{name}.duration` | Histogram | Per-endpoint timing |
| `http.memory.delta` | Gauge | Memory change per request |
### Memory Monitoring
#### Background Monitor Thread
```python
class MemoryMonitor(Thread):
"""Background thread for continuous memory monitoring"""
def __init__(self, metrics_collector: MetricsCollector, interval: int = 10):
super().__init__(daemon=True)
self.metrics = metrics_collector
self.interval = interval
self.running = True
self.baseline_memory = None
self.high_water_mark = 0
def run(self):
"""Main monitoring loop"""
# Establish baseline after startup
time.sleep(5)
self.baseline_memory = self._get_memory_info()
while self.running:
try:
memory_info = self._get_memory_info()
# Update high water mark
self.high_water_mark = max(self.high_water_mark, memory_info['rss'])
# Calculate growth rate
if self.baseline_memory:
growth_rate = (memory_info['rss'] - self.baseline_memory['rss']) /
(time.time() - self.baseline_memory['timestamp']) * 3600
# Detect potential leak (>10MB/hour growth)
if growth_rate > 10:
self.metrics.record_memory_leak_warning(growth_rate)
# Record metrics
self.metrics.record_memory_usage(
rss_mb=memory_info['rss'],
vms_mb=memory_info['vms'],
high_water_mb=self.high_water_mark,
gc_stats=self._get_gc_stats()
)
except Exception as e:
logger.error(f"Memory monitoring error: {e}")
time.sleep(self.interval)
def _get_memory_info(self) -> dict:
"""Get current memory usage"""
import resource
usage = resource.getrusage(resource.RUSAGE_SELF)
return {
'timestamp': time.time(),
'rss': usage.ru_maxrss / 1024, # Convert to MB
'vms': usage.ru_idrss
}
def _get_gc_stats(self) -> dict:
"""Get garbage collection statistics"""
import gc
return {
'collections': gc.get_count(),
'collected': gc.collect(0),
'uncollectable': len(gc.garbage)
}
```
#### Metrics Collected
| Metric | Type | Description |
|--------|------|-------------|
| `memory.rss` | Gauge | Resident set size in MB |
| `memory.vms` | Gauge | Virtual memory size in MB |
| `memory.high_water` | Gauge | Maximum RSS observed |
| `memory.growth_rate` | Gauge | MB/hour growth rate |
| `gc.collections` | Counter | GC collection counts by generation |
| `gc.collected` | Counter | Objects collected |
| `gc.uncollectable` | Gauge | Uncollectable object count |
### Business Metrics
#### Syndication Metrics
```python
class SyndicationMetrics:
"""Business metrics specific to content syndication"""
def __init__(self, metrics_collector: MetricsCollector):
self.metrics = metrics_collector
def record_feed_request(self, format: str, cached: bool, generation_time: float):
"""Record feed request metrics"""
self.metrics.increment(f'feed.requests.{format}')
if cached:
self.metrics.increment('feed.cache.hits')
else:
self.metrics.increment('feed.cache.misses')
self.metrics.record_histogram('feed.generation.time', generation_time * 1000)
def record_content_negotiation(self, accept_header: str, selected_format: str):
"""Track content negotiation results"""
self.metrics.increment(f'feed.negotiation.{selected_format}')
# Track client preferences
if 'json' in accept_header.lower():
self.metrics.increment('feed.client.prefers_json')
elif 'atom' in accept_header.lower():
self.metrics.increment('feed.client.prefers_atom')
def record_publication(self, note_length: int, has_media: bool):
"""Track content publication metrics"""
self.metrics.increment('content.notes.published')
self.metrics.record_histogram('content.note.length', note_length)
if has_media:
self.metrics.increment('content.notes.with_media')
```
#### Metrics Collected
| Metric | Type | Description |
|--------|------|-------------|
| `feed.requests.{format}` | Counter | Requests by feed format |
| `feed.cache.hits` | Counter | Cache hit count |
| `feed.cache.misses` | Counter | Cache miss count |
| `feed.cache.hit_rate` | Gauge | Cache hit percentage |
| `feed.generation.time` | Histogram | Feed generation duration |
| `feed.negotiation.{format}` | Counter | Format selection results |
| `content.notes.published` | Counter | Total notes published |
| `content.note.length` | Histogram | Note size distribution |
| `content.syndication.success` | Counter | Successful syndications |
## Implementation Details
### Metrics Collector
```python
class MetricsCollector:
"""Central metrics collection and storage"""
def __init__(self, buffer_size: int = 1000):
self.buffer = deque(maxlen=buffer_size)
self.counters = defaultdict(int)
self.gauges = {}
self.histograms = defaultdict(list)
self.slow_queries = deque(maxlen=100)
def record_metric(self, category: str, name: str, value: float, metadata: dict = None):
"""Record a generic metric"""
metric = {
'timestamp': time.time(),
'category': category,
'name': name,
'value': value,
'metadata': metadata or {}
}
self.buffer.append(metric)
def increment(self, name: str, amount: int = 1):
"""Increment a counter"""
self.counters[name] += amount
def set_gauge(self, name: str, value: float):
"""Set a gauge value"""
self.gauges[name] = value
def record_histogram(self, name: str, value: float):
"""Add value to histogram"""
self.histograms[name].append(value)
# Keep only last 1000 values
if len(self.histograms[name]) > 1000:
self.histograms[name] = self.histograms[name][-1000:]
def get_summary(self, window_seconds: int = 900) -> dict:
"""Get metrics summary for dashboard"""
cutoff = time.time() - window_seconds
recent = [m for m in self.buffer if m['timestamp'] > cutoff]
summary = {
'counters': dict(self.counters),
'gauges': dict(self.gauges),
'histograms': self._calculate_histogram_stats(),
'recent_metrics': recent[-100:], # Last 100 metrics
'slow_queries': list(self.slow_queries)
}
return summary
def _calculate_histogram_stats(self) -> dict:
"""Calculate statistics for histograms"""
stats = {}
for name, values in self.histograms.items():
if values:
sorted_values = sorted(values)
stats[name] = {
'count': len(values),
'min': min(values),
'max': max(values),
'mean': sum(values) / len(values),
'p50': sorted_values[len(values) // 2],
'p95': sorted_values[int(len(values) * 0.95)],
'p99': sorted_values[int(len(values) * 0.99)]
}
return stats
```
## Configuration
### Environment Variables
```ini
# Metrics collection toggles
STARPUNK_METRICS_ENABLED=true
STARPUNK_METRICS_DB_TIMING=true
STARPUNK_METRICS_HTTP_TIMING=true
STARPUNK_METRICS_MEMORY_MONITOR=true
STARPUNK_METRICS_BUSINESS=true
# Thresholds
STARPUNK_METRICS_SLOW_QUERY_THRESHOLD=1.0 # seconds
STARPUNK_METRICS_MEMORY_LEAK_THRESHOLD=10 # MB/hour
# Storage
STARPUNK_METRICS_BUFFER_SIZE=1000
STARPUNK_METRICS_RETENTION_SECONDS=900 # 15 minutes
# Monitoring intervals
STARPUNK_METRICS_MEMORY_INTERVAL=10 # seconds
```
## Testing Strategy
### Unit Tests
1. **Collector Tests**
```python
def test_metrics_buffer_circular():
collector = MetricsCollector(buffer_size=10)
for i in range(20):
collector.record_metric('test', 'metric', i)
assert len(collector.buffer) == 10
assert collector.buffer[0]['value'] == 10 # Oldest is 10, not 0
```
2. **Instrumentation Tests**
```python
def test_database_timing():
conn = MonitoredConnection(':memory:', collector)
conn.execute('CREATE TABLE test (id INTEGER)')
metrics = collector.get_summary()
assert 'db.query.duration' in metrics['histograms']
assert metrics['counters']['db.query.count'] == 1
```
### Integration Tests
1. **End-to-End Request Tracking**
```python
def test_request_metrics():
response = client.get('/feed.xml')
metrics = app.metrics_collector.get_summary()
assert 'http.request.duration' in metrics['histograms']
assert metrics['counters']['http.status.200'] > 0
```
2. **Memory Leak Detection**
```python
def test_memory_monitoring():
monitor = MemoryMonitor(collector)
monitor.start()
# Simulate memory growth
large_list = [0] * 1000000
time.sleep(15)
metrics = collector.get_summary()
assert metrics['gauges']['memory.rss'] > 0
```
## Performance Benchmarks
### Overhead Measurement
```python
def benchmark_instrumentation_overhead():
# Baseline without instrumentation
config.METRICS_ENABLED = False
start = time.perf_counter()
for _ in range(1000):
execute_operation()
baseline = time.perf_counter() - start
# With instrumentation
config.METRICS_ENABLED = True
start = time.perf_counter()
for _ in range(1000):
execute_operation()
instrumented = time.perf_counter() - start
overhead_percent = ((instrumented - baseline) / baseline) * 100
assert overhead_percent < 1.0 # Less than 1% overhead
```
## Security Considerations
1. **No Sensitive Data**: Never log query parameters that might contain passwords
2. **Rate Limiting**: Metrics endpoints should be rate-limited
3. **Access Control**: Metrics dashboard requires admin authentication
4. **Data Sanitization**: Escape all user-provided data in metrics
## Migration Notes
### From v1.1.1
- Existing performance monitoring configuration remains compatible
- New metrics are additive, no breaking changes
- Dashboard enhanced but backward compatible
## Acceptance Criteria
1. ✅ All database operations are timed
2. ✅ HTTP requests fully instrumented
3. ✅ Memory monitoring thread operational
4. ✅ Business metrics for syndication tracked
5. ✅ Performance overhead <1%
6. ✅ Metrics dashboard shows all new data
7. ✅ Slow query detection working
8. ✅ Memory leak detection functional
9. ✅ All metrics properly documented
10. ✅ Security review passed