docs: Add v1.1.1 developer Q&A session

Create developer-qa.md with architect's answers to all 20 implementation questions from the developer's design review. This is the proper format for Q&A between developer and architect during design review, not an ADR (which is for architectural decisions with lasting impact). Content includes: - 6 critical questions with answers (config, db pool, logging, etc.) - 8 important questions (session migration, Unicode, health checks) - 6 nice-to-have clarifications (testing, monitoring, dashboard) - Implementation phases (3 weeks) - Integration guidance Developer now has clear guidance to proceed with v1.1.1 implementation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-25 13:43:56 -07:00
parent e589f5bd6c
commit f62d3c5382
1 changed files with 400 additions and 0 deletions
--- a/docs/design/v1.1.1/developer-qa.md
+++ b/docs/design/v1.1.1/developer-qa.md
@@ -0,0 +1,400 @@
+# StarPunk v1.1.1 "Polish" - Developer Q&A
+
+**Date**: 2025-11-25
+**Developer**: Developer Agent
+**Architect**: Architect Agent
+
+This document contains the Q&A session between the developer and architect during v1.1.1 design review.
+
+## Purpose
+
+The developer reviewed all v1.1.1 design documentation and prepared questions about implementation details, integration points, and edge cases. This document contains the architect's answers to guide implementation.
+
+## Critical Questions (Must be answered before implementation)
+
+### Q1: Configuration System Integration
+**Developer Question**: The design calls for centralized configuration. I see we have `config.py` at the root for Flask app config. Should the new `starpunk/config.py` module replace this, wrap it, or co-exist as a separate configuration layer? How do we avoid breaking existing code that directly imports from `config`?
+
+**Architect Answer**: Keep both files with clear separation of concerns. The existing `config.py` remains for Flask app configuration, while the new `starpunk/config.py` becomes a configuration helper module that wraps Flask's app.config for runtime access.
+
+**Rationale**: This maintains backward compatibility, separates Flask-specific config from application logic, and allows gradual migration without breaking changes.
+
+**Implementation Guidance**:
+- Create `starpunk/config.py` as a helper that uses `current_app.config`
+- Provide methods like `get_database_path()`, `get_upload_folder()`, etc.
+- Gradually replace direct config access with helper methods
+- Document both in the configuration guide
+
+---
+
+### Q2: Database Connection Pool Scope
+**Developer Question**: The connection pool will replace the current `get_db()` context manager used throughout routes. Should it also replace direct `sqlite3.connect()` calls in migrations and utilities? How do we ensure proper connection lifecycle in Flask's request context?
+
+**Architect Answer**: Connection pool replaces `get_db()` but NOT migrations. The pool replaces all runtime `sqlite3.connect()` calls but migrations must use direct connections for isolation. Integrate the pool with Flask's `g` object for request-scoped connections.
+
+**Rationale**: Migrations need isolated transactions without pool interference. The pool improves runtime performance while request-scoped connections via `g` maintain Flask patterns.
+
+**Implementation Guidance**:
+- Implement pool in `starpunk/database/pool.py`
+- Use `g.db` for request-scoped connections
+- Replace `get_db()` in all route files
+- Keep direct connections for migrations only
+- Add pool statistics to metrics
+
+---
+
+### Q3: Logging vs. Print Statements Migration
+**Developer Question**: Current code has many print statements for debugging. Should we phase these out gradually or remove all at once? Should we use Python's logging module directly or Flask's app.logger? For CLI commands, should they use logging or click.echo()?
+
+**Architect Answer**: Phase out print statements immediately in v1.1.1. Remove ALL print statements in this release. Use Flask's `app.logger` as the base, enhanced with structured logging. CLI commands use `click.echo()` for user output and logger for diagnostics.
+
+**Rationale**: A clean break prevents confusion. Flask's logger integrates with the framework, and click.echo() is the proper CLI output method.
+
+**Implementation Guidance**:
+- Set up RotatingFileHandler in app factory
+- Configure structured logging with correlation IDs
+- Replace all print() with appropriate logging calls
+- Use click.echo() for CLI user feedback
+- Use logger for CLI diagnostic output
+
+---
+
+### Q4: Error Handling Middleware Integration
+**Developer Question**: For consistent error handling, should we use Flask's @app.errorhandler decorator or implement custom middleware? How do we ensure Micropub endpoints return spec-compliant error responses while other endpoints return HTML error pages?
+
+**Architect Answer**: Use Flask's `@app.errorhandler` for all error handling. Register error handlers in the app factory. Micropub endpoints get specialized error handlers for spec compliance. No decorators on individual routes.
+
+**Rationale**: Flask's error handler is the idiomatic approach. Centralized error handling reduces code duplication, and Micropub spec requires specific error formats.
+
+**Implementation Guidance**:
+- Create `starpunk/errors.py` with `register_error_handlers(app)`
+- Check request path to determine response format
+- Return JSON for `/micropub` endpoints
+- Return HTML templates for other endpoints
+- Log all errors with correlation IDs
+
+---
+
+### Q5: FTS5 Fallback Search Implementation
+**Developer Question**: If FTS5 isn't available, should fallback search be in the same module or separate? Should it have the same function signature? How do we detect FTS5 support - at startup or runtime?
+
+**Architect Answer**: Same module, runtime detection with decorator pattern. Keep in `search.py` module with the same function signature. Determine support at startup and cache for performance.
+
+**Rationale**: A single module maintains cohesion. Same signature allows transparent switching. Startup detection avoids runtime overhead.
+
+**Implementation Guidance**:
+- Detect FTS5 support at startup using a test table
+- Cache the result in a module-level variable
+- Use function pointer to select implementation
+- Both implementations use identical signatures
+- Log which implementation is active
+
+---
+
+### Q6: Performance Monitoring Circular Buffer
+**Developer Question**: For the circular buffer storing performance metrics - in a multi-process deployment (like gunicorn), should each process have its own buffer or should we use shared memory? How do we aggregate metrics across processes?
+
+**Architect Answer**: Per-process buffer with aggregation endpoint. Each process maintains its own circular buffer. `/admin/metrics` aggregates across all workers. Use `multiprocessing.Manager` for shared state if needed.
+
+**Rationale**: Per-process avoids locking overhead. Aggregation provides complete picture. This is a standard pattern for multi-process Flask apps.
+
+**Implementation Guidance**:
+- Create `MetricsBuffer` class with deque
+- Include process ID in all metrics
+- Aggregate in `/admin/metrics` endpoint
+- Consider shared memory for future enhancement
+- Default to 1000 entries per buffer
+
+---
+
+## Important Questions
+
+### Q7: Session Table Migration
+**Developer Question**: The session management enhancement requires a new database table. Should this be added to an existing migration file or create a new one? What happens to existing sessions during upgrade?
+
+**Architect Answer**: New migration file `008_add_session_table.sql`. This is a separate migration that maintains clarity. Drop existing sessions (document in upgrade guide). Use RETURNING clause with version check where supported.
+
+**Rationale**: Clean migration history is important. Sessions are ephemeral and safe to drop. RETURNING improves performance where available.
+
+**Implementation Guidance**:
+- Create new migration file
+- Drop table if exists before creation
+- Add proper indexes for user_id and expires_at
+- Document session reset in upgrade guide
+- Test migration rollback procedure
+
+---
+
+### Q8: Unicode Slug Generation
+**Developer Question**: When slug generation from title fails (e.g., all emoji title), what should the fallback be? Should we return an error to the Micropub client or generate a default slug? What pattern for auto-generated slugs?
+
+**Architect Answer**: Timestamp-based fallback with warning. Use `YYYYMMDD-HHMMSS` pattern when normalization fails. Log warning with original text for debugging. Return 201 Created to Micropub client (not an error).
+
+**Rationale**: Timestamp ensures uniqueness. Warning helps identify encoding issues. Micropub spec doesn't define this as an error condition.
+
+**Implementation Guidance**:
+- Try Unicode normalization first
+- Fall back to timestamp if result is empty
+- Log warnings for debugging
+- Include original text in logs
+- Never fail the Micropub request
+
+---
+
+### Q9: RSS Memory Optimization
+**Developer Question**: The current RSS generator builds the entire feed in memory. For optimization, should we stream the XML directly to the response or use a generator? How do we handle large feeds (1000+ items)?
+
+**Architect Answer**: Use generator with `yield` for streaming. Implement as generator function. Use Flask's `Response(generate(), mimetype='application/rss+xml')`. Stream directly to client.
+
+**Rationale**: Generators minimize memory footprint. Flask handles streaming automatically. This scales to any feed size.
+
+**Implementation Guidance**:
+- Convert RSS generation to generator function
+- Yield XML chunks, not individual characters
+- Query notes in batches if needed
+- Set appropriate response headers
+- Test with large feed counts
+
+---
+
+### Q10: Health Check Authentication
+**Developer Question**: Should health check endpoints require authentication? Load balancers need to access them, but detailed health info might be sensitive. How do we balance security with operational needs?
+
+**Architect Answer**: Basic check public, detailed check requires auth. `/health` returns 200 OK (no auth, for load balancers). `/health?detailed=true` requires authentication. Separate `/admin/health` for full diagnostics (always auth).
+
+**Rationale**: Load balancers need unauthenticated access. Detailed info could leak sensitive data. This follows industry standard patterns.
+
+**Implementation Guidance**:
+- Basic health: just return 200 if app responds
+- Detailed health: check database, disk space, etc.
+- Admin health: full diagnostics with metrics
+- Use query parameter to trigger detailed mode
+- Document endpoints in operations guide
+
+---
+
+### Q11: Request Correlation ID Scope
+**Developer Question**: Should the correlation ID be per-request or per-session? If a request triggers background tasks, should they inherit the correlation ID? What about CLI commands?
+
+**Architect Answer**: New ID for each HTTP request, inherit in background tasks. Each HTTP request gets a unique ID. Background tasks spawned from requests inherit the parent ID. CLI commands generate their own root ID.
+
+**Rationale**: This maintains request tracing through async operations. CLI commands are independent operations. It's a standard distributed tracing pattern.
+
+**Implementation Guidance**:
+- Generate UUID for each request
+- Store in Flask's `g` object
+- Pass to background tasks as parameter
+- Include in all log messages
+- Add to response headers
+
+---
+
+### Q12: Performance Monitoring Sampling
+**Developer Question**: To reduce overhead, should we sample performance metrics (e.g., only track 10% of requests)? Should sampling be configurable? Apply to all metrics or just specific types?
+
+**Architect Answer**: Configuration-based sampling with operation types. Default 10% sampling rate with different rates per operation type. Applied at collection point, not in slow query log.
+
+**Rationale**: Reduces overhead in production. Operation-specific rates allow focused monitoring. Slow query log should capture everything for debugging.
+
+**Implementation Guidance**:
+- Define sampling rates in config
+- Different rates for database/http/render
+- Use random sampling at collection point
+- Always log slow queries regardless
+- Make rates runtime configurable
+
+---
+
+### Q13: Search Highlighting XSS Prevention
+**Developer Question**: When highlighting search terms in results, how do we prevent XSS if the search term contains HTML? Should we use a library like bleach or implement our own escaping?
+
+**Architect Answer**: Use `markupsafe.escape()` with whitelist. Use Flask's standard `markupsafe.escape()`. Whitelist only `<mark>` tags for highlighting. Validate class attribute against whitelist.
+
+**Rationale**: markupsafe is Flask's security standard. Whitelist approach is most secure. Prevents class-based XSS attacks.
+
+**Implementation Guidance**:
+- Escape all text first
+- Then add safe mark tags
+- Use Markup() for safe strings
+- Limit to single highlight class
+- Test with malicious input
+
+---
+
+### Q14: Configuration Validation Timing
+**Developer Question**: When should configuration validation run - at startup, on first use, or both? Should invalid config crash the app or fall back to defaults? Should we validate before or after migrations?
+
+**Architect Answer**: Validate at startup, fail fast with clear errors. Validate immediately after loading config. Invalid config crashes app with descriptive error. Validate both presence and type. Run BEFORE migrations.
+
+**Rationale**: Fail fast prevents subtle runtime errors. Clear errors help operators fix issues. Type validation catches common mistakes.
+
+**Implementation Guidance**:
+- Create validation schema
+- Check required fields exist
+- Validate types and ranges
+- Provide clear error messages
+- Exit with non-zero status on failure
+
+---
+
+## Nice-to-Have Clarifications
+
+### Q15: Test Race Condition Fix Priority
+**Developer Question**: Some tests have intermittent failures due to race conditions. Should fixing these block v1.1.1 release, or can we defer to v1.1.2?
+
+**Architect Answer**: Fix in Phase 2, after core features. Not blocking for v1.1.1 release. Fix after performance monitoring is in place. Add to technical debt backlog.
+
+**Rationale**: Race conditions are intermittent, not blocking. Focus on user-visible improvements first. Can be addressed in v1.1.2.
+
+---
+
+### Q16: Memory Monitoring Thread
+**Developer Question**: The memory monitoring thread needs to record metrics periodically. How should it handle database unavailability? Should it stop gracefully on shutdown?
+
+**Architect Answer**: Use threading.Event for graceful shutdown. Stop gracefully using Event. Log warning if database unavailable, don't crash. Reconnect automatically on database recovery.
+
+**Rationale**: Graceful shutdown prevents data corruption. Monitoring shouldn't crash the app. Self-healing improves reliability.
+
+**Implementation Guidance**:
+- Use daemon thread with Event
+- Check stop event in loop
+- Handle database errors gracefully
+- Retry with exponential backoff
+- Log issues but don't propagate
+
+---
+
+### Q17: Log Rotation Strategy
+**Developer Question**: For log rotation, should we use Python's RotatingFileHandler, Linux logrotate, or a custom solution? What size/count limits are appropriate?
+
+**Architect Answer**: Use RotatingFileHandler with 10MB files. Python's built-in RotatingFileHandler. 10MB per file, keep 10 files. No compression for simplicity.
+
+**Rationale**: Built-in solution requires no dependencies. 100MB total is reasonable for small deployment. Compression adds complexity for minimal benefit.
+
+---
+
+### Q18: Error Budget Tracking
+**Developer Question**: How should we track error budgets - as a percentage, count, or rate? Over what time window? Should exceeding budget trigger any automatic actions?
+
+**Architect Answer**: Simple counter-based tracking. Track in metrics buffer. Display in dashboard as percentage. No auto-alerting in v1.1.1 (future enhancement).
+
+**Rationale**: Simple to implement and understand. Provides visibility without complexity. Alerting can be added later.
+
+**Implementation Guidance**:
+- Track last 1000 requests
+- Calculate success rate
+- Display remaining budget
+- Log when budget low
+- Manual monitoring for now
+
+---
+
+### Q19: Dashboard UI Framework
+**Developer Question**: For the admin dashboard, should we use a JavaScript framework (React/Vue), server-side rendering, or a hybrid approach? Any CSS framework preferences?
+
+**Architect Answer**: Server-side rendering with htmx for updates. No JavaScript framework for simplicity. Use htmx for real-time updates. Chart.js for graphs via CDN. Existing CSS, no new framework.
+
+**Rationale**: Maintains "works without JavaScript" principle. htmx provides reactivity without complexity. Chart.js is simple and sufficient.
+
+**Implementation Guidance**:
+- Use Jinja2 templates
+- Add htmx for auto-refresh
+- Include Chart.js from CDN
+- Keep existing CSS styles
+- Progressive enhancement approach
+
+---
+
+### Q20: Micropub Error Response Format
+**Developer Question**: The Micropub spec defines error responses, but should we add additional debugging info in development mode? How much detail in error_description field?
+
+**Architect Answer**: Maintain strict Micropub spec compliance. Use spec-defined error format exactly. Add `error_description` for clarity. Log additional details server-side only.
+
+**Rationale**: Spec compliance is non-negotiable. error_description is allowed by spec. Server logs provide debugging info.
+
+**Implementation Guidance**:
+- Use exact error codes from spec
+- Include helpful error_description
+- Never expose internal details
+- Log full context server-side
+- Keep development/production responses identical
+
+---
+
+## Implementation Priorities
+
+The architect recommends implementing v1.1.1 in three phases:
+
+### Phase 1: Core Infrastructure (Week 1)
+Focus on foundational improvements that other features depend on:
+1. Logging system replacement - Remove all print statements
+2. Configuration validation - Fail fast on invalid config
+3. Database connection pool - Improve performance
+4. Error handling middleware - Consistent error responses
+
+### Phase 2: Enhancements (Week 2)
+Add the user-facing improvements:
+5. Session management - Secure session handling
+6. Performance monitoring - Track system health
+7. Health checks - Enable monitoring
+8. Search improvements - Better search experience
+
+### Phase 3: Polish (Week 3)
+Complete the release with final touches:
+9. Admin dashboard - Visualize metrics
+10. Memory optimization - RSS streaming
+11. Documentation - Update all guides
+12. Testing improvements - Fix flaky tests
+
+## Additional Architectural Guidance
+
+### Configuration Integration Strategy
+The developer should implement configuration in layers:
+1. Keep existing config.py for Flask settings
+2. Add starpunk/config.py as helper module
+3. Migrate gradually by replacing direct config access
+4. Document both systems in configuration guide
+
+### Connection Pool Implementation Notes
+The pool should be transparent to calling code:
+1. Same interface as get_db()
+2. Automatic cleanup on request end
+3. Connection recycling for performance
+4. Statistics collection for monitoring
+
+### Validation Specifications
+Create centralized validation schemas for:
+- Configuration values (types, ranges, requirements)
+- Micropub requests (required fields, formats)
+- Input data (lengths, patterns, encoding)
+
+### Migration Ordering
+The developer must run migrations in this specific order:
+1. 008_add_session_table.sql
+2. 009_add_performance_indexes.sql
+3. 010_add_metrics_table.sql
+
+### Testing Gaps to Address
+While not blocking v1.1.1, these should be noted for v1.1.2:
+1. Connection pool stress tests
+2. Unicode edge cases
+3. Memory leak detection
+4. Error recovery scenarios
+
+### Required Documentation
+Before release, create these operational guides:
+1. `/docs/operations/upgrade-to-v1.1.1.md` - Step-by-step upgrade process
+2. `/docs/operations/troubleshooting.md` - Common issues and solutions
+3. `/docs/operations/performance-tuning.md` - Optimization guidelines
+
+## Final Architectural Notes
+
+These answers prioritize:
+- **Simplicity** over features - Every addition must justify its complexity
+- **Compatibility** over clean breaks - Don't break existing deployments
+- **Gradual migration** over big bang - Incremental improvements reduce risk
+- **Flask patterns** over custom solutions - Use idiomatic Flask approaches
+
+The developer should implement in the phase order specified, testing thoroughly between phases. Any blockers or uncertainties should be escalated immediately for architectural review.
+
+Remember: v1.1.1 is about polish, not new features. Focus on making existing functionality more robust, observable, and maintainable.