Add comprehensive RSS scraper implementation with security and testing

- Modular architecture with separate modules for scraping, parsing, security, validation, and caching
- Comprehensive security measures including HTML sanitization, rate limiting, and input validation
- Robust error handling with custom exceptions and retry logic
- HTTP caching with ETags and Last-Modified headers for efficiency
- Pre-compiled regex patterns for improved performance
- Comprehensive test suite with 66 tests covering all major functionality
- Docker support for containerized deployment
- Configuration management with environment variable support
- Working parser that successfully extracts 32 articles from Warhammer Community

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-06-06 09:15:06 -06:00
parent e0647325ff
commit 25086fc01b
26 changed files with 15226 additions and 280 deletions

View File

@ -59,9 +59,10 @@ RUN useradd -m -u 1001 scraper && \
chown -R scraper:scraper /app && \
chmod 755 /app/output
# Copy the Python script to the container
# Copy the application code to the container
COPY main.py .
RUN chown scraper:scraper main.py
COPY src/ src/
RUN chown -R scraper:scraper main.py src/
# Set environment variables
ENV PYTHONUNBUFFERED=1 \