Add comprehensive RSS scraper implementation with security and testing

- Modular architecture with separate modules for scraping, parsing, security, validation, and caching - Comprehensive security measures including HTML sanitization, rate limiting, and input validation - Robust error handling with custom exceptions and retry logic - HTTP caching with ETags and Last-Modified headers for efficiency - Pre-compiled regex patterns for improved performance - Comprehensive test suite with 66 tests covering all major functionality - Docker support for containerized deployment - Configuration management with environment variable support - Working parser that successfully extracts 32 articles from Warhammer Community 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-06 09:15:06 -06:00
parent e0647325ff
commit 25086fc01b
26 changed files with 15226 additions and 280 deletions
--- a/README.md
+++ b/README.md
@ -1,31 +1,43 @@
 # Warhammer Community RSS Scraper

-A Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
+A production-ready Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.

 ## Overview

-This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.
+This project provides a robust, secure, and scalable RSS scraper for the Warhammer Community website. It features comprehensive error handling, caching, rate limiting, and security measures suitable for production deployment.

 ## Features

+### Core Functionality
 - Scrapes articles from Warhammer Community website
- Generates RSS feed with proper formatting
+- Generates properly formatted RSS feeds
 - Handles duplicate article detection
 - Sorts articles by publication date (newest first)
- Dockerized for easy deployment
- Saves both RSS feed and raw HTML for debugging
- **Security-focused**: URL validation, content filtering, and resource limits
- **Safe execution**: Runs as non-root user in container
+- Saves both RSS feed and debug HTML
+
+### Production Features
+- **Modular Architecture**: Clean separation of concerns with dedicated modules
+- **Comprehensive Logging**: Structured logging with configurable levels
+- **Configuration Management**: Environment-based configuration
+- **Caching**: Intelligent content caching with ETags and conditional requests
+- **Rate Limiting**: Respectful scraping with configurable delays
+- **Retry Logic**: Exponential backoff for network failures
+- **Type Safety**: Full type hints throughout codebase
+- **Comprehensive Tests**: Unit tests with pytest framework
+
+### Security Features
+- **URL Validation**: Whitelist-based domain validation
+- **Content Sanitization**: HTML sanitization using bleach library
+- **Path Validation**: Prevention of directory traversal attacks
+- **Resource Limits**: Memory and execution time constraints
+- **Input Validation**: Comprehensive argument and data validation
+- **Non-root Execution**: Secure container execution
+- **File Sanitization**: Safe filename handling

 ## Requirements

 - Python 3.12+
- Dependencies listed in `requirements.txt`:
-  - playwright
-  - beautifulsoup4
-  - feedgen
-  - pytz
-  - requests
+- Dependencies listed in `requirements.txt`

 ## Installation

@ -41,16 +53,19 @@ pip install -r requirements.txt
 playwright install
 ```

-3. Run the script:
+3. Run the scraper:
 ```bash
-# Default: saves to current directory
+# Basic usage
 python main.py

-# Or specify output directory
-python main.py /path/to/output
+# With custom options
+python main.py --url https://www.warhammer-community.com/en-gb/ \
+               --output-dir ./output \
+               --log-level DEBUG \
+               --max-scroll 3

-# Or use environment variable
-OUTPUT_DIR=/path/to/output python main.py
+# View all options
+python main.py --help
 ```

 ### Docker Setup
@ -60,58 +75,275 @@ OUTPUT_DIR=/path/to/output python main.py
 docker build -t warhammer-rss .
 ```

-2. Run the container (multiple options to avoid permission issues):
-
-**Option A: Save to current directory (simplest)**
+2. Run the container:
 ```bash
-docker run -v $(pwd):/app/output warhammer-rss
+# Basic usage
+docker run -v $(pwd)/output:/app/output warhammer-rss
+
+# With custom configuration
+docker run -e MAX_SCROLL_ITERATIONS=3 \
+           -e LOG_LEVEL=DEBUG \
+           -v $(pwd)/output:/app/output \
+           warhammer-rss --no-cache
+
+# With resource limits
+docker run --memory=512m --cpu-quota=50000 \
+           -v $(pwd)/output:/app/output \
+           warhammer-rss
 ```

-**Option B: Use environment variable for output directory**
+## Command Line Options
+
 ```bash
-docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss
+Usage: main.py [OPTIONS]
+
+Options:
+  --url URL              URL to scrape (default: Warhammer Community)
+  --output-dir PATH      Output directory for files
+  --max-scroll INT       Maximum scroll iterations (default: 5)
+  --log-level LEVEL      Logging level: DEBUG, INFO, WARNING, ERROR
+  --log-file PATH        Log file path (default: scraper.log)
+  --no-cache             Disable content caching
+  --clear-cache          Clear cache before running
+  --cache-info           Show cache information and exit
+  -h, --help             Show help message
 ```

-**Option C: With resource limits for additional security**
-```bash
-docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss
-```
-
-## Output
-
-The application generates:
- `warhammer_rss_feed.xml` - RSS feed file
- `page.html` - Raw HTML content for debugging
-
-Both files are saved to the specified output directory (current directory by default).
-
-## Security Features
-
-This application implements several security measures:
-
- **URL Validation**: Only allows scraping from trusted Warhammer Community domains
- **Path Validation**: Prevents directory traversal attacks by validating output paths
- **Resource Limits**: Caps content size (10MB) and scroll iterations (5) to prevent DoS
- **Content Filtering**: Sanitizes extracted text to prevent XSS and injection attacks
- **Non-root Execution**: Docker container runs as user `scraper` (UID 1001) for reduced privilege
- **Input Sanitization**: All URLs and file paths are validated before use
-
-## How It Works
-
-1. **Validates** the target URL against whitelist of allowed domains
-2. Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
-3. Scrolls through the page to load additional content (limited to 5 iterations)
-4. **Validates content size** and parses the rendered HTML with BeautifulSoup
-5. **Sanitizes** and extracts article titles, links, and publication dates
-6. **Validates all links** against allowed domains
-7. Removes duplicates and sorts by date
-8. Generates RSS feed using feedgen library
-9. **Validates output paths** before saving files
-
 ## Configuration

-The scraper targets `https://www.warhammer-community.com/en-gb/` by default and only allows URLs from:
+### Environment Variables
+
+The application supports extensive configuration via environment variables:
+
+```bash
+# Scraping Configuration
+MAX_SCROLL_ITERATIONS=5      # Number of scroll iterations
+MAX_CONTENT_SIZE=10485760    # Maximum content size (10MB)
+SCROLL_DELAY_SECONDS=2.0     # Delay between scrolls
+PAGE_TIMEOUT_MS=120000       # Page load timeout
+
+# Security Configuration
+ALLOWED_DOMAINS="warhammer-community.com,www.warhammer-community.com"
+MAX_TITLE_LENGTH=500         # Maximum title length
+
+# Output Configuration
+DEFAULT_OUTPUT_DIR="."       # Default output directory
+RSS_FILENAME="warhammer_rss_feed.xml"
+DEBUG_HTML_FILENAME="page.html"
+
+# Feed Metadata
+FEED_TITLE="Warhammer Community RSS Feed"
+FEED_DESCRIPTION="Latest Warhammer Community Articles"
+```
+
+### Cache Management
+
+```bash
+# View cache status
+python main.py --cache-info
+
+# Clear cache
+python main.py --clear-cache
+
+# Disable caching for a run
+python main.py --no-cache
+```
+
+## Project Structure
+
+```
+rss_warhammer/
+├── main.py                 # CLI entry point
+├── src/rss_scraper/        # Main package
+│   ├── __init__.py
+│   ├── config.py           # Configuration management
+│   ├── exceptions.py       # Custom exceptions
+│   ├── validation.py       # URL and path validation
+│   ├── scraper.py          # Web scraping with Playwright
+│   ├── parser.py           # HTML parsing and article extraction
+│   ├── rss_generator.py    # RSS feed generation
+│   ├── cache.py            # Content caching system
+│   ├── security.py         # Security utilities
+│   └── retry_utils.py      # Retry logic with backoff
+├── tests/                  # Comprehensive test suite
+├── cache/                  # Cache directory (auto-created)
+├── requirements.txt        # Python dependencies
+├── pytest.ini            # Test configuration
+├── Dockerfile             # Container configuration
+└── README.md              # This file
+```
+
+## Output Files
+
+The application generates:
+- `warhammer_rss_feed.xml` - RSS feed with extracted articles
+- `page.html` - Raw HTML for debugging (optional)
+- `scraper.log` - Application logs
+- `cache/` - Cached content and ETags
+
+## Testing
+
+Run the comprehensive test suite:
+
+```bash
+# Run all tests
+pytest
+
+# Run with coverage
+pytest --cov=src/rss_scraper
+
+# Run specific test categories
+pytest -m unit              # Unit tests only
+pytest tests/test_parser.py  # Specific module
+```
+
+## Error Handling
+
+The application uses specific exit codes for different error types:
+
+- `0` - Success
+- `1` - Configuration/Validation error
+- `2` - Network error
+- `3` - Page loading error
+- `4` - Content parsing error
+- `5` - File operation error
+- `6` - Content size exceeded
+- `99` - Unexpected error
+
+## Security Considerations
+
+### Allowed Domains
+The scraper only operates on whitelisted domains:
 - `warhammer-community.com`
 - `www.warhammer-community.com`

-To modify allowed domains, update the `ALLOWED_DOMAINS` list in `main.py:11-14`.
+### Rate Limiting
+- Default: 30 requests per minute
+- Minimum delay: 2 seconds between requests
+- Configurable via environment variables
+
+### Content Sanitization
+- HTML content sanitized using bleach
+- Dangerous scripts and patterns removed
+- File paths validated against directory traversal
+- URL validation against malicious patterns
+
+## Deployment
+
+### Production Deployment
+
+1. **Environment Setup**:
+```bash
+# Create production environment file
+cat > .env << EOF
+MAX_SCROLL_ITERATIONS=3
+SCROLL_DELAY_SECONDS=3.0
+DEFAULT_OUTPUT_DIR=/app/data
+LOG_LEVEL=INFO
+EOF
+```
+
+2. **Docker Compose** (recommended):
+```yaml
+version: '3.8'
+services:
+  rss-scraper:
+    build: .
+    environment:
+      - MAX_SCROLL_ITERATIONS=3
+      - LOG_LEVEL=INFO
+    volumes:
+      - ./output:/app/output
+      - ./logs:/app/logs
+    restart: unless-stopped
+    memory: 512m
+    cpus: 0.5
+```
+
+3. **Cron Schedule**:
+```bash
+# Add to crontab for regular updates
+0 */6 * * * docker run --rm -v /path/to/output:/app/output warhammer-rss
+```
+
+## Development
+
+### Setup Development Environment
+
+```bash
+# Install development dependencies
+pip install -r requirements.txt
+pip install pytest pytest-cov black isort
+
+# Install pre-commit hooks (optional)
+pre-commit install
+
+# Run tests
+pytest
+
+# Format code
+black src/ tests/
+isort src/ tests/
+```
+
+### Adding New Features
+
+1. Follow the modular architecture
+2. Add type hints to all functions
+3. Include comprehensive error handling
+4. Write tests for new functionality
+5. Update configuration if needed
+6. Document changes in README
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Permission Errors**:
+   - Ensure output directory is writable
+   - Use proper Docker volume mounting
+
+2. **Memory Issues**:
+   - Reduce `MAX_SCROLL_ITERATIONS`
+   - Increase Docker memory limits
+
+3. **Rate Limiting**:
+   - Increase `SCROLL_DELAY_SECONDS`
+   - Check network connectivity
+
+4. **Cache Issues**:
+   - Clear cache with `--clear-cache`
+   - Check cache directory permissions
+
+### Debug Mode
+
+```bash
+# Enable debug logging
+python main.py --log-level DEBUG
+
+# Disable caching for testing
+python main.py --no-cache --log-level DEBUG
+```
+
+## License
+
+This project is provided as-is for educational purposes. Please respect the Warhammer Community website's robots.txt and terms of service.
+
+## Contributing
+
+1. Fork the repository
+2. Create a feature branch
+3. Add tests for new functionality
+4. Ensure all tests pass
+5. Submit a pull request
+
+## Changelog
+
+### Version 1.0.0
+- Complete rewrite with modular architecture
+- Added comprehensive caching system
+- Implemented rate limiting and security hardening
+- Full test coverage with pytest
+- Production-ready Docker container
+- Extensive configuration management
+- Structured logging and error handling