Add comprehensive RSS scraper implementation with security and testing

- Modular architecture with separate modules for scraping, parsing, security, validation, and caching
- Comprehensive security measures including HTML sanitization, rate limiting, and input validation
- Robust error handling with custom exceptions and retry logic
- HTTP caching with ETags and Last-Modified headers for efficiency
- Pre-compiled regex patterns for improved performance
- Comprehensive test suite with 66 tests covering all major functionality
- Docker support for containerized deployment
- Configuration management with environment variable support
- Working parser that successfully extracts 32 articles from Warhammer Community

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-06-06 09:15:06 -06:00
parent e0647325ff
commit 25086fc01b
26 changed files with 15226 additions and 280 deletions

358
README.md
View File

@ -1,31 +1,43 @@
# Warhammer Community RSS Scraper
A Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
A production-ready Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
## Overview
This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.
This project provides a robust, secure, and scalable RSS scraper for the Warhammer Community website. It features comprehensive error handling, caching, rate limiting, and security measures suitable for production deployment.
## Features
### Core Functionality
- Scrapes articles from Warhammer Community website
- Generates RSS feed with proper formatting
- Generates properly formatted RSS feeds
- Handles duplicate article detection
- Sorts articles by publication date (newest first)
- Dockerized for easy deployment
- Saves both RSS feed and raw HTML for debugging
- **Security-focused**: URL validation, content filtering, and resource limits
- **Safe execution**: Runs as non-root user in container
- Saves both RSS feed and debug HTML
### Production Features
- **Modular Architecture**: Clean separation of concerns with dedicated modules
- **Comprehensive Logging**: Structured logging with configurable levels
- **Configuration Management**: Environment-based configuration
- **Caching**: Intelligent content caching with ETags and conditional requests
- **Rate Limiting**: Respectful scraping with configurable delays
- **Retry Logic**: Exponential backoff for network failures
- **Type Safety**: Full type hints throughout codebase
- **Comprehensive Tests**: Unit tests with pytest framework
### Security Features
- **URL Validation**: Whitelist-based domain validation
- **Content Sanitization**: HTML sanitization using bleach library
- **Path Validation**: Prevention of directory traversal attacks
- **Resource Limits**: Memory and execution time constraints
- **Input Validation**: Comprehensive argument and data validation
- **Non-root Execution**: Secure container execution
- **File Sanitization**: Safe filename handling
## Requirements
- Python 3.12+
- Dependencies listed in `requirements.txt`:
- playwright
- beautifulsoup4
- feedgen
- pytz
- requests
- Dependencies listed in `requirements.txt`
## Installation
@ -41,16 +53,19 @@ pip install -r requirements.txt
playwright install
```
3. Run the script:
3. Run the scraper:
```bash
# Default: saves to current directory
# Basic usage
python main.py
# Or specify output directory
python main.py /path/to/output
# With custom options
python main.py --url https://www.warhammer-community.com/en-gb/ \
--output-dir ./output \
--log-level DEBUG \
--max-scroll 3
# Or use environment variable
OUTPUT_DIR=/path/to/output python main.py
# View all options
python main.py --help
```
### Docker Setup
@ -60,58 +75,275 @@ OUTPUT_DIR=/path/to/output python main.py
docker build -t warhammer-rss .
```
2. Run the container (multiple options to avoid permission issues):
**Option A: Save to current directory (simplest)**
2. Run the container:
```bash
docker run -v $(pwd):/app/output warhammer-rss
# Basic usage
docker run -v $(pwd)/output:/app/output warhammer-rss
# With custom configuration
docker run -e MAX_SCROLL_ITERATIONS=3 \
-e LOG_LEVEL=DEBUG \
-v $(pwd)/output:/app/output \
warhammer-rss --no-cache
# With resource limits
docker run --memory=512m --cpu-quota=50000 \
-v $(pwd)/output:/app/output \
warhammer-rss
```
**Option B: Use environment variable for output directory**
## Command Line Options
```bash
docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss
Usage: main.py [OPTIONS]
Options:
--url URL URL to scrape (default: Warhammer Community)
--output-dir PATH Output directory for files
--max-scroll INT Maximum scroll iterations (default: 5)
--log-level LEVEL Logging level: DEBUG, INFO, WARNING, ERROR
--log-file PATH Log file path (default: scraper.log)
--no-cache Disable content caching
--clear-cache Clear cache before running
--cache-info Show cache information and exit
-h, --help Show help message
```
**Option C: With resource limits for additional security**
```bash
docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss
```
## Output
The application generates:
- `warhammer_rss_feed.xml` - RSS feed file
- `page.html` - Raw HTML content for debugging
Both files are saved to the specified output directory (current directory by default).
## Security Features
This application implements several security measures:
- **URL Validation**: Only allows scraping from trusted Warhammer Community domains
- **Path Validation**: Prevents directory traversal attacks by validating output paths
- **Resource Limits**: Caps content size (10MB) and scroll iterations (5) to prevent DoS
- **Content Filtering**: Sanitizes extracted text to prevent XSS and injection attacks
- **Non-root Execution**: Docker container runs as user `scraper` (UID 1001) for reduced privilege
- **Input Sanitization**: All URLs and file paths are validated before use
## How It Works
1. **Validates** the target URL against whitelist of allowed domains
2. Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
3. Scrolls through the page to load additional content (limited to 5 iterations)
4. **Validates content size** and parses the rendered HTML with BeautifulSoup
5. **Sanitizes** and extracts article titles, links, and publication dates
6. **Validates all links** against allowed domains
7. Removes duplicates and sorts by date
8. Generates RSS feed using feedgen library
9. **Validates output paths** before saving files
## Configuration
The scraper targets `https://www.warhammer-community.com/en-gb/` by default and only allows URLs from:
### Environment Variables
The application supports extensive configuration via environment variables:
```bash
# Scraping Configuration
MAX_SCROLL_ITERATIONS=5 # Number of scroll iterations
MAX_CONTENT_SIZE=10485760 # Maximum content size (10MB)
SCROLL_DELAY_SECONDS=2.0 # Delay between scrolls
PAGE_TIMEOUT_MS=120000 # Page load timeout
# Security Configuration
ALLOWED_DOMAINS="warhammer-community.com,www.warhammer-community.com"
MAX_TITLE_LENGTH=500 # Maximum title length
# Output Configuration
DEFAULT_OUTPUT_DIR="." # Default output directory
RSS_FILENAME="warhammer_rss_feed.xml"
DEBUG_HTML_FILENAME="page.html"
# Feed Metadata
FEED_TITLE="Warhammer Community RSS Feed"
FEED_DESCRIPTION="Latest Warhammer Community Articles"
```
### Cache Management
```bash
# View cache status
python main.py --cache-info
# Clear cache
python main.py --clear-cache
# Disable caching for a run
python main.py --no-cache
```
## Project Structure
```
rss_warhammer/
├── main.py # CLI entry point
├── src/rss_scraper/ # Main package
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── exceptions.py # Custom exceptions
│ ├── validation.py # URL and path validation
│ ├── scraper.py # Web scraping with Playwright
│ ├── parser.py # HTML parsing and article extraction
│ ├── rss_generator.py # RSS feed generation
│ ├── cache.py # Content caching system
│ ├── security.py # Security utilities
│ └── retry_utils.py # Retry logic with backoff
├── tests/ # Comprehensive test suite
├── cache/ # Cache directory (auto-created)
├── requirements.txt # Python dependencies
├── pytest.ini # Test configuration
├── Dockerfile # Container configuration
└── README.md # This file
```
## Output Files
The application generates:
- `warhammer_rss_feed.xml` - RSS feed with extracted articles
- `page.html` - Raw HTML for debugging (optional)
- `scraper.log` - Application logs
- `cache/` - Cached content and ETags
## Testing
Run the comprehensive test suite:
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=src/rss_scraper
# Run specific test categories
pytest -m unit # Unit tests only
pytest tests/test_parser.py # Specific module
```
## Error Handling
The application uses specific exit codes for different error types:
- `0` - Success
- `1` - Configuration/Validation error
- `2` - Network error
- `3` - Page loading error
- `4` - Content parsing error
- `5` - File operation error
- `6` - Content size exceeded
- `99` - Unexpected error
## Security Considerations
### Allowed Domains
The scraper only operates on whitelisted domains:
- `warhammer-community.com`
- `www.warhammer-community.com`
To modify allowed domains, update the `ALLOWED_DOMAINS` list in `main.py:11-14`.
### Rate Limiting
- Default: 30 requests per minute
- Minimum delay: 2 seconds between requests
- Configurable via environment variables
### Content Sanitization
- HTML content sanitized using bleach
- Dangerous scripts and patterns removed
- File paths validated against directory traversal
- URL validation against malicious patterns
## Deployment
### Production Deployment
1. **Environment Setup**:
```bash
# Create production environment file
cat > .env << EOF
MAX_SCROLL_ITERATIONS=3
SCROLL_DELAY_SECONDS=3.0
DEFAULT_OUTPUT_DIR=/app/data
LOG_LEVEL=INFO
EOF
```
2. **Docker Compose** (recommended):
```yaml
version: '3.8'
services:
rss-scraper:
build: .
environment:
- MAX_SCROLL_ITERATIONS=3
- LOG_LEVEL=INFO
volumes:
- ./output:/app/output
- ./logs:/app/logs
restart: unless-stopped
memory: 512m
cpus: 0.5
```
3. **Cron Schedule**:
```bash
# Add to crontab for regular updates
0 */6 * * * docker run --rm -v /path/to/output:/app/output warhammer-rss
```
## Development
### Setup Development Environment
```bash
# Install development dependencies
pip install -r requirements.txt
pip install pytest pytest-cov black isort
# Install pre-commit hooks (optional)
pre-commit install
# Run tests
pytest
# Format code
black src/ tests/
isort src/ tests/
```
### Adding New Features
1. Follow the modular architecture
2. Add type hints to all functions
3. Include comprehensive error handling
4. Write tests for new functionality
5. Update configuration if needed
6. Document changes in README
## Troubleshooting
### Common Issues
1. **Permission Errors**:
- Ensure output directory is writable
- Use proper Docker volume mounting
2. **Memory Issues**:
- Reduce `MAX_SCROLL_ITERATIONS`
- Increase Docker memory limits
3. **Rate Limiting**:
- Increase `SCROLL_DELAY_SECONDS`
- Check network connectivity
4. **Cache Issues**:
- Clear cache with `--clear-cache`
- Check cache directory permissions
### Debug Mode
```bash
# Enable debug logging
python main.py --log-level DEBUG
# Disable caching for testing
python main.py --no-cache --log-level DEBUG
```
## License
This project is provided as-is for educational purposes. Please respect the Warhammer Community website's robots.txt and terms of service.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit a pull request
## Changelog
### Version 1.0.0
- Complete rewrite with modular architecture
- Added comprehensive caching system
- Implemented rate limiting and security hardening
- Full test coverage with pytest
- Production-ready Docker container
- Extensive configuration management
- Structured logging and error handling