- Modular architecture with separate modules for scraping, parsing, security, validation, and caching - Comprehensive security measures including HTML sanitization, rate limiting, and input validation - Robust error handling with custom exceptions and retry logic - HTTP caching with ETags and Last-Modified headers for efficiency - Pre-compiled regex patterns for improved performance - Comprehensive test suite with 66 tests covering all major functionality - Docker support for containerized deployment - Configuration management with environment variable support - Working parser that successfully extracts 32 articles from Warhammer Community 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Warhammer Community RSS Scraper
A production-ready Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
Overview
This project provides a robust, secure, and scalable RSS scraper for the Warhammer Community website. It features comprehensive error handling, caching, rate limiting, and security measures suitable for production deployment.
Features
Core Functionality
- Scrapes articles from Warhammer Community website
- Generates properly formatted RSS feeds
- Handles duplicate article detection
- Sorts articles by publication date (newest first)
- Saves both RSS feed and debug HTML
Production Features
- Modular Architecture: Clean separation of concerns with dedicated modules
- Comprehensive Logging: Structured logging with configurable levels
- Configuration Management: Environment-based configuration
- Caching: Intelligent content caching with ETags and conditional requests
- Rate Limiting: Respectful scraping with configurable delays
- Retry Logic: Exponential backoff for network failures
- Type Safety: Full type hints throughout codebase
- Comprehensive Tests: Unit tests with pytest framework
Security Features
- URL Validation: Whitelist-based domain validation
- Content Sanitization: HTML sanitization using bleach library
- Path Validation: Prevention of directory traversal attacks
- Resource Limits: Memory and execution time constraints
- Input Validation: Comprehensive argument and data validation
- Non-root Execution: Secure container execution
- File Sanitization: Safe filename handling
Requirements
- Python 3.12+
- Dependencies listed in
requirements.txt
Installation
Local Setup
- Install dependencies:
pip install -r requirements.txt
- Install Playwright browsers:
playwright install
- Run the scraper:
# Basic usage
python main.py
# With custom options
python main.py --url https://www.warhammer-community.com/en-gb/ \
--output-dir ./output \
--log-level DEBUG \
--max-scroll 3
# View all options
python main.py --help
Docker Setup
- Build the Docker image:
docker build -t warhammer-rss .
- Run the container:
# Basic usage
docker run -v $(pwd)/output:/app/output warhammer-rss
# With custom configuration
docker run -e MAX_SCROLL_ITERATIONS=3 \
-e LOG_LEVEL=DEBUG \
-v $(pwd)/output:/app/output \
warhammer-rss --no-cache
# With resource limits
docker run --memory=512m --cpu-quota=50000 \
-v $(pwd)/output:/app/output \
warhammer-rss
Command Line Options
Usage: main.py [OPTIONS]
Options:
--url URL URL to scrape (default: Warhammer Community)
--output-dir PATH Output directory for files
--max-scroll INT Maximum scroll iterations (default: 5)
--log-level LEVEL Logging level: DEBUG, INFO, WARNING, ERROR
--log-file PATH Log file path (default: scraper.log)
--no-cache Disable content caching
--clear-cache Clear cache before running
--cache-info Show cache information and exit
-h, --help Show help message
Configuration
Environment Variables
The application supports extensive configuration via environment variables:
# Scraping Configuration
MAX_SCROLL_ITERATIONS=5 # Number of scroll iterations
MAX_CONTENT_SIZE=10485760 # Maximum content size (10MB)
SCROLL_DELAY_SECONDS=2.0 # Delay between scrolls
PAGE_TIMEOUT_MS=120000 # Page load timeout
# Security Configuration
ALLOWED_DOMAINS="warhammer-community.com,www.warhammer-community.com"
MAX_TITLE_LENGTH=500 # Maximum title length
# Output Configuration
DEFAULT_OUTPUT_DIR="." # Default output directory
RSS_FILENAME="warhammer_rss_feed.xml"
DEBUG_HTML_FILENAME="page.html"
# Feed Metadata
FEED_TITLE="Warhammer Community RSS Feed"
FEED_DESCRIPTION="Latest Warhammer Community Articles"
Cache Management
# View cache status
python main.py --cache-info
# Clear cache
python main.py --clear-cache
# Disable caching for a run
python main.py --no-cache
Project Structure
rss_warhammer/
├── main.py # CLI entry point
├── src/rss_scraper/ # Main package
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── exceptions.py # Custom exceptions
│ ├── validation.py # URL and path validation
│ ├── scraper.py # Web scraping with Playwright
│ ├── parser.py # HTML parsing and article extraction
│ ├── rss_generator.py # RSS feed generation
│ ├── cache.py # Content caching system
│ ├── security.py # Security utilities
│ └── retry_utils.py # Retry logic with backoff
├── tests/ # Comprehensive test suite
├── cache/ # Cache directory (auto-created)
├── requirements.txt # Python dependencies
├── pytest.ini # Test configuration
├── Dockerfile # Container configuration
└── README.md # This file
Output Files
The application generates:
warhammer_rss_feed.xml
- RSS feed with extracted articlespage.html
- Raw HTML for debugging (optional)scraper.log
- Application logscache/
- Cached content and ETags
Testing
Run the comprehensive test suite:
# Run all tests
pytest
# Run with coverage
pytest --cov=src/rss_scraper
# Run specific test categories
pytest -m unit # Unit tests only
pytest tests/test_parser.py # Specific module
Error Handling
The application uses specific exit codes for different error types:
0
- Success1
- Configuration/Validation error2
- Network error3
- Page loading error4
- Content parsing error5
- File operation error6
- Content size exceeded99
- Unexpected error
Security Considerations
Allowed Domains
The scraper only operates on whitelisted domains:
warhammer-community.com
www.warhammer-community.com
Rate Limiting
- Default: 30 requests per minute
- Minimum delay: 2 seconds between requests
- Configurable via environment variables
Content Sanitization
- HTML content sanitized using bleach
- Dangerous scripts and patterns removed
- File paths validated against directory traversal
- URL validation against malicious patterns
Deployment
Production Deployment
- Environment Setup:
# Create production environment file
cat > .env << EOF
MAX_SCROLL_ITERATIONS=3
SCROLL_DELAY_SECONDS=3.0
DEFAULT_OUTPUT_DIR=/app/data
LOG_LEVEL=INFO
EOF
- Docker Compose (recommended):
version: '3.8'
services:
rss-scraper:
build: .
environment:
- MAX_SCROLL_ITERATIONS=3
- LOG_LEVEL=INFO
volumes:
- ./output:/app/output
- ./logs:/app/logs
restart: unless-stopped
memory: 512m
cpus: 0.5
- Cron Schedule:
# Add to crontab for regular updates
0 */6 * * * docker run --rm -v /path/to/output:/app/output warhammer-rss
Development
Setup Development Environment
# Install development dependencies
pip install -r requirements.txt
pip install pytest pytest-cov black isort
# Install pre-commit hooks (optional)
pre-commit install
# Run tests
pytest
# Format code
black src/ tests/
isort src/ tests/
Adding New Features
- Follow the modular architecture
- Add type hints to all functions
- Include comprehensive error handling
- Write tests for new functionality
- Update configuration if needed
- Document changes in README
Troubleshooting
Common Issues
-
Permission Errors:
- Ensure output directory is writable
- Use proper Docker volume mounting
-
Memory Issues:
- Reduce
MAX_SCROLL_ITERATIONS
- Increase Docker memory limits
- Reduce
-
Rate Limiting:
- Increase
SCROLL_DELAY_SECONDS
- Check network connectivity
- Increase
-
Cache Issues:
- Clear cache with
--clear-cache
- Check cache directory permissions
- Clear cache with
Debug Mode
# Enable debug logging
python main.py --log-level DEBUG
# Disable caching for testing
python main.py --no-cache --log-level DEBUG
License
This project is provided as-is for educational purposes. Please respect the Warhammer Community website's robots.txt and terms of service.
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
Changelog
Version 1.0.0
- Complete rewrite with modular architecture
- Added comprehensive caching system
- Implemented rate limiting and security hardening
- Full test coverage with pytest
- Production-ready Docker container
- Extensive configuration management
- Structured logging and error handling