# Warhammer Community RSS Scraper A production-ready Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles. ## Overview This project provides a robust, secure, and scalable RSS scraper for the Warhammer Community website. It features comprehensive error handling, caching, rate limiting, and security measures suitable for production deployment. ## Features ### Core Functionality - Scrapes articles from Warhammer Community website - Generates properly formatted RSS feeds - Handles duplicate article detection - Sorts articles by publication date (newest first) - Saves both RSS feed and debug HTML ### Production Features - **Modular Architecture**: Clean separation of concerns with dedicated modules - **Comprehensive Logging**: Structured logging with configurable levels - **Configuration Management**: Environment-based configuration - **Caching**: Intelligent content caching with ETags and conditional requests - **Rate Limiting**: Respectful scraping with configurable delays - **Retry Logic**: Exponential backoff for network failures - **Type Safety**: Full type hints throughout codebase - **Comprehensive Tests**: Unit tests with pytest framework ### Security Features - **URL Validation**: Whitelist-based domain validation - **Content Sanitization**: HTML sanitization using bleach library - **Path Validation**: Prevention of directory traversal attacks - **Resource Limits**: Memory and execution time constraints - **Input Validation**: Comprehensive argument and data validation - **Non-root Execution**: Secure container execution - **File Sanitization**: Safe filename handling ## Requirements - Python 3.12+ - Dependencies listed in `requirements.txt` ## Installation ### Local Setup 1. Install dependencies: ```bash pip install -r requirements.txt ``` 2. Install Playwright browsers: ```bash playwright install ``` 3. Run the scraper: ```bash # Basic usage python main.py # With custom options python main.py --url https://www.warhammer-community.com/en-gb/ \ --output-dir ./output \ --log-level DEBUG \ --max-scroll 3 # View all options python main.py --help ``` ### Docker Setup 1. Build the Docker image: ```bash docker build -t warhammer-rss . ``` 2. Run the container: ```bash # Basic usage docker run -v $(pwd)/output:/app/output warhammer-rss # With custom configuration docker run -e MAX_SCROLL_ITERATIONS=3 \ -e LOG_LEVEL=DEBUG \ -v $(pwd)/output:/app/output \ warhammer-rss --no-cache # With resource limits docker run --memory=512m --cpu-quota=50000 \ -v $(pwd)/output:/app/output \ warhammer-rss ``` ## Command Line Options ```bash Usage: main.py [OPTIONS] Options: --url URL URL to scrape (default: Warhammer Community) --output-dir PATH Output directory for files --max-scroll INT Maximum scroll iterations (default: 5) --log-level LEVEL Logging level: DEBUG, INFO, WARNING, ERROR --log-file PATH Log file path (default: scraper.log) --no-cache Disable content caching --clear-cache Clear cache before running --cache-info Show cache information and exit -h, --help Show help message ``` ## Configuration ### Environment Variables The application supports extensive configuration via environment variables: ```bash # Scraping Configuration MAX_SCROLL_ITERATIONS=5 # Number of scroll iterations MAX_CONTENT_SIZE=10485760 # Maximum content size (10MB) SCROLL_DELAY_SECONDS=2.0 # Delay between scrolls PAGE_TIMEOUT_MS=120000 # Page load timeout # Security Configuration ALLOWED_DOMAINS="warhammer-community.com,www.warhammer-community.com" MAX_TITLE_LENGTH=500 # Maximum title length # Output Configuration DEFAULT_OUTPUT_DIR="." # Default output directory RSS_FILENAME="warhammer_rss_feed.xml" DEBUG_HTML_FILENAME="page.html" # Feed Metadata FEED_TITLE="Warhammer Community RSS Feed" FEED_DESCRIPTION="Latest Warhammer Community Articles" ``` ### Cache Management ```bash # View cache status python main.py --cache-info # Clear cache python main.py --clear-cache # Disable caching for a run python main.py --no-cache ``` ## Project Structure ``` rss_warhammer/ ├── main.py # CLI entry point ├── src/rss_scraper/ # Main package │ ├── __init__.py │ ├── config.py # Configuration management │ ├── exceptions.py # Custom exceptions │ ├── validation.py # URL and path validation │ ├── scraper.py # Web scraping with Playwright │ ├── parser.py # HTML parsing and article extraction │ ├── rss_generator.py # RSS feed generation │ ├── cache.py # Content caching system │ ├── security.py # Security utilities │ └── retry_utils.py # Retry logic with backoff ├── tests/ # Comprehensive test suite ├── cache/ # Cache directory (auto-created) ├── requirements.txt # Python dependencies ├── pytest.ini # Test configuration ├── Dockerfile # Container configuration └── README.md # This file ``` ## Output Files The application generates: - `warhammer_rss_feed.xml` - RSS feed with extracted articles - `page.html` - Raw HTML for debugging (optional) - `scraper.log` - Application logs - `cache/` - Cached content and ETags ## Testing Run the comprehensive test suite: ```bash # Run all tests pytest # Run with coverage pytest --cov=src/rss_scraper # Run specific test categories pytest -m unit # Unit tests only pytest tests/test_parser.py # Specific module ``` ## Error Handling The application uses specific exit codes for different error types: - `0` - Success - `1` - Configuration/Validation error - `2` - Network error - `3` - Page loading error - `4` - Content parsing error - `5` - File operation error - `6` - Content size exceeded - `99` - Unexpected error ## Security Considerations ### Allowed Domains The scraper only operates on whitelisted domains: - `warhammer-community.com` - `www.warhammer-community.com` ### Rate Limiting - Default: 30 requests per minute - Minimum delay: 2 seconds between requests - Configurable via environment variables ### Content Sanitization - HTML content sanitized using bleach - Dangerous scripts and patterns removed - File paths validated against directory traversal - URL validation against malicious patterns ## Deployment ### Production Deployment 1. **Environment Setup**: ```bash # Create production environment file cat > .env << EOF MAX_SCROLL_ITERATIONS=3 SCROLL_DELAY_SECONDS=3.0 DEFAULT_OUTPUT_DIR=/app/data LOG_LEVEL=INFO EOF ``` 2. **Docker Compose** (recommended): ```yaml version: '3.8' services: rss-scraper: build: . environment: - MAX_SCROLL_ITERATIONS=3 - LOG_LEVEL=INFO volumes: - ./output:/app/output - ./logs:/app/logs restart: unless-stopped memory: 512m cpus: 0.5 ``` 3. **Cron Schedule**: ```bash # Add to crontab for regular updates 0 */6 * * * docker run --rm -v /path/to/output:/app/output warhammer-rss ``` ## Development ### Setup Development Environment ```bash # Install development dependencies pip install -r requirements.txt pip install pytest pytest-cov black isort # Install pre-commit hooks (optional) pre-commit install # Run tests pytest # Format code black src/ tests/ isort src/ tests/ ``` ### Adding New Features 1. Follow the modular architecture 2. Add type hints to all functions 3. Include comprehensive error handling 4. Write tests for new functionality 5. Update configuration if needed 6. Document changes in README ## Troubleshooting ### Common Issues 1. **Permission Errors**: - Ensure output directory is writable - Use proper Docker volume mounting 2. **Memory Issues**: - Reduce `MAX_SCROLL_ITERATIONS` - Increase Docker memory limits 3. **Rate Limiting**: - Increase `SCROLL_DELAY_SECONDS` - Check network connectivity 4. **Cache Issues**: - Clear cache with `--clear-cache` - Check cache directory permissions ### Debug Mode ```bash # Enable debug logging python main.py --log-level DEBUG # Disable caching for testing python main.py --no-cache --log-level DEBUG ``` ## License This project is provided as-is for educational purposes. Please respect the Warhammer Community website's robots.txt and terms of service. ## Contributing 1. Fork the repository 2. Create a feature branch 3. Add tests for new functionality 4. Ensure all tests pass 5. Submit a pull request ## Changelog ### Version 1.0.0 - Complete rewrite with modular architecture - Added comprehensive caching system - Implemented rate limiting and security hardening - Full test coverage with pytest - Production-ready Docker container - Extensive configuration management - Structured logging and error handling