- Modular architecture with separate modules for scraping, parsing, security, validation, and caching - Comprehensive security measures including HTML sanitization, rate limiting, and input validation - Robust error handling with custom exceptions and retry logic - HTTP caching with ETags and Last-Modified headers for efficiency - Pre-compiled regex patterns for improved performance - Comprehensive test suite with 66 tests covering all major functionality - Docker support for containerized deployment - Configuration management with environment variable support - Working parser that successfully extracts 32 articles from Warhammer Community 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
349 lines
8.9 KiB
Markdown
349 lines
8.9 KiB
Markdown
# Warhammer Community RSS Scraper
|
|
|
|
A production-ready Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
|
|
|
|
## Overview
|
|
|
|
This project provides a robust, secure, and scalable RSS scraper for the Warhammer Community website. It features comprehensive error handling, caching, rate limiting, and security measures suitable for production deployment.
|
|
|
|
## Features
|
|
|
|
### Core Functionality
|
|
- Scrapes articles from Warhammer Community website
|
|
- Generates properly formatted RSS feeds
|
|
- Handles duplicate article detection
|
|
- Sorts articles by publication date (newest first)
|
|
- Saves both RSS feed and debug HTML
|
|
|
|
### Production Features
|
|
- **Modular Architecture**: Clean separation of concerns with dedicated modules
|
|
- **Comprehensive Logging**: Structured logging with configurable levels
|
|
- **Configuration Management**: Environment-based configuration
|
|
- **Caching**: Intelligent content caching with ETags and conditional requests
|
|
- **Rate Limiting**: Respectful scraping with configurable delays
|
|
- **Retry Logic**: Exponential backoff for network failures
|
|
- **Type Safety**: Full type hints throughout codebase
|
|
- **Comprehensive Tests**: Unit tests with pytest framework
|
|
|
|
### Security Features
|
|
- **URL Validation**: Whitelist-based domain validation
|
|
- **Content Sanitization**: HTML sanitization using bleach library
|
|
- **Path Validation**: Prevention of directory traversal attacks
|
|
- **Resource Limits**: Memory and execution time constraints
|
|
- **Input Validation**: Comprehensive argument and data validation
|
|
- **Non-root Execution**: Secure container execution
|
|
- **File Sanitization**: Safe filename handling
|
|
|
|
## Requirements
|
|
|
|
- Python 3.12+
|
|
- Dependencies listed in `requirements.txt`
|
|
|
|
## Installation
|
|
|
|
### Local Setup
|
|
|
|
1. Install dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
2. Install Playwright browsers:
|
|
```bash
|
|
playwright install
|
|
```
|
|
|
|
3. Run the scraper:
|
|
```bash
|
|
# Basic usage
|
|
python main.py
|
|
|
|
# With custom options
|
|
python main.py --url https://www.warhammer-community.com/en-gb/ \
|
|
--output-dir ./output \
|
|
--log-level DEBUG \
|
|
--max-scroll 3
|
|
|
|
# View all options
|
|
python main.py --help
|
|
```
|
|
|
|
### Docker Setup
|
|
|
|
1. Build the Docker image:
|
|
```bash
|
|
docker build -t warhammer-rss .
|
|
```
|
|
|
|
2. Run the container:
|
|
```bash
|
|
# Basic usage
|
|
docker run -v $(pwd)/output:/app/output warhammer-rss
|
|
|
|
# With custom configuration
|
|
docker run -e MAX_SCROLL_ITERATIONS=3 \
|
|
-e LOG_LEVEL=DEBUG \
|
|
-v $(pwd)/output:/app/output \
|
|
warhammer-rss --no-cache
|
|
|
|
# With resource limits
|
|
docker run --memory=512m --cpu-quota=50000 \
|
|
-v $(pwd)/output:/app/output \
|
|
warhammer-rss
|
|
```
|
|
|
|
## Command Line Options
|
|
|
|
```bash
|
|
Usage: main.py [OPTIONS]
|
|
|
|
Options:
|
|
--url URL URL to scrape (default: Warhammer Community)
|
|
--output-dir PATH Output directory for files
|
|
--max-scroll INT Maximum scroll iterations (default: 5)
|
|
--log-level LEVEL Logging level: DEBUG, INFO, WARNING, ERROR
|
|
--log-file PATH Log file path (default: scraper.log)
|
|
--no-cache Disable content caching
|
|
--clear-cache Clear cache before running
|
|
--cache-info Show cache information and exit
|
|
-h, --help Show help message
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
The application supports extensive configuration via environment variables:
|
|
|
|
```bash
|
|
# Scraping Configuration
|
|
MAX_SCROLL_ITERATIONS=5 # Number of scroll iterations
|
|
MAX_CONTENT_SIZE=10485760 # Maximum content size (10MB)
|
|
SCROLL_DELAY_SECONDS=2.0 # Delay between scrolls
|
|
PAGE_TIMEOUT_MS=120000 # Page load timeout
|
|
|
|
# Security Configuration
|
|
ALLOWED_DOMAINS="warhammer-community.com,www.warhammer-community.com"
|
|
MAX_TITLE_LENGTH=500 # Maximum title length
|
|
|
|
# Output Configuration
|
|
DEFAULT_OUTPUT_DIR="." # Default output directory
|
|
RSS_FILENAME="warhammer_rss_feed.xml"
|
|
DEBUG_HTML_FILENAME="page.html"
|
|
|
|
# Feed Metadata
|
|
FEED_TITLE="Warhammer Community RSS Feed"
|
|
FEED_DESCRIPTION="Latest Warhammer Community Articles"
|
|
```
|
|
|
|
### Cache Management
|
|
|
|
```bash
|
|
# View cache status
|
|
python main.py --cache-info
|
|
|
|
# Clear cache
|
|
python main.py --clear-cache
|
|
|
|
# Disable caching for a run
|
|
python main.py --no-cache
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
rss_warhammer/
|
|
├── main.py # CLI entry point
|
|
├── src/rss_scraper/ # Main package
|
|
│ ├── __init__.py
|
|
│ ├── config.py # Configuration management
|
|
│ ├── exceptions.py # Custom exceptions
|
|
│ ├── validation.py # URL and path validation
|
|
│ ├── scraper.py # Web scraping with Playwright
|
|
│ ├── parser.py # HTML parsing and article extraction
|
|
│ ├── rss_generator.py # RSS feed generation
|
|
│ ├── cache.py # Content caching system
|
|
│ ├── security.py # Security utilities
|
|
│ └── retry_utils.py # Retry logic with backoff
|
|
├── tests/ # Comprehensive test suite
|
|
├── cache/ # Cache directory (auto-created)
|
|
├── requirements.txt # Python dependencies
|
|
├── pytest.ini # Test configuration
|
|
├── Dockerfile # Container configuration
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Output Files
|
|
|
|
The application generates:
|
|
- `warhammer_rss_feed.xml` - RSS feed with extracted articles
|
|
- `page.html` - Raw HTML for debugging (optional)
|
|
- `scraper.log` - Application logs
|
|
- `cache/` - Cached content and ETags
|
|
|
|
## Testing
|
|
|
|
Run the comprehensive test suite:
|
|
|
|
```bash
|
|
# Run all tests
|
|
pytest
|
|
|
|
# Run with coverage
|
|
pytest --cov=src/rss_scraper
|
|
|
|
# Run specific test categories
|
|
pytest -m unit # Unit tests only
|
|
pytest tests/test_parser.py # Specific module
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
The application uses specific exit codes for different error types:
|
|
|
|
- `0` - Success
|
|
- `1` - Configuration/Validation error
|
|
- `2` - Network error
|
|
- `3` - Page loading error
|
|
- `4` - Content parsing error
|
|
- `5` - File operation error
|
|
- `6` - Content size exceeded
|
|
- `99` - Unexpected error
|
|
|
|
## Security Considerations
|
|
|
|
### Allowed Domains
|
|
The scraper only operates on whitelisted domains:
|
|
- `warhammer-community.com`
|
|
- `www.warhammer-community.com`
|
|
|
|
### Rate Limiting
|
|
- Default: 30 requests per minute
|
|
- Minimum delay: 2 seconds between requests
|
|
- Configurable via environment variables
|
|
|
|
### Content Sanitization
|
|
- HTML content sanitized using bleach
|
|
- Dangerous scripts and patterns removed
|
|
- File paths validated against directory traversal
|
|
- URL validation against malicious patterns
|
|
|
|
## Deployment
|
|
|
|
### Production Deployment
|
|
|
|
1. **Environment Setup**:
|
|
```bash
|
|
# Create production environment file
|
|
cat > .env << EOF
|
|
MAX_SCROLL_ITERATIONS=3
|
|
SCROLL_DELAY_SECONDS=3.0
|
|
DEFAULT_OUTPUT_DIR=/app/data
|
|
LOG_LEVEL=INFO
|
|
EOF
|
|
```
|
|
|
|
2. **Docker Compose** (recommended):
|
|
```yaml
|
|
version: '3.8'
|
|
services:
|
|
rss-scraper:
|
|
build: .
|
|
environment:
|
|
- MAX_SCROLL_ITERATIONS=3
|
|
- LOG_LEVEL=INFO
|
|
volumes:
|
|
- ./output:/app/output
|
|
- ./logs:/app/logs
|
|
restart: unless-stopped
|
|
memory: 512m
|
|
cpus: 0.5
|
|
```
|
|
|
|
3. **Cron Schedule**:
|
|
```bash
|
|
# Add to crontab for regular updates
|
|
0 */6 * * * docker run --rm -v /path/to/output:/app/output warhammer-rss
|
|
```
|
|
|
|
## Development
|
|
|
|
### Setup Development Environment
|
|
|
|
```bash
|
|
# Install development dependencies
|
|
pip install -r requirements.txt
|
|
pip install pytest pytest-cov black isort
|
|
|
|
# Install pre-commit hooks (optional)
|
|
pre-commit install
|
|
|
|
# Run tests
|
|
pytest
|
|
|
|
# Format code
|
|
black src/ tests/
|
|
isort src/ tests/
|
|
```
|
|
|
|
### Adding New Features
|
|
|
|
1. Follow the modular architecture
|
|
2. Add type hints to all functions
|
|
3. Include comprehensive error handling
|
|
4. Write tests for new functionality
|
|
5. Update configuration if needed
|
|
6. Document changes in README
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Permission Errors**:
|
|
- Ensure output directory is writable
|
|
- Use proper Docker volume mounting
|
|
|
|
2. **Memory Issues**:
|
|
- Reduce `MAX_SCROLL_ITERATIONS`
|
|
- Increase Docker memory limits
|
|
|
|
3. **Rate Limiting**:
|
|
- Increase `SCROLL_DELAY_SECONDS`
|
|
- Check network connectivity
|
|
|
|
4. **Cache Issues**:
|
|
- Clear cache with `--clear-cache`
|
|
- Check cache directory permissions
|
|
|
|
### Debug Mode
|
|
|
|
```bash
|
|
# Enable debug logging
|
|
python main.py --log-level DEBUG
|
|
|
|
# Disable caching for testing
|
|
python main.py --no-cache --log-level DEBUG
|
|
```
|
|
|
|
## License
|
|
|
|
This project is provided as-is for educational purposes. Please respect the Warhammer Community website's robots.txt and terms of service.
|
|
|
|
## Contributing
|
|
|
|
1. Fork the repository
|
|
2. Create a feature branch
|
|
3. Add tests for new functionality
|
|
4. Ensure all tests pass
|
|
5. Submit a pull request
|
|
|
|
## Changelog
|
|
|
|
### Version 1.0.0
|
|
- Complete rewrite with modular architecture
|
|
- Added comprehensive caching system
|
|
- Implemented rate limiting and security hardening
|
|
- Full test coverage with pytest
|
|
- Production-ready Docker container
|
|
- Extensive configuration management
|
|
- Structured logging and error handling |