Phil 25086fc01b Add comprehensive RSS scraper implementation with security and testing
- Modular architecture with separate modules for scraping, parsing, security, validation, and caching
- Comprehensive security measures including HTML sanitization, rate limiting, and input validation
- Robust error handling with custom exceptions and retry logic
- HTTP caching with ETags and Last-Modified headers for efficiency
- Pre-compiled regex patterns for improved performance
- Comprehensive test suite with 66 tests covering all major functionality
- Docker support for containerized deployment
- Configuration management with environment variable support
- Working parser that successfully extracts 32 articles from Warhammer Community

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-06 09:15:06 -06:00

349 lines
8.9 KiB
Markdown

# Warhammer Community RSS Scraper
A production-ready Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
## Overview
This project provides a robust, secure, and scalable RSS scraper for the Warhammer Community website. It features comprehensive error handling, caching, rate limiting, and security measures suitable for production deployment.
## Features
### Core Functionality
- Scrapes articles from Warhammer Community website
- Generates properly formatted RSS feeds
- Handles duplicate article detection
- Sorts articles by publication date (newest first)
- Saves both RSS feed and debug HTML
### Production Features
- **Modular Architecture**: Clean separation of concerns with dedicated modules
- **Comprehensive Logging**: Structured logging with configurable levels
- **Configuration Management**: Environment-based configuration
- **Caching**: Intelligent content caching with ETags and conditional requests
- **Rate Limiting**: Respectful scraping with configurable delays
- **Retry Logic**: Exponential backoff for network failures
- **Type Safety**: Full type hints throughout codebase
- **Comprehensive Tests**: Unit tests with pytest framework
### Security Features
- **URL Validation**: Whitelist-based domain validation
- **Content Sanitization**: HTML sanitization using bleach library
- **Path Validation**: Prevention of directory traversal attacks
- **Resource Limits**: Memory and execution time constraints
- **Input Validation**: Comprehensive argument and data validation
- **Non-root Execution**: Secure container execution
- **File Sanitization**: Safe filename handling
## Requirements
- Python 3.12+
- Dependencies listed in `requirements.txt`
## Installation
### Local Setup
1. Install dependencies:
```bash
pip install -r requirements.txt
```
2. Install Playwright browsers:
```bash
playwright install
```
3. Run the scraper:
```bash
# Basic usage
python main.py
# With custom options
python main.py --url https://www.warhammer-community.com/en-gb/ \
--output-dir ./output \
--log-level DEBUG \
--max-scroll 3
# View all options
python main.py --help
```
### Docker Setup
1. Build the Docker image:
```bash
docker build -t warhammer-rss .
```
2. Run the container:
```bash
# Basic usage
docker run -v $(pwd)/output:/app/output warhammer-rss
# With custom configuration
docker run -e MAX_SCROLL_ITERATIONS=3 \
-e LOG_LEVEL=DEBUG \
-v $(pwd)/output:/app/output \
warhammer-rss --no-cache
# With resource limits
docker run --memory=512m --cpu-quota=50000 \
-v $(pwd)/output:/app/output \
warhammer-rss
```
## Command Line Options
```bash
Usage: main.py [OPTIONS]
Options:
--url URL URL to scrape (default: Warhammer Community)
--output-dir PATH Output directory for files
--max-scroll INT Maximum scroll iterations (default: 5)
--log-level LEVEL Logging level: DEBUG, INFO, WARNING, ERROR
--log-file PATH Log file path (default: scraper.log)
--no-cache Disable content caching
--clear-cache Clear cache before running
--cache-info Show cache information and exit
-h, --help Show help message
```
## Configuration
### Environment Variables
The application supports extensive configuration via environment variables:
```bash
# Scraping Configuration
MAX_SCROLL_ITERATIONS=5 # Number of scroll iterations
MAX_CONTENT_SIZE=10485760 # Maximum content size (10MB)
SCROLL_DELAY_SECONDS=2.0 # Delay between scrolls
PAGE_TIMEOUT_MS=120000 # Page load timeout
# Security Configuration
ALLOWED_DOMAINS="warhammer-community.com,www.warhammer-community.com"
MAX_TITLE_LENGTH=500 # Maximum title length
# Output Configuration
DEFAULT_OUTPUT_DIR="." # Default output directory
RSS_FILENAME="warhammer_rss_feed.xml"
DEBUG_HTML_FILENAME="page.html"
# Feed Metadata
FEED_TITLE="Warhammer Community RSS Feed"
FEED_DESCRIPTION="Latest Warhammer Community Articles"
```
### Cache Management
```bash
# View cache status
python main.py --cache-info
# Clear cache
python main.py --clear-cache
# Disable caching for a run
python main.py --no-cache
```
## Project Structure
```
rss_warhammer/
├── main.py # CLI entry point
├── src/rss_scraper/ # Main package
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── exceptions.py # Custom exceptions
│ ├── validation.py # URL and path validation
│ ├── scraper.py # Web scraping with Playwright
│ ├── parser.py # HTML parsing and article extraction
│ ├── rss_generator.py # RSS feed generation
│ ├── cache.py # Content caching system
│ ├── security.py # Security utilities
│ └── retry_utils.py # Retry logic with backoff
├── tests/ # Comprehensive test suite
├── cache/ # Cache directory (auto-created)
├── requirements.txt # Python dependencies
├── pytest.ini # Test configuration
├── Dockerfile # Container configuration
└── README.md # This file
```
## Output Files
The application generates:
- `warhammer_rss_feed.xml` - RSS feed with extracted articles
- `page.html` - Raw HTML for debugging (optional)
- `scraper.log` - Application logs
- `cache/` - Cached content and ETags
## Testing
Run the comprehensive test suite:
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=src/rss_scraper
# Run specific test categories
pytest -m unit # Unit tests only
pytest tests/test_parser.py # Specific module
```
## Error Handling
The application uses specific exit codes for different error types:
- `0` - Success
- `1` - Configuration/Validation error
- `2` - Network error
- `3` - Page loading error
- `4` - Content parsing error
- `5` - File operation error
- `6` - Content size exceeded
- `99` - Unexpected error
## Security Considerations
### Allowed Domains
The scraper only operates on whitelisted domains:
- `warhammer-community.com`
- `www.warhammer-community.com`
### Rate Limiting
- Default: 30 requests per minute
- Minimum delay: 2 seconds between requests
- Configurable via environment variables
### Content Sanitization
- HTML content sanitized using bleach
- Dangerous scripts and patterns removed
- File paths validated against directory traversal
- URL validation against malicious patterns
## Deployment
### Production Deployment
1. **Environment Setup**:
```bash
# Create production environment file
cat > .env << EOF
MAX_SCROLL_ITERATIONS=3
SCROLL_DELAY_SECONDS=3.0
DEFAULT_OUTPUT_DIR=/app/data
LOG_LEVEL=INFO
EOF
```
2. **Docker Compose** (recommended):
```yaml
version: '3.8'
services:
rss-scraper:
build: .
environment:
- MAX_SCROLL_ITERATIONS=3
- LOG_LEVEL=INFO
volumes:
- ./output:/app/output
- ./logs:/app/logs
restart: unless-stopped
memory: 512m
cpus: 0.5
```
3. **Cron Schedule**:
```bash
# Add to crontab for regular updates
0 */6 * * * docker run --rm -v /path/to/output:/app/output warhammer-rss
```
## Development
### Setup Development Environment
```bash
# Install development dependencies
pip install -r requirements.txt
pip install pytest pytest-cov black isort
# Install pre-commit hooks (optional)
pre-commit install
# Run tests
pytest
# Format code
black src/ tests/
isort src/ tests/
```
### Adding New Features
1. Follow the modular architecture
2. Add type hints to all functions
3. Include comprehensive error handling
4. Write tests for new functionality
5. Update configuration if needed
6. Document changes in README
## Troubleshooting
### Common Issues
1. **Permission Errors**:
- Ensure output directory is writable
- Use proper Docker volume mounting
2. **Memory Issues**:
- Reduce `MAX_SCROLL_ITERATIONS`
- Increase Docker memory limits
3. **Rate Limiting**:
- Increase `SCROLL_DELAY_SECONDS`
- Check network connectivity
4. **Cache Issues**:
- Clear cache with `--clear-cache`
- Check cache directory permissions
### Debug Mode
```bash
# Enable debug logging
python main.py --log-level DEBUG
# Disable caching for testing
python main.py --no-cache --log-level DEBUG
```
## License
This project is provided as-is for educational purposes. Please respect the Warhammer Community website's robots.txt and terms of service.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit a pull request
## Changelog
### Version 1.0.0
- Complete rewrite with modular architecture
- Added comprehensive caching system
- Implemented rate limiting and security hardening
- Full test coverage with pytest
- Production-ready Docker container
- Extensive configuration management
- Structured logging and error handling