Add comprehensive RSS scraper implementation with security and testing
- Modular architecture with separate modules for scraping, parsing, security, validation, and caching - Comprehensive security measures including HTML sanitization, rate limiting, and input validation - Robust error handling with custom exceptions and retry logic - HTTP caching with ETags and Last-Modified headers for efficiency - Pre-compiled regex patterns for improved performance - Comprehensive test suite with 66 tests covering all major functionality - Docker support for containerized deployment - Configuration management with environment variable support - Working parser that successfully extracts 32 articles from Warhammer Community 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
358
README.md
358
README.md
@ -1,31 +1,43 @@
|
||||
# Warhammer Community RSS Scraper
|
||||
|
||||
A Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
|
||||
A production-ready Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
|
||||
|
||||
## Overview
|
||||
|
||||
This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.
|
||||
This project provides a robust, secure, and scalable RSS scraper for the Warhammer Community website. It features comprehensive error handling, caching, rate limiting, and security measures suitable for production deployment.
|
||||
|
||||
## Features
|
||||
|
||||
### Core Functionality
|
||||
- Scrapes articles from Warhammer Community website
|
||||
- Generates RSS feed with proper formatting
|
||||
- Generates properly formatted RSS feeds
|
||||
- Handles duplicate article detection
|
||||
- Sorts articles by publication date (newest first)
|
||||
- Dockerized for easy deployment
|
||||
- Saves both RSS feed and raw HTML for debugging
|
||||
- **Security-focused**: URL validation, content filtering, and resource limits
|
||||
- **Safe execution**: Runs as non-root user in container
|
||||
- Saves both RSS feed and debug HTML
|
||||
|
||||
### Production Features
|
||||
- **Modular Architecture**: Clean separation of concerns with dedicated modules
|
||||
- **Comprehensive Logging**: Structured logging with configurable levels
|
||||
- **Configuration Management**: Environment-based configuration
|
||||
- **Caching**: Intelligent content caching with ETags and conditional requests
|
||||
- **Rate Limiting**: Respectful scraping with configurable delays
|
||||
- **Retry Logic**: Exponential backoff for network failures
|
||||
- **Type Safety**: Full type hints throughout codebase
|
||||
- **Comprehensive Tests**: Unit tests with pytest framework
|
||||
|
||||
### Security Features
|
||||
- **URL Validation**: Whitelist-based domain validation
|
||||
- **Content Sanitization**: HTML sanitization using bleach library
|
||||
- **Path Validation**: Prevention of directory traversal attacks
|
||||
- **Resource Limits**: Memory and execution time constraints
|
||||
- **Input Validation**: Comprehensive argument and data validation
|
||||
- **Non-root Execution**: Secure container execution
|
||||
- **File Sanitization**: Safe filename handling
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.12+
|
||||
- Dependencies listed in `requirements.txt`:
|
||||
- playwright
|
||||
- beautifulsoup4
|
||||
- feedgen
|
||||
- pytz
|
||||
- requests
|
||||
- Dependencies listed in `requirements.txt`
|
||||
|
||||
## Installation
|
||||
|
||||
@ -41,16 +53,19 @@ pip install -r requirements.txt
|
||||
playwright install
|
||||
```
|
||||
|
||||
3. Run the script:
|
||||
3. Run the scraper:
|
||||
```bash
|
||||
# Default: saves to current directory
|
||||
# Basic usage
|
||||
python main.py
|
||||
|
||||
# Or specify output directory
|
||||
python main.py /path/to/output
|
||||
# With custom options
|
||||
python main.py --url https://www.warhammer-community.com/en-gb/ \
|
||||
--output-dir ./output \
|
||||
--log-level DEBUG \
|
||||
--max-scroll 3
|
||||
|
||||
# Or use environment variable
|
||||
OUTPUT_DIR=/path/to/output python main.py
|
||||
# View all options
|
||||
python main.py --help
|
||||
```
|
||||
|
||||
### Docker Setup
|
||||
@ -60,58 +75,275 @@ OUTPUT_DIR=/path/to/output python main.py
|
||||
docker build -t warhammer-rss .
|
||||
```
|
||||
|
||||
2. Run the container (multiple options to avoid permission issues):
|
||||
|
||||
**Option A: Save to current directory (simplest)**
|
||||
2. Run the container:
|
||||
```bash
|
||||
docker run -v $(pwd):/app/output warhammer-rss
|
||||
# Basic usage
|
||||
docker run -v $(pwd)/output:/app/output warhammer-rss
|
||||
|
||||
# With custom configuration
|
||||
docker run -e MAX_SCROLL_ITERATIONS=3 \
|
||||
-e LOG_LEVEL=DEBUG \
|
||||
-v $(pwd)/output:/app/output \
|
||||
warhammer-rss --no-cache
|
||||
|
||||
# With resource limits
|
||||
docker run --memory=512m --cpu-quota=50000 \
|
||||
-v $(pwd)/output:/app/output \
|
||||
warhammer-rss
|
||||
```
|
||||
|
||||
**Option B: Use environment variable for output directory**
|
||||
## Command Line Options
|
||||
|
||||
```bash
|
||||
docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss
|
||||
Usage: main.py [OPTIONS]
|
||||
|
||||
Options:
|
||||
--url URL URL to scrape (default: Warhammer Community)
|
||||
--output-dir PATH Output directory for files
|
||||
--max-scroll INT Maximum scroll iterations (default: 5)
|
||||
--log-level LEVEL Logging level: DEBUG, INFO, WARNING, ERROR
|
||||
--log-file PATH Log file path (default: scraper.log)
|
||||
--no-cache Disable content caching
|
||||
--clear-cache Clear cache before running
|
||||
--cache-info Show cache information and exit
|
||||
-h, --help Show help message
|
||||
```
|
||||
|
||||
**Option C: With resource limits for additional security**
|
||||
```bash
|
||||
docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
The application generates:
|
||||
- `warhammer_rss_feed.xml` - RSS feed file
|
||||
- `page.html` - Raw HTML content for debugging
|
||||
|
||||
Both files are saved to the specified output directory (current directory by default).
|
||||
|
||||
## Security Features
|
||||
|
||||
This application implements several security measures:
|
||||
|
||||
- **URL Validation**: Only allows scraping from trusted Warhammer Community domains
|
||||
- **Path Validation**: Prevents directory traversal attacks by validating output paths
|
||||
- **Resource Limits**: Caps content size (10MB) and scroll iterations (5) to prevent DoS
|
||||
- **Content Filtering**: Sanitizes extracted text to prevent XSS and injection attacks
|
||||
- **Non-root Execution**: Docker container runs as user `scraper` (UID 1001) for reduced privilege
|
||||
- **Input Sanitization**: All URLs and file paths are validated before use
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Validates** the target URL against whitelist of allowed domains
|
||||
2. Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
|
||||
3. Scrolls through the page to load additional content (limited to 5 iterations)
|
||||
4. **Validates content size** and parses the rendered HTML with BeautifulSoup
|
||||
5. **Sanitizes** and extracts article titles, links, and publication dates
|
||||
6. **Validates all links** against allowed domains
|
||||
7. Removes duplicates and sorts by date
|
||||
8. Generates RSS feed using feedgen library
|
||||
9. **Validates output paths** before saving files
|
||||
|
||||
## Configuration
|
||||
|
||||
The scraper targets `https://www.warhammer-community.com/en-gb/` by default and only allows URLs from:
|
||||
### Environment Variables
|
||||
|
||||
The application supports extensive configuration via environment variables:
|
||||
|
||||
```bash
|
||||
# Scraping Configuration
|
||||
MAX_SCROLL_ITERATIONS=5 # Number of scroll iterations
|
||||
MAX_CONTENT_SIZE=10485760 # Maximum content size (10MB)
|
||||
SCROLL_DELAY_SECONDS=2.0 # Delay between scrolls
|
||||
PAGE_TIMEOUT_MS=120000 # Page load timeout
|
||||
|
||||
# Security Configuration
|
||||
ALLOWED_DOMAINS="warhammer-community.com,www.warhammer-community.com"
|
||||
MAX_TITLE_LENGTH=500 # Maximum title length
|
||||
|
||||
# Output Configuration
|
||||
DEFAULT_OUTPUT_DIR="." # Default output directory
|
||||
RSS_FILENAME="warhammer_rss_feed.xml"
|
||||
DEBUG_HTML_FILENAME="page.html"
|
||||
|
||||
# Feed Metadata
|
||||
FEED_TITLE="Warhammer Community RSS Feed"
|
||||
FEED_DESCRIPTION="Latest Warhammer Community Articles"
|
||||
```
|
||||
|
||||
### Cache Management
|
||||
|
||||
```bash
|
||||
# View cache status
|
||||
python main.py --cache-info
|
||||
|
||||
# Clear cache
|
||||
python main.py --clear-cache
|
||||
|
||||
# Disable caching for a run
|
||||
python main.py --no-cache
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
rss_warhammer/
|
||||
├── main.py # CLI entry point
|
||||
├── src/rss_scraper/ # Main package
|
||||
│ ├── __init__.py
|
||||
│ ├── config.py # Configuration management
|
||||
│ ├── exceptions.py # Custom exceptions
|
||||
│ ├── validation.py # URL and path validation
|
||||
│ ├── scraper.py # Web scraping with Playwright
|
||||
│ ├── parser.py # HTML parsing and article extraction
|
||||
│ ├── rss_generator.py # RSS feed generation
|
||||
│ ├── cache.py # Content caching system
|
||||
│ ├── security.py # Security utilities
|
||||
│ └── retry_utils.py # Retry logic with backoff
|
||||
├── tests/ # Comprehensive test suite
|
||||
├── cache/ # Cache directory (auto-created)
|
||||
├── requirements.txt # Python dependencies
|
||||
├── pytest.ini # Test configuration
|
||||
├── Dockerfile # Container configuration
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Output Files
|
||||
|
||||
The application generates:
|
||||
- `warhammer_rss_feed.xml` - RSS feed with extracted articles
|
||||
- `page.html` - Raw HTML for debugging (optional)
|
||||
- `scraper.log` - Application logs
|
||||
- `cache/` - Cached content and ETags
|
||||
|
||||
## Testing
|
||||
|
||||
Run the comprehensive test suite:
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
pytest
|
||||
|
||||
# Run with coverage
|
||||
pytest --cov=src/rss_scraper
|
||||
|
||||
# Run specific test categories
|
||||
pytest -m unit # Unit tests only
|
||||
pytest tests/test_parser.py # Specific module
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
The application uses specific exit codes for different error types:
|
||||
|
||||
- `0` - Success
|
||||
- `1` - Configuration/Validation error
|
||||
- `2` - Network error
|
||||
- `3` - Page loading error
|
||||
- `4` - Content parsing error
|
||||
- `5` - File operation error
|
||||
- `6` - Content size exceeded
|
||||
- `99` - Unexpected error
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Allowed Domains
|
||||
The scraper only operates on whitelisted domains:
|
||||
- `warhammer-community.com`
|
||||
- `www.warhammer-community.com`
|
||||
|
||||
To modify allowed domains, update the `ALLOWED_DOMAINS` list in `main.py:11-14`.
|
||||
### Rate Limiting
|
||||
- Default: 30 requests per minute
|
||||
- Minimum delay: 2 seconds between requests
|
||||
- Configurable via environment variables
|
||||
|
||||
### Content Sanitization
|
||||
- HTML content sanitized using bleach
|
||||
- Dangerous scripts and patterns removed
|
||||
- File paths validated against directory traversal
|
||||
- URL validation against malicious patterns
|
||||
|
||||
## Deployment
|
||||
|
||||
### Production Deployment
|
||||
|
||||
1. **Environment Setup**:
|
||||
```bash
|
||||
# Create production environment file
|
||||
cat > .env << EOF
|
||||
MAX_SCROLL_ITERATIONS=3
|
||||
SCROLL_DELAY_SECONDS=3.0
|
||||
DEFAULT_OUTPUT_DIR=/app/data
|
||||
LOG_LEVEL=INFO
|
||||
EOF
|
||||
```
|
||||
|
||||
2. **Docker Compose** (recommended):
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
rss-scraper:
|
||||
build: .
|
||||
environment:
|
||||
- MAX_SCROLL_ITERATIONS=3
|
||||
- LOG_LEVEL=INFO
|
||||
volumes:
|
||||
- ./output:/app/output
|
||||
- ./logs:/app/logs
|
||||
restart: unless-stopped
|
||||
memory: 512m
|
||||
cpus: 0.5
|
||||
```
|
||||
|
||||
3. **Cron Schedule**:
|
||||
```bash
|
||||
# Add to crontab for regular updates
|
||||
0 */6 * * * docker run --rm -v /path/to/output:/app/output warhammer-rss
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
### Setup Development Environment
|
||||
|
||||
```bash
|
||||
# Install development dependencies
|
||||
pip install -r requirements.txt
|
||||
pip install pytest pytest-cov black isort
|
||||
|
||||
# Install pre-commit hooks (optional)
|
||||
pre-commit install
|
||||
|
||||
# Run tests
|
||||
pytest
|
||||
|
||||
# Format code
|
||||
black src/ tests/
|
||||
isort src/ tests/
|
||||
```
|
||||
|
||||
### Adding New Features
|
||||
|
||||
1. Follow the modular architecture
|
||||
2. Add type hints to all functions
|
||||
3. Include comprehensive error handling
|
||||
4. Write tests for new functionality
|
||||
5. Update configuration if needed
|
||||
6. Document changes in README
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Permission Errors**:
|
||||
- Ensure output directory is writable
|
||||
- Use proper Docker volume mounting
|
||||
|
||||
2. **Memory Issues**:
|
||||
- Reduce `MAX_SCROLL_ITERATIONS`
|
||||
- Increase Docker memory limits
|
||||
|
||||
3. **Rate Limiting**:
|
||||
- Increase `SCROLL_DELAY_SECONDS`
|
||||
- Check network connectivity
|
||||
|
||||
4. **Cache Issues**:
|
||||
- Clear cache with `--clear-cache`
|
||||
- Check cache directory permissions
|
||||
|
||||
### Debug Mode
|
||||
|
||||
```bash
|
||||
# Enable debug logging
|
||||
python main.py --log-level DEBUG
|
||||
|
||||
# Disable caching for testing
|
||||
python main.py --no-cache --log-level DEBUG
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
This project is provided as-is for educational purposes. Please respect the Warhammer Community website's robots.txt and terms of service.
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Fork the repository
|
||||
2. Create a feature branch
|
||||
3. Add tests for new functionality
|
||||
4. Ensure all tests pass
|
||||
5. Submit a pull request
|
||||
|
||||
## Changelog
|
||||
|
||||
### Version 1.0.0
|
||||
- Complete rewrite with modular architecture
|
||||
- Added comprehensive caching system
|
||||
- Implemented rate limiting and security hardening
|
||||
- Full test coverage with pytest
|
||||
- Production-ready Docker container
|
||||
- Extensive configuration management
|
||||
- Structured logging and error handling
|
Reference in New Issue
Block a user