Phil 25086fc01b Add comprehensive RSS scraper implementation with security and testing

- Modular architecture with separate modules for scraping, parsing, security, validation, and caching
- Comprehensive security measures including HTML sanitization, rate limiting, and input validation
- Robust error handling with custom exceptions and retry logic
- HTTP caching with ETags and Last-Modified headers for efficiency
- Pre-compiled regex patterns for improved performance
- Comprehensive test suite with 66 tests covering all major functionality
- Docker support for containerized deployment
- Configuration management with environment variable support
- Working parser that successfully extracts 32 articles from Warhammer Community

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-06-06 09:15:06 -06:00

8.9 KiB

Raw Blame History

Warhammer Community RSS Scraper

A production-ready Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.

Overview

This project provides a robust, secure, and scalable RSS scraper for the Warhammer Community website. It features comprehensive error handling, caching, rate limiting, and security measures suitable for production deployment.

Features

Core Functionality

Scrapes articles from Warhammer Community website
Generates properly formatted RSS feeds
Handles duplicate article detection
Sorts articles by publication date (newest first)
Saves both RSS feed and debug HTML

Production Features

Modular Architecture: Clean separation of concerns with dedicated modules
Comprehensive Logging: Structured logging with configurable levels
Configuration Management: Environment-based configuration
Caching: Intelligent content caching with ETags and conditional requests
Rate Limiting: Respectful scraping with configurable delays
Retry Logic: Exponential backoff for network failures
Type Safety: Full type hints throughout codebase
Comprehensive Tests: Unit tests with pytest framework

Security Features

URL Validation: Whitelist-based domain validation
Content Sanitization: HTML sanitization using bleach library
Path Validation: Prevention of directory traversal attacks
Resource Limits: Memory and execution time constraints
Input Validation: Comprehensive argument and data validation
Non-root Execution: Secure container execution
File Sanitization: Safe filename handling

Requirements

Python 3.12+
Dependencies listed in requirements.txt

Installation

Local Setup

Install dependencies:

pip install -r requirements.txt

Install Playwright browsers:

playwright install

Run the scraper:

# Basic usage
python main.py

# With custom options
python main.py --url https://www.warhammer-community.com/en-gb/ \
               --output-dir ./output \
               --log-level DEBUG \
               --max-scroll 3

# View all options
python main.py --help

Docker Setup

Build the Docker image:

docker build -t warhammer-rss .

Run the container:

# Basic usage
docker run -v $(pwd)/output:/app/output warhammer-rss

# With custom configuration
docker run -e MAX_SCROLL_ITERATIONS=3 \
           -e LOG_LEVEL=DEBUG \
           -v $(pwd)/output:/app/output \
           warhammer-rss --no-cache

# With resource limits
docker run --memory=512m --cpu-quota=50000 \
           -v $(pwd)/output:/app/output \
           warhammer-rss

Command Line Options

Usage: main.py [OPTIONS]

Options:
  --url URL              URL to scrape (default: Warhammer Community)
  --output-dir PATH      Output directory for files
  --max-scroll INT       Maximum scroll iterations (default: 5)
  --log-level LEVEL      Logging level: DEBUG, INFO, WARNING, ERROR
  --log-file PATH        Log file path (default: scraper.log)
  --no-cache             Disable content caching
  --clear-cache          Clear cache before running
  --cache-info           Show cache information and exit
  -h, --help             Show help message

Configuration

Environment Variables

The application supports extensive configuration via environment variables:

# Scraping Configuration
MAX_SCROLL_ITERATIONS=5      # Number of scroll iterations
MAX_CONTENT_SIZE=10485760    # Maximum content size (10MB)
SCROLL_DELAY_SECONDS=2.0     # Delay between scrolls
PAGE_TIMEOUT_MS=120000       # Page load timeout

# Security Configuration
ALLOWED_DOMAINS="warhammer-community.com,www.warhammer-community.com"
MAX_TITLE_LENGTH=500         # Maximum title length

# Output Configuration
DEFAULT_OUTPUT_DIR="."       # Default output directory
RSS_FILENAME="warhammer_rss_feed.xml"
DEBUG_HTML_FILENAME="page.html"

# Feed Metadata
FEED_TITLE="Warhammer Community RSS Feed"
FEED_DESCRIPTION="Latest Warhammer Community Articles"

Cache Management

# View cache status
python main.py --cache-info

# Clear cache
python main.py --clear-cache

# Disable caching for a run
python main.py --no-cache

Project Structure

rss_warhammer/
├── main.py                 # CLI entry point
├── src/rss_scraper/        # Main package
│   ├── __init__.py
│   ├── config.py           # Configuration management
│   ├── exceptions.py       # Custom exceptions
│   ├── validation.py       # URL and path validation
│   ├── scraper.py          # Web scraping with Playwright
│   ├── parser.py           # HTML parsing and article extraction
│   ├── rss_generator.py    # RSS feed generation
│   ├── cache.py            # Content caching system
│   ├── security.py         # Security utilities
│   └── retry_utils.py      # Retry logic with backoff
├── tests/                  # Comprehensive test suite
├── cache/                  # Cache directory (auto-created)
├── requirements.txt        # Python dependencies
├── pytest.ini            # Test configuration
├── Dockerfile             # Container configuration
└── README.md              # This file

Output Files

The application generates:

warhammer_rss_feed.xml - RSS feed with extracted articles
page.html - Raw HTML for debugging (optional)
scraper.log - Application logs
cache/ - Cached content and ETags

Testing

Run the comprehensive test suite:

# Run all tests
pytest

# Run with coverage
pytest --cov=src/rss_scraper

# Run specific test categories
pytest -m unit              # Unit tests only
pytest tests/test_parser.py  # Specific module

Error Handling

The application uses specific exit codes for different error types:

0 - Success
1 - Configuration/Validation error
2 - Network error
3 - Page loading error
4 - Content parsing error
5 - File operation error
6 - Content size exceeded
99 - Unexpected error

Security Considerations

Allowed Domains

The scraper only operates on whitelisted domains:

warhammer-community.com
www.warhammer-community.com

Rate Limiting

Default: 30 requests per minute
Minimum delay: 2 seconds between requests
Configurable via environment variables

Content Sanitization

HTML content sanitized using bleach
Dangerous scripts and patterns removed
File paths validated against directory traversal
URL validation against malicious patterns

Deployment

Production Deployment

Environment Setup:

# Create production environment file
cat > .env << EOF
MAX_SCROLL_ITERATIONS=3
SCROLL_DELAY_SECONDS=3.0
DEFAULT_OUTPUT_DIR=/app/data
LOG_LEVEL=INFO
EOF

Docker Compose (recommended):

version: '3.8'
services:
  rss-scraper:
    build: .
    environment:
      - MAX_SCROLL_ITERATIONS=3
      - LOG_LEVEL=INFO
    volumes:
      - ./output:/app/output
      - ./logs:/app/logs
    restart: unless-stopped
    memory: 512m
    cpus: 0.5

Cron Schedule:

# Add to crontab for regular updates
0 */6 * * * docker run --rm -v /path/to/output:/app/output warhammer-rss

Development

Setup Development Environment

# Install development dependencies
pip install -r requirements.txt
pip install pytest pytest-cov black isort

# Install pre-commit hooks (optional)
pre-commit install

# Run tests
pytest

# Format code
black src/ tests/
isort src/ tests/

Adding New Features

Follow the modular architecture
Add type hints to all functions
Include comprehensive error handling
Write tests for new functionality
Update configuration if needed
Document changes in README

Troubleshooting

Common Issues

Permission Errors:
- Ensure output directory is writable
- Use proper Docker volume mounting
Memory Issues:
- Reduce MAX_SCROLL_ITERATIONS
- Increase Docker memory limits
Rate Limiting:
- Increase SCROLL_DELAY_SECONDS
- Check network connectivity
Cache Issues:
- Clear cache with --clear-cache
- Check cache directory permissions

Debug Mode

# Enable debug logging
python main.py --log-level DEBUG

# Disable caching for testing
python main.py --no-cache --log-level DEBUG

License

This project is provided as-is for educational purposes. Please respect the Warhammer Community website's robots.txt and terms of service.

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

Changelog

Version 1.0.0

Complete rewrite with modular architecture
Added comprehensive caching system
Implemented rate limiting and security hardening
Full test coverage with pytest
Production-ready Docker container
Extensive configuration management
Structured logging and error handling

8.9 KiB Raw Blame History