Phil 25086fc01b Add comprehensive RSS scraper implementation with security and testing
- Modular architecture with separate modules for scraping, parsing, security, validation, and caching
- Comprehensive security measures including HTML sanitization, rate limiting, and input validation
- Robust error handling with custom exceptions and retry logic
- HTTP caching with ETags and Last-Modified headers for efficiency
- Pre-compiled regex patterns for improved performance
- Comprehensive test suite with 66 tests covering all major functionality
- Docker support for containerized deployment
- Configuration management with environment variable support
- Working parser that successfully extracts 32 articles from Warhammer Community

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-06 09:15:06 -06:00

8.9 KiB

Warhammer Community RSS Scraper

A production-ready Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.

Overview

This project provides a robust, secure, and scalable RSS scraper for the Warhammer Community website. It features comprehensive error handling, caching, rate limiting, and security measures suitable for production deployment.

Features

Core Functionality

  • Scrapes articles from Warhammer Community website
  • Generates properly formatted RSS feeds
  • Handles duplicate article detection
  • Sorts articles by publication date (newest first)
  • Saves both RSS feed and debug HTML

Production Features

  • Modular Architecture: Clean separation of concerns with dedicated modules
  • Comprehensive Logging: Structured logging with configurable levels
  • Configuration Management: Environment-based configuration
  • Caching: Intelligent content caching with ETags and conditional requests
  • Rate Limiting: Respectful scraping with configurable delays
  • Retry Logic: Exponential backoff for network failures
  • Type Safety: Full type hints throughout codebase
  • Comprehensive Tests: Unit tests with pytest framework

Security Features

  • URL Validation: Whitelist-based domain validation
  • Content Sanitization: HTML sanitization using bleach library
  • Path Validation: Prevention of directory traversal attacks
  • Resource Limits: Memory and execution time constraints
  • Input Validation: Comprehensive argument and data validation
  • Non-root Execution: Secure container execution
  • File Sanitization: Safe filename handling

Requirements

  • Python 3.12+
  • Dependencies listed in requirements.txt

Installation

Local Setup

  1. Install dependencies:
pip install -r requirements.txt
  1. Install Playwright browsers:
playwright install
  1. Run the scraper:
# Basic usage
python main.py

# With custom options
python main.py --url https://www.warhammer-community.com/en-gb/ \
               --output-dir ./output \
               --log-level DEBUG \
               --max-scroll 3

# View all options
python main.py --help

Docker Setup

  1. Build the Docker image:
docker build -t warhammer-rss .
  1. Run the container:
# Basic usage
docker run -v $(pwd)/output:/app/output warhammer-rss

# With custom configuration
docker run -e MAX_SCROLL_ITERATIONS=3 \
           -e LOG_LEVEL=DEBUG \
           -v $(pwd)/output:/app/output \
           warhammer-rss --no-cache

# With resource limits
docker run --memory=512m --cpu-quota=50000 \
           -v $(pwd)/output:/app/output \
           warhammer-rss

Command Line Options

Usage: main.py [OPTIONS]

Options:
  --url URL              URL to scrape (default: Warhammer Community)
  --output-dir PATH      Output directory for files
  --max-scroll INT       Maximum scroll iterations (default: 5)
  --log-level LEVEL      Logging level: DEBUG, INFO, WARNING, ERROR
  --log-file PATH        Log file path (default: scraper.log)
  --no-cache             Disable content caching
  --clear-cache          Clear cache before running
  --cache-info           Show cache information and exit
  -h, --help             Show help message

Configuration

Environment Variables

The application supports extensive configuration via environment variables:

# Scraping Configuration
MAX_SCROLL_ITERATIONS=5      # Number of scroll iterations
MAX_CONTENT_SIZE=10485760    # Maximum content size (10MB)
SCROLL_DELAY_SECONDS=2.0     # Delay between scrolls
PAGE_TIMEOUT_MS=120000       # Page load timeout

# Security Configuration
ALLOWED_DOMAINS="warhammer-community.com,www.warhammer-community.com"
MAX_TITLE_LENGTH=500         # Maximum title length

# Output Configuration
DEFAULT_OUTPUT_DIR="."       # Default output directory
RSS_FILENAME="warhammer_rss_feed.xml"
DEBUG_HTML_FILENAME="page.html"

# Feed Metadata
FEED_TITLE="Warhammer Community RSS Feed"
FEED_DESCRIPTION="Latest Warhammer Community Articles"

Cache Management

# View cache status
python main.py --cache-info

# Clear cache
python main.py --clear-cache

# Disable caching for a run
python main.py --no-cache

Project Structure

rss_warhammer/
├── main.py                 # CLI entry point
├── src/rss_scraper/        # Main package
│   ├── __init__.py
│   ├── config.py           # Configuration management
│   ├── exceptions.py       # Custom exceptions
│   ├── validation.py       # URL and path validation
│   ├── scraper.py          # Web scraping with Playwright
│   ├── parser.py           # HTML parsing and article extraction
│   ├── rss_generator.py    # RSS feed generation
│   ├── cache.py            # Content caching system
│   ├── security.py         # Security utilities
│   └── retry_utils.py      # Retry logic with backoff
├── tests/                  # Comprehensive test suite
├── cache/                  # Cache directory (auto-created)
├── requirements.txt        # Python dependencies
├── pytest.ini            # Test configuration
├── Dockerfile             # Container configuration
└── README.md              # This file

Output Files

The application generates:

  • warhammer_rss_feed.xml - RSS feed with extracted articles
  • page.html - Raw HTML for debugging (optional)
  • scraper.log - Application logs
  • cache/ - Cached content and ETags

Testing

Run the comprehensive test suite:

# Run all tests
pytest

# Run with coverage
pytest --cov=src/rss_scraper

# Run specific test categories
pytest -m unit              # Unit tests only
pytest tests/test_parser.py  # Specific module

Error Handling

The application uses specific exit codes for different error types:

  • 0 - Success
  • 1 - Configuration/Validation error
  • 2 - Network error
  • 3 - Page loading error
  • 4 - Content parsing error
  • 5 - File operation error
  • 6 - Content size exceeded
  • 99 - Unexpected error

Security Considerations

Allowed Domains

The scraper only operates on whitelisted domains:

  • warhammer-community.com
  • www.warhammer-community.com

Rate Limiting

  • Default: 30 requests per minute
  • Minimum delay: 2 seconds between requests
  • Configurable via environment variables

Content Sanitization

  • HTML content sanitized using bleach
  • Dangerous scripts and patterns removed
  • File paths validated against directory traversal
  • URL validation against malicious patterns

Deployment

Production Deployment

  1. Environment Setup:
# Create production environment file
cat > .env << EOF
MAX_SCROLL_ITERATIONS=3
SCROLL_DELAY_SECONDS=3.0
DEFAULT_OUTPUT_DIR=/app/data
LOG_LEVEL=INFO
EOF
  1. Docker Compose (recommended):
version: '3.8'
services:
  rss-scraper:
    build: .
    environment:
      - MAX_SCROLL_ITERATIONS=3
      - LOG_LEVEL=INFO
    volumes:
      - ./output:/app/output
      - ./logs:/app/logs
    restart: unless-stopped
    memory: 512m
    cpus: 0.5
  1. Cron Schedule:
# Add to crontab for regular updates
0 */6 * * * docker run --rm -v /path/to/output:/app/output warhammer-rss

Development

Setup Development Environment

# Install development dependencies
pip install -r requirements.txt
pip install pytest pytest-cov black isort

# Install pre-commit hooks (optional)
pre-commit install

# Run tests
pytest

# Format code
black src/ tests/
isort src/ tests/

Adding New Features

  1. Follow the modular architecture
  2. Add type hints to all functions
  3. Include comprehensive error handling
  4. Write tests for new functionality
  5. Update configuration if needed
  6. Document changes in README

Troubleshooting

Common Issues

  1. Permission Errors:

    • Ensure output directory is writable
    • Use proper Docker volume mounting
  2. Memory Issues:

    • Reduce MAX_SCROLL_ITERATIONS
    • Increase Docker memory limits
  3. Rate Limiting:

    • Increase SCROLL_DELAY_SECONDS
    • Check network connectivity
  4. Cache Issues:

    • Clear cache with --clear-cache
    • Check cache directory permissions

Debug Mode

# Enable debug logging
python main.py --log-level DEBUG

# Disable caching for testing
python main.py --no-cache --log-level DEBUG

License

This project is provided as-is for educational purposes. Please respect the Warhammer Community website's robots.txt and terms of service.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Changelog

Version 1.0.0

  • Complete rewrite with modular architecture
  • Added comprehensive caching system
  • Implemented rate limiting and security hardening
  • Full test coverage with pytest
  • Production-ready Docker container
  • Extensive configuration management
  • Structured logging and error handling