This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.

Features

Scrapes articles from Warhammer Community website
Generates RSS feed with proper formatting
Handles duplicate article detection
Sorts articles by publication date (newest first)
Dockerized for easy deployment
Saves both RSS feed and raw HTML for debugging
Security-focused: URL validation, content filtering, and resource limits
Safe execution: Runs as non-root user in container

Requirements

Python 3.12+
Dependencies listed in requirements.txt:
- playwright
- beautifulsoup4
- feedgen
- pytz
- requests

Installation

Local Setup

Install dependencies:

pip install -r requirements.txt

Install Playwright browsers:

playwright install

Run the script:

# Default: saves to current directory
python main.py

# Or specify output directory
python main.py /path/to/output

# Or use environment variable
OUTPUT_DIR=/path/to/output python main.py

Docker Setup

Build the Docker image:

docker build -t warhammer-rss .

Run the container (multiple options to avoid permission issues):

Option A: Save to current directory (simplest)

docker run -v $(pwd):/app/output warhammer-rss

Option B: Use environment variable for output directory

docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss

Option C: With resource limits for additional security

docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss

Output

The application generates:

warhammer_rss_feed.xml - RSS feed file
page.html - Raw HTML content for debugging

Both files are saved to the specified output directory (current directory by default).

Security Features

This application implements several security measures:

URL Validation: Only allows scraping from trusted Warhammer Community domains
Path Validation: Prevents directory traversal attacks by validating output paths
Resource Limits: Caps content size (10MB) and scroll iterations (5) to prevent DoS
Content Filtering: Sanitizes extracted text to prevent XSS and injection attacks
Non-root Execution: Docker container runs as user scraper (UID 1001) for reduced privilege
Input Sanitization: All URLs and file paths are validated before use

How It Works

Validates the target URL against whitelist of allowed domains
Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
Scrolls through the page to load additional content (limited to 5 iterations)
Validates content size and parses the rendered HTML with BeautifulSoup
Sanitizes and extracts article titles, links, and publication dates
Validates all links against allowed domains
Removes duplicates and sorts by date
Generates RSS feed using feedgen library
Validates output paths before saving files

Configuration

The scraper targets https://www.warhammer-community.com/en-gb/ by default and only allows URLs from:

warhammer-community.com
www.warhammer-community.com

To modify allowed domains, update the ALLOWED_DOMAINS list in main.py:11-14.