Warhammer Community RSS Scraper
A Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
Overview
This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.
Features
- Scrapes articles from Warhammer Community website
- Generates RSS feed with proper formatting
- Handles duplicate article detection
- Sorts articles by publication date (newest first)
- Dockerized for easy deployment
- Saves both RSS feed and raw HTML for debugging
- Security-focused: URL validation, content filtering, and resource limits
- Safe execution: Runs as non-root user in container
Requirements
- Python 3.12+
- Dependencies listed in
requirements.txt
:- playwright
- beautifulsoup4
- feedgen
- pytz
- requests
Installation
Local Setup
- Install dependencies:
pip install -r requirements.txt
- Install Playwright browsers:
playwright install
- Run the script:
# Default: saves to current directory
python main.py
# Or specify output directory
python main.py /path/to/output
# Or use environment variable
OUTPUT_DIR=/path/to/output python main.py
Docker Setup
- Build the Docker image:
docker build -t warhammer-rss .
- Run the container (multiple options to avoid permission issues):
Option A: Save to current directory (simplest)
docker run -v $(pwd):/app/output warhammer-rss
Option B: Use environment variable for output directory
docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss
Option C: With resource limits for additional security
docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss
Output
The application generates:
warhammer_rss_feed.xml
- RSS feed filepage.html
- Raw HTML content for debugging
Both files are saved to the specified output directory (current directory by default).
Security Features
This application implements several security measures:
- URL Validation: Only allows scraping from trusted Warhammer Community domains
- Path Validation: Prevents directory traversal attacks by validating output paths
- Resource Limits: Caps content size (10MB) and scroll iterations (5) to prevent DoS
- Content Filtering: Sanitizes extracted text to prevent XSS and injection attacks
- Non-root Execution: Docker container runs as user
scraper
(UID 1001) for reduced privilege - Input Sanitization: All URLs and file paths are validated before use
How It Works
- Validates the target URL against whitelist of allowed domains
- Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
- Scrolls through the page to load additional content (limited to 5 iterations)
- Validates content size and parses the rendered HTML with BeautifulSoup
- Sanitizes and extracts article titles, links, and publication dates
- Validates all links against allowed domains
- Removes duplicates and sorts by date
- Generates RSS feed using feedgen library
- Validates output paths before saving files
Configuration
The scraper targets https://www.warhammer-community.com/en-gb/
by default and only allows URLs from:
warhammer-community.com
www.warhammer-community.com
To modify allowed domains, update the ALLOWED_DOMAINS
list in main.py:11-14
.