2025-06-05 18:28:41 -06:00
2025-06-05 18:28:41 -06:00
2024-10-03 12:34:27 -06:00

Warhammer Community RSS Scraper

A Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.

Overview

This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.

Features

  • Scrapes articles from Warhammer Community website
  • Generates RSS feed with proper formatting
  • Handles duplicate article detection
  • Sorts articles by publication date (newest first)
  • Dockerized for easy deployment
  • Saves both RSS feed and raw HTML for debugging
  • Security-focused: URL validation, content filtering, and resource limits
  • Safe execution: Runs as non-root user in container

Requirements

  • Python 3.12+
  • Dependencies listed in requirements.txt:
    • playwright
    • beautifulsoup4
    • feedgen
    • pytz
    • requests

Installation

Local Setup

  1. Install dependencies:
pip install -r requirements.txt
  1. Install Playwright browsers:
playwright install
  1. Run the script:
# Default: saves to current directory
python main.py

# Or specify output directory
python main.py /path/to/output

# Or use environment variable
OUTPUT_DIR=/path/to/output python main.py

Docker Setup

  1. Build the Docker image:
docker build -t warhammer-rss .
  1. Run the container (multiple options to avoid permission issues):

Option A: Save to current directory (simplest)

docker run -v $(pwd):/app/output warhammer-rss

Option B: Use environment variable for output directory

docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss

Option C: With resource limits for additional security

docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss

Output

The application generates:

  • warhammer_rss_feed.xml - RSS feed file
  • page.html - Raw HTML content for debugging

Both files are saved to the specified output directory (current directory by default).

Security Features

This application implements several security measures:

  • URL Validation: Only allows scraping from trusted Warhammer Community domains
  • Path Validation: Prevents directory traversal attacks by validating output paths
  • Resource Limits: Caps content size (10MB) and scroll iterations (5) to prevent DoS
  • Content Filtering: Sanitizes extracted text to prevent XSS and injection attacks
  • Non-root Execution: Docker container runs as user scraper (UID 1001) for reduced privilege
  • Input Sanitization: All URLs and file paths are validated before use

How It Works

  1. Validates the target URL against whitelist of allowed domains
  2. Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
  3. Scrolls through the page to load additional content (limited to 5 iterations)
  4. Validates content size and parses the rendered HTML with BeautifulSoup
  5. Sanitizes and extracts article titles, links, and publication dates
  6. Validates all links against allowed domains
  7. Removes duplicates and sorts by date
  8. Generates RSS feed using feedgen library
  9. Validates output paths before saving files

Configuration

The scraper targets https://www.warhammer-community.com/en-gb/ by default and only allows URLs from:

  • warhammer-community.com
  • www.warhammer-community.com

To modify allowed domains, update the ALLOWED_DOMAINS list in main.py:11-14.

Description
No description provided
Readme 475 KiB
Languages
HTML 96.8%
Python 3.1%