Add comprehensive security improvements

- URL validation with domain whitelist - Path validation to prevent directory traversal - Resource limits (content size, scroll iterations) - Content filtering and sanitization - Non-root Docker execution with gosu - Configurable output directory via CLI/env vars - Fixed Docker volume permission issues 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-05 18:19:23 -06:00
parent eecee074e2
commit b9b3ece3cb
4 changed files with 6636 additions and 16 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,117 @@
+# Warhammer Community RSS Scraper
+
+A Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
+
+## Overview
+
+This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.
+
+## Features
+
+- Scrapes articles from Warhammer Community website
+- Generates RSS feed with proper formatting
+- Handles duplicate article detection
+- Sorts articles by publication date (newest first)
+- Dockerized for easy deployment
+- Saves both RSS feed and raw HTML for debugging
+- **Security-focused**: URL validation, content filtering, and resource limits
+- **Safe execution**: Runs as non-root user in container
+
+## Requirements
+
+- Python 3.12+
+- Dependencies listed in `requirements.txt`:
+  - playwright
+  - beautifulsoup4
+  - feedgen
+  - pytz
+  - requests
+
+## Installation
+
+### Local Setup
+
+1. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+
+2. Install Playwright browsers:
+```bash
+playwright install
+```
+
+3. Run the script:
+```bash
+# Default: saves to current directory
+python main.py
+
+# Or specify output directory
+python main.py /path/to/output
+
+# Or use environment variable
+OUTPUT_DIR=/path/to/output python main.py
+```
+
+### Docker Setup
+
+1. Build the Docker image:
+```bash
+docker build -t warhammer-rss .
+```
+
+2. Run the container (multiple options to avoid permission issues):
+
+**Option A: Save to current directory (simplest)**
+```bash
+docker run -v $(pwd):/app/output warhammer-rss
+```
+
+**Option B: Use environment variable for output directory**
+```bash
+docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss
+```
+
+**Option C: With resource limits for additional security**
+```bash
+docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss
+```
+
+## Output
+
+The application generates:
+- `warhammer_rss_feed.xml` - RSS feed file
+- `page.html` - Raw HTML content for debugging
+
+Both files are saved to the specified output directory (current directory by default).
+
+## Security Features
+
+This application implements several security measures:
+
+- **URL Validation**: Only allows scraping from trusted Warhammer Community domains
+- **Path Validation**: Prevents directory traversal attacks by validating output paths
+- **Resource Limits**: Caps content size (10MB) and scroll iterations (5) to prevent DoS
+- **Content Filtering**: Sanitizes extracted text to prevent XSS and injection attacks
+- **Non-root Execution**: Docker container runs as user `scraper` (UID 1001) for reduced privilege
+- **Input Sanitization**: All URLs and file paths are validated before use
+
+## How It Works
+
+1. **Validates** the target URL against whitelist of allowed domains
+2. Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
+3. Scrolls through the page to load additional content (limited to 5 iterations)
+4. **Validates content size** and parses the rendered HTML with BeautifulSoup
+5. **Sanitizes** and extracts article titles, links, and publication dates
+6. **Validates all links** against allowed domains
+7. Removes duplicates and sorts by date
+8. Generates RSS feed using feedgen library
+9. **Validates output paths** before saving files
+
+## Configuration
+
+The scraper targets `https://www.warhammer-community.com/en-gb/` by default and only allows URLs from:
+- `warhammer-community.com`
+- `www.warhammer-community.com`
+
+To modify allowed domains, update the `ALLOWED_DOMAINS` list in `main.py:11-14`.