Phil b9b3ece3cb Add comprehensive security improvements
- URL validation with domain whitelist
- Path validation to prevent directory traversal
- Resource limits (content size, scroll iterations)
- Content filtering and sanitization
- Non-root Docker execution with gosu
- Configurable output directory via CLI/env vars
- Fixed Docker volume permission issues

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-05 18:19:23 -06:00

117 lines
3.5 KiB
Markdown

# Warhammer Community RSS Scraper
A Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
## Overview
This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.
## Features
- Scrapes articles from Warhammer Community website
- Generates RSS feed with proper formatting
- Handles duplicate article detection
- Sorts articles by publication date (newest first)
- Dockerized for easy deployment
- Saves both RSS feed and raw HTML for debugging
- **Security-focused**: URL validation, content filtering, and resource limits
- **Safe execution**: Runs as non-root user in container
## Requirements
- Python 3.12+
- Dependencies listed in `requirements.txt`:
- playwright
- beautifulsoup4
- feedgen
- pytz
- requests
## Installation
### Local Setup
1. Install dependencies:
```bash
pip install -r requirements.txt
```
2. Install Playwright browsers:
```bash
playwright install
```
3. Run the script:
```bash
# Default: saves to current directory
python main.py
# Or specify output directory
python main.py /path/to/output
# Or use environment variable
OUTPUT_DIR=/path/to/output python main.py
```
### Docker Setup
1. Build the Docker image:
```bash
docker build -t warhammer-rss .
```
2. Run the container (multiple options to avoid permission issues):
**Option A: Save to current directory (simplest)**
```bash
docker run -v $(pwd):/app/output warhammer-rss
```
**Option B: Use environment variable for output directory**
```bash
docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss
```
**Option C: With resource limits for additional security**
```bash
docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss
```
## Output
The application generates:
- `warhammer_rss_feed.xml` - RSS feed file
- `page.html` - Raw HTML content for debugging
Both files are saved to the specified output directory (current directory by default).
## Security Features
This application implements several security measures:
- **URL Validation**: Only allows scraping from trusted Warhammer Community domains
- **Path Validation**: Prevents directory traversal attacks by validating output paths
- **Resource Limits**: Caps content size (10MB) and scroll iterations (5) to prevent DoS
- **Content Filtering**: Sanitizes extracted text to prevent XSS and injection attacks
- **Non-root Execution**: Docker container runs as user `scraper` (UID 1001) for reduced privilege
- **Input Sanitization**: All URLs and file paths are validated before use
## How It Works
1. **Validates** the target URL against whitelist of allowed domains
2. Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
3. Scrolls through the page to load additional content (limited to 5 iterations)
4. **Validates content size** and parses the rendered HTML with BeautifulSoup
5. **Sanitizes** and extracts article titles, links, and publication dates
6. **Validates all links** against allowed domains
7. Removes duplicates and sorts by date
8. Generates RSS feed using feedgen library
9. **Validates output paths** before saving files
## Configuration
The scraper targets `https://www.warhammer-community.com/en-gb/` by default and only allows URLs from:
- `warhammer-community.com`
- `www.warhammer-community.com`
To modify allowed domains, update the `ALLOWED_DOMAINS` list in `main.py:11-14`.