Add comprehensive security improvements
- URL validation with domain whitelist - Path validation to prevent directory traversal - Resource limits (content size, scroll iterations) - Content filtering and sanitization - Non-root Docker execution with gosu - Configurable output directory via CLI/env vars - Fixed Docker volume permission issues 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
117
README.md
Normal file
117
README.md
Normal file
@ -0,0 +1,117 @@
|
||||
# Warhammer Community RSS Scraper
|
||||
|
||||
A Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
|
||||
|
||||
## Overview
|
||||
|
||||
This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.
|
||||
|
||||
## Features
|
||||
|
||||
- Scrapes articles from Warhammer Community website
|
||||
- Generates RSS feed with proper formatting
|
||||
- Handles duplicate article detection
|
||||
- Sorts articles by publication date (newest first)
|
||||
- Dockerized for easy deployment
|
||||
- Saves both RSS feed and raw HTML for debugging
|
||||
- **Security-focused**: URL validation, content filtering, and resource limits
|
||||
- **Safe execution**: Runs as non-root user in container
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.12+
|
||||
- Dependencies listed in `requirements.txt`:
|
||||
- playwright
|
||||
- beautifulsoup4
|
||||
- feedgen
|
||||
- pytz
|
||||
- requests
|
||||
|
||||
## Installation
|
||||
|
||||
### Local Setup
|
||||
|
||||
1. Install dependencies:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
2. Install Playwright browsers:
|
||||
```bash
|
||||
playwright install
|
||||
```
|
||||
|
||||
3. Run the script:
|
||||
```bash
|
||||
# Default: saves to current directory
|
||||
python main.py
|
||||
|
||||
# Or specify output directory
|
||||
python main.py /path/to/output
|
||||
|
||||
# Or use environment variable
|
||||
OUTPUT_DIR=/path/to/output python main.py
|
||||
```
|
||||
|
||||
### Docker Setup
|
||||
|
||||
1. Build the Docker image:
|
||||
```bash
|
||||
docker build -t warhammer-rss .
|
||||
```
|
||||
|
||||
2. Run the container (multiple options to avoid permission issues):
|
||||
|
||||
**Option A: Save to current directory (simplest)**
|
||||
```bash
|
||||
docker run -v $(pwd):/app/output warhammer-rss
|
||||
```
|
||||
|
||||
**Option B: Use environment variable for output directory**
|
||||
```bash
|
||||
docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss
|
||||
```
|
||||
|
||||
**Option C: With resource limits for additional security**
|
||||
```bash
|
||||
docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
The application generates:
|
||||
- `warhammer_rss_feed.xml` - RSS feed file
|
||||
- `page.html` - Raw HTML content for debugging
|
||||
|
||||
Both files are saved to the specified output directory (current directory by default).
|
||||
|
||||
## Security Features
|
||||
|
||||
This application implements several security measures:
|
||||
|
||||
- **URL Validation**: Only allows scraping from trusted Warhammer Community domains
|
||||
- **Path Validation**: Prevents directory traversal attacks by validating output paths
|
||||
- **Resource Limits**: Caps content size (10MB) and scroll iterations (5) to prevent DoS
|
||||
- **Content Filtering**: Sanitizes extracted text to prevent XSS and injection attacks
|
||||
- **Non-root Execution**: Docker container runs as user `scraper` (UID 1001) for reduced privilege
|
||||
- **Input Sanitization**: All URLs and file paths are validated before use
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Validates** the target URL against whitelist of allowed domains
|
||||
2. Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
|
||||
3. Scrolls through the page to load additional content (limited to 5 iterations)
|
||||
4. **Validates content size** and parses the rendered HTML with BeautifulSoup
|
||||
5. **Sanitizes** and extracts article titles, links, and publication dates
|
||||
6. **Validates all links** against allowed domains
|
||||
7. Removes duplicates and sorts by date
|
||||
8. Generates RSS feed using feedgen library
|
||||
9. **Validates output paths** before saving files
|
||||
|
||||
## Configuration
|
||||
|
||||
The scraper targets `https://www.warhammer-community.com/en-gb/` by default and only allows URLs from:
|
||||
- `warhammer-community.com`
|
||||
- `www.warhammer-community.com`
|
||||
|
||||
To modify allowed domains, update the `ALLOWED_DOMAINS` list in `main.py:11-14`.
|
Reference in New Issue
Block a user