Add comprehensive security improvements

- URL validation with domain whitelist
- Path validation to prevent directory traversal
- Resource limits (content size, scroll iterations)
- Content filtering and sanitization
- Non-root Docker execution with gosu
- Configurable output directory via CLI/env vars
- Fixed Docker volume permission issues

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Phil 2025-06-05 18:19:23 -06:00
parent eecee074e2
commit b9b3ece3cb
4 changed files with 6636 additions and 16 deletions

View File

@ -44,14 +44,41 @@ RUN pip install --upgrade pip && \
feedgen \ feedgen \
pytz pytz
# Install Playwright browser binaries # Install only Chromium (faster than all browsers)
RUN playwright install RUN playwright install chromium
# Create an entrypoint script to handle permissions (as root)
RUN echo '#!/bin/bash\n\
# Fix permissions for mounted volumes\n\
if [ -d "/app/output" ]; then\n\
chmod 777 /app/output 2>/dev/null || true\n\
fi\n\
# Run as scraper user\n\
exec gosu scraper "$@"' > /entrypoint.sh && chmod +x /entrypoint.sh
# Install gosu for user switching
RUN apt-get update && apt-get install -y gosu && rm -rf /var/lib/apt/lists/*
# Create non-root user for security
RUN useradd -m -u 1001 scraper && \
mkdir -p /app/output && \
chown -R scraper:scraper /app && \
chmod 755 /app/output
# Copy the Python script to the container # Copy the Python script to the container
COPY main.py . COPY main.py .
RUN chown scraper:scraper main.py
# Set the environment variable to ensure Playwright works in the container # Set the environment variable to ensure Playwright works in the container
ENV PLAYWRIGHT_BROWSERS_PATH=/root/.cache/ms-playwright ENV PLAYWRIGHT_BROWSERS_PATH=/home/scraper/.cache/ms-playwright
# Command to run the Python script # Don't switch user here - entrypoint will handle it
# USER scraper
# Install Chromium for the scraper user (only what we need)
USER scraper
RUN playwright install chromium
USER root
ENTRYPOINT ["/entrypoint.sh"]
CMD ["python", "main.py"] CMD ["python", "main.py"]

117
README.md Normal file
View File

@ -0,0 +1,117 @@
# Warhammer Community RSS Scraper
A Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
## Overview
This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.
## Features
- Scrapes articles from Warhammer Community website
- Generates RSS feed with proper formatting
- Handles duplicate article detection
- Sorts articles by publication date (newest first)
- Dockerized for easy deployment
- Saves both RSS feed and raw HTML for debugging
- **Security-focused**: URL validation, content filtering, and resource limits
- **Safe execution**: Runs as non-root user in container
## Requirements
- Python 3.12+
- Dependencies listed in `requirements.txt`:
- playwright
- beautifulsoup4
- feedgen
- pytz
- requests
## Installation
### Local Setup
1. Install dependencies:
```bash
pip install -r requirements.txt
```
2. Install Playwright browsers:
```bash
playwright install
```
3. Run the script:
```bash
# Default: saves to current directory
python main.py
# Or specify output directory
python main.py /path/to/output
# Or use environment variable
OUTPUT_DIR=/path/to/output python main.py
```
### Docker Setup
1. Build the Docker image:
```bash
docker build -t warhammer-rss .
```
2. Run the container (multiple options to avoid permission issues):
**Option A: Save to current directory (simplest)**
```bash
docker run -v $(pwd):/app/output warhammer-rss
```
**Option B: Use environment variable for output directory**
```bash
docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss
```
**Option C: With resource limits for additional security**
```bash
docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss
```
## Output
The application generates:
- `warhammer_rss_feed.xml` - RSS feed file
- `page.html` - Raw HTML content for debugging
Both files are saved to the specified output directory (current directory by default).
## Security Features
This application implements several security measures:
- **URL Validation**: Only allows scraping from trusted Warhammer Community domains
- **Path Validation**: Prevents directory traversal attacks by validating output paths
- **Resource Limits**: Caps content size (10MB) and scroll iterations (5) to prevent DoS
- **Content Filtering**: Sanitizes extracted text to prevent XSS and injection attacks
- **Non-root Execution**: Docker container runs as user `scraper` (UID 1001) for reduced privilege
- **Input Sanitization**: All URLs and file paths are validated before use
## How It Works
1. **Validates** the target URL against whitelist of allowed domains
2. Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
3. Scrolls through the page to load additional content (limited to 5 iterations)
4. **Validates content size** and parses the rendered HTML with BeautifulSoup
5. **Sanitizes** and extracts article titles, links, and publication dates
6. **Validates all links** against allowed domains
7. Removes duplicates and sorts by date
8. Generates RSS feed using feedgen library
9. **Validates output paths** before saving files
## Configuration
The scraper targets `https://www.warhammer-community.com/en-gb/` by default and only allows URLs from:
- `warhammer-community.com`
- `www.warhammer-community.com`
To modify allowed domains, update the `ALLOWED_DOMAINS` list in `main.py:11-14`.

139
main.py
View File

@ -4,9 +4,100 @@ from feedgen.feed import FeedGenerator
from datetime import datetime from datetime import datetime
import pytz import pytz
import time import time
import urllib.parse
import os
import sys
# Allowed domains for scraping - security whitelist
ALLOWED_DOMAINS = [
'warhammer-community.com',
'www.warhammer-community.com'
]
# Resource limits
MAX_SCROLL_ITERATIONS = 5
MAX_CONTENT_SIZE = 10 * 1024 * 1024 # 10MB
def validate_url(url):
"""Validate URL against whitelist of allowed domains"""
try:
parsed = urllib.parse.urlparse(url)
if not parsed.scheme or not parsed.netloc:
raise ValueError("Invalid URL format")
# Check if domain is in allowed list
domain = parsed.netloc.lower()
if domain not in ALLOWED_DOMAINS:
raise ValueError(f"Domain {domain} not in allowed list: {ALLOWED_DOMAINS}")
return True
except Exception as e:
raise ValueError(f"URL validation failed: {e}")
def validate_output_path(path, base_dir):
"""Validate and sanitize output file path"""
# Resolve to absolute path and check if it's safe
abs_path = os.path.abspath(path)
abs_base = os.path.abspath(base_dir)
# Ensure path is within allowed directory
if not abs_path.startswith(abs_base):
raise ValueError(f"Output path {abs_path} is outside allowed directory {abs_base}")
# Ensure output directory exists
os.makedirs(abs_base, exist_ok=True)
return abs_path
def sanitize_text(text):
"""Sanitize text content to prevent injection attacks"""
if not text:
return "No title"
# Remove potential harmful characters and limit length
sanitized = text.strip()[:500] # Limit title length
# Remove any script tags or potentially harmful content
dangerous_patterns = ['<script', '</script', 'javascript:', 'data:', 'vbscript:']
for pattern in dangerous_patterns:
sanitized = sanitized.replace(pattern.lower(), '').replace(pattern.upper(), '')
return sanitized if sanitized else "No title"
def validate_link(link, base_url):
"""Validate and sanitize article links"""
if not link:
return None
try:
# Handle relative URLs
if link.startswith('/'):
parsed_base = urllib.parse.urlparse(base_url)
link = f"{parsed_base.scheme}://{parsed_base.netloc}{link}"
# Validate the resulting URL
parsed = urllib.parse.urlparse(link)
if not parsed.scheme or not parsed.netloc:
return None
# Ensure it's from allowed domain
domain = parsed.netloc.lower()
if domain not in ALLOWED_DOMAINS:
return None
return link
except Exception:
return None
# Function to scrape articles using Playwright and generate an RSS feed # Function to scrape articles using Playwright and generate an RSS feed
def scrape_and_generate_rss(url): def scrape_and_generate_rss(url, output_dir=None):
# Validate URL first
validate_url(url)
# Set default output directory if not provided
if output_dir is None:
output_dir = '.' # Default to current directory
articles = [] articles = []
seen_urls = set() # Set to track seen URLs and avoid duplicates seen_urls = set() # Set to track seen URLs and avoid duplicates
@ -21,13 +112,19 @@ def scrape_and_generate_rss(url):
# Load the Warhammer Community page # Load the Warhammer Community page
page.goto(url, wait_until="networkidle") page.goto(url, wait_until="networkidle")
# Simulate scrolling to load more content if needed # Simulate scrolling to load more content if needed (limited for security)
for _ in range(10): for _ in range(MAX_SCROLL_ITERATIONS):
page.evaluate("window.scrollBy(0, document.body.scrollHeight)") page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
time.sleep(2) time.sleep(2)
# Get the fully rendered HTML content # Get the fully rendered HTML content
html = page.content() html = page.content()
# Check content size for security
if len(html) > MAX_CONTENT_SIZE:
browser.close()
raise ValueError(f"Content size {len(html)} exceeds maximum {MAX_CONTENT_SIZE}")
browser.close() browser.close()
# Parse the HTML content with BeautifulSoup # Parse the HTML content with BeautifulSoup
@ -38,13 +135,15 @@ def scrape_and_generate_rss(url):
# Find all articles in the page # Find all articles in the page
for article in soup.find_all('article'): for article in soup.find_all('article'):
# Extract the title # Extract and sanitize the title
title_tag = article.find('h3', class_='newsCard-title-sm') or article.find('h3', class_='newsCard-title-lg') title_tag = article.find('h3', class_='newsCard-title-sm') or article.find('h3', class_='newsCard-title-lg')
title = title_tag.text.strip() if title_tag else 'No title' raw_title = title_tag.text.strip() if title_tag else 'No title'
title = sanitize_text(raw_title)
# Extract the link # Extract and validate the link
link_tag = article.find('a', href=True) link_tag = article.find('a', href=True)
link = link_tag['href'] if link_tag else None raw_link = link_tag['href'] if link_tag else None
link = validate_link(raw_link, url)
# Skip this entry if the link is None or the URL has already been seen # Skip this entry if the link is None or the URL has already been seen
if not link or link in seen_urls: if not link or link in seen_urls:
@ -97,13 +196,29 @@ def scrape_and_generate_rss(url):
# Generate the RSS feed # Generate the RSS feed
rss_feed = fg.rss_str(pretty=True) rss_feed = fg.rss_str(pretty=True)
# Save the RSS feed to a file # Validate and save the RSS feed to a file
with open('/app/output/warhammer_rss_feed.xml', 'wb') as f: rss_path = validate_output_path(os.path.join(output_dir, 'warhammer_rss_feed.xml'), output_dir)
with open(rss_path, 'wb') as f:
f.write(rss_feed) f.write(rss_feed)
with open('/app/output/page.html','w', encoding='utf-8') as f: # Validate and save HTML for debugging
html_path = validate_output_path(os.path.join(output_dir, 'page.html'), output_dir)
with open(html_path, 'w', encoding='utf-8') as f:
f.write(soup.prettify()) f.write(soup.prettify())
print('RSS feed generated and saved as warhammer_rss_feed.xml') print('RSS feed generated and saved as warhammer_rss_feed.xml')
# Run the function if __name__ == "__main__":
scrape_and_generate_rss('https://www.warhammer-community.com/en-gb/') # Get output directory from environment variable or command line argument
output_dir = os.getenv('OUTPUT_DIR')
if len(sys.argv) > 1:
output_dir = sys.argv[1]
# Default to current directory if no output specified (avoids permission issues)
if not output_dir:
output_dir = '.'
print(f"Using output directory: {output_dir}")
# Run the function
scrape_and_generate_rss('https://www.warhammer-community.com/en-gb/', output_dir)

6361
output/page.html Normal file

File diff suppressed because one or more lines are too long