Add comprehensive security improvements

- URL validation with domain whitelist - Path validation to prevent directory traversal - Resource limits (content size, scroll iterations) - Content filtering and sanitization - Non-root Docker execution with gosu - Configurable output directory via CLI/env vars - Fixed Docker volume permission issues 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-05 18:19:23 -06:00
parent eecee074e2
commit b9b3ece3cb
4 changed files with 6636 additions and 16 deletions
--- a/35
+++ b/35
@@ -44,14 +44,41 @@ RUN pip install --upgrade pip && \
    feedgen \
    pytz
-# Install Playwright browser binaries
+# Install only Chromium (faster than all browsers)
-RUN playwright install
+RUN playwright install chromium
 # Create an entrypoint script to handle permissions (as root)
 RUN echo '#!/bin/bash\n\
 # Fix permissions for mounted volumes\n\
 if [ -d "/app/output" ]; then\n\
    chmod 777 /app/output 2>/dev/null || true\n\
 fi\n\
 # Run as scraper user\n\
 exec gosu scraper "$@"' > /entrypoint.sh && chmod +x /entrypoint.sh
 # Install gosu for user switching
 RUN apt-get update && apt-get install -y gosu && rm -rf /var/lib/apt/lists/*
 # Create non-root user for security
 RUN useradd -m -u 1001 scraper && \
    mkdir -p /app/output && \
    chown -R scraper:scraper /app && \
    chmod 755 /app/output
 # Copy the Python script to the container
 COPY main.py .
 RUN chown scraper:scraper main.py
 # Set the environment variable to ensure Playwright works in the container
-ENV PLAYWRIGHT_BROWSERS_PATH=/root/.cache/ms-playwright
+ENV PLAYWRIGHT_BROWSERS_PATH=/home/scraper/.cache/ms-playwright
-# Command to run the Python script
+# Don't switch user here - entrypoint will handle it
 # USER scraper
 # Install Chromium for the scraper user (only what we need)
 USER scraper
 RUN playwright install chromium
 USER root
 ENTRYPOINT ["/entrypoint.sh"]
 CMD ["python", "main.py"]
--- a/README.md
+++ b/README.md
@@ -0,0 +1,117 @@
 # Warhammer Community RSS Scraper
 A Python application that scrapes the Warhammer Community website and generates an RSS feed from the latest articles.
 ## Overview
 This project uses web scraping to extract articles from the Warhammer Community website and converts them into an RSS feed format. It uses Playwright for JavaScript-heavy content rendering and BeautifulSoup for HTML parsing.
 ## Features
 - Scrapes articles from Warhammer Community website
 - Generates RSS feed with proper formatting
 - Handles duplicate article detection
 - Sorts articles by publication date (newest first)
 - Dockerized for easy deployment
 - Saves both RSS feed and raw HTML for debugging
 - **Security-focused**: URL validation, content filtering, and resource limits
 - **Safe execution**: Runs as non-root user in container
 ## Requirements
 - Python 3.12+
 - Dependencies listed in `requirements.txt`:
  - playwright
  - beautifulsoup4
  - feedgen
  - pytz
  - requests
 ## Installation
 ### Local Setup
 1. Install dependencies:
 ```bash
 pip install -r requirements.txt
 ```
 2. Install Playwright browsers:
 ```bash
 playwright install
 ```
 3. Run the script:
 ```bash
 # Default: saves to current directory
 python main.py
 # Or specify output directory
 python main.py /path/to/output
 # Or use environment variable
 OUTPUT_DIR=/path/to/output python main.py
 ```
 ### Docker Setup
 1. Build the Docker image:
 ```bash
 docker build -t warhammer-rss .
 ```
 2. Run the container (multiple options to avoid permission issues):
 **Option A: Save to current directory (simplest)**
 ```bash
 docker run -v $(pwd):/app/output warhammer-rss
 ```
 **Option B: Use environment variable for output directory**
 ```bash
 docker run -e OUTPUT_DIR=/app/output -v $(pwd)/output:/app/output warhammer-rss
 ```
 **Option C: With resource limits for additional security**
 ```bash
 docker run --memory=512m --cpu-quota=50000 -v $(pwd):/app/output warhammer-rss
 ```
 ## Output
 The application generates:
 - `warhammer_rss_feed.xml` - RSS feed file
 - `page.html` - Raw HTML content for debugging
 Both files are saved to the specified output directory (current directory by default).
 ## Security Features
 This application implements several security measures:
 - **URL Validation**: Only allows scraping from trusted Warhammer Community domains
 - **Path Validation**: Prevents directory traversal attacks by validating output paths
 - **Resource Limits**: Caps content size (10MB) and scroll iterations (5) to prevent DoS
 - **Content Filtering**: Sanitizes extracted text to prevent XSS and injection attacks
 - **Non-root Execution**: Docker container runs as user `scraper` (UID 1001) for reduced privilege
 - **Input Sanitization**: All URLs and file paths are validated before use
 ## How It Works
 1. **Validates** the target URL against whitelist of allowed domains
 2. Uses Playwright to load the Warhammer Community homepage with full JavaScript rendering
 3. Scrolls through the page to load additional content (limited to 5 iterations)
 4. **Validates content size** and parses the rendered HTML with BeautifulSoup
 5. **Sanitizes** and extracts article titles, links, and publication dates
 6. **Validates all links** against allowed domains
 7. Removes duplicates and sorts by date
 8. Generates RSS feed using feedgen library
 9. **Validates output paths** before saving files
 ## Configuration
 The scraper targets `https://www.warhammer-community.com/en-gb/` by default and only allows URLs from:
 - `warhammer-community.com`
 - `www.warhammer-community.com`
 To modify allowed domains, update the `ALLOWED_DOMAINS` list in `main.py:11-14`.
--- a/main.py
+++ b/main.py
@@ -4,9 +4,100 @@ from feedgen.feed import FeedGenerator
 from datetime import datetime
 import pytz
 import time
 import urllib.parse
 import os
 import sys
 # Allowed domains for scraping - security whitelist
 ALLOWED_DOMAINS = [
    'warhammer-community.com',
    'www.warhammer-community.com'
 ]
 # Resource limits
 MAX_SCROLL_ITERATIONS = 5
 MAX_CONTENT_SIZE = 10 * 1024 * 1024  # 10MB
 def validate_url(url):
    """Validate URL against whitelist of allowed domains"""
    try:
        parsed = urllib.parse.urlparse(url)
        if not parsed.scheme or not parsed.netloc:
            raise ValueError("Invalid URL format")
        # Check if domain is in allowed list
        domain = parsed.netloc.lower()
        if domain not in ALLOWED_DOMAINS:
            raise ValueError(f"Domain {domain} not in allowed list: {ALLOWED_DOMAINS}")
        return True
    except Exception as e:
        raise ValueError(f"URL validation failed: {e}")
 def validate_output_path(path, base_dir):
    """Validate and sanitize output file path"""
    # Resolve to absolute path and check if it's safe
    abs_path = os.path.abspath(path)
    abs_base = os.path.abspath(base_dir)
    # Ensure path is within allowed directory
    if not abs_path.startswith(abs_base):
        raise ValueError(f"Output path {abs_path} is outside allowed directory {abs_base}")
    # Ensure output directory exists
    os.makedirs(abs_base, exist_ok=True)
    return abs_path
 def sanitize_text(text):
    """Sanitize text content to prevent injection attacks"""
    if not text:
        return "No title"
    # Remove potential harmful characters and limit length
    sanitized = text.strip()[:500]  # Limit title length
    # Remove any script tags or potentially harmful content
    dangerous_patterns = ['<script', '</script', 'javascript:', 'data:', 'vbscript:']
    for pattern in dangerous_patterns:
        sanitized = sanitized.replace(pattern.lower(), '').replace(pattern.upper(), '')
    return sanitized if sanitized else "No title"
 def validate_link(link, base_url):
    """Validate and sanitize article links"""
    if not link:
        return None
    try:
        # Handle relative URLs
        if link.startswith('/'):
            parsed_base = urllib.parse.urlparse(base_url)
            link = f"{parsed_base.scheme}://{parsed_base.netloc}{link}"
        # Validate the resulting URL
        parsed = urllib.parse.urlparse(link)
        if not parsed.scheme or not parsed.netloc:
            return None
        # Ensure it's from allowed domain
        domain = parsed.netloc.lower()
        if domain not in ALLOWED_DOMAINS:
            return None
        return link
    except Exception:
        return None
 # Function to scrape articles using Playwright and generate an RSS feed
-def scrape_and_generate_rss(url):
+def scrape_and_generate_rss(url, output_dir=None):
    # Validate URL first
    validate_url(url)
    # Set default output directory if not provided
    if output_dir is None:
        output_dir = '.'  # Default to current directory
    articles = []
    seen_urls = set()  # Set to track seen URLs and avoid duplicates
@@ -21,13 +112,19 @@ def scrape_and_generate_rss(url):
        # Load the Warhammer Community page
        page.goto(url, wait_until="networkidle")
-        # Simulate scrolling to load more content if needed
+        # Simulate scrolling to load more content if needed (limited for security)
-        for _ in range(10):
+        for _ in range(MAX_SCROLL_ITERATIONS):
            page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
            time.sleep(2)
        # Get the fully rendered HTML content
        html = page.content()
        # Check content size for security
        if len(html) > MAX_CONTENT_SIZE:
            browser.close()
            raise ValueError(f"Content size {len(html)} exceeds maximum {MAX_CONTENT_SIZE}")
        browser.close()
    # Parse the HTML content with BeautifulSoup
@@ -38,13 +135,15 @@ def scrape_and_generate_rss(url):
    # Find all articles in the page
    for article in soup.find_all('article'):
-        # Extract the title
+        # Extract and sanitize the title
        title_tag = article.find('h3', class_='newsCard-title-sm') or article.find('h3', class_='newsCard-title-lg')
-        title = title_tag.text.strip() if title_tag else 'No title'
+        raw_title = title_tag.text.strip() if title_tag else 'No title'
        title = sanitize_text(raw_title)
-        # Extract the link
+        # Extract and validate the link
        link_tag = article.find('a', href=True)
-        link = link_tag['href'] if link_tag else None
+        raw_link = link_tag['href'] if link_tag else None
        link = validate_link(raw_link, url)
        # Skip this entry if the link is None or the URL has already been seen
        if not link or link in seen_urls:
@@ -97,13 +196,29 @@ def scrape_and_generate_rss(url):
    # Generate the RSS feed
    rss_feed = fg.rss_str(pretty=True)
-    # Save the RSS feed to a file
+    # Validate and save the RSS feed to a file
-    with open('/app/output/warhammer_rss_feed.xml', 'wb') as f:
+    rss_path = validate_output_path(os.path.join(output_dir, 'warhammer_rss_feed.xml'), output_dir)
    with open(rss_path, 'wb') as f:
        f.write(rss_feed)
-    with open('/app/output/page.html','w', encoding='utf-8') as f:
+    # Validate and save HTML for debugging
    html_path = validate_output_path(os.path.join(output_dir, 'page.html'), output_dir)
    with open(html_path, 'w', encoding='utf-8') as f:
        f.write(soup.prettify())
    print('RSS feed generated and saved as warhammer_rss_feed.xml')
 if __name__ == "__main__":
    # Get output directory from environment variable or command line argument
    output_dir = os.getenv('OUTPUT_DIR')
    if len(sys.argv) > 1:
        output_dir = sys.argv[1]
    # Default to current directory if no output specified (avoids permission issues)
    if not output_dir:
        output_dir = '.'
    print(f"Using output directory: {output_dir}")
    # Run the function
-scrape_and_generate_rss('https://www.warhammer-community.com/en-gb/')
+    scrape_and_generate_rss('https://www.warhammer-community.com/en-gb/', output_dir)
--- a/output/page.html
+++ b/output/page.html