+ We use cookies to give you personalised content and adverts, provide you social media content you’ll want to see and to analyse our traffic. We may share information about how you use the site with social media, advertising, and analytics partners.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Your Privacy
+
+
+ You probably already know about cookies – they’re little nuggets of information that websites store on your browser to personalise your experience. They record information about your device and preferences that help make the site work the way you expect.
+
+ You have the right to choose what types of cookies you allow. Click on the category headings to find out more about the different kinds and change the settings to suit you. Remember that some cookies are necessary to the experience and features of the site, so blocking some might change the way the site works for you.
+
+
+ More information
+
+
+
+
+
+ Manage Consent Preferences
+
+
+
+
+
+
+ Strictly Necessary Cookies
+
+
+ Always Active
+
+
+
+
+
+
+
+
+ These cookies are necessary for the website to function and cannot be switched off in our systems. They are usually only set in response to actions made by you which amount to a request for services, such as setting your privacy preferences, logging in or filling in forms. You can set your browser to block or alert you about these cookies, but some parts of the site will not then work. These cookies do not store any personally identifiable information.
+
+
+
+
+
+
+
+
+
+
+
+ Performance Cookies
+
+
+
+
+
+
+
+
+
+
+
+
+ These cookies allow us to count visits and traffic sources so we can measure and improve the performance of our site. They help us to know which pages are the most and least popular and see how visitors move around the site. All information these cookies collect is aggregated and therefore anonymous. If you do not allow these cookies we will not know when you have visited our site, and will not be able to monitor its performance.
+
+
+
+
+
+
+
+
+
+
+
+ Functional Cookies
+
+
+
+
+
+
+
+
+
+
+
+
+ These cookies enable the website to provide enhanced functionality and personalisation. They may be set by us or by third party providers whose services we have added to our pages. If you do not allow these cookies then some or all of these services may not function properly.
+
+
+
+
+
+
+
+
+
+
+
+ Targeting Cookies
+
+
+
+
+
+
+
+
+
+
+
+
+ These cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Cookie List
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ label
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Consent
+
+
+ Leg.Interest
+
+
+
+
+
+
+
+ label
+
+
+
+
+
+
+ label
+
+
+
+
+
+
+ label
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/pytest.ini b/pytest.ini
new file mode 100644
index 0000000..b111982
--- /dev/null
+++ b/pytest.ini
@@ -0,0 +1,14 @@
+[tool:pytest]
+testpaths = tests
+python_files = test_*.py
+python_classes = Test*
+python_functions = test_*
+addopts =
+ -v
+ --tb=short
+ --strict-markers
+ --disable-warnings
+markers =
+ unit: Unit tests
+ integration: Integration tests
+ slow: Slow running tests
\ No newline at end of file
diff --git a/requirements.txt b/requirements.txt
index 995be38..7d762e0 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,4 +2,8 @@ requests
beautifulsoup4
feedgen
playwright
-pytz
\ No newline at end of file
+pytz
+pytest
+pytest-mock
+pytest-asyncio
+bleach
\ No newline at end of file
diff --git a/src/__init__.py b/src/__init__.py
new file mode 100644
index 0000000..8e95836
--- /dev/null
+++ b/src/__init__.py
@@ -0,0 +1 @@
+# RSS Scraper package
\ No newline at end of file
diff --git a/src/rss_scraper/__init__.py b/src/rss_scraper/__init__.py
new file mode 100644
index 0000000..22ccae2
--- /dev/null
+++ b/src/rss_scraper/__init__.py
@@ -0,0 +1,5 @@
+"""RSS Scraper for Warhammer Community website."""
+
+__version__ = "1.0.0"
+__author__ = "RSS Scraper"
+__description__ = "A production-ready RSS scraper for Warhammer Community website"
\ No newline at end of file
diff --git a/src/rss_scraper/cache.py b/src/rss_scraper/cache.py
new file mode 100644
index 0000000..c1a0faa
--- /dev/null
+++ b/src/rss_scraper/cache.py
@@ -0,0 +1,216 @@
+"""Caching utilities for avoiding redundant scraping."""
+
+import os
+import json
+import hashlib
+import logging
+from datetime import datetime, timedelta
+from typing import Optional, Dict, Any, List
+import requests
+
+from .config import Config
+from .exceptions import FileOperationError
+
+logger = logging.getLogger(__name__)
+
+
+class ContentCache:
+ """Cache for storing and retrieving scraped content."""
+
+ def __init__(self, cache_dir: str = "cache"):
+ self.cache_dir = cache_dir
+ self.cache_file = os.path.join(cache_dir, "content_cache.json")
+ self.etag_file = os.path.join(cache_dir, "etags.json")
+ self.max_cache_age_hours = 24 # Cache expires after 24 hours
+
+ # Ensure cache directory exists
+ os.makedirs(cache_dir, exist_ok=True)
+
+ def _get_cache_key(self, url: str) -> str:
+ """Generate cache key from URL."""
+ return hashlib.sha256(url.encode()).hexdigest()
+
+ def _load_cache(self) -> Dict[str, Any]:
+ """Load cache from file."""
+ try:
+ if os.path.exists(self.cache_file):
+ with open(self.cache_file, 'r') as f:
+ return json.load(f)
+ except Exception as e:
+ logger.warning(f"Failed to load cache: {e}")
+ return {}
+
+ def _save_cache(self, cache_data: Dict[str, Any]) -> None:
+ """Save cache to file."""
+ try:
+ with open(self.cache_file, 'w') as f:
+ json.dump(cache_data, f, indent=2, default=str)
+ except Exception as e:
+ logger.error(f"Failed to save cache: {e}")
+ raise FileOperationError(f"Failed to save cache: {e}")
+
+ def _load_etags(self) -> Dict[str, str]:
+ """Load ETags from file."""
+ try:
+ if os.path.exists(self.etag_file):
+ with open(self.etag_file, 'r') as f:
+ return json.load(f)
+ except Exception as e:
+ logger.warning(f"Failed to load ETags: {e}")
+ return {}
+
+ def _save_etags(self, etag_data: Dict[str, str]) -> None:
+ """Save ETags to file."""
+ try:
+ with open(self.etag_file, 'w') as f:
+ json.dump(etag_data, f, indent=2)
+ except Exception as e:
+ logger.warning(f"Failed to save ETags: {e}")
+
+ def _is_cache_valid(self, cached_entry: Dict[str, Any]) -> bool:
+ """Check if cached entry is still valid."""
+ try:
+ cached_time = datetime.fromisoformat(cached_entry['timestamp'])
+ expiry_time = cached_time + timedelta(hours=self.max_cache_age_hours)
+ return datetime.now() < expiry_time
+ except (KeyError, ValueError):
+ return False
+
+ def check_if_content_changed(self, url: str) -> Optional[Dict[str, str]]:
+ """Check if content has changed using conditional requests."""
+ etags = self._load_etags()
+ cache_key = self._get_cache_key(url)
+
+ headers = {}
+ if cache_key in etags:
+ headers['If-None-Match'] = etags[cache_key]
+
+ try:
+ logger.debug(f"Checking if content changed for {url}")
+ response = requests.head(url, headers=headers, timeout=10)
+
+ # 304 means not modified
+ if response.status_code == 304:
+ logger.info(f"Content not modified for {url}")
+ return {'status': 'not_modified'}
+
+ # Update ETag if available
+ if 'etag' in response.headers:
+ etags[cache_key] = response.headers['etag']
+ self._save_etags(etags)
+ logger.debug(f"Updated ETag for {url}")
+
+ return {'status': 'modified', 'etag': response.headers.get('etag')}
+
+ except requests.RequestException as e:
+ logger.warning(f"Failed to check content modification for {url}: {e}")
+ # If we can't check, assume it's modified
+ return {'status': 'modified'}
+
+ def get_cached_content(self, url: str) -> Optional[str]:
+ """Get cached HTML content if available and valid."""
+ cache_data = self._load_cache()
+ cache_key = self._get_cache_key(url)
+
+ if cache_key not in cache_data:
+ logger.debug(f"No cached content for {url}")
+ return None
+
+ cached_entry = cache_data[cache_key]
+
+ if not self._is_cache_valid(cached_entry):
+ logger.debug(f"Cached content for {url} has expired")
+ # Remove expired entry
+ del cache_data[cache_key]
+ self._save_cache(cache_data)
+ return None
+
+ logger.info(f"Using cached content for {url}")
+ return cached_entry['content']
+
+ def cache_content(self, url: str, content: str) -> None:
+ """Cache HTML content with timestamp."""
+ cache_data = self._load_cache()
+ cache_key = self._get_cache_key(url)
+
+ cache_data[cache_key] = {
+ 'url': url,
+ 'content': content,
+ 'timestamp': datetime.now().isoformat(),
+ 'size': len(content)
+ }
+
+ self._save_cache(cache_data)
+ logger.info(f"Cached content for {url} ({len(content)} bytes)")
+
+ def get_cached_articles(self, url: str) -> Optional[List[Dict[str, Any]]]:
+ """Get cached articles if available and valid."""
+ cache_data = self._load_cache()
+ cache_key = self._get_cache_key(url) + "_articles"
+
+ if cache_key not in cache_data:
+ return None
+
+ cached_entry = cache_data[cache_key]
+
+ if not self._is_cache_valid(cached_entry):
+ # Remove expired entry
+ del cache_data[cache_key]
+ self._save_cache(cache_data)
+ return None
+
+ logger.info(f"Using cached articles for {url}")
+ return cached_entry['articles']
+
+ def cache_articles(self, url: str, articles: List[Dict[str, Any]]) -> None:
+ """Cache extracted articles."""
+ cache_data = self._load_cache()
+ cache_key = self._get_cache_key(url) + "_articles"
+
+ # Convert datetime objects to strings for JSON serialization
+ serializable_articles = []
+ for article in articles:
+ serializable_article = article.copy()
+ if 'date' in serializable_article and hasattr(serializable_article['date'], 'isoformat'):
+ serializable_article['date'] = serializable_article['date'].isoformat()
+ serializable_articles.append(serializable_article)
+
+ cache_data[cache_key] = {
+ 'url': url,
+ 'articles': serializable_articles,
+ 'timestamp': datetime.now().isoformat(),
+ 'count': len(articles)
+ }
+
+ self._save_cache(cache_data)
+ logger.info(f"Cached {len(articles)} articles for {url}")
+
+ def clear_cache(self) -> None:
+ """Clear all cached content."""
+ try:
+ if os.path.exists(self.cache_file):
+ os.remove(self.cache_file)
+ if os.path.exists(self.etag_file):
+ os.remove(self.etag_file)
+ logger.info("Cache cleared successfully")
+ except Exception as e:
+ logger.error(f"Failed to clear cache: {e}")
+ raise FileOperationError(f"Failed to clear cache: {e}")
+
+ def get_cache_info(self) -> Dict[str, Any]:
+ """Get information about cached content."""
+ cache_data = self._load_cache()
+ etags = self._load_etags()
+
+ info = {
+ 'cache_file': self.cache_file,
+ 'etag_file': self.etag_file,
+ 'cache_entries': len(cache_data),
+ 'etag_entries': len(etags),
+ 'cache_size_bytes': 0
+ }
+
+ if os.path.exists(self.cache_file):
+ info['cache_size_bytes'] = os.path.getsize(self.cache_file)
+
+ return info
\ No newline at end of file
diff --git a/src/rss_scraper/config.py b/src/rss_scraper/config.py
new file mode 100644
index 0000000..8c55338
--- /dev/null
+++ b/src/rss_scraper/config.py
@@ -0,0 +1,77 @@
+"""Configuration management for RSS Warhammer scraper."""
+
+import os
+from typing import List, Optional
+
+
+class Config:
+ """Configuration class for RSS scraper settings."""
+
+ # Security settings
+ ALLOWED_DOMAINS: List[str] = [
+ 'warhammer-community.com',
+ 'www.warhammer-community.com'
+ ]
+
+ # Scraping limits
+ MAX_SCROLL_ITERATIONS: int = int(os.getenv('MAX_SCROLL_ITERATIONS', '5'))
+ MAX_CONTENT_SIZE: int = int(os.getenv('MAX_CONTENT_SIZE', str(10 * 1024 * 1024))) # 10MB
+ MAX_TITLE_LENGTH: int = int(os.getenv('MAX_TITLE_LENGTH', '500'))
+
+ # Timing settings
+ SCROLL_DELAY_SECONDS: float = float(os.getenv('SCROLL_DELAY_SECONDS', '2.0'))
+ PAGE_TIMEOUT_MS: int = int(os.getenv('PAGE_TIMEOUT_MS', '120000'))
+
+ # Default URLs and paths
+ DEFAULT_URL: str = os.getenv('DEFAULT_URL', 'https://www.warhammer-community.com/en-gb/')
+ DEFAULT_OUTPUT_DIR: str = os.getenv('DEFAULT_OUTPUT_DIR', '.')
+
+ # File names
+ RSS_FILENAME: str = os.getenv('RSS_FILENAME', 'warhammer_rss_feed.xml')
+ DEBUG_HTML_FILENAME: str = os.getenv('DEBUG_HTML_FILENAME', 'page.html')
+
+ # Feed metadata
+ FEED_TITLE: str = os.getenv('FEED_TITLE', 'Warhammer Community RSS Feed')
+ FEED_DESCRIPTION: str = os.getenv('FEED_DESCRIPTION', 'Latest Warhammer Community Articles')
+
+ # Security patterns to remove from content
+ DANGEROUS_PATTERNS: List[str] = [
+ '', re.IGNORECASE | re.DOTALL),
+ re.compile(r'', re.IGNORECASE | re.DOTALL),
+ re.compile(r'', re.IGNORECASE | re.DOTALL),
+ re.compile(r'', re.IGNORECASE | re.DOTALL),
+ re.compile(r'', re.IGNORECASE | re.DOTALL),
+ re.compile(r'', re.IGNORECASE | re.DOTALL),
+ re.compile(r'javascript:', re.IGNORECASE),
+ re.compile(r'vbscript:', re.IGNORECASE),
+ re.compile(r'data:', re.IGNORECASE),
+ re.compile(r'on\w+\s*=', re.IGNORECASE), # event handlers like onclick, onload, etc.
+ ]
+
+ def sanitize_html(self, html_content: str) -> str:
+ """Sanitize HTML content using bleach library."""
+ if not html_content:
+ return ""
+
+ try:
+ # First pass: remove obviously dangerous patterns
+ cleaned = html_content
+ for pattern in self.dangerous_patterns:
+ cleaned = pattern.sub('', cleaned)
+
+ # Second pass: use bleach for comprehensive sanitization
+ sanitized = bleach.clean(
+ cleaned,
+ tags=self.allowed_tags,
+ attributes=self.allowed_attributes,
+ protocols=self.allowed_protocols,
+ strip=True,
+ strip_comments=True
+ )
+
+ return sanitized
+
+ except Exception as e:
+ logger.error(f"Error sanitizing HTML: {e}")
+ # If sanitization fails, return empty string for safety
+ return ""
+
+ def sanitize_text(self, text: Optional[str]) -> str:
+ """Enhanced text sanitization with better security."""
+ if not text:
+ return "No title"
+
+ # Basic cleaning
+ sanitized = text.strip()
+
+ # Remove null bytes and other control characters
+ sanitized = re.sub(r'[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]', '', sanitized)
+
+ # Remove dangerous patterns (case insensitive)
+ for pattern in Config.DANGEROUS_PATTERNS:
+ sanitized = re.sub(re.escape(pattern), '', sanitized, flags=re.IGNORECASE)
+
+ # Limit length
+ sanitized = sanitized[:Config.MAX_TITLE_LENGTH]
+
+ # Remove excessive whitespace
+ sanitized = re.sub(r'\s+', ' ', sanitized).strip()
+
+ return sanitized if sanitized else "No title"
+
+ def validate_url_security(self, url: str) -> bool:
+ """Enhanced URL validation for security."""
+ if not url:
+ return False
+
+ # Check for dangerous protocols
+ dangerous_protocols = ['javascript:', 'vbscript:', 'data:', 'file:', 'ftp:']
+ url_lower = url.lower()
+
+ for protocol in dangerous_protocols:
+ if url_lower.startswith(protocol):
+ logger.warning(f"Blocked dangerous protocol in URL: {url}")
+ return False
+
+ # Check for suspicious patterns
+ suspicious_patterns = [
+ r'\.\./', # Path traversal
+ r'%2e%2e%2f', # Encoded path traversal
+ r'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Home - Warhammer Community
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ We use cookies to give you personalised content and adverts, provide you social media content you’ll want to see and to analyse our traffic. We may share information about how you use the site with social media, advertising, and analytics partners.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Your Privacy
+
+
+ You probably already know about cookies – they’re little nuggets of information that websites store on your browser to personalise your experience. They record information about your device and preferences that help make the site work the way you expect.
+
+ You have the right to choose what types of cookies you allow. Click on the category headings to find out more about the different kinds and change the settings to suit you. Remember that some cookies are necessary to the experience and features of the site, so blocking some might change the way the site works for you.
+
+
+ More information
+
+
+
+
+
+ Manage Consent Preferences
+
+
+
+
+
+
+ Strictly Necessary Cookies
+
+
+ Always Active
+
+
+
+
+
+
+
+
+ These cookies are necessary for the website to function and cannot be switched off in our systems. They are usually only set in response to actions made by you which amount to a request for services, such as setting your privacy preferences, logging in or filling in forms. You can set your browser to block or alert you about these cookies, but some parts of the site will not then work. These cookies do not store any personally identifiable information.
+
+
+
+
+
+
+
+
+
+
+
+ Performance Cookies
+
+
+
+
+
+
+
+
+
+
+
+
+ These cookies allow us to count visits and traffic sources so we can measure and improve the performance of our site. They help us to know which pages are the most and least popular and see how visitors move around the site. All information these cookies collect is aggregated and therefore anonymous. If you do not allow these cookies we will not know when you have visited our site, and will not be able to monitor its performance.
+
+
+
+
+
+
+
+
+
+
+
+ Functional Cookies
+
+
+
+
+
+
+
+
+
+
+
+
+ These cookies enable the website to provide enhanced functionality and personalisation. They may be set by us or by third party providers whose services we have added to our pages. If you do not allow these cookies then some or all of these services may not function properly.
+
+
+
+
+
+
+
+
+
+
+
+ Targeting Cookies
+
+
+
+
+
+
+
+
+
+
+
+
+ These cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.
+