Table of Contents
Why Incremental?
Imagine scraping 100,000 products daily. A full crawl might take 8 hours and make 100,000 requests. But what if only 1% of products change daily?
Full Crawl Problems
- Time: Hours to complete
- Cost: Bandwidth, compute, proxy costs
- Risk: More requests = more likely to get blocked
- Freshness: Data at end of crawl may be stale
Incremental Benefits
- Speed: Process only what changed (maybe 1-10%)
- Cost: 90%+ reduction in requests
- Stealth: Fewer requests, less likely to trigger blocks
- History: Track changes over time
Change Detection Strategies
1. Hash-Based Detection
The simplest approach: hash the content and compare to the previous hash.
import hashlib
def content_hash(data):
"""Generate hash of relevant fields only"""
relevant = {
'price': data.get('price'),
'stock': data.get('stock_status'),
'title': data.get('title')
}
content = json.dumps(relevant, sort_keys=True)
return hashlib.md5(content.encode()).hexdigest()
def has_changed(item, previous_hash):
current_hash = content_hash(item)
return current_hash != previous_hash
2. HTTP Headers
Use Last-Modified and ETag headers when available:
import requests
def fetch_if_changed(url, last_etag=None, last_modified=None):
headers = {}
if last_etag:
headers['If-None-Match'] = last_etag
if last_modified:
headers['If-Modified-Since'] = last_modified
response = requests.get(url, headers=headers)
if response.status_code == 304:
return None # Not modified
return {
'content': response.text,
'etag': response.headers.get('ETag'),
'last_modified': response.headers.get('Last-Modified')
}
3. Sitemap Monitoring
Many sites include lastmod dates in sitemaps:
from xml.etree import ElementTree
import requests
def get_changed_urls(sitemap_url, since_date):
response = requests.get(sitemap_url)
root = ElementTree.fromstring(response.content)
changed = []
for url in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
loc = url.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text
lastmod = url.find('{http://www.sitemaps.org/schemas/sitemap/0.9}lastmod')
if lastmod is not None and lastmod.text > since_date:
changed.append(loc)
return changed
4. API-Based Updates
If the source has an API with timestamps or change feeds:
# Many APIs support filtering by updated_at
params = {
'updated_after': last_sync_timestamp,
'per_page': 100
}
response = requests.get(api_url, params=params)
Architecture Patterns
Basic Architecture
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Scheduler │────▶│ Scraper │────▶│ Storage │
│ (Airflow) │ │ (Scrapy) │ │ (Postgres) │
└─────────────┘ └──────────────┘ └─────────────┘
│
▼
┌──────────────┐
│ Change Store │
│ (Redis/DB) │
└──────────────┘
Components
- Scheduler: Triggers crawls on schedule (Airflow, cron, Prefect)
- Scraper: Fetches and parses data (Scrapy, custom Python)
- Change Store: Tracks hashes/timestamps for comparison
- Storage: Stores current data and change history
Database Schema
CREATE TABLE items (
id SERIAL PRIMARY KEY,
source_id VARCHAR(255) UNIQUE,
data JSONB,
content_hash VARCHAR(32),
first_seen_at TIMESTAMP,
last_seen_at TIMESTAMP,
last_changed_at TIMESTAMP
);
CREATE TABLE item_history (
id SERIAL PRIMARY KEY,
item_id INTEGER REFERENCES items(id),
data JSONB,
changed_at TIMESTAMP,
change_type VARCHAR(50) -- 'created', 'updated', 'deleted'
);
Monitoring & Alerting
What to Monitor
- Success rate: % of URLs successfully scraped
- Change rate: % of items that changed (sudden drops = possible issue)
- Duration: Is the job taking longer than usual?
- Error types: 403s, 429s, timeouts, parse errors
- Data freshness: When was data last updated?
Alert Conditions
# Alert if success rate drops below threshold
if success_rate < 0.95:
alert(f"Scraper success rate dropped to {success_rate}")
# Alert if no changes detected (might indicate broken scraper)
if change_rate == 0 and expected_changes > 0:
alert("No changes detected - verify scraper is working")
# Alert if duration exceeds normal + 50%
if duration > normal_duration * 1.5:
alert(f"Scraper running slow: {duration}s vs normal {normal_duration}s")
Health Dashboard
Build a simple dashboard showing:
- Last successful run time
- Items processed / changed / errors
- Historical trends
- Current job status
Implementation Examples
Scrapy with Incremental Logic
import scrapy
import hashlib
import redis
class IncrementalSpider(scrapy.Spider):
name = 'incremental'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.redis = redis.Redis()
def parse(self, response):
for product in response.css('.product'):
item = {
'id': product.css('::attr(data-id)').get(),
'title': product.css('.title::text').get(),
'price': product.css('.price::text').get(),
}
# Check if changed
current_hash = self.hash_item(item)
previous_hash = self.redis.get(f"hash:{item['id']}")
if previous_hash is None or current_hash != previous_hash.decode():
# New or changed - yield for processing
self.redis.set(f"hash:{item['id']}", current_hash)
yield item
def hash_item(self, item):
content = json.dumps(item, sort_keys=True)
return hashlib.md5(content.encode()).hexdigest()
Change History Recording
def save_with_history(item, db):
existing = db.query(
"SELECT * FROM items WHERE source_id = %s",
(item['id'],)
)
if existing is None:
# New item
db.execute("""
INSERT INTO items (source_id, data, content_hash,
first_seen_at, last_seen_at, last_changed_at)
VALUES (%s, %s, %s, NOW(), NOW(), NOW())
""", (item['id'], json.dumps(item), hash_item(item)))
record_history(item, 'created', db)
elif hash_item(item) != existing['content_hash']:
# Changed
db.execute("""
UPDATE items
SET data = %s, content_hash = %s,
last_seen_at = NOW(), last_changed_at = NOW()
WHERE source_id = %s
""", (json.dumps(item), hash_item(item), item['id']))
record_history(item, 'updated', db)
else:
# Unchanged - just update last_seen
db.execute("""
UPDATE items SET last_seen_at = NOW() WHERE source_id = %s
""", (item['id'],))
Frequently Asked Questions
How do I detect deleted items?
Track items that weren't seen in the latest crawl. If an item's last_seen_at is older than the last crawl time, it may have been deleted. Implement a grace period before marking as deleted.
What about sites without stable IDs?
Create synthetic IDs from stable attributes (URL, SKU, combination of title + other unique fields). Hash these to create a consistent identifier.
How often should I do full crawls?
Even with incremental updates, periodic full crawls catch anything missed. Weekly or monthly full crawls are common, depending on data criticality.
Need help building data feeds?
I build production-grade scrapers with incremental updates and monitoring. Get in touch to discuss your project.
Book a Call