Incremental Web Scraping & Data Feeds

Why Incremental?
Change Detection Strategies
Architecture Patterns
Monitoring & Alerting
Implementation Examples
FAQ

Why Incremental?

Imagine scraping 100,000 products daily. A full crawl might take 8 hours and make 100,000 requests. But what if only 1% of products change daily?

Full Crawl Problems

Time: Hours to complete
Cost: Bandwidth, compute, proxy costs
Risk: More requests = more likely to get blocked
Freshness: Data at end of crawl may be stale

Incremental Benefits

Speed: Process only what changed (maybe 1-10%)
Cost: 90%+ reduction in requests
Stealth: Fewer requests, less likely to trigger blocks
History: Track changes over time

Change Detection Strategies

1. Hash-Based Detection

The simplest approach: hash the content and compare to the previous hash.

import hashlib

def content_hash(data):
    """Generate hash of relevant fields only"""
    relevant = {
        'price': data.get('price'),
        'stock': data.get('stock_status'),
        'title': data.get('title')
    }
    content = json.dumps(relevant, sort_keys=True)
    return hashlib.md5(content.encode()).hexdigest()

def has_changed(item, previous_hash):
    current_hash = content_hash(item)
    return current_hash != previous_hash

2. HTTP Headers

Use Last-Modified and ETag headers when available:

import requests

def fetch_if_changed(url, last_etag=None, last_modified=None):
    headers = {}
    if last_etag:
        headers['If-None-Match'] = last_etag
    if last_modified:
        headers['If-Modified-Since'] = last_modified

    response = requests.get(url, headers=headers)

    if response.status_code == 304:
        return None  # Not modified

    return {
        'content': response.text,
        'etag': response.headers.get('ETag'),
        'last_modified': response.headers.get('Last-Modified')
    }

3. Sitemap Monitoring

Many sites include lastmod dates in sitemaps:

from xml.etree import ElementTree
import requests

def get_changed_urls(sitemap_url, since_date):
    response = requests.get(sitemap_url)
    root = ElementTree.fromstring(response.content)

    changed = []
    for url in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
        loc = url.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text
        lastmod = url.find('{http://www.sitemaps.org/schemas/sitemap/0.9}lastmod')

        if lastmod is not None and lastmod.text > since_date:
            changed.append(loc)

    return changed

4. API-Based Updates

If the source has an API with timestamps or change feeds:

# Many APIs support filtering by updated_at
params = {
    'updated_after': last_sync_timestamp,
    'per_page': 100
}
response = requests.get(api_url, params=params)

Architecture Patterns

Basic Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Scheduler │────▶│   Scraper    │────▶│   Storage   │
│  (Airflow)  │     │  (Scrapy)    │     │  (Postgres) │
└─────────────┘     └──────────────┘     └─────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │ Change Store │
                    │  (Redis/DB)  │
                    └──────────────┘

Components

Scheduler: Triggers crawls on schedule (Airflow, cron, Prefect)
Scraper: Fetches and parses data (Scrapy, custom Python)
Change Store: Tracks hashes/timestamps for comparison
Storage: Stores current data and change history

Database Schema

CREATE TABLE items (
    id SERIAL PRIMARY KEY,
    source_id VARCHAR(255) UNIQUE,
    data JSONB,
    content_hash VARCHAR(32),
    first_seen_at TIMESTAMP,
    last_seen_at TIMESTAMP,
    last_changed_at TIMESTAMP
);

CREATE TABLE item_history (
    id SERIAL PRIMARY KEY,
    item_id INTEGER REFERENCES items(id),
    data JSONB,
    changed_at TIMESTAMP,
    change_type VARCHAR(50)  -- 'created', 'updated', 'deleted'
);

Monitoring & Alerting

What to Monitor

Success rate: % of URLs successfully scraped
Change rate: % of items that changed (sudden drops = possible issue)
Duration: Is the job taking longer than usual?
Error types: 403s, 429s, timeouts, parse errors
Data freshness: When was data last updated?

Alert Conditions

# Alert if success rate drops below threshold
if success_rate < 0.95:
    alert(f"Scraper success rate dropped to {success_rate}")

# Alert if no changes detected (might indicate broken scraper)
if change_rate == 0 and expected_changes > 0:
    alert("No changes detected - verify scraper is working")

# Alert if duration exceeds normal + 50%
if duration > normal_duration * 1.5:
    alert(f"Scraper running slow: {duration}s vs normal {normal_duration}s")

Health Dashboard

Build a simple dashboard showing:

Last successful run time
Items processed / changed / errors
Historical trends
Current job status

Implementation Examples

Scrapy with Incremental Logic

import scrapy
import hashlib
import redis

class IncrementalSpider(scrapy.Spider):
    name = 'incremental'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.redis = redis.Redis()

    def parse(self, response):
        for product in response.css('.product'):
            item = {
                'id': product.css('::attr(data-id)').get(),
                'title': product.css('.title::text').get(),
                'price': product.css('.price::text').get(),
            }

            # Check if changed
            current_hash = self.hash_item(item)
            previous_hash = self.redis.get(f"hash:{item['id']}")

            if previous_hash is None or current_hash != previous_hash.decode():
                # New or changed - yield for processing
                self.redis.set(f"hash:{item['id']}", current_hash)
                yield item

    def hash_item(self, item):
        content = json.dumps(item, sort_keys=True)
        return hashlib.md5(content.encode()).hexdigest()

Change History Recording

def save_with_history(item, db):
    existing = db.query(
        "SELECT * FROM items WHERE source_id = %s",
        (item['id'],)
    )

    if existing is None:
        # New item
        db.execute("""
            INSERT INTO items (source_id, data, content_hash,
                             first_seen_at, last_seen_at, last_changed_at)
            VALUES (%s, %s, %s, NOW(), NOW(), NOW())
        """, (item['id'], json.dumps(item), hash_item(item)))

        record_history(item, 'created', db)

    elif hash_item(item) != existing['content_hash']:
        # Changed
        db.execute("""
            UPDATE items
            SET data = %s, content_hash = %s,
                last_seen_at = NOW(), last_changed_at = NOW()
            WHERE source_id = %s
        """, (json.dumps(item), hash_item(item), item['id']))

        record_history(item, 'updated', db)

    else:
        # Unchanged - just update last_seen
        db.execute("""
            UPDATE items SET last_seen_at = NOW() WHERE source_id = %s
        """, (item['id'],))

Frequently Asked Questions

How do I detect deleted items?

Track items that weren't seen in the latest crawl. If an item's last_seen_at is older than the last crawl time, it may have been deleted. Implement a grace period before marking as deleted.

What about sites without stable IDs?

Create synthetic IDs from stable attributes (URL, SKU, combination of title + other unique fields). Hash these to create a consistent identifier.

How often should I do full crawls?

Even with incremental updates, periodic full crawls catch anything missed. Weekly or monthly full crawls are common, depending on data criticality.

Need help building data feeds?

I build production-grade scrapers with incremental updates and monitoring. Get in touch to discuss your project.

Book a Call

Table of Contents