Table of Contents
Core Principles of Reliable Pipelines
Before diving into specifics, let's establish the principles that guide reliable pipeline design:
- Fail fast, fail loud: Don't silently swallow errors. If something breaks, make noise immediately.
- Expect failure: Networks fail. APIs timeout. Disks fill up. Design for these scenarios.
- Be idempotent: Running the same job twice should produce the same result.
- Validate data: Check data quality at every stage, not just at the end.
- Log everything: When debugging at 3am, you'll thank yourself for comprehensive logging.
Idempotency: The Foundation of Reliability
An idempotent pipeline produces the same result whether you run it once or ten times. This is crucial because:
- You can safely retry failed jobs
- Backfills don't corrupt data
- Debugging is easier when you can rerun jobs
Strategies for Idempotency
Partition-based overwrite: Write to date-partitioned tables and overwrite the entire partition:
-- Instead of INSERT, use INSERT OVERWRITE
INSERT OVERWRITE TABLE orders PARTITION (order_date = '2024-01-15')
SELECT * FROM staging_orders WHERE order_date = '2024-01-15';
Merge/Upsert: Use MERGE statements for slowly changing dimensions:
MERGE INTO customers AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;
Delete-then-insert: For simpler cases, delete existing data first:
DELETE FROM orders WHERE order_date = '2024-01-15';
INSERT INTO orders SELECT * FROM staging WHERE order_date = '2024-01-15';
Error Handling That Actually Works
Categorize Your Errors
Not all errors are equal. Categorize them to handle appropriately:
- Transient errors: Network timeouts, rate limits, temporary unavailability. Retry with exponential backoff.
- Data errors: Invalid records, schema mismatches. Log to dead letter queue, continue processing.
- Fatal errors: Configuration issues, permission problems. Fail immediately and alert.
Implement Retry Logic
import time
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=1):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except TransientError as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
time.sleep(delay)
return wrapper
return decorator
Dead Letter Queues
Don't let bad records kill your pipeline. Route them to a dead letter queue for later investigation:
- Log the full record and error message
- Include timestamp and job context
- Alert if dead letter queue grows beyond threshold
- Build tooling to replay records after fixing issues
Data Validation at Every Stage
Data quality issues compound. Catch them early.
Source Validation
Before processing, validate what you received:
- File exists and has expected size/row count
- Schema matches expectations
- Required fields are present
- Date ranges are reasonable
Transform Validation
After transformations, verify your logic:
- Row counts match expectations (input rows = output rows + filtered rows)
- Aggregations are reasonable (totals, averages within expected ranges)
- No unexpected NULLs introduced
- Referential integrity maintained
Load Validation
After loading, confirm data landed correctly:
- Row counts in target match source
- Checksums match for critical columns
- No duplicate primary keys
- Freshness checks (latest timestamp is recent)
Use dbt Tests or Great Expectations
Don't reinvent the wheel. Use established frameworks:
# dbt schema.yml
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
Monitoring & Alerting
A pipeline without monitoring is a pipeline waiting to fail silently.
What to Monitor
- Job success/failure: Did the job complete?
- Duration: Is it taking longer than usual?
- Row counts: Did we process the expected volume?
- Data freshness: When was data last updated?
- Error rates: How many records went to dead letter queue?
Alerting Best Practices
- Alert on actionable issues only (no alert fatigue)
- Include context in alerts (job name, error message, runbook link)
- Route to the right channel (Slack, PagerDuty, email)
- Escalate if not acknowledged within threshold
Dashboards
Build dashboards that show:
- Current status of all pipelines
- Historical success rates
- Data freshness per table
- Processing latency trends
Testing Strategies
Unit Tests
Test individual transformation functions with known inputs and expected outputs.
Integration Tests
Test the full pipeline with sample data:
- Use a subset of production data (anonymized if needed)
- Verify end-to-end data flow
- Check that all validation passes
Data Quality Tests in Production
Run validation queries after each production load:
- Fail the job if critical checks don't pass
- Prevent downstream jobs from running on bad data
- Alert immediately on failures
Frequently Asked Questions
What makes an ETL pipeline reliable?
Reliable ETL pipelines are idempotent (can be rerun safely), have comprehensive error handling, include data validation checks, and have monitoring and alerting built in from day one. They're designed to fail gracefully and recover automatically when possible.
Should I use ETL or ELT?
Modern data stacks typically use ELT (Extract, Load, Transform) because cloud warehouses like Snowflake and Databricks can handle transformations efficiently. ETL is still useful when you need to filter or clean data before loading to reduce storage costs, or when transformations require specialized processing.
How do I handle ETL pipeline failures?
Design pipelines to be idempotent so they can be safely rerun. Implement checkpointing for long-running jobs so you can resume from where you left off. Use dead letter queues for bad records so one bad row doesn't stop the entire job. Set up alerts that notify the right people immediately with enough context to diagnose the issue.
What tools should I use for ETL?
Popular choices include:
- Orchestration: Airflow, Prefect, Dagster
- Transformation: dbt, Spark, SQL
- Validation: Great Expectations, dbt tests
- Monitoring: Datadog, Monte Carlo, custom dashboards
Need help building reliable pipelines?
I build ETL systems with proper error handling, monitoring, and validation. Get in touch to discuss your project.
Book a Call