Building a Real-Time Analytics Pipeline: From Zero to Production in One Day
How I replaced expensive analytics SaaS with a privacy-first, streaming data pipeline I actually own. Browser events to queryable data in under 5 minutes. Total cost: $12/month.
Results at a Glance
vs $500+ for Mixpanel/Amplitude
Browser → S3 → Dashboard
No vendor lock-in
Privacy-first by design
The Problem: Flying Blind
I was posting content on LinkedIn and Twitter, sharing case studies, writing technical guides. Traffic was coming in. But I had no idea:
- Which posts actually drove traffic?
- Were people reading the full case study or bouncing after 5 seconds?
- What percentage clicked "Book a Call"?
- Did that LinkedIn thread convert better than the Twitter one?
"I could have used Google Analytics. But I wanted to understand the data pipeline deeply. And I didn't want to send my visitors' data to Google."
The Requirements
Must Have
- Track page views, clicks, scroll depth
- UTM attribution (know which campaigns work)
- Session tracking (see the full journey)
- Real-time or near-real-time data
- SQL-queryable for custom analysis
Constraints
- < $20/month budget
- No cookies (privacy-first)
- Own my data (no vendor lock-in)
- Infrastructure as code (reproducible)
- Build it in a day (portfolio piece)
The Architecture
A streaming pipeline that handles burst traffic, buffers efficiently, and lands data in a query-ready format.
BROWSER API GATEWAY LAMBDA KINESIS ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │ │ │ │ │ │ │ │ analytics.js │─────▶│ /v1/events │─────▶│ Validate & │─────▶│ Data Stream │ │ │ POST │ HTTPS + CORS │ │ Enrich │ │ (1 shard) │ │ - page_view │ │ │ │ │ │ │ │ - cta_click │ │ │ │ + server_ts │ │ 24hr retention │ │ - scroll_depth │ │ │ │ + validation │ │ replay capable │ │ - time_on_page │ │ │ │ │ │ │ └──────────────────┘ └──────────────────┘ └──────────────────┘ └────────┬─────────┘ $0 (runs in browser) $0 (free tier) $0 (free tier) │ │ ▼ DASHBOARD DUCKDB S3 FIREHOSE ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │ │ │ │ │ │ │ │ Streamlit │◀─────│ SQL Queries │◀─────│ Partitioned │◀─────│ Batch + GZIP │ │ │ │ on DataFrames │ │ by date/hour │ │ every 5 min │ │ - Charts │ │ │ │ │ │ │ │ - Metrics │ │ "SELECT * FROM │ │ raw/year=2026/ │ │ 128MB buffer │ │ - SQL Playground│ │ events WHERE" │ │ month=02/... │ │ │ │ │ │ │ │ │ │ │ └──────────────────┘ └──────────────────┘ └──────────────────┘ └──────────────────┘ $0 (local) $0 (local) ~$0.10/mo ~$0.50/mo TOTAL: ~$12/month (Kinesis shard is $11)
The Data Flow: Step by Step
Browser: Event Capture
When you loaded this page, JavaScript captured the event and built a payload:
// analytics.js builds this payload { "event_type": "page_view", "event_id": "550e8400-e29b-41d4-a716-446655440000", "session_id": "abc-123-def-456", "page_path": "/case-studies/real-time-analytics-pipeline.html", "utm_source": "linkedin", "utm_campaign": "analytics_case_study", "screen_width": 1920, "event_ts": "2026-02-01T12:00:00.000Z" }
API Gateway: HTTPS Endpoint
The payload is POSTed to a REST API endpoint. API Gateway handles SSL termination and CORS headers automatically.
POST https://jdsbdf3t47.execute-api.us-east-1.amazonaws.com/v1/events Content-Type: application/json Origin: https://jamesjlaurieiii.com
Lambda: Validation & Enrichment
Python code validates required fields and adds server-side data before forwarding to Kinesis.
def lambda_handler(event, context): body = json.loads(event['body']) # Validate required fields if 'session_id' not in body: return {'error': 'Missing session_id'} # Enrich with server timestamp body['ingest_ts'] = datetime.now().isoformat() # Forward to Kinesis kinesis.put_record( StreamName='jl3-analytics-stream', Data=json.dumps(body), PartitionKey=body['session_id'] )
Kinesis: Streaming Buffer
Kinesis Data Streams acts as a buffer. It absorbs traffic bursts and retains data for 24 hours (enabling replay if something breaks downstream).
Why not write directly to S3? Writing thousands of tiny files is slow and expensive. Kinesis + Firehose batch events efficiently.
Firehose: Batching & Compression
Firehose reads from Kinesis and batches events together. Every 5 minutes (or 128MB, whichever comes first), it compresses to GZIP and writes to S3.
# S3 path with Hive-style partitioning s3://jl3-static-site-analytics-data/ raw/ year=2026/ month=02/ day=01/ hour=12/ firehose-data-abc123.gz # Contains ~100 events
DuckDB + Streamlit: Analysis
DuckDB queries the data directly. No need to load into a database first - it reads S3/JSON/Parquet files natively.
-- Which traffic sources convert best? SELECT COALESCE(utm_source, '(direct)') as source, COUNT(DISTINCT session_id) as sessions, COUNT(CASE WHEN event_type = 'cta_click' THEN 1 END) as conversions, ROUND(100.0 * conversions / sessions, 1) as conversion_rate FROM events GROUP BY 1 ORDER BY conversion_rate DESC
What Gets Tracked
A comprehensive event schema that captures behavior without collecting PII.
page_view
Fires when a page loads. Captures path, referrer, UTM params, screen size.
cta_click
Fires when someone clicks a call-to-action button (like "Book a Call").
scroll_depth
Fires at 25%, 50%, 75%, 100% scroll. Tells you if people actually read.
time_on_page
Fires at 30s, 60s, 180s, 300s. Distinguishes engaged readers from bouncers.
form_start
Fires when someone starts filling out a form. Reveals abandonment rate.
exit_intent
Fires when mouse moves toward browser chrome. Captures exit behavior.
Privacy by Design
What We Collect
- Random session ID (not tied to identity)
- Page path and referrer
- UTM parameters (campaign attribution)
- Screen size and browser info
- Behavioral events (scroll, time, clicks)
What We DON'T Collect
- Names, emails, or any PII
- IP addresses (not even hashed)
- Cookies (uses localStorage only)
- Cross-site tracking
- Third-party data sharing
Infrastructure as Code (Terraform)
The entire pipeline is defined in Terraform. One command to create, one command to destroy. No clicking around in AWS console.
# analytics.tf - Core infrastructure resource "aws_kinesis_stream" "analytics" { name = "jl3-analytics-stream" shard_count = 1 retention_period = 24 # hours - enables replay } resource "aws_kinesis_firehose_delivery_stream" "analytics" { name = "jl3-analytics-firehose" destination = "extended_s3" extended_s3_configuration { bucket_arn = aws_s3_bucket.analytics.arn buffering_interval = 300 # 5 minutes buffering_size = 128 # MB compression_format = "GZIP" # Hive-style partitioning for efficient queries prefix = "raw/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/" } }
Deployment Commands
# Deploy everything cd infra/envs/prod terraform apply -var="kinesis_stream_enabled=true" # Tear down (saves $11/month when not needed) terraform apply -var="kinesis_stream_enabled=false" # Complete destroy terraform destroy
Lessons Learned
Streaming > Batch for Events
Originally tried Lambda → S3 direct. Thousands of tiny files. Query performance was terrible. Kinesis + Firehose batches events into efficient chunks.
Hive Partitioning Matters
Organizing data as year=YYYY/month=MM/day=DD lets query engines skip irrelevant folders. Query "last 7 days" reads 7 folders, not all history.
DuckDB is Ridiculously Fast
Expected to need Athena or Snowflake. DuckDB queries millions of rows locally in milliseconds. For this scale, it's perfect.
Cost Toggle is Essential
Kinesis costs $11/month even idle. Built a kinesis_stream_enabled toggle. Turn it off when not needed, data still lands in S3 via Lambda.
What's Next
The foundation is solid. Here's what could be added:
Snowflake Integration
Snowpipe to auto-ingest from S3. Enables joins with other data sources and more complex analytics.
Real-Time Dashboard
Query Kinesis directly instead of S3. See events the moment they happen, not 5 minutes later.
Anomaly Detection
Alert when traffic spikes or drops. Catch issues before users report them.
A/B Testing
Add variant tracking. Measure which headlines, CTAs, or layouts perform better.
Funnel Visualization
See the path: page_view → scroll_50 → cta_click → form_start → form_submit. Where do people drop off?
Cohort Analysis
Group users by first visit date. Track retention over time. Who comes back?
Want Your Own Analytics Pipeline?
I build streaming data infrastructure for companies that want to own their data. No vendor lock-in, no monthly SaaS fees that scale with your traffic, no sending user data to third parties.
Whether it's product analytics, event tracking, or real-time dashboards - if it involves streaming data, I can help.
Typical project: 1-2 weeks from requirements to production. You own the code and infrastructure.
Related Resources
SaaS MRR Analytics
How I built automated subscription analytics for a fast-growing SaaS startup using dbt and SQL.
View Case Study →
GuideReliable ETL Pipelines
The principles that make data pipelines production-ready: idempotency, testing, monitoring.
Read Guide →
GuideData Quality Testing
How to ensure your analytics are accurate. Validation techniques and reconciliation testing.
Read Guide →