Building a Real-Time Analytics Pipeline: From Zero to Production in One Day

Results at a Glance

$12

Monthly cost

vs $500+ for Mixpanel/Amplitude

<5min

Event to queryable

Browser → S3 → Dashboard

100%

Data ownership

No vendor lock-in

0

PII collected

Privacy-first by design

The Problem: Flying Blind

I was posting content on LinkedIn and Twitter, sharing case studies, writing technical guides. Traffic was coming in. But I had no idea:

Which posts actually drove traffic?
Were people reading the full case study or bouncing after 5 seconds?
What percentage clicked "Book a Call"?
Did that LinkedIn thread convert better than the Twitter one?

"I could have used Google Analytics. But I wanted to understand the data pipeline deeply. And I didn't want to send my visitors' data to Google."

The Requirements

Must Have

Track page views, clicks, scroll depth
UTM attribution (know which campaigns work)
Session tracking (see the full journey)
Real-time or near-real-time data
SQL-queryable for custom analysis

Constraints

< $20/month budget
No cookies (privacy-first)
Own my data (no vendor lock-in)
Infrastructure as code (reproducible)
Build it in a day (portfolio piece)

The Architecture

A streaming pipeline that handles burst traffic, buffers efficiently, and lands data in a query-ready format.

BROWSER                    API GATEWAY              LAMBDA                   KINESIS
┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐
│                  │      │                  │      │                  │      │                  │
│   analytics.js   │─────▶│   /v1/events     │─────▶│   Validate &     │─────▶│   Data Stream    │
│                  │ POST │   HTTPS + CORS   │      │   Enrich         │      │   (1 shard)      │
│  - page_view     │      │                  │      │                  │      │                  │
│  - cta_click     │      │                  │      │  + server_ts     │      │  24hr retention  │
│  - scroll_depth  │      │                  │      │  + validation    │      │  replay capable  │
│  - time_on_page  │      │                  │      │                  │      │                  │
└──────────────────┘      └──────────────────┘      └──────────────────┘      └────────┬─────────┘
$0 (runs in browser)      $0 (free tier)            $0 (free tier)                     │
                                                                                       │
                                                                                       ▼
DASHBOARD                  DUCKDB                   S3                       FIREHOSE
┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐
│                  │      │                  │      │                  │      │                  │
│   Streamlit      │◀─────│   SQL Queries    │◀─────│   Partitioned    │◀─────│   Batch + GZIP   │
│                  │      │   on DataFrames  │      │   by date/hour   │      │   every 5 min    │
│  - Charts        │      │                  │      │                  │      │                  │
│  - Metrics       │      │  "SELECT * FROM  │      │  raw/year=2026/  │      │  128MB buffer    │
│  - SQL Playground│      │   events WHERE"  │      │    month=02/...  │      │                  │
│                  │      │                  │      │                  │      │                  │
└──────────────────┘      └──────────────────┘      └──────────────────┘      └──────────────────┘
$0 (local)                $0 (local)                ~$0.10/mo                ~$0.50/mo

                                                    TOTAL: ~$12/month (Kinesis shard is $11)

The Data Flow: Step by Step

1

Browser: Event Capture

When you loaded this page, JavaScript captured the event and built a payload:

// analytics.js builds this payload
{
  "event_type": "page_view",
  "event_id": "550e8400-e29b-41d4-a716-446655440000",
  "session_id": "abc-123-def-456",
  "page_path": "/case-studies/real-time-analytics-pipeline.html",
  "utm_source": "linkedin",
  "utm_campaign": "analytics_case_study",
  "screen_width": 1920,
  "event_ts": "2026-02-01T12:00:00.000Z"
}

2

API Gateway: HTTPS Endpoint

The payload is POSTed to a REST API endpoint. API Gateway handles SSL termination and CORS headers automatically.

POST https://jdsbdf3t47.execute-api.us-east-1.amazonaws.com/v1/events
Content-Type: application/json
Origin: https://jamesjlaurieiii.com

3

Lambda: Validation & Enrichment

Python code validates required fields and adds server-side data before forwarding to Kinesis.

def lambda_handler(event, context):
    body = json.loads(event['body'])

    # Validate required fields
    if 'session_id' not in body:
        return {'error': 'Missing session_id'}

    # Enrich with server timestamp
    body['ingest_ts'] = datetime.now().isoformat()

    # Forward to Kinesis
    kinesis.put_record(
        StreamName='jl3-analytics-stream',
        Data=json.dumps(body),
        PartitionKey=body['session_id']
    )

4

Kinesis: Streaming Buffer

Kinesis Data Streams acts as a buffer. It absorbs traffic bursts and retains data for 24 hours (enabling replay if something breaks downstream).

Why not write directly to S3? Writing thousands of tiny files is slow and expensive. Kinesis + Firehose batch events efficiently.

5

Firehose: Batching & Compression

Firehose reads from Kinesis and batches events together. Every 5 minutes (or 128MB, whichever comes first), it compresses to GZIP and writes to S3.

# S3 path with Hive-style partitioning
s3://jl3-static-site-analytics-data/
  raw/
    year=2026/
      month=02/
        day=01/
          hour=12/
            firehose-data-abc123.gz  # Contains ~100 events

6

DuckDB + Streamlit: Analysis

DuckDB queries the data directly. No need to load into a database first - it reads S3/JSON/Parquet files natively.

-- Which traffic sources convert best?
SELECT
    COALESCE(utm_source, '(direct)') as source,
    COUNT(DISTINCT session_id) as sessions,
    COUNT(CASE WHEN event_type = 'cta_click' THEN 1 END) as conversions,
    ROUND(100.0 * conversions / sessions, 1) as conversion_rate
FROM events
GROUP BY 1
ORDER BY conversion_rate DESC

What Gets Tracked

A comprehensive event schema that captures behavior without collecting PII.

page_view

Fires when a page loads. Captures path, referrer, UTM params, screen size.

cta_click

Fires when someone clicks a call-to-action button (like "Book a Call").

scroll_depth

Fires at 25%, 50%, 75%, 100% scroll. Tells you if people actually read.

time_on_page

Fires at 30s, 60s, 180s, 300s. Distinguishes engaged readers from bouncers.

form_start

Fires when someone starts filling out a form. Reveals abandonment rate.

exit_intent

Fires when mouse moves toward browser chrome. Captures exit behavior.

Privacy by Design

What We Collect

Random session ID (not tied to identity)
Page path and referrer
UTM parameters (campaign attribution)
Screen size and browser info
Behavioral events (scroll, time, clicks)

What We DON'T Collect

Names, emails, or any PII
IP addresses (not even hashed)
Cookies (uses localStorage only)
Cross-site tracking
Third-party data sharing

Infrastructure as Code (Terraform)

The entire pipeline is defined in Terraform. One command to create, one command to destroy. No clicking around in AWS console.

# analytics.tf - Core infrastructure

resource "aws_kinesis_stream" "analytics" {
  name             = "jl3-analytics-stream"
  shard_count      = 1
  retention_period = 24  # hours - enables replay
}

resource "aws_kinesis_firehose_delivery_stream" "analytics" {
  name        = "jl3-analytics-firehose"
  destination = "extended_s3"

  extended_s3_configuration {
    bucket_arn         = aws_s3_bucket.analytics.arn
    buffering_interval = 300   # 5 minutes
    buffering_size     = 128   # MB
    compression_format = "GZIP"

    # Hive-style partitioning for efficient queries
    prefix = "raw/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/"
  }
}

Deployment Commands

# Deploy everything
cd infra/envs/prod
terraform apply -var="kinesis_stream_enabled=true"

# Tear down (saves $11/month when not needed)
terraform apply -var="kinesis_stream_enabled=false"

# Complete destroy
terraform destroy

Lessons Learned

Streaming > Batch for Events

Originally tried Lambda → S3 direct. Thousands of tiny files. Query performance was terrible. Kinesis + Firehose batches events into efficient chunks.

Hive Partitioning Matters

Organizing data as year=YYYY/month=MM/day=DD lets query engines skip irrelevant folders. Query "last 7 days" reads 7 folders, not all history.

DuckDB is Ridiculously Fast

Expected to need Athena or Snowflake. DuckDB queries millions of rows locally in milliseconds. For this scale, it's perfect.

Cost Toggle is Essential

Kinesis costs $11/month even idle. Built a kinesis_stream_enabled toggle. Turn it off when not needed, data still lands in S3 via Lambda.

What's Next

The foundation is solid. Here's what could be added:

Snowflake Integration

Snowpipe to auto-ingest from S3. Enables joins with other data sources and more complex analytics.

Real-Time Dashboard

Query Kinesis directly instead of S3. See events the moment they happen, not 5 minutes later.

Anomaly Detection

Alert when traffic spikes or drops. Catch issues before users report them.

A/B Testing

Add variant tracking. Measure which headlines, CTAs, or layouts perform better.

Funnel Visualization

See the path: page_view → scroll_50 → cta_click → form_start → form_submit. Where do people drop off?

Cohort Analysis

Group users by first visit date. Track retention over time. Who comes back?

Want Your Own Analytics Pipeline?

I build streaming data infrastructure for companies that want to own their data. No vendor lock-in, no monthly SaaS fees that scale with your traffic, no sending user data to third parties.

Whether it's product analytics, event tracking, or real-time dashboards - if it involves streaming data, I can help.

Book a Free 20-Minute Call Send Me Your Challenge

Typical project: 1-2 weeks from requirements to production. You own the code and infrastructure.

Related Resources

Case Study

SaaS MRR Analytics

How I built automated subscription analytics for a fast-growing SaaS startup using dbt and SQL.

View Case Study →

Guide

Reliable ETL Pipelines

The principles that make data pipelines production-ready: idempotency, testing, monitoring.

Read Guide →

Guide

Data Quality Testing

How to ensure your analytics are accurate. Validation techniques and reconciliation testing.

Read Guide →