Snowflake to Databricks Migration Playbook

Q: Can I run Snowflake and Databricks in parallel during migration?

Yes, running both platforms in parallel is recommended. This allows you to validate data accuracy before cutting over and provides a rollback option if issues arise.

Q: What are the biggest challenges in Snowflake to Databricks migration?

The main challenges are SQL dialect differences (especially window functions and semi-structured data handling), stored procedure conversion, and ensuring data accuracy through reconciliation testing.

Why Migrate from Snowflake to Databricks?
Pre-Migration Assessment
Schema Translation
Query Conversion
Data Migration Strategies
Reconciliation Testing
Cutover Planning
Frequently Asked Questions

Why Migrate from Snowflake to Databricks?

Organizations migrate from Snowflake to Databricks for several reasons:

Unified analytics: Databricks combines data engineering, data science, and ML workloads on a single platform
Cost optimization: Depending on workload patterns, Databricks can offer significant cost savings
Lakehouse architecture: Delta Lake provides ACID transactions on object storage with open formats
ML/AI capabilities: Native MLflow integration and GPU support for advanced analytics

That said, migration isn't right for everyone. If your team is primarily SQL-focused and happy with Snowflake's performance and pricing, the migration cost may not be justified.

Pre-Migration Assessment

Before writing any migration code, you need a complete inventory of what you're migrating:

Data Inventory

List all databases, schemas, and tables with row counts and sizes
Identify partitioning strategies and clustering keys
Document semi-structured data (VARIANT columns)
Map data types that need translation

Query Inventory

Export query history to identify most-used queries
Catalog stored procedures and UDFs
Identify Snowflake-specific functions that need conversion
Document views and their dependencies

Downstream Dependencies

BI tools (Tableau, Looker, Power BI) and their connection methods
ETL/ELT pipelines that write to or read from Snowflake
Applications with direct database connections
Scheduled jobs and their dependencies

Schema Translation

Most Snowflake data types have direct Databricks equivalents, but there are important differences:

Data Type Mapping

VARIANT → Use STRING with JSON functions or define explicit schema
GEOGRAPHY → Use H3 or similar geospatial libraries
TIME → STRING or calculate from TIMESTAMP
NUMBER(38,0) → DECIMAL(38,0) or BIGINT

Schema Design Considerations

Databricks with Delta Lake has different optimization patterns than Snowflake:

Partitioning: Choose partition columns carefully (typically date columns with low cardinality)
Z-Ordering: Replaces Snowflake's clustering for frequently filtered columns
Liquid Clustering: Newer alternative to traditional partitioning (Databricks 13.3+)

Query Conversion

While both platforms use SQL, there are dialect differences that require conversion:

Common Differences

FLATTEN → EXPLODE or LATERAL VIEW
QUALIFY → Subquery with WHERE on window function
TRY_CAST → TRY_CAST (same in newer Databricks)
GET_PATH → get_json_object or dot notation
PARSE_JSON → from_json with schema

Stored Procedures

Snowflake stored procedures (JavaScript or Snowflake Scripting) need to be rewritten. Options include:

Databricks SQL stored procedures (if simple)
Python notebooks or scripts
Databricks Workflows for orchestration

Data Migration Strategies

Option 1: Direct Export/Import

For smaller datasets, export from Snowflake to cloud storage, then import to Databricks:

Use COPY INTO to export Snowflake tables to Parquet in S3/Azure/GCS
Create Delta tables in Databricks from the Parquet files
Run OPTIMIZE and VACUUM to clean up

Option 2: Change Data Capture

For larger datasets or zero-downtime migrations:

Initial bulk load of historical data
Set up CDC stream from Snowflake (using Snowflake Streams or third-party tools)
Apply changes incrementally to Databricks
Validate and cut over when in sync

Option 3: Dual-Write

If you control the upstream pipelines:

Modify pipelines to write to both Snowflake and Databricks
Backfill historical data to Databricks
Validate both systems match
Switch downstream consumers to Databricks
Retire Snowflake writes

Reconciliation Testing

This is the most critical phase. Data discrepancies after migration destroy trust and cause downstream issues.

Row Count Validation

Start with the basics - row counts should match exactly for each table:

-- Snowflake
SELECT 'orders' as table_name, COUNT(*) as row_count FROM orders
UNION ALL
SELECT 'customers', COUNT(*) FROM customers;

-- Compare with equivalent Databricks queries

Aggregate Validation

Sum, average, min, and max on numeric columns should match:

SELECT
  SUM(amount) as sum_amount,
  AVG(amount) as avg_amount,
  COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE order_date >= '2024-01-01';

Sample-Based Row Comparison

For large tables, compare a random sample of rows column by column:

Select 0.1-1% of rows using consistent sampling
Export from both systems
Compare row by row, flagging any differences
Investigate and fix discrepancies

Automated Reconciliation Framework

Build reusable reconciliation scripts that:

Run nightly during parallel operation
Alert on any discrepancies above threshold
Generate detailed reports for investigation
Track trends over time

Cutover Planning

Plan your cutover carefully to minimize risk:

Pre-Cutover Checklist

All reconciliation tests passing for 7+ consecutive days
All downstream consumers tested against Databricks
Rollback plan documented and tested
Team trained on Databricks operations
Monitoring and alerting configured

Cutover Sequence

Final reconciliation check
Stop writes to Snowflake
Apply any final changes to Databricks
Switch downstream consumers
Monitor for issues
Keep Snowflake available for rollback (1-2 weeks)

Frequently Asked Questions

How long does a Snowflake to Databricks migration take?

A typical migration takes 4-12 weeks depending on data volume, query complexity, and team capacity. Simple migrations with fewer than 20 tables can be done in 4 weeks, while complex migrations with hundreds of tables and stored procedures may take 12+ weeks.

Can I run Snowflake and Databricks in parallel during migration?

Yes, and this is highly recommended. Running both platforms in parallel allows you to validate data accuracy before cutting over and provides a rollback option if issues arise. Plan for 2-4 weeks of parallel operation.

What are the biggest challenges in Snowflake to Databricks migration?

The main challenges are:

SQL dialect differences (especially window functions and semi-structured data handling)
Stored procedure conversion (JavaScript/Snowflake Scripting to Python/SQL)
Ensuring data accuracy through reconciliation testing
Retraining teams on Databricks operations and debugging

Should I use Unity Catalog?

Yes, for new Databricks deployments, Unity Catalog is recommended. It provides centralized governance, fine-grained access control, and data lineage tracking that matches or exceeds Snowflake's governance capabilities.

Need help with your migration?

I've helped companies migrate from Snowflake to Databricks with zero data discrepancies. Get in touch to discuss your project, or check out my data engineering services.

Book a Call

Table of Contents