Why Migrate from Snowflake to Databricks?

Organizations migrate from Snowflake to Databricks for several reasons:

  • Unified analytics: Databricks combines data engineering, data science, and ML workloads on a single platform
  • Cost optimization: Depending on workload patterns, Databricks can offer significant cost savings
  • Lakehouse architecture: Delta Lake provides ACID transactions on object storage with open formats
  • ML/AI capabilities: Native MLflow integration and GPU support for advanced analytics

That said, migration isn't right for everyone. If your team is primarily SQL-focused and happy with Snowflake's performance and pricing, the migration cost may not be justified.

Pre-Migration Assessment

Before writing any migration code, you need a complete inventory of what you're migrating:

Data Inventory

  • List all databases, schemas, and tables with row counts and sizes
  • Identify partitioning strategies and clustering keys
  • Document semi-structured data (VARIANT columns)
  • Map data types that need translation

Query Inventory

  • Export query history to identify most-used queries
  • Catalog stored procedures and UDFs
  • Identify Snowflake-specific functions that need conversion
  • Document views and their dependencies

Downstream Dependencies

  • BI tools (Tableau, Looker, Power BI) and their connection methods
  • ETL/ELT pipelines that write to or read from Snowflake
  • Applications with direct database connections
  • Scheduled jobs and their dependencies

Schema Translation

Most Snowflake data types have direct Databricks equivalents, but there are important differences:

Data Type Mapping

  • VARIANT → Use STRING with JSON functions or define explicit schema
  • GEOGRAPHY → Use H3 or similar geospatial libraries
  • TIMESTRING or calculate from TIMESTAMP
  • NUMBER(38,0)DECIMAL(38,0) or BIGINT

Schema Design Considerations

Databricks with Delta Lake has different optimization patterns than Snowflake:

  • Partitioning: Choose partition columns carefully (typically date columns with low cardinality)
  • Z-Ordering: Replaces Snowflake's clustering for frequently filtered columns
  • Liquid Clustering: Newer alternative to traditional partitioning (Databricks 13.3+)

Query Conversion

While both platforms use SQL, there are dialect differences that require conversion:

Common Differences

  • FLATTENEXPLODE or LATERAL VIEW
  • QUALIFY → Subquery with WHERE on window function
  • TRY_CASTTRY_CAST (same in newer Databricks)
  • GET_PATHget_json_object or dot notation
  • PARSE_JSONfrom_json with schema

Stored Procedures

Snowflake stored procedures (JavaScript or Snowflake Scripting) need to be rewritten. Options include:

  • Databricks SQL stored procedures (if simple)
  • Python notebooks or scripts
  • Databricks Workflows for orchestration

Data Migration Strategies

Option 1: Direct Export/Import

For smaller datasets, export from Snowflake to cloud storage, then import to Databricks:

  1. Use COPY INTO to export Snowflake tables to Parquet in S3/Azure/GCS
  2. Create Delta tables in Databricks from the Parquet files
  3. Run OPTIMIZE and VACUUM to clean up

Option 2: Change Data Capture

For larger datasets or zero-downtime migrations:

  1. Initial bulk load of historical data
  2. Set up CDC stream from Snowflake (using Snowflake Streams or third-party tools)
  3. Apply changes incrementally to Databricks
  4. Validate and cut over when in sync

Option 3: Dual-Write

If you control the upstream pipelines:

  1. Modify pipelines to write to both Snowflake and Databricks
  2. Backfill historical data to Databricks
  3. Validate both systems match
  4. Switch downstream consumers to Databricks
  5. Retire Snowflake writes

Reconciliation Testing

This is the most critical phase. Data discrepancies after migration destroy trust and cause downstream issues.

Row Count Validation

Start with the basics - row counts should match exactly for each table:

-- Snowflake
SELECT 'orders' as table_name, COUNT(*) as row_count FROM orders
UNION ALL
SELECT 'customers', COUNT(*) FROM customers;

-- Compare with equivalent Databricks queries

Aggregate Validation

Sum, average, min, and max on numeric columns should match:

SELECT
  SUM(amount) as sum_amount,
  AVG(amount) as avg_amount,
  COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE order_date >= '2024-01-01';

Sample-Based Row Comparison

For large tables, compare a random sample of rows column by column:

  1. Select 0.1-1% of rows using consistent sampling
  2. Export from both systems
  3. Compare row by row, flagging any differences
  4. Investigate and fix discrepancies

Automated Reconciliation Framework

Build reusable reconciliation scripts that:

  • Run nightly during parallel operation
  • Alert on any discrepancies above threshold
  • Generate detailed reports for investigation
  • Track trends over time

Cutover Planning

Plan your cutover carefully to minimize risk:

Pre-Cutover Checklist

  • All reconciliation tests passing for 7+ consecutive days
  • All downstream consumers tested against Databricks
  • Rollback plan documented and tested
  • Team trained on Databricks operations
  • Monitoring and alerting configured

Cutover Sequence

  1. Final reconciliation check
  2. Stop writes to Snowflake
  3. Apply any final changes to Databricks
  4. Switch downstream consumers
  5. Monitor for issues
  6. Keep Snowflake available for rollback (1-2 weeks)

Frequently Asked Questions

How long does a Snowflake to Databricks migration take?

A typical migration takes 4-12 weeks depending on data volume, query complexity, and team capacity. Simple migrations with fewer than 20 tables can be done in 4 weeks, while complex migrations with hundreds of tables and stored procedures may take 12+ weeks.

Can I run Snowflake and Databricks in parallel during migration?

Yes, and this is highly recommended. Running both platforms in parallel allows you to validate data accuracy before cutting over and provides a rollback option if issues arise. Plan for 2-4 weeks of parallel operation.

What are the biggest challenges in Snowflake to Databricks migration?

The main challenges are:

  • SQL dialect differences (especially window functions and semi-structured data handling)
  • Stored procedure conversion (JavaScript/Snowflake Scripting to Python/SQL)
  • Ensuring data accuracy through reconciliation testing
  • Retraining teams on Databricks operations and debugging

Should I use Unity Catalog?

Yes, for new Databricks deployments, Unity Catalog is recommended. It provides centralized governance, fine-grained access control, and data lineage tracking that matches or exceeds Snowflake's governance capabilities.