Table of Contents
Why Migrate from Snowflake to Databricks?
Organizations migrate from Snowflake to Databricks for several reasons:
- Unified analytics: Databricks combines data engineering, data science, and ML workloads on a single platform
- Cost optimization: Depending on workload patterns, Databricks can offer significant cost savings
- Lakehouse architecture: Delta Lake provides ACID transactions on object storage with open formats
- ML/AI capabilities: Native MLflow integration and GPU support for advanced analytics
That said, migration isn't right for everyone. If your team is primarily SQL-focused and happy with Snowflake's performance and pricing, the migration cost may not be justified.
Pre-Migration Assessment
Before writing any migration code, you need a complete inventory of what you're migrating:
Data Inventory
- List all databases, schemas, and tables with row counts and sizes
- Identify partitioning strategies and clustering keys
- Document semi-structured data (VARIANT columns)
- Map data types that need translation
Query Inventory
- Export query history to identify most-used queries
- Catalog stored procedures and UDFs
- Identify Snowflake-specific functions that need conversion
- Document views and their dependencies
Downstream Dependencies
- BI tools (Tableau, Looker, Power BI) and their connection methods
- ETL/ELT pipelines that write to or read from Snowflake
- Applications with direct database connections
- Scheduled jobs and their dependencies
Schema Translation
Most Snowflake data types have direct Databricks equivalents, but there are important differences:
Data Type Mapping
VARIANT→ UseSTRINGwith JSON functions or define explicit schemaGEOGRAPHY→ Use H3 or similar geospatial librariesTIME→STRINGor calculate fromTIMESTAMPNUMBER(38,0)→DECIMAL(38,0)orBIGINT
Schema Design Considerations
Databricks with Delta Lake has different optimization patterns than Snowflake:
- Partitioning: Choose partition columns carefully (typically date columns with low cardinality)
- Z-Ordering: Replaces Snowflake's clustering for frequently filtered columns
- Liquid Clustering: Newer alternative to traditional partitioning (Databricks 13.3+)
Query Conversion
While both platforms use SQL, there are dialect differences that require conversion:
Common Differences
FLATTEN→EXPLODEorLATERAL VIEWQUALIFY→ Subquery withWHEREon window functionTRY_CAST→TRY_CAST(same in newer Databricks)GET_PATH→get_json_objector dot notationPARSE_JSON→from_jsonwith schema
Stored Procedures
Snowflake stored procedures (JavaScript or Snowflake Scripting) need to be rewritten. Options include:
- Databricks SQL stored procedures (if simple)
- Python notebooks or scripts
- Databricks Workflows for orchestration
Data Migration Strategies
Option 1: Direct Export/Import
For smaller datasets, export from Snowflake to cloud storage, then import to Databricks:
- Use
COPY INTOto export Snowflake tables to Parquet in S3/Azure/GCS - Create Delta tables in Databricks from the Parquet files
- Run
OPTIMIZEandVACUUMto clean up
Option 2: Change Data Capture
For larger datasets or zero-downtime migrations:
- Initial bulk load of historical data
- Set up CDC stream from Snowflake (using Snowflake Streams or third-party tools)
- Apply changes incrementally to Databricks
- Validate and cut over when in sync
Option 3: Dual-Write
If you control the upstream pipelines:
- Modify pipelines to write to both Snowflake and Databricks
- Backfill historical data to Databricks
- Validate both systems match
- Switch downstream consumers to Databricks
- Retire Snowflake writes
Reconciliation Testing
This is the most critical phase. Data discrepancies after migration destroy trust and cause downstream issues.
Row Count Validation
Start with the basics - row counts should match exactly for each table:
-- Snowflake
SELECT 'orders' as table_name, COUNT(*) as row_count FROM orders
UNION ALL
SELECT 'customers', COUNT(*) FROM customers;
-- Compare with equivalent Databricks queries
Aggregate Validation
Sum, average, min, and max on numeric columns should match:
SELECT
SUM(amount) as sum_amount,
AVG(amount) as avg_amount,
COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE order_date >= '2024-01-01';
Sample-Based Row Comparison
For large tables, compare a random sample of rows column by column:
- Select 0.1-1% of rows using consistent sampling
- Export from both systems
- Compare row by row, flagging any differences
- Investigate and fix discrepancies
Automated Reconciliation Framework
Build reusable reconciliation scripts that:
- Run nightly during parallel operation
- Alert on any discrepancies above threshold
- Generate detailed reports for investigation
- Track trends over time
Cutover Planning
Plan your cutover carefully to minimize risk:
Pre-Cutover Checklist
- All reconciliation tests passing for 7+ consecutive days
- All downstream consumers tested against Databricks
- Rollback plan documented and tested
- Team trained on Databricks operations
- Monitoring and alerting configured
Cutover Sequence
- Final reconciliation check
- Stop writes to Snowflake
- Apply any final changes to Databricks
- Switch downstream consumers
- Monitor for issues
- Keep Snowflake available for rollback (1-2 weeks)
Frequently Asked Questions
How long does a Snowflake to Databricks migration take?
A typical migration takes 4-12 weeks depending on data volume, query complexity, and team capacity. Simple migrations with fewer than 20 tables can be done in 4 weeks, while complex migrations with hundreds of tables and stored procedures may take 12+ weeks.
Can I run Snowflake and Databricks in parallel during migration?
Yes, and this is highly recommended. Running both platforms in parallel allows you to validate data accuracy before cutting over and provides a rollback option if issues arise. Plan for 2-4 weeks of parallel operation.
What are the biggest challenges in Snowflake to Databricks migration?
The main challenges are:
- SQL dialect differences (especially window functions and semi-structured data handling)
- Stored procedure conversion (JavaScript/Snowflake Scripting to Python/SQL)
- Ensuring data accuracy through reconciliation testing
- Retraining teams on Databricks operations and debugging
Should I use Unity Catalog?
Yes, for new Databricks deployments, Unity Catalog is recommended. It provides centralized governance, fine-grained access control, and data lineage tracking that matches or exceeds Snowflake's governance capabilities.
Need help with your migration?
I've helped companies migrate from Snowflake to Databricks with zero data discrepancies. Get in touch to discuss your project, or check out my data engineering services.
Book a Call