Autonomous in the loop. Accountable at the gates. An agentic offering by Plainsight

Migrate Azure Synapse to Databricks.

Four Synapse workloads onto the lakehouse (Spark notebooks fit naturally, the SQL pools get re-derived as Delta) with three human gates and a loop that iterates until green.

Azure Synapse logo Databricks logo

Why Azure Synapse to Databricks is four migrations, not one

An Azure Synapse Analytics workspace looks like one product, but it’s a collaboration boundary that bundles four distinct workloads, each with its own engine: Synapse pipelines, Spark notebooks, dedicated SQL pools, and serverless SQL. Treating them as a single thing is the first mistake a naive lift-and-shift makes. The pipelines are the same JSON model as Azure Data Factory; the notebooks run on serverless Apache Spark pools and write Delta; the dedicated SQL pool is a massively parallel T-SQL warehouse spread across 60 distributions; and serverless SQL is an on-demand query engine that stores nothing and reads files in place. Moving the estate onto Databricks means handling all four, and the difficulty is wildly uneven across them. The agent fleet documents each workload on its own terms first, then rebuilds: autonomous in the loop, accountable at the three human gates of assessment, design, and promotion.

The good news for this pair is that one workload is a genuinely natural fit. Synapse Spark notebooks already run Apache Spark and already write Delta, so they map to Databricks notebooks more closely than to almost any other target. A Spark-notebook-heavy Synapse estate lands on the lakehouse with the least friction of any source we migrate. The hard parts are the SQL pools, where Synapse’s physical design has no Databricks equivalent and must be re-derived rather than copied.

What breaks in a naive lift-and-shift

The dedicated SQL pool is where the copy-paste instinct does the most damage. Its performance model is built entirely around the 60-distribution MPP architecture: you choose a HASH distribution column so co-located joins avoid data movement, you make staging tables ROUND_ROBIN for fast loads, you make small dimensions REPLICATED so a full copy is cached on every compute node, and you partition on a single column (usually a date) so you can do fast partition switching and partition elimination. On Databricks none of these concepts exist. Delta Lake on the lakehouse has no distribution layer; physical layout is handled by liquid clustering and OPTIMIZE, and the Spark optimizer decides broadcast joins on its own, so the REPLICATED choice simply evaporates. Porting DISTRIBUTION = HASH(...) as if it were meaningful Delta DDL produces tables that are at best ignored and at worst misleading. The fleet reads why each distribution and partition choice was made (the access patterns behind it) and re-derives a Delta layout from those patterns.

The dialect breaks too. Even though dedicated-pool T-SQL is already a reduced subset (an 8-level nesting cap instead of 32, no @@NESTLEVEL, no INSERT...EXECUTE, constrained scalar UDFs, blob-type limits that force dynamic SQL to be chunked and run with EXEC()), it is still T-SQL, not Spark SQL. ISNULL becomes coalesce/nvl, GETDATE() becomes current_timestamp(), TOP n becomes LIMIT n, and [bracketed] identifiers become backticks. IDENTITY(1,1) surrogate keys do not behave like Delta’s GENERATED ... AS IDENTITY under parallel writes. Every T-SQL function gets checked against the Spark SQL built-in set; some have no direct equivalent and need rewriting, and the fleet catches these in the loop, not in a hand-off document.

Serverless SQL has its own trap. Because it is compatible with Delta reader version 1 only, a Synapse logical data warehouse may have been built around that ceiling: no deletion vectors, no column renames, no v2 checkpoints. The lakehouse imposes no such limit, so the rebuild as Unity Catalog external tables and views over the same Delta usually gets simpler, but the fleet still has to recognise where the old constraint shaped the design. And the deepest risk is the coupling between workloads: in Synapse the real logic lives across the seams, where a Spark notebook lands Delta, serverless SQL exposes views over it, the dedicated pool holds the modeled warehouse, and pipelines orchestrate the lot. Type mismatches, Delta-version compatibility, and schema-sync delays all hide at those Spark → serverless → dedicated boundaries.

How the fleet handles it

The Documenter reads the logic, not the labels, across all four workloads and follows data across the boundaries rather than documenting each pool in isolation. That inventory (the knowledge base) is what a human signs off at the assessment gate. For Synapse pipelines, the activity graph and its dependsOn conditions (Succeeded, Failed, Skipped, Completed) are reconstructed and re-expressed as Lakeflow Jobs tasks, dependencies, and triggers. For notebooks, the close-fit Spark logic is ported while %run versus mssparkutils.notebook.run chaining and mssparkutils utility calls are reworked for Databricks. For the dedicated pool, layout is re-derived for Delta and SCD2 history that was hand-rolled in T-SQL becomes either dbt snapshots (with dbt_valid_from / dbt_valid_to) on the dbt flavor or an explicit Delta MERGE INTO on the your-framework flavor. For serverless SQL, query-in-place views become Unity Catalog tables and views.

Whichever flavor you choose, the builders generate, the Test agent runs the suite, and the build-test-run loop iterates until green: row counts, SCD2 valid-from/valid-to spans, and aggregate parity all have to match source before anything reaches the promotion gate. Counts come from config, and you get to inventory in minutes. If your destination is Microsoft’s own platform instead, Migrate Azure Synapse to Microsoft Fabric follows the same fleet and the same gates onto Fabric.

How the fleet runs it

Azure Synapse to Databricks, stage by stage.

For Azure Synapse → Databricks, the same five stages run with the agents that handle this pair named and linked below. Inside the loop they iterate on their own; your experts hold the three gates.

dbt or your framework

Built your way, on Databricks.

Take Azure Synapse to Databricks as an open-source dbt project (free to run), or built into the framework your team has trusted for years: same migration, your standards either way.

What gets generated

dbt
  • A dbt project on a Databricks SQL warehouse via dbt-databricks (+file_format: delta), with ref()-built DAG, sources.yml, tests, macros and generated docs
  • Dimension and fact models as Delta tables, with incremental models using incremental_strategy='merge' (compiles to MERGE INTO) on unique_key
  • dbt **snapshots** (timestamp or check strategy) for SCD Type 2, producing dbt_valid_from / dbt_valid_to history in place of dedicated-pool history tables
  • Serverless-SQL views re-expressed as dbt view models over Unity Catalog tables; liquid_clustered_by / partition_by chosen for Delta layout

Pattern mapping

Each Azure Synapse pattern, mapped to a deliberate Databricks target.

Source pattern
Databricks · dbt
Databricks · your framework
Dedicated SQL pool, hash distribution on a chosen column
dbt table/incremental model on Delta; liquid_clustered_by or partition_by chosen for layout (distribution discarded)
Delta table in Unity Catalog; liquid clustering / OPTIMIZE; the 60-distribution key is re-derived, not copied
Dedicated SQL pool, round-robin staging table
Ephemeral / staging dbt model (CTE or table); no distribution concept on Delta
Staging Delta table or temporary view; Auto Loader / COPY INTO for the landing step
Replicated small dimension (cached on every node)
Plain dbt table model; broadcast handled by the Spark optimizer, not declared
Delta table; Spark decides broadcast joins automatically, no replication choice to make
Single-column date partitioning + partition switching
incremental_strategy='insert_overwrite' (partition-aware) or replace_where for selective rewrites
Delta replaceWhere / partition overwrite; OPTIMIZE and liquid clustering replace partition switching
Serverless SQL OPENROWSET / external table over Delta in the lake
dbt view/table model over a Unity Catalog table; raw files declared in sources.yml
Unity Catalog external table + view over Delta; cloudFiles (Auto Loader) for incremental file landing
Serverless partitioned view over a Delta root folder
dbt view model selecting from the registered Delta table (partitions resolved by the engine)
Spark SQL view over the Delta table; no Delta reader-version v1 ceiling on the lakehouse
Synapse Spark notebook (PySpark / Spark SQL) writing Delta
(n/a for dbt SQL), pure-SQL logic becomes models; programmatic logic becomes a dbt Python model
Databricks notebook, close fit; mssparkutilsdbutils, %run include logic reworked as Lakeflow Jobs task dependencies
Synapse pipeline orchestrating Spark + SQL with dependsOn
dbt DAG (ref()) for transform order; the schedule runs dbt build as an orchestrated task
**Lakeflow Jobs**: tasks, dependencies and triggers replace the pipeline activity graph
Dedicated-pool stored procedure (procedural T-SQL, dynamic SQL)
Decomposed into ordered, idempotent models + pre/post-hooks; complex logic to a dbt Python model
SQL scripting (BEGIN...END, IF, WHILE) and CREATE PROCEDURE in Unity Catalog, or a PySpark notebook (confirm Runtime gates)

Before & after

A real Azure Synapse snippet and what the builders generate on Databricks.

dedicated_pool/dim_customer_scd.sql legacy
-- Synapse dedicated SQL pool: hash-distributed, date-partitioned SCD2 dimension
CREATE TABLE dbo.DimCustomer
(
    CustomerKey   BIGINT IDENTITY(1,1) NOT NULL,
    CustomerId    INT          NOT NULL,
    CustomerName  NVARCHAR(200) NOT NULL,
    Segment       NVARCHAR(50)  NOT NULL,
    ValidFrom     DATE         NOT NULL,
    ValidTo       DATE         NULL,
    IsCurrent     BIT          NOT NULL
)
WITH
(
    DISTRIBUTION = HASH(CustomerId),
    CLUSTERED COLUMNSTORE INDEX,
    PARTITION ( ValidFrom RANGE RIGHT FOR VALUES ('2024-01-01','2025-01-01') )
);

-- close changed rows, then insert new versions (hand-rolled SCD Type 2)
UPDATE tgt
    SET tgt.ValidTo = stg.LoadDate, tgt.IsCurrent = 0
FROM dbo.DimCustomer tgt
INNER JOIN stg.Customer stg
    ON tgt.CustomerId = stg.CustomerId AND tgt.IsCurrent = 1
WHERE tgt.Segment <> stg.Segment OR tgt.CustomerName <> stg.CustomerName;
dbt
snapshots/dim_customer.sql dbt
{% snapshot dim_customer %}
{{
  config(
    target_schema='marts',
    unique_key='customer_id',
    strategy='check',
    check_cols=['customer_name', 'segment'],
    file_format='delta'
  )
}}
-- Delta snapshot on Databricks: dbt maintains dbt_valid_from / dbt_valid_to.
-- The dedicated-pool HASH distribution and RANGE partition are dropped;
-- physical layout is a Delta concern (liquid_clustered_by), not a key choice.
select
    customer_id,
    customer_name,
    segment
from {{ source('crm', 'customer') }}
{% endsnapshot %}
The loop that closes itself

Iterated until green, and logged.

A failure on this pair doesn't go on a list. It goes back to the builders, and the loop re-runs until the suite passes.

synapse-to-databricks · iteration log build-test-operate
iter 1 Iteration 1, FAIL: `GETDATE()` and `ISNULL()` from the dedicated-pool proc have no Spark SQL equivalent; rewrite to `current_timestamp()` / `coalesce()`.
iter 2 Iteration 2, FAIL: SCD2 row count drifts above source because the `IsCurrent` close was double-applied on re-run; the `MERGE` was not idempotent.
iter 3 Iteration 3, FAIL: `MERGE` idempotency fixed, but `IDENTITY(1,1)` surrogate keys collided on parallel writes; switch to Delta `GENERATED ... AS IDENTITY`.
iter 4 Iteration 4, GREEN: row counts, SCD2 valid-from/valid-to spans and aggregate parity match source; delta 0 across all marts.

Green on iteration 4: full parity on the dedicated-pool warehouse and the Spark-notebook hand-offs, surrogate keys and SCD2 spans included.

Questions for this migration

Azure Synapse → Databricks, answered.

Do the dedicated SQL pool's distribution and partition choices carry over to Databricks?

No, and trying to copy them is the classic naive-migration mistake. A dedicated SQL pool spreads every table across 60 distributions, and you tune it by picking a `HASH` distribution column, a `ROUND_ROBIN` staging table, or a `REPLICATED` small dimension, plus single-column partitioning for partition switching. None of that maps to Delta. The lakehouse has no 60-distribution MPP layer: physical layout is a Delta concern handled by liquid clustering, `OPTIMIZE`, and partition pruning, and the Spark optimizer decides broadcast joins on its own (so `REPLICATED` has no analogue). The fleet re-derives the layout for Delta from the access patterns it documented; it never ports the distribution key as if it were portable.

Synapse Spark notebooks are already Spark, is that part basically free?

It is the closest fit of the four workloads, but not a no-op. Synapse notebooks run on serverless Apache Spark pools and write Delta, so the DataFrame and Spark SQL logic ports cleanly to Databricks notebooks. The work is in the seams: `mssparkutils` / `notebookutils` calls become `dbutils`; `%run` (which copies cells in and shares variable context) and `mssparkutils.notebook.run` (isolated, import-style) behave differently from Databricks notebook chaining, so the fleet re-expresses cross-notebook flow as task dependencies in Lakeflow Jobs; and `%%configure` session settings map to cluster / job configuration. A Spark-notebook-heavy estate genuinely lands on Databricks naturally, the fleet just reads the chaining semantics rather than assuming them.

How does serverless SQL pool's query-in-place model translate to the lakehouse?

Serverless SQL pool stores nothing: it queries files in place with `OPENROWSET`, external tables, and views, often built into a logical data warehouse over Delta in the lake. On Databricks the equivalent is Unity Catalog external tables and views over the same Delta, queried through Databricks SQL or Spark. One real gotcha the fleet flags: serverless SQL is compatible with Delta reader version 1 only, so deletion vectors, column renames, and v2 checkpoints were off the table; the lakehouse has no such ceiling, which usually simplifies the rebuild rather than complicating it.

Can dedicated-pool stored procedures move, given Synapse's reduced T-SQL dialect?

They move, but they are re-expressed, not ported verbatim. Dedicated-pool procedures already run a reduced dialect: 8-level nesting cap, no `@@NESTLEVEL`, no `INSERT...EXECUTE`, constrained scalar UDFs, so the imperative logic has to be read and rebuilt. On the dbt flavor it decomposes into ordered, idempotent models with pre/post-hooks, and anything genuinely procedural becomes a dbt Python model. On the your-framework flavor Databricks now offers SQL scripting (`BEGIN...END`, `IF`, `WHILE`, condition handlers) and `CREATE PROCEDURE` in Unity Catalog, but several of these are gated behind recent Databricks Runtime versions, so the fleet confirms the customer's Runtime before relying on them rather than assuming availability.

How does the fleet keep the Spark-to-SQL hand-offs from breaking on migration?

In Synapse the end-to-end logic only exists across the seams: a Spark notebook lands Delta, serverless SQL exposes views over it, the dedicated pool holds the modeled warehouse, and pipelines orchestrate all three. The Documenter follows the data across those Spark → serverless → dedicated boundaries instead of documenting each pool in isolation, so type mismatches, Delta-version compatibility, and schema-sync delays (the usual breakage points) are caught in the knowledge base, not in production. The build-test-run loop then proves the hand-offs hold: the Test agent checks parity end to end, and the loop iterates until green before anything reaches the promotion gate.

Let's talk

Ready to migrate Azure Synapse to Databricks?

Tell us about your Azure Synapse estate and we'll run the assessment, the Surveyor scores it before you commit to Databricks.

Plan my migration

A short form, no spam. We usually reply within one business day.

Plan my migration