Migrate SSIS to Databricks.
Every `.dtsx` package read by the Documenter, rebuilt on Spark and Delta as dbt or in your framework: autonomous in the loop, accountable at the gates.
Why moving SSIS onto Databricks is a rebuild, not a port
An SSIS package is a .dtsx XML document describing a control flow, one or more
data flows, connection managers, variables, parameters and configurations. What
makes SSIS hard to move isn’t the task names. It’s the execution
model underneath them. A Data Flow only exists inside a Data Flow Task, and each
one spins up a separate instance of a buffer-based engine: rows stream
through PipelineBuffer objects, a new buffer is allocated whenever an
asynchronous transformation (Sort, Aggregate, Merge Join) is hit, and
synchronous transformations reuse the buffer row by row. Databricks has no
row-buffer model. Spark is a set-based, distributed engine over Delta Lake. So
there is no general-availability lift-and-shift that runs the SSIS runtime on
Databricks, and anyone who tells you otherwise is selling a runtime that does
not exist. The work is to reproduce the package’s semantics (sorted inputs,
error outputs, lineage-tracked columns, effective-dating) as Spark SQL or
PySpark, not to emulate its buffers.
That is exactly what a naive lift-and-shift gets wrong. It treats the data flow
as a black box that “moves data,” ports the visible SQL, and loses the parts
that were never visible: the expression-driven ConnectionString on a
connection manager, the variable whose value is computed at runtime because
EvaluateAsExpression = True, the SSIS expression that casts to (DT_BOOL)
where True is represented as -1 rather than Spark’s true. Each of those is a
silent behaviour change waiting to surface as a wrong row count three weeks after
go-live.
What the Documenter reads, and what the fleet builds
The SSIS Documenter parses the package XML, not the canvas. It reconstructs the
control flow from precedence constraints, capturing the Value
(Success / Failure / Completion), the EvalOp, and the LogicalAnd flag
that decides whether multiple incoming constraints are AND-ed or OR-ed, so a
branch like “run the load only if extract succeeded and validation did not
fail” survives the move. It maps every Data Flow component, resolves SSIS
expressions, captures variable and parameter scopes (variable names are
case-sensitive, and parameters resolve through environment references that the
.dtsx alone does not show), and isolates embedded T-SQL. All of it is written
to the knowledge base as a functional spec, and the Librarian keeps that spec
coherent across packages.
From that spec the builders generate one of two flavors. In dbt, each Data
Flow becomes staging and mart models, plain SELECT statements that dbt
materializes to Delta tables or views on a Databricks SQL warehouse via
dbt-databricks. The per-row upserts that an OLE DB Command used to do become an
incremental model with incremental_strategy='merge', the default on Delta,
which compiles to MERGE INTO. The SSIS Slowly Changing Dimension transform
becomes a dbt snapshot, adding dbt_valid_from and dbt_valid_to columns
with the timestamp or check strategy. Run order comes free from the ref()
DAG, which replaces precedence-constraint ordering.
In your framework, the same logic lands as native Databricks objects.
Data Flows become PySpark or Spark SQL notebooks, or Lakeflow Spark Declarative
Pipelines with streaming tables and materialized views. SCD2 is a Delta
MERGE INTO that closes the open version and opens a new one. File landing
zones that SSIS read with a Flat File source become Auto Loader (the
cloudFiles source) for incremental ingestion, or COPY INTO for simpler SQL
loads. Connection managers become Unity Catalog connections, external locations
and volumes; precedence constraints become Lakeflow Jobs task dependencies with
conditional control flow. Whatever lived inside a Script Task (opaque .NET that
cannot be auto-converted) is rebuilt as a PySpark notebook and explicitly
flagged for a human to read at the design gate.
Autonomous in the loop, accountable at the gates
Once the build exists, the Databricks Operator deploys it to DEV, runs the
Workflows and pipelines for real, reads each failure and routes the fix back to
the builders. That build-test-run loop re-runs on its own until the Databricks
Test Agent’s suite passes and the Reconciler’s delta against the legacy
dimension is zero or explained. The early failures are the predictable ones,
GETDATE() left as T-SQL instead of current_timestamp(), an ISNULL() that
should be coalesce(), a (DT_BOOL) cast that broke the active-flag because
SSIS True is -1, an SCD2 valid_to boundary off by one. The fleet iterates
through them; it does not hand you a list. What stays human are the three gates:
your experts sign off the assessment, the target design, and the promotion. The
agents are autonomous inside the loop; you are accountable at the gates.
If your target is Microsoft Fabric instead, the same Documenter output drives a different build, Migrate SSIS to Microsoft Fabric covers that pair.
SSIS to Databricks, stage by stage.
For SSIS → Databricks, the same five stages run with the agents that handle this pair named and linked below. Inside the loop they iterate on their own; your experts hold the three gates.
Orchestrated end-to-end by The Conductor , The Librarian and The Chronicler across every SSIS → Databricks asset.
Built your way, on Databricks.
Take SSIS to Databricks as an open-source dbt project (free to run), or built into the framework your team has trusted for years: same migration, your standards either way.
What gets generated
- Staging and mart models as
SELECTstatements on Databricks SQL warehouses, materialized to Delta tables and views - Incremental models with
incremental_strategy='merge'(the default on Delta) for the per-row upserts SSIS did with OLE DB Command - dbt snapshots (timestamp or check strategy) for every SCD Type 2 the SSIS Slowly Changing Dimension transform used to maintain
- Generic and singular tests for the parity the Test Agent asserts;
sources.ymlstanding in for connection managers - Python models where a Script Task's .NET logic has no pure-SQL equivalent (flagged for review)
- PySpark and Spark SQL notebooks that reproduce each Data Flow source-transform-destination pipeline
MERGE INTOagainst Delta dimension tables, or Lakeflow Spark Declarative Pipelines with change processing, for SCD2- Auto Loader (
cloudFiles) ingestion for the file landing zones SSIS read with Flat File sources - Lakeflow Jobs with task dependencies replacing precedence constraints; Unity Catalog connections replacing connection managers
- PySpark notebooks (flagged) carrying the logic that lived inside Script Task .NET code
Pattern mapping
Each SSIS pattern, mapped to a deliberate Databricks target.
materialized='table'/'view'/'incremental' on Deltatimestamp or check strategy, adding dbt_valid_from/dbt_valid_toMERGE INTO on a Delta dimension closing and opening versions, or Lakeflow SDP change processingincremental_strategy='merge' on unique_key (compiles to MERGE INTO)MERGE INTO in Spark SQL, or DeltaTable.merge in PySparkBEGIN...END block / stored procedureref() edges determines run ordersources.yml plus the warehouse connection profilecloudFiles) incremental file ingestion, or COPY INTOBefore & after
A real SSIS snippet and what the builders generate on Databricks.
<!-- Data Flow Task: extract -> Slowly Changing Dimension (Type 2) -> insert/update -->
<component name="OLE DB Source" componentClassID="Microsoft.OLEDBSource">
<property name="SqlCommand">SELECT CustomerKey, Name, City, Segment, UpdatedAt
FROM staging.Customer</property>
</component>
<component name="SCD Customer" componentClassID="Microsoft.SCD">
<!-- business key -->
<inputColumn name="CustomerKey" scdType="Key" />
<!-- historical attributes -> new row version on change -->
<inputColumn name="City" scdType="Historical" />
<inputColumn name="Segment" scdType="Historical" />
<!-- changing attribute -> overwrite in place -->
<inputColumn name="Name" scdType="Changing" />
</component>
<!-- OLE DB Command issues a per-row UPDATE to close the prior version;
precedence constraint (Success) then runs the dim insert. -->
{% snapshot dim_customer_snapshot %}
{{
config(
target_schema='marts',
unique_key='customer_key',
strategy='timestamp',
updated_at='updated_at',
file_format='delta'
)
}}
-- The SCD2 transform becomes a dbt snapshot. dbt closes the old
-- row and opens a new one, maintaining dbt_valid_from / dbt_valid_to.
select
customer_key,
name,
city,
segment,
updated_at
from {{ ref('stg_customer') }}
{% endsnapshot %}
# SCD Type 2 on Delta via MERGE INTO: the SCD transform + OLE DB Command,
# rebuilt as one set-based statement. No row-buffer engine on Spark.
src = spark.read.table("staging.customer")
src.createOrReplaceTempView("stg_customer")
spark.sql("""
MERGE INTO marts.dim_customer AS tgt
USING (
SELECT customer_key, name, city, segment, updated_at
FROM stg_customer
) AS src
ON tgt.customer_key = src.customer_key AND tgt.is_current = true
-- historical attribute changed: close the open version
WHEN MATCHED AND (tgt.city <> src.city OR tgt.segment <> src.segment)
THEN UPDATE SET tgt.valid_to = src.updated_at, tgt.is_current = false
-- brand-new business key: open the first version
WHEN NOT MATCHED
THEN INSERT (customer_key, name, city, segment, valid_from, valid_to, is_current)
VALUES (src.customer_key, src.name, src.city, src.segment, src.updated_at, NULL, true)
""")
# A second MERGE inserts the new version row for the closed keys.
Iterated until green, and logged.
A failure on this pair doesn't go on a list. It goes back to the builders, and the loop re-runs until the suite passes.
Green on iteration 4: full SCD2 parity, Reconciler delta 0 against the legacy dimension.
SSIS → Databricks, answered.
What happens to a Script Task that runs custom .NET code?
Script Tasks are opaque .NET (C#/VB) authored in VSTA, with logic that has no declarative equivalent. They are one of the two highest-risk constructs in an SSIS estate. The Documenter reads the `ScriptMain` / `Main` entry point and records what it does; the builders re-implement it as a PySpark or Python notebook on Databricks, and the result is flagged for human review at the design gate. We never claim a Script Task is auto-converted: .NET Core/Standard references aren't even supported in VSTA, so the logic is rebuilt, not lifted.
Can I just lift-and-shift the SSIS runtime onto Databricks?
No, there is no GA lift-and-shift that runs native SSIS on Databricks. SSIS executes a buffer-based, row-by-row data-flow engine that simply does not exist on Spark. The fleet rebuilds each Data Flow as set-based Spark SQL or PySpark transformations; the legacy semantics (sorted inputs, error outputs, lineage-tracked columns) are reproduced, not the runtime. Treat anyone promising a runtime port with suspicion.
How does an SSIS Slowly Changing Dimension (Type 2) transform map to Databricks?
On the dbt flavor it becomes a dbt snapshot (`timestamp` strategy when there's a reliable `updated_at`, `check` strategy otherwise) which maintains `dbt_valid_from` / `dbt_valid_to` on a Delta table. On your-framework it becomes a Delta `MERGE INTO` that closes the open version and opens a new one, or a Lakeflow Spark Declarative Pipeline using change processing. A `merge` upsert alone is SCD Type 1 (overwrite); SCD2 history needs the snapshot or the explicit close/open MERGE.
Where do precedence constraints and connection managers go?
Precedence constraints (the Success/Failure/Completion edges with their LogicalAnd merge logic) become Lakeflow Jobs task dependencies and conditional control flow in your-framework, or are inferred from the `ref()` DAG in dbt, which runs models in topological order. Connection managers become Unity Catalog connections, external locations and volumes (with secrets in secret scopes), or `sources.yml` entries on the dbt side. The Documenter resolves expression-driven `ConnectionString` properties first, so dynamic connections aren't lost.
What about dynamic SQL built in an Execute SQL Task with `SQLSourceType = Variable`?
Dynamic SQL is the other top-risk construct: when the statement is assembled in a variable or a Script Task, the actual SQL isn't present statically in the package. The Documenter traces the variable construction to recover what runs, writes it to the knowledge base, and the builders re-express it, as Jinja templating at compile time on the dbt flavor, or `EXECUTE IMMEDIATE` / `spark.sql(...)` on your-framework. The parameterization and security model differs, so each case is reviewed rather than ported verbatim.
Keep exploring
SSIS migration
How the Documenter reads SSIS, with both destinations beyond the Databricks build on this page.
Migrate to Databricks
What the fleet builds when you take SSIS to Databricks, in dbt or your framework.
Migrate SSIS to Fabric
The same source, the other destination, compare the fit.
Migrate ADF to Databricks
Another source onto Databricks, run by the same fleet.
Migrate Synapse to Databricks
Another source onto Databricks, run by the same fleet.
Ready to migrate SSIS to Databricks?
Tell us about your SSIS estate and we'll run the assessment, the Surveyor scores it before you commit to Databricks.
Plan my migrationA short form, no spam. We usually reply within one business day.