For business

Data lineage you can follow

dbt maps every column from raw source to dashboard automatically, so you can trace where any number came from and see what a change will break before you make it.

Data lineage you can follow

generated from ref() and source()

trace, do not guess

Selected column customer_revenue

Upstream, where it comes from

raw sourceorders.amount stagingstg_orders.net_amount martcustomer_revenue

Downstream, what depends on it

Board revenue dashboard shows it to leadership weekly
Churn ML feature feeds the prediction model

Impact analysis before a change

Rename or drop customer_revenue and dbt shows the two consumers above will break, so you warn their owners and test against them first.

Trace any number to its source, and see what a change breaks before you make it.

Every wrong number ends in the same question: where did this come from? In a hand-built data estate, nobody can answer it quickly. The figure on the slide passed through a dozen joins, a handful of scripts and someone’s spreadsheet, and tracing it back means interviewing whoever happened to write each step. Worse, when a change is proposed, no one can say what it will break, so either nothing changes or something breaks in production.

The problem

Without a reliable map, data flows are invisible. A column gets renamed in a source and three dashboards quietly go blank. A definition is tweaked in one model and a downstream report starts disagreeing with finance. The dependencies are real, but they live only in people’s heads, so the only way to find them is to ship the change and see who shouts. That is a slow, frightening way to run anything the business depends on.

What dbt does about it

dbt knows the shape of your whole pipeline because you tell it, in the most natural way possible. Every model references its inputs: ref() for other models and source() for raw tables. From those references dbt assembles the full dependency graph, and it does it down to the column. You can click any field, say customer_revenue, and trace it upstream through stg_orders.net_amount back to the raw orders.amount, and downstream into the exact dashboard and ML feature that read it. The full picture lives in the dbt documentation and lineage tooling.

What it looks like

An analyst questions a revenue figure. Instead of a half-day investigation, they open the lineage for customer_revenue and read it end to end in seconds: raw orders, into a staging model that nets out refunds, into the mart, into the board dashboard. A few weeks later an engineer needs to change that staging logic. Before touching it they look downstream, see the dashboard and a churn feature both depend on it, warn both owners and test against both. Nothing breaks by surprise.

How we think about it

Lineage is only trustworthy when it is generated, never drawn by hand, so we lean on the graph dbt already builds rather than a diagram that rots. The same map underpins the data catalog everyone trusts: the catalog tells you what a table means, the lineage tells you where it came from and what depends on it. Together they turn every number into something you can stand behind, and every change into something you can make with your eyes open.

Questions

Data lineage you can follow, in short.

Where does the lineage come from?

It is generated from the project itself. Every model declares its inputs with ref() for other models and source() for raw tables, so dbt already knows the full dependency graph. The lineage you see is read straight out of that graph, which means it is always an accurate picture of what actually runs, not a diagram someone drew once and forgot to update.

What is column-level lineage, and why does it matter?

Model-level lineage tells you that one table feeds another. Column-level lineage tells you that one specific field, say net_amount, flows from a raw column through staging into the revenue mart and then into a named dashboard. That precision is what lets you answer where exactly a number came from, and what exactly breaks if you rename or drop a field.

How does this help before we make a change?

You run impact analysis. Before touching a shared column you look downstream and see every model, dashboard and feature that depends on it. Instead of shipping a change and waiting for someone to complain, you know in advance who to warn and what to test. The guesswork and the 3am surprise both go away.

Do we have to maintain the lineage by hand?

No, and that is the point. Because it is derived from ref() and source() in the code, it updates itself on every build. There is no separate diagram to keep in sync. Add a model and it appears in the graph; remove one and it disappears. The map can never drift from the pipeline it describes.

Want this for your data?

If this is how you want your team to work, we should talk.

Talk to us

Data lineage you can follow

Data lineage you can follow

The problem

What dbt does about it

What it looks like

How we think about it

Data lineage you can follow, in short.

Keep exploring

Catch bad data before the business sees it

A data catalog everyone trusts

Reliability you can prove

Want this for your data?