A data catalog everyone trusts
dbt generates a searchable docs site with descriptions, owners, types and tests, built from the code that runs so it never drifts.
Three mystery tables. Pick one and hope.
- revenue_final
- rev_v2
- bkp_old
One owned, defined, tested answer.
Revenue per customer, net of refunds, in reporting currency.
One search, one trusted answer, generated straight from the code and never drifting from it.
The problem
Ask where the trusted customer revenue figure lives and you will often get a pause, then three different answers. People open the warehouse, find revenue_final, rev_v2 and bkp_old, and pick one on a hunch. No description, no owner, no way to tell which is current or whether anyone still maintains it. So analysts spend more time hunting for the right table than actually analysing, and a worrying share of the time they pick the wrong one. The cost is not just wasted hours: it is decisions made on numbers nobody can vouch for.
What dbt does about it
dbt makes the catalog a byproduct of building rather than a separate project. Documentation lives in YAML beside each model, so a description cannot drift from the logic it describes. One command, dbt docs generate, turns your models into a searchable website: every table and column with its plain-language description, its owner, its data type and the tests that guard it, all generated from the code that actually runs. Because it is regenerated on every build, it reflects today’s pipeline, not last quarter’s intentions. The dbt documentation guide covers how the descriptions and the catalog are produced.
What it looks like
An analyst types “customer revenue” into the catalog and gets one confident result: mart_customer_revenue, owned by the Finance data team, with a one-line definition of exactly how it is calculated and a green tick showing its tests are passing. Beside it, the old mystery tables, revenue_final, rev_v2 and bkp_old, are clearly what they are: unowned, undescribed, not the source of truth. The search that used to return a shrug now returns a single answer the whole business can stand behind.
How we think about it
A catalog only earns trust if it cannot lie, and the only catalog that cannot lie is one generated from the code that runs. We treat descriptions, owners and tests as part of the model, reviewed in the same pull request as the logic, never as documentation bolted on afterwards. The catalog tells you what a number means and who owns it; data lineage you can follow then shows you where it came from and what a change would touch. Together they turn the warehouse from a maze into a map your whole business can navigate by.
A data catalog everyone trusts, in short.
Where does the catalog come from?
dbt generates it out of the box. One command builds a searchable website from your models: every table and column with its description, owner, data type and the tests that guard it. There is no separate catalog tool to buy or wire up, and nothing extra to maintain. It is a byproduct of building, not another chore.
Who writes the descriptions and assigns owners?
The people who write the models. Descriptions, owners and column notes live in YAML right next to each model, reviewed in the same pull request as the logic. The person changing a number is the person describing it, so the catalog reflects how the pipeline actually behaves today.
Will it go stale like the old wiki did?
No, because it is generated from the code that runs. Every build rebuilds the catalog from your models, so it can never quietly drift from reality the way a hand-kept wiki does. If a description is wrong, you fix it where the model lives and it is correct everywhere at once.
Can business users actually use it, or is it just for engineers?
It is an interactive site anyone can browse. People search for a table or a metric, read the plain-language description, see who owns it and check that its tests are passing. No SQL required to find the trusted source.
Keep exploring
Catch bad data before the business sees it
Tests run on every build and fail loudly, so wrong records are caught at the source, not discovered in a board report.
Data lineage you can follow
dbt maps every column from raw source to dashboard automatically, so you can trace where any number came from and see what a change will break before you make it.
Reliability you can prove
Every run leaves results, logs and freshness behind, and CI surfaces failures early, so you can show the pipeline is healthy.
Want this for your data?
If this is how you want your team to work, we should talk.