A perspective from Plainsight Migrating to Fabric or Databricks? Generative BI
For business

A data catalog everyone trusts

dbt generates a searchable docs site with descriptions, owners, types and tests, built from the code that runs so it never drifts.

The problem

Ask where the trusted customer revenue figure lives and you will often get a pause, then three different answers. People open the warehouse, find revenue_final, rev_v2 and bkp_old, and pick one on a hunch. No description, no owner, no way to tell which is current or whether anyone still maintains it. So analysts spend more time hunting for the right table than actually analysing, and a worrying share of the time they pick the wrong one. The cost is not just wasted hours: it is decisions made on numbers nobody can vouch for.

What dbt does about it

dbt makes the catalog a byproduct of building rather than a separate project. Documentation lives in YAML beside each model, so a description cannot drift from the logic it describes. One command, dbt docs generate, turns your models into a searchable website: every table and column with its plain-language description, its owner, its data type and the tests that guard it, all generated from the code that actually runs. Because it is regenerated on every build, it reflects today’s pipeline, not last quarter’s intentions. The dbt documentation guide covers how the descriptions and the catalog are produced.

What it looks like

An analyst types “customer revenue” into the catalog and gets one confident result: mart_customer_revenue, owned by the Finance data team, with a one-line definition of exactly how it is calculated and a green tick showing its tests are passing. Beside it, the old mystery tables, revenue_final, rev_v2 and bkp_old, are clearly what they are: unowned, undescribed, not the source of truth. The search that used to return a shrug now returns a single answer the whole business can stand behind.

How we think about it

A catalog only earns trust if it cannot lie, and the only catalog that cannot lie is one generated from the code that runs. We treat descriptions, owners and tests as part of the model, reviewed in the same pull request as the logic, never as documentation bolted on afterwards. The catalog tells you what a number means and who owns it; data lineage you can follow then shows you where it came from and what a change would touch. Together they turn the warehouse from a maze into a map your whole business can navigate by.

Questions

A data catalog everyone trusts, in short.

Where does the catalog come from?

dbt generates it out of the box. One command builds a searchable website from your models: every table and column with its description, owner, data type and the tests that guard it. There is no separate catalog tool to buy or wire up, and nothing extra to maintain. It is a byproduct of building, not another chore.

Who writes the descriptions and assigns owners?

The people who write the models. Descriptions, owners and column notes live in YAML right next to each model, reviewed in the same pull request as the logic. The person changing a number is the person describing it, so the catalog reflects how the pipeline actually behaves today.

Will it go stale like the old wiki did?

No, because it is generated from the code that runs. Every build rebuilds the catalog from your models, so it can never quietly drift from reality the way a hand-kept wiki does. If a description is wrong, you fix it where the model lives and it is correct everywhere at once.

Can business users actually use it, or is it just for engineers?

It is an interactive site anyone can browse. People search for a table or a metric, read the plain-language description, see who owns it and check that its tests are passing. No SQL required to find the trusted source.

Want this for your data?

If this is how you want your team to work, we should talk.

Talk to us
Talk to us