dbt catalog.json: Generated by dbt docs (2026)

catalog.json, generated by dbt, is the artifact that aggregates physical information about relations and columns in the actual database. Combined with the logical information from schema.yml (descriptions, tests, etc.) captured in manifest.json, it lets you build a trustworthy data catalog.

This article organizes how to position, generate, and operate catalog.json, integration patterns with external catalogs, and the points most often tested on the Analytics Engineer exam — all from a practitioner's perspective.

What catalog.json Is and How It Is Generated

catalog.json is an artifact written to the target directory when you run dbt docs generate. It uses metadata APIs provided by the adapter (Snowflake, BigQuery, Databricks, etc.) to collect relation and column information from the real schema. Typical contents include database/schema/table names, column names and types, comments, and (depending on the adapter) some row counts and statistics. Fields vary by version and adapter, but the role of aggregating physical metadata is stable.

Generating the docs site uses both manifest.json (logical information) and catalog.json (physical information). On its own, catalog.json lacks lineage and descriptions — it only becomes a complete catalog once joined with manifest.json.

Output location: target/catalog.json (manifest.json is written alongside it)
Prerequisites: profile authentication is active; the target adapter supports metadata retrieval
Caveat: with some adapters, statistics such as row_count can be null
Docs site: dbt docs serve provides it locally. Storing the artifact from CI/CD makes external integration easier

Artifact	Main contents	Primary use
catalog.json	Physical metadata (DB/schema/table/column, types, comments, some statistics)	Syncing to external catalogs; physical side of the docs
manifest.json	Node definitions, dependencies, properties (description, tests, sources, exposures)	Lineage, documentation, dependency analysis
run_results.json	Most recent execution results (status, runtime, messages)	Pipeline health; SLA/quality monitoring

The flow of artifact generation via docs generate

Generating and inspecting catalog.json (CLI)

dbt deps
# Run models as needed (catalog queries real tables; uncreated ones may not be retrievable)
dbt run --select my_model
# Generate docs (outputs manifest.json and catalog.json)
dbt docs generate --target-path target
# Browse locally
dbt docs serve --port 8080 --target-path target
# Verify outputs
ls -1 target | grep -E 'manifest|catalog'

How Documentation Connects to catalog.json

Model and column descriptions are written in schema.yml. The descriptions, tests, and meta land in manifest.json, while catalog.json holds column types, comments, and similar values obtained by querying the real schema. The docs site merges the two and shows logical column descriptions and physical column types on the same screen.

Docs blocks (defined in .md and referenced via doc()) also belong to manifest.json. The result is a clear split: catalog.json is "automatic collection of facts," while manifest.json is "declaration of intent."

schema.yml: defines description, columns[].description, tests, meta, etc.
catalog.json: holds columns[].type, comment, and table-level metadata (adapter-dependent)
The docs site cross-references both files for display
Treat schema.yml as the source of truth for descriptions and the database as the source of truth for types — this mindset keeps operations stable

Information	Storage	Display example
Model description	manifest.json (properties)	Shown in the table overview section
Column type / comment	catalog.json (columns)	Shown in the type/comment fields of the column detail view
Tests (unique, not_null, etc.)	manifest.json (tests)	Reflected as Quality information on the relevant column in docs

Merging logical and physical information

 schema.yml (description/tests)   catalog.json (types/comments)
                \                      /
                 \                    /
                  v                  v
                 manifest.json   catalog.json
                         \        /
                          v      v
                           docs site

Example of schema.yml and docs blocks

# models/orders/schema.yml
version: 2
models:
  - name: fct_orders
    description: "Order fact table; aggregated at daily grain"
    columns:
      - name: order_id
        description: "Order ID (unique)"
        tests:
          - unique
          - not_null
      - name: order_total
        description: "Order amount (pre-tax)"
        meta:
          pii: false
    docs:
      node_color: blue

# docs/blocks.md
{% docs fct_orders_notes %}
Reference: revenue-recognition logic has been signed off by the finance team.
{% enddocs %}

# models/orders/fct_orders.sql
select * from {{ ref('stg_orders') }}
-- description goes to manifest.json; types go to catalog.json

Choosing Between catalog.json and manifest.json

When integrating with external data catalogs, you typically ingest both manifest.json and catalog.json. Pull physical names and column types from catalog.json, and pull descriptions, owners, lineage, tests, and exposures from manifest.json, then register the merged result.

Note that ephemeral models do not create physical tables, so they never appear in catalog.json — although they do exist as nodes in manifest.json. Keep in mind that a node can be important for lineage yet absent from the physical catalog.

Source of truth for physical: catalog.json (limited by what the adapter retrieves)
Source of truth for logical: manifest.json (properties, lineage, exposures)
Ephemeral models do not appear in catalog.json but do exist in manifest.json
External integration is fundamentally a "join" of the two files

Aspect	catalog.json	manifest.json
Relation presence	Only existing physical entities (whatever the adapter returns)	All resources (models, sources, seeds, exposures, etc.)
Granularity	Centered on tables/views/columns (types, comments, some stats)	Properties, dependencies, documentation, tests
Main gaps	No lineage, descriptions, or tests	No physical information such as column types or statistics

The three-in-one structure of docs

Example catalog.json node (excerpt)

{
  "nodes": {
    "model.my_proj.fct_orders": {
      "database": "ANALYTICS",
      "schema": "MART",
      "name": "fct_orders",
      "resource_type": "model",
      "relation_name": "ANALYTICS.MART.FCT_ORDERS",
      "columns": {
        "ORDER_ID": {"name": "ORDER_ID", "type": "NUMBER", "comment": "Order ID"},
        "ORDER_TOTAL": {"name": "ORDER_TOTAL", "type": "NUMBER", "comment": "Order amount"}
      }
    }
  }
}

Integration Patterns with External Data Catalogs

Many data catalogs (e.g., DataHub, Amundsen, Collibra, Alation) either natively support ingesting dbt artifacts or have community implementations. They typically upload or reference the manifest.json + catalog.json pair and update entities by combining model descriptions and owners (manifest) with column types and comments (catalog).

There are three broad integration styles: polling (periodically fetching from a file location), push (calling an API from CI/CD), and reconciliation (merging dbt information into existing schema metadata). Push is the easiest to start with; implement incremental updates by keying off manifest's node_version, checksums, or file modification times.

Push: send to an API after docs generate in CI (easy to reproduce and detect failures)
Polling: periodically write to storage (S3/GCS/ADLS) for the external system to fetch
Reconciliation: treat the existing catalog's schema info as the baseline and merge in dbt's descriptions and lineage
Mark sensitive information explicitly via meta and filter before exposing externally

Style	Benefits	Considerations
Push (API)	Immediate reflection; easy failure detection	Requires network and authorization configuration
Polling (storage)	Loosely coupled; easy to extend	Need to control latency and duplicate processing
Reconciliation (merge)	Maintains consistency with existing operations	Requires key design and deduplication logic

Data flow for external catalog integration

Reading catalog.json and pushing to an API (minimal example)

import json, os, requests

ARTIFACT_DIR = os.environ.get("ARTIFACT_DIR", "target")
CAT_PATH = os.path.join(ARTIFACT_DIR, "catalog.json")
MANI_PATH = os.path.join(ARTIFACT_DIR, "manifest.json")

with open(CAT_PATH, "r", encoding="utf-8") as f:
    catalog = json.load(f)
with open(MANI_PATH, "r", encoding="utf-8") as f:
    manifest = json.load(f)

# Example: extract column definitions and send (adjust schema as needed)
payload = []
for node_id, node in catalog.get("nodes", {}).items():
    cols = node.get("columns", {})
    for c in cols.values():
        payload.append({
            "node_id": node_id,
            "relation": node.get("relation_name"),
            "column": c.get("name"),
            "type": c.get("type"),
            "comment": c.get("comment")
        })

resp = requests.post("https://catalog.example.com/api/dbt/columns", json=payload, timeout=30)
resp.raise_for_status()
print("uploaded:", len(payload))

Operations and Governance: Update Frequency, Quality, and Security

catalog.json reflects the current state of the database. Align the update frequency with how often models change and consumer needs — at minimum, daily regeneration is recommended. In CI, run docs generate on pull requests to visualize diffs and finalize the production catalog.json after each production release (store it separately per environment).

On the security side, decide whether comments on PII columns can be made public and filter them out before external integration. Because statistics are adapter-dependent and row_count and similar fields can be missing, pair catalog.json with other sources such as run_results.json for SLA monitoring.

Save target separately per environment (dev, staging, prod)
Check docs diffs on PRs and finalize the production catalog.json after each production release
Strip or mask PII and sensitive comments before external integration
Statistic availability depends on the adapter; missing values may be by design, not an error

Operational item	Recommendation	Notes
Update frequency	Daily (or as needed when changes are frequent)	Keep CI minimal; store the finalized version for production
Storage location	Versioned in object storage	Organize by environment / date / commit ID
Disclosure policy	Strip or mask PII comments	Control via meta and tags; filter before integration

Artifact handling in CI/CD

Generating docs and storing artifacts in CI (Bash example)

# Switch connection target via environment variable
export DBT_TARGET=prod
export DBT_PROFILES_DIR=.

# Fetch dependencies and run (as needed)
dbt deps
# Generate (explicitly specify output path)
dbt docs generate --target ${DBT_TARGET} --target-path artifacts/${DBT_TARGET}/$(date +%F)

# Upload artifacts (pseudo command)
aws s3 sync artifacts/${DBT_TARGET}/$(date +%F) s3://my-bucket/dbt-artifacts/${DBT_TARGET}/$(date +%F)

Exam Key Points and Pitfalls

On the Analytics Engineer exam, understanding the roles of dbt documentation and artifacts comes up often. Make sure you have the differences between catalog.json and manifest.json, the behavior of docs generate/serve, describing dashboards via exposures, and the scope of test definitions all organized.

Common pitfalls include: ephemeral models not appearing in catalog.json, confusing run_results.json with catalog.json, and mixing up docs blocks vs. description. Be ready to instantly say which piece of information lives in which file.

catalog.json = physical, manifest.json = logical, run_results.json = execution results
Exposures live in manifest.json and express BI integration metadata
Ephemeral does not appear physically (and so does not show up in catalog.json)
docs generate queries the database for physical information; gaps occur when authentication or permissions are insufficient

Frequently asked theme	Key terms to remember	Commonly confused contrast
Roles of artifacts	catalog = physical, manifest = logical	catalog vs run_results
BI integration metadata	exposures	sources vs exposures
Documentation	schema.yml description / docs blocks	Comment (physical) vs description (logical)

Which question is answered by which file

 Physical type? ----> catalog.json
 Description / owner? -> manifest.json
 Did the run succeed? -> run_results.json

Extract column types of a specific table from catalog.json with jq

TABLE_ID="model.my_proj.fct_orders"
cat target/catalog.json \
 | jq -r --arg id "$TABLE_ID" '.nodes[$id].columns | to_entries[] | "\(.value.name),\(.value.type)"'

Check Your Understanding

Analytics Engineer

問題 1

You wrote model and column descriptions in schema.yml in dbt. You want to integrate column types and comments into an external data catalog. Which artifact is the most appropriate source?

catalog.json
run_results.json
packages.yml
profiles.yml

正解: A

catalog.json aggregates the database's physical metadata (tables/columns, types, comments, etc.). run_results.json holds the most recent execution results, packages.yml declares dependent packages, and profiles.yml stores connection settings, so none of them fit this purpose.

Frequently Asked Questions

When does catalog.json get updated? Does it stay current without running the models?

It is refreshed whenever you run dbt docs generate. As long as the adapter can fetch metadata, the latest physical information is collected for relations that already exist in the database — even if you did not run the models immediately beforehand. Relations that have not been created yet are not in scope.

Do ephemeral models and CTEs show up in catalog.json?

Ephemeral models do not create a physical relation, so they do not appear in catalog.json (they are still recorded as nodes in manifest.json). CTEs are likewise not intermediate physical objects and are out of scope.

Why are some of the statistics in catalog.json (such as row_count) null?

Statistic coverage is adapter-dependent, and fields that cannot be retrieved are returned as null. That is by design, not a bug. For SLA monitoring or volume checks, complement catalog.json with run_results.json or separate queries.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Using dbt catalog.json: Mastering Documentation and Catalog Integration