dbt

Using dbt catalog.json: Mastering Documentation and Catalog Integration

2026-04-19
NicheeLab Editorial Team

catalog.json, generated by dbt, is the artifact that aggregates physical information about relations and columns in the actual database. Combined with the logical information from schema.yml (descriptions, tests, etc.) captured in manifest.json, it lets you build a trustworthy data catalog.

This article organizes how to position, generate, and operate catalog.json, integration patterns with external catalogs, and the points most often tested on the Analytics Engineer exam — all from a practitioner's perspective.

What catalog.json Is and How It Is Generated

catalog.json is an artifact written to the target directory when you run dbt docs generate. It uses metadata APIs provided by the adapter (Snowflake, BigQuery, Databricks, etc.) to collect relation and column information from the real schema. Typical contents include database/schema/table names, column names and types, comments, and (depending on the adapter) some row counts and statistics. Fields vary by version and adapter, but the role of aggregating physical metadata is stable.

Generating the docs site uses both manifest.json (logical information) and catalog.json (physical information). On its own, catalog.json lacks lineage and descriptions — it only becomes a complete catalog once joined with manifest.json.

  • Output location: target/catalog.json (manifest.json is written alongside it)
  • Prerequisites: profile authentication is active; the target adapter supports metadata retrieval
  • Caveat: with some adapters, statistics such as row_count can be null
  • Docs site: dbt docs serve provides it locally. Storing the artifact from CI/CD makes external integration easier
ArtifactMain contentsPrimary use
catalog.jsonPhysical metadata (DB/schema/table/column, types, comments, some statistics)Syncing to external catalogs; physical side of the docs
manifest.jsonNode definitions, dependencies, properties (description, tests, sources, exposures)Lineage, documentation, dependency analysis
run_results.jsonMost recent execution results (status, runtime, messages)Pipeline health; SLA/quality monitoring

The flow of artifact generation via docs generate

Projectmodels / schema.ymlProfilecredentialsdbt docs generatetarget/manifest.jsontarget/catalog.jsondocs siteservedbt docs generate takes Project and Profile as inputs, emits manifest/catalog, and the docs site merges the two

Generating and inspecting catalog.json (CLI)

dbt deps
# Run models as needed (catalog queries real tables; uncreated ones may not be retrievable)
dbt run --select my_model
# Generate docs (outputs manifest.json and catalog.json)
dbt docs generate --target-path target
# Browse locally
dbt docs serve --port 8080 --target-path target
# Verify outputs
ls -1 target | grep -E 'manifest|catalog'

Model and column descriptions are written in schema.yml. The descriptions, tests, and meta land in manifest.json, while catalog.json holds column types, comments, and similar values obtained by querying the real schema. The docs site merges the two and shows logical column descriptions and physical column types on the same screen.

Docs blocks (defined in .md and referenced via doc()) also belong to manifest.json. The result is a clear split: catalog.json is "automatic collection of facts," while manifest.json is "declaration of intent."

  • schema.yml: defines description, columns[].description, tests, meta, etc.
  • catalog.json: holds columns[].type, comment, and table-level metadata (adapter-dependent)
  • The docs site cross-references both files for display
  • Treat schema.yml as the source of truth for descriptions and the database as the source of truth for types — this mindset keeps operations stable
InformationStorageDisplay example
Model descriptionmanifest.json (properties)Shown in the table overview section
Column type / commentcatalog.json (columns)Shown in the type/comment fields of the column detail view
Tests (unique, not_null, etc.)manifest.json (tests)Reflected as Quality information on the relevant column in docs

Merging logical and physical information

 schema.yml (description/tests)   catalog.json (types/comments)
                \                      /
                 \                    /
                  v                  v
                 manifest.json   catalog.json
                         \        /
                          v      v
                           docs site

Example of schema.yml and docs blocks

# models/orders/schema.yml
version: 2
models:
  - name: fct_orders
    description: "Order fact table; aggregated at daily grain"
    columns:
      - name: order_id
        description: "Order ID (unique)"
        tests:
          - unique
          - not_null
      - name: order_total
        description: "Order amount (pre-tax)"
        meta:
          pii: false
    docs:
      node_color: blue

# docs/blocks.md
{% docs fct_orders_notes %}
Reference: revenue-recognition logic has been signed off by the finance team.
{% enddocs %}

# models/orders/fct_orders.sql
select * from {{ ref('stg_orders') }}
-- description goes to manifest.json; types go to catalog.json

Choosing Between catalog.json and manifest.json

When integrating with external data catalogs, you typically ingest both manifest.json and catalog.json. Pull physical names and column types from catalog.json, and pull descriptions, owners, lineage, tests, and exposures from manifest.json, then register the merged result.

Note that ephemeral models do not create physical tables, so they never appear in catalog.json — although they do exist as nodes in manifest.json. Keep in mind that a node can be important for lineage yet absent from the physical catalog.

  • Source of truth for physical: catalog.json (limited by what the adapter retrieves)
  • Source of truth for logical: manifest.json (properties, lineage, exposures)
  • Ephemeral models do not appear in catalog.json but do exist in manifest.json
  • External integration is fundamentally a "join" of the two files
Aspectcatalog.jsonmanifest.json
Relation presenceOnly existing physical entities (whatever the adapter returns)All resources (models, sources, seeds, exposures, etc.)
GranularityCentered on tables/views/columns (types, comments, some stats)Properties, dependencies, documentation, tests
Main gapsNo lineage, descriptions, or testsNo physical information such as column types or statistics

The three-in-one structure of docs

manifest.jsonLogiccatalog.jsonPhysicalrun_results.jsonRuntimedocs sitemanifest (Logic) + catalog (Physical) + run_results (Runtime) compose the docs site

Example catalog.json node (excerpt)

{
  "nodes": {
    "model.my_proj.fct_orders": {
      "database": "ANALYTICS",
      "schema": "MART",
      "name": "fct_orders",
      "resource_type": "model",
      "relation_name": "ANALYTICS.MART.FCT_ORDERS",
      "columns": {
        "ORDER_ID": {"name": "ORDER_ID", "type": "NUMBER", "comment": "Order ID"},
        "ORDER_TOTAL": {"name": "ORDER_TOTAL", "type": "NUMBER", "comment": "Order amount"}
      }
    }
  }
}

Integration Patterns with External Data Catalogs

Many data catalogs (e.g., DataHub, Amundsen, Collibra, Alation) either natively support ingesting dbt artifacts or have community implementations. They typically upload or reference the manifest.json + catalog.json pair and update entities by combining model descriptions and owners (manifest) with column types and comments (catalog).

There are three broad integration styles: polling (periodically fetching from a file location), push (calling an API from CI/CD), and reconciliation (merging dbt information into existing schema metadata). Push is the easiest to start with; implement incremental updates by keying off manifest's node_version, checksums, or file modification times.

  • Push: send to an API after docs generate in CI (easy to reproduce and detect failures)
  • Polling: periodically write to storage (S3/GCS/ADLS) for the external system to fetch
  • Reconciliation: treat the existing catalog's schema info as the baseline and merge in dbt's descriptions and lineage
  • Mark sensitive information explicitly via meta and filter before exposing externally
StyleBenefitsConsiderations
Push (API)Immediate reflection; easy failure detectionRequires network and authorization configuration
Polling (storage)Loosely coupled; easy to extendNeed to control latency and duplicate processing
Reconciliation (merge)Maintains consistency with existing operationsRequires key design and deduplication logic

Data flow for external catalog integration

dbt run → dbt docs generatetarget/{manifest,catalog}.jsonPush (API)Poll (Storage)External CatalogPush or poll the artifacts produced by docs generate into the external catalog

Reading catalog.json and pushing to an API (minimal example)

import json, os, requests

ARTIFACT_DIR = os.environ.get("ARTIFACT_DIR", "target")
CAT_PATH = os.path.join(ARTIFACT_DIR, "catalog.json")
MANI_PATH = os.path.join(ARTIFACT_DIR, "manifest.json")

with open(CAT_PATH, "r", encoding="utf-8") as f:
    catalog = json.load(f)
with open(MANI_PATH, "r", encoding="utf-8") as f:
    manifest = json.load(f)

# Example: extract column definitions and send (adjust schema as needed)
payload = []
for node_id, node in catalog.get("nodes", {}).items():
    cols = node.get("columns", {})
    for c in cols.values():
        payload.append({
            "node_id": node_id,
            "relation": node.get("relation_name"),
            "column": c.get("name"),
            "type": c.get("type"),
            "comment": c.get("comment")
        })

resp = requests.post("https://catalog.example.com/api/dbt/columns", json=payload, timeout=30)
resp.raise_for_status()
print("uploaded:", len(payload))

Operations and Governance: Update Frequency, Quality, and Security

catalog.json reflects the current state of the database. Align the update frequency with how often models change and consumer needs — at minimum, daily regeneration is recommended. In CI, run docs generate on pull requests to visualize diffs and finalize the production catalog.json after each production release (store it separately per environment).

On the security side, decide whether comments on PII columns can be made public and filter them out before external integration. Because statistics are adapter-dependent and row_count and similar fields can be missing, pair catalog.json with other sources such as run_results.json for SLA monitoring.

  • Save target separately per environment (dev, staging, prod)
  • Check docs diffs on PRs and finalize the production catalog.json after each production release
  • Strip or mask PII and sensitive comments before external integration
  • Statistic availability depends on the adapter; missing values may be by design, not an error
Operational itemRecommendationNotes
Update frequencyDaily (or as needed when changes are frequent)Keep CI minimal; store the finalized version for production
Storage locationVersioned in object storageOrganize by environment / date / commit ID
Disclosure policyStrip or mask PII commentsControl via meta and tags; filter before integration

Artifact handling in CI/CD

Pull RequestMain mergeProd deploydbt docs generatedevdbt docs generatestgdbt docs generateprodtarget/* (dev)target/* (stg)target/* (prod)Artifact storageExternal CatalogRun docs generate on every PR / main merge / prod deploy, and feed a shared Artifact storage into the External Catalog

Generating docs and storing artifacts in CI (Bash example)

# Switch connection target via environment variable
export DBT_TARGET=prod
export DBT_PROFILES_DIR=.

# Fetch dependencies and run (as needed)
dbt deps
# Generate (explicitly specify output path)
dbt docs generate --target ${DBT_TARGET} --target-path artifacts/${DBT_TARGET}/$(date +%F)

# Upload artifacts (pseudo command)
aws s3 sync artifacts/${DBT_TARGET}/$(date +%F) s3://my-bucket/dbt-artifacts/${DBT_TARGET}/$(date +%F)

Exam Key Points and Pitfalls

On the Analytics Engineer exam, understanding the roles of dbt documentation and artifacts comes up often. Make sure you have the differences between catalog.json and manifest.json, the behavior of docs generate/serve, describing dashboards via exposures, and the scope of test definitions all organized.

Common pitfalls include: ephemeral models not appearing in catalog.json, confusing run_results.json with catalog.json, and mixing up docs blocks vs. description. Be ready to instantly say which piece of information lives in which file.

  • catalog.json = physical, manifest.json = logical, run_results.json = execution results
  • Exposures live in manifest.json and express BI integration metadata
  • Ephemeral does not appear physically (and so does not show up in catalog.json)
  • docs generate queries the database for physical information; gaps occur when authentication or permissions are insufficient
Frequently asked themeKey terms to rememberCommonly confused contrast
Roles of artifactscatalog = physical, manifest = logicalcatalog vs run_results
BI integration metadataexposuressources vs exposures
Documentationschema.yml description / docs blocksComment (physical) vs description (logical)

Which question is answered by which file

 Physical type? ----> catalog.json
 Description / owner? -> manifest.json
 Did the run succeed? -> run_results.json

Extract column types of a specific table from catalog.json with jq

TABLE_ID="model.my_proj.fct_orders"
cat target/catalog.json \
 | jq -r --arg id "$TABLE_ID" '.nodes[$id].columns | to_entries[] | "\(.value.name),\(.value.type)"'

Check Your Understanding

Analytics Engineer

問題 1

You wrote model and column descriptions in schema.yml in dbt. You want to integrate column types and comments into an external data catalog. Which artifact is the most appropriate source?

  1. catalog.json
  2. run_results.json
  3. packages.yml
  4. profiles.yml

正解: A

catalog.json aggregates the database's physical metadata (tables/columns, types, comments, etc.). run_results.json holds the most recent execution results, packages.yml declares dependent packages, and profiles.yml stores connection settings, so none of them fit this purpose.

Frequently Asked Questions

When does catalog.json get updated? Does it stay current without running the models?

It is refreshed whenever you run dbt docs generate. As long as the adapter can fetch metadata, the latest physical information is collected for relations that already exist in the database — even if you did not run the models immediately beforehand. Relations that have not been created yet are not in scope.

Do ephemeral models and CTEs show up in catalog.json?

Ephemeral models do not create a physical relation, so they do not appear in catalog.json (they are still recorded as nodes in manifest.json). CTEs are likewise not intermediate physical objects and are out of scope.

Why are some of the statistics in catalog.json (such as row_count) null?

Statistic coverage is adapter-dependent, and fields that cannot be retrieved are returned as null. That is by design, not a bug. For SLA monitoring or volume checks, complement catalog.json with run_results.json or separate queries.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
dbt

dbt Models: SQL-Defined Transformation Units (2026)

Model fundamentals — SELECT-based definitions, naming, refs,...

dbt

dbt Analytics Engineering Exam: Complete Guide (2026)

Pass the AE Certification — scope, weighting, sample questio...

dbt

dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)

Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...

dbt

dbt Project Structure: models/seeds/macros Layout (2026)

Recommended dbt project layout — models, seeds, macros, snap...

dbt

dbt_project.yml Explained: Every Config (2026)

Every dbt_project.yml setting that matters — paths, vars, ma...

Browse all dbt articles (101)
© 2026 NicheeLab All rights reserved.