catalog.json, generated by dbt, is the artifact that aggregates physical information about relations and columns in the actual database. Combined with the logical information from schema.yml (descriptions, tests, etc.) captured in manifest.json, it lets you build a trustworthy data catalog.
This article organizes how to position, generate, and operate catalog.json, integration patterns with external catalogs, and the points most often tested on the Analytics Engineer exam — all from a practitioner's perspective.
catalog.json is an artifact written to the target directory when you run dbt docs generate. It uses metadata APIs provided by the adapter (Snowflake, BigQuery, Databricks, etc.) to collect relation and column information from the real schema. Typical contents include database/schema/table names, column names and types, comments, and (depending on the adapter) some row counts and statistics. Fields vary by version and adapter, but the role of aggregating physical metadata is stable.
Generating the docs site uses both manifest.json (logical information) and catalog.json (physical information). On its own, catalog.json lacks lineage and descriptions — it only becomes a complete catalog once joined with manifest.json.
| Artifact | Main contents | Primary use |
|---|---|---|
| catalog.json | Physical metadata (DB/schema/table/column, types, comments, some statistics) | Syncing to external catalogs; physical side of the docs |
| manifest.json | Node definitions, dependencies, properties (description, tests, sources, exposures) | Lineage, documentation, dependency analysis |
| run_results.json | Most recent execution results (status, runtime, messages) | Pipeline health; SLA/quality monitoring |
The flow of artifact generation via docs generate
Generating and inspecting catalog.json (CLI)
dbt deps
# Run models as needed (catalog queries real tables; uncreated ones may not be retrievable)
dbt run --select my_model
# Generate docs (outputs manifest.json and catalog.json)
dbt docs generate --target-path target
# Browse locally
dbt docs serve --port 8080 --target-path target
# Verify outputs
ls -1 target | grep -E 'manifest|catalog'Model and column descriptions are written in schema.yml. The descriptions, tests, and meta land in manifest.json, while catalog.json holds column types, comments, and similar values obtained by querying the real schema. The docs site merges the two and shows logical column descriptions and physical column types on the same screen.
Docs blocks (defined in .md and referenced via doc()) also belong to manifest.json. The result is a clear split: catalog.json is "automatic collection of facts," while manifest.json is "declaration of intent."
| Information | Storage | Display example |
|---|---|---|
| Model description | manifest.json (properties) | Shown in the table overview section |
| Column type / comment | catalog.json (columns) | Shown in the type/comment fields of the column detail view |
| Tests (unique, not_null, etc.) | manifest.json (tests) | Reflected as Quality information on the relevant column in docs |
Merging logical and physical information
schema.yml (description/tests) catalog.json (types/comments)
\ /
\ /
v v
manifest.json catalog.json
\ /
v v
docs siteExample of schema.yml and docs blocks
# models/orders/schema.yml
version: 2
models:
- name: fct_orders
description: "Order fact table; aggregated at daily grain"
columns:
- name: order_id
description: "Order ID (unique)"
tests:
- unique
- not_null
- name: order_total
description: "Order amount (pre-tax)"
meta:
pii: false
docs:
node_color: blue
# docs/blocks.md
{% docs fct_orders_notes %}
Reference: revenue-recognition logic has been signed off by the finance team.
{% enddocs %}
# models/orders/fct_orders.sql
select * from {{ ref('stg_orders') }}
-- description goes to manifest.json; types go to catalog.jsonWhen integrating with external data catalogs, you typically ingest both manifest.json and catalog.json. Pull physical names and column types from catalog.json, and pull descriptions, owners, lineage, tests, and exposures from manifest.json, then register the merged result.
Note that ephemeral models do not create physical tables, so they never appear in catalog.json — although they do exist as nodes in manifest.json. Keep in mind that a node can be important for lineage yet absent from the physical catalog.
| Aspect | catalog.json | manifest.json |
|---|---|---|
| Relation presence | Only existing physical entities (whatever the adapter returns) | All resources (models, sources, seeds, exposures, etc.) |
| Granularity | Centered on tables/views/columns (types, comments, some stats) | Properties, dependencies, documentation, tests |
| Main gaps | No lineage, descriptions, or tests | No physical information such as column types or statistics |
The three-in-one structure of docs
Example catalog.json node (excerpt)
{
"nodes": {
"model.my_proj.fct_orders": {
"database": "ANALYTICS",
"schema": "MART",
"name": "fct_orders",
"resource_type": "model",
"relation_name": "ANALYTICS.MART.FCT_ORDERS",
"columns": {
"ORDER_ID": {"name": "ORDER_ID", "type": "NUMBER", "comment": "Order ID"},
"ORDER_TOTAL": {"name": "ORDER_TOTAL", "type": "NUMBER", "comment": "Order amount"}
}
}
}
}Many data catalogs (e.g., DataHub, Amundsen, Collibra, Alation) either natively support ingesting dbt artifacts or have community implementations. They typically upload or reference the manifest.json + catalog.json pair and update entities by combining model descriptions and owners (manifest) with column types and comments (catalog).
There are three broad integration styles: polling (periodically fetching from a file location), push (calling an API from CI/CD), and reconciliation (merging dbt information into existing schema metadata). Push is the easiest to start with; implement incremental updates by keying off manifest's node_version, checksums, or file modification times.
| Style | Benefits | Considerations |
|---|---|---|
| Push (API) | Immediate reflection; easy failure detection | Requires network and authorization configuration |
| Polling (storage) | Loosely coupled; easy to extend | Need to control latency and duplicate processing |
| Reconciliation (merge) | Maintains consistency with existing operations | Requires key design and deduplication logic |
Data flow for external catalog integration
Reading catalog.json and pushing to an API (minimal example)
import json, os, requests
ARTIFACT_DIR = os.environ.get("ARTIFACT_DIR", "target")
CAT_PATH = os.path.join(ARTIFACT_DIR, "catalog.json")
MANI_PATH = os.path.join(ARTIFACT_DIR, "manifest.json")
with open(CAT_PATH, "r", encoding="utf-8") as f:
catalog = json.load(f)
with open(MANI_PATH, "r", encoding="utf-8") as f:
manifest = json.load(f)
# Example: extract column definitions and send (adjust schema as needed)
payload = []
for node_id, node in catalog.get("nodes", {}).items():
cols = node.get("columns", {})
for c in cols.values():
payload.append({
"node_id": node_id,
"relation": node.get("relation_name"),
"column": c.get("name"),
"type": c.get("type"),
"comment": c.get("comment")
})
resp = requests.post("https://catalog.example.com/api/dbt/columns", json=payload, timeout=30)
resp.raise_for_status()
print("uploaded:", len(payload))catalog.json reflects the current state of the database. Align the update frequency with how often models change and consumer needs — at minimum, daily regeneration is recommended. In CI, run docs generate on pull requests to visualize diffs and finalize the production catalog.json after each production release (store it separately per environment).
On the security side, decide whether comments on PII columns can be made public and filter them out before external integration. Because statistics are adapter-dependent and row_count and similar fields can be missing, pair catalog.json with other sources such as run_results.json for SLA monitoring.
| Operational item | Recommendation | Notes |
|---|---|---|
| Update frequency | Daily (or as needed when changes are frequent) | Keep CI minimal; store the finalized version for production |
| Storage location | Versioned in object storage | Organize by environment / date / commit ID |
| Disclosure policy | Strip or mask PII comments | Control via meta and tags; filter before integration |
Artifact handling in CI/CD
Generating docs and storing artifacts in CI (Bash example)
# Switch connection target via environment variable
export DBT_TARGET=prod
export DBT_PROFILES_DIR=.
# Fetch dependencies and run (as needed)
dbt deps
# Generate (explicitly specify output path)
dbt docs generate --target ${DBT_TARGET} --target-path artifacts/${DBT_TARGET}/$(date +%F)
# Upload artifacts (pseudo command)
aws s3 sync artifacts/${DBT_TARGET}/$(date +%F) s3://my-bucket/dbt-artifacts/${DBT_TARGET}/$(date +%F)On the Analytics Engineer exam, understanding the roles of dbt documentation and artifacts comes up often. Make sure you have the differences between catalog.json and manifest.json, the behavior of docs generate/serve, describing dashboards via exposures, and the scope of test definitions all organized.
Common pitfalls include: ephemeral models not appearing in catalog.json, confusing run_results.json with catalog.json, and mixing up docs blocks vs. description. Be ready to instantly say which piece of information lives in which file.
| Frequently asked theme | Key terms to remember | Commonly confused contrast |
|---|---|---|
| Roles of artifacts | catalog = physical, manifest = logical | catalog vs run_results |
| BI integration metadata | exposures | sources vs exposures |
| Documentation | schema.yml description / docs blocks | Comment (physical) vs description (logical) |
Which question is answered by which file
Physical type? ----> catalog.json
Description / owner? -> manifest.json
Did the run succeed? -> run_results.jsonExtract column types of a specific table from catalog.json with jq
TABLE_ID="model.my_proj.fct_orders"
cat target/catalog.json \
| jq -r --arg id "$TABLE_ID" '.nodes[$id].columns | to_entries[] | "\(.value.name),\(.value.type)"'Analytics Engineer
問題 1
You wrote model and column descriptions in schema.yml in dbt. You want to integrate column types and comments into an external data catalog. Which artifact is the most appropriate source?
正解: A
catalog.json aggregates the database's physical metadata (tables/columns, types, comments, etc.). run_results.json holds the most recent execution results, packages.yml declares dependent packages, and profiles.yml stores connection settings, so none of them fit this purpose.
When does catalog.json get updated? Does it stay current without running the models?
It is refreshed whenever you run dbt docs generate. As long as the adapter can fetch metadata, the latest physical information is collected for relations that already exist in the database — even if you did not run the models immediately beforehand. Relations that have not been created yet are not in scope.
Do ephemeral models and CTEs show up in catalog.json?
Ephemeral models do not create a physical relation, so they do not appear in catalog.json (they are still recorded as nodes in manifest.json). CTEs are likewise not intermediate physical objects and are out of scope.
Why are some of the statistics in catalog.json (such as row_count) null?
Statistic coverage is adapter-dependent, and fields that cannot be retrieved are returned as null. That is by design, not a bug. For SLA monitoring or volume checks, complement catalog.json with run_results.json or separate queries.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
dbt Models: SQL-Defined Transformation Units (2026)
Model fundamentals — SELECT-based definitions, naming, refs,...
dbt Analytics Engineering Exam: Complete Guide (2026)
Pass the AE Certification — scope, weighting, sample questio...
dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)
Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...
dbt Project Structure: models/seeds/macros Layout (2026)
Recommended dbt project layout — models, seeds, macros, snap...
dbt_project.yml Explained: Every Config (2026)
Every dbt_project.yml setting that matters — paths, vars, ma...