dbt

dbt Layered Architecture: How to Correctly Split staging / intermediate / marts

2026-04-19
NicheeLab Editorial Team

dbt's layered architecture is a foundational principle for building a model graph that is resilient to change and easy to test, by separating source shaping, business logic consolidation, and analytical schemas.

This article balances the Analytics Engineer exam perspective with concrete implementation patterns that pay off in day-to-day operations, so you can draw the boundaries between staging / intermediate / marts without hesitation.

The Big Picture of Layered Architecture

dbt models have their execution order determined by a dependency graph based on ref. Clearly separating layers aligns the responsibilities of each model and makes test placement and the scope of change impact easier to reason about. In particular, adopting staging / intermediate / marts lets you decouple type casting and naming standardization, business logic consolidation, and analytics consumption optimization.

The Analytics Engineer exam asks about each layer's purpose, naming, grain, materialization choices, and the placement of tests and documentation. Keeping the following principles in mind helps you balance design consistency with execution efficiency.

  • One responsibility per model: do not mix source shaping, joining/aggregation, and analytical serving
  • Explicit grain: be able to articulate row uniqueness and the primary key at each layer
  • Stable API: marts focuses on being a contract surface for downstream consumers
  • Localized change: intermediate aggregations absorb upstream changes so they do not propagate downstream
  • Test focus: uniqueness/not-null upstream, business validation in intermediate and marts

The flow of dbt layered architecture

Raw Sources (apps, logs, etc.)staging (light shaping / type casting / naming)intermediate (joins / dedup / business grain)dims (marts)facts (marts)Consumers (BI, ML, QA)

Project structure example (excerpt)

# Directory
models/
  staging/
    sources.yml
    stg_customers.sql
    stg_orders.sql
  intermediate/
    int_orders_enriched.sql
    int_customers_current.sql
  marts/
    core/
      dim_customers.sql
      fct_orders.sql

# dbt_project.yml (unify per-layer materialization policy)
models:
  my_project:
    staging:
      +materialized: view
    intermediate:
      +materialized: table
    marts:
      +materialized: table

Staging Layer: Normalizing Sources

Staging is the layer closest to raw data. Keep it to column-name normalization, type casting, and light filters or column expansion. Do not bring in business logic or aggregation. Its main purpose is to provide a "stable column shape and naming that downstream consumers can work with easily."

Views are the safe default for materialization. Consider tables only when the computation is heavy or the upstream source is slow. Defining basic tests like source uniqueness, not-null, and referential integrity here helps you catch anomalies early.

  • Naming convention: snake_case, prefix with stg_
  • Reference external sources via source() and standardize column names
  • Explicit type casting and unified time zone policy
  • Place uniqueness and not-null tests here
  • Default to views; only switch to tables when heavy

Example of stg_orders.sql and sources.yml

-- models/staging/stg_orders.sql
with src as (
  select * from {{ source('raw', 'orders') }}
)
select
  cast(order_id as varchar)       as order_id,
  cast(customer_id as varchar)    as customer_id,
  cast(order_ts as timestamp)     as order_timestamp,
  upper(status)                   as status,
  cast(total_amount as numeric(38,2)) as total_amount,
  updated_at
from src
where order_id is not null
;

# models/staging/sources.yml
version: 2
sources:
  - name: raw
    schema: raw
    tables:
      - name: orders
        description: Raw orders table
        loaded_at_field: updated_at
        freshness:
          warn_after: {count: 90, period: minute}
          error_after: {count: 180, period: minute}
        columns:
          - name: order_id
            tests: [not_null, unique]
          - name: customer_id
            tests: [not_null]
          - name: updated_at
            tests: [not_null]

Intermediate Layer: Consolidating Business Logic

Intermediate is the layer that joins multiple staging models and consolidates business logic such as deduplication, grain unification, and derived columns. Sitting just before marts, it functions as a buffer that absorbs the impact of changes.

For frequently updated tables, incremental materialization is effective. With an is_incremental() guard, you only ingest new and updated rows, shortening execution time. For deduplication, selecting the latest version via window functions is the stable approach.

  • Join multiple sources and lock in the grain
  • Deduplicate and pick the latest record
  • Add derived metrics and flags (e.g., shipping-inclusive amount, cancellation flag)
  • Use incremental to keep compute costs down
  • Provide an interface that easily absorbs schema changes on the marts side

int_orders_enriched.sql (incremental and deduplication)

-- models/intermediate/int_orders_enriched.sql
{{
  config(
    materialized='incremental',
    unique_key='order_id'
  )
}}

with orders as (
  select * from {{ ref('stg_orders') }}
),
-- Reference the latest customer attributes from another model (e.g., int_customers_current)
customers as (
  select * from {{ ref('int_customers_current') }}
),
-- Take the latest version of duplicate orders
latest_orders as (
  select o.*
  from (
    select *,
           row_number() over(partition by order_id order by updated_at desc) as rn
    from orders
  ) o
  where o.rn = 1
),
joined as (
  select
    l.order_id,
    l.customer_id,
    c.customer_tier,
    l.order_timestamp,
    l.status,
    l.total_amount,
    case when l.status in ('CANCELLED','VOID') then 1 else 0 end as is_cancelled
  from latest_orders l
  left join customers c using (customer_id)
)
select *
from joined
{% if is_incremental() %}
  -- Ingest only recent updates (assumes updated_at exists)
  where joined.order_timestamp > (
    select coalesce(max(order_timestamp), '1900-01-01') from {{ this }}
  )
{% endif %}
;

Marts Layer: Providing the Analytics Schema

Marts is the layer that downstream dashboards and analytical queries reference directly. Split schemas by business domain (e.g., core, marketing, finance) and design dimensions (dim_) and facts (fct_) following the star schema principle. Prioritize schema stability and readability above all.

Facts have a clear grain (e.g., one row = one order), and dimensions provide the current form of attributes. To avoid double counting in aggregation, keep row-level facts in the fact tables as much as possible and clearly separate the roles between dbt and dashboard-side aggregation.

  • Naming convention: dim_*, fct_*
  • Marts plays an API-like role. Lock columns and types as a contract
  • Facts provide measurements, dimensions provide descriptive attributes
  • Split into sub-folders by domain (core, marketing, etc.)
  • Visualize downstream assets and track dependencies via exposures

dim_customers / fct_orders (excerpt)

-- models/marts/core/dim_customers.sql
select
  customer_id,
  customer_name,
  customer_tier,
  signup_date
from {{ ref('int_customers_current') }}
;

-- models/marts/core/fct_orders.sql
select
  order_id,
  customer_id,
  order_timestamp,
  status,
  total_amount,
  is_cancelled
from {{ ref('int_orders_enriched') }}
where status not in ('TEST')
;

Placement of Tests, Contracts, and Documentation

dbt tests come in two flavors: generic (not_null, unique, relationships, etc.) and singular (arbitrary SQL). Guarantee key integrity at staging, and validate business rules between intermediate and marts. Data contracts (model contracts) lock columns, types, and required flags to prevent downstream breakage.

Build documentation with descriptions and docs blocks, and make it browsable via docs generate. Defining exposures lets you visualize the dependencies between dashboards or reports and your models.

  • staging: enforce not_null, unique, and relationships
  • intermediate: verify dedup and filter consistency with singular tests
  • marts: lock columns and types with contracts (contract: true)
  • Make actual usage and ownership clear with docs and exposures
  • On failures, work backward from upstream to isolate the cause

schema.yml (example with tests, contracts, exposures)

version: 2
models:
  - name: fct_orders
    description: Orders fact. One row = one order
    config:
      contract: true
    columns:
      - name: order_id
        description: Order ID (primary key)
        data_type: varchar
        tests: [not_null, unique]
      - name: customer_id
        data_type: varchar
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: total_amount
        data_type: numeric
        tests:
          - not_null
          - accepted_values:
              values:
                - "{{ 0 }}:inf"  # Non-negative can be enforced via a custom test

exposures:
  - name: sales_dashboard
    type: dashboard
    maturity: high
    url: https://bi.example.com/dashboards/sales
    owners:
      - name: Analytics Team
        email: [email protected]
    depends_on:
      - ref('fct_orders')
      - ref('dim_customers')

Execution Strategy and Performance Optimization

dbt executes in parallel based on a DAG. Using selectors (--select, --state) re-runs only the scope affected by changes, shortening the dev-to-prod cycle. Decide a default materialization per layer to avoid unnecessary recomputation.

For incremental builds, clarify how primary keys and update timestamps are handled, and share full-refresh operational guidelines with the team. Warehouse-specific optimizations like clustering and partitioning are safer when managed explicitly via dbt model configuration.

  • Minimal execution with --select + standardized tags
  • Differential builds via state:modified
  • Schedule full refreshes deliberately (night-time or low-load windows)
  • Lock per-layer default materialization in dbt_project.yml
  • Split folders by domain to increase parallelism
LayerMain purposeGrainNaming convention
stagingSource shaping, type casting, naming standardizationSame as source (primary key = natural key)stg_*
intermediateJoins, deduplication, business logic consolidationRow grain at the business levelint_*
martsStable API for downstream (analytics schema)dim: current entity / fct: one fact per rowdim_* / fct_*

Execution commands and selector examples

# Only changed models and their downstream
$ dbt build --select state:modified+ --state target/previous_run

# Per-layer execution (folder selector)
$ dbt build --select models/staging
$ dbt build --select models/intermediate
$ dbt build --select models/marts

# Re-run only a specific domain
$ dbt build --select marts.core+

# Full refresh (planned)
$ dbt build --full-refresh --select int_orders_enriched fct_orders

Check Your Understanding

Analytics Engineer

問題 1

Which of the following is the most appropriate division of responsibilities among dbt's staging / intermediate / marts layers?

  1. A. staging handles type casting and naming standardization, intermediate handles joins and deduplication, and marts provides a stable analytics schema
  2. B. staging aggregates facts, intermediate ingests sources, and marts normalizes column names
  3. C. staging serves dashboards, intermediate forbids table creation, and marts uses views only
  4. D. staging and intermediate are identical, and only marts needs to be separated

正解: A

staging handles minimal shaping (types and naming), intermediate consolidates business logic (joins, deduplication, grain), and marts provides a stable schema that downstream queries reference directly. This matches dbt's recommended division of responsibilities.

Frequently Asked Questions

Is it okay to perform aggregations or joins in the staging layer?

Generally, avoid it. Keep staging limited to type casting and naming standardization, and consolidate joins, deduplication, and derived metrics in the intermediate layer. This achieves separation of concerns and localizes the impact of changes.

Should I materialize as views or tables?

The standard defaults are staging=view, intermediate=table (or incremental for frequently updated tables), and marts=table — this combination works well in practice. Even if you make exceptions based on cost, execution time, or source size, locking in per-layer defaults in dbt_project.yml keeps operations stable.

Where should I handle dimension history (SCD)?

Manage change history itself with dbt snapshots (a separate feature), shape it into the current form in the intermediate layer, and finish with a consumer-friendly column layout in marts dim_ tables. If you expose history tables directly in marts, use clear naming and descriptions to prevent misuse by downstream consumers.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
dbt

dbt Models: SQL-Defined Transformation Units (2026)

Model fundamentals — SELECT-based definitions, naming, refs,...

dbt

dbt Analytics Engineering Exam: Complete Guide (2026)

Pass the AE Certification — scope, weighting, sample questio...

dbt

dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)

Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...

dbt

dbt Project Structure: models/seeds/macros Layout (2026)

Recommended dbt project layout — models, seeds, macros, snap...

dbt

dbt_project.yml Explained: Every Config (2026)

Every dbt_project.yml setting that matters — paths, vars, ma...

Browse all dbt articles (101)
© 2026 NicheeLab All rights reserved.