dbt

Introduction to dbt Incremental Materialization: Basics and Exam Prep

2026-04-19
NicheeLab Editorial Team

dbt's incremental materialization is a mechanism that reduces processing time and cost by writing only new or updated records without reprocessing existing data. The concept is simple, but failure to design unique keys, change detection logic, strategy selection (merge/append/insert_overwrite), and schema change handling consistently can lead to incidents.

Based on stable features in the official dbt documentation, this article covers the basics of incrementals, implementation differences across major warehouses (Snowflake/BigQuery/Databricks/Spark/Redshift), and exam-prone points all at once.

Incremental Materialization Principles and Prerequisites

Incremental materializes a model as a table and reflects only new or updated data to the target table in subsequent runs. The default behavior is a full build on the first run and incremental processing thereafter.

Change detection is performed by conditions within the is_incremental() block in the model SQL or by adapter-provided strategies (merge or insert_overwrite). Defining unique keys and change detection columns is key to safe operation.

  • First run: always full load. Table is created per the specified strategy
  • Subsequent runs: only the delta is applied per is_incremental() conditions and strategy
  • To rebuild: explicitly use dbt run --full-refresh
MaterializationStorage usageRebuild time estimate
viewZero (computed at query time)Fast (always query-evaluated)
tableLargeRequires full recomputation; heavy
incrementalMediumShort, delta only
ephemeralZeroInlined upstream (depends on downstream load)

Conceptual flow of incremental builds

SourcesStaging(model A)Incremental(model B)Target TableMerge / Append / Insert_OverwriteSources → Staging → Incremental → Target Table (Merge/Append/Insert_Overwrite)

Minimal example: incremental model template

{{ config(
    materialized='incremental'
) }}

with src as (
  select * from {{ ref('stg_orders') }}
)

select * from src
{% if is_incremental() %}
  -- 2回目以降にだけ適用される差分条件をここに記述
  -- 例: updated_atがターゲットの最大値以降
  where updated_at >= (
    select coalesce(max(updated_at), '1970-01-01') from {{ this }}
  )
{% endif %}

Implementation Patterns: Designing is_incremental and unique_key

There are two common approaches to change detection. One uses a timestamp watermark (max updated_at) to filter; the other uses a unique key to merge (UPSERT) conflicting rows. The former is simple and fast; the latter is robust against duplicates and late-arriving data.

By setting unique_key and choosing incremental_strategy='merge' in dbt, supported adapters auto-generate a MERGE-equivalent UPSERT. Combining the where condition of is_incremental() with the MERGE optimizes the balance between read volume and consistency.

  • Watermark approach: reference the max value of updated_at or ingested_at from the target
  • Key approach: specify unique_key and UPSERT via MERGE
  • Hybrid: read only recent data via where, finalize consistency via MERGE
ApproachProsCaveats
Watermark (timestamp)Simple, fast, reduced scanRisk of missing data from clock drift or late arrivals
Unique key + MERGERobust against duplicates and update overwritesCannot work with undefined or unstable keys
HybridBalances I/O reduction with consistencyfull-refresh is the safety net if where misses data

is_incremental branching diagram

is_incremental()?Apply delta filterYesInitial full scanNoWrite

UPSERT template using MERGE

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='order_id'
) }}

with src as (
  select * from {{ ref('stg_orders') }}
  {% if is_incremental() %}
    where updated_at >= (
      select coalesce(max(updated_at), '1970-01-01') from {{ this }}
    )
  {% endif %}
)

select * from src

Strategy-Specific Behavior: merge / append / insert_overwrite

Choose incremental_strategy based on the capabilities of your adapter (warehouse). Support varies by version, so check your environment's adapter documentation. The following are stable, well-known trends.

merge requires a unique key and handles updates and inserts simultaneously. append is always append-only, and deduplication must be handled by upstream SQL. insert_overwrite can replace data per partition and is commonly used with BigQuery and Spark.

  • Snowflake/Redshift: primarily merge and append
  • BigQuery: in addition to merge, insert_overwrite can be combined with partitioning
  • Databricks/Spark: merge is mainstream; insert_overwrite (partitions) has also become an option recently
StrategyMain supported enginesRequirements/Caveats
mergeSnowflake, BigQuery, Databricks/Spark, Redshiftunique_key required. Handles updates and inserts together
appendAllYou must deduplicate yourself. No updates (no overwrite)
insert_overwriteBigQuery, Databricks/Sparkpartition_by is a prerequisite. Replaces the target partition

Rough flow for choosing a strategy

Have a unique key?YesmergeNoAppend-only OK?Yes: appendNo / daily replacement: insert_overwrite

Configuration examples per strategy

-- Snowflake: MERGE
{{ config(materialized='incremental', incremental_strategy='merge', unique_key='id') }}

-- BigQuery: INSERT_OVERWRITE(日次分のみ置換)
{{ config(
  materialized='incremental',
  incremental_strategy='insert_overwrite',
  partition_by={'field': 'ds', 'data_type': 'date'},
  cluster_by=['customer_id']
) }}

-- Spark/Databricks: MERGE
{{ config(materialized='incremental', incremental_strategy='merge', unique_key='id') }}

Schema Changes and Re-runs: on_schema_change and full-refresh

In production, column additions and type changes can occur. dbt lets you control the default behavior via on_schema_change. Support depends on the adapter, so defaulting to the safer option is the safe bet.

For breaking changes (type narrowing, column drops, etc.) or major logic changes, rebuilding with dbt run --full-refresh is the reliable choice.

  • ignore: Ignore schema diffs (columns are not added)
  • append_new_columns: Add compatible new columns (effective on supported adapters)
  • sync_all_columns: Sync existing column renames and more (watch for support)
  • fail: Fail on any diff so you notice
SettingBehavior summaryUse case
ignoreDo nothingWhen you want to be conservative and prioritize data quality
append_new_columnsAdd new columnsWhen column additions are frequent and backward-compatible
sync_all_columnsSync diffsWhen you want to automate schema tracking (requires adapter support)
failFail on diffWhen you want CI to detect and force a fix

Branching during schema evolution

Schema diff?YesApply on_schema_change ruleRun / failNoNormal incremental processing

Examples of on_schema_change and full-refresh

{{ config(
  materialized='incremental',
  incremental_strategy='merge',
  unique_key='id',
  on_schema_change='append_new_columns'
) }}

-- 破壊的変更時はコマンドで再作成
-- dbt run --select my_model --full-refresh

Performance Design: Partitioning, Clustering, Watermark Management

The main goal of incrementals is I/O reduction. Combined with the warehouse's partitioning/clustering features, you can further narrow the read range. insert_overwrite in particular should be considered together with partition design.

A common practice is to read with a watermark that has a slight buffer to account for late arrivals. Tune to your environment, for example by rolling back a few hours to a day from max(updated_at).

  • Reduce scan with partition_by/cluster_by
  • Safety margin: roll back slightly from max(updated_at)
  • Hybrid (delta where + MERGE) balances I/O and consistency
EngineRecommended companion settingsNotes
BigQuerypartition_by(ds), cluster_by(user_id)Pairs well with insert_overwrite
SnowflakeClustering keys (as needed)Configure based on cost tradeoffs
Databricks/SparkpartitionBy, Z-Ordering (environment-dependent)Narrow the target range of MERGE

Watermark management diagram

Time ----|----|----|---->
          ^max_in_target
       <---- safety window ----
読み取り範囲 = [max_in_target - safety, now]

Example of a watermark with safety margin (using a macro)

{% macro watermark(model, col, safety_hours=6) %}
select datetime_sub(max({{ col }}), interval {{ safety_hours }} hour) as wm
from {{ model }}
{% endmacro %}

-- モデル内
with w as (
  {{ watermark(this, 'updated_at', 6) }}
), src as (
  select * from {{ ref('stg_orders') }} s
  {% if is_incremental() %}
  where s.updated_at >= (select coalesce(wm, '1970-01-01') from w)
  {% endif %}
)
select * from src

Exam Prep Checklist and Pitfalls

The Analytics Engineer exam often targets incremental basics, the necessity of unique_key, on_schema_change, full-refresh, and the choice of strategy. Written questions ask which strategy to choose and what settings are missing.

In practice, unintended duplicates, missing data, and silent schema evolution are typical failure modes. It is safest to combine CI verification (including full-refresh) with data profile monitoring.

  • Are you choosing merge without a unique_key?
  • Did you forget partition_by with insert_overwrite?
  • Are diffs being silently dropped by the default on_schema_change?
  • Are you forcing through changes that need full-refresh as incremental?
KeywordCommon wrong answer / pitfallCorrect guideline
mergeAssuming unique_key is not requiredmerge requires a stable unique_key
insert_overwriteMisunderstanding that it always works on any tablePartition columns are a prerequisite. Replaces only the target partition
on_schema_changeAssuming sync_all_columns is always safeVerify adapter support. Default to the safer side

Mini decision tree for strategy selection

Need to overwrite updates?YesHave a unique_key?Yes: mergeNo: consider introducing a keyNoAppend-only: appendDaily replacement: insert_overwrite

Example verification and operation commands

# 差分のみ
 dbt run --select tag:incremental
# フル再作成(影響小さい単位で)
 dbt run --select my_model --full-refresh
# テストと組み合わせ
 dbt test --select state:modified+

Check Your Understanding

Analytics Engineer

問題 1

You want to keep past data immutable and replace only today's data in a BigQuery fact table with daily partitions (ds column, DATE). Which recommended setting minimizes processing time while avoiding duplicates and missing data?

  1. materialized='incremental', incremental_strategy='insert_overwrite', partition_by={'field': 'ds', 'data_type': 'date'}
  2. materialized='incremental', incremental_strategy='append' only
  3. materialized='incremental', incremental_strategy='merge', unique_key not set
  4. materialized='table' with full rebuild every time

正解: A

For the requirement of replacing only today's data in daily partitions, BigQuery's insert_overwrite fits. A partition_by definition is required. append is weak against duplicates and same-day updates, merge without unique_key is inappropriate, and table full rebuild is unnecessarily heavy.

Frequently Asked Questions

Is unique_key always required?

No. It is required for the merge strategy, but not for designs that only need append or insert_overwrite (immutable/append-only data or partition replacement). However, if you need to overwrite updates, you must define a stable unique_key.

How can I avoid missing late-arriving data?

Pull the watermark back a few hours to a day from max(updated_at), or use a hybrid approach that reads only recent data with a where clause and reconciles via merge. If misses are suspected, temporarily widen the range or run a full-refresh to restore consistency.

What is the safest default for on_schema_change?

It depends on your operations policy, but starting conservatively with fail or ignore and explicitly switching to append_new_columns when adding columns is the safe bet. sync_all_columns is convenient but should only be enabled after verifying adapter support and the nature of the changes.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
dbt

dbt Models: SQL-Defined Transformation Units (2026)

Model fundamentals — SELECT-based definitions, naming, refs,...

dbt

dbt Analytics Engineering Exam: Complete Guide (2026)

Pass the AE Certification — scope, weighting, sample questio...

dbt

dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)

Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...

dbt

dbt Project Structure: models/seeds/macros Layout (2026)

Recommended dbt project layout — models, seeds, macros, snap...

dbt

dbt_project.yml Explained: Every Config (2026)

Every dbt_project.yml setting that matters — paths, vars, ma...

Browse all dbt articles (101)
© 2026 NicheeLab All rights reserved.