dbt's incremental materialization is a mechanism that reduces processing time and cost by writing only new or updated records without reprocessing existing data. The concept is simple, but failure to design unique keys, change detection logic, strategy selection (merge/append/insert_overwrite), and schema change handling consistently can lead to incidents.
Based on stable features in the official dbt documentation, this article covers the basics of incrementals, implementation differences across major warehouses (Snowflake/BigQuery/Databricks/Spark/Redshift), and exam-prone points all at once.
Incremental materializes a model as a table and reflects only new or updated data to the target table in subsequent runs. The default behavior is a full build on the first run and incremental processing thereafter.
Change detection is performed by conditions within the is_incremental() block in the model SQL or by adapter-provided strategies (merge or insert_overwrite). Defining unique keys and change detection columns is key to safe operation.
| Materialization | Storage usage | Rebuild time estimate |
|---|---|---|
| view | Zero (computed at query time) | Fast (always query-evaluated) |
| table | Large | Requires full recomputation; heavy |
| incremental | Medium | Short, delta only |
| ephemeral | Zero | Inlined upstream (depends on downstream load) |
Conceptual flow of incremental builds
Minimal example: incremental model template
{{ config(
materialized='incremental'
) }}
with src as (
select * from {{ ref('stg_orders') }}
)
select * from src
{% if is_incremental() %}
-- 2回目以降にだけ適用される差分条件をここに記述
-- 例: updated_atがターゲットの最大値以降
where updated_at >= (
select coalesce(max(updated_at), '1970-01-01') from {{ this }}
)
{% endif %}There are two common approaches to change detection. One uses a timestamp watermark (max updated_at) to filter; the other uses a unique key to merge (UPSERT) conflicting rows. The former is simple and fast; the latter is robust against duplicates and late-arriving data.
By setting unique_key and choosing incremental_strategy='merge' in dbt, supported adapters auto-generate a MERGE-equivalent UPSERT. Combining the where condition of is_incremental() with the MERGE optimizes the balance between read volume and consistency.
| Approach | Pros | Caveats |
|---|---|---|
| Watermark (timestamp) | Simple, fast, reduced scan | Risk of missing data from clock drift or late arrivals |
| Unique key + MERGE | Robust against duplicates and update overwrites | Cannot work with undefined or unstable keys |
| Hybrid | Balances I/O reduction with consistency | full-refresh is the safety net if where misses data |
is_incremental branching diagram
UPSERT template using MERGE
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='order_id'
) }}
with src as (
select * from {{ ref('stg_orders') }}
{% if is_incremental() %}
where updated_at >= (
select coalesce(max(updated_at), '1970-01-01') from {{ this }}
)
{% endif %}
)
select * from srcChoose incremental_strategy based on the capabilities of your adapter (warehouse). Support varies by version, so check your environment's adapter documentation. The following are stable, well-known trends.
merge requires a unique key and handles updates and inserts simultaneously. append is always append-only, and deduplication must be handled by upstream SQL. insert_overwrite can replace data per partition and is commonly used with BigQuery and Spark.
| Strategy | Main supported engines | Requirements/Caveats |
|---|---|---|
| merge | Snowflake, BigQuery, Databricks/Spark, Redshift | unique_key required. Handles updates and inserts together |
| append | All | You must deduplicate yourself. No updates (no overwrite) |
| insert_overwrite | BigQuery, Databricks/Spark | partition_by is a prerequisite. Replaces the target partition |
Rough flow for choosing a strategy
Configuration examples per strategy
-- Snowflake: MERGE
{{ config(materialized='incremental', incremental_strategy='merge', unique_key='id') }}
-- BigQuery: INSERT_OVERWRITE(日次分のみ置換)
{{ config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={'field': 'ds', 'data_type': 'date'},
cluster_by=['customer_id']
) }}
-- Spark/Databricks: MERGE
{{ config(materialized='incremental', incremental_strategy='merge', unique_key='id') }}In production, column additions and type changes can occur. dbt lets you control the default behavior via on_schema_change. Support depends on the adapter, so defaulting to the safer option is the safe bet.
For breaking changes (type narrowing, column drops, etc.) or major logic changes, rebuilding with dbt run --full-refresh is the reliable choice.
| Setting | Behavior summary | Use case |
|---|---|---|
| ignore | Do nothing | When you want to be conservative and prioritize data quality |
| append_new_columns | Add new columns | When column additions are frequent and backward-compatible |
| sync_all_columns | Sync diffs | When you want to automate schema tracking (requires adapter support) |
| fail | Fail on diff | When you want CI to detect and force a fix |
Branching during schema evolution
Examples of on_schema_change and full-refresh
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='id',
on_schema_change='append_new_columns'
) }}
-- 破壊的変更時はコマンドで再作成
-- dbt run --select my_model --full-refreshThe main goal of incrementals is I/O reduction. Combined with the warehouse's partitioning/clustering features, you can further narrow the read range. insert_overwrite in particular should be considered together with partition design.
A common practice is to read with a watermark that has a slight buffer to account for late arrivals. Tune to your environment, for example by rolling back a few hours to a day from max(updated_at).
| Engine | Recommended companion settings | Notes |
|---|---|---|
| BigQuery | partition_by(ds), cluster_by(user_id) | Pairs well with insert_overwrite |
| Snowflake | Clustering keys (as needed) | Configure based on cost tradeoffs |
| Databricks/Spark | partitionBy, Z-Ordering (environment-dependent) | Narrow the target range of MERGE |
Watermark management diagram
Time ----|----|----|---->
^max_in_target
<---- safety window ----
読み取り範囲 = [max_in_target - safety, now]Example of a watermark with safety margin (using a macro)
{% macro watermark(model, col, safety_hours=6) %}
select datetime_sub(max({{ col }}), interval {{ safety_hours }} hour) as wm
from {{ model }}
{% endmacro %}
-- モデル内
with w as (
{{ watermark(this, 'updated_at', 6) }}
), src as (
select * from {{ ref('stg_orders') }} s
{% if is_incremental() %}
where s.updated_at >= (select coalesce(wm, '1970-01-01') from w)
{% endif %}
)
select * from srcThe Analytics Engineer exam often targets incremental basics, the necessity of unique_key, on_schema_change, full-refresh, and the choice of strategy. Written questions ask which strategy to choose and what settings are missing.
In practice, unintended duplicates, missing data, and silent schema evolution are typical failure modes. It is safest to combine CI verification (including full-refresh) with data profile monitoring.
| Keyword | Common wrong answer / pitfall | Correct guideline |
|---|---|---|
| merge | Assuming unique_key is not required | merge requires a stable unique_key |
| insert_overwrite | Misunderstanding that it always works on any table | Partition columns are a prerequisite. Replaces only the target partition |
| on_schema_change | Assuming sync_all_columns is always safe | Verify adapter support. Default to the safer side |
Mini decision tree for strategy selection
Example verification and operation commands
# 差分のみ
dbt run --select tag:incremental
# フル再作成(影響小さい単位で)
dbt run --select my_model --full-refresh
# テストと組み合わせ
dbt test --select state:modified+Analytics Engineer
問題 1
You want to keep past data immutable and replace only today's data in a BigQuery fact table with daily partitions (ds column, DATE). Which recommended setting minimizes processing time while avoiding duplicates and missing data?
正解: A
For the requirement of replacing only today's data in daily partitions, BigQuery's insert_overwrite fits. A partition_by definition is required. append is weak against duplicates and same-day updates, merge without unique_key is inappropriate, and table full rebuild is unnecessarily heavy.
Is unique_key always required?
No. It is required for the merge strategy, but not for designs that only need append or insert_overwrite (immutable/append-only data or partition replacement). However, if you need to overwrite updates, you must define a stable unique_key.
How can I avoid missing late-arriving data?
Pull the watermark back a few hours to a day from max(updated_at), or use a hybrid approach that reads only recent data with a where clause and reconciles via merge. If misses are suspected, temporarily widen the range or run a full-refresh to restore consistency.
What is the safest default for on_schema_change?
It depends on your operations policy, but starting conservatively with fail or ignore and explicitly switching to append_new_columns when adding columns is the safe bet. sync_all_columns is convenient but should only be enabled after verifying adapter support and the nature of the changes.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
dbt Models: SQL-Defined Transformation Units (2026)
Model fundamentals — SELECT-based definitions, naming, refs,...
dbt Analytics Engineering Exam: Complete Guide (2026)
Pass the AE Certification — scope, weighting, sample questio...
dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)
Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...
dbt Project Structure: models/seeds/macros Layout (2026)
Recommended dbt project layout — models, seeds, macros, snap...
dbt_project.yml Explained: Every Config (2026)
Every dbt_project.yml setting that matters — paths, vars, ma...