incremental + merge is the core dbt feature for delivering the standard upsert (update + insert) pattern on a data warehouse.
This article covers unique_key design, dedup, safe watermarking, performance, and schema-change handling — the exam-favorite topics together with the practical know-how you need on the job.
Setting incremental_strategy=merge on a dbt incremental model runs an upsert against the target table: update on key match, insert on no match. This avoids full reloads and lets you efficiently apply only the newly arrived data.
A working merge requires a properly specified unique key and a correct incremental selection. From an exam perspective, you should be able to articulate the risks of a missing or duplicated unique_key (non-deterministic updates, MERGE errors).
| Strategy | Applies updates | Handles deletes | unique_key required |
|---|---|---|---|
| merge | Yes (update on key match) | Generally no (prefer soft delete) | Required |
| append | No (always appends) | No | Not required |
| insert_overwrite (partition) | Per-partition replace | Yes (targeted partitions are replaced) | Not required |
| delete+insert | Yes (delete target rows, then reinsert) | Yes, within the targeted range | Required |
UPSERT flow (incremental + merge)
Minimal merge incremental configuration
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='id'
) }}
with src as (
select * from {{ source('app', 'orders') }}
{% if is_incremental() %}
where updated_at >= (select coalesce(max(updated_at), '1970-01-01') from {{ this }})
{% endif %}
)
select * from srcA clean unique_key is the key to a successful merge. If business keys are unstable, consider introducing a surrogate key, and assume the source delivers duplicates so that upstream dedup is non-negotiable.
On the exam, when asked what happens if the unique_key duplicates, the expected answer is to use a window function upstream to deterministically pick a single row (e.g., most recent wins).
| Key type | Strengths | Watch-outs |
|---|---|---|
| Business key (natural key) | Easy for the business to understand | Fragile when values change or get reused |
| Surrogate key | Stable with few collisions | Requires generation and sync machinery |
| Composite key | Preserves the existing grain | Adding columns complicates queries and tests |
Dedup flow
Use a window function to collapse duplicates to one row
with staged as (
select *,
row_number() over (
partition by id
order by updated_at desc
) as rn
from {{ source('app', 'orders') }}
), dedup as (
select * from staged where rn = 1
)
select * from dedupTo absorb late-arriving data and timestamp jitter, a safer choice than a plain max(updated_at) is a lookback threshold that rewinds slightly. Rewind a fixed window from the target's max updated_at and re-evaluate source rows in that range.
This approach can pull in some duplicates, so upstream dedup plus a unique_key-driven merge is required.
| Approach | Benefits | Risks / costs |
|---|---|---|
| Simple MAX approach | Fastest with the smallest scan | May miss late-arriving rows |
| Combined with lookback | Resilient to lateness and robust | Adds some reprocessing and duplicate evaluation |
| Source-side change history table | Fully reproducible with clear deltas | Requires history production and storage cost |
Lookback concept
Complete example: incremental + merge with lookback
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='id'
) }}
{% set lookback_hours = 24 %}
with base as (
select * from {{ source('app', 'orders') }}
), window as (
select coalesce(
dateadd('hour', -{{ lookback_hours }}, max(updated_at)),
cast('1970-01-01' as timestamp)
) as watermark
from {{ this }}
), filtered as (
select b.*
from base b
{% if is_incremental() %}
where b.updated_at >= (select watermark from window)
{% endif %}
), dedup as (
select *, row_number() over (partition by id order by updated_at desc) as rn
from filtered
)
select * from dedup where rn = 1On large tables, the physical design (partitions and clusters) and incremental filtering directly drive query cost. Use dbt's config to apply the right settings per adapter and minimize the scanned range.
incremental_predicates lets you specify extra predicates applied to both target and source, on adapters that support it. Check your adapter's documentation for support.
| Optimization | Benefit | Watch-out |
|---|---|---|
| partition_by (date) | Reduces scans and pairs well with insert_overwrite | Watch out for adapter-specific syntax differences |
| cluster_by / sort key | Speeds up join and MERGE match lookup | Too many columns increases write cost |
| incremental_predicates | Narrows the target scope further | Verify adapter support |
Pruning illustration
Optimization example (adapter-dependent)
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='id',
partition_by={'field': 'order_date'}, -- e.g. BigQuery/Spark family
cluster_by=['id'], -- e.g. BigQuery/Snowflake clustering equivalents
incremental_predicates=["order_date >= dateadd('day', -7, current_date)"]
) }}
select * from {{ ref('stg_orders_clean') }}
{% if is_incremental() %}
where order_date >= dateadd('day', -7, current_date)
{% endif %}In incremental models, on_schema_change controls behavior when the schema changes. append_new_columns adds new columns, fail aborts on mismatch, and sync_all_columns synchronizes column differences on supported adapters.
Handling actual deletes is not always covered by the default merge behavior. In practice, a soft delete with an is_deleted flag is safer and easier to keep historically consistent. If you must hard-delete, make precise impact scoping and tests mandatory.
| on_schema_change | Behavior | Use case |
|---|---|---|
| ignore | Ignore (continue executing despite differences) | Short-term workaround; fix it later |
| append_new_columns | Add new columns | Operations where only column additions are expected |
| fail | Fail when differences are detected | Environments that require strict schema control |
| sync_all_columns | Synchronize column differences (supported adapters only) | When you want to absorb column changes in both directions |
Soft delete flow
Upsert logic that includes soft delete (flag propagation)
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='id',
on_schema_change='append_new_columns'
) }}
with src as (
select id,
updated_at,
status,
case when status = 'deleted' then true else false end as is_deleted
from {{ ref('stg_orders_clean') }}
{% if is_incremental() %}
where updated_at >= (select coalesce(max(updated_at), '1970-01-01') from {{ this }})
{% endif %}
)
select * from srcThe exam frequently asks about incremental fundamentals (unique_key, the role of is_incremental, and what full_refresh means) and when merge is the right choice. Be ready to explain the on_schema_change options, how to write dedup logic, and safe watermarking strategies.
Attaching tests (not_null/unique and generic tests) to models to guarantee data quality is another commonly evaluated point.
| Common topic | Typical wrong answer | Key to the right answer |
|---|---|---|
| Why unique_key matters | Updates work with append too | merge updates on key match; append only appends |
| Watermark | Use max(created_at) | Use updated_at plus a lookback to absorb lateness |
| Schema-change handling | Always ignore | Choose append_new_columns / fail / sync_all_columns based on requirements |
Exam imagery
Minimal quality-test example (schema.yml excerpt)
version: 2
models:
- name: dim_orders
columns:
- name: id
tests:
- not_null
- unique
- name: updated_at
tests:
- not_nullAnalytics Engineer
問題 1
When using incremental_strategy=merge in a dbt incremental model, which design most correctly reflects both updates and new inserts?
正解: C
merge requires a unique_key, and multiple candidates for the same key must be deterministically reduced to one upstream. Combining updated_at with a lookback prevents missed late arrivals. append does not apply updates, and on_schema_change governs schema-difference behavior — it does not resolve key duplicates.
Does the merge strategy automatically propagate deletes (rows missing from the source)?
A standard upsert does not propagate deletes automatically. When auditability is required, the common approach is a soft delete reflected via a flag such as is_deleted. If you truly need hard deletes, consider a tightly scoped delete+insert or partition-level insert_overwrite (supported adapters only).
How should configuration differences across BigQuery, Spark, Snowflake, and Databricks be handled?
Options like partition_by and insert_overwrite are adapter-specific, so follow the documentation for the adapter you use. In dbt models, lock in the shared parts (unique_key, is_incremental, dedup) and branch the physical design via config or manage it per environment for safety.
When should you run a full_refresh?
Run it for major schema changes, fixing inconsistent watermarks, or rebuilding history. Adding --full-refresh to a build run rebuilds the target incremental table from scratch.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
dbt Models: SQL-Defined Transformation Units (2026)
Model fundamentals — SELECT-based definitions, naming, refs,...
dbt Analytics Engineering Exam: Complete Guide (2026)
Pass the AE Certification — scope, weighting, sample questio...
dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)
Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...
dbt Project Structure: models/seeds/macros Layout (2026)
Recommended dbt project layout — models, seeds, macros, snap...
dbt_project.yml Explained: Every Config (2026)
Every dbt_project.yml setting that matters — paths, vars, ma...