Incremental merge Strategy: Upsert Pattern (2026)

incremental + merge is the core dbt feature for delivering the standard upsert (update + insert) pattern on a data warehouse.

This article covers unique_key design, dedup, safe watermarking, performance, and schema-change handling — the exam-favorite topics together with the practical know-how you need on the job.

Big picture of the merge strategy and when to use it

Setting incremental_strategy=merge on a dbt incremental model runs an upsert against the target table: update on key match, insert on no match. This avoids full reloads and lets you efficiently apply only the newly arrived data.

A working merge requires a properly specified unique key and a correct incremental selection. From an exam perspective, you should be able to articulate the risks of a missing or duplicated unique_key (non-deterministic updates, MERGE errors).

The unique_key must be logically unique. Always dedupe upstream.
Use is_incremental() to narrow the load. In practice, add a lookback to absorb late-arriving data.
On full_refresh the table is rebuilt. Declare index and cluster settings explicitly in the model config.

Strategy	Applies updates	Handles deletes	unique_key required
merge	Yes (update on key match)	Generally no (prefer soft delete)	Required
append	No (always appends)	No	Not required
insert_overwrite (partition)	Per-partition replace	Yes (targeted partitions are replaced)	Not required
delete+insert	Yes (delete target rows, then reinsert)	Yes, within the targeted range	Required

UPSERT flow (incremental + merge)

Minimal merge incremental configuration

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id'
) }}

with src as (
  select * from {{ source('app', 'orders') }}
  {% if is_incremental() %}
    where updated_at >= (select coalesce(max(updated_at), '1970-01-01') from {{ this }})
  {% endif %}
)

select * from src

unique_key design and dedup in practice

A clean unique_key is the key to a successful merge. If business keys are unstable, consider introducing a surrogate key, and assume the source delivers duplicates so that upstream dedup is non-negotiable.

On the exam, when asked what happens if the unique_key duplicates, the expected answer is to use a window function upstream to deterministically pick a single row (e.g., most recent wins).

When business keys can change, introduce a stable surrogate key.
Pick exactly one row per key deterministically using update timestamps or versions.
Null keys are not unique, so always attach a not_null test.

Key type	Strengths	Watch-outs
Business key (natural key)	Easy for the business to understand	Fragile when values change or get reused
Surrogate key	Stable with few collisions	Requires generation and sync machinery
Composite key	Preserves the existing grain	Adding columns complicates queries and tests

Dedup flow

Use a window function to collapse duplicates to one row

with staged as (
  select *,
         row_number() over (
           partition by id
           order by updated_at desc
         ) as rn
  from {{ source('app', 'orders') }}
), dedup as (
  select * from staged where rn = 1
)
select * from dedup

Practical upsert: watermark with a lookback

To absorb late-arriving data and timestamp jitter, a safer choice than a plain max(updated_at) is a lookback threshold that rewinds slightly. Rewind a fixed window from the target's max updated_at and re-evaluate source rows in that range.

This approach can pull in some duplicates, so upstream dedup plus a unique_key-driven merge is required.

Tune the lookback window from hours to days based on your data-arrival SLA.
Always handle the first run where the target's max timestamp is null.
Prefer an update-driven column (updated_at). Avoid relying on created_at alone.

Approach	Benefits	Risks / costs
Simple MAX approach	Fastest with the smallest scan	May miss late-arriving rows
Combined with lookback	Resilient to lateness and robust	Adds some reprocessing and duplicate evaluation
Source-side change history table	Fully reproducible with clear deltas	Requires history production and storage cost

Lookback concept

Complete example: incremental + merge with lookback

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id'
) }}

{% set lookback_hours = 24 %}

with base as (
  select * from {{ source('app', 'orders') }}
), window as (
  select coalesce(
           dateadd('hour', -{{ lookback_hours }}, max(updated_at)),
           cast('1970-01-01' as timestamp)
         ) as watermark
  from {{ this }}
), filtered as (
  select b.*
  from base b
  {% if is_incremental() %}
  where b.updated_at >= (select watermark from window)
  {% endif %}
), dedup as (
  select *, row_number() over (partition by id order by updated_at desc) as rn
  from filtered
)
select * from dedup where rn = 1

Partitioning, clustering, and incremental pruning

On large tables, the physical design (partitions and clusters) and incremental filtering directly drive query cost. Use dbt's config to apply the right settings per adapter and minimize the scanned range.

incremental_predicates lets you specify extra predicates applied to both target and source, on adapters that support it. Check your adapter's documentation for support.

Use partition_by on a time column on warehouses that support it.
Speed up merge match-finding with cluster or sort keys.
Keep incremental predicates simple and choose columns the statistics can actually use.

Optimization	Benefit	Watch-out
partition_by (date)	Reduces scans and pairs well with insert_overwrite	Watch out for adapter-specific syntax differences
cluster_by / sort key	Speeds up join and MERGE match lookup	Too many columns increases write cost
incremental_predicates	Narrows the target scope further	Verify adapter support

Pruning illustration

Optimization example (adapter-dependent)

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id',
    partition_by={'field': 'order_date'},          -- e.g. BigQuery/Spark family
    cluster_by=['id'],                              -- e.g. BigQuery/Snowflake clustering equivalents
    incremental_predicates=["order_date >= dateadd('day', -7, current_date)"]
) }}

select * from {{ ref('stg_orders_clean') }}
{% if is_incremental() %}
where order_date >= dateadd('day', -7, current_date)
{% endif %}

Schema changes and deletes (on_schema_change and soft delete)

In incremental models, on_schema_change controls behavior when the schema changes. append_new_columns adds new columns, fail aborts on mismatch, and sync_all_columns synchronizes column differences on supported adapters.

Handling actual deletes is not always covered by the default merge behavior. In practice, a soft delete with an is_deleted flag is safer and easier to keep historically consistent. If you must hard-delete, make precise impact scoping and tests mandatory.

on_schema_change depends on adapter support. For the exam, be able to explain each mode and when to use it.
Prefer soft deletes. Align any hard deletes with your audit requirements.
After a schema change, also consider running full_refresh.

on_schema_change	Behavior	Use case
ignore	Ignore (continue executing despite differences)	Short-term workaround; fix it later
append_new_columns	Add new columns	Operations where only column additions are expected
fail	Fail when differences are detected	Environments that require strict schema control
sync_all_columns	Synchronize column differences (supported adapters only)	When you want to absorb column changes in both directions

Soft delete flow

Upsert logic that includes soft delete (flag propagation)

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id',
    on_schema_change='append_new_columns'
) }}

with src as (
  select id,
         updated_at,
         status,
         case when status = 'deleted' then true else false end as is_deleted
  from {{ ref('stg_orders_clean') }}
  {% if is_incremental() %}
    where updated_at >= (select coalesce(max(updated_at), '1970-01-01') from {{ this }})
  {% endif %}
)

select * from src

Exam-prep checklist (for the Analytics Engineer exam)

The exam frequently asks about incremental fundamentals (unique_key, the role of is_incremental, and what full_refresh means) and when merge is the right choice. Be ready to explain the on_schema_change options, how to write dedup logic, and safe watermarking strategies.

Attaching tests (not_null/unique and generic tests) to models to guarantee data quality is another commonly evaluated point.

unique_key is required. Dedup deterministically inside the model.
Control the load range with is_incremental(); use a lookback for late arrivals.
Be able to articulate each on_schema_change mode and when to use which.

Common topic	Typical wrong answer	Key to the right answer
Why unique_key matters	Updates work with append too	merge updates on key match; append only appends
Watermark	Use max(created_at)	Use updated_at plus a lookback to absorb lateness
Schema-change handling	Always ignore	Choose append_new_columns / fail / sync_all_columns based on requirements

Exam imagery

Minimal quality-test example (schema.yml excerpt)

version: 2
models:
  - name: dim_orders
    columns:
      - name: id
        tests:
          - not_null
          - unique
      - name: updated_at
        tests:
          - not_null

Check your understanding

Analytics Engineer

問題 1

When using incremental_strategy=merge in a dbt incremental model, which design most correctly reflects both updates and new inserts?

unique_key is unnecessary; selecting only new rows with is_incremental() will also apply updates.
Switching to the append strategy automatically overwrites updates.
Set a unique_key, dedupe upstream in the source, and combine it with an updated_at-based lookback.
Specifying on_schema_change=fail makes MERGE safe even when keys duplicate.

正解: C

merge requires a unique_key, and multiple candidates for the same key must be deterministically reduced to one upstream. Combining updated_at with a lookback prevents missed late arrivals. append does not apply updates, and on_schema_change governs schema-difference behavior — it does not resolve key duplicates.

Frequently asked questions

Does the merge strategy automatically propagate deletes (rows missing from the source)?

A standard upsert does not propagate deletes automatically. When auditability is required, the common approach is a soft delete reflected via a flag such as is_deleted. If you truly need hard deletes, consider a tightly scoped delete+insert or partition-level insert_overwrite (supported adapters only).

How should configuration differences across BigQuery, Spark, Snowflake, and Databricks be handled?

Options like partition_by and insert_overwrite are adapter-specific, so follow the documentation for the adapter you use. In dbt models, lock in the shared parts (unique_key, is_incremental, dedup) and branch the physical design via config or manage it per environment for safety.

When should you run a full_refresh?

Run it for major schema changes, fixing inconsistent watermarks, or rebuilding history. Adding --full-refresh to a build run rebuilds the target incremental table from scratch.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Robust upsert patterns with dbt incremental merge strategy