dbt

Robust upsert patterns with dbt incremental merge strategy

2026-04-19
NicheeLab Editorial Team

incremental + merge is the core dbt feature for delivering the standard upsert (update + insert) pattern on a data warehouse.

This article covers unique_key design, dedup, safe watermarking, performance, and schema-change handling — the exam-favorite topics together with the practical know-how you need on the job.

Big picture of the merge strategy and when to use it

Setting incremental_strategy=merge on a dbt incremental model runs an upsert against the target table: update on key match, insert on no match. This avoids full reloads and lets you efficiently apply only the newly arrived data.

A working merge requires a properly specified unique key and a correct incremental selection. From an exam perspective, you should be able to articulate the risks of a missing or duplicated unique_key (non-deterministic updates, MERGE errors).

  • The unique_key must be logically unique. Always dedupe upstream.
  • Use is_incremental() to narrow the load. In practice, add a lookback to absorb late-arriving data.
  • On full_refresh the table is rebuilt. Declare index and cluster settings explicitly in the model config.
StrategyApplies updatesHandles deletesunique_key required
mergeYes (update on key match)Generally no (prefer soft delete)Required
appendNo (always appends)NoNot required
insert_overwrite (partition)Per-partition replaceYes (targeted partitions are replaced)Not required
delete+insertYes (delete target rows, then reinsert)Yes, within the targeted rangeRequired

UPSERT flow (incremental + merge)

Sourceraw/feedIncremental SQLdedup & filterTarget (Dimension)MERGE on unique_keyUPSERT flow (incremental + merge)

Minimal merge incremental configuration

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id'
) }}

with src as (
  select * from {{ source('app', 'orders') }}
  {% if is_incremental() %}
    where updated_at >= (select coalesce(max(updated_at), '1970-01-01') from {{ this }})
  {% endif %}
)

select * from src

unique_key design and dedup in practice

A clean unique_key is the key to a successful merge. If business keys are unstable, consider introducing a surrogate key, and assume the source delivers duplicates so that upstream dedup is non-negotiable.

On the exam, when asked what happens if the unique_key duplicates, the expected answer is to use a window function upstream to deterministically pick a single row (e.g., most recent wins).

  • When business keys can change, introduce a stable surrogate key.
  • Pick exactly one row per key deterministically using update timestamps or versions.
  • Null keys are not unique, so always attach a not_null test.
Key typeStrengthsWatch-outs
Business key (natural key)Easy for the business to understandFragile when values change or get reused
Surrogate keyStable with few collisionsRequires generation and sync machinery
Composite keyPreserves the existing grainAdding columns complicates queries and tests

Dedup flow

SourceDedup (ROW_NUMBER)Clean feedDedup flow

Use a window function to collapse duplicates to one row

with staged as (
  select *,
         row_number() over (
           partition by id
           order by updated_at desc
         ) as rn
  from {{ source('app', 'orders') }}
), dedup as (
  select * from staged where rn = 1
)
select * from dedup

Practical upsert: watermark with a lookback

To absorb late-arriving data and timestamp jitter, a safer choice than a plain max(updated_at) is a lookback threshold that rewinds slightly. Rewind a fixed window from the target's max updated_at and re-evaluate source rows in that range.

This approach can pull in some duplicates, so upstream dedup plus a unique_key-driven merge is required.

  • Tune the lookback window from hours to days based on your data-arrival SLA.
  • Always handle the first run where the target's max timestamp is null.
  • Prefer an update-driven column (updated_at). Avoid relying on created_at alone.
ApproachBenefitsRisks / costs
Simple MAX approachFastest with the smallest scanMay miss late-arriving rows
Combined with lookbackResilient to lateness and robustAdds some reprocessing and duplicate evaluation
Source-side change history tableFully reproducible with clear deltasRequires history production and storage cost

Lookback concept

processedalready loadedlookbackre-evaluation rangenew windownew loadtime →: processed / lookback (re-evaluate) / new window (boundary at max(updated_at))

Complete example: incremental + merge with lookback

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id'
) }}

{% set lookback_hours = 24 %}

with base as (
  select * from {{ source('app', 'orders') }}
), window as (
  select coalesce(
           dateadd('hour', -{{ lookback_hours }}, max(updated_at)),
           cast('1970-01-01' as timestamp)
         ) as watermark
  from {{ this }}
), filtered as (
  select b.*
  from base b
  {% if is_incremental() %}
  where b.updated_at >= (select watermark from window)
  {% endif %}
), dedup as (
  select *, row_number() over (partition by id order by updated_at desc) as rn
  from filtered
)
select * from dedup where rn = 1

Partitioning, clustering, and incremental pruning

On large tables, the physical design (partitions and clusters) and incremental filtering directly drive query cost. Use dbt's config to apply the right settings per adapter and minimize the scanned range.

incremental_predicates lets you specify extra predicates applied to both target and source, on adapters that support it. Check your adapter's documentation for support.

  • Use partition_by on a time column on warehouses that support it.
  • Speed up merge match-finding with cluster or sort keys.
  • Keep incremental predicates simple and choose columns the statistics can actually use.
OptimizationBenefitWatch-out
partition_by (date)Reduces scans and pairs well with insert_overwriteWatch out for adapter-specific syntax differences
cluster_by / sort keySpeeds up join and MERGE match lookupToo many columns increases write cost
incremental_predicatesNarrows the target scope furtherVerify adapter support

Pruning illustration

scanskippartitionsrecent daysscanolder daysskipPruning: scan only recent date partitions and skip older ones

Optimization example (adapter-dependent)

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id',
    partition_by={'field': 'order_date'},          -- e.g. BigQuery/Spark family
    cluster_by=['id'],                              -- e.g. BigQuery/Snowflake clustering equivalents
    incremental_predicates=["order_date >= dateadd('day', -7, current_date)"]
) }}

select * from {{ ref('stg_orders_clean') }}
{% if is_incremental() %}
where order_date >= dateadd('day', -7, current_date)
{% endif %}

Schema changes and deletes (on_schema_change and soft delete)

In incremental models, on_schema_change controls behavior when the schema changes. append_new_columns adds new columns, fail aborts on mismatch, and sync_all_columns synchronizes column differences on supported adapters.

Handling actual deletes is not always covered by the default merge behavior. In practice, a soft delete with an is_deleted flag is safer and easier to keep historically consistent. If you must hard-delete, make precise impact scoping and tests mandatory.

  • on_schema_change depends on adapter support. For the exam, be able to explain each mode and when to use it.
  • Prefer soft deletes. Align any hard deletes with your audit requirements.
  • After a schema change, also consider running full_refresh.
on_schema_changeBehaviorUse case
ignoreIgnore (continue executing despite differences)Short-term workaround; fix it later
append_new_columnsAdd new columnsOperations where only column additions are expected
failFail when differences are detectedEnvironments that require strict schema control
sync_all_columnsSynchronize column differences (supported adapters only)When you want to absorb column changes in both directions

Soft delete flow

SourcedeletedTargetis_deleted=1Soft delete flow (MERGE on key / set is_deleted=1)

Upsert logic that includes soft delete (flag propagation)

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id',
    on_schema_change='append_new_columns'
) }}

with src as (
  select id,
         updated_at,
         status,
         case when status = 'deleted' then true else false end as is_deleted
  from {{ ref('stg_orders_clean') }}
  {% if is_incremental() %}
    where updated_at >= (select coalesce(max(updated_at), '1970-01-01') from {{ this }})
  {% endif %}
)

select * from src

Exam-prep checklist (for the Analytics Engineer exam)

The exam frequently asks about incremental fundamentals (unique_key, the role of is_incremental, and what full_refresh means) and when merge is the right choice. Be ready to explain the on_schema_change options, how to write dedup logic, and safe watermarking strategies.

Attaching tests (not_null/unique and generic tests) to models to guarantee data quality is another commonly evaluated point.

  • unique_key is required. Dedup deterministically inside the model.
  • Control the load range with is_incremental(); use a lookback for late arrivals.
  • Be able to articulate each on_schema_change mode and when to use which.
Common topicTypical wrong answerKey to the right answer
Why unique_key mattersUpdates work with append toomerge updates on key match; append only appends
WatermarkUse max(created_at)Use updated_at plus a lookback to absorb lateness
Schema-change handlingAlways ignoreChoose append_new_columns / fail / sync_all_columns based on requirements

Exam imagery

ConfigModel SQLQuality testsBuildExam imagery

Minimal quality-test example (schema.yml excerpt)

version: 2
models:
  - name: dim_orders
    columns:
      - name: id
        tests:
          - not_null
          - unique
      - name: updated_at
        tests:
          - not_null

Check your understanding

Analytics Engineer

問題 1

When using incremental_strategy=merge in a dbt incremental model, which design most correctly reflects both updates and new inserts?

  1. unique_key is unnecessary; selecting only new rows with is_incremental() will also apply updates.
  2. Switching to the append strategy automatically overwrites updates.
  3. Set a unique_key, dedupe upstream in the source, and combine it with an updated_at-based lookback.
  4. Specifying on_schema_change=fail makes MERGE safe even when keys duplicate.

正解: C

merge requires a unique_key, and multiple candidates for the same key must be deterministically reduced to one upstream. Combining updated_at with a lookback prevents missed late arrivals. append does not apply updates, and on_schema_change governs schema-difference behavior — it does not resolve key duplicates.

Frequently asked questions

Does the merge strategy automatically propagate deletes (rows missing from the source)?

A standard upsert does not propagate deletes automatically. When auditability is required, the common approach is a soft delete reflected via a flag such as is_deleted. If you truly need hard deletes, consider a tightly scoped delete+insert or partition-level insert_overwrite (supported adapters only).

How should configuration differences across BigQuery, Spark, Snowflake, and Databricks be handled?

Options like partition_by and insert_overwrite are adapter-specific, so follow the documentation for the adapter you use. In dbt models, lock in the shared parts (unique_key, is_incremental, dedup) and branch the physical design via config or manage it per environment for safety.

When should you run a full_refresh?

Run it for major schema changes, fixing inconsistent watermarks, or rebuilding history. Adding --full-refresh to a build run rebuilds the target incremental table from scratch.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
dbt

dbt Models: SQL-Defined Transformation Units (2026)

Model fundamentals — SELECT-based definitions, naming, refs,...

dbt

dbt Analytics Engineering Exam: Complete Guide (2026)

Pass the AE Certification — scope, weighting, sample questio...

dbt

dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)

Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...

dbt

dbt Project Structure: models/seeds/macros Layout (2026)

Recommended dbt project layout — models, seeds, macros, snap...

dbt

dbt_project.yml Explained: Every Config (2026)

Every dbt_project.yml setting that matters — paths, vars, ma...

Browse all dbt articles (101)
© 2026 NicheeLab All rights reserved.