dbt

Optimizing dbt Incremental Models: Designing Fast Incremental Jobs

2026-04-19
NicheeLab Editorial Team

Incremental updates can process only the smallest amount of data, but a bad design ends up doing full scans or producing many small files, slowing the job down.

This article follows the official docs and optimizes end-to-end: strategy selection, watermarks, MERGE/INSERT_OVERWRITE, late-arriving and deleted data, and operational tuning.

Incremental Model Basics and Strategy Selection

dbt incremental models use materialized='incremental' to keep the existing table and append only the delta. There are roughly three strategies — append, merge, and insert_overwrite — and adapter support differs. On Snowflake, merge is the mainstream choice; on Databricks (Delta), both merge and partition-targeted insert_overwrite are commonly used in practice.

The selection criteria are: do you need to overwrite rows with the same key? Is partition-level regeneration advantageous? How much late-arriving data does the source produce? If you can update matching rows via unique_key, merge is the choice; if there are clear partition boundaries such as daily ones and recomputation is cheap, insert_overwrite is effective.

  • append: simple append-only. Useful only for logs where deduplication is unnecessary.
  • merge: control updates/inserts via unique_key. The standard on Snowflake and Delta.
  • insert_overwrite: swap only the specified partitions. Strong for date-grained recomputation.
StrategyMain Supported EnginesTypical Use CaseRisks / Caveats
appendSnowflake, Databricks (Delta), etc.Append-only event logsCannot dedupe or overwrite. Weak against late arrivals or recomputation
mergeSnowflake, Databricks (Delta)Key-level upserts, mild late arrivalsDesigning unique_key and join conditions is critical. Excessive stage input is slow
insert_overwriteSpark-family engines such as Databricks (Delta)Recreating or backfilling date partitionsCoarse partition selection causes write amplification. Watch for small-file growth

Conceptual incremental pipeline

Sourceraw/CDCStagingfilteredIncremental Modelmerge/overwriteAnalytics Tablespartitioned/clusteredraw/CDC → filter in staging → merge/overwrite in incremental model → partitioned/clustered analytics tables

Skeleton of a minimal incremental model

{{ config(
  materialized='incremental',
  incremental_strategy='merge',  -- common on Snowflake/Delta
  unique_key='id',               -- key for upsert
  on_schema_change='append_new_columns'  -- depends on adapter support
) }}

with src as (
  select *
  from {{ source('app', 'events') }}
  {% if is_incremental() %}
    -- narrow input via watermark (e.g., updated_at)
    where updated_at >= dateadd(day, -1, (select coalesce(max(updated_at), '1900-01-01') from {{ this }}))
  {% endif %}
)

select
  id,
  event_type,
  payload,
  updated_at
from src;

is_incremental and Watermark Design

When switching the delta condition with is_incremental(), relying only on a max() lookup against the target table can become a bottleneck. Ideally, drive the condition from the source's update timestamp or incremental key directly, and add a "fence width" as needed to absorb late arrivals.

Watermark columns must be monotonically increasing, non-NULL, and timezone-consistent. Parameterize the ingestion range so it can be overridden during a backfill.

  • Basic: control with source updated_at >= run_started_at - lookback
  • Late arrivals: read a wider window (e.g., 24-48 hours) and deduplicate downstream
  • Optimize max() lookups: switch the lookup target to a small summary table

Variable watermark example (Jinja + variables)

{% set lookback_hours = var('wm_lookback_hours', 24) %}
{% set lower_bound_expr %}dateadd(hour, -{{ lookback_hours }}, {{ run_started_at }}){% endset %}

{{ config(materialized='incremental', incremental_strategy='merge', unique_key='id') }}

with src as (
  select *
  from {{ source('crm', 'customers') }}
  {% if is_incremental() %}
    where updated_at >= {{ lower_bound_expr }}
  {% endif %}
)
,
dedup as (
  -- dedupe: pick the latest row per key from the source
  select asof.*
  from (
    select *, row_number() over(partition by id order by updated_at desc) as rn
    from src
  ) asof
  where rn = 1
)
select * from dedup;

Optimizing the MERGE Strategy (Snowflake / Databricks)

MERGE offers great upsert flexibility, but performance falls off a cliff when the join touches many rows. The first optimization step is to narrow the input enough via is_incremental() and to make the merge join key (unique_key) explicit. Pre-aggregating each key to its latest row on the stage side before MERGE alone has a big impact.

On Snowflake, clustering key design influences scan efficiency. Set the key with ALTER TABLE ... CLUSTER BY to improve search efficiency. On Databricks (Delta), OPTIMIZE and ZORDER BY improve data skipping and tame post-MERGE fragmentation. These operations are costly, so controlling frequency via post-hook keeps things stable.

  • Aggregate MERGE input to "the latest record per required key" before feeding it in
  • Snowflake: prioritize columns used in join/filter as CLUSTER BY keys
  • Databricks (Delta): run OPTIMIZE/ZORDER only after large updates

MERGE optimization example (clustering/OPTIMIZE via post-hook)

{{ config(
  materialized='incremental',
  incremental_strategy='merge',
  unique_key=['account_id', 'effective_date'],
  post_hook=[
    {% raw %}
    "{% if target.type == 'snowflake' %}"
    "alter table {{ this }} cluster by (account_id, effective_date);",
    "{% endif %}"
    ,
    "{% if target.type in ['databricks','spark'] %}"
    "optimize {{ this }} zorder by (account_id);",
    "{% endif %}"
    {% endraw %}
  ]
) }}

with staged as (
  select *, row_number() over(partition by account_id, effective_date order by updated_at desc) as rn
  from {{ ref('stg_accounts_daily') }}
  {% if is_incremental() %}
    where updated_at >= dateadd(day, -2, {{ run_started_at }})
  {% endif %}
)
select account_id, effective_date, balance, updated_at
from staged
where rn = 1;

INSERT_OVERWRITE and Partition Operations (Databricks Delta)

On Spark/Databricks, you can use incremental_strategy='insert_overwrite' with partition_by. This swaps only the affected partitions, making daily or monthly recomputation fast. Pick partition keys that are frequently used as filters and have moderate cardinality (such as dates).

Conversely, partitions that are too fine increase small files and slow down reads and metadata management. Run OPTIMIZE periodically and tune batch granularity so that you settle on appropriate file sizes. Because insert_overwrite is uncommon on Snowflake, prefer merge there.

  • Daily partitions + insert_overwrite is strong for backfills
  • Small-file mitigation: control write parallelism and OPTIMIZE frequency
  • On Snowflake, default to merge; handle recomputation by splitting models

INSERT_OVERWRITE example on Delta (daily partitions)

{{ config(
  materialized='incremental',
  incremental_strategy='insert_overwrite',
  partition_by=['dt']
) }}

with base as (
  select *, to_date(event_time) as dt
  from {{ source('app', 'clicks') }}
  {% if is_incremental() %}
    where event_time >= dateadd(day, -3, {{ run_started_at }})
  {% endif %}
)

select
  user_id,
  url,
  event_time,
  to_date(event_time) as dt
from base
-- insert_overwrite swaps only the affected partitions based on partition_by=['dt'] above

Handling Late Arrivals and Deletes, plus Test Strategy

Absorb late arrivals with "read a wider window + normalize to the latest row downstream." For row deletes, if the source has a delete flag (tombstone), you can handle it with MERGE's WHEN MATCHED AND deleted=1 THEN DELETE. If the source physically deletes rows, ensure consistency via periodic full-comparison models or snapshots.

For quality assurance, in addition to basic unique and not_null tests, selective tests that inspect only the incremental-specific region (such as the last N days) are efficient. When you embed lookback into the model, share vars so that tests can reference the same window.

  • When delete flags exist: leverage MERGE's WHEN MATCHED ... THEN DELETE
  • Physical deletes only: consider consistency checks via snapshots or a separate model
  • Focus tests on recent partitions and run them at high frequency

Delete-flag-aware MERGE conditions and test definitions

{{ config(materialized='incremental', incremental_strategy='merge', unique_key='id') }}

with src as (
  select * from {{ source('erp', 'orders_cdc') }}
  {% if is_incremental() %}
    where updated_at >= dateadd(day, -2, {{ run_started_at }})
  {% endif %}
), latest as (
  select *, row_number() over(partition by id order by updated_at desc) rn
  from src
)
select id, status, amount, updated_at, deleted
from latest
where rn = 1

-- dbt generates the default MERGE statement for the adapter. Customizing to include WHEN MATCHED AND deleted=1 THEN DELETE for deleted=1 rows is also possible (depending on adapter/version).

-- Example tests/schema.yml
-- version: 2
-- models:
--   - name: fct_orders
--     tests:
--       - unique:
--           column_name: id
--       - not_null:
--           column_name: id
--       - relationships:
--           to: ref('dim_customers')
--           field: customer_id

Execution Plans and Operational Tuning

Speed comes not only from query optimization but also from deciding which models to run and when. Use state:modified+ and tags to select only models whose changes truly propagate, and tune parallelism (threads) to your environment's limits. The trick to stable operations is not running multiple large MERGEs simultaneously.

On Databricks (Delta), running OPTIMIZE/VACUUM only after large backfills — lowering maintenance frequency — saves cost and time. On Snowflake, review clustering key settings and avoid unnecessary reclustering.

  • Model selection: rerun only impacted scope with dbt build -s state:modified+
  • Parallelism: serialize heavy MERGEs on the same warehouse to avoid contention
  • Maintenance: on Delta, OPTIMIZE/VACUUM only when needed; on Snowflake, set appropriate CLUSTER BY

Execution selection and scheduling example

# Build only the changed delta (validate in a test env that defers to prod)
dbt build -s state:modified+ --defer --state target/artifacts --target prod

# Limit heavy incremental tag to a nightly batch
dbt run -s tag:heavy_incremental --threads 2

# Example selectors.yml
# selectors:
#   - name: modified_plus_heavy
#     definition:
#       union:
#         - method: state
#           value: modified+
#         - method: tag
#           value: heavy_incremental

Check Your Understanding

Analytics Engineer

問題 1

On Databricks (Delta), you frequently recompute a daily-partitioned sales table. Late arrivals are at most 48 hours, and overwriting rows with the same key is not required. Which strategy is the most efficient?

  1. Use incremental_strategy='insert_overwrite' with partition_by=['sale_date'], and use is_incremental() to target only the last 2-3 days
  2. Use incremental_strategy='merge' and upsert against the entire history every run
  3. Set materialized='table' and fully rebuild every run
  4. Just append everything with the append strategy and filter errors on the dashboard side

正解: A

Daily partitions are clear and there is no overwrite requirement, so insert_overwrite — swapping only the affected days — is the most efficient. Read a 2-3 day window to account for late arrivals and replace only those partitions. Full-history merge or full rebuild does unnecessary work, and append cannot preserve integrity.

Frequently Asked Questions

Is unique_key always required?

It is effectively required for the merge strategy. Without a key, you cannot match rows, which leads to duplicates or missed updates. It is unnecessary for append and certain insert_overwrite cases, but you should still plan key management to meet your integrity requirements.

Can I use insert_overwrite on Snowflake?

merge is generally recommended on Snowflake. Spark-style partition-targeted insert_overwrite is not common there; for recomputation, split the model or regenerate via a separate table.

Should OPTIMIZE and ZORDER run every time?

No. OPTIMIZE/ZORDER on Databricks (Delta) are costly. Run them conditionally via post-hook only when needed, such as right after a large backfill or when many small files have accumulated.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
dbt

dbt Models: SQL-Defined Transformation Units (2026)

Model fundamentals — SELECT-based definitions, naming, refs,...

dbt

dbt Analytics Engineering Exam: Complete Guide (2026)

Pass the AE Certification — scope, weighting, sample questio...

dbt

dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)

Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...

dbt

dbt Project Structure: models/seeds/macros Layout (2026)

Recommended dbt project layout — models, seeds, macros, snap...

dbt

dbt_project.yml Explained: Every Config (2026)

Every dbt_project.yml setting that matters — paths, vars, ma...

Browse all dbt articles (101)
© 2026 NicheeLab All rights reserved.