Incremental updates can process only the smallest amount of data, but a bad design ends up doing full scans or producing many small files, slowing the job down.
This article follows the official docs and optimizes end-to-end: strategy selection, watermarks, MERGE/INSERT_OVERWRITE, late-arriving and deleted data, and operational tuning.
dbt incremental models use materialized='incremental' to keep the existing table and append only the delta. There are roughly three strategies — append, merge, and insert_overwrite — and adapter support differs. On Snowflake, merge is the mainstream choice; on Databricks (Delta), both merge and partition-targeted insert_overwrite are commonly used in practice.
The selection criteria are: do you need to overwrite rows with the same key? Is partition-level regeneration advantageous? How much late-arriving data does the source produce? If you can update matching rows via unique_key, merge is the choice; if there are clear partition boundaries such as daily ones and recomputation is cheap, insert_overwrite is effective.
| Strategy | Main Supported Engines | Typical Use Case | Risks / Caveats |
|---|---|---|---|
| append | Snowflake, Databricks (Delta), etc. | Append-only event logs | Cannot dedupe or overwrite. Weak against late arrivals or recomputation |
| merge | Snowflake, Databricks (Delta) | Key-level upserts, mild late arrivals | Designing unique_key and join conditions is critical. Excessive stage input is slow |
| insert_overwrite | Spark-family engines such as Databricks (Delta) | Recreating or backfilling date partitions | Coarse partition selection causes write amplification. Watch for small-file growth |
Conceptual incremental pipeline
Skeleton of a minimal incremental model
{{ config(
materialized='incremental',
incremental_strategy='merge', -- common on Snowflake/Delta
unique_key='id', -- key for upsert
on_schema_change='append_new_columns' -- depends on adapter support
) }}
with src as (
select *
from {{ source('app', 'events') }}
{% if is_incremental() %}
-- narrow input via watermark (e.g., updated_at)
where updated_at >= dateadd(day, -1, (select coalesce(max(updated_at), '1900-01-01') from {{ this }}))
{% endif %}
)
select
id,
event_type,
payload,
updated_at
from src;When switching the delta condition with is_incremental(), relying only on a max() lookup against the target table can become a bottleneck. Ideally, drive the condition from the source's update timestamp or incremental key directly, and add a "fence width" as needed to absorb late arrivals.
Watermark columns must be monotonically increasing, non-NULL, and timezone-consistent. Parameterize the ingestion range so it can be overridden during a backfill.
Variable watermark example (Jinja + variables)
{% set lookback_hours = var('wm_lookback_hours', 24) %}
{% set lower_bound_expr %}dateadd(hour, -{{ lookback_hours }}, {{ run_started_at }}){% endset %}
{{ config(materialized='incremental', incremental_strategy='merge', unique_key='id') }}
with src as (
select *
from {{ source('crm', 'customers') }}
{% if is_incremental() %}
where updated_at >= {{ lower_bound_expr }}
{% endif %}
)
,
dedup as (
-- dedupe: pick the latest row per key from the source
select asof.*
from (
select *, row_number() over(partition by id order by updated_at desc) as rn
from src
) asof
where rn = 1
)
select * from dedup;MERGE offers great upsert flexibility, but performance falls off a cliff when the join touches many rows. The first optimization step is to narrow the input enough via is_incremental() and to make the merge join key (unique_key) explicit. Pre-aggregating each key to its latest row on the stage side before MERGE alone has a big impact.
On Snowflake, clustering key design influences scan efficiency. Set the key with ALTER TABLE ... CLUSTER BY to improve search efficiency. On Databricks (Delta), OPTIMIZE and ZORDER BY improve data skipping and tame post-MERGE fragmentation. These operations are costly, so controlling frequency via post-hook keeps things stable.
MERGE optimization example (clustering/OPTIMIZE via post-hook)
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key=['account_id', 'effective_date'],
post_hook=[
{% raw %}
"{% if target.type == 'snowflake' %}"
"alter table {{ this }} cluster by (account_id, effective_date);",
"{% endif %}"
,
"{% if target.type in ['databricks','spark'] %}"
"optimize {{ this }} zorder by (account_id);",
"{% endif %}"
{% endraw %}
]
) }}
with staged as (
select *, row_number() over(partition by account_id, effective_date order by updated_at desc) as rn
from {{ ref('stg_accounts_daily') }}
{% if is_incremental() %}
where updated_at >= dateadd(day, -2, {{ run_started_at }})
{% endif %}
)
select account_id, effective_date, balance, updated_at
from staged
where rn = 1;On Spark/Databricks, you can use incremental_strategy='insert_overwrite' with partition_by. This swaps only the affected partitions, making daily or monthly recomputation fast. Pick partition keys that are frequently used as filters and have moderate cardinality (such as dates).
Conversely, partitions that are too fine increase small files and slow down reads and metadata management. Run OPTIMIZE periodically and tune batch granularity so that you settle on appropriate file sizes. Because insert_overwrite is uncommon on Snowflake, prefer merge there.
INSERT_OVERWRITE example on Delta (daily partitions)
{{ config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by=['dt']
) }}
with base as (
select *, to_date(event_time) as dt
from {{ source('app', 'clicks') }}
{% if is_incremental() %}
where event_time >= dateadd(day, -3, {{ run_started_at }})
{% endif %}
)
select
user_id,
url,
event_time,
to_date(event_time) as dt
from base
-- insert_overwrite swaps only the affected partitions based on partition_by=['dt'] aboveAbsorb late arrivals with "read a wider window + normalize to the latest row downstream." For row deletes, if the source has a delete flag (tombstone), you can handle it with MERGE's WHEN MATCHED AND deleted=1 THEN DELETE. If the source physically deletes rows, ensure consistency via periodic full-comparison models or snapshots.
For quality assurance, in addition to basic unique and not_null tests, selective tests that inspect only the incremental-specific region (such as the last N days) are efficient. When you embed lookback into the model, share vars so that tests can reference the same window.
Delete-flag-aware MERGE conditions and test definitions
{{ config(materialized='incremental', incremental_strategy='merge', unique_key='id') }}
with src as (
select * from {{ source('erp', 'orders_cdc') }}
{% if is_incremental() %}
where updated_at >= dateadd(day, -2, {{ run_started_at }})
{% endif %}
), latest as (
select *, row_number() over(partition by id order by updated_at desc) rn
from src
)
select id, status, amount, updated_at, deleted
from latest
where rn = 1
-- dbt generates the default MERGE statement for the adapter. Customizing to include WHEN MATCHED AND deleted=1 THEN DELETE for deleted=1 rows is also possible (depending on adapter/version).
-- Example tests/schema.yml
-- version: 2
-- models:
-- - name: fct_orders
-- tests:
-- - unique:
-- column_name: id
-- - not_null:
-- column_name: id
-- - relationships:
-- to: ref('dim_customers')
-- field: customer_idSpeed comes not only from query optimization but also from deciding which models to run and when. Use state:modified+ and tags to select only models whose changes truly propagate, and tune parallelism (threads) to your environment's limits. The trick to stable operations is not running multiple large MERGEs simultaneously.
On Databricks (Delta), running OPTIMIZE/VACUUM only after large backfills — lowering maintenance frequency — saves cost and time. On Snowflake, review clustering key settings and avoid unnecessary reclustering.
Execution selection and scheduling example
# Build only the changed delta (validate in a test env that defers to prod)
dbt build -s state:modified+ --defer --state target/artifacts --target prod
# Limit heavy incremental tag to a nightly batch
dbt run -s tag:heavy_incremental --threads 2
# Example selectors.yml
# selectors:
# - name: modified_plus_heavy
# definition:
# union:
# - method: state
# value: modified+
# - method: tag
# value: heavy_incrementalAnalytics Engineer
問題 1
On Databricks (Delta), you frequently recompute a daily-partitioned sales table. Late arrivals are at most 48 hours, and overwriting rows with the same key is not required. Which strategy is the most efficient?
正解: A
Daily partitions are clear and there is no overwrite requirement, so insert_overwrite — swapping only the affected days — is the most efficient. Read a 2-3 day window to account for late arrivals and replace only those partitions. Full-history merge or full rebuild does unnecessary work, and append cannot preserve integrity.
Is unique_key always required?
It is effectively required for the merge strategy. Without a key, you cannot match rows, which leads to duplicates or missed updates. It is unnecessary for append and certain insert_overwrite cases, but you should still plan key management to meet your integrity requirements.
Can I use insert_overwrite on Snowflake?
merge is generally recommended on Snowflake. Spark-style partition-targeted insert_overwrite is not common there; for recomputation, split the model or regenerate via a separate table.
Should OPTIMIZE and ZORDER run every time?
No. OPTIMIZE/ZORDER on Databricks (Delta) are costly. Run them conditionally via post-hook only when needed, such as right after a large backfill or when many small files have accumulated.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
dbt Models: SQL-Defined Transformation Units (2026)
Model fundamentals — SELECT-based definitions, naming, refs,...
dbt Analytics Engineering Exam: Complete Guide (2026)
Pass the AE Certification — scope, weighting, sample questio...
dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)
Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...
dbt Project Structure: models/seeds/macros Layout (2026)
Recommended dbt project layout — models, seeds, macros, snap...
dbt_project.yml Explained: Every Config (2026)
Every dbt_project.yml setting that matters — paths, vars, ma...