dbt

Designing unique_key in dbt: Deduplication and Primary Key Management

2026-04-19
NicheeLab Editorial Team

unique_key is the cornerstone setting that guarantees update identity in dbt incremental models. You need to carefully design how source-side duplicates are safely removed while still merging (row-level upsert) correctly.

This article distills the concepts most likely to appear on the dbt Analytics Engineer exam, along with practical guidelines for avoiding real-world pitfalls, all grounded in stable official specifications.

Fundamentals of unique_key and What It Means in dbt

In dbt, unique_key refers to the "identity column(s)" used primarily in incremental models to match existing rows in the target table against new rows. It does not create a database-level primary key constraint per se; it functions as the MERGE condition during incremental loads.

Because unique_key is meant to guarantee row uniqueness, it can be a natural key, a surrogate key, or a composite key. What matters is that the identity definition is stable from a business perspective and that you have decided the tiebreaker (which row counts as the latest) for updates.

When using incremental_strategy=merge for dbt incremental models, specifying unique_key correctly is a prerequisite. This ensures that reingestion and corrected data converge to the same idempotent result.

  • unique_key is a "merge key" and does not necessarily come with a database constraint
  • To build models resilient to resends, corrections, and CDC, you need both the unique_key and a tiebreaker column
  • Composite keys are also supported. Audit cardinality and NULL tolerance up front

Deduplication and incremental merge flow (conceptual)

Sourcesraw_orders, etc.stg_* (dedup/standardize)Collapse to 1 row via ROW_NUMBERcore/fct_* (incremental)MERGE using unique_keymarts/*BI / output layer

Safe dedup + MERGE using unique_key in an incremental model (with minimal dialect dependence)

{{ config(
  materialized='incremental',
  incremental_strategy='merge',
  unique_key='order_id',            # use ['order_id','line_no'] etc. for a composite key
  on_schema_change='append_new_columns'
) }}

with src as (
  select * from {{ source('raw', 'orders') }}
  {% if is_incremental() %}
    -- Example: only ingest recent rows. Use a proper boundary column (e.g. updated_at)
    where updated_at > (select coalesce(max(updated_at), '1970-01-01') from {{ this }})
  {% endif %}
),

-- Tiebreaker (decide which row is latest)
ranked as (
  select
    *,
    row_number() over (partition by order_id order by updated_at desc) as rn
  from src
),

dedup as (
  select * from ranked where rn = 1
)

select
  order_id,
  customer_id,
  status,
  total_amount,
  updated_at
from dedup;

Where to Dedupe: Splitting Responsibilities Between Staging and Fact

Deduplication is generally most effective at the staging layer, where it stabilizes the entire downstream pipeline. If stg_* guarantees "one business key = one row," then core/fact incremental models can stay simple.

That said, in systems where CDC lag and resends are frequent, defense in depth, where the fact side also reinforces tiebreaking, is effective. Make sure that uniqueness for order_id and similar keys is consistent across both stg dedup and the fact MERGE.

  • Dedup at stg: high impact because it is reused across many downstream models
  • Additional dedup on the fact side: ensures robustness at the final boundary
  • Use deterministic, reproducible columns for tiebreaking (update timestamp, version, is_deleted, etc.)

Dedup at stg (a common, reusable pattern)

{{ config(materialized='view') }}

with base as (
  select * from {{ source('raw', 'orders') }}
), ranked as (
  select
    *,
    row_number() over (
      partition by order_id
      order by coalesce(updated_at, created_at) desc
    ) as rn
  from base
)
select
  order_id,
  customer_id,
  status,
  total_amount,
  coalesce(updated_at, created_at) as last_modified_at
from ranked
where rn = 1;

How Incremental Models and unique_key Interact (Strategy-by-Strategy)

incremental_strategy=merge requires unique_key on most adapters and updates or inserts rows that share the same key. Its strength is resilience against corrections and late-arriving data.

For strategies that recreate per partition (e.g. insert_overwrite), the design of partition_by matters more than unique_key. This is effective when daily recreation is realistic on large tables.

The append strategy is a simple, fast insert-only approach, but you must control duplicates and apply corrections yourself. Without a solid dedup SELECT, duplicates will accumulate.

  • merge: resilient to corrections and resends, assuming unique_key is logically stable
  • insert_overwrite: partition design and upstream boundary extraction are critical
  • append: fast but raises the risk of duplicates. Limit it to genuinely append-only cases

Representative examples per strategy (including a BigQuery example)

-- 1) MERGE strategy (common across many adapters)
{{ config(materialized='incremental', incremental_strategy='merge', unique_key=['order_id','line_no']) }}
select * from {{ ref('stg_orders_dedup') }}

-- 2) Per-partition recreation (BigQuery insert_overwrite)
--   Use a stable column such as a date/timestamp as the partition key
--   Not a MERGE, so unique_key is not required
{{ config(
  materialized='incremental',
  incremental_strategy='insert_overwrite',
  partition_by={'field': 'order_date', 'data_type': 'date'}
) }}
select * from {{ ref('stg_orders_dedup') }}
where order_date >= date_sub(current_date(), interval 7 day);

Primary Key Management: Choosing Between Natural, Surrogate, and Composite Keys

A natural key uses a business identifier (e.g. order_id) directly. It is highly readable, but it can be fragile against operational reuse or renumbering changes.

A surrogate key is a stable technical key (hash or sequence) and handles multi-column combinations and NULLs well. It is effective when you want models that can absorb change.

A composite key guarantees uniqueness across multiple columns. dbt's unique_key supports this via array notation. Watch out for NULLs and future column changes, and protect it firmly with tests in practice.

  • Natural key: usable as-is, but sensitive to upstream changes
  • Surrogate key: has generation cost but is robust. Pairs well with history management
  • Composite key: a practical compromise. Hash generation is easy with dbt_utils
Key TypeSource of Uniqueness / StrengthsChange Impact / Implementation Tips
Natural key (single column)Intuitive business identifierVulnerable to upstream renumbering and spec changes. not_null/unique tests are required
Surrogate key (hash / sequence)Absorbs schema changes and composite uniquenessStabilize the generation logic and avoid collisions. NULL normalization is the key
Composite key (multiple columns)Easy to express real-world uniquenessStrengthen tests to prepare for future column additions/removals. Hashing via utilities is also effective

Generating a surrogate key with dbt_utils (composite key support)

-- Add dbt-labs/dbt_utils in packages.yml first
-- Usage example inside a model
select
  {{ dbt_utils.generate_surrogate_key([
    'order_id',
    "cast(line_no as string)",
    "coalesce(customer_id, 'UNKNOWN')"
  ]) }} as sk_order_line,
  *
from {{ ref('stg_orders_dedup') }};

Quality Management: Combining dbt Tests with (Optional) Constraints

Applying not_null and unique via dbt schema tests is the baseline. For composite uniqueness, dbt_utils.unique_combination_of_columns provides a robust solution.

In some data warehouses, primary key and unique constraints are informational and not enforced at runtime. The safe approach is to guarantee effectiveness through dbt tests at build time. Combine with query-based monitoring (anomaly detection) where appropriate.

On large tables, full-table-scan tests can become expensive. Use operational patterns such as incremental tests or checks limited to the daily partition.

  • not_null + unique is the minimum set. Use utilities for composite uniqueness
  • Constraints are environment-dependent. Backfill with tests when not enforced
  • Spread the load of heavy tests through scheduling separation or sampling

schema.yml example (single and composite uniqueness tests)

version: 2
models:
  - name: fct_orders
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns: ['order_id', 'line_no']
    columns:
      - name: order_id
        tests:
          - not_null
      - name: line_no
        tests:
          - not_null
      - name: sk_order_line
        tests:
          - unique

Design Patterns and Pitfalls: Late Arrival, CDC, and Snapshots

In systems with late arrival or resends, always carry a tiebreaker column for dedup (such as updated_at) and choose MERGE for incremental models. When is_deleted flags or version columns exist, they are also good tiebreaker inputs.

In CDC, duplicate keys do not guarantee that the after-image arrives last. Make the row_number ORDER BY strict; if needed, take the max version for the same key, or use op_ts (operation timestamp).

dbt snapshots require their own unique_key design, separate from the model's unique_key. When building SCD Type 2 history, keep uniqueness on the business key while correctly specifying change detection (check_cols, etc.).

  • Make the tiebreaker an explicit deterministic order (updated_at, op_ts, version, is_deleted)
  • Design CDC under the assumption that order guarantees are weak
  • A snapshot's unique_key is the axis of history. Do not conflate it with the model's unique_key

dbt snapshot example (SCD2)

{% snapshot dim_customer_scd %}
{{ config(
  target_schema='snapshots',
  unique_key='customer_id',
  strategy='check',
  check_cols=['name','email','status']
) }}

select * from {{ ref('stg_customers_dedup') }}

{% endsnapshot %}

Check Your Understanding

Analytics Engineer

問題 1

You are building fact_orders on BigQuery as a dbt incremental model. Upstream performs resends and corrections, so for any given order_id you want to keep only the latest update while correctly overwriting existing rows. Which configuration/implementation combination is most appropriate?

  1. Set materialized=incremental and configure only partition_by to append (unique_key is not needed)
  2. Set materialized=incremental, incremental_strategy=merge, unique_key=order_id, and implement dedup with row_number plus a deterministic tiebreaker on the SELECT side
  3. Set materialized=table and always do a full refresh
  4. Set materialized=incremental, incremental_strategy=append, and unique_key=order_id

正解: B

Safely applying corrections and resends requires the MERGE strategy with unique_key, and you also need dedup on the SELECT side (e.g. row_number) to keep only the latest row per key. Append-style partition_by (A) and append (D) do not overwrite and tend to leave duplicates, while constant full refresh (C) is far too costly for what is needed.

Frequently Asked Questions

Is dbt's unique_key the same as a database primary key constraint?

They are different. dbt's unique_key is a logical identity definition used as the MERGE condition for incremental models, and it does not necessarily create or enforce a database constraint. Constraint behavior varies by warehouse, so it is safer to rely on dbt tests to guarantee uniqueness in practice.

How do I specify a composite key?

Most adapters let you pass an array to unique_key (e.g. ['order_id','line_no']). Pairing this with dbt_utils.generate_surrogate_key to build a hash makes it easier to work with, and dbt_utils.unique_combination_of_columns in schema tests adds a robust safety net.

How do I handle late-arriving data when 'latest row' determination is unstable?

Make the ordering deterministic in your dedup query. Use row_number ordered by updated_at (or op_ts / version if absent) and keep rn=1, and design priorities for flags like is_deleted. If needed, combine multiple columns in the ORDER BY so reruns always produce the same result.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
dbt

dbt Models: SQL-Defined Transformation Units (2026)

Model fundamentals — SELECT-based definitions, naming, refs,...

dbt

dbt Analytics Engineering Exam: Complete Guide (2026)

Pass the AE Certification — scope, weighting, sample questio...

dbt

dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)

Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...

dbt

dbt Project Structure: models/seeds/macros Layout (2026)

Recommended dbt project layout — models, seeds, macros, snap...

dbt

dbt_project.yml Explained: Every Config (2026)

Every dbt_project.yml setting that matters — paths, vars, ma...

Browse all dbt articles (101)
© 2026 NicheeLab All rights reserved.