dbt Snapshots: SCD Type 2 in SQL (2026)

dbt Snapshots implement SCD Type 2 by appending a history row whenever a source row changes and managing the validity window. The two main strategies are timestamp and check, and both require a unique_key.

This article covers how to choose between the two strategies, common pitfalls, warehouse-specific tuning in practice, and the points most likely to appear on the exam, all aligned with the official documentation.

Snapshot Basics and Generated Columns

dbt Snapshots detect row-level changes against the result of a SELECT statement and update the history table. When a change is detected, the existing current row's validity window is closed and a new current row is appended. As a result, multiple versions accumulate for the same unique_key.

Snapshot tables include control columns added by dbt. These are common across both strategies and can be used for history-integrity checks and point-in-time queries.

Required: unique_key and strategy (timestamp or check)
Shared control columns: dbt_scd_id, dbt_valid_from, dbt_valid_to, dbt_updated_at
invalidate_hard_deletes: true also expires rows that have disappeared from the source (hard-delete invalidation)
Snapshots are defined as snapshot blocks rather than models and are refreshed with the dbt snapshot command

Column	Role	Notes
dbt_scd_id	Unique identifier for the SCD row	Computed by dbt based on unique_key and strategy
dbt_valid_from	Start of the row's validity	Time the change was detected (or updated_at)
dbt_valid_to	End of the row's validity	NULL for the current row; a timestamp is written on expiration
dbt_updated_at	Update reference assigned by dbt at processing time	Computation differs depending on the strategy

Minimal Snapshot configuration (check strategy)

{% snapshot customers_snapshot %}
{{
  config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='check',
    check_cols=['email', 'address', 'status'],
    invalidate_hard_deletes=true
  )
}}
select customer_id, email, address, status
from {{ source('app', 'customers') }}
{% endsnapshot %}

Timestamp vs Check Strategy Comparison

The timestamp strategy detects row changes using a single updated_at column. It is best suited for datasets where updated_at increases monotonically and is reliably updated on every change.

The check strategy detects changes based on value differences in a specified set of columns. It is useful when the update timestamp is unreliable or when you want to base change detection only on a few business-critical attributes.

If you have a reliable updated_at, timestamp is simpler and tends to be faster
If the update timestamp is untrustworthy, missing, or delayed, use check and monitor only the business-critical columns
Exclude nondeterministic or noisy columns (the update timestamp itself, random values, floating-point micro-deltas) from check

Aspect	Timestamp strategy	Check strategy
Detection logic	Detects when updated_at advances	Detects when check_cols values change
Required data	A trustworthy updated_at	Stable, deterministic columns (one or more)
Misdetection risk	Misses changes if the timestamp is delayed or missing	Includes too many columns causes false positives (noise)
Performance	Relatively light (single-column lookup)	Depends on the number of columns and comparison cost
Use case	DWH tables with a CDC-sourced updated_at	App-sourced data where updated_at is inconsistent or you only want to track specific attributes

How Snapshots track history (SCD2)

Skeletons for both strategies

{% snapshot orders_ts %}
{{
  config(
    target_schema='snapshots',
    unique_key='order_id',
    strategy='timestamp',
    updated_at='updated_at'
  )
}}
select order_id, updated_at, amount, status
from {{ ref('stg_orders') }}
{% endsnapshot %}

{% snapshot users_ck %}
{{
  config(
    target_schema='snapshots',
    unique_key='user_id',
    strategy='check',
    check_cols=['email', 'is_active', 'plan']
  )
}}
select user_id, email, is_active, plan
from {{ ref('stg_users') }}
{% endsnapshot %}

Practical Tips for the Timestamp Strategy

updated_at must be rewritten on every update at the source. If updated_at points to the past due to late arrivals or backfills, changes will be missed. Ensure a trustworthy, monotonically increasing timestamp.

Time zone handling depends on the warehouse. Mismatched zones between source and DWH shift window boundaries. Normalize to UTC in the ETL layer or explicitly convert inside the Snapshot SELECT.

During backfills, do not overwrite updated_at with the load time instead of the actual update time
Rounding the timestamp to align with daily rollups can collapse changes
Consider invalidate_hard_deletes=true to handle hard deletes

Symptom	Cause	Mitigation
Missed detection	updated_at is delayed or goes backwards	Guarantee monotonic growth in the ETL; reprocess via a late-arrival flag
Boundary drift	Time zone mismatch	Normalize to UTC and convert explicitly in the SELECT
Wasteful history rows	Mechanical periodic updates change updated_at	Only update at the application layer on semantically meaningful changes

Conceptual example of normalizing time inside the SELECT

{% snapshot payments_ts %}
{{
  config(
    target_schema='snapshots',
    unique_key='payment_id',
    strategy='timestamp',
    updated_at='updated_at_utc'
  )
}}
select
  payment_id,
  -- ここで UTC に正規化（倉庫に応じて適切な関数を使用）
  cast(updated_at as timestamp) as updated_at_utc,
  amount, status
from {{ ref('stg_payments_raw') }}
{% endsnapshot %}

Practical Tips for the Check Strategy

Restrict check_cols to columns that reflect business-meaningful changes. Exclude log-derived timestamps, sort orders, and outputs of nondeterministic functions. Floating-point values often cause false positives via rounding errors, so preprocessing with rounding or stringification before comparison is effective.

If you want to check a derived column, compute it deterministically in the Snapshot SELECT before including it in check_cols. This also minimizes the impact of source schema changes.

check_cols may need NULL-safe comparisons; use coalesce inside the SELECT
Comparing many columns increases cost; keep it to the bare minimum
Standardize monetary rounding per business rules (e.g., fixed to two decimal places)

Symptom	Cause	Mitigation
Over-detection	Included noisy columns	Narrow check_cols and apply preprocessing
Detection from tiny deltas	Floating-point rounding errors	Stabilize via round, cast, or stringification
Missed detection	Missing columns that should be compared	Review your diff definition against requirements and revisit the column list

Example of stabilizing check_cols via preprocessing

{% snapshot subscription_ck %}
{{
  config(
    target_schema='snapshots',
    unique_key='account_id',
    strategy='check',
    check_cols=['plan', 'is_active', 'mrr_str']
  )
}}
with src as (
  select account_id, plan, is_active, mrr
  from {{ ref('stg_subscription') }}
)
select
  account_id,
  plan,
  is_active,
  -- 丸めと型統一により比較を安定化
  cast(round(mrr, 2) as string) as mrr_str
from src
{% endsnapshot %}

Operations and Performance: Scheduling and Growth Control

Snapshots evaluate diffs on every run. Aligning the schedule with the source update frequency is critical; too frequent and you incur unnecessary comparison cost, too infrequent and detection lag grows.

History tables grow over time. If you only query the current rows, expose a view that filters dbt_valid_to IS NULL and optimize reads with partitioning or clustering.

Aligning the schedule right after source updates reduces waste
On large tables, use time-series columns like dbt_valid_from for partition and clustering keys
Enable invalidate_hard_deletes=true if you need hard-delete detection
Isolate a dedicated target_schema for snapshots to manage permissions and lifecycle

Warehouse / aspect	Recommended setting	Goal
Common (scheduling)	Once right after updates; restrict the time window if needed	Reduce wasted diff comparisons
Common (storage)	Partition and cluster by dbt_valid_from/valid_to	Reduce I/O for time-ordered queries
Common (views)	Current-row view: where dbt_valid_to is null	Keep the bulk of reads lightweight

Example view that returns only the current rows

create or replace view analytics.v_customers_current as
select *
from snapshots.customers_snapshot
where dbt_valid_to is null;

Exam Tips: Common Traps

The Analytics Engineer exam frequently tests the conditions for choosing timestamp vs check, the requirement for unique_key, hard-delete handling, and the choice of comparison columns. It rewards semantically correct diff definitions and stable comparisons over operator trivia or warehouse-specific functions.

Snapshots and incremental models have different purposes. Incremental is a means to efficiently update the latest state; Snapshot is a means to preserve history. Do not conflate the two.

unique_key is required; an unstable one breaks history
Timestamp depends on updated_at; check depends on business columns
invalidate_hard_deletes lets you expire rows removed from the source
Do not include nondeterministic columns or micro-deltas in check

Item	What to remember	Common trap
unique_key	Required for every snapshot	Designs where the natural key is unstable
timestamp	Assumes a trustworthy updated_at	Late arrivals cause missed detection
check	Compare only business-critical columns	Including log columns or the update timestamp

Example tests in schema.yml (unique and not_null)

version: 2
snapshots:
  - name: customers_snapshot
    tests:
      - not_null:
          column_name: customer_id
      - unique:
          column_name: customer_id

Check Your Understanding

Analytics Engineer

問題 1

You want to snapshot an online orders table. The app updates amount and status when the amount changes, but some backfills overwrite updated_at with a past timestamp. You want to suppress false positives without missing real changes. Which configuration is most appropriate?

Use the check strategy with check_cols=['amount','status'] and round amount before comparing
Use the timestamp strategy with updated_at, but compare a rounded updated_at
Use the check strategy with check_cols set to every column for an exact-match comparison
Use the timestamp strategy and compare the greater of updated_at and the current time

正解: A

Because updated_at is unreliable, timestamp is a poor fit. The correct approach is check restricted to business-meaningful columns (amount, status), with rounding to avoid floating-point error on amount. Comparing every column amplifies noise, and mixing in the current time invites false positives.

Frequently Asked Questions

What is the difference between Snapshot and incremental models?

Incremental models efficiently update a table to its latest state without keeping history. Snapshots close the validity window on the previous row and append a new row on each change, preserving full history. They serve different purposes.

If a row is hard-deleted from the source, does Snapshot automatically expire it?

Not by default. Enabling invalidate_hard_deletes=true makes the snapshot run detect rows that no longer exist in the source and update dbt_valid_to to expire them.

Is it safe to include computed columns in check_cols?

It is safe as long as the computation is deterministic and stable. Avoid nondeterministic functions (current time, random) and floating-point values prone to rounding errors. Preprocess with round, cast, or coalesce where needed.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

dbt Snapshots for History Tracking: Choosing Between Timestamp and Check Strategies