In a migration, quantitatively proving that the new system produces the same results is critical. The dbt-audit-helper package compares two relations and returns a summary of counts, missing rows, and mismatched rows, making it easy to automate migration acceptance criteria.
This article distills diff-validation patterns you can safely use in production, framed around concepts commonly tested on the Analytics Engineer certification (testing vs diagnostics, reproducibility, CI integration). The details below stick to the stable parts of the official dbt documentation, with notes added where version differences may matter.
During migrations you frequently need to quickly visualize, for legacy (old) vs new tables, which primary keys are missing, where column values differ, and where duplicates or type mismatches occur. dbt-audit-helper provides macros that compare two relations and return a summary of counts, matches/mismatches, and duplicates. This lets you wire up reproducible validation faster than handwritten full-outer-join SQL.
From an exam perspective, a recurring theme is whether you can express production-grade QA declaratively and reproducibly in CI. By incorporating dbt-audit-helper as a model (or a dedicated diagnostic model), diffs are summarized automatically with every dbt run, and the result can drive pass/fail criteria.
| Aspect | dbt-audit-helper | dbt_utils.equality | Handwritten SQL |
|---|---|---|---|
| Purpose | Diff summary and diagnostics (observability-focused) | Set-equality testing (pass/fail-focused) | Maximum flexibility, high implementation and maintenance cost |
| Primary key handling | Easy full-outer-join comparison via the primary key | No primary key required (set equality), but hard to interpret with many duplicates | Freely designable but prone to bugs |
| Output | Summary table of counts and mismatched rows (ideal for visualization and investigation) | Test pass/fail plus extraction of failing rows | Depends on design |
| CI suitability | High (gate on summary thresholds) | High (blocks immediately on failure) | Medium (requires extra scripting) |
| Learning curve | Low to medium (calling standard macros) | Low (applying a generic test) | High (queries and operations both) |
Abstract diff-validation flow during migration
packages.yml (minimal)
packages:
- package: dbt-labs/dbt-audit-helper
version: ">=0.9.0,<1.0.0"The basic flow is to add dbt-audit-helper to packages.yml and fetch it with dbt deps. Run it with a profile that has SELECT permission on both relations (old/new). Because the comparison may become a large full outer join, also watch the warehouse size and the quota for temporary/intermediate space.
A stable dbt best practice is to create a dedicated diff-summary model (e.g., audit_orders), build it in CI, and gate on zero rows (or a row count below an accepted threshold). Macro argument names and detailed output may vary by package version, so consult the docs for your version. The stable concept is 'compare and summarize two relations'.
Diff summary model skeleton (audit_orders.sql)
{{ config(materialized='table') }}
-- Reference the old and new relations
{% set old_rel = ref('orders_legacy') %}
{% set new_rel = ref('orders') %}
-- Representative call. Available arguments depend on the package version.
-- primary_key: the primary key used to match rows for the diff (single or multiple columns)
-- exclude_columns: columns to exclude from the comparison (e.g., load timestamp, tracking columns)
{{ audit_helper.compare_relations(
a_relation=old_rel,
b_relation=new_rel,
primary_key=['order_id'],
exclude_columns=['_ingested_at', '_updated_at']
) }}
Rather than chasing an exact match from day one, it's more realistic to converge while controlling noise. Combining the following patterns lets diffs shrink steadily.
Rounding floats, normalizing time zones, trimming and case-normalizing strings, and excluding audit columns (load timestamps, metadata) are basic moves that work reliably on any DWH.
Comparison through rounding and normalization (via a preprocessing model)
-- Preprocessing: prepare normalized views for old/new and have the comparison reference those views
-- models/stg_orders_legacy_norm.sql
select
order_id,
upper(trim(customer_name)) as customer_name,
round(amount, 2) as amount,
to_timestamp_ntz(order_ts) as order_ts
from {{ ref('orders_legacy') }}
-- models/stg_orders_norm.sql
select
order_id,
upper(trim(customer_name)) as customer_name,
round(amount, 2) as amount,
to_timestamp_ntz(order_ts) as order_ts
from {{ ref('orders') }}
-- models/audit_orders.sql (comparison)
{{ config(materialized='table') }}
{{ audit_helper.compare_relations(
a_relation=ref('stg_orders_legacy_norm'),
b_relation=ref('stg_orders_norm'),
primary_key=['order_id'],
exclude_columns=[]
) }}
Build the diff summary model in CI and gate the pipeline on the result. If you can't enforce zero-row exact match, define an accepted threshold (e.g., 0 missing rows, <=10 value mismatches) so the pass/fail condition is explicit and avoids disputes in real-world operation.
In dbt, you can express arbitrary pass/fail logic just by creating a singular test that expects zero rows from a single query. Reference the diff summary model and write a SELECT that returns rows for the NG (failure) condition.
Singular test example (require 0 mismatches and 0 missing rows)
-- tests/no_diff_orders.sql
-- audit_orders (summary output) is expected to surface rows when there are diffs
select *
from {{ ref('audit_orders') }}
where (mismatch_count > 0 or only_in_a > 0 or only_in_b > 0)
-- If the SELECT above returns 1 or more rows, the test fails (CI blocks)
# Reference: GitHub Actions (excerpt)
# - run: dbt deps
# - run: dbt run -s models/audit
# - run: dbt test -s tests/no_diff_orders.sql
For large tables, it's more practical to advance step by step using partition windows or sampling rather than comparing everything at once. First, compare only the last N days to confirm the pipeline is healthy, then widen the backfill window over time. This controls cost and time while still ensuring quality.
Clustering or sort keys on the primary key also reduce full-outer-join cost. Before comparing, consider refreshing statistics or optimizing the table (e.g., Snowflake clustering, Databricks OPTIMIZE).
Comparison narrowed by date window (configurable via vars)
{{ config(materialized='table') }}
{% set days = var('compare_days', 7) %}
with legacy as (
select * from {{ ref('orders_legacy') }}
where order_date >= dateadd(day, -{{ days }}, current_date)
),
newer as (
select * from {{ ref('orders') }}
where order_date >= dateadd(day, -{{ days }}, current_date)
)
select * from (
{{ audit_helper.compare_relations(
a_relation=adapter.get_relation(database=target.database, schema=this.schema, identifier='legacy'),
b_relation=adapter.get_relation(database=target.database, schema=this.schema, identifier='newer'),
primary_key=['order_id']
) }}
)
Float rounding errors, string collation differences, and gaps in time zones or sub-second precision are classic reasons diffs refuse to shrink. Normalize before comparing, and explicitly exclude columns that simply can't be stabilized.
If the primary key isn't unique, duplicate detection comes first. Building a surrogate key (a hash of multiple columns) before comparing works well. The surrogate_key macro in dbt-utils is widely used and stable.
Preprocessing example with a surrogate key and rounding
-- Build a surrogate key with dbt-utils and unify rounding
with base as (
select
{{ dbt_utils.surrogate_key(['order_id', 'warehouse_id']) }} as sk,
round(amount, 2) as amount,
to_timestamp_ntz(order_ts) as order_ts,
*
from {{ ref('orders') }}
)
select * from base
Analytics Engineer
問題 1
You want to compare the legacy orders_legacy table with the new orders table and capture missing rows, duplicates, and value mismatches as a summary keyed by order_id. Audit columns (_ingested_at, _updated_at) should be excluded. You want to use this summary as a pass/fail gate in CI. Which approach is best?
正解: A
When the requirement is to obtain a diff summary and use it for a pass/fail gate, dbt-audit-helper's compare_relations is the right tool. Specifying the primary key to gather row-level missing rows and mismatches, excluding audit columns to avoid spurious diffs, and using a singular test to gate the NG condition (e.g., mismatch_count > 0) to zero rows is optimal for reproducibility and maintainability. The equality test is good for set-equality pass/fail but isn't suitable for absorbing audit-column noise or producing summary metrics.
When should I use dbt-audit-helper vs dbt_utils.equality?
equality is best for pass/fail testing of whether two relations match as sets. dbt-audit-helper is better for diagnostic use — getting a summary of missing rows and mismatch counts — and works well for migration progress tracking and reporting. A stable real-world pattern is to use audit-helper to understand the breakdown of differences during the early migration phase, then gate on exact equality once the diff converges.
Can I use it without a primary key?
Without a primary key the matching condition for mismatches becomes ambiguous. The recommended approach is to first define a unique key, or build a multi-column hash key with dbt-utils' surrogate_key and use it as the comparison key. If duplicates are common, add a deduplication preprocessing model before comparing.
How can I reduce time and cost when comparing large tables?
Start with date-partitioned comparisons over the last N days and gradually widen the window. Excluding audit columns and reducing spurious diffs via rounding and time-zone normalization also helps. Warehouse-side optimization (clustering, statistics refresh) is effective as well.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
dbt Models: SQL-Defined Transformation Units (2026)
Model fundamentals — SELECT-based definitions, naming, refs,...
dbt Analytics Engineering Exam: Complete Guide (2026)
Pass the AE Certification — scope, weighting, sample questio...
dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)
Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...
dbt Project Structure: models/seeds/macros Layout (2026)
Recommended dbt project layout — models, seeds, macros, snap...
dbt_project.yml Explained: Every Config (2026)
Every dbt_project.yml setting that matters — paths, vars, ma...