BigQuery costs are tied directly to bytes scanned. Whether you set partition_by and cluster_by correctly in your dbt models is decisive for both performance and cost. This article uses the stable, baseline features of the dbt BigQuery adapter to walk through date/timestamp and integer range partitioning, practical cluster design, and how to combine them with incremental strategies.
The Analytics Engineer exam tends to ask which configuration produces which behavior — when to use MERGE vs INSERT OVERWRITE, the effect of require_partition_filter, WHERE clauses that don't break pruning, and so on. Forget rote memorization; this guide locks in the minimal implementation patterns you'll actually use in production.
The dbt BigQuery adapter lets you declare partition_by and cluster_by per model. Partitioning physically narrows the scan footprint at a coarse level; clustering then co-locates rows inside each partition for an additional efficiency win. The top priority is to partition correctly first — clustering is a second-stage optimization.
Cluster columns max out at 4 and order matters (most selective column first). Pick the partition granularity for date/timestamp columns based on real workload, and turn on require_partition_filter for tables exposed to ad-hoc analytics so you can prevent accidental full scans.
In dbt, you declare these in the config block at the top of the model. For date/timestamp partitioning, layer on require_partition_filter where appropriate. For clustering, lead with the columns that show up most often in filters and join keys.
Timestamp/datetime partitions accept a granularity (day is the easiest to work with). Integer range partitions are also supported, but if you adopt them, mind both adapter version compatibility and range width design.
Conceptual view of partitioning and clustering
Baseline model config example (date/timestamp partitioning + clustering)
{{ config(
materialized='table',
partition_by={
'field': 'created_at',
'data_type': 'timestamp',
'granularity': 'day'
},
cluster_by=['user_id', 'event_type'],
require_partition_filter=true
) }}
select
user_id,
event_type,
created_at,
payload
from {{ ref('stg_events') }}
For incremental models on BigQuery, MERGE is the default strategy. It updates rows by unique key and uses a range condition on the partition column to narrow what gets ingested. When a full per-partition replacement is the right fit, INSERT OVERWRITE shines — it swaps out only the targeted partitions, which is handy for things like window aggregations that get recomputed frequently.
Either strategy benefits from an explicit partition-column range condition in the source query to avoid unnecessary recomputation. With INSERT OVERWRITE, since every touched partition is replaced wholesale, the key is to use a WHERE clause that clearly bounds the affected window.
| Strategy | Primary use case | Prereqs / constraints | Cost & characteristics |
|---|---|---|---|
| table (full rebuild) | Small total volume batch jobs; initial builds | None to speak of — but slow | Always full scan, always full write |
| incremental (MERGE) | Late-arriving fixes; row-level upserts | unique_key required; conflict-resolution rules needed | Reads/writes only changed rows. Partition conditions optimize further |
| incremental (INSERT OVERWRITE) | Recomputing specific windows; window aggregations | Partition definition required; touched partitions are replaced wholesale | Writes are limited to target partitions. Finer grain = faster |
Incremental model examples (MERGE and INSERT OVERWRITE)
-- MERGE strategy (assume late arrivals; read only the last N days)
{{ config(
materialized='incremental',
unique_key='event_id',
partition_by={'field': 'created_at', 'data_type': 'timestamp', 'granularity': 'day'},
cluster_by=['user_id'],
require_partition_filter=true
) }}
with src as (
select * from {{ ref('stg_events') }}
{% if is_incremental() %}
where created_at >= timestamp_sub(current_timestamp(), interval 3 day)
{% endif %}
)
select * from src;
-- INSERT OVERWRITE strategy (regenerate the target partition wholesale)
{{ config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={'field': 'created_at', 'data_type': 'timestamp', 'granularity': 'day'},
cluster_by=['user_id'],
require_partition_filter=true
) }}
select *
from {{ ref('agg_daily_metrics_source') }}
{% if is_incremental() %}
where created_at >= timestamp_sub(current_timestamp(), interval 7 day)
{% endif %};
The key to effective pruning is writing range conditions directly against the partition column in the WHERE clause. Wrapping the column in a function tends to defeat pruning. For tables that get a lot of ad-hoc traffic, enable require_partition_filter to block full scans that lack an explicit partition condition.
Clustering further shrinks scan volume within a single partition when you lead with selective columns. But clustering isn't a silver bullet — always make sure the partition-level coarse filtering is in place first.
Filter examples that don't break pruning
-- Recommended (don't wrap the column in a function)
where created_at >= timestamp('2026-04-10 00:00:00 UTC')
and created_at < timestamp('2026-04-17 00:00:00 UTC')
-- Not recommended (pruning often won't kick in)
where date(created_at) between date('2026-04-10') and date('2026-04-16')
Changing partition_by or cluster_by alters the physical definition of the existing table. The safe dbt practice is to full-refresh the affected model to guarantee the new definition takes hold. If the dependency graph is large, you'll need a planned rollout that includes downstream models.
Some cluster column swaps can be applied via DDL, but for keeping consistency with the dbt model definition, a clean rebuild with --full-refresh is much clearer.
Example commands for a migration
# Full-refresh just the high-impact model
$ dbt run --full-refresh -s fact_events
# Apply in stages including downstream dependencies
$ dbt run --full-refresh -s +fact_events
Use dbt tests to enforce NOT NULL on partition columns and uniqueness on incremental keys. For consistency between model definitions and physical metadata, periodically query INFORMATION_SCHEMA. In particular, monitor for drift in require_partition_filter being enabled and in cluster column ordering.
When distributing sample queries to ad-hoc users, include the partition condition in those samples — it cuts down on accidental full scans.
Example of checking BigQuery metadata (optional)
-- Table options (require_partition_filter etc.)
select table_name, option_name, option_value
from `<PROJECT>.<REGION>.INFORMATION_SCHEMA.TABLE_OPTIONS`
where table_catalog = '<PROJECT>'
and table_schema = '<DATASET>'
and table_name = 'fact_events'
and option_name in ('require_partition_filter', 'partitioning_type', 'clustering_columns');
Analytics Engineer
問題 1
Using the dbt BigQuery adapter, you want to minimize scan volume on an event fact table that gets a lot of ad-hoc analyst traffic. Which combination is most appropriate?
正解: A
Trimming the coarse scan with partitioning, blocking unconditional full scans with require_partition_filter, and writing the explicit created_at range in SQL is the sturdiest approach. Clustering alone has limited impact, and wrapping the column in a function weakens pruning. Incremental builds with no time bound and full-history recomputation both drive up cost.
Can I use INT64 range partitioning?
Yes — the dbt BigQuery adapter supports integer range partitioning. Example: partition_by={'field': 'bucket', 'data_type': 'int64', 'range': {'start': 0, 'end': 1000, 'interval': 10}}. Watch out for version compatibility and range design (avoid hotspots) when you adopt it.
How many cluster columns can I have, and does order matter?
Up to 4 columns. Order matters: put the most selective columns (the ones that filter data hardest) first to get the biggest payoff. Adding columns just for the sake of it drives up write cost and hurts maintainability.
How do I change partition_by/cluster_by on an existing model?
Because the physical definition changes, the safest dbt workflow is to rebuild the model with --full-refresh. If you have downstream dependencies, either widen the full-refresh scope deliberately, or backfill in stages with INSERT OVERWRITE.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
dbt Models: SQL-Defined Transformation Units (2026)
Model fundamentals — SELECT-based definitions, naming, refs,...
dbt Analytics Engineering Exam: Complete Guide (2026)
Pass the AE Certification — scope, weighting, sample questio...
dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)
Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...
dbt Project Structure: models/seeds/macros Layout (2026)
Recommended dbt project layout — models, seeds, macros, snap...
dbt_project.yml Explained: Every Config (2026)
Every dbt_project.yml setting that matters — paths, vars, ma...