Redshift is a data warehouse built on massively parallel processing, where table distribution (DISTSTYLE/DISTKEY) and sort order (SORTKEY) directly influence the execution plan and I/O. Because dbt-redshift lets you declare these as model settings, you can manage optimization as if it were embedded in the schema.
While Redshift's automatic optimization has improved in recent years, manually designing DISTKEY / SORTKEY is still effective for workloads with clear patterns such as star schemas. This article compactly covers design principles, how to express them in dbt, how to operate changes, and the key points commonly tested on the exam.
Redshift distributes data across nodes, and the distribution style (DISTSTYLE) and distribution key (DISTKEY) determine where rows are placed. The SORTKEY then controls the physical row order, reducing I/O for range filters and merge joins. These decisions directly affect data movement during joins (redistribute/broadcast), scan volume, and slot consumption under concurrency.
In dbt-redshift, you declare diststyle, dist, sort, and sort_type as model settings. Because the configuration lives in code, you can safely switch settings between branches and environments, and regressions become easier to catch. In practice, the smoothest path is to lock down your primary join keys and access patterns first, then choose dbt model settings to match.
The main distribution styles are AUTO, KEY, ALL, and EVEN. KEY hashes the specified column to assign rows to nodes, co-locating rows with the same key on the same node. ALL replicates a small table to every node, avoiding data movement when joining against large facts. EVEN distributes uniformly without depending on a key. AUTO lets the cluster pick the style it judges appropriate — well suited to initial builds or uncertain workloads.
When you choose KEY, the rule of thumb is to pick a column with high cardinality, low skew, and heavy use in joins. Keys with significant skew or monotonically increasing time-series columns tend to cause data imbalance and hotspots across nodes.
Conceptual diagram of hash distribution by DISTKEY
SORTKEY defines the physical row order of a table. COMPOUND prioritizes the leading column, dramatically reducing I/O for range filters, sorts, and grouping on that column. INTERLEAVED weights multiple columns equally, but it is sensitive to updates and skew, increasing maintenance complexity. For most analytical workloads, COMPOUND is easier to operate, and the basic rule is to place the most frequently filtered column first.
The sort key does not need to match the distribution key. Place the time column you use most often first, followed by columns with high selectivity. Without a sort key, the table follows insertion order, and range filters tend to lose scan efficiency.
| Type | Best-fit queries | Key characteristics and caveats |
|---|---|---|
| COMPOUND | Range filters, ORDER BY, GROUP BY on the leading column | Prioritizes the leading column's order. Fits common analytical patterns well. Column order matters in design |
| INTERLEAVED | Frequent equality filters using multiple columns equally | Weights each column equally. Sensitive to updates and skew, with higher operational complexity |
| None (no sort) | Small to mid-sized tables fast enough on full scans | Insertion order. Hard to gain I/O reduction from range conditions |
In a dbt-redshift model, declare diststyle, dist, sort, and sort_type in config. These are written into the CREATE TABLE DDL, so re-runs of the same model reliably reproduce the same physical design. They do not apply to views.
Because the column order of a sort key carries meaning, decide the order based on query frequency and selectivity. The conventional choice for distribution is to align with the primary join key of the fact table. When you change settings — especially SORTKEY — you may need to recreate the table, so handle this together with a full-refresh plan.
dbt model example: dist/sort for a typical fact table
{{
config(
materialized='incremental',
unique_key='order_id',
diststyle='key', # choose from KEY/ALL/EVEN/AUTO
dist='customer_id', # DISTKEY column
sort_type='compound', # compound or interleaved
sort=['order_ts', 'customer_id'] # column order matters
)
}}
with src as (
select * from {{ ref('stg_orders') }}
)
select
order_id,
customer_id,
order_ts,
total_amount
from src
{% if is_incremental() %}
-- Example: incremental condition. Range filters on the leading sort column are most effective
where order_ts > (select coalesce(max(order_ts), '1970-01-01') from {{ this }})
{% endif %}
Changing DISTSTYLE may be possible with ALTER TABLE depending on your environment, but changing SORTKEY columns or type generally requires recreating the table and reloading data. In dbt, the reliable flow is to use --full-refresh, build a new table via CTAS, and rename it.
To minimize production risk, split the work into models with small blast radii and switch them over in stages. For large tables that take a long time to recreate, consider running during the nightly window or temporarily scaling up the cluster.
In the context of the dbt Analytics Engineer exam, you'll be asked which dist/sort to choose for a given workload, and whether you can express it correctly in dbt settings. Be prepared to explain collocated joins and range-filter optimization for the combination of fact and dimension tables in a star schema.
Analytics Engineer
問題 1
A star schema centered on fact_orders with hundreds of millions of rows. Dashboards mostly aggregate over date ranges, and fact_orders is frequently joined to dim_customers on customer_id. Which dbt configuration is most appropriate?
正解: A
To guarantee collocation on the primary join key, the standard approach is DISTKEY=customer_id on the fact and a COMPOUND SORTKEY led by the date/time column for range-filter efficiency. Small dimensions use ALL to avoid broadcast. The other options are worse in terms of data movement or scan efficiency.
Which takes precedence: the dist/sort settings in dbt or Redshift's automatic optimization?
The diststyle/dist/sort values you specify explicitly in dbt are written into the CREATE TABLE DDL. When you specify them explicitly, those settings are the baseline. The exact behavior of automatic optimization depends on cluster configuration and version, so choose between AUTO and explicit settings based on your workload, and validate with real measurements.
I want different dist/sort settings per environment. How do I manage that in dbt?
You can switch settings by using target.name conditions in the models section of dbt_project.yml — for example, diststyle='auto' in dev and explicit settings in prod. Branching settings inside the model with Jinja variables or env_var references is another practical approach.
Can dist/sort be used on views or materialized views?
They do not apply to views. The dist/sort settings in dbt only take effect when a table is created. Handling for materialized views depends on the implementation and version, but it's generally safer not to design around explicit dist/sort. Apply the optimizations you need on the final table.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
dbt Models: SQL-Defined Transformation Units (2026)
Model fundamentals — SELECT-based definitions, naming, refs,...
dbt Analytics Engineering Exam: Complete Guide (2026)
Pass the AE Certification — scope, weighting, sample questio...
dbt Cloud vs dbt Core: Feature & Cost Comparison (2026)
Honest comparison of dbt Cloud vs. dbt Core — IDE, scheduler...
dbt Project Structure: models/seeds/macros Layout (2026)
Recommended dbt project layout — models, seeds, macros, snap...
dbt_project.yml Explained: Every Config (2026)
Every dbt_project.yml setting that matters — paths, vars, ma...