Snowflake

Snowflake Clustering Keys: Micro-partition Optimization Guide

2026-03-21
更新: 2026-03-27
NicheeLab Editorial Team

Clustering Keys are a performance-tuning feature in Snowflake that maximize the accuracy of partition pruning on large tables. By optimizing the physical data layout (the sort order inside micro-partitions) based on the columns you specify, they dramatically reduce the amount of data scanned per query. The feature is available on Enterprise Edition and above.

Micro-partitions and Partition Pruning

Snowflake automatically splits every table into compressed micro-partitions of 50-500 MB. Each micro-partition is stored in a columnar format, and metadata such as the min, max, and NULL count of every column is tracked in the Cloud Services Layer.

When a query has a WHERE clause, Snowflake consults this metadata and excludes any micro-partition that cannot possibly match. That mechanism is partition pruning. The more physically sorted the data is, the more precise the pruning becomes and the less data is scanned.

Natural Clustering and Clustering Depth

If data is naturally sorted in the order it was inserted — for example, when rows are appended chronologically by a timestamp column — the table is already naturally clustered. In that case, you do not need to define Clustering Keys.

Clustering Depth is a metric that indicates how well-organized the physical data layout of a table is. The smaller the value, the better organized the data and the more efficient the pruning.

-- Check the clustering depth for specific columns
SELECT SYSTEM$CLUSTERING_DEPTH(
  'analytics.orders',
  '(order_date, region)'
);

-- Retrieve detailed clustering information
SELECT SYSTEM$CLUSTERING_INFORMATION(
  'analytics.orders',
  '(order_date, region)'
);

system$clustering_information() Output Fields

FieldDescription
cluster_by_keysThe Clustering Keys being evaluated
total_partition_countTotal number of micro-partitions in the table
total_constant_partition_countNumber of partitions where the specified column has only one value (the most prunable state)
average_overlapsAverage value-range overlap between partitions (smaller is better)
average_depthAverage clustering depth (smaller is better)

Defining and Modifying Clustering Keys

-- Specify Clustering Keys when creating a table
CREATE TABLE analytics.orders (
  order_id    INT,
  order_date  DATE,
  region      VARCHAR(50),
  customer_id INT,
  amount      NUMBER(12,2)
)
CLUSTER BY (order_date, region);

-- Add Clustering Keys to an existing table
ALTER TABLE analytics.orders
  CLUSTER BY (order_date, region);

-- Change (overwrite) the Clustering Keys
ALTER TABLE analytics.orders
  CLUSTER BY (order_date, customer_id);

-- Drop the Clustering Keys (also stops Automatic Clustering)
ALTER TABLE analytics.orders DROP CLUSTERING KEY;

Guidelines for Choosing Columns

CriterionRecommendedAvoid
Filter frequencyColumns used frequently in WHERE clausesColumns that are rarely filtered
CardinalityMedium (dates, categories, regions)Extremely high (UUID) or extremely low (BOOLEAN)
Number of columns2 to 4 columns5 or more (maintenance cost grows quickly)
Column orderPlace lower-cardinality columns firstPlacing higher-cardinality columns first

Automatic Clustering

Once Clustering Keys are defined, Snowflake's Automatic Clustering service reclusters the data automatically on the Cloud Services Layer. It does not consume your Warehouses and is billed as Serverless Credits instead.

  • Automatically monitors and maintains the clustering state after DML operations
  • No manual reclustering required
  • Automatic Clustering can be suspended and resumed per table
-- Suspend Automatic Clustering
ALTER TABLE analytics.orders SUSPEND RECLUSTER;

-- Resume Automatic Clustering
ALTER TABLE analytics.orders RESUME RECLUSTER;

Monitoring and Cost Management

-- Automatic Clustering credit consumption (last 7 days)
SELECT *
FROM TABLE(INFORMATION_SCHEMA.AUTOMATIC_CLUSTERING_HISTORY(
  DATE_RANGE_START => DATEADD(DAY, -7, CURRENT_DATE()),
  DATE_RANGE_END => CURRENT_DATE(),
  TABLE_NAME => 'ANALYTICS.ORDERS'
));

-- Long-term cost trend via the Account Usage view
SELECT
  table_name,
  SUM(credits_used) AS total_credits,
  SUM(num_bytes_reclustered) AS total_bytes_reclustered
FROM SNOWFLAKE.ACCOUNT_USAGE.AUTOMATIC_CLUSTERING_HISTORY
WHERE start_time >= DATEADD(MONTH, -1, CURRENT_TIMESTAMP())
GROUP BY table_name
ORDER BY total_credits DESC;

Decision Flowchart for Adoption

Is the table billions of rows or larger?
  ├─ NO  → Clustering Keys are unnecessary (natural pruning is enough)
  └─ YES
       ↓
  Check partition pruning efficiency in the Query Profile
       ↓
  Is the number of scanned partitions large relative to the total?
  ├─ NO  → Natural clustering already works → Clustering Keys unnecessary
  └─ YES
       ↓
  Pick 2-4 columns that are filtered frequently in WHERE clauses
       ↓
  Define them with ALTER TABLE ... CLUSTER BY
       ↓
  Monitor cost via AUTOMATIC_CLUSTERING_HISTORY

Best Practices

  • Start by diagnosing with the Query Profile: review the partition pruning status and only consider Clustering Keys when there is real room for improvement
  • Lean on natural clustering for time-series data: when rows are inserted in chronological order, clustering by the date column is maintained naturally and no extra configuration is needed
  • Monitor cost regularly: track credit consumption via AUTOMATIC_CLUSTERING_HISTORY and verify the cost/benefit ratio
  • Expression-based Clustering Keys: expressions like TO_DATE(timestamp_col) or MONTH(order_date) are a useful trick to lower the cardinality of a high-cardinality column before clustering on it

Check Your Understanding

Performance Optimization

問題 1

On Enterprise Edition, a query against a 5-billion-row transaction table is slow. The Query Profile shows that partition pruning is barely working for the order_date range filter in the WHERE clause. Which is the most appropriate action?

  1. Scale the Virtual Warehouse up to 6XL to increase scan throughput
  2. Run ALTER TABLE ... CLUSTER BY (order_date) and let Automatic Clustering optimize the physical layout
  3. Create a materialized view to cache the query result
  4. Set DATA_RETENTION_TIME_IN_DAYS to 90 to reserve space for Time Travel

正解: B

Pruning is failing because order_date values are scattered across micro-partitions. Setting CLUSTER BY (order_date) tells Automatic Clustering to physically reorganize the data, which dramatically improves pruning accuracy for date-range filters. Scaling the Warehouse increases raw scan speed but doesn't fix the root cause. Time Travel settings are a data-protection feature and have no effect on query performance.

Frequently Asked Questions

How much does Automatic Clustering cost once I set Clustering Keys?

Automatic Clustering is billed as Serverless Credits and scales with the frequency and volume of DML operations (INSERT/UPDATE/DELETE/MERGE) on the table. Tables that are updated frequently trigger more reclustering and therefore cost more. Use the AUTOMATIC_CLUSTERING_HISTORY table function to inspect credit consumption; if costs run higher than expected, reduce the number of columns in the Clustering Keys or drop them entirely (ALTER TABLE t DROP CLUSTERING KEY).

What is the difference between Clustering Keys and the Search Optimization Service?

Clustering Keys control the physical data layout to improve partition pruning, and are effective for range filters (BETWEEN, >=, <=) and equality filters. The Search Optimization Service, in contrast, builds a complementary access structure that accelerates equality lookups (= / IN), VARIANT path searches, and GEOGRAPHY function searches. The two features can be used together, and you choose between them based on the workload.

Why shouldn't I pick a very high-cardinality column for Clustering Keys?

If you pick an extremely high-cardinality column such as a UUID or random ID as a Clustering Key, the min/max range within each micro-partition becomes very narrow and partitions barely overlap. That looks appealing on paper, but filters on those columns already prune well naturally because they are so pinpoint, so the improvement does not justify the maintenance cost of Clustering Keys. Medium-cardinality columns such as dates or categories are the most effective choice.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Snowflake

Snowflake Certifications: All 11 Exams Explained (2026)

Every SnowPro certification — Associate, Core, Specialty, Ad...

Snowflake

Snowflake Exam Difficulty Ranking: All 11 Certs Compared (2026)

All 11 SnowPro exams ranked by difficulty with study-time es...

Snowflake

Snowflake Study Guide: Fastest Pass Route by Exam (2026)

How to pass SnowPro certifications efficiently — official ma...

Snowflake

SnowPro Core (COF-C03): Complete Exam Guide (2026)

Pass the SnowPro Core exam — six domains, scope, sample ques...

Snowflake

SnowPro Associate Platform (SOL-C01): Complete Guide (2026)

The entry-level SnowPro Associate exam — scope, weighting, s...

Browse all Snowflake articles (103)
© 2026 NicheeLab All rights reserved.