Optmizing Query Performance with Clustering Keys in Snowflake
This blog explores how Clustering Keys Data Pruning, and the Search Optimization Service (SOS) enhance query efficiency in Snowflake. It explains how clustering keys physically organize data into micro-partitions enabling faster queries by…

Clustering Keys in Snowflake
Clustering keys are a way to organize data within tables to optimize query performance. They determine how data is physically stored in micro-partitions by grouping similar values together.
What Are Clustering Keys?
Clustering keys are specific columns or expressions defined on a table that dictate how Snowflake organizes data into micro-partitions. By clustering, Snowflake ensures that rows with similar values in the clustering key columns are stored together, enabling efficient query execution.
How Clustering Works: Visualizing Micro-Partitions
Without Clustering Keys (Default Behavior)
When no clustering key is defined, data is stored in the order it is loaded. This can lead to scattered organization, making queries less efficient.
Example: Table with Random Data Distribution {{< code-block>}} Micro-Partition 1: [Jan, Mar, Dec, Aug] Micro-Partition 2: [Feb, Sep, Nov, Apr] Micro-Partition 3: [May, Oct, Jul, Jun]
Querying for data from "Jan" requires scanning multiple micro-partitions because the data is not grouped together.
### With Clustering Keys
When a clustering key is defined (e.g., `CLUSTER BY (date)`), Snowflake organizes the data so that rows with similar clustering key values are stored together.
**Example: Table Clustered by Date**
{{< code-block>}}
Micro-Partition 1: [Jan, Feb, Mar]
Micro-Partition 2: [Apr, May, Jun]
Micro-Partition 3: [Jul, Aug, Sep]
Now, querying for "Jan" only scans one micro-partition instead of multiple, significantly improving performance.
Why Use Clustering Keys?
Clustering keys improve query performance by:
- Data Pruning: Snowflake can skip irrelevant micro-partitions during queries (e.g., filtering by date).
- Reduced I/O: Only relevant partitions are read, minimizing disk scans.
- Better Compression: Grouped data often leads to higher compression ratios.
- Optimized Query Execution: Queries on clustered columns execute faster due to co-located data.
Example of Defining a Clustering Key
Create a Table with Clustering Key
CREATE TABLE sales_data (
transaction_date DATE,
region_id INT,
amount DECIMAL
) CLUSTER BY (transaction_date);Alter an Existing Table to Add Clustering Key
ALTER TABLE sales_data CLUSTER BY (transaction_date);Query Performance Comparison
Before Clustering
Query:
SELECT * FROM sales_data WHERE transaction_date = '2025-01-01';- Without clustering, Snowflake scans all micro-partitions because the data is scattered.
- Query time: High (e.g., scans hundreds of partitions).
After Clustering
Query:
SELECT * FROM sales_data WHERE transaction_date = '2025-01-01';- With clustering by
transaction_date, Snowflake prunes irrelevant partitions and only scans those containing "2025-01-01." - Query time: Low (e.g., scans only a few partitions).
Best Practices for Clustering Keys
- Choose Columns Frequently Used in Filters: For example,
dateorregion_idif these are common in WHERE clauses. - Limit the Number of Keys: Use 3–4 columns at most to avoid high maintenance costs.
- Order Columns by Cardinality: Start with low-cardinality columns (e.g.,
region) and then higher-cardinality ones (e.g.,date). - Avoid High Cardinality Columns Directly: For example, instead of using raw timestamps (
2025-01-25 10:00:00), truncate them to dates (2025-01-25).
When to Use Clustering Keys
Clustering keys are most effective when:
- The table is large (multi-terabyte scale).
- Queries frequently filter or join on specific columns.
- The default organization of data (natural clustering) does not align with query patterns.
By defining appropriate clustering keys based on query patterns and table size, you can significantly reduce query times and optimize resource usage in Snowflake.
Data Pruning in Snowflake
Data pruning is Snowflake’s way of skipping micro-partitions that have no relevant data for a given query. Each micro-partition in Snowflake stores metadata about the range of values it contains. When a query includes filters on clustering keys or other columns, Snowflake checks this metadata to determine which micro-partitions likely hold the required data and reads only those.
How Data Pruning Works
Micro-Partition Metadata
- Each micro-partition has min/max statistics for the columns it contains. If a filter value lies outside the min/max range of a micro-partition, Snowflake can safely skip scanning it.
Clustering Keys
- By defining clustering keys, you physically group related data together. This organization improves pruning because micro-partitions become more uniform, making it easier for Snowflake to discard irrelevant segments.
Reduced I/O
- Data pruning significantly reduces the volume of data read from disk. Fewer partitions scanned translates to faster queries and lower compute costs
Search Optimization Service (SOS) in Snowflake
The Search Optimization Service (SOS) in Snowflake is a feature designed to improve the performance of highly selective queries, such as point lookups or queries with extensive filtering on large tables. It achieves this by creating a specialized data structure called a search access path, which helps Snowflake quickly locate relevant micro-partitions while skipping irrelevant ones.
How Search Optimization Works
1. Search Access Path Creation
- When enabled, Snowflake builds a search access path for the table in the background.
- This path contains metadata about which values are present in each micro-partition.
- Instead of scanning all micro-partitions, Snowflake uses this metadata to prune irrelevant partitions.
2. Bloom Filters
- SOS uses Bloom filters, which help Snowflake identify where data isn't located, further reducing unnecessary scans.
- This mechanism is particularly effective for queries that return only a small subset of rows.
3. Automatic Updates
- The search access path is maintained automatically whenever the table is updated (e.g., new data loads or DML operations).
When to Use Search Optimization
SOS is most beneficial for:
- Point Lookup Queries: Queries that filter on specific values (e.g.,
WHERE id = 123). - Highly Selective Filters: Queries that retrieve only a small subset of rows from large tables.
- Substring and Regular Expression Searches: Queries using
LIKE,ILIKE,RLIKE, etc. - Semi-Structured Data: Queries on fields in
VARIANT,OBJECT, orARRAYcolumns with filters likeARRAY_CONTAINSor equality predicates. - Geospatial Queries: Queries using geospatial functions like
ST_INTERSECTSorST_CONTAINS.
Performance Example
Without Search Optimization
A query like:
SELECT * FROM CUSTOMER WHERE C_CUSTOMER_SK = '4722123';- Scans all micro-partitions because Snowflake has no metadata to prune irrelevant ones.
- Query time: 1.3 seconds (example).
With Search Optimization
After enabling SOS:
ALTER TABLE CUSTOMER ADD SEARCH OPTIMIZATION;Running the same query:
SELECT * FROM CUSTOMER WHERE C_CUSTOMER_SK = '4722123';- Only relevant micro-partitions are scanned using the search access path.
- Query time: Reduced significantly (e.g., from 1.3 seconds to milliseconds).
How to Enable Search Optimization
- Enable SOS on a table:
ALTER TABLE my_table ADD SEARCH OPTIMIZATION;- Estimate costs before enabling SOS:
SELECT SYSTEM$ESTIMATE_SEARCH_OPTIMIZATION_COSTS('my_table');- Disable SOS if needed:
ALTER TABLE my_table DROP SEARCH OPTIMIZATION;- Check enabled columns:
DESCRIBE SEARCH OPTIMIZATION ON my_table;Best Practices
- Use SOS for tables with over 1,000 micro-partitions or large datasets.
- Apply it only to queries with high selectivity (returning few rows).
- Avoid enabling SOS on tables with frequent updates or deletes, as it increases maintenance costs.
- Combine SOS with clustering keys for optimal performance when queries involve both selective lookups and range filters.
Comparison: Clustering vs Search Optimization
| Feature | Clustering Keys | Search Optimization Service |
|---|---|---|
| Focus | Range-based queries (e.g., date ranges) | Point lookups and highly selective queries |
| Data Organization | Physically reorganizes data | Adds metadata without reorganizing data |
| Maintenance Overhead | Higher for frequent updates | Moderate; automatically maintained |
| Use Case | Range filters, joins | Equality filters, substring searches |
By enabling SOS strategically, you can achieve significant performance gains for specific query patterns while avoiding unnecessary costs.
On this page
- What Are Clustering Keys?
- How Clustering Works: Visualizing Micro-Partitions
- Without Clustering Keys (Default Behavior)
- Why Use Clustering Keys?
- Example of Defining a Clustering Key
- Create a Table with Clustering Key
- Alter an Existing Table to Add Clustering Key
- Query Performance Comparison
- Before Clustering
- After Clustering
- Best Practices for Clustering Keys
- When to Use Clustering Keys
- How Data Pruning Works
- Micro-Partition Metadata
- Clustering Keys
- Reduced I/O
- How Search Optimization Works
- 1. Search Access Path Creation
- 2. Bloom Filters
- 3. Automatic Updates
- When to Use Search Optimization
- Performance Example
- Without Search Optimization
- With Search Optimization
- How to Enable Search Optimization
- Best Practices
- Comparison: Clustering vs Search Optimization
Keep exploring
matched by tag + title overlap
Read next
PowerBI and Snowflake Integration
This blog will explore how to effectively integrate Power BI with Snowflake, focusing on best practices and technical details from the provided documents. It aims to help users understand the process and optimize their data analytics…
#snowflake#data-analyst#data-engineeringIceberg and Snowflake Integration
Learn how to integrate Apache Iceberg with Snowflake, covering key concepts like external tables, time travel, and file management, along with practical implementation examples and best practices.
#snowflake#data-analyst#data-engineeringdbt and Snowflake Integration
This article explores the integration of dbt with Snowflake, a comprehensive guide to using dbt with Snowflake. It covers the main concepts of dbt, including models, materializations, sources, tests, documentation, macros, and packages. It…
#snowflake#data-analyst#data-engineeringDBT and Snowflake Native app
DBT and Snowflake Native app, how dbt works as a full functional Snowflake native app.
#snowflake#data-analyst#data-engineering