Clustering Tips: Points to be consider about Clustering

Read Time:3 Minute, 1 Second

To implement the Clustering, a Cluster Key needs to be selected, And while choosing a cluster key consider the below recommendations by snowflake:

A clustering key is a subset of columns in a table or an expression that used to co-locate the data in the table in the same micro-partition. This is useful for very large tables where the ordering was not ideal at the time of data insertion.

Some general indicators that can help determine whether to define a clustering key for a table to improve the performance:

Pointers:

a) Queries on the table are running slower than expected or have noticeably degraded over time.

b) The clustering depth for the table is large.

Cluster columns which are using effectively in selective filters. For many fact tables involved in date-based queries (for example “WHERE invoice_date > x AND invoice date <= y”), choosing the date column is a good idea

If there is room for additional cluster keys, then consider columns frequently used in join predicates, for example “FROM table1 JOIN table2 ON table2.column_A = table1.column_B”.

In general, if a column (or expression) has higher cardinality, then maintaining clustering on that column is more expensive. If you want to use a column with very high cardinality as a clustering key, Snowflake recommends defining the key as an expression on the column, rather than on the column directly

When to add Cluster key: When would you usually consider to add clustering key to a table.

Clustering keys, Not suggested on all tables. Depends on the size of a table, as well as the query performance for the table.
Performance of a table will deteriorate over a period of time. Snowflake physically stores data in 16MB micro-partitions which are immutable. So, when you are constantly inserting/updating records in the tables, those micro-partitions gets recreate.

When they get recreated, it is not possible for Snowflake to ensure that the records are clustered together. Hence, the clustering deteriorates over a period of time. If you create clustering key, auto clustering is turned on and Snowflake automatically recluster the records based on an algorithm. It does not cluster the entire table at the same time, it does it gradually.

Scenario:

You have a large warehouse with auto suspend configured for 10 minutes. You submitted a query and it is going to run for more than 10 minutes. What will happen to the query?

By default, auto-suspend is enabled. Snowflake automatically suspends the warehouse if it is inactive for the specified period of time.
By default, auto-resume is enabled. Snowflake automatically resumes the warehouse when any statement that requires a warehouse is submitted.

So in this case Auto-resume property would comes into the existence and query will continue to run.

One of your query is taking a long time to finish, when you open the query profiler you see that lot of data is spilling to the remote disk(Bytes spilled to remote storage). What may be the cause of this?

The amount of memory available for the servers used to execute the operation might not be sufficient to hold intermediate results. As a result, the query processing engine will start spilling the data to local disk.
If the local disk space is not sufficient, the spilled data is then saved to remote disks.

This spilling can have a profound effect on query performance (especially if remote disk is used for
spilling).
To alleviate this, we recommend: