Snowflake has become a phenomenon across industries with its focus on analytical data storage and impressive performance. With compute capacity charge on a consumption basis and most legacy performance tuning now handled by the service, Performance optimization is often as simple as spending a bit more money.
Performance tuning categorized in two ways:
1: Automatically by Snowflake:
- Micro partitioning: Columns store independently with-in micro partitions, referred to a columnar storage. Enables precise pruning of Column/Partition: the optimizer analyzes FROM and WHERE clauses in SQL statements to eliminate unneeded partitions .
- Columns are compress individually inside the micro partitions. Snowflake automatically determines the efficient compression algorithm.
- Re-Clustering: Re-Clustering a table can help reduce scanning of micro-partitions which helps performance optimization.
2.Tuning by developer/Admin:
- Always use Limit Clause with Select * from table. Virtual warehouse pulls complete data from storage layer, and execute the query and store all the results into Result Cache.
- Charged for the computation even we use SELECT * to know only type or structure of data.
- Dedicated warehouse for Querying: Snowflake automatically caches data in the Virtual Warehouse (local disk cache). Place users querying the same data/tables on the same virtual warehouse.
- This maximizes the chances ,data retrieved to the cache by one user used by others instead getting from Storage layer:
- Never disable Cloud Service Result Cache layer : One of the important performance optimization, This layer holds the result of your query and incase other users execute the same query , Instead route to VW and Storage layer , retrieve the output from Result Cache layer itself and here no computation cost involves.
- What happens if underlying table gets updated: Assume, result available in Result Cache layer and perform update or delete the data from table , Execute the Select Query , No Result Cache comes into relevance. Will send request to VW layer which in turn pulls the data from Storage layer ,Cache it at Local disk ,execute the query and store the latest Output in Result Cache. Hence it is advisable first perform all DML operations and then execute the DDL to avoid the performance issues.
- Clustering the table: Partition the large tables:A clustering key is a subset of columns in a table or an expression that are explicitly used to co-locate the data in the table in the same micro-partition.
- Large Files Split on Snowflake: Split data into multiple small files rather one large file, Make use of all the nodes in Cluster. Loading a large single file make only one node at action and other nodes ignore even if we have larger warehouse.
- Date/Time Data Types for Columns: When defining columns to contain dates or timestamps, choose a date or timestamp data type rather than a character data type. Snowflake stores DATE and TIMESTAMP data more efficiently than VARCHAR, resulting in better query performance
- Querying data stored external to the database is likely to be slower than querying native database tables; however, materialized views based on external tables can improve query performance.
- Scale up by resizing a warehouse. Resizing a warehouse generally improves query performance, particularly for larger, more complex queries
- Scale out by adding clusters to a warehouse :Auto spawn the VW based on the workloads.