Sampling:

Read Time:1 Minute, 57 Second

Returns a subset of rows sampled randomly from the specified table. The following sampling methods are supported:

Sample a fraction of a table, with a specified probability for including a given row. Therefore, the number of rows returned depends on the size of the table and the requested probability.
- However, A seed can be specified to make the sampling deterministic.
Sample a fixed, specified number of rows. The exact number of specified rows is returned unless the table contains fewer rows.

SAMPLE and TABLESAMPLE are synonymous and can be used interchangeably.

Sample method is optional. If no method is specified, In conclusion the default is BERNOULLI.

SELECT ...
FROM ...
  { SAMPLE | TABLESAMPLE } [ samplingMethod ] ( { <probability> | <num> ROWS } ) [ { REPEATABLE | SEED } ( <seed> ) ]
[ ... ]

samplingMethod ::= { { BERNOULLI | ROW } |
                     { SYSTEM | BLOCK } }

Where:

samplingMethod ::= { { BERNOULLI | ROW } |
                     { SYSTEM | BLOCK } }

BERNOULLI | ROWorSYSTEM | BLOCK

Specifies the method to use:

BERNOULLI (or ROW): Includes each row with a probability of p/100. Similar to flipping a weighted coin for each row. The resulting data size is approxmilatelyof p/100 * number of Rows.
SYSTEM (or BLOCK): Includes each block of rows with a probability of p/100. Similar to flipping a weighted coin for each block of rows. This method does not support fixed-size .

REPEATABLE | SEED ( seed )

Specifies a seed value to make this deterministic.

Which to use: Use SYSTEM/BLOCK in case of higher volume of data and BERNOULI/ROW in case of smaller tables:

CREATE OR REPLACE TRANSIENT TABLE SUPPLIER_BLOCK_ALG
AS
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.SUPPLIER sample system (3) seed (82);

CREATE OR REPLACE TRANSIENT TABLE SUPPLIER_ROW_ALG
AS
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.SUPPLIER sample row (3) seed (82);

Note: Here (3) is the percentage not the number of rows:

Performance Considerations

SYSTEM | BLOCK sampling is often faster than BERNOULLI | ROW sampling.
Sampling without a seed is often faster than sampling with a seed.
Fixed-size sampling can be slower than equivalent fraction-based because fixed-size prevents some query optimization.

Fixed-size Row Sample:

Return a fixed-size sample of 10 rows in which each row has a max(1, 10/n) probability of being included in the sample, where n is the number of rows in the table:

select * from testtable sample (10 rows);

About Post Author

Sachin Mittal

sachin.mittal04@gmail.com

https://cloudyard.in

Happy

0 %

Sad

0 %

Excited

0 %

Sleepy

0 %

Angry

0 %

Surprise

0 %

Performance Considerations

Fixed-size Row Sample:

About Post Author

Sachin Mittal

Average Rating

Leave a Reply Cancel reply

Smart Insurance Claim Photo Triage System with Cortex AI

Cortex AI + Custom .whl Files: An AI-Powered Contract Risk Analyzer

Custom Python Packages in Snowflake Notebooks Using .whl Files

MCP Connects GitHub Copilot in VS Code to Snowflake Cortex Agents

Real-Time Alerts from Snowflake to Microsoft Teams