1 0
Read Time:1 Minute, 57 Second

Returns a subset of rows sampled randomly from the specified table. The following sampling methods are supported:

  • Sample a fraction of a table, with a specified probability for including a given row. Therefore, the number of rows returned depends on the size of the table and the requested probability.
    • However, A seed can be specified to make the sampling deterministic.
  • Sample a fixed, specified number of rows. The exact number of specified rows is returned unless the table contains fewer rows.

SAMPLE and TABLESAMPLE are synonymous and can be used interchangeably.

Sample method is optional. If no method is specified, In conclusion the default is BERNOULLI.

SELECT ...
FROM ...
  { SAMPLE | TABLESAMPLE } [ samplingMethod ] ( { <probability> | <num> ROWS } ) [ { REPEATABLE | SEED } ( <seed> ) ]
[ ... ]

samplingMethod ::= { { BERNOULLI | ROW } |
                     { SYSTEM | BLOCK } }

Where:

samplingMethod ::= { { BERNOULLI | ROW } |
                     { SYSTEM | BLOCK } }

BERNOULLI | ROWorSYSTEM | BLOCK

Specifies the method to use:

  • BERNOULLI (or ROW): Includes each row with a probability of p/100. Similar to flipping a weighted coin for each row. The resulting data size is approxmilatelyof p/100 * number of Rows.
  • SYSTEM (or BLOCK): Includes each block of rows with a probability of p/100. Similar to flipping a weighted coin for each block of rows. This method does not support fixed-size .

REPEATABLE | SEED ( seed )

Specifies a seed value to make this deterministic.

Which to use: Use SYSTEM/BLOCK in case of higher volume of data and BERNOULI/ROW in case of smaller tables:

CREATE OR REPLACE TRANSIENT TABLE SUPPLIER_BLOCK_ALG
AS
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.SUPPLIER sample system (3) seed (82);

CREATE OR REPLACE TRANSIENT TABLE SUPPLIER_ROW_ALG
AS
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.SUPPLIER sample row (3) seed (82);

Note: Here (3) is the percentage not the number of rows:
Performance Considerations
  • SYSTEM | BLOCK sampling is often faster than BERNOULLI | ROW sampling.
  • Sampling without a seed is often faster than sampling with a seed.
  • Fixed-size sampling can be slower than equivalent fraction-based because fixed-size prevents some query optimization.
Fixed-size Row Sample:

Return a fixed-size sample of 10 rows in which each row has a max(1, 10/n) probability of being included in the sample, where n is the number of rows in the table:

select * from testtable sample (10 rows);

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

Leave a Reply

Your email address will not be published. Required fields are marked *