2 0
Read Time:3 Minute, 19 Second

In this post, we’ll explore Schema Evolution in Snowflake, a crucial feature allowing seamless adaptation to changing data structures. While Parquet files were initially supported, Snowflake has extended this feature to CSV and JSON files as well. Modern data systems often append new columns to accommodate additional information, necessitating downstream tables to adjust accordingly. Data pipeline should be robust enough that it should read the multiple file structure at run time and ingest them in a same table.

Imagine a scenario where a source system continuously feeds Snowflake with evolving CSV data. The data’s structure frequently changes, with new columns or alterations introduced. Meeting this challenge requires the development of robust data pipelines capable of modifying table columns to align with the evolving source data schema. Snowflake latest feature Schema Detection came to the rescue, It automatically detecting the schema i.e.  Schema Detection Automate in a set of staged structured data files and retrieving the column definitions.

Technical implementation:

Below is the structure of CSV file we receive from the source system on day1 in S3 bucket.

CSV File Structure – Day1

As we can see currently there are 7 columns in feed file. We will create the respective Stage and File formats to consume this file.

create or replace file format schema_evol_fmt
type = 'csv'
compression = 'auto'
field_delimiter = ','
parse_header = True
record_delimiter = '\n' ;

create or replace stage demo_db.public.ext_csv_stage
URL = 's3://srcbucketsachin/'
STORAGE_INTEGRATION = s3_int
file_format = demo_db.public.schema_evol_fmt;

CREATE OR REPLACE TABLE CUST_INVC
USING TEMPLATE (
SELECT ARRAY_AGG(object_construct(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@ext_csv_stage/Schema_Evol_day1.csv',
FILE_FORMAT=>'schema_evol_fmt'
)));

CUST_INVC table gets created with following structure.

Table Structure

We will load the data into table using below command:

copy into CUST_INVC
from @demo_db.public.ext_csv_stage/Schema_Evol_day1.csv
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

Verify the data in Snowflake:

Table Output Day1

Say next day we received the file from source with different architecture.

CSV File Structure – Day2

We can clearly see there are 9 columns in the feed file. Now our task is without dropping the existing table we need to load this data into same table.

So if we try to load the data using COPY command we get the below error.

Load Error

To resolve the issue we have modified the file format like below :

create or replace file format schema_evol_fmt
type = 'csv'
compression = 'auto'
field_delimiter = ','
parse_header = True
record_delimiter = '\n'
ERROR_ON_COLUMN_COUNT_MISMATCH = false

Now try to execute the COPY command and this time it gets completed successfully.

COPY Day2

Verify the data in table.

Though the  load got completed successfully but it has not created or populated the additional two columns.

Table Data Day2

Question , How we should consume the feed file with all columns without dropping or deleting the existing table.

We need to enable SCHEMA Evolution property on the table.

ALTER TABLE CUST_INVC SET ENABLE_SCHEMA_EVOLUTION = TRUE;

Now try to load the file with 9 columns and verify the data.

COPY Day2 with Enable Evolution

Verify the data and we can clearly see the table gets altered and include two new columns. So the existing previous day records having null value for these two columns. While the today’s record will have all the values in columns.

Data with Evolution

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

Leave a Reply

Your email address will not be published. Required fields are marked *