Schema Evolution with Case Sensitivity Handling in Snowflake

Read Time:6 Minute, 6 Second

In modern data pipelines, handling data in various formats such as CSV, Parquet, and JSON is essential to ensure smooth data processing. However, one of the most common challenges faced by data engineers is the evolution of schemas as new data comes in. Schema evolution refers to the ability of a system to adapt to changes in the structure of incoming data without breaking existing workflows.

In this blog, we’ll explore the significance of schema evolution using real-world examples with CSV, Parquet, and JSON data formats. Schema evolution allows for the automatic adjustment of the schema in the data warehouse as new data is ingested, ensuring data integrity and avoiding pipeline failures.

Technical Implementation:

1. Handling CSV Data with Schema Evolution

Consider the example of customer invoices stored in CSV files. On day 1, our data might contain columns like CUST_NUM, CUST_STAT, and CUST_BAL. As new data comes in on day 2, we may have additional columns like PHONE and EMAIL.

We first create a table CUST_INVC by using the INFER_SCHEMA function, which infers the schema of the CSV file and creates the table:

create or replace database schema_evol;

create stage src_stg;

create or replace file format csv_fmt type = 'csv' compression = 'auto' field_delimiter = ',' parse_header = True record_delimiter = '\n' ERROR_ON_COLUMN_COUNT_MISMATCH = false;

CREATE OR REPLACE TABLE CUST_INVC USING TEMPLATE ( SELECT ARRAY_AGG(object_construct(*)) FROM TABLE( INFER_SCHEMA( LOCATION=>'@src_stg/Customer_Invc.csv', FILE_FORMAT=>'csv_fmt' )));

To allow for schema evolution, we enable the ENABLE_SCHEMA_EVOLUTION feature:

ALTER TABLE CUST_INVC SET ENABLE_SCHEMA_EVOLUTION = TRUE;

Now, when we load new data from the Customer_Invc_day2.csv file, which includes the new columns (PHONE, EMAIL), Snowflake automatically adjusts the schema to accommodate the new data:

copy into CUST_INVC from @public.src_stg/Customer_Invc.csv file_format = csv_fmt MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

--to show day2 with addition of PHONE and EMAIL in feedfile copy into CUST_INVC from @public.src_stg/Customer_Invc_day2.csv file_format = csv_fmt MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

2. Handling Parquet Data with Schema Evolution

Let’s now look at how schema evolution works with Parquet files. Parquet is a columnar storage format, often used for its efficient data storage and retrieval.

We create a table Accessory_parquet and load data from the Parquet file Accessory_day1.parquet:

CREATE FILE FORMAT my_parquet_format TYPE = parquet;

CREATE OR REPLACE TABLE Accessory_parquet USING TEMPLATE ( SELECT ARRAY_AGG(object_construct(*)) FROM TABLE( INFER_SCHEMA( LOCATION=>'@src_stg/Accessory_day1.parquet', FILE_FORMAT=>'my_parquet_format' ) ));

ALTER TABLE Accessory_parquet SET ENABLE_SCHEMA_EVOLUTION = TRUE;

When new data is loaded on day 2 with additional columns like AMOUNT and RENEWAL, Snowflake will automatically adjust the schema to include the new columns:

COPY INTO Accessory_parquet FROM @src_stg/Accessory_day1.parquet FILE_FORMAT = (type=parquet) MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

–to show day2 with addition of AMOUNT and RENEWAL in feedfile

COPY INTO Accessory_parquet FROM @src_stg/Accessory_day2.parquet FILE_FORMAT = (type=parquet) MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

3. Handling JSON Data with Schema Evolution

Lastly, let’s look at how JSON data can benefit from schema evolution. We create a table Cust_orders_json to store customer orders:

CREATE or replace FILE FORMAT my_json_format TYPE = json STRIP_OUTER_ARRAY = TRUE;

CREATE OR REPLACE TABLE Cust_orders_json USING TEMPLATE ( SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*)) FROM TABLE( INFER_SCHEMA( LOCATION => '@src_stg/Cust_orders.json', FILE_FORMAT => 'my_json_format' )));

ALTER TABLE Cust_orders_json SET ENABLE_SCHEMA_EVOLUTION = TRUE;

As new JSON data arrives with additional fields (like AMOUNT), Snowflake automatically handles the changes.

COPY INTO Cust_orders_json FROM @src_stg/Cust_orders.json FILE_FORMAT = (FORMAT_NAME = 'my_json_format') MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

--to show day2 with addition of AMOUNT

COPY INTO Cust_orders_json FROM @src_stg/Cust_orders_day2.json FILE_FORMAT = (FORMAT_NAME = 'my_json_format') MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

Managing Column Name Case Sensitivity: The Need for a Stored Procedure

Now, here’s where a key challenge arose: column names in Snowflake are case-sensitive by default. When we attempted to create a view (CUSTOMER_DATA_VIEW) that joined the three tables — CUST_INVC, Accessory_parquet, and Cust_orders_json — the query failed. The reason? Column names in the files were in lowercase, but Snowflake treats unquoted column names as uppercase by default.

For example, in our Customer_Invc.csv file, the column names might have been crid, ssn, etc., but when we referenced these columns in the view, we needed to include them in double quotes ("crid", "ssn") to match the case exactly.

This case sensitivity caused an issue in the view creation, even though the columns were present in the table. The columns were being treated as uppercase by Snowflake’s SQL parser, leading to a mismatch.

Verify the DDL of the table:

To address this, we created a stored procedure that automatically renames columns to ensure case-insensitive matching. The stored procedure scans through the schema, identifies columns with lowercase names, and renames them to uppercase, ensuring consistency across tables and preventing errors due to case sensitivity.

Here’s the stored procedure we created to handle this issue:

Above Proc will create the below query which change case-sensitive column names to case-insensitive column names in multiple tables.

Creating a Unified View with Schema Evolution

Now that we’ve set up our tables with schema evolution for CSV, Parquet, and JSON data, we can create a unified view that combines data from all three tables. Here’s how we create the CUSTOMER_DATA_VIEW:

CREATE OR REPLACE VIEW CUSTOMER_DATA_VIEW AS SELECT a.CUST_NUM,A.CUST_STAT,A.CUST_BAL,A.INV_NO,A.CRID,A.SSN, B.ACCESSORY,B.STATUS AS ACCESS_STATUS, C.ORDERID FROM CUST_INVC A LEFT JOIN Accessory_parquet b ON a.CUST_NUM = b.CUST_ID LEFT JOIN Cust_orders_json c ON a.CUST_NUM = c.CustomerID;

This view consolidates data from the three different sources, showcasing how schema evolution seamlessly integrates data with different structures into a unified view.

Conclusion

Schema evolution is a vital feature that allows data pipelines to remain flexible and resilient as data structures change over time. Whether dealing with CSV, Parquet, or JSON data, schema evolution ensures that your data processing workflows continue to function smoothly, even when new columns are added or removed. However, challenges such as case sensitivity can arise when creating views or queries. In such cases, a stored procedure like the one we created can help resolve these issues by renaming columns to ensure consistent matching.