
In modern data pipelines, handling data in various formats such as CSV, Parquet, and JSON is essential to ensure smooth data processing. However, one of the most common challenges faced by data engineers is the evolution of schemas as new data comes in. Schema evolution refers to the ability of a system to adapt to changes in the structure of incoming data without breaking existing workflows.
In this blog, we’ll explore the significance of schema evolution using real-world examples with CSV, Parquet, and JSON data formats. Schema evolution allows for the automatic adjustment of the schema in the data warehouse as new data is ingested, ensuring data integrity and avoiding pipeline failures.
Technical Implementation:
1. Handling CSV Data with Schema Evolution
Consider the example of customer invoices stored in CSV files. On day 1, our data might contain columns like CUST_NUM
, CUST_STAT
, and CUST_BAL
. As new data comes in on day 2, we may have additional columns like PHONE
and EMAIL
.
We first create a table CUST_INVC
by using the INFER_SCHEMA
function, which infers the schema of the CSV file and creates the table:
create or replace database schema_evol;
create stage src_stg;
create or replace file format csv_fmt
type = 'csv'
compression = 'auto'
field_delimiter = ','
parse_header = True
record_delimiter = '\n'
ERROR_ON_COLUMN_COUNT_MISMATCH = false;
CREATE OR REPLACE TABLE CUST_INVC
USING TEMPLATE (
SELECT ARRAY_AGG(object_construct(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@src_stg/Customer_Invc.csv',
FILE_FORMAT=>'csv_fmt'
)));
To allow for schema evolution, we enable the ENABLE_SCHEMA_EVOLUTION
feature:
ALTER TABLE CUST_INVC SET ENABLE_SCHEMA_EVOLUTION = TRUE;
Now, when we load new data from the Customer_Invc_day2.csv
file, which includes the new columns (PHONE
, EMAIL
), Snowflake automatically adjusts the schema to accommodate the new data:
copy into CUST_INVC
from @public.src_stg/Customer_Invc.csv
file_format = csv_fmt
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
--to show day2 with addition of PHONE and EMAIL in feedfile
copy into CUST_INVC
from @public.src_stg/Customer_Invc_day2.csv
file_format = csv_fmt
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

2. Handling Parquet Data with Schema Evolution
Let’s now look at how schema evolution works with Parquet files. Parquet is a columnar storage format, often used for its efficient data storage and retrieval.
We create a table Accessory_parquet
and load data from the Parquet file Accessory_day1.parquet
:
CREATE FILE FORMAT my_parquet_format
TYPE = parquet;
CREATE OR REPLACE TABLE Accessory_parquet
USING TEMPLATE (
SELECT ARRAY_AGG(object_construct(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@src_stg/Accessory_day1.parquet',
FILE_FORMAT=>'my_parquet_format'
)
));
ALTER TABLE Accessory_parquet SET ENABLE_SCHEMA_EVOLUTION = TRUE;
When new data is loaded on day 2 with additional columns like AMOUNT
and RENEWAL
, Snowflake will automatically adjust the schema to include the new columns:
COPY INTO Accessory_parquet
FROM @src_stg/Accessory_day1.parquet
FILE_FORMAT = (type=parquet)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
–to show day2 with addition of AMOUNT and RENEWAL in feedfile
COPY INTO Accessory_parquet
FROM @src_stg/Accessory_day2.parquet
FILE_FORMAT = (type=parquet)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

3. Handling JSON Data with Schema Evolution
Lastly, let’s look at how JSON data can benefit from schema evolution. We create a table Cust_orders_json
to store customer orders:
CREATE or replace FILE FORMAT my_json_format
TYPE = json
STRIP_OUTER_ARRAY = TRUE;
CREATE OR REPLACE TABLE Cust_orders_json
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION => '@src_stg/Cust_orders.json',
FILE_FORMAT => 'my_json_format'
)));
ALTER TABLE Cust_orders_json SET ENABLE_SCHEMA_EVOLUTION = TRUE;
As new JSON data arrives with additional fields (like AMOUNT
), Snowflake automatically handles the changes.
COPY INTO Cust_orders_json
FROM @src_stg/Cust_orders.json
FILE_FORMAT = (FORMAT_NAME = 'my_json_format')
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
--to show day2 with addition of AMOUNT
COPY INTO Cust_orders_json
FROM @src_stg/Cust_orders_day2.json
FILE_FORMAT = (FORMAT_NAME = 'my_json_format')
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

Managing Column Name Case Sensitivity: The Need for a Stored Procedure
Now, here’s where a key challenge arose: column names in Snowflake are case-sensitive by default. When we attempted to create a view (CUSTOMER_DATA_VIEW
) that joined the three tables — CUST_INVC
, Accessory_parquet
, and Cust_orders_json
— the query failed. The reason? Column names in the files were in lowercase, but Snowflake treats unquoted column names as uppercase by default.
For example, in our Customer_Invc.csv
file, the column names might have been crid
, ssn
, etc., but when we referenced these columns in the view, we needed to include them in double quotes ("crid"
, "ssn"
) to match the case exactly.
This case sensitivity caused an issue in the view creation, even though the columns were present in the table. The columns were being treated as uppercase by Snowflake’s SQL parser, leading to a mismatch.

Verify the DDL of the table:

To address this, we created a stored procedure that automatically renames columns to ensure case-insensitive matching. The stored procedure scans through the schema, identifies columns with lowercase names, and renames them to uppercase, ensuring consistency across tables and preventing errors due to case sensitivity.
Here’s the stored procedure we created to handle this issue:


Above Proc will create the below query which change case-sensitive column names to case-insensitive column names in multiple tables.

Creating a Unified View with Schema Evolution
Now that we’ve set up our tables with schema evolution for CSV, Parquet, and JSON data, we can create a unified view that combines data from all three tables. Here’s how we create the CUSTOMER_DATA_VIEW
:
CREATE OR REPLACE VIEW CUSTOMER_DATA_VIEW AS
SELECT
a.CUST_NUM,A.CUST_STAT,A.CUST_BAL,A.INV_NO,A.CRID,A.SSN,
B.ACCESSORY,B.STATUS AS ACCESS_STATUS,
C.ORDERID
FROM CUST_INVC A
LEFT JOIN Accessory_parquet b ON a.CUST_NUM = b.CUST_ID
LEFT JOIN Cust_orders_json c ON a.CUST_NUM = c.CustomerID;
This view consolidates data from the three different sources, showcasing how schema evolution seamlessly integrates data with different structures into a unified view.
Conclusion
Schema evolution is a vital feature that allows data pipelines to remain flexible and resilient as data structures change over time. Whether dealing with CSV, Parquet, or JSON data, schema evolution ensures that your data processing workflows continue to function smoothly, even when new columns are added or removed. However, challenges such as case sensitivity can arise when creating views or queries. In such cases, a stored procedure like the one we created can help resolve these issues by renaming columns to ensure consistent matching.