Tracking Schema Changes in Iceberg Tables Using Metadata Files

Read Time:4 Minute, 21 Second

Introduction

Managing schema changes is a critical aspect of maintaining data integrity and consistency in dynamic data environments. When using Iceberg tables, every Data Definition Language (DDL) operation triggers the generation of a new metadata JSON file that captures the updated structure. Tracking these schema modifications, such as the addition, deletion, or alteration of columns, is essential for audit purposes.

This article outlines a process for efficiently tracking schema changes in Iceberg tables by leveraging Snowflake’s powerful metadata storage capabilities. We will demonstrate how to automatically detect and log schema modifications in an audit table every time the table structure changes, using metadata files stored in a Snowflake stage.

Problem Statement

Problem Statement

Consider a scenario where you create an Iceberg table with a specific number of columns. Over time, you perform various DDL operations, such as adding or deleting columns. Each operation generates a new metadata JSON file in the respective Snowflake stage. Our goal is to track these changes by identifying newly added or deleted columns and storing these changes in an audit table for historical tracking.

Solution Overview

The solution involves three main steps:

Extracting the latest metadata file: Every time a DDL operation occurs, a new metadata JSON file is generated in the Snowflake stage. We retrieve the most recent file to access the updated schema.
Inserting the current schema: We parse the JSON metadata and store the latest DDL information in a schema_tracking table.
Tracking schema changes: By comparing the current schema with the previously stored version, we log any added or deleted columns in a schema_change_tracking table, recording each change along with its type (addition or deletion).

Let’s dive into the details of each step and see how to implement the solution.

Implementation:

Step 1: Schema Tracking Table

The first table, schema_tracking, uses to store the current structure of the table after each DDL change. The key columns here are the schema ID, field ID, field name, field type, and the file name.

CREATE OR REPLACE TABLE schema_tracking ( schema_id INT, last_seq_num INT, field_id INT, field_name STRING, field_type STRING, file_name STRING, last_updated TIMESTAMP );

Step 2: Schema Change Tracking Table

The second table, schema_change_tracking, logs any changes detected in the schema by recording whether a column was added or deleted, along with a timestamp and the file name.

CREATE OR REPLACE TABLE schema_change_tracking ( schema_id INT, last_seq_num INT, field_id INT, field_name STRING, field_type STRING, change_type STRING, -- Specifies 'ADDED' or 'DELETED' change_timestamp TIMESTAMP, file_name STRING -- Store the file name );

Step 3: Stored Procedure to Track Schema Changes

The heart of this solution is a Snowflake stored procedure that automates the process of detecting schema changes. This procedure performs the following tasks:

Identify the latest metadata file: The procedure retrieves and parses the latest JSON metadata file from the specified Snowflake stage.
Store the current schema: The current schema structure is stored in the schema_tracking table.
Detect changes: By comparing the current schema with the previous version, the procedure identifies new and deleted fields and logs these changes in the schema_change_tracking table.

Here is the stored procedure that performs these tasks:

Testing:

Testing:

Upload the initial metadata file to the stage. This file is generated when the ICEBERG table is created.

Execute the Procedure:

call track_schema_changes();

Audit table populated:

select * from schema_change_tracking;

2. Upload the second metadata file to the stage. This file is generated when we alter the ICEBERG table with new column .

Execute the Procedure:

call track_schema_changes();

Audit table populated:

select * from schema_change_tracking;

3. Upload the Third metadata file to the stage. This file generates when we alter the ICEBERG table with new column FAX and drop the existing PHONE Column.

Execute the Procedure:

call track_schema_changes();

Audit table populated:

select * from schema_change_tracking;

Files in Stages :

Step 4: Implementing the Process

Execution Frequency: This procedure can execute periodically using Snowflake tasks to monitor schema changes in real time.
Audit Reporting: The schema_change_tracking table will serve as an audit log, providing a complete history of schema modifications, including the type of change (add or delete), the field affected, and the corresponding metadata file.

Conclusion

Tracking changes in Iceberg tables is crucial for maintaining accurate data governance and auditing. By leveraging Snowflake’s robust capabilities, we can automate the detection and logging of schemas changes efficiently. This approach ensures that every change in your Iceberg table’s structure is capture and store in an audit table, giving you a complete history of schema modifications over time.