Snowpark Magic: Auto-Validate Your S3 to Snowflake Data Loads

Read Time:2 Minute, 34 Second

Introduction

In modern data pipelines, especially in cloud data platforms like Snowflake, data ingestion from external systems such as AWS S3 is common. However, one critical question that often arises is: How do we ensure the data we receive from the source matches the data we ingest into Snowflake tables? This is where “Snowpark Magic: Auto-Validate Your S3 to Snowflake Data Loads” comes into play — a powerful approach to automate row-level validation between staged source files and their corresponding Snowflake tables, ensuring trust and integrity across your data pipelines.

In this blog, we introduce a Snowpark-powered Data Validation Framework that:

Dynamically reads data files (CSV) from an S3 stage.
Validates row counts between the source files and target Snowflake tables.
Skips already processed files (avoiding duplicates).
Maintains an audit trail through log tables for both file-level processing and validation results.

Why is This Framework Important?

Manual validations across hundreds of files and tables? Error-prone!
Risk of duplicate loading without proper tracking.
Automated checks with audit logs.
Snowpark and SQL-based solution. Native to Snowflake, no extra infrastructure needed.

Current Scenario:

We have internal stage (sure we can have External Stage as well) with name : DATALOAD.
Different folders having multiple files in this stage.

Load the data from stage into respective table. Please note Table gets created with Folder name for respective domains.

Key Components

1️⃣ FILE_LOAD_LOG

Tracks every file that has been processed to avoid duplication.

2️⃣ LOAD_VALIDATION_LOG

Captures the validation result for each table.

How It Works

Reads all files from the defined S3 stage.
Skips already processed files using the FILE_LOAD_LOG.
Groups files by folder name (each folder represents a target table).
Counts rows in the source files and compares them with the target Snowflake table.
Logs mismatches, matches, and skipped files into the LOAD_VALIDATION_LOG.

Core Snowpark Validation Code (Python Script):

Execute the Code and observe the output:

Validate the data in FILE_LOAD_LOG table:

Validate the data in LOAD_VALIDATION_LOG table:

Potential Enhancements

Support for Parquet, JSON formats.
Alerting mechanism using Snowflake Alerts or notifications.

This Snowpark-based validation framework ensures automated data validation between S3-stage/Internal stage files and Snowflake tables. It’s not just about loading data — it’s about loading accurate data that you can trust.