2 0
Read Time:2 Minute, 34 Second

Introduction

In modern data pipelines, especially in cloud data platforms like Snowflake, data ingestion from external systems such as AWS S3 is common. However, one critical question that often arises is: How do we ensure the data we receive from the source matches the data we ingest into Snowflake tables? This is where “Snowpark Magic: Auto-Validate Your S3 to Snowflake Data Loads” comes into play — a powerful approach to automate row-level validation between staged source files and their corresponding Snowflake tables, ensuring trust and integrity across your data pipelines.

In this blog, we introduce a Snowpark-powered Data Validation Framework that:

  • Dynamically reads data files (CSV) from an S3 stage.

  • Validates row counts between the source files and target Snowflake tables.

  • Skips already processed files (avoiding duplicates).

  • Maintains an audit trail through log tables for both file-level processing and validation results.

Why is This Framework Important?

  • Manual validations across hundreds of files and tables? Error-prone!

  • Risk of duplicate loading without proper tracking.

  • Automated checks with audit logs.

  • Snowpark and SQL-based solution. Native to Snowflake, no extra infrastructure needed.

Validation Flow

Current Scenario:

  • We have internal stage (sure we can have External Stage as well) with name : DATALOAD.
  • Different folders having multiple files in this stage.
Folders in Stage
  • Load the data from stage into respective table. Please note Table gets created  with Folder name for respective domains.
Data load in SF Table

Key Components

1️⃣ FILE_LOAD_LOG

Tracks every file that has been processed to avoid duplication.

CREATE OR REPLACE TABLE FILE_LOAD_LOG (
FILE_NAME STRING,
LOAD_DATE TIMESTAMP_NTZ
);

2️⃣ LOAD_VALIDATION_LOG

Captures the validation result for each table.

CREATE OR REPLACE TABLE LOAD_VALIDATION_LOG (
TABLE_NAME STRING,
FILE_COUNT NUMBER,
FILE_ROW_COUNT NUMBER,
SNOWFLAKE_ROW_COUNT NUMBER,
RECORD_MISMATCH NUMBER,
LOAD_DATE TIMESTAMP_NTZ
);

How It Works

  1. Reads all files from the defined S3 stage.

  2. Skips already processed files using the FILE_LOAD_LOG.

  3. Groups files by folder name (each folder represents a target table).

  4. Counts rows in the source files and compares them with the target Snowflake table.

  5. Logs mismatches, matches, and skipped files into the LOAD_VALIDATION_LOG.

Core Snowpark Validation Code (Python Script):

Snowpark Validation Code

Execute the Code and observe the output:

Snowpark output

Validate the data in FILE_LOAD_LOG table:

FILE_LOAD_LOG

Validate the data in LOAD_VALIDATION_LOG table:

LOAD_VALIDATION_LOG

Potential Enhancements

  • Support for Parquet, JSON formats.

  • Alerting mechanism using Snowflake Alerts or notifications.

This Snowpark-based validation framework ensures  automated data validation between S3-stage/Internal stage files and Snowflake tables. It’s not just about loading data — it’s about loading accurate data that you can trust.

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

Leave a Reply

Your email address will not be published. Required fields are marked *