In data-driven organizations, File Archival in Snowflake: A Snowpark-Powered Solution has become a game-changer. Handling feed files in data pipelines is a critical task for many organizations. These files, often stored in stages such as Amazon S3 or Snowflake internal stages, are the backbone of data ingestion workflows. Once processed, these files need to be archived to ensure cleanliness in the staging area and facilitate historical tracking. Without a proper archival strategy, these files can clutter staging areas, leading to operational challenges.
Snowflake, with its robust Snowpark API and features like COPY FILES and REMOVE, allows us to develop fully managed solutions without relying on external automation like AWS Lambda or lifecycle policies.
This blog explores a real-world use case where a Snowpark stored procedure automates the movement of processed feed files from one stage to another (either internal or external). This approach simplifies workflows, reduces manual intervention, and ensures a clean staging area.
Use Case:
Imagine a retail company that processes daily feed files containing transactional data. These files are staged in an internal staging location (@S3_STAGE/inbound/). After processing, the business needs to archive these files to:
- An internal archive stage (@S3_STAGE/archive/) .
- An external stage (@ext_csv_stage) for sharing with third-party vendors.
Solution Overview
Using a Snowpark stored procedure, we can:
- Dynamically choose between internal or external archiving.
- Move processed files using Snowflake’s COPY FILES command.
- Keep the staging area clean by generating REMOVE commands for execution.
Technical Implementation:
Technical Details:
Below is the Python code for the stored procedure, which automates file movement based on user preference:
Execution:
Case1: Files in Internal staging and snowpark will move to S3 external staging area.
No Files in External stage.
Execute the Procedure:
Files Archived in External stage.
Files Archived in Internal stage:
Case2: Files in Internal staging and snowpark will move to Internal staging area in archive folder.
Execute the Procedure:
Internal stage Archive Folder:
Enhance User Interaction:
Integrate this procedure with a Streamlit application to allow users to dynamically select archive types (internal or external).
Also to make it more dynamic we can have below code in our snowpark:
def main(session: snowpark.Session ,archive_type: str):
and deploy the stored procedure with execute as CALLER and can call the procedure dynamically values.
CALL ARCHIVE_FILE(‘internal’);
CALL ARCHIVE_FILE(‘external’);