Directory Tables, Python UDF and Streams for PDF Processing

Read Time:2 Minute, 16 Second

Introduction

In today’s data-driven world, organizations rely on automated pipelines to handle unstructured and semi-structured data formats like PDFs. Snowflake provides powerful tools such as directory tables, streams, and Python UDFs to seamlessly process these files, making it easy to extract actionable insights.

This blog highlights a real-time use case where directory tables track file-level metadata, streams monitor new file uploads, and Python functions extract specific details from PDF files—all within Snowflake’s unified platform.

Pipeline Overview

The pipeline consists of the following components:

Stage: Stores PDF files and tracks their metadata using directory tables.
Stream: Monitors changes in the stage (e.g., newly added files).
Python UDF: Parses the PDF files and extracts key details.
Task: Automates the process of inserting the extracted data into a Snowflake table.

Use Case: Insurance Policy Document Processing

Use Case: Insurance Policy Document Processing

Scenario

A global insurance provider receives thousands of Car insurance policy documents in PDF format. Each document contains critical details such as:

Policy Number
Intermediary Code
Intermediary Name
Contact
Gross Premium
Invoice Details

The organization needs to:

Automatically detect and process newly uploaded PDFs.
Extract relevant details from the PDFs.
Load the extracted data into a structured Snowflake table for analytics and reporting.

PDF Snapshot:

Technical Implementation:

1. Directory Tables for File Metadata Tracking: Enable file-level metadata monitoring..

CREATE OR REPLACE STAGE Insurance_stage ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE') DIRECTORY = ( ENABLE = TRUE);

A stream monitors changes in the stage and captures metadata for new files.

CREATE or replace STREAM insurance_stream ON STAGE Insurance_stage;

2. Python UDF to Extract PDF Details

Using Snowflake’s Python integration, we create a function to parse PDF files and extract relevant fields. This function reads the PDF content and extracts details such as Policy Number, Intermediary Code, and Gross Premium.