1 0
Read Time:2 Minute, 33 Second

Snowflake’s PARSE_DOCUMENT function revolutionizes how unstructured data, such as PDF files, is processed within the Snowflake ecosystem. Traditionally, this function is used within SQL to extract structured content from documents. However, I’ve taken this a step further, leveraging Snowpark to extend its capabilities and build a complete data extraction process.

This blog explores how you can leverage the power of PARSE_DOCUMENT with Snowpark, showcasing a use case to extract, clean, and process data from PDF documents.

Why Use PARSE_DOC?

Why Use PARSE_DOCUMENT ?

The PARSE_DOCUMENT function allows users to extract text from supported file types like PDFs and converting them into readable formats within Snowflake. Its SQL usage is straightforward, making it easy to integrate into existing pipelines.

For example, a SQL query might look like this:

SELECT relative_path AS file_name, TO_VARCHAR(SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@mystage, relative_path, OBJECT_CONSTRUCT('mode', 'LAYOUT'))) AS raw_text
FROM DIRECTORY(@mystage);

Parse Document

While SQL-based processing is effective, it often lacks flexibility when applying complex transformations or validations. This is where Snowpark shines, offering Python’s versatility while operating seamlessly within Snowflake.

Extending PARSE_DOCUMENT with Snowpark

Using Snowpark, we can:

  1. Process and validate extracted content dynamically.
  2. Apply advanced data cleansing and transformation logic using Python.
  3. Automate structured data insertion into Snowflake tables for downstream analytics.

Use Case: Extracting Insurance Data from PDFs

Imagine a scenario where an insurance company receives thousands of policy documents daily. These documents contain key details such as policy numbers, holder names, and amounts. Our goal is to:

  1. Extract the raw text using PARSE_DOCUMENT .
  2. Process and validate key fields such as policy numbers, holder names, and financial amounts.
  3. Store the cleaned data in a structured format for analysis.

Step 1: Extract Raw Data Using PARSE_DOCUMENT

First, PDFs are uploaded to a Snowflake stage. Using the PARSE_DOCUMENT function, we extract the raw text into a staging table called pdf_feedback:

INSERT INTO pdf_feedback (file_name, raw_text)
SELECT
relative_path AS file_name,
TO_VARCHAR(SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@mystage, relative_path, OBJECT_CONSTRUCT('mode', 'LAYOUT'))) AS raw_text
FROM DIRECTORY(@mystage);

Extract PDF Text

Step 2: Process Data with Snowpark

The raw data often contains additional details. Using Snowpark, we:

  • Parse key fields (e.g., policy holder name, policy number).
  • Apply regex-based validation and cleansing.
  • Handle inconsistencies like missing or malformed data.

Here’s the Snowpark code:

Snowpark PDF code

Step 3: Validate and Store Extracted Data

The cleaned data is now stored in the pdf_extracted_data table, ready for analytics and reporting.

Parse PDF Output

Benefits: 

  1. Advanced Parsing with Regex: Using regex within Snowpark, we accurately extract key fields like the policy holder’s name while eliminating irrelevant text.
  2. Automate Data Validation: The logic ensures invalid entries, such as missing policy numbers handles gracefully.

 

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

Leave a Reply

Your email address will not be published. Required fields are marked *