Parallel Jobs with Snowpark (collect_nowait + async_job)

Read Time:3 Minute, 20 Second

Parallel Snowflake Jobs with Snowpark: The Challenge: Sequential Bottlenecks

Imagine you’re loading sales data from multiple regions into Snowflake. Your pipeline needs to:

Load North American sales data (2 minutes)
Load European sales data (2 minutes)
Load Asian sales data (2 minutes)
Update high-value customer segments (1 minute)
Update medium-value customer segments (1 minute)
Update low-value customer segments (1 minute)

Traditional sequential execution would take 9 minutes. But here’s the catch: these operations are completely independent. They don’t depend on each other’s results. Why should we wait?

The Solution: Asynchronous Parallel Execution

Snowflake’s Snowpark API provides powerful async capabilities that let you run multiple independent queries simultaneously. With async execution, those same operations can complete in approximately 2 minutes – the time of your longest-running query.

Understanding the Async Pattern

The Core Components

collect() vs collect_nowait() vs create_async_job()

collect()
Submits the statement and waits for it to finish; if it’s a SELECT, it also fetches the result rows into the client.
collect_nowait()
Submits the statement to Snowflake without waiting. It returns quickly with an object that contains a query_id.
Think “fire-and-forget (for now)”.
session.create_async_job(query_id)
This does not start a new query. It creates a client-side handle for an already-running query (identified by query_id).
You use this handle to:
- is_done() → check completion
- job.result() → fetch the outcome (rows for SELECT)
- job.cancel() → Stops the query if needed

Real-World Implementation

Real-World Implementation

Sample table:

CREATE OR REPLACE TRANSIENT TABLE DEMO_DB.PUBLIC.CUSTOMER_DATA ( CUSTOMER_ID INT, SEGMENT VARCHAR(50), LOYALTY_SCORE INT, PURCHASE_TRENDS VARCHAR(100), UPDATED_TS TIMESTAMP );

INSERT INTO DEMO_DB.PUBLIC.CUSTOMER_DATA SELECT SEQ4() AS CUSTOMER_ID, CASE WHEN MOD(SEQ4(), 3) = 0 THEN 'High Value' WHEN MOD(SEQ4(), 3) = 1 THEN 'Medium Value' ELSE 'Low Value' END AS SEGMENT, NULL AS LOYALTY_SCORE, NULL AS PURCHASE_TRENDS, CURRENT_TIMESTAMP AS UPDATED_TS FROM TABLE(GENERATOR(ROWCOUNT => 100000));

Here’s a practical example that loads regional sales data and updates customer segments in parallel:

Let’s Break Down What We Just Built

Launching Jobs in Parallel

async_jobs = [] for region, query in region_queries.items(): print(f"Starting data load for {region}...") print(query) query_id = session.sql(query).collect_nowait().query_id print(f"Submitted. Query ID = {query_id}") async_job = session.create_async_job(query_id) async_jobs.append((region, async_job))

What we did:

Looped through all operations
Started each query with collect_nowait() (non-blocking)
query_id = the receipt number for that running query inside Snowflake.
Created an async job handle for each
Stored everything in a list for tracking

The Polling Loop

while any(not job.is_done() for _, job in async_jobs): print(f"Waiting for {sum(not job.is_done() for _, job in async_jobs)} jobs to complete...")

any(not job.is_done() …) → “Is any job still running?” If yes, keep waiting.
sum(not job.is_done() …) → “How many are still running?” (nice progress print)

Collecting Results (what you really get)

if job.is_done(): result = job.result()