Snowflake:Split Huge File via aws s3 cp

Sachin MittalJune 21, 2021 Snowflake Scenario

2

0

Read Time:1 Minute, 57 Second

In one of the project, we were required to process large volume of data. In typical scenario we split huge file into the chunks at our on-premise before uploading them to the S3 bucket. Later on use the AWS CLI to copy these files of data from on-premise infrastructure to S3 buckets .

But consider the scenario, If the data is already present on the S3 bucket (say complete huge file )then we need to download it of course at our local premise and then split this huge file and upload these chunks to the S3 location. But this is not a viable solution and not ideal when clients deliver to your S3 buckets a few GBs.

The cp command initiates a copy operation to or from Amazon S3. aws s3 cp will copy all files, even if they already exist in the destination area. It also will not delete files from your destination if they are deleted from the source.

We have tested the functionality on Linux version where we executed a shell script which has splitted the large files into the smaller chunks based on the number of lines in file.

Please see the below screenshot from S3 location where we have CSV file: Here we will be trying to split the final.csv file into smaller chunks of 1000 lines each.

hugefile

EC2 and Shell Script:

Login to the EC2 machine and create the shell script (Split_huge_file.sh) with below content.

Split_huge_file.sh:

S3_BUCKET=ec2s3
FILENAME=final.csv
INFILE=s3://”${S3_BUCKET}”/”${FILENAME}”

OUTFILE=s3://”${S3_BUCKET}”/”${FILENAME%%.*}”

echo $S3_BUCKET, $FILENAME, $INFILE, $OUTFILE

echo “Starting S3 File Splitter using line count of 10000 lines”

FILES=($(aws s3 cp “${INFILE}” – | split -d -l 10000 –filter “aws s3 cp – \”${OUTFILE}_\$FILE.csv\” | echo \”\$FILE.csv\””))

ec2login

Run the Shell Script:

Run Shell

After running the script, File got splitted into the multiple files with 10000 lines per file.

File Split on AWS

Later on we can use the COPY command

copy into final_table from @<<S3 External Area>>

pattern=’.*[.]csv’;

About Post Author

Sachin Mittal

sachin.mittal04@gmail.com

https://cloudyard.in

Happy

0 %

Sad

0 %

Excited

0 %

Sleepy

0 %

Angry

0 %

Surprise

0 %

Average Rating

5 Star

0%

4 Star

0%

3 Star

0%

2 Star

0%

1 Star

0%

(Add your review)

Leave a Reply Cancel reply

Snowflake → Slack: real-time alerts in 15 minutes

Snowflake Cortex Search Named Scoring Profiles: Ranking the Right Document First

Flow

Parse, Redact, Extract, Search, Ask: Conversational AI Agent

Persona-Driven Partial Decryption in Snowflake

Flow

Smart Insurance Claim Photo Triage System with Cortex AI

End To End Flow

Cortex AI + Custom .whl Files: An AI-Powered Contract Risk Analyzer

Custom Python Packages in Snowflake Notebooks Using .whl Files

MCP Arch

MCP Connects GitHub Copilot in VS Code to Snowflake Cortex Agents