
If you have a csv file that has ‘N’ number of columns and we are interested in load few columns. Following post will describe the steps to load Non Sequential Columns.
Though, Data insertion performance into Snowflake table with a SELECT statement from the .csv file is similar in performance to the COPY. But as recommended by Snowflake:
Snowflake recommended COPY approach to loading data into Snowflake with below reasons:
- Using the COPY command ,Snowflake automatically tracks, through MD5 file signature. The files that have already been loaded into a given table to prevent loading a specific file more than once.
- Also per my understanding in terms of Oracle using sqlloader function, COPY does not search for the free blocks in data base which takes time as Blocks are scattered inside the Database. Instead it allocates the new block at run time and stores the data instead of searching the free block.
Non-sequential columns
- Firstly, Create a Temp HEALTHCARE_CSV table with all columns to load the entire .csv file into it.
- Create the table you would like to load with the specific columns.
- Load your csv file into Temp table using COPY.
- Load target table using temp table.
create or replace storage integration s3_int type = external_stage storage_provider = s3 enabled = true storage_aws_role_arn = 'arn:aws:iam::xxxxxxxxxx:role/testsnowflake'
create or replace file format demo_db.public.csv_format_seq
type = csv
FIELD_DELIMITER = ','
skip_header = 1
null_if = ('NULL', 'null')
empty_field_as_null = true;
create or replace stage demo_db.public.ext_csv_stage
URL = 's3://testsnowflakepro/snowflake/csv'
STORAGE_INTEGRATION = s3_int
file_format = demo_db.public.csv_format_seq;
CREATE or replace temp TABLE HEALTHCARE_CSV(
AVERAGE_COVERED_CHARGES NUMBER(38,6)
,AVERAGE_TOTAL_PAYMENTS NUMBER(38,6)
,TOTAL_DISCHARGES NUMBER(38,0)
,BACHELORORHIGHER NUMBER(38,1)
,HSGRADORHIGHER NUMBER(38,1)
,TOTALPAYMENTS VARCHAR(128)
,REIMBURSEMENT VARCHAR(128)
,TOTAL_COVERED_CHARGES VARCHAR(128)
,REFERRALREGION_PROVIDER_NAME VARCHAR(256)
,REIMBURSEMENTPERCENTAGE NUMBER(38,9)
,DRG_DEFINITION VARCHAR(256)
,REFERRAL_REGION VARCHAR(26)
,INCOME_PER_CAPITA NUMBER(38,0)
,MEDIAN_EARNINGSBACHELORS NUMBER(38,0)
,MEDIAN_EARNINGS_GRADUATE NUMBER(38,0)
,MEDIAN_EARNINGS_HS_GRAD NUMBER(38,0)
,MEDIAN_EARNINGSLESS_THAN_HS NUMBER(38,0)
,MEDIAN_FAMILY_INCOME NUMBER(38,0)
,NUMBER_OF_RECORDS NUMBER(38,0)
,POP_25_OVER NUMBER(38,0)
,PROVIDER_CITY VARCHAR(128)
,PROVIDER_ID NUMBER(38,0)
,PROVIDER_NAME VARCHAR(256)
,PROVIDER_STATE VARCHAR(128)
,PROVIDER_STREET_ADDRESS VARCHAR(256)
,PROVIDER_ZIP_CODE NUMBER(38,0)
);
Create or replace table healthcare_target( AVERAGE_COVERED_CHARGES NUMBER(38,6)
,AVERAGE_TOTAL_PAYMENTS NUMBER(38,6)
,TOTAL_DISCHARGES NUMBER(38,0)
,NUMBER_OF_RECORDS NUMBER(38,0)
,POP_25_OVER NUMBER(38,0)
,PROVIDER_CITY VARCHAR(128)
,PROVIDER_ID NUMBER(38,0)
,PROVIDER_NAME VARCHAR(256)
,PROVIDER_STATE VARCHAR(128)
,PROVIDER_STREET_ADDRESS VARCHAR(256)
,PROVIDER_ZIP_CODE NUMBER(38,0) )
copy into healthcare_csv from @demo_db.public.ext_csv_stage
on_error = CONTINUE;
Insert into healthcare_target (AVERAGE_COVERED_CHARGES, AVERAGE_TOTAL_PAYMENTS
,TOTAL_DISCHARGES
,NUMBER_OF_RECORDS
,POP_25_OVER
,PROVIDER_CITY
,PROVIDER_ID
,PROVIDER_NAME
,PROVIDER_STATE
,PROVIDER_STREET_ADDRESS
,PROVIDER_ZIP_CODE)
select
AVERAGE_COVERED_CHARGES,
AVERAGE_TOTAL_PAYMENTS
,TOTAL_DISCHARGES
,NUMBER_OF_RECORDS
,POP_25_OVER
,PROVIDER_CITY
,PROVIDER_ID
,PROVIDER_NAME
,PROVIDER_STATE
,PROVIDER_STREET_ADDRESS
,PROVIDER_ZIP_CODE
from HEALTHCARE_CSV
Note : In above approach Snowflake would not enable to track the metadata (MD5 Signatures) of data files loaded into the Target table. If you execute the COPY command again then it will be executed successfully but no records would be copied again means no duplicate records:
Copy executed with 0 files processed.
To load the few Sequential columns from csv file, Click here.