2 0
Read Time:3 Minute, 8 Second
CSV-to-Snowflake

If you have a csv file that has ‘N’ number of columns and we are interested in load few columns. Following post will describe the steps to load Non Sequential Columns.

Though, Data insertion performance into  Snowflake table with a SELECT statement from the .csv file is similar in performance to the COPY. But as recommended by Snowflake:

Snowflake recommended COPY approach to loading data into Snowflake with below reasons:

  • Using the COPY command ,Snowflake automatically tracks, through MD5 file signature. The files that have already been loaded into a given table to prevent loading a specific file more than once.
  • Also per my understanding in terms of Oracle using sqlloader function, COPY does not search for the free blocks in data base which takes time as Blocks are scattered inside the Database. Instead it allocates the new block at run time and stores the data instead of searching the free block.

Non-sequential columns

  • Firstly, Create a Temp HEALTHCARE_CSV table with all columns to load the entire .csv file into it.
  • Create the table you would like to load with the specific columns.
  • Load your csv file into Temp table using COPY.
  • Load target table using temp table.
create or replace storage integration s3_int
  type = external_stage
  storage_provider = s3
  enabled = true
  storage_aws_role_arn = 'arn:aws:iam::xxxxxxxxxx:role/testsnowflake'
create or replace file format demo_db.public.csv_format_seq
                    type = csv
                     FIELD_DELIMITER = ','
                    skip_header = 1
                    null_if = ('NULL', 'null')
                    empty_field_as_null = true;
                               
create or replace stage demo_db.public.ext_csv_stage
  URL = 's3://testsnowflakepro/snowflake/csv'
  STORAGE_INTEGRATION = s3_int
  file_format = demo_db.public.csv_format_seq;
CREATE or replace temp TABLE HEALTHCARE_CSV(
    AVERAGE_COVERED_CHARGES    NUMBER(38,6)  
   ,AVERAGE_TOTAL_PAYMENTS    NUMBER(38,6)  
   ,TOTAL_DISCHARGES    NUMBER(38,0)  
   ,BACHELORORHIGHER    NUMBER(38,1)  
   ,HSGRADORHIGHER    NUMBER(38,1)  
   ,TOTALPAYMENTS    VARCHAR(128)  
   ,REIMBURSEMENT    VARCHAR(128)  
   ,TOTAL_COVERED_CHARGES    VARCHAR(128) 
   ,REFERRALREGION_PROVIDER_NAME    VARCHAR(256)  
   ,REIMBURSEMENTPERCENTAGE    NUMBER(38,9)  
   ,DRG_DEFINITION    VARCHAR(256)  
   ,REFERRAL_REGION    VARCHAR(26)  
   ,INCOME_PER_CAPITA    NUMBER(38,0)  
   ,MEDIAN_EARNINGSBACHELORS    NUMBER(38,0)  
   ,MEDIAN_EARNINGS_GRADUATE    NUMBER(38,0)  
   ,MEDIAN_EARNINGS_HS_GRAD    NUMBER(38,0)  
   ,MEDIAN_EARNINGSLESS_THAN_HS    NUMBER(38,0)  
   ,MEDIAN_FAMILY_INCOME    NUMBER(38,0)  
   ,NUMBER_OF_RECORDS    NUMBER(38,0)  
   ,POP_25_OVER    NUMBER(38,0)  
   ,PROVIDER_CITY    VARCHAR(128)  
   ,PROVIDER_ID    NUMBER(38,0)  
   ,PROVIDER_NAME    VARCHAR(256)  
   ,PROVIDER_STATE    VARCHAR(128)  
   ,PROVIDER_STREET_ADDRESS    VARCHAR(256)  
   ,PROVIDER_ZIP_CODE    NUMBER(38,0)  
);
Create or replace table healthcare_target( AVERAGE_COVERED_CHARGES    NUMBER(38,6)  
   ,AVERAGE_TOTAL_PAYMENTS    NUMBER(38,6)  
   ,TOTAL_DISCHARGES    NUMBER(38,0)  
   ,NUMBER_OF_RECORDS    NUMBER(38,0)  
   ,POP_25_OVER    NUMBER(38,0)  
   ,PROVIDER_CITY    VARCHAR(128)  
   ,PROVIDER_ID    NUMBER(38,0)  
   ,PROVIDER_NAME    VARCHAR(256)  
   ,PROVIDER_STATE    VARCHAR(128)  
   ,PROVIDER_STREET_ADDRESS    VARCHAR(256)  
   ,PROVIDER_ZIP_CODE    NUMBER(38,0)  )
copy into healthcare_csv from @demo_db.public.ext_csv_stage
on_error = CONTINUE;
Insert into healthcare_target (AVERAGE_COVERED_CHARGES, AVERAGE_TOTAL_PAYMENTS 
   ,TOTAL_DISCHARGES
   ,NUMBER_OF_RECORDS 
   ,POP_25_OVER
   ,PROVIDER_CITY 
   ,PROVIDER_ID  
   ,PROVIDER_NAME  
   ,PROVIDER_STATE 
   ,PROVIDER_STREET_ADDRESS  
   ,PROVIDER_ZIP_CODE)
   select 
   AVERAGE_COVERED_CHARGES, 
   AVERAGE_TOTAL_PAYMENTS 
   ,TOTAL_DISCHARGES
   ,NUMBER_OF_RECORDS 
   ,POP_25_OVER
   ,PROVIDER_CITY 
   ,PROVIDER_ID  
   ,PROVIDER_NAME  
   ,PROVIDER_STATE 
   ,PROVIDER_STREET_ADDRESS  
   ,PROVIDER_ZIP_CODE 
   from HEALTHCARE_CSV

Note : In above approach Snowflake would not enable to track the metadata (MD5 Signatures) of data files loaded into the Target table. If you execute the COPY command again then it will be executed successfully but no records would be copied again means no duplicate records:

Copy executed with 0 files processed.

To load the few Sequential columns from csv file, Click here.

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

Leave a Reply

Your email address will not be published. Required fields are marked *