Snowflake’s Data Metric Functions (DMF) are powerful tools designed to ensure data quality and governance. By enabling automated checks and validations, DMFs allow organizations to monitor their data continuously and enforce business rules. With built-in and custom metrics, DMFs simplify the process of validating large datasets and identifying anomalies.
In this blog, we will explore a practical use case of DMFs to monitor the quality of transactional data. Using Snowflake’s flexibility, we will create custom DMFs, integrate them with data tables, and leverage alerting mechanisms to ensure high-quality standards.
Use Case: Ensuring Data Quality for Transactional Data
Scenario:
You are responsible for maintaining the quality of a DATA_QUALITY_TEST table, which records transaction details. Ensuring data accuracy and validity is critical for downstream analytics and decision-making processes.
STEP 1: The table structure includes the following fields:
CREATE OR REPLACE TABLE DATA_QUALITY_TEST (
ID INT,
CUSTOMER_ID INT,
TRANSACTION_AMOUNT NUMERIC,
DATE TIMESTAMP,
SHIPMENT_QUANTITY INT,
MAX_TEMP_F NUMBER,
DELIVERY_TIME TIMESTAMP
);
To ensure data quality, we will implement checks for:
- High-value transactions exceeding a threshold.
- Temperature values outside the valid Fahrenheit range (-40°F to 140°F).
- Shipment delivery anomalies, including invalid shipment quantities and delivery times.
- Null values and duplicate entries for customer IDs.
Technical:
Step 2: Creating Custom Data Metric Functions
We will define custom DMFs to validate specific data quality aspects:
a) High Transaction Amount Check:
CREATE OR REPLACE DATA METRIC FUNCTION CHECK_HIGH_TRANSACTION(
TABLE_NAME TABLE(TRANSACTION_AMOUNT NUMBER)
)
RETURNS NUMBER
AS $$
SELECT COUNT(*)
FROM TABLE_NAME
WHERE TRANSACTION_AMOUNT > 500000
$$;
b)Temperature Range Check:
CREATE OR REPLACE DATA METRIC FUNCTION CHECK_FAHRENHEIT_RANGE(
TABLE_NAME TABLE(MAX_TEMP_F NUMBER)
)
RETURNS NUMBER
AS $$
SELECT COUNT(*)
FROM TABLE_NAME
WHERE MAX_TEMP_F IS NOT NULL AND MAX_TEMP_F NOT BETWEEN -40 AND 140
$$;
c) Shipment and Delivery Validation:
CREATE OR REPLACE DATA METRIC FUNCTION CHECK_SHIPMENT_DELIVERY(
TABLE_NAME TABLE(SHIPMENT_QUANTITY NUMBER, DELIVERY_TIME TIMESTAMP)
)
RETURNS NUMBER
AS $$
SELECT COUNT(*)
FROM TABLE_NAME
WHERE SHIPMENT_QUANTITY <= 0
OR EXTRACT(HOUR FROM DELIVERY_TIME) NOT BETWEEN 8 AND 18
$$;
Step 3: Integrating DMFs with the Data Table
To enable automated data validation, attach the DMFs to the DATA_QUALITY_TEST table:
— Set the schedule for the table
ALTER TABLE DATA_QUALITY_TEST SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';
ALTER TABLE DATA_QUALITY_TEST ADD DATA METRIC FUNCTION CHECK_HIGH_TRANSACTION ON (TRANSACTION_AMOUNT);
ALTER TABLE DATA_QUALITY_TEST ADD DATA METRIC FUNCTION CHECK_FAHRENHEIT_RANGE ON (MAX_TEMP_F);
ALTER TABLE DATA_QUALITY_TEST ADD DATA METRIC FUNCTION CHECK_SHIPMENT_DELIVERY ON (SHIPMENT_QUANTITY, DELIVERY_TIME);
— Built-in Snowflake functions for null and duplicate checks
ALTER TABLE DATA_QUALITY_TEST ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.NULL_COUNT ON (CUSTOMER_ID);
ALTER TABLE DATA_QUALITY_TEST ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.DUPLICATE_COUNT ON (CUSTOMER_ID);
Step 4: Monitoring and Alerting
To review data quality metrics, use the following queries:
a) View DMF Details:
SELECT *
FROM TABLE(
INFORMATION_SCHEMA.DATA_METRIC_FUNCTION_REFERENCES(
REF_ENTITY_NAME => 'YOUR_DATABASE.PUBLIC.DATA_QUALITY_TEST',
REF_ENTITY_DOMAIN => 'table'
)
);
b) View the Latest DMF Execution Status:
WITH Latest_DMF_Run AS (
SELECT MAX(DATE_TRUNC('MINUTE', measurement_time)) AS latest_time
FROM TABLE(SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS(
REF_ENTITY_NAME => 'ORDER_DB.PUBLIC.DATA_QUALITY_TEST',
REF_ENTITY_DOMAIN => 'table'))
)
SELECT scheduled_time, measurement_time, table_name, metric_name, value
FROM TABLE(SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS(
REF_ENTITY_NAME => 'ORDER_DB.PUBLIC.DATA_QUALITY_TEST',
REF_ENTITY_DOMAIN => 'table'))
WHERE DATE_TRUNC('MINUTE', measurement_time) = (SELECT latest_time FROM Latest_DMF_Run);
Set Up Alerts
To notify stakeholders when data quality issues arise:
- Create a Notification Integration:
CREATE OR REPLACE NOTIFICATION INTEGRATION DMF_FAILURE_ALERT
TYPE = EMAIL
ENABLED = TRUE
ALLOWED_RECIPIENTS = ('sachin.mittal04@gmail.com')
COMMENT = 'Notification for DMF failures';
Configure an Alert:
CREATE OR REPLACE ALERT DMF_ALERTS
WAREHOUSE = COMPUTE_WH
SCHEDULE = '1 MINUTE'
IF (
EXISTS (
SELECT 1
FROM TABLE(SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS(
REF_ENTITY_NAME => 'YOUR_DATABASE.PUBLIC.DATA_QUALITY_TEST',
REF_ENTITY_DOMAIN => 'table'))
)
)
THEN CALL SYSTEM$SEND_EMAIL(
'DMF_FAILURE_ALERT',
'sachin.mittal04@gmail.com',
'Data Quality Alert',
'One or more records failed data quality checks.'
);
ALTER ALERT DMF_ALERTS RESUME;
Step 5: Benefits of Using DMFs
- Automation: Continuous monitoring with minimal manual intervention.
- Scalability: Handle large datasets without compromising performance.
- Proactive Notification: Real-time alerts prevent downstream impact.
- Customizability: Tailor validation logic to meet specific requirements.
Conclusion
By integrating DMFs into your data quality strategy, you can establish robust validation mechanisms to ensure data consistency, compliance, and accuracy. Snowflake’s flexibility allows you to extend these functions with custom logic and combine them with alerting systems for real-time monitoring.
This approach is ideal for organizations looking to enhance their data governance practices and ensure high-quality standards.