Classification of expensive SQL: In a system where queries are stable and homogeneous with respect to time. Identification of the long-running or expensive queries can be achieve through data analytics, This will help to build an effective alarming system that can overcome performance issues that can arise due to long-running queries which consume more system and DB resources and can degrade system performance where all DB resources get exhaust to a point of an outage.
Here we talk about a classification model based on the learning correlations within metrics. We can also add/automate remediation steps to avoid resource crunch when an expensive query runs.
However, different approaches can be taken base on the problem we have, with heterogeneous queries or SQL workload management for a data center or in capacity planning where you need to know how many resources such queries consume, we may need to apply different ML techniques to the get the relevant results.
Following are the main steps for training our classification machine learning model
- Identify data requirements
- Performs data collection and cleaning of raw datasets to build historical data and continuous data pipeline for inference.
- Use some vectorization techniques to extract features from SQL statements.
- Trains the models with classification algorithms.
Important features get select from DB and System-level details ( CPU, Memory, etc) , Query statement, Query cost, and its runtimes.
This data can be store on the recent cloud-base warehousing solutions like Snowflake or any data store (i.e. Mongo DB).
After Normalization, standardization, and visualization of the metrics data, we can carefully discard the metrics that do not show any relevance for query types we want to classify. One-hot encoding is use for categorical data which is in the form of text. And for Query statement, rather than using traditional SQL parsers or any other means to get the weights of a SQL statement, recent advancements happen in the field text vectorization techniques, can be utilize to generate query embeddings that convert text-based data to fixed-length vectors. These vectors can now be added to the feature set and use to train the model.
To keep our approach simple we did not use tables metadata like indexes, bad indexes, tablespace size, etc
Based on the experience, we opted for tree-based machine learning algorithms, XGBoost provides a parallel tree boosting that solves many data science problems in a fast and accurate way. It also allows for easy interpretation of the feature importance graphs.
Modeling & Performance
However,After preparing the feature set, extracted from the system and DB metrics, we use it to train the classifier, after taking care of imbalanced classes from the training set. Cross-validation is use to find the optimal hyperparameters. then, we test the trained classifier on the testing set. The model achieves 98% accuracy and almost the same precision and recall for resource-consuming long-running queries.