The Impact of Balanced Data Techniques on Classification Model Performance
DOI:
https://doi.org/10.15294/sji.v11i2.3649Keywords:
Imbalanced Data, SMOTE, B-SMOTE, SMOTE-ENN, PerformanceAbstract
Purpose: The aim of this study is to examine the impact of balanced data techniques on the performance of classification models.
Methods: To balance the imbalanced dataset, several resampling techniques are employed: The Synthetic Minority Oversampling Technique (SMOTE), Borderline-SMOTE (B-SMOTE), and SMOTE and Edited Nearest Neighbors (SMOTE-ENN). Classification is then performed using both balanced and unbalanced datasets to evaluate the impact of resampling techniques on classification model performance.
Result: This study proposes the SMOTE, B-SMOTE, and SMOTE-ENN techniques for generating synthetic data. Experimental results showed that re-sampling can improve model performance on KNN, Naive Bayes, and Decision Tree. The best-balanced data technique is the SMOTE-ENN. The second best is B-SMOTE, and the last is SMOTE. If compared to the unbalanced dataset, the SMOTE technique encourages increasing the performance of Accuracy, Precision, Recall, F1-Score, G-mean, and Curve-ROC respectively by 4.79%, 35.89%, 35.32%, 35.63%, 46.94%, and 34.89%, respectively on DT method. The B-SMOTE technique on the DT method improves the performance of Accuracy, Precision, Recall, F1-Score, G-mean, and Curve-ROC respectively by 5.62%, 36.45%, 35.88%, 36.19%, 47.40%, and 35.46% if compared to the unbalanced dataset. The SMOTE-ENN technique improves the performance of Accuracy, Precision, Recall, F1-Score, G-mean, and Curve-ROC respectively by 8.11%, 34.53%, 43.25%, 41.63%, 62.85%, and 42.91% if compared to the unbalanced dataset.
Novelty: Based on the experiment results, the best-balanced data technique is the SMOTE-ENN. The SMOTE-ENN technique improves the performance of Accuracy, Precision, Recall, F1-Score, G-mean, and Curve-ROC.