The Impact of Balanced Data Techniques on Classification Model Performance

Authors

  • Jasman Pardede Institut Teknologi Nasional (Itenas) Bandung Author
  • Dika Prasetia Pamungkas Institut Teknologi Nasional (Itenas) Bandung Author

DOI:

https://doi.org/10.15294/sji.v11i2.3649

Keywords:

Imbalanced Data, SMOTE, B-SMOTE, SMOTE-ENN, Performance

Abstract

Purpose: The aim of this study is to examine the impact of balanced data techniques on the performance of classification models.

Methods: To balance the imbalanced dataset, several resampling techniques are employed: The Synthetic Minority Oversampling Technique (SMOTE), Borderline-SMOTE (B-SMOTE), and SMOTE and Edited Nearest Neighbors (SMOTE-ENN). Classification is then performed using both balanced and unbalanced datasets to evaluate the impact of resampling techniques on classification model performance.

Result: This study proposes the SMOTE, B-SMOTE, and SMOTE-ENN techniques for generating synthetic data. Experimental results showed that re-sampling can improve model performance on KNN, Naive Bayes, and Decision Tree. The best-balanced data technique is the SMOTE-ENN. The second best is B-SMOTE, and the last is SMOTE. If compared to the unbalanced dataset, the SMOTE technique encourages increasing the performance of Accuracy, Precision, Recall, F1-Score, G-mean, and Curve-ROC respectively by 4.79%, 35.89%, 35.32%, 35.63%, 46.94%, and 34.89%, respectively on DT method. The B-SMOTE technique on the DT method improves the performance of Accuracy, Precision, Recall, F1-Score, G-mean, and Curve-ROC respectively by 5.62%, 36.45%, 35.88%, 36.19%, 47.40%, and 35.46% if compared to the unbalanced dataset. The SMOTE-ENN technique improves the performance of Accuracy, Precision, Recall, F1-Score, G-mean, and Curve-ROC respectively by 8.11%, 34.53%, 43.25%, 41.63%, 62.85%, and 42.91% if compared to the unbalanced dataset.

Novelty: Based on the experiment results, the best-balanced data technique is the SMOTE-ENN. The SMOTE-ENN technique improves the performance of Accuracy, Precision, Recall, F1-Score, G-mean, and Curve-ROC.

Downloads

Article ID

3649

Published

31-05-2024

Issue

Section

Articles

How to Cite

The Impact of Balanced Data Techniques on Classification Model Performance. (2024). Scientific Journal of Informatics, 11(2), 401-412. https://doi.org/10.15294/sji.v11i2.3649