Evaluating Ensemble Learning Techniques for Class Imbalance in Machine Learning: A Comparative Analysis of Balanced Random Forest, SMOTE-RF, SMOTEBoost, and RUSBoost
DOI:
https://doi.org/10.15294/sji.v11i4.15937Keywords:
Machine Learning, Balanced Random Forest, SMOTE-RF, SMOTEBoost, RUSBoost, Random Forest, AdaBoost, Imbalanced Data, Ensemble LearningAbstract
Purpose: This research aims to identify the optimal ensemble learning method for mitigating class imbalance in datasets utilizing various advanced techniques which include balanced random forest (BRF), SMOTE-random forest (SMOTE-RF), RUSBoost, and SMOTEBoost. The methods were systematically evaluated against conventional algorithms, including random forest and AdaBoost, across heterogeneous datasets with varying class imbalance ratios.
Methods: This study utilized 13 secondary datasets from diverse sources, each with binary class outputs. The datasets exhibited varying degrees of class imbalance, offering scenarios to assess the effectiveness of ensemble learning techniques and traditional machine learning approaches in managing class imbalance issues. Study data were split into training (80%) and testing (20%), with stratified sampling applied to maintain consistent class proportions across both sets. Each method underwent hyperparameter optimization with distinct settings with repetition over 10 iterations. The optimal method was evaluated based on balanced accuracy, recall, and computation time.
Result: Based on the evaluation, the BRF method exhibited the highest performance in balanced accuracy and recall when compared to SMOTE-RF, RUSBoost, SMOTEBoost, random forest, and AdaBoost. Conversely, the classical random forest method outperformed other techniques in terms of computational efficiency.
Novelty: This study presents an innovative analysis of advanced ensemble learning techniques, including BRF, SMOTE-random forest, SMOTEBoost, and RUSBoost, which demonstrate significant effectiveness in addressing class imbalance across various datasets. By systematically optimizing hyperparameters and applying stratified sampling, this research produces findings that redefine the benchmarks of balanced accuracy, recall and computational efficiency in machine learning.