A Hybrid Sampling Approach for Handling Data Imbalance in Ensemble Learning Algorithms

Authors

  • Reka Agustia Astari Department of Statistics, Faculty of Mathematics and Natural Sciences, IPB University, Indonesia Author
  • I Made Sumertajaya Department of Statistics, Faculty of Mathematics and Natural Sciences, IPB University, Indonesia Author
  • Agus Mohamad Soleh Department of Statistics, Faculty of Mathematics and Natural Sciences, IPB University, Indonesia Author

DOI:

https://doi.org/10.15294/sji.v12i2.19163

Keywords:

Data imbalance, Double random forest, Extra trees, Hybrid sampling, Poor household

Abstract

Purpose: This research aims to address the methodological challenges posed by imbalanced data in classification tasks, where minority classes are severely underrepresented, often leading to biased model performance. It evaluates the effectiveness of hybrid sampling techniques specifically, the Synthetic Minority Oversampling Technique combined with Neighborhood Cleaning Rule (SMOTE-NCL) and with Edited Nearest Neighbors (SMOTE-ENN) in improving the predictive performance of ensemble classifiers, namely Double Random Forest (DRF) and Extremely Randomized Trees (ET), with a focus on enhancing minority class detection.

Methods: A total of eighteen simulated scenarios were developed by varying class imbalance ratios, sample sizes, and feature correlation levels. In addition, empirical data from the 2023 National Socioeconomic Survey (SUSENAS) in Riau Province were employed. The data were partitioned using stratified random sampling (80% training, 20% testing). Models were trained with and without hybrid sampling and optimized through grid search. Their performance was evaluated over 100 iterations using balanced accuracy, sensitivity, and G-mean. Feature importance was interpreted using Shapley Additive Explanations (SHAP).

Results: DRF combined with SMOTE-NCL consistently outperformed all other models, achieving 87.56% balanced accuracy, 82.17% sensitivity, and 86.75% G-mean in the most extreme simulation scenario. On the empirical dataset, the model achieved 76.37% balanced accuracy and 75.49% G-mean.

Novelty: This study introduces a novel integration of hybrid sampling techniques and ensemble learning within an interpretable machine learning framework, providing a robust solution for poverty classification in imbalanced datasets.

Downloads

Published

25-06-2025

Article ID

19163

Issue

Section

Articles

How to Cite

A Hybrid Sampling Approach for Handling Data Imbalance in Ensemble Learning Algorithms. (2025). Scientific Journal of Informatics, 12(2), 247-258. https://doi.org/10.15294/sji.v12i2.19163