A Hybrid Sampling Approach for Handling Data Imbalance in Ensemble Learning Algorithms

Reka Agustia  Astari; I Made  Sumertajaya; Agus Mohamad Soleh

doi:10.15294/sji.v12i2.19163

Authors

Reka Agustia Astari Department of Statistics, Faculty of Mathematics and Natural Sciences, IPB University, Indonesia Author
I Made Sumertajaya Department of Statistics, Faculty of Mathematics and Natural Sciences, IPB University, Indonesia Author
Agus Mohamad Soleh Department of Statistics, Faculty of Mathematics and Natural Sciences, IPB University, Indonesia Author

DOI:

https://doi.org/10.15294/sji.v12i2.19163

Keywords:

Data imbalance, Double random forest, Extra trees, Hybrid sampling, Poor household

Abstract

Purpose: This research aims to address the methodological challenges posed by imbalanced data in classification tasks, where minority classes are severely underrepresented, often leading to biased model performance. It evaluates the effectiveness of hybrid sampling techniques specifically, the Synthetic Minority Oversampling Technique combined with Neighborhood Cleaning Rule (SMOTE-NCL) and with Edited Nearest Neighbors (SMOTE-ENN) in improving the predictive performance of ensemble classifiers, namely Double Random Forest (DRF) and Extremely Randomized Trees (ET), with a focus on enhancing minority class detection.

Methods: A total of eighteen simulated scenarios were developed by varying class imbalance ratios, sample sizes, and feature correlation levels. In addition, empirical data from the 2023 National Socioeconomic Survey (SUSENAS) in Riau Province were employed. The data were partitioned using stratified random sampling (80% training, 20% testing). Models were trained with and without hybrid sampling and optimized through grid search. Their performance was evaluated over 100 iterations using balanced accuracy, sensitivity, and G-mean. Feature importance was interpreted using Shapley Additive Explanations (SHAP).

Results: DRF combined with SMOTE-NCL consistently outperformed all other models, achieving 87.56% balanced accuracy, 82.17% sensitivity, and 86.75% G-mean in the most extreme simulation scenario. On the empirical dataset, the model achieved 76.37% balanced accuracy and 75.49% G-mean.

Novelty: This study introduces a novel integration of hybrid sampling techniques and ensemble learning within an interpretable machine learning framework, providing a robust solution for poverty classification in imbalanced datasets.

A Hybrid Sampling Approach for Handling Data Imbalance in Ensemble Learning Algorithms

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Article ID

Issue

Section

How to Cite

Main-Sidebar

Stat Counter