Performance Analysis of Machine Learning Models using RFE Feature Selection and Bayesian Optimization in Imbalanced Data Classification with Shap-Based Explanations

Nurzatil Aqmar; Hari Wijayanto; Farit Mochamad Afendi

doi:10.15294/sji.v12i3.31459

Authors

Nurzatil Aqmar Statistics and Data Science Study Program, IPB University, Indonesia Author
Hari Wijayanto Statistics and Data Science Study Program, IPB University, Indonesia Author
Farit Mochamad Afendi Statistics and Data Science Study Program, IPB University, Indonesia Author

DOI:

https://doi.org/10.15294/sji.v12i3.31459

Keywords:

Random forest, Light gradient boosting machine, Recursive feature elimination, Bayesian optimization, Shapley additive explanations

Abstract

Purpose: This research aims to evaluates the performance of Random Forest (RF) and Light Gradient Boosting Machine (LightGBM) models integrated with Recursive Feature Elimination (RFE) for feature selection, Bayesian Optimization (BO) for hyperparameter tuning, and three imbalanced data handling techniques Random Undersampling (RUS), Random Oversampling (ROS), and SMOTENC. Identifying key determinants of household food insecurity in Papua using SHAP for transparent feature interpretation.

Methods: The research used 2022 SUSENAS data from Papua Province. Exploring data composition and variable characteristics, and aggregating individual data into household data. Data were split using random sampling (80% training, 20% testing). Eighteen experimental scenarios were created by combining feature selection or no feature selection, three imbalance handling methods, and default or hyperparameter tuning. RF and LightGBM were evaluated over 50 iterations using accuracy, sensitivity, specificity, and G-Mean, with SHAP applied to the best-performing models for interpretability.

Result: LightGBM achieved the highest accuracy and stability, particularly when combined with SMOTENC and RFE+BO. RF showed better performance in maintaining G-Mean when paired with RUS, with the highest G-Mean (0.756) obtained by RF + BO + RUS. Three-way ANOVA proved that model type, imbalance handling, feature selection, and their interaction significantly affected the G-Mean value. SHAP analysis shows that health, financial, and educational limitations can increase the risk of food insecurity.

Novelty: This research offers a new integration between feature selection, hyperparameter tuning, and imbalanced data handling within an interpretable machine learning framework, thereby providing a robust solution for food vulnerability classification on imbalanced datasets.

Performance Analysis of Machine Learning Models using RFE Feature Selection and Bayesian Optimization in Imbalanced Data Classification with Shap-Based Explanations

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Article ID

Issue

Section

How to Cite

Main-Sidebar

Stat Counter