Comparative Study of Imbalanced Data Oversampling Techniques for Peer-to-Peer Landing Loan Prediction

Rini Muzayanah(1), Apri Dwi Lestari(2), Jumanto Jumanto(3), Budi Prasetiyo(4), Dwika Ananda Agustina Pertiwi(5), Much Aziz Muslim(6),


(1) Department of Computer Science, Universitas Negeri Semarang, Indonesia
(2) Department of Computer Science, Universitas Negeri Semarang, Indonesia
(3) Department of Computer Science, Universitas Negeri Semarang, Indonesia
(4) Department of Computer Science, Universitas Negeri Semarang, Indonesia
(5) Faculty of Technology Management, Universiti Tun Hussein Onn Malaysia, Malaysia
(6) Department of Computer Science, Universitas Negeri Semarang, Indonesia

Abstract

Purpose: Data imbalances that often occur in the classification of loan data on the Peer-to-Peer Lending platform can
cause algorithm performance to be less than optimal, causing the resulting accuracy to decrease. To overcome this
problem, appropriate resampling techniques are needed so that the classification algorithm can work optimally and
provide results with optimal accuracy. This research aims to find the right resampling technique to overcome the
problem of data imbalance in data lending on peer-to-peer landing platforms.
Methods: This study uses the XGBoost classification algorithm to evaluate and compare the resampling techniques
used. The resampling techniques that will be compared in this research include SMOTE, ADACYN, Border Line, and
Random Oversampling.
Results: The highest training accuracy was achieved by the combination of the XGBoost model with the Boerder Line
resampling technique with a training accuracy of 0.99988 and the combination of the XGBoost model with the SMOTE
resampling technique. In accuracy testing, the combination with the highest accuracy score was achieved by a
combination of the XGBoost model with the SMOTE resampling technique.
Novelty: It is hoped that from this research we can find the most suitable resampling technique combined with the
XGBoost sorting algorithm to overcome the problem of unbalanced data in uploading data on peer-to-peer lending
platforms so that the sorting algorithm can work optimally and produce optimal accuracy.

Keywords

P2P lending; Resampling data; Imbalanced data; Machine learning

Full Text:

PDF

References

Haibo He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowl. Data Eng.,

vol. 21, no. 9, pp. 1263–1284, Sep. 2009, doi: 10.1109/TKDE.2008.239.

J. Ren, Y. Wang, Y. Cheung, X.-Z. Gao, and X. Guo, “Grouping-based Oversampling in Kernel

Space for Imbalanced Data Classification,” Pattern Recognit., vol. 133, p. 108992, Jan. 2023, doi:

1016/j.patcog.2022.108992.

K. Niu, Z. Zhang, Y. Liu, and R. Li, “Resampling ensemble model based on data distribution for

imbalanced credit risk evaluation in P2P lending,” Inf. Sci. (Ny)., vol. 536, pp. 120–134, Oct. 2020,

doi: 10.1016/j.ins.2020.05.040.

A. R. Safitri and M. A. Muslim, “Improved Accuracy of Naive Bayes Classifier for Determination

Scientific Journal of Informatics, Vol. 11, No. 1, Feb 2024 | 253

of Customer Churn Uses SMOTE and Genetic Algorithms,” J. Soft Comput. Explor., vol. 1, no. 1,

Sep. 2020, doi: 10.52465/joscex.v1i1.5.

S. Chatterjee and Y.-C. Byun, “Highly imbalanced fault classification of wind turbines using data

resampling and hybrid ensemble method approach,” Eng. Appl. Artif. Intell., vol. 126, p. 107104,

Nov. 2023, doi: 10.1016/j.engappai.2023.107104.

E. Artigao, S. Martín-Martínez, A. Honrubia-Escribano, and E. Gómez-Lázaro, “Wind turbine

reliability: A comprehensive review towards effective condition monitoring development,” Appl.

Energy, vol. 228, pp. 1569–1583, Oct. 2018, doi: 10.1016/j.apenergy.2018.07.037.

Y. Liu et al., “Imbalanced data classification: Using transfer learning and active sampling,” Eng.

Appl. Artif. Intell., vol. 117, p. 105621, Jan. 2023, doi: 10.1016/j.engappai.2022.105621.

Q. Gu, J. Tian, X. Li, and S. Jiang, “A novel Random Forest integrated model for imbalanced data

classification problem,” Knowledge-Based Syst., vol. 250, p. 109050, Aug. 2022, doi:

1016/j.knosys.2022.109050.

P. S. Sundari and M. Khafidz Putra, “Optimization house price prediction model using gradient

boosted regression trees (GBRT) and xgboost algorithm,” J. Student Res. Explor., vol. 2, no. 1,

Sep. 2023, doi: 10.52465/josre.v2i1.176.

R. Rofik, R. Aulia, K. Musaadah, S. S. F. Ardyani, and A. A. Hakim, “Optimization of Credit

Scoring Model Using Stacking Ensemble Learning and Oversampling Techniques,” J. Inf. Syst.

Explor. Res., vol. 2, no. 1, Dec. 2023, doi: 10.52465/joiser.v2i1.203.

H. Kaur, H. S. Pannu, and A. K. Malhi, “A Systematic Review on Imbalanced Data Challenges in

Machine Learning,” ACM Comput. Surv., vol. 52, no. 4, pp. 1–36, Jul. 2020, doi: 10.1145/3343440.

S. C. and J. V. Devasia, “Peer to Peer Lending: Risk Prediction Using Machine Learning on An

Imbalanced Dataset,” in 2022 Third International Conference on Intelligent Computing

Instrumentation and Control Technologies (ICICICT), IEEE, Aug. 2022, pp. 511–519. doi:

1109/ICICICT54557.2022.9917708.

L. E. Boiko Ferreira, J. P. Barddal, H. M. Gomes, and F. Enembreck, “Improving Credit Risk

Prediction in Online Peer-to-Peer (P2P) Lending Using Imbalanced Learning Techniques,” in 2017

IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, Nov.

, pp. 175–181. doi: 10.1109/ICTAI.2017.00037.

Y. Yuan, J. Wei, H. Huang, W. Jiao, J. Wang, and H. Chen, “Review of resampling techniques for

the treatment of imbalanced industrial data classification in equipment condition monitoring,” Eng.

Appl. Artif. Intell., vol. 126, p. 106911, Nov. 2023, doi: 10.1016/j.engappai.2023.106911.

Y. A. Sir and A. H. H. Soepranoto, “Pendekatan Resampling Data Untuk Menangani Masalah

Ketidakseimbangan Kelas,” J. Komput. dan Inform., vol. 10, no. 1, pp. 31–38, Mar. 2022, doi:

35508/jicon.v10i1.6554.

A. Amiruddin, P. N. H. Suryani, S. D. Santoso, and M. Y. B. Setiadji, “Utilizing Reverse

Engineering Technique for A Malware Analysis Model,” Sci. J. Informatics, vol. 8, no. 2, pp. 222–

, Nov. 2021, doi: 10.15294/sji.v8i2.24755.

W. J. XING Yulong SHANGGUAN Wei, PENG Cong, ZHU Linfu, “Track circuit fault

diagnosis method for massive imbalanced data,” China Safety Science Journal, vol. 32, no. 5. pp.

–118. [Online]. Available: http://www.cssjj.com.cn

P. Kaur and A. Gosain, “Comparing the Behavior of Oversampling and Undersampling Approach

of Class Imbalance Learning by Combining Class Imbalance Problem with Noise,” 2018, pp. 23–

doi: 10.1007/978-981-10-6602-3_3.

M. A. Muslim et al., “New model combination meta-learner to improve accuracy prediction P2P

lending with stacking ensemble learning,” Intell. Syst. with Appl., vol. 18, no. December 2022, p.

, 2023, doi: 10.1016/j.iswa.2023.200204.

A. López-García, O. Blasco-Blasco, M. Liern-García, and S. E. Parada-Rico, “Early detection of

students’ failure using Machine Learning techniques,” Oper. Res. Perspect., vol. 11, p. 100292,

Dec. 2023, doi: 10.1016/j.orp.2023.100292.

W. Bhaya, “Review of Data Preprocessing Techniques in Data Mining,” J. Eng. Appl. Sci., vol. 12,

pp. 4102–4107, Sep. 2017, doi: 10.3923/jeasci.2017.4102.4107.

Y. Liu, B. Li, S. Yang, and Z. Li, “Handling missing values and imbalanced classes in machine

learning to predict consumer preference: Demonstrations and comparisons to prominent methods,”

Expert Syst. Appl., vol. 237, p. 121694, Mar. 2024, doi: 10.1016/j.eswa.2023.121694.

M. R. Chernick, “Resampling methods,” WIREs Data Min. Knowl. Discov., vol. 2, no. 3, pp. 255–

, May 2012, doi: 10.1002/widm.1054.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority

| Scientific Journal of Informatics, Vol. 11, No. 1, Feb 2024

Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002, doi:

1613/jair.953.

Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, “ADASYN: Adaptive synthetic sampling

approach for imbalanced learning,” in 2008 IEEE International Joint Conference on Neural

Networks (IEEE World Congress on Computational Intelligence), IEEE, Jun. 2008, pp. 1322–1328.

doi: 10.1109/IJCNN.2008.4633969.

Y. Sun et al., “Borderline SMOTE Algorithm and Feature Selection-Based Network Anomalies

Detection Strategy,” Energies, vol. 15, no. 13, p. 4751, Jun. 2022, doi: 10.3390/en15134751.

A. Ghazikhani, H. S. Yazdi, and R. Monsefi, “Class imbalance handling using wrapper-based

random oversampling,” in 20th Iranian Conference on Electrical Engineering (ICEE2012), IEEE,

May 2012, pp. 611–616. doi: 10.1109/IranianCEE.2012.6292428.

T. Chen and C. Guestrin, “XGBoost,” in Proceedings of the 22nd ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM, Aug. 2016,

pp. 785–794. doi: 10.1145/2939672.2939785.

Y. Qiu, J. Zhou, M. Khandelwal, H. Yang, P. Yang, and C. Li, “Performance evaluation of hybrid

WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground

vibration,” Eng. Comput., vol. 38, no. S5, pp. 4145–4162, Dec. 2022, doi: 10.1007/s00366-021-

-9.

J.-J. Liu and J.-C. Liu, “Permeability Predictions for Tight Sandstone Reservoir Using Explainable

Machine Learning and Particle Swarm Optimization,” Geofluids, vol. 2022, pp. 1–15, Jan. 2022,

doi: 10.1155/2022/2263329

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.