Comparative Study of Imbalanced Data Oversampling Techniques for Peer-to-Peer Landing Loan Prediction
(1) Department of Computer Science, Universitas Negeri Semarang, Indonesia
(2) Department of Computer Science, Universitas Negeri Semarang, Indonesia
(3) Department of Computer Science, Universitas Negeri Semarang, Indonesia
(4) Department of Computer Science, Universitas Negeri Semarang, Indonesia
(5) Faculty of Technology Management, Universiti Tun Hussein Onn Malaysia, Malaysia
(6) Department of Computer Science, Universitas Negeri Semarang, Indonesia
Abstract
cause algorithm performance to be less than optimal, causing the resulting accuracy to decrease. To overcome this
problem, appropriate resampling techniques are needed so that the classification algorithm can work optimally and
provide results with optimal accuracy. This research aims to find the right resampling technique to overcome the
problem of data imbalance in data lending on peer-to-peer landing platforms.
Methods: This study uses the XGBoost classification algorithm to evaluate and compare the resampling techniques
used. The resampling techniques that will be compared in this research include SMOTE, ADACYN, Border Line, and
Random Oversampling.
Results: The highest training accuracy was achieved by the combination of the XGBoost model with the Boerder Line
resampling technique with a training accuracy of 0.99988 and the combination of the XGBoost model with the SMOTE
resampling technique. In accuracy testing, the combination with the highest accuracy score was achieved by a
combination of the XGBoost model with the SMOTE resampling technique.
Novelty: It is hoped that from this research we can find the most suitable resampling technique combined with the
XGBoost sorting algorithm to overcome the problem of unbalanced data in uploading data on peer-to-peer lending
platforms so that the sorting algorithm can work optimally and produce optimal accuracy.
Keywords
Full Text:
PDFReferences
Haibo He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowl. Data Eng.,
vol. 21, no. 9, pp. 1263–1284, Sep. 2009, doi: 10.1109/TKDE.2008.239.
J. Ren, Y. Wang, Y. Cheung, X.-Z. Gao, and X. Guo, “Grouping-based Oversampling in Kernel
Space for Imbalanced Data Classification,” Pattern Recognit., vol. 133, p. 108992, Jan. 2023, doi:
1016/j.patcog.2022.108992.
K. Niu, Z. Zhang, Y. Liu, and R. Li, “Resampling ensemble model based on data distribution for
imbalanced credit risk evaluation in P2P lending,” Inf. Sci. (Ny)., vol. 536, pp. 120–134, Oct. 2020,
doi: 10.1016/j.ins.2020.05.040.
A. R. Safitri and M. A. Muslim, “Improved Accuracy of Naive Bayes Classifier for Determination
Scientific Journal of Informatics, Vol. 11, No. 1, Feb 2024 | 253
of Customer Churn Uses SMOTE and Genetic Algorithms,” J. Soft Comput. Explor., vol. 1, no. 1,
Sep. 2020, doi: 10.52465/joscex.v1i1.5.
S. Chatterjee and Y.-C. Byun, “Highly imbalanced fault classification of wind turbines using data
resampling and hybrid ensemble method approach,” Eng. Appl. Artif. Intell., vol. 126, p. 107104,
Nov. 2023, doi: 10.1016/j.engappai.2023.107104.
E. Artigao, S. Martín-Martínez, A. Honrubia-Escribano, and E. Gómez-Lázaro, “Wind turbine
reliability: A comprehensive review towards effective condition monitoring development,” Appl.
Energy, vol. 228, pp. 1569–1583, Oct. 2018, doi: 10.1016/j.apenergy.2018.07.037.
Y. Liu et al., “Imbalanced data classification: Using transfer learning and active sampling,” Eng.
Appl. Artif. Intell., vol. 117, p. 105621, Jan. 2023, doi: 10.1016/j.engappai.2022.105621.
Q. Gu, J. Tian, X. Li, and S. Jiang, “A novel Random Forest integrated model for imbalanced data
classification problem,” Knowledge-Based Syst., vol. 250, p. 109050, Aug. 2022, doi:
1016/j.knosys.2022.109050.
P. S. Sundari and M. Khafidz Putra, “Optimization house price prediction model using gradient
boosted regression trees (GBRT) and xgboost algorithm,” J. Student Res. Explor., vol. 2, no. 1,
Sep. 2023, doi: 10.52465/josre.v2i1.176.
R. Rofik, R. Aulia, K. Musaadah, S. S. F. Ardyani, and A. A. Hakim, “Optimization of Credit
Scoring Model Using Stacking Ensemble Learning and Oversampling Techniques,” J. Inf. Syst.
Explor. Res., vol. 2, no. 1, Dec. 2023, doi: 10.52465/joiser.v2i1.203.
H. Kaur, H. S. Pannu, and A. K. Malhi, “A Systematic Review on Imbalanced Data Challenges in
Machine Learning,” ACM Comput. Surv., vol. 52, no. 4, pp. 1–36, Jul. 2020, doi: 10.1145/3343440.
S. C. and J. V. Devasia, “Peer to Peer Lending: Risk Prediction Using Machine Learning on An
Imbalanced Dataset,” in 2022 Third International Conference on Intelligent Computing
Instrumentation and Control Technologies (ICICICT), IEEE, Aug. 2022, pp. 511–519. doi:
1109/ICICICT54557.2022.9917708.
L. E. Boiko Ferreira, J. P. Barddal, H. M. Gomes, and F. Enembreck, “Improving Credit Risk
Prediction in Online Peer-to-Peer (P2P) Lending Using Imbalanced Learning Techniques,” in 2017
IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, Nov.
, pp. 175–181. doi: 10.1109/ICTAI.2017.00037.
Y. Yuan, J. Wei, H. Huang, W. Jiao, J. Wang, and H. Chen, “Review of resampling techniques for
the treatment of imbalanced industrial data classification in equipment condition monitoring,” Eng.
Appl. Artif. Intell., vol. 126, p. 106911, Nov. 2023, doi: 10.1016/j.engappai.2023.106911.
Y. A. Sir and A. H. H. Soepranoto, “Pendekatan Resampling Data Untuk Menangani Masalah
Ketidakseimbangan Kelas,” J. Komput. dan Inform., vol. 10, no. 1, pp. 31–38, Mar. 2022, doi:
35508/jicon.v10i1.6554.
A. Amiruddin, P. N. H. Suryani, S. D. Santoso, and M. Y. B. Setiadji, “Utilizing Reverse
Engineering Technique for A Malware Analysis Model,” Sci. J. Informatics, vol. 8, no. 2, pp. 222–
, Nov. 2021, doi: 10.15294/sji.v8i2.24755.
W. J. XING Yulong SHANGGUAN Wei, PENG Cong, ZHU Linfu, “Track circuit fault
diagnosis method for massive imbalanced data,” China Safety Science Journal, vol. 32, no. 5. pp.
–118. [Online]. Available: http://www.cssjj.com.cn
P. Kaur and A. Gosain, “Comparing the Behavior of Oversampling and Undersampling Approach
of Class Imbalance Learning by Combining Class Imbalance Problem with Noise,” 2018, pp. 23–
doi: 10.1007/978-981-10-6602-3_3.
M. A. Muslim et al., “New model combination meta-learner to improve accuracy prediction P2P
lending with stacking ensemble learning,” Intell. Syst. with Appl., vol. 18, no. December 2022, p.
, 2023, doi: 10.1016/j.iswa.2023.200204.
A. López-García, O. Blasco-Blasco, M. Liern-García, and S. E. Parada-Rico, “Early detection of
students’ failure using Machine Learning techniques,” Oper. Res. Perspect., vol. 11, p. 100292,
Dec. 2023, doi: 10.1016/j.orp.2023.100292.
W. Bhaya, “Review of Data Preprocessing Techniques in Data Mining,” J. Eng. Appl. Sci., vol. 12,
pp. 4102–4107, Sep. 2017, doi: 10.3923/jeasci.2017.4102.4107.
Y. Liu, B. Li, S. Yang, and Z. Li, “Handling missing values and imbalanced classes in machine
learning to predict consumer preference: Demonstrations and comparisons to prominent methods,”
Expert Syst. Appl., vol. 237, p. 121694, Mar. 2024, doi: 10.1016/j.eswa.2023.121694.
M. R. Chernick, “Resampling methods,” WIREs Data Min. Knowl. Discov., vol. 2, no. 3, pp. 255–
, May 2012, doi: 10.1002/widm.1054.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority
| Scientific Journal of Informatics, Vol. 11, No. 1, Feb 2024
Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002, doi:
1613/jair.953.
Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, “ADASYN: Adaptive synthetic sampling
approach for imbalanced learning,” in 2008 IEEE International Joint Conference on Neural
Networks (IEEE World Congress on Computational Intelligence), IEEE, Jun. 2008, pp. 1322–1328.
doi: 10.1109/IJCNN.2008.4633969.
Y. Sun et al., “Borderline SMOTE Algorithm and Feature Selection-Based Network Anomalies
Detection Strategy,” Energies, vol. 15, no. 13, p. 4751, Jun. 2022, doi: 10.3390/en15134751.
A. Ghazikhani, H. S. Yazdi, and R. Monsefi, “Class imbalance handling using wrapper-based
random oversampling,” in 20th Iranian Conference on Electrical Engineering (ICEE2012), IEEE,
May 2012, pp. 611–616. doi: 10.1109/IranianCEE.2012.6292428.
T. Chen and C. Guestrin, “XGBoost,” in Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM, Aug. 2016,
pp. 785–794. doi: 10.1145/2939672.2939785.
Y. Qiu, J. Zhou, M. Khandelwal, H. Yang, P. Yang, and C. Li, “Performance evaluation of hybrid
WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground
vibration,” Eng. Comput., vol. 38, no. S5, pp. 4145–4162, Dec. 2022, doi: 10.1007/s00366-021-
-9.
J.-J. Liu and J.-C. Liu, “Permeability Predictions for Tight Sandstone Reservoir Using Explainable
Machine Learning and Particle Swarm Optimization,” Geofluids, vol. 2022, pp. 1–15, Jan. 2022,
doi: 10.1155/2022/2263329
Refbacks
- There are currently no refbacks.
Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]
This work is licensed under a Creative Commons Attribution 4.0 International License.