Application of C4.5 Algorithm Using Synthetic Minority Oversampling Technique (SMOTE) and Particle Swarm Optimization (PSO) for Diabetes Prediction
Abstract
Abstract. Diabetes is the fourth or fifth leading cause of death in most developed countries and an epidemic in many developing countries. Early detection can be a preventive measure that uses a set of existing data to be processed through data mining with a classification process.
Purpose: Investigate the efficacy of integrating the C4.5 algorithm with Synthetic Minority Oversampling Technique (SMOTE) and Particle Swarm Optimization (PSO) for improving the accuracy of diabetes prediction models. By employing SMOTE, the study aims to address the class imbalance issue inherent in diabetes datasets, which often contain significantly fewer instances of positive cases (diabetes) than negative cases (non-diabetes). Furthermore, by incorporating PSO, the research seeks to optimize the decision tree construction process within the C4.5 algorithm, enhancing its ability to discern complex patterns and relationships within the data.
Methods/Study design/approach: This study proposes the use of the C4.5 classification algorithm by applying the synthetic minority oversampling technique (SMOTE) and particle swarm optimization (PSO) to overcome problems in the diabetes dataset, namely the Pima Indian Diabetes Database (PIDD).
Result/Findings: From the research results, the accuracy obtained in applying the C4.5 algorithm without the preprocessing process is 75.97%, while the results of the SMOTE application of the C4.5 algorithm are 80%. Meanwhile, applying the C4.5 algorithm using SMOTE and PSO produces the highest accuracy, with 82.5%. This indicates an increase of 6.53% from the classification results using the C4.5 algorithm.
Novelty/Originality/Value: This research contributes novelty by proposing a hybrid approach that combines the C4.5 decision tree algorithm with two advanced techniques, Synthetic Minority Oversampling Technique (SMOTE) and Particle Swarm Optimization (PSO), for the prediction of diabetes. While previous studies have explored the application of machine learning algorithms for diabetes prediction, few have examined the synergistic effects of integrating SMOTE and PSO with the C4.5 algorithm specifically.
References
[2] D. Atlas, “International diabetes federation,” IDF Diabetes Atlas, 7th edn. Brussels, Belgium: International Diabetes Federation, vol. 33, no. 2, 2015.
[3] P. Saeedi et al., “Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas, 9th edition,” Diabetes Res Clin Pract, vol. 157, p. 107843, Nov. 2019, doi: 10.1016/j.diabres.2019.107843.
[4] M. S. Diab, S. Husain, and A. Jarndal, “On Diabetes Classification and Prediction using Artificial Neural Networks,” in 2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI), IEEE, Nov. 2020, pp. 1–5. doi: 10.1109/CCCI49893.2020.9256621.
[5] E.-H. A. Rady and A. S. Anwar, “Prediction of kidney disease stages using data mining algorithms,” Inform Med Unlocked, vol. 15, p. 100178, 2019, doi: 10.1016/j.imu.2019.100178.
[6] M. A. Muslim, A. J. Herowati, E. Sugiharti, and B. Prasetiyo, “Application of the pessimistic pruning to increase the accuracy of C4.5 algorithm in diagnosing chronic kidney disease,” J Phys Conf Ser, vol. 983, p. 012062, Mar. 2018, doi: 10.1088/1742-6596/983/1/012062.
[7] P. Mayadewi and E. Rosely, “Prediksi Nilai Proyek Akhir Mahasiswa Menggunakan Algoritma Klasifikasi Data Mining,” SESINDO 2015, vol. 2015, 2015.
[8] B. Gupta, A. Rawat, A. Jain, A. Arora, and N. Dhami, “Analysis of various decision tree algorithms for classification in data mining,” Int J Comput Appl, vol. 163, no. 8, pp. 15–19, 2017.
[9] S. S. Nikam, “A comparative study of classification techniques in data mining algorithms,” Oriental Journal of Computer Science and Technology, vol. 8, no. 1, pp. 13–19, 2015.
[10] S. Umadevi and K. S. J. Marseline, “A survey on data mining classification algorithms,” in 2017 International Conference on Signal Processing and Communication (ICSPC), IEEE, Jul. 2017, pp. 264–268. doi: 10.1109/CSPC.2017.8305851.
[11] J. R. Quinlan, C4. 5: programs for machine learning. Elsevier, 2014.
[12] A. P. Wibawa, M. Guntur, A. Purnama, M. F. Akbar, and F. A. Dwiyanto, “Metode-metode klasifikasi,” in Prosiding Seminar Ilmu Komputer dan Teknologi Informasi, 2018.
[13] U. Pujianto, A. L. Setiawan, H. A. Rosyid, and A. M. M. Salah, “Comparison of Naïve Bayes Algorithm and Decision Tree C4.5 for Hospital Readmission Diabetes Patients using HbA1c Measurement,” Knowledge Engineering and Data Science, vol. 2, no. 2, p. 58, Dec. 2019, doi: 10.17977/um018v2i22019p58-71.
[14] B. Hssina, A. Merbouha, H. Ezzikouri, and M. Erritali, “A comparative study of decision tree ID3 and C4. 5,” International Journal of Advanced Computer Science and Applications, vol. 4, no. 2, pp. 13–19, 2014.
[15] S. Maldonado, J. López, and C. Vairetti, “An alternative SMOTE oversampling strategy for high-dimensional datasets,” Appl Soft Comput, vol. 76, pp. 380–389, Mar. 2019, doi: 10.1016/j.asoc.2018.12.024.
[16] A. Fernández, S. del Río, N. V. Chawla, and F. Herrera, “An insight into imbalanced Big Data classification: outcomes and challenges,” Complex & Intelligent Systems, vol. 3, no. 2, pp. 105–120, Jun. 2017, doi: 10.1007/s40747-017-0037-9.
[17] M. Mustaqim, B. Warsito, and B. Surarso, “Kombinasi Synthetic Minority Oversampling Technique (SMOTE) dan Neural Network Backpropagation untuk menangani data tidak seimbang pada prediksi pemakaian alat kontrasepsi implan,” Register, vol. 5, no. 2, pp. 116–127, 2019.
[18] H. W. Nugroho, T. B. Adji, and N. A. Setiawan, “Random forest weighting based feature selection for c4. 5 algorithm on wart treatment selection method,” Int. J. Adv. Sci. Eng. Inf. Technol, vol. 8, no. 5, pp. 1858–1863, 2018.
[19] M. A. Muslim, S. H. Rukmana, E. Sugiharti, B. Prasetiyo, and S. Alimah, “Optimization of C4.5 algorithm-based particle swarm optimization for breast cancer diagnosis,” J Phys Conf Ser, vol. 983, p. 012063, Mar. 2018, doi: 10.1088/1742-6596/983/1/012063.
[20] C.-L. Huang and J.-F. Dun, “A distributed PSO–SVM hybrid system with feature selection and parameter optimization,” Appl Soft Comput, vol. 8, no. 4, pp. 1381–1391, Sep. 2008, doi: 10.1016/j.asoc.2007.10.007.
[21] M. H. Dunham, Data mining: Introductory and advanced topics. Pearson Education India, 2006.
[22] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” in Machine Learning Proceedings 1995, Elsevier, 1995, pp. 194–202. doi: 10.1016/B978-1-55860-377-6.50032-3.
[23] A. Nurzahputra and M. A. Muslim, “Peningkatan Akurasi Pada Algoritma C4. 5 Menggunakan Adaboost Untuk Meminimalkan Resiko Kredit,” Prosiding SNATIF, pp. 243–247, 2017.
[24] H. Jantan, A. R. Hamdan, and Z. A. Othman, “Human talent prediction in HRM using C4. 5 classification algorithm,” International Journal on Computer Science and Engineering, vol. 2, no. 8, pp. 2526–2534, 2010.
[25] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
[26] Q. Gu, X.-M. Wang, Z. Wu, B. Ning, and C.-S. Xin, “An improved SMOTE algorithm based on genetic algorithm for imbalanced data classification,” Journal of Digital Information Management, vol. 14, no. 2, pp. 92–103, 2016.
[27] S. Susanti, “Klasifikasi Kemampuan Perawatan Diri Anak dengan Disabilitas Menggunakan SMOTE Berbasis Neural Network,” Jurnal Informatika, vol. 6, no. 2, pp. 175–184, 2019.
[28] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95 - International Conference on Neural Networks, IEEE, pp. 1942–1948. doi: 10.1109/ICNN.1995.488968.
[29] Y. Shi, “Particle swarm optimization,” IEEE connections, vol. 2, no. 1, pp. 8–13, 2004.
[30] S.-W. Lin, K.-C. Ying, S.-C. Chen, and Z.-J. Lee, “Particle swarm optimization for parameter determination and feature selection of support vector machines,” Expert Syst Appl, vol. 35, no. 4, pp. 1817–1824, Nov. 2008, doi: 10.1016/j.eswa.2007.08.088.
[31] B. Liu, L. Wang, and Y.-H. Jin, “An effective hybrid PSO-based algorithm for flow shop scheduling with limited buffers,” Comput Oper Res, vol. 35, no. 9, pp. 2791–2806, Sep. 2008, doi: 10.1016/j.cor.2006.12.013.
[32] N. Noviandi, “Implementasi algoritma decision tree c4. 5 untuk prediksi penyakit diabetes,” Indonesian of Health Information Management Journal (INOHIM), vol. 6, no. 1, pp. 1–5, 2018.
[33] D. K. Choubey, P. Kumar, S. Tripathi, and S. Kumar, “Performance evaluation of classification methods with PCA and PSO for diabetes,” Network Modeling Analysis in Health Informatics and Bioinformatics, vol. 9, no. 1, p. 5, Dec. 2020, doi: 10.1007/s13721-019-0210-8.
[34] C. Azad, B. Bhushan, R. Sharma, A. Shankar, K. K. Singh, and A. Khamparia, “Prediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus,” Multimed Syst, vol. 28, no. 4, pp. 1289–1307, Aug. 2022, doi: 10.1007/s00530-021-00817-2.