Optimizing Random Forest for Predicting Thoracic Surgery Success in Lung Cancer Using Recursive Feature Elimination and GridSearchCV

  • Deonisius Germandy Cahaya Putra Universitas Negeri Semarang
  • Anggyi Trisnawan Putra Universitas Negeri Semarang
Keywords: Random Forest, Recursive Feature Elimination, Lung Cancer, Thoracic Surgery, GridSearchCV

Abstract

Abstract. Lung cancer is one of the deadliest forms of cancer, claiming numerous lives annually. Thoracic surgery is a strategy to manage lung cancer patients; however, it poses high risks, including potential nerve damage and fatal complications leading to mortality. Predicting the success rate of thoracic surgery for lung cancer patients can be accomplished using data mining techniques based on classification principles. Medical data mining involves employing mathematical, statistical, and computational methods. In this study, the prediction of thoracic surgery success employs the random forest algorithm with recursive feature elimination for feature selection. The feature selection process yields the top 8 features. The 8 best features include 'DGN', 'PRE4', 'PRE5', 'PRE6', 'PRE10', 'PRE14', 'PRE30', and 'AGE'. Hyperparameter using GridSearchCV is then applied to enhance classification accuracy. The results of this method implementation demonstrate a predictive accuracy of 91.41%.

Purpose: The study aims to develop and evaluate a Random Forest model with a Recursive Feature Elimination feature selection and applies hyperparameter GridSearchCV for predicting thoracic surgery success rate.

Methods: This study uses the thoracic surgery dataset and applies various data preprocessing techniques. The dataset is then used for classification using the Random Forest algorithm and applies the Recursive Feature Elimination feature selection to obtain the best features. GridSearchCV is used in this study for hyperparameter.

Result: The accuracy using the Random Forest algorithm and Recursive Feature Elimination feature selection with hyperparameters tuning GridSearchCV resulted in an accuracy of 91,41%. The accuracy was obtained from the following parameters values: bootstrap set to false, criterion set to gini, n_estimator equal to 100, max_depth set to none, min_samples_split equal to 4, min_samples_leaf equal to 1, max_features set to auto, n_jobs set to -1, and verbose set to 2 with 10-fold cross validation.

Novelty: This study comparison and analysis of various dataset preprocessing methods and different model configurations are conducted to find the best model for predicting the success rate of thoracic surgery. The study also employs feature selection to choose the best feature to be used in classification process, as well as hyperparameter tuning to achieve optimal accuracy and discover the optimal values for these hyperparameters.

References

[1] Roshan and Rohini, “Prediction of Post-Surgical Survival of Lung Cancer Patients After Thoracic Surgery Using Data Mining Techniques.,” Int. J. Adv. Res., vol. 5, no. 4, pp. 596–600, 2017, doi: 10.21474/ijar01/3852.
[2] R. T. Prasetio and S. Susanti, “Prediksi Harapan Hidup Pasien Kanker Paru Pasca Operasi Bedah Toraks Menggunakan Boosted k-Nearest Neighbor,” vol. 1, no. 1, pp. 64–69, 2019.
[3] I. F. Anshori and D. Riana, “Prediksi Harapan Hidup Pasien Kanker Paru-Paru Pasca Operasi Bedah Thoraks Menggunakan Boosted Neural Network Dan Smote,” vol. 6, no. 1, pp. 9–15, 2021.
[4] H. Kenang, C. Alivian, W. Suharso, and A. Qurrota, “Pengklasifikasian Kanker Payudara Dan Kanker Paru-Paru Dengan Metode Gaussian Naïve Bayes , Multinomial Naïve Bayes , Dan Bernoulli Naïve Bayes Classification Of Breast Cancer And Lung Cancer Using The Gaussian Naïve Bayes Multinomial Nave Bayes And Berno,” vol. 3, no. 4, pp. 350–355, 2022.
[5] R. Sanjaya and F. Fitriyani, “Prediksi Bedah Toraks Menggunakan Seleksi Fitur Forward Selection dan K-Nearest Neighbor,” J. Edukasi dan Penelit. Inform., vol. 5, no. 3, p. 316, 2019, doi: 10.26418/jp.v5i3.35324.
[6] M. Koklu, H. Kahramanli, and N. Allahverdi, “Applications of Rule Based Classification Techniques for Thoracic Surgery,” Jt. Int. Conf. 2015, no. November, pp. 1991–1998, 2015.
[7] M. L. Giger and K. Suzuki, “Computer-Aided Diagnosis,” Biomed. Inf. Technol., pp. 359–XXII, Jan. 2008, doi: 10.1016/B978-012373583-6.50020-7.
[8] D. Jollyta, W. Ramdhan, and M. Zarlis, Konsep Data Mining Dan Penerapan. Deepublish, 2020.
[9] D. Derisma, “Perbandingan Kinerja Algoritma untuk Prediksi Penyakit Jantung dengan Teknik Data Mining,” J. Appl. Informatics Comput., vol. 4, no. 1, pp. 84–88, 2020, doi: 10.30871/jaic.v4i1.2152.
[10] S. Maesaroh and Kusrini, “Sistem Prediksi Produktifitas Pertanian Padi Menggunakan Data Mining,” J. Energi, vol. 7, no. 2, pp. 25–30, 2017.
[11] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324/METRICS.
[12] L. Breiman, “Randomizing outputs to increase prediction accuracy,” Mach. Learn., vol. 40, no. 3, pp. 229–242, 2000, doi: 10.1023/A:1007682208299.
[13] J. Saputra, Wahyu S., A. R. Sujatmika, and A. Z. Arifin, “Seleksi Fitur Menggunakan Random Forest Dan Neural Network,” 13 Th Ind. Electron. Semin. 2011 (IES 2011), vol. 1, no. Ies, pp. 93–97, 2011.
[14] L. Demarchi, A. Kania, W. Ciezkowski, H. Piórkowski, Z. Oświecimska-Piasko, and J. Chormański, “Recursive feature elimination and random forest classification of natura 2000 grasslands in lowland river valleys of poland based on airborne hyperspectral and LiDAR data fusion,” Remote Sens., vol. 12, no. 11, 2020, doi: 10.3390/rs12111842.
[15] J. Lu et al., “Estimation of monthly 1 km resolution PM2.5 concentrations using a random forest model over ‘2 + 26’ cities, China,” Urban Clim., vol. 35, no. November 2020, p. 100734, 2021, doi: 10.1016/j.uclim.2020.100734.
[16] Andriana et al., “Prediksi Gelombang Corona Dengan Metode Neural Network,” JIKOMSI (Jurnal Ilmu Komput. dan Sist. Inf., vol. 3, no. 2, pp. 102–107, 2020.
[17] Z. Maisat, E. Darmawan, and A. Fauzan, “Implementasi Optimasi Hyperparameter GridSearchCV Pada Sistem Prediksi Serangan Jantung Menggunakan SVM Implementation of GridSearchCV Hyperparameter Optimization in Heart Attack Prediction System Using SVM,” vol. 13, no. 1, pp. 8–15, 2023.
[18] I. W. Septiani, A. C. Fauzan, and M. M. Huda, “Implementasi Algoritma K-Medoids Dengan Evaluasi Davies-Bouldin- Index Untuk Klasterisasi Harapan Hidup Pasca Operasi Pada Pasien Penderita Kanker Paru-Paru,” vol. 3, pp. 556–566, 2022, doi: 10.30865/json.v3i4.4055.
[19] S. Asadi, S. E. Roshan, and M. W. Kattan, “Random forest swarm optimization-based for heart diseases diagnosis,” J. Biomed. Inform., vol. 115, no. August 2020, p. 103690, 2021, doi: 10.1016/j.jbi.2021.103690.
[20] “Heart Disease Dataset | Kaggle.” .
[21] A. R. I. Pratama, S. A. Latipah, and B. N. Sari, “Optimasi Klasifikasi Curah Hujan Menggunakan Support Vector Machine (Svm) Dan Recursive Feature Elimination (Rfe),” JIPI (Jurnal Ilm. Penelit. dan Pembelajaran Inform., vol. 7, no. 2, pp. 314–324, 2022, doi: 10.29100/jipi.v7i2.2675.
[22] A. A. Mohammed, R. Basa, A. K. Kuchuru, S. P. Nandigama, and M. Gangolla, “Random Forest Machine Learning technique to predict Heart disease,” vol. 7, no. 4, p. 2020, 2020.
[23] G. A. Lujan-Moreno, P. R. Howard, O. G. Rojas, and D. C. Montgomery, “Design of experiments and response surface methodology to tune machine learning hyperparameters, with a random forest case-study,” Expert Syst. Appl., vol. 109, pp. 195–205, 2018, doi: 10.1016/j.eswa.2018.05.024.
Published
2024-09-30
How to Cite
Putra, D., & Putra, A. (2024). Optimizing Random Forest for Predicting Thoracic Surgery Success in Lung Cancer Using Recursive Feature Elimination and GridSearchCV. Recursive Journal of Informatics, 2(2), 97 - 105. https://doi.org/10.15294/rji.v2i2.73154