Implementation of Feature Selection Strategies to Enhance Classification Using XGBoost and Decision Tree

Fhara Elvina Pingky Nadya(1), M.Firdaus Ibadi Ferdiansyah(2), Vinna Rahmayanti Setyaning Nastiti(3), Christian Sri Kusuma Aditya(4),


(1) Faculty of Engineering, Universitas Muhammadiyah Malang, Indonesia
(2) Faculty of Engineering, Universitas Muhammadiyah Malang, Indonesia
(3) Faculty of Engineering, Universitas Muhammadiyah Malang, Indonesia
(4) Faculty of Engineering, Universitas Muhammadiyah Malang, Indonesia

Abstract

Purpose: Grades in the world of education are often a benchmark for students to be considered successful or not during the learning period. The facilities and teaching staff provided by schools with the same portion do not make student grades the same, the value gap is still found in every school. The purpose of this research is to produce a better accuracy rate by applying feature selection Information Gain (IG), Recursive Feature Elimination (RFE), Lasso, and Hybrid (RFE + Mutual Information) using XGBoost and Decision Tree models.

Methods: This research was conducted using 649 Portuguese course student data that had been pre-processed according to data requirements, then, feature selection was carried out to select features that affect the target, after that all data can be classified using XGBoost and Decision tree, finally evaluating and displaying the results.

Results: The results showed that feature selection Information Gain combined with the XGBoost algorithm has the best accuracy results compared to others, which is 81.53%.

Novelty: The contribution of this research is to improve the classification accuracy results of previous research by using 2 traditional machine learning algorithms and some feature selection.

Keywords

Grade; Decision tree; Feature selection; XGBoost

Full Text:

PDF

References

S. Li and T. Liu, “Performance Prediction for Higher Education Students Using Deep Learning,” Complexity, vol. 2021, 2021, doi: 10.1155/2021/9958203.

D. O. T. Aritonang, “The Efforts to Improve the Quality of Education in North Tapanuli Regency,” Int. J. English Lit. Soc. Sci., vol. 3, no. 6, pp. 1154–1159, 2018, doi: 10.22161/ijels.3.6.30.

S. G. Sireci and S. Greiff, “Editorial: On the importance of educational tests,” Eur. J. Psychol. Assess., vol. 35, no. 3, pp. 297–300, 2019, doi: 10.1027/1015-5759/a000549.

B. K. Francis and S. S. Babu, “Predicting Academic Performance of Students Using a Hybrid Data Mining Approach,” 2019.

P. Dabhade, R. Agarwal, K. P. Alameen, A. T. Fathima, R. Sridharan, and G. Gopakumar, “Educational data mining for predicting students’ academic performance using machine learning algorithms,” Mater. Today Proc., vol. 47, no. xxxx, pp. 5260–5267, 2021, doi: 10.1016/j.matpr.2021.05.646.

J. López-Zambrano, J. A. L. Torralbo, and C. Romero, “Early prediction of student learning performance through data mining: A systematic review,” Psicothema, vol. 33, no. 3, pp. 456–465, 2021, doi: 10.7334/psicothema2021.62.

A. Khan and S. K. Ghosh, Student performance analysis and prediction in classroom learning: A review of educational data mining studies, vol. 26, no. 1. Education and Information Technologies, 2021. doi: 10.1007/s10639-020-10230-3.

S. Khademizadeh, Z. Nematollahi, and F. Danesh, “Analysis of book circulation data and a book recommendation system in academic libraries using data mining techniques,” Libr. Inf. Sci. Res., vol. 44, no. 4, p. 101191, 2022, doi: 10.1016/j.lisr.2022.101191.

R. Ordoñez-Avila, N. Salgado Reyes, J. Meza, and S. Ventura, “Data mining techniques for predicting teacher evaluation in higher education: A systematic literature review,” Heliyon, vol. 9, no. 3, 2023, doi: 10.1016/j.heliyon.2023.e13939.

J. Klimek and J. A. Klimek, “IT and data mining in decision-making in the organization. Education management in the culture of late modernity,” Procedia Comput. Sci., vol. 176, pp. 1990–1999, 2020, doi: 10.1016/j.procs.2020.09.235.

D. Hooshyar, M. Pedaste, and Y. Yang, “Mining educational data to predict students’ performance through procrastination behavior,” Entropy, vol. 22, no. 1, p. 12, 2020, doi: 10.3390/e22010012.

Y. S. Mitrofanova, A. A. Sherstobitova, and O. A. Filippova, Modeling smart learning processes based on educational data mining tools, vol. 144. Springer Singapore, 2019. doi: 10.1007/978-981-13-8260-4_49.

G. Ramaswami, T. Susnjak, A. Mathrani, J. Lim, and P. Garcia, “Using educational data mining techniques to increase the prediction accuracy of student academic performance,” Inf. Learn. Sci., vol. 120, no. 7–8, pp. 451–467, 2019, doi: 10.1108/ILS-03-2019-0017.

A. Abu Saa, M. Al-Emran, and K. Shaalan, Factors Affecting Students’ Performance in Higher Education: A Systematic Review of Predictive Data Mining Techniques, vol. 24, no. 4. Springer Netherlands, 2019. doi: 10.1007/s10758-019-09408-7.

F. Jauhari and A. A. Supianto, “Building student’s performance decision tree classifier using boosting algorithm,” Indones. J. Electr. Eng. Comput. Sci., vol. 14, no. 3, pp. 1298–1304, 2019, doi: 10.11591/ijeecs.v14.i3.pp1298-1304.

F. Ünal, Data Mining - Methods, Applications and Systems. IntechOpen, 2020.

S. Rajendran, S. Chamundeswari, and A. A. Sinha, “Predicting the academic performance of middle- and high-school students using machine learning algorithms,” Soc. Sci. Humanit. Open, vol. 6, no. 1, p. 100357, 2022, doi: 10.1016/j.ssaho.2022.100357.

E. Fernandes, M. Holanda, M. Victorino, V. Borges, R. Carvalho, and G. Van Erven, “Educational data mining: Predictive analysis of academic performance of public school students in the capital of Brazil,” J. Bus. Res., vol. 94, no. August 2017, pp. 335–343, 2019, doi: 10.1016/j.jbusres.2018.02.012.

F. F. Firdaus, H. A. Nugroho, and I. Soesanti, “A Review of Feature Selection and Classification Approaches for Heart Disease Prediction,” IJITEE (International J. Inf. Technol. Electr. Eng., vol. 4, no. 3, p. 75, 2021, doi: 10.22146/ijitee.59193.

A. K. Shukla, P. Singh, and M. Vardhan, “Predicting Alcohol Consumption Behaviours of the Secondary Level Students,” SSRN Electron. J., 2018, doi: 10.2139/ssrn.3170173.

R. C. Chen, C. Dewi, S. W. Huang, and R. E. Caraka, “Selecting critical features for data classification based on machine learning methods,” J. Big Data, vol. 7, no. 1, 2020, doi: 10.1186/s40537-020-00327-4.

S. A. Fayaz, M. Zaman, S. Kaul, and M. A. Butt, “Is Deep Learning on Tabular Data Enough? An Assessment,” Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 4, pp. 466–473, 2022, doi: 10.14569/IJACSA.2022.0130454.

V. Borisov, T. Leemann, K. Sessler, J. Haug, M. Pawelczyk, and G. Kasneci, “Deep Neural Networks and Tabular Data: A Survey,” IEEE Trans. Neural Networks Learn. Syst., pp. 1–21, 2022, doi: 10.1109/TNNLS.2022.3229161.

Y. Zhang, Y. Yun, R. An, J. Cui, H. Dai, and X. Shang, “Educational Data Mining Techniques for Student Performance Prediction: Method Review and Comparison Analysis,” Front. Psychol., vol. 12, no. December, pp. 1–19, 2021, doi: 10.3389/fpsyg.2021.698490.

J. H. Sullivan, M. Warkentin, and L. Wallace, “So many ways for assessing outliers: What really works and does it matter?,” J. Bus. Res., vol. 132, no. May, pp. 530–543, 2021, doi: 10.1016/j.jbusres.2021.03.066.

T. Nyitrai and M. Virág, “The effects of handling outliers on the performance of bankruptcy prediction models,” Socioecon. Plann. Sci., vol. 67, no. August, pp. 34–42, 2019, doi: 10.1016/j.seps.2018.08.004.

A. J. Mohammed, “Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 9, no. 3, pp. 3161–3172, 2020, doi: 10.30534/ijatcse/2020/104932020.

P. Barbiero, G. Squillero, and A. Tonda, “Predictable Features Elimination: An Unsupervised Approach to Feature Selection,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 13163 LNCS, pp. 399–412, 2022, doi: 10.1007/978-3-030-95467-3_29.

D. V. Akman et al., “K-Best Feature Selection and Ranking Via Stochastic Approximation,” Expert Syst. Appl., vol. 213, no. September, 2023, doi: 10.1016/j.eswa.2022.118864.

A. Thakkar and R. Lohiya, “Attack classification using feature selection techniques: a comparative study,” J. Ambient Intell. Humaniz. Comput., vol. 12, no. 1, pp. 1249–1266, 2021, doi: 10.1007/s12652-020-02167-9.

P. Bhat and K. Dutta, “A multi-tiered feature selection model for android malware detection based on Feature discrimination and Information Gain,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 10, pp. 9464–9477, 2022, doi: 10.1016/j.jksuci.2021.11.004.

H. Jeon and S. Oh, “Hybrid-recursive feature elimination for efficient feature selection,” Appl. Sci., vol. 10, no. 9, pp. 1–8, 2020, doi: 10.3390/app10093211.

F. Li, L. Lai, and S. Cui, “On the Adversarial Robustness of LASSO Based Feature Selection,” IEEE Trans. Signal Process., vol. 69, pp. 5555–5567, 2021, doi: 10.1109/TSP.2021.3115943.

Y. Bouchlaghem, Y. Akhiat, and S. Amjad, “Feature Selection: A Review and Comparative Study,” E3S Web Conf., vol. 351, pp. 1–6, 2022, doi: 10.1051/e3sconf/202235101046.

K. H. Susheelamma and K. M. Ravikumar, “Student risk identification learning model using machine learning approach,” Int. J. Electr. Comput. Eng., vol. 9, no. 5, pp. 3872–3879, 2019, doi: 10.11591/ijece.v9i5.pp3872-3879.

W. Su et al., “An XGBoost-Based Knowledge Tracing Model,” Int. J. Comput. Intell. Syst., vol. 16, no. 1, 2023, doi: 10.1007/s44196-023-00192-y.

C. Bentéjac, A. Csörgő, and G. Martínez-Muñoz, “A Comparative Analysis of XGBoost,” no. February, 2019, doi: 10.1007/s10462-020-09896-5.

Abdullah-All-Tanvir, I. Ali Khandokar, A. K. M. Muzahidul Islam, S. Islam, and S. Shatabda, “A gradient boosting classifier for purchase intention prediction of online shoppers,” Heliyon, vol. 9, no. 4, p. e15163, 2023, doi: 10.1016/j.heliyon.2023.e15163.

H. T. T. Nguyen, L. H. Chen, V. S. Saravanarajan, and H. Q. Pham, “Using XG Boost and Random Forest Classifier Algorithms to Predict Student Behavior,” 2021 IEEE Int. Conf. Emerg. Trends Ind. 4.0, ETI 4.0 2021, 2021, doi: 10.1109/ETI4.051663.2021.9619217.

J. Yang and J. Guan, “A Study of Heart Disease Prediction Model Based on Smote-XGBoost Algorithm,” Hans J. Data Min., vol. 12, no. 03, pp. 220–234, 2022, doi: 10.12677/hjdm.2022.123003.

D. Tarwidi, S. R. Pudjaprasetya, D. Adytia, and M. Apri, “An optimized XGBoost-based machine learning method for predicting wave run-up on a sloping beach,” MethodsX, vol. 10, no. December 2022, 2023, doi: 10.1016/j.mex.2023.102119.

W. T. Astuti, M. A. Muslim, and E. Sugiharti, “The Implementation of The Neuro Fuzzy Method Using Information Gain for Improving Accuracy in Determination of Landslide Prone Areas,” Sci. J. Informatics, vol. 6, no. 1, pp. 95–105, 2019, doi: 10.15294/sji.v6i1.16648.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.