Hybrid Top-K Feature Selection to Improve High-Dimensional Data Classification Using Naïve Bayes Algorithm

Riska Wibowo(1), M. Arief Soeleman(2), Affandy Affandy(3),


(1) Department of Informatics, Universitas Dian Nuswantoro, Indonesia
(2) Department of Informatics, Universitas Dian Nuswantoro, Indonesia
(3) Department of Informatics, Universitas Dian Nuswantoro, Indonesia

Abstract

Abstract.

Purpose: The naive bayes algorithm is one of the most popular machine learning algorithms, because it is simple, has high computational efficiency and has good accuracy. The naive bayes method assumes each attribute contributes to determining the classification result that may exist between attributes, this can interfere with the classification performance of naive bayes. The naïve bayes algorithm is sensitive to many features so this can reduce the performance of naïve bayes. Efforts to improve the performance of the naïve bayes algorithm by using a hybrid top-k feature selection method that aims to handle high-dimensional data using the naïve bayes algorithm so as to produce better accuracy.

Methods: This research proposes a hybrid top-k feature selection method with stages 1. Prepare the dataset, 2. Replace the missing value with the average value of each attribute, 3. Calculate the weight of the attribute value using the weight information gain method, 4. Select attributes using the top-k feature selection method, 5. Backward Elimination with the naïve bayes algorithm, 6. Datasets that have been selected new attributes, then validated using 10 fold-cross validation where the data is divided into training data and testing data, 7. Calculate the accuracy value based on the confusion matrix table.

Result: Based on the experimental results of performance and performance comparison of several methods that have been presented (Naïve Bayes, deep feature weighting naïve bayes, top-k feature selection, and hybrid top-k feature selection). The experimental results in this study show that from 5 datasets from UCI Repository that have been tested, the accuracy value of the hybrid top-k feature selection method increases from the previous method. From the accuracy comparison results that the proposed hybrid top-k feature selection method is ranked the first best method.

Novelty: Thus it can be concluded that the Hybrid top-k feature selection method can be used to handle dimensional data in the Naïve Bayes algorithm.

 

Keywords

calssification, naïve bayes, weight information gain, top-k feature selection, backward elimination, hybrid feature selecion

Full Text:

PDF

References

H. Rao et al., “Feature Selection Based On Artificial Bee Colony and Gradient Boosting Decision Tree,” Appl. Soft Comput., vol. 74, pp. 634–642, 2019, doi: 10.1016/j.asoc.2018.10.036.

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems), 3rd Ed. USA: Morgan Kaufmann Publishers, 2012.

Q. Guo and M. Zhang, “Implement Web Learning Environment Based On Data Mining,” Knowledge-Based Syst., vol. 22, no. 6, pp. 439–442, 2009, doi: 10.1016/j.knosys.2009.06.001.

X. Sun, Y. Liu, M. Xu, H. Chen, J. Han, and K. Wang, “Feature Selection Using Dynamic Weights for Classification,” Knowledge-Based Syst., vol. 37, pp. 541–549, 2013, doi: 10.1016/j.knosys.2012.10.001.

J. Apolloni, G. Leguizamón, and E. Alba, “Two Hybrid Wrapper-Filter Feature Selection Algorithms Applied to High-Dimensional Microarray Experiments,” Appl. Soft Comput., vol. 38,pp. 922–932, 2016, doi: 10.1016/j.asoc.2015.10.037.

C.-F. Tsai and Y.-T. Sung, “Ensemble Feature Selection in High Dimension, Low Sample Size Datasets: Parallel and Serial Combination Approaches,” Knowledge-Based Syst., vol. 203, p. 106097, 2020, doi:10.1016/j.knosys.2020.106097.

S. B and M. K, “Firefly Algorithm Based Feature Selection for Network Intrusion Detection,” Comput. Secur., vol. 81, pp. 148–155, 2019, doi: 10.1016/j.cose.2018.11.005.

M. R. Wijaya, R. Saptono, and A. Doewes, “The Effect of Best First and Spreadsubsample on Selection of a Feature Wrapper With Naïve Bayes Classifier for The Classification of the Ratio of Inpatients,” Sci. J. Inform., vol. 3, no. 2, pp. 139–148, 2016, doi: 10.15294/sji.v3i2.7910.

N. P. Ririanti and A. Purwinarko, “Implementation of Support Vector Machine Algorithm with Correlation-Based Feature Selection and Term Frequency Inverse Document Frequency for Sentiment Analysis Review Hotel,” Sci. J. Inform., vol. 8, no. 2, pp. 297–303, 2021, doi: 10.15294/sji.v8i2.29992.

D. M. Singh, N. Harbi, and M. Zahidur Rahman, “Combining Naive Bayes and Decision Tree for Adaptive Intrusion Detection,” Int. J. Netw. Secur. Its Appl., vol. 2, no. 2, pp. 12–25, 2010, doi: 10.5121/ijnsa.2010.2202.

D. M. Farid et al., “An Adaptive Ensemble Classifier for Mining Concept Drifting Data Streams,” Expert Syst. Appl., vol. 40, no. 15, pp. 5895–5906, 2013, doi: 10.1016/j.eswa.2013.05.001.

N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian Network Classifiers,” Mach. Learn., vol. 29, no. 2, pp. 131–163, 1997, doi: 10.1023/A:1007465528199.

R. Zhang, W. Li, N. Liu, and D. Gao, “Coherent Narrative Summarization with ACognitive Model,” Comput. Speech Lang., vol. 35, pp. 134–160, 2016, doi: 10.1016/j.csl.2015.07.004.

L. Jiang, H. Zhang, Z. Cai, and D. Wang, “WeightedAverage of One-Dependence Estimators,” J. Exp. Theor. Artif. Intell., vol. 24, no. 2, pp. 219–230, 2012, doi: 10.1080/0952813X.2011.639092.

J. Chen, H. Huang, S. Tian, and Y. Qu,“Feature Selection for Text Classification with Naïve Bayes,” Expert Syst. Appl., vol. 36, no. 3, pp. 5432–5435, 2009, doi: 10.1016/j.eswa.2008.06.054.

H. Kang, S. J. Yoo, and D. Han, “Senti-lexicon and Improved Naïve Bayes Algorithms for Sentiment Analysis of Restaurant Reviews,” Expert Syst. Appl., vol. 39, no. 5, pp. 6000–6010, 2012, doi: 10.1016/j.eswa.2011.11.107.

T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, “A Systematic Literature Review on Fault Prediction Performance in Software Engineering,” IEEE Trans. Softw. Eng., vol. 38, no. 6, pp. 1276–1304, 2012, doi: 10.1109/TSE.2011.103.

L. Dey, S. Chakraborty, A. Biswas, B. Bose, and S. Tiwari, “Sentiment Analysis of Review Datasets Using Naïve Bayes‘ and K-NN Classifier,” Int. J. Inf. Eng. Electron. Bus., vol. 8, no. 4, pp. 54–62, 2016, doi: 10.5815/ijieeb.2016.04.07.

X. Wu et al., “Top 10Algorithms in Data Mining,” Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, 2008, doi: 10.1007/s10115-007-0114-2.

J. Wu, S. Pan, X. Zhu, Z. Cai, P. Zhang, and C. Zhang, “Self-adaptive Attribute Weighting for Naive Bayes Classification,” Expert Syst. Appl., vol. 42, no. 3, pp. 1487–1502, 2015, doi: 10.1016/j.eswa.2014.09.019.

L. Jiang, Z. Cai, and D. Wang, “Improving Naive Bayes for Classification,” Int. J. Comput. Appl., vol. 32, no. 3, 2010, doi: 10.2316/Journal.202.2010.3.202-2747.

L. Jiang, D. Wang, Z. Cai, and X. Yan, “Survey of Improving Naive Bayes for Classification,” in Adv. Data Min. Appl.,2007, pp. 134–145, doi: 10.1007/978-3-540-73871-8_14.

L. Jiang, C. Li, S. Wang, and L. Zhang, “Deep Feature Weighting for Naive Bayesand Its Application to Text Classification,” Eng. Appl. Artif. Intell., vol. 52, pp. 26–39, 2016, doi: 10.1016/j.engappai.2016.02.002.

G. Chen and J. Chen, “A Novel Wrapper Method for Feature Selection and Its Applications,” Neurocomputing, vol. 159,pp. 219–226, 2015, doi: 10.1016/j.neucom.2015.01.070.

R. Wibowo and H. Indriyawati, “Top-k Feature Selection Untuk Deteksi Penyakit Hepatitis Menggunakan Algoritme Naïve Bayes,” J. Buana Inform., vol. 11, no. 1, p. 1, 2020, doi: 10.24002/jbi.v11i1.2456.

X. Deng, Q. Liu, Y. Deng, and S. Mahadevan, “An Improved Method to Construct Basic Probability Assignment Based On The Confusion Matrix for Classification Problem,” Inf. Sci. (Ny)., vol. 340–341, pp. 250–261, 2016, doi: 10.1016/j.ins.2016.01.033.

Bustami, “Penerapan Algoritma Naive Bayes untuk Mengklasifikasi Data Nasabah Asuransi,” J. Inform., vol. 8, no. 1, pp. 884–898, 2014, doi: http://dx.doi.org/10.26555/jifo.v8i1.a2086.

W. Zhang and F. Gao, “An Improvement to Naive Bayes for Text Classification,” Procedia Eng., vol. 15, pp. 2160–2164, 2011, doi: 10.1016/j.proeng.2011.08.404.

V. Bolón-Canedo, I. Porto-Díaz, N. Sánchez-Maroño, and A. Alonso-Betanzos, “A Framework for Cost-Based Feature Selection,” Pattern Recognit., vol. 47, no. 7, pp. 2481–2489, Jul. 2014, doi: 10.1016/j.patcog.2014.01.008.

J. Suntoro and C. N. Indah, “Average Weight Information Gain Untuk Menangani Data Berdimensi Tinggi Menggunakan Algoritma C4.5,” J. Buana Inform., vol. 8, no. 3, Oct. 2017, doi: 10.24002/jbi.v8i3.1315.

V. Kotu and B. Deshpande, “Feature Selection,” in Predictive Analytics and Data Mining, Elsevier, 2015, pp. 347–370.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.