Comparative Study of Machine Learning Algorithms for Performing Ham or Spam Classification in SMS

Erna Zuni Astuti(1), Christy Atika Sari(2), Eko Hari Rachmawanto(3), Rabei Raad Ali(4),


(1) Departmen of Informatics Engineering, Universitas Dian Nuswantoro, Indonesia
(2) Departmen of Informatics Engineering, Universitas Dian Nuswantoro, Indonesia
(3) Departmen of Informatics Engineering, Universitas Dian Nuswantoro, Indonesia
(4) Department of Computer Engineering, Northern Technical University, Iraq

Abstract

Purpose: Fraud is rampant in the current era, especially in the era of technology where there is now easy access to a lot of information. Therefore, everyone needs to be able to sort out whether the information received is the right information or information that is fraudulent. In this research, the process of classifying messages including ham or spam has been carried out. The purpose of this research is to be able to build a model that can help classify messages. The purpose of this research is also to determine which machine learning method can accurately and efficiently perform the ham or spam classification process on messages.

Methods: In this research, the ham or spam classification process has been using machine learning methods. The machine learning methods used are the classification process with Random Forest, Logistic Regression, Support Vector Classification, Gradient Boosting, and XGBoost Classifier algorithms.

Results: The results obtained after testing in this study are the classification process using the Random Forest algorithm getting an accuracy of 97.28%, Logistic Regression getting an accuracy of 94.67%, with Support Vector Classification getting an accuracy of 97.93%, and using XGBoost Classifier getting an accuracy of 96.47%. The best precision value obtained in this study is 98% when using the random forest algorithm. The best recall value is 94% when using the SVC algorithm. While the best f1-score value is 95% when using the SVC algorithm.

Novelty: This research has been compared with several algorithms. In previous research, it is still very rarely done using XGBoost to classify the ham or spam in messages. We focus on giving brief information based con comparison algorithm and show the best algorithm to classify classify the ham or spam in messages. And for the novelty that exists from this research, the machine learning model built gets better accuracy when compared to previous research.

Keywords

Random forest; Logistic regression; Support vector classification; XGBoost classifier; Machine learning

Full Text:

PDF

References

N. Duha, “Short Message Services (SMS) Fraud Against Mobile Telephone Provider Consumer Review From Law Number 8 Of 1999 Concerning Consumer Protection,” J. Law Sci., vol. 3, pp. 36–43, Jan. 2021, doi: 10.35335/jls.v3i1.1654.

N. P. Damayanti, D. E. Prameswari, W. Puspita, and P. S. Sundari, “Classification of Hate Comments on Twitter Using a Combination of Logistic Regression and Support Vector Machine Algorithm,” J. Inf. Syst. Explor. Res., vol. 2, no. 1, Jan. 2024, doi: 10.52465/joiser.v2i1.229.

Y. Zhang, Z. Wang, Z. Wang, and C. Liu, “A Robust and Adaptive Watermarking Technique for Relational Database,” 2022, pp. 3–26. doi: 10.1007/978-981-16-9229-1_1.

O. F. Cossa, N. Sousa, R. Goncalves, J. Martins, and F. Branco, “Prediction of bank frauds by SMS or voice, from cell phone data analysis: A Systematic Literature Review,” in 2021 16th Iberian Conference on Information Systems and Technologies (CISTI), IEEE, Jun. 2021, pp. 1–8. doi: 10.23919/CISTI52073.2021.9476380.

A. Wagstaff, E. van Doorslaer, and R. Burger, “SMS nudges as a tool to reduce tuberculosis treatment delay and pretreatment loss to follow-up. A randomized controlled trial,” PLoS One, vol. 14, no. 6, p. e0218527, Jun. 2019, doi: 10.1371/journal.pone.0218527.

H. R. Nafiisah and F. Z. Ruskanda, “Content-based Multiclass Classification on Indonesian SMS Messages,” in 2022 International Symposium on Electronics and Smart Devices (ISESD), IEEE, Nov. 2022, pp. 1–6. doi: 10.1109/ISESD56103.2022.9980769.

M. Salman, M. Ikram, and D. Kaafar, An Empirical Analysis of SMS Scam Detection Systems. 2022.

H. H. Mansoor* and A. P. D. S. H. Shaker, “Using Classification Techniques to SMS Spam Filter,” Int. J. Innov. Technol. Explor. Eng., vol. 8, no. 12, pp. 1734–1739, Oct. 2019, doi: 10.35940/ijitee.L3206.1081219.

V. V Sergeev, I. M. Gorbchenko, and V. V Safronov, “Comparative analysis of fraud detection systems by phone number,” J. Phys. Conf. Ser., vol. 1679, no. 5, p. 052003, Nov. 2020, doi: 10.1088/1742-6596/1679/5/052003.

D. Budiman, Z. Zayyan, A. Mardiana, and A. A. Mahrani, “Email spam detection: a comparison of svm and naive bayes using bayesian optimization and grid search parameters,” J. Student Res. Explor., vol. 2, no. 1, pp. 53–64, Jan. 2024, doi: 10.52465/josre.v2i1.260.

“Performance comparison of support vector machine and gaussian naive bayes classifier for youtube spam comment detection,” J. Soft Comput. Explor., vol. 2, no. 2, Sep. 2021, doi: 10.52465/joscex.v2i2.42.

A. Dogan and D. Birant, “Machine learning and data mining in manufacturing,” Expert Syst. Appl., vol. 166, p. 114060, Mar. 2021, doi: 10.1016/j.eswa.2020.114060.

Q. Bi, K. E. Goodman, J. Kaminsky, and J. Lessler, “What is Machine Learning? A Primer for the Epidemiologist,” Am. J. Epidemiol., Oct. 2019, doi: 10.1093/aje/kwz189.

L. Zhang et al., “A review of machine learning in building load prediction,” Appl. Energy, vol. 285, p. 116452, Mar. 2021, doi: 10.1016/j.apenergy.2021.116452.

I. H. Sarker, “Machine Learning: Algorithms, Real-World Applications and Research Directions,” SN Comput. Sci., vol. 2, no. 3, p. 160, May 2021, doi: 10.1007/s42979-021-00592-x.

S. Uddin, A. Khan, M. E. Hossain, and M. A. Moni, “Comparing different supervised machine learning algorithms for disease prediction,” BMC Med. Inform. Decis. Mak., vol. 19, no. 1, p. 281, Dec. 2019, doi: 10.1186/s12911-019-1004-8.

B. Mahesh, Machine Learning Algorithms -A Review. 2019. doi: 10.21275/ART20203995.

S. Kushwaha et al., “Significant Applications of Machine Learning for COVID-19 Pandemic,” J. Ind. Integr. Manag., vol. 05, no. 04, pp. 453–479, Dec. 2020, doi: 10.1142/S2424862220500268.

T. Meng, X. Jing, Z. Yan, and W. Pedrycz, “A survey on machine learning for data fusion,” Inf. Fusion, vol. 57, pp. 115–129, May 2020, doi: 10.1016/j.inffus.2019.12.001.

A. Parmar, R. Katariya, and V. Patel, “A Review on Random Forest: An Ensemble Classifier,” 2019, pp. 758–763. doi: 10.1007/978-3-030-03146-6_86.

A. Subudhi, M. Dash, and S. Sabut, “Automated segmentation and classification of brain stroke using expectation-maximization and random forest classifier,” Biocybern. Biomed. Eng., vol. 40, no. 1, pp. 277–289, Jan. 2020, doi: 10.1016/j.bbe.2019.04.004.

M. Huljanah, Z. Rustam, S. Utama, and T. Siswantining, “Feature Selection using Random Forest Classifier for Predicting Prostate Cancer,” IOP Conf. Ser. Mater. Sci. Eng., vol. 546, no. 5, p. 052031, Jun. 2019, doi: 10.1088/1757-899X/546/5/052031.

K. Shah, H. Patel, D. Sanghvi, and M. Shah, “A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification,” Augment. Hum. Res., vol. 5, no. 1, p. 12, Dec. 2020, doi: 10.1007/s41133-020-00032-0.

N. M. Samsudin, C. F. binti Mohd Foozy, N. Alias, P. Shamala, N. F. Othman, and W. I. S. Wan Din, “Youtube spam detection framework using naïve bayes and logistic regression,” Indones. J. Electr. Eng. Comput. Sci., vol. 14, no. 3, p. 1508, Jun. 2019, doi: 10.11591/ijeecs.v14.i3.pp1508-1517.

K. V. Kumar and M. Ramamoorthy, “Machine Learning-based spam detection using Naïve Bayes Classifier in comparison with Logistic Regression for improving accuracy,” J. Pharm. Negat. Results, vol. 13, no. SO4, Jan. 2022, doi: 10.47750/pnr.2022.13.S04.061.

D. Xia, H. Tang, S. Sun, C. Tang, and B. Zhang, “Landslide Susceptibility Mapping Based on the Germinal Center Optimization Algorithm and Support Vector Classification,” Remote Sens., vol. 14, no. 11, p. 2707, Jun. 2022, doi: 10.3390/rs14112707.

O. Rákos, S. Aradi, and T. Bécsi, “Lane Change Prediction Using Gaussian Classification, Support Vector Classification and Neural Network Classifiers,” Period. Polytech. Transp. Eng., vol. 48, no. 4, pp. 327–333, Jun. 2020, doi: 10.3311/PPtr.15849.

R. Zhang, B. Li, and B. Jiao, “Application of XGboost Algorithm in Bearing Fault Diagnosis,” IOP Conf. Ser. Mater. Sci. Eng., vol. 490, p. 072062, Apr. 2019, doi: 10.1088/1757-899X/490/7/072062.

I. L. Cherif and A. Kortebi, “On using eXtreme Gradient Boosting (XGBoost) Machine Learning algorithm for Home Network Traffic Classification,” in 2019 Wireless Days (WD), IEEE, Apr. 2019, pp. 1–6. doi: 10.1109/WD.2019.8734193.

S. Thongsuwan, S. Jaiyen, A. Padcharoen, and P. Agarwal, “ConvXGB: A new deep learning model for classification problems based on CNN and XGBoost,” Nucl. Eng. Technol., vol. 53, no. 2, pp. 522–531, Feb. 2021, doi: 10.1016/j.net.2020.04.008.

V. Jackins, S. Vimal, M. Kaliappan, and M. Y. Lee, “AI-based smart prediction of clinical disease using random forest classifier and Naive Bayes,” J. Supercomput., vol. 77, no. 5, pp. 5198–5219, May 2021, doi: 10.1007/s11227-020-03481-x.

U. Barman and R. D. Choudhury, “Soil texture classification using multi class support vector machine,” Inf. Process. Agric., vol. 7, no. 2, pp. 318–332, Jun. 2020, doi: 10.1016/j.inpa.2019.08.001.

Z. Qi, “The Text Classification of Theft Crime Based on TF-IDF and XGBoost Model,” in 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), IEEE, Jun. 2020, pp. 1241–1246. doi: 10.1109/ICAICA50127.2020.9182555.

N. Kudupudi and S. Nair, “Spam Message Detection Using Logistic Regression,” 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:259939313

J. Wang, W. Xu, W. Yan, and C. Li, “Text Similarity Calculation Method Based on Hybrid Model of LDA and TF-IDF,” in Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, New York, NY, USA: ACM, Dec. 2019, pp. 1–8. doi: 10.1145/3374587.3374590.

R. S. Patil and S. R. Kolhe, “Supervised classifiers with TF-IDF features for sentiment analysis of Marathi tweets,” Soc. Netw. Anal. Min., vol. 12, no. 1, p. 51, Dec. 2022, doi: 10.1007/s13278-022-00877-w.

V. Kumar and B. Subba, “A TfidfVectorizer and SVM based sentiment analysis framework for text data corpus,” in 2020 National Conference on Communications (NCC), IEEE, Feb. 2020, pp. 1–6. doi: 10.1109/NCC48643.2020.9056085.

S. Annareddy and S. Tammina, “A Comparative Study of Deep Learning Methods for Spam Detection,” in 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), IEEE, Dec. 2019, pp. 66–72. doi: 10.1109/I-SMAC47947.2019.9032627.

A. Alzahrani and D. B. Rawat, “Comparative Study of Machine Learning Algorithms for SMS Spam Detection,” in 2019 SoutheastCon, IEEE, Apr. 2019, pp. 1–6. doi: 10.1109/SoutheastCon42311.2019.9020530.

J. Shobana and D. Kanchana, “An Efficient Spam SMS Analysis Model based on Multinomial Naïve Bayes model Using Passive Aggressive Algorithm,” J. Phys. Conf. Ser., vol. 2007, no. 1, p. 012047, Aug. 2021, doi: 10.1088/1742-6596/2007/1/012047.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.