Voting Classifier Technique and Count Vectorizer with N-gram to Identifying Hate Speech and Abusive Tweets in Indonesian

Riza Arifudin; Dandi Indra Wijaya; Budi Warsito; Adi Wibowo

doi:10.15294/sji.v10i4.46633

Voting Classifier Technique and Count Vectorizer with N-gram to Identifying Hate Speech and Abusive Tweets in Indonesian

Riza Arifudin⁽¹⁾, Dandi Indra Wijaya⁽²⁾, Budi Warsito⁽³⁾, Adi Wibowo⁽⁴⁾,

DOI: https://doi.org/10.15294/sji.v10i4.46633

(1) Department of Computer Science, Universitas Negeri Semarang, Indonesia
(2) Department of Computer Science, Universitas Negeri Semarang, Indonesia
(3) Department of Statistics, Diponegoro University, Indonesia
(4) Department of Informatics, Diponegoro University, Indonesia

Abstract

Purpose: The objective of this study is to identify hate speech and abusive tweets in Indonesian using a Voting Classifier technique and Count Vectorizer with N-grams. Voting Classifier technique involves combining multiple classifiers like Random Forest and Support Vector Machines to improve classification accuracy.

Methods: This research begins by preprocessing the data. Voting classifier uses Support Vector Machine algorithm and Random Forest algorithm. Support Vector Machine and Random Forest serve as the estimators for the voting classifier. As for feature extraction, N-gram and count vectorizer were employed. The effectiveness of the suggested procedures is the desired outcome.

Result: Combining the Voting Classifier approach with Count Vectorizer feature extraction and using 1 gram of N-grams, or 82.50%, resulted in the best accuracy. From this study, it can be inferred that the approach employed to identify hate speech and abusive tweets is extremely practical.

Novelty: Combining multiple classifiers and using feature extraction techniques like count vectorizer and N-gram with machine learning algorithms can be used for sentiment analysis to differentiate between hate speech and abusive tweets.

Keywords

Voting classifier; Count vectorizer; Support vector machine; Random forest; Sentiment analysis

Full Text:

PDF

References

D. Kim and S. S. Jang, “The psychological and motivational aspects of restaurant experience sharing behavior on social networking sites,” Serv. Bus., vol. 13, no. 1, pp. 25–49, Mar. 2019, doi: 10.1007/s11628-018-0367-8.

T. L. Nikmah, M. Z. Ammar, and Y. R. Allatif, “Comparison of LSTM, SVM, and naive bayes for classifying sexual harassment tweets,” J. Soft Comput. Explor., vol. 3, no. 2, pp. 131–137, 2022, doi: 10.52465/joscex.v3i2.85.

N. R. Fatahillah, P. Suryati, and C. Haryawan, “Implementation of Naive Bayes classifier algorithm on social media (Twitter) to the teaching of Indonesian hate speech,” in 2017 International Conference on Sustainable Information Engineering and Technology (SIET), Nov. 2017, pp. 128–131. doi: 10.1109/SIET.2017.8304122.

I. Karthika, G. Boomika, R. Nisha, M. Shalini, and S. P. Srivarshini, “A Survey on Detecting and Preventing Hateful Comments on Social Media Using Deep Learning,” in Smart Innovation, Systems and Technologies, Springer Science and Business Media Deutschland GmbH, 2023, pp. 285–298. doi: 10.1007/978-981-19-3575-6_30.

D. Kindermann, “Against ‘Hate Speech,’” J. Appl. Philos., Apr. 2023, doi: 10.1111/japp.12648.

S. Mishra, S. Prasad, and S. Mishra, “Exploring Multi-Task Multi-Lingual Learning of Transformer Models for Hate Speech and Offensive Speech Identification in Social Media,” SN Comput. Sci., vol. 2, no. 2, pp. 1–19, 2021, doi: 10.1007/s42979-021-00455-5.

Z. Li, Y. Fan, B. Jiang, T. Lei, and W. Liu, “A survey on sentiment analysis and opinion mining for social multimedia,” Multimed. Tools Appl., vol. 78, no. 6, pp. 6939–6967, 2019, doi: 10.1007/s11042-018-6445-z.

J. Jumanto, M. A. Muslim, Y. Dasril, and T. Mustaqim, “Accuracy of Malaysia Public Response to Economic Factors During the Covid-19 Pandemic Using Vader and Random Forest,” J. Inf. Syst. Explor. Res., vol. 1, no. 1, pp. 49–70, 2022, doi: 10.52465/joiser.v1i1.104.

M. Taboada, “Sentiment Analysis: An Overview from Linguistics,” Annu. Rev. Linguist., vol. 2, pp. 325–347, 2016, doi: 10.1146/annurev-linguistics-011415-040518.

G. I. Ahmad and J. Singla, “Machine Learning Techniques for Sentiment Analysis of Indian Languages,” Int. J. Recent Technol. Eng., vol. 8, no. 2S11, pp. 3630–3636, Nov. 2019, doi: 10.35940/ijrte.B1456.0982S1119.

W. Budiawan Zulfikar, A. Rialdy Atmadja, and S. F. Pratama, “Sentiment Analysis on Social Media Against Public Policy Using Multinomial Naive Bayes,” Sci. J. Informatics, vol. 10, no. 1, pp. 25–34, 2023, doi: 10.15294/sji.v10i1.39952.

D. M. E.-D. M. Hussein, “A survey on sentiment analysis challenges,” J. King Saud Univ. - Eng. Sci., vol. 30, no. 4, pp. 330–338, Oct. 2018, doi: 10.1016/j.jksues.2016.04.002.

M. O. Ibrohim and I. Budi, “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter,” in Proceedings of the Third Workshop on Abusive Language Online, 2019, pp. 46–57. doi: 10.18653/v1/W19-3506.

M. A. Fauzi and A. Yuniarti, “Ensemble Method for Indonesian Twitter Hate Speech Detection,” Indones. J. Electr. Eng. Comput. Sci., vol. 11, no. 1, p. 294, Jul. 2018, doi: 10.11591/ijeecs.v11.i1.pp294-299.

E. Laoh, I. Surjandari, and N. I. Prabaningtyas, “Enhancing Hospitality Sentiment Reviews Analysis Performance using SVM N-Grams Method,” in 2019 16th International Conference on Service Systems and Service Management (ICSSSM), Jul. 2019, pp. 1–5. doi: 10.1109/ICSSSM.2019.8887662.

U. K. Kumar, M. B. S. Nikhil, and K. Sumangali, “Prediction of breast cancer using voting classifier technique,” in 2017 IEEE International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM), Aug. 2017, pp. 108–114. doi: 10.1109/ICSTM.2017.8089135.

Y. Zhang, H. Zhang, J. Cai, and B. Yang, “A Weighted Voting Classifier Based on Differential Evolution,” Abstr. Appl. Anal., vol. 2014, pp. 1–6, 2014, doi: 10.1155/2014/376950.

M. Belgiu and L. Drăguţ, “Random forest in remote sensing: A review of applications and future directions,” ISPRS J. Photogramm. Remote Sens., vol. 114, pp. 24–31, Apr. 2016, doi: 10.1016/j.isprsjprs.2016.01.011.

S. Ding, Z. Zhu, and X. Zhang, “An overview on semi-supervised support vector machine,” Neural Comput. Appl., vol. 28, no. 5, pp. 969–978, 2017, doi: 10.1007/s00521-015-2113-7.

A. Tharwat, A. E. Hassanien, and B. E. Elnaghi, “A BA-based algorithm for parameter optimization of Support Vector Machine,” Pattern Recognit. Lett., vol. 93, pp. 13–22, Jul. 2017, doi: 10.1016/j.patrec.2016.10.007.

Q. Wang, “Support Vector Machine Algorithm in Machine Learning,” in 2022 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Jun. 2022, pp. 750–756. doi: 10.1109/ICAICA54878.2022.9844516.

J. Huang, J. Zhou, and L. Zheng, “Support Vector Machine Classification Algorithm Based on Relief-F Feature Weighting,” in 2020 International Conference on Computer Engineering and Application (ICCEA), Mar. 2020, pp. 547–553. doi: 10.1109/ICCEA50009.2020.00121.

U. S. Bist and N. Singh, “Analysis of recent advancements in support vector machine,” Concurr. Comput. Pract. Exp., vol. 34, no. 25, Nov. 2022, doi: 10.1002/cpe.7270.

K. Guo, J. Song, Y. Wang, and D. Yan, “An Improved Directed Acyclic Graph SVM,” 2023, pp. 6410–6419. doi: 10.1007/978-981-19-6613-2_618.

N. E. I. Karabadji, A. Amara Korba, A. Assi, H. Seridi, S. Aridhi, and W. Dhifli, “Accuracy and diversity-aware multi-objective approach for random forest construction,” Expert Syst. Appl., vol. 225, p. 120138, Sep. 2023, doi: 10.1016/j.eswa.2023.120138.

M. Azhari, A. Alaoui, Z. Achraoui, B. Ettaki, and J. Zerouaoui, “Adaptation of the random forest method,” in Proceedings of the 4th International Conference on Smart City Applications, Oct. 2019, pp. 1–6. doi: 10.1145/3368756.3369004.

Y. Manzali and M. Elfar, “Random Forest Pruning Techniques: A Recent Review,” Oper. Res. Forum, vol. 4, no. 2, p. 43, May 2023, doi: 10.1007/s43069-023-00223-6.

N. Mohapatra, K. Shreya, and A. Chinmay, “Optimization of the Random Forest Algorithm,” in Lecture Notes on Data Engineering and Communications Technologies, Springer Science and Business Media Deutschland GmbH, 2020, pp. 201–208. doi: 10.1007/978-981-15-0978-0_19.

R. K. Sevakula and N. K. Verma, “MVPC—A Classifier with Very Low VC Dimension,” in Studies in Computational Intelligence, Springer Science and Business Media Deutschland GmbH, 2023, pp. 23–39. doi: 10.1007/978-981-19-5073-5_3.

C. Atik, R. A. Kut, R. Yilmaz, and D. Birant, “Support Vector Machine Chains with a Novel Tournament Voting,” Electronics, vol. 12, no. 11, p. 2485, May 2023, doi: 10.3390/electronics12112485.

E. Alfaro, M. Gámez, and N. García, “Ensemble Classifiers Methods,” in Ensemble Classification Methods with Applicationsin R, Wiley, 2018, pp. 31–50. doi: 10.1002/9781119421566.ch3.

S. Wu, J. Li, and W. Ding, “A geometric framework for multiclass ensemble classifiers,” Mach. Learn., Sep. 2023, doi: 10.1007/s10994-023-06406-w.

L. I. Kuncheva and J. J. Rodríguez, “A weighted voting framework for classifiers ensembles,” Knowl. Inf. Syst., vol. 38, no. 2, pp. 259–275, 2014, doi: 10.1007/s10115-012-0586-6.

T. Turki and S. S. Roy, “Novel Hate Speech Detection Using Word Cloud Visualization and Ensemble Learning Coupled with Count Vectorizer,” Appl. Sci., vol. 12, no. 13, p. 6611, Jun. 2022, doi: 10.3390/app12136611.

S. Kang, L. Kong, B. Luo, C. Zheng, and J. Wu, “Principle research of word vector representation in natural language processing,” in International Conference on Electronic Information Engineering and Computer Science (EIECS 2022), Apr. 2023, p. 133. doi: 10.1117/12.2668487.

A. Esmaeilzadeh, J. R. F. Cacho, K. Taghva, M. E. Z. N. Kambar, and M. Hajiali, “Building Wikipedia N-grams with Apache Spark,” Springer Science and Business Media Deutschland GmbH, 2022, pp. 672–684. doi: 10.1007/978-3-031-10464-0_45.

N. S. Mamatov, N. A. Niyozmatova, A. N. Samijonov, and B. N. Samijonov, “Construction of Language Models for Uzbek Language,” in 2022 International Conference on Information Science and Communications Technologies (ICISCT), Sep. 2022, pp. 1–4. doi: 10.1109/ICISCT55600.2022.10146788.

S. Avasthi, R. Chauhan, and D. P. Acharjya, “Processing Large Text Corpus Using N-Gram Language Modeling and Smoothing,” Springer Science and Business Media Deutschland GmbH, 2021, pp. 21–32. doi: 10.1007/978-981-15-9689-6_3.

Refbacks

There are currently no refbacks.

Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Username
Password
Remember me