A Comparative Analysis of Classification Algorithms for Cyberbullying Crime Detection: An Experimental Study of Twitter Social Media in Indonesia

Ari Muzakir(1), Hadi Syaputra(2), Febriyanti Panjaitan(3),


(1) Department of Information System, Universitas Bina Darma, Indonesia
(2) Department of Computer Science, Universitas Bina Darma, Indonesia
(3) Department of Computer Science, Universitas Bina Darma, Indonesia

Abstract

Purpose: This research aims to identify content that contains cyberbullying on Twitter. We also conducted a comparative study of several classification algorithms, namely NB, DT, LR, and SVM. The dataset we use comes from Twitter data which is then manually labeled and validated by language experts. This study used 1065 data with a label distribution, namely 638 data with a non-bullying label and 427 with a bullying label.
Methods: The weighting process for each word uses the bag of word (BOW) method, which uses three weighting features. The three-word vector weighting features used include unigram, bigram, and trigram. The experiment was conducted with two scenarios, namely testing to find the best accuracy value with the three features. The following scenario looks at the overall comparison of the algorithm's performance against all the features used.
Result: The experimental results show that for the measurement of accuracy weighting based on features and algorithms, the SVM classification algorithm outperformed other algorithms with a percentage of 76%. Then for the weighting based on the average recall, the DT classification algorithm outperformed the other algorithms by an average of 76%. Another test for measuring overall performance (F-measure) based on accuracy and precision, the SVM classification algorithm, managed to outperform other algorithms with an F-measure of 82%.
Value: Based on several experiments conducted, the SVM classification algorithm can detect words containing cyberbullying on social media.

Keywords

Cyberbullying, Model Comparison, Machine Learning, Bag of Words, Classification

Full Text:

PDF

References

W. Uriawan, A. Wahana, D. Wulandari, W. Darmalaksana, and A. Rosihon, “Pearson Correlation

Method and Web Scraping for Analysis of Islamic Content on Instagram Videos,” in 2020 6th

Int. Conf. Wirel. Telemat. (ICWT), 2020, pp. 1–6.

M. A. Maulana, “The Effects of Anonymity, Psychological Needs and Cyber Victimization Toward Cyberbullying Behavior Among Adolescents in Cirebon City,” in 3rd ASEAN Conf. Psychol. Couns. Humanit., 2017, pp. 42–51.

H. Margono, X. Yi, and G. K. Raikundalia, “Mining Indonesian Cyber Bullying Patterns in Social Networks,” in Proc. Thirty-Seventh Australas. Comput. Sci. Conf., 2014, pp. 115–124.

A. F. Hidayatullah and M. R. Ma’Arif, “Pre-processing tasks in Indonesian Twitter messages,” in J. Phys.: Conf. Ser., 2017, vol. 801, no. 1, p. 12072.

A. Muzakir and R. A. Wulandari, “Model Data Mining sebagai Prediksi Penyakit Hipertensi Kehamilan dengan Teknik Decision Tree,” Sci. J. Informatics, vol. 3, no. 1, pp.19-26, 2016.

U. I. Larasati, M. A. Muslim, R. Arifudin, and A. Alamsyah, “Improve the Accuracy of Support Vector Machine Using Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis,” Sci. J. Informatics, vol. 6, no. 1, pp. 138–149, 2019.

H. F. Fadli and A. F. Hidayatullah, “Identifikasi Cyberbullying Pada Media Sosial Twitter Menggunakan Metode LSTM dan BiLSTM,” AUTOMATA, vol. 2, no. 1, 2021.

S. MacAvaney, H.-R. Yao, E. Yang, K. Russell, N. Goharian, and O. Frieder, “Hate Speech Detection: Challenges and Solutions,” PLoS One, vol. 14, no. 8, p. e0221152, 2019.

J. Seglow, “Hate Speech, Dignity and Self-Respect,” Ethical Theory Moral Pract., vol. 19, no. 5, pp. 1103–1116, 2016.

K. Sreelakshmi, B. Premjith, and K. P. Soman, “Amrita CEN at HASOC 2019: Hate speech

detection in roman and devanagiri scripted text,” in CEUR Workshop Proc., 2019, vol. 2517,

pp. 366–369.

S. Assimakopoulos, F. H. Baider, and S. Millar, Online hate speech in the European Union: A discourse-analytic perspective. Springer Nat., 2017.

A. Muzakir and U. Ependi, “Model for Identification and Prediction of Leaf Patterns: Preliminary Study for Improvement,” Sci. J. Informatics, vol. 8, no. 2, pp. 244–250, 2021.

Y. N. Ifriza and M. Sam’an, “Performance Comparison Of Support Vector Machine and Gaussian

Naive Bayes Classifier for Youtube Spam Comment Detection,” J. Soft Comput. Explor., vol. 2, no. 2, pp. 93–98, 2021.

Y. Zhang, R. Jin, and Z.-H. Zhou, “Understanding Bag-Of-Words Model: A Statistical Framework,” Int. J. Mach. Learn. Cybern., vol. 1, no. 1, pp. 43–52, 2010.

A. Sethy and B. Ramabhadran, “Bag-of-word Normalized N-Gram Models,” 2008.

U. Naseem, I. Razzak, and P. W. Eklund, “A Survey of Pre-Processing Techniques to Improve Short-Text Quality: A Case Study on Hate Speech Detection on Twitter,” Multimed. Tools Appl., 2020.

Y. N. Ifriza and M. Sam’an, “Performance comparison of support vector machine and gaussian

naive bayes classifier for youtube spam comment detection,” J. Soft Comput. Explor., vol. 2, no. 2, pp. 93–98, 2021.

J. Xie, B. Chen, X. Gu, F. Liang, and X. Xu, “Self-Attention-Based BiLSTM Model for Short Text Fine-Grained Sentiment Classification,” IEEE Access, vol. 7, pp. 180558–180570, 2019.

W. Yin and A. Zubiaga, “Towards Generalisable Hate Speech Detection: A Review on Obstacles and Solutions,” PeerJ Comput. Sci., vol. 7, pp. 1–38, 2021.

P. Wang, B. Xu, J. Xu, G. Tian, C. L. Liu, and H. Hao, “Semantic Expansion Using Word Embedding Clustering and Convolutional Neural Network for Improving Short Text Classification,” Neurocomputing, vol. 174, pp. 806–814, 2016.

W. Cui et al., “Short Text Analysis Based on Dual Semantic Extension and Deep Hashing in Microblog,” ACM Trans. Intell. Syst. Technol., vol. 10, no. 4, 2019.

Refbacks





Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.