Indonesian Hate Speech Text Classification Using Improved K-Nearest Neighbor with TF-IDF-ICSρF

Nova Adi Saputra(1), Khurotul Aeni(2), Nurul Mega Saraswati(3),


(1) Department of Informatics, Faculty of Science and Technology, Universitas Peradaban, Indonesia
(2) Department of Informatics, Faculty of Science and Technology, Universitas Peradaban, Indonesia
(3) Department of Informatics, Faculty of Science and Technology, Universitas Peradaban, Indonesia

Abstract

Purpose: Freedom in social media gives rise to the possibility of disturbing users through the sentences they send, which is limited by the Electronic Information and Transactions Law (UU ITE). This research aims to find an effective method for classifying hate speech text data, especially in Indonesian, with many categories expected to minimize this case.

Methods: This study used 1.000 data from Twitter with five labels, including religion, race, physical, gender and other (invective or slander). The process started with several steps of preprocessing, data transformation using TF-IDF-ICSρF term weighting and data mining using an Improved KNN algorithm. Then, the results were compared with the TF-IDF and KNN methods to evaluate the differences.

Result: Using TF-IDF-ICSρF and Improved KNN algorithms gets an average accuracy value of 88.11%, 17.81% higher compared with the same data and parameters to the K-Nearest Neighbor and TF-IDF algorithms, which get results of 70.30%.

Novelty: Based on the comparison results, TF-IDF-ICSρF and Improved KNN methods can effectively classify hate speech sentences that have many labels with fairly good accuracy.

Keywords

Hate speech; Text classification; Improved KNN, Term weighting, TF-IDF-ICSρF

Full Text:

PDF

References

H. Margono, X. Yi, and G. K. Raikundalia, “Mining Indonesian cyber bullying patterns in social networks,” Conf. Res. Pract. Inf. Technol. Ser., vol. 147, no. ACSC, pp. 115–124, 2014.

Y. F. Safri, R. Arifudin, and M. A. Muslim, “K-Nearest Neighbor and Naive Bayes Classifier Algorithm in Determining The Classification of Healthy Card Indonesia Giving to The Poor,” vol. 5, no. 1, pp. 9–18, 2018.

T. Santarius et al., “Digitalization and Sustainability: A Call for a Digital Green Deal,” Environ. Sci. Policy, vol. 147, no. June, pp. 11–14, 2023, doi: 10.1016/j.envsci.2023.04.020.

H. Chang, Q. Ding, W. Zhao, N. Hou, and W. Liu, “The digital economy, industrial structure upgrading, and carbon emission intensity —— empirical evidence from China’s provinces,” Energy Strateg. Rev., vol. 50, no. December 2022, p. 101218, 2023, doi: 10.1016/j.esr.2023.101218.

M. Matthess, S. Kunkel, M. F. Dachrodt, and G. Beier, “The impact of digitalization on energy intensity in manufacturing sectors – A panel data analysis for Europe,” J. Clean. Prod., vol. 397, no. December 2022, p. 136598, 2023, doi: 10.1016/j.jclepro.2023.136598.

S. Perera, N. Meedin, M. Caldera, I. Perera, and S. Ahangama, “A comparative study of the characteristics of hate speech propagators and their behaviours over Twitter social media platform,” Heliyon, vol. 9, no. 8, p. e19097, 2023, doi: 10.1016/j.heliyon.2023.e19097.

A. M. U. D. Khanday, S. T. Rabani, Q. R. Khan, and S. H. Malik, “Detecting twitter hate speech in COVID-19 era using machine learning and ensemble learning techniques,” Int. J. Inf. Manag. Data Insights, vol. 2, no. 2, p. 100120, 2022, doi: 10.1016/j.jjimei.2022.100120.

A. Muzakir, H. Syaputra, and F. Panjaitan, “A Comparative Analysis of Classification Algorithms for Cyberbullying Crime Detection: An Experimental Study of Twitter Social Media in Indonesia,” Sci. J. Informatics, vol. 9, no. 2, pp. 133–138, 2022, doi: 10.15294/sji.v9i2.35149.

H. A. Santoso, E. H. Rachmawanto, and U. Hidayati, “Fake Twitter Account Classification of Fake News Spreading Using Naïve Bayes,” Sci. J. Informatics, vol. 7, no. 2, pp. 228–237, 2020, [Online]. Available: http://journal.unnes.ac.id/nju/index.php/sji

E. Maryani, D. Rahmawan, and I. Garnesia, “The Mediatization of ‘SARA’ Conflict in Indonesian Online Media,” J. ASPIKOM, vol. 4, no. 1, p. 184, 2019, doi: 10.24329/aspikom.v4i1.526.

G. T. Siregar, S. Amry Siregar, and R. Silaban, “Legal Implementation of Electronic Information and Transaction Law in Preventing the Spread of Content Containing SARA Issues through Social Media,” Int. J. Innov. Creat. Chang., vol. 13, no. 10, pp. 1418–1431, 2020, [Online]. Available: www.ijicc.net

M. O. Ibrohim and I. Budi, “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter,” in Proceedings of the Third Workshop on Abusive Language Online, 2019, pp. 46–57. doi: 10.18653/v1/W19-3506.

Elan, S. Ampuan, and J. Girsang, “The Effectiveness Of The Ite Law In Handling Hate Speaking Through Social Media In Batam City,” Leg. Br., vol. 11, no. 3, pp. 1794–1799, 2022, doi: 10.35335/legal.

S. A. Yana and B. T. Bawono, “Effectiveness of Implementing ITE Laws and Investigations of Damnation through Social Media,” Law Dev. J., vol. 2, no. 3, p. 433, 2020, doi: 10.30659/ldj.2.3.433-440.

Y. Mandasari Saragih, M. Ihwanuddin Hasibuan, R. Nur Ilham, R. Pratama Ginting, and Sardi, “Juridical Study Of The Criminal Acts Of Defense In View From The Ite Law Number 19 Of 2016,” Int. J. Educ. Rev. Law Soc. Sci., vol. 3, no. 3, pp. 1100–1106, 2023.

I. Koto, “Cyber Crime According to the ITE Law,” Int. J. Reglem. Soc. (IJRS, no. August, pp. 103–110, 2021, doi: 10.55357/ijrs.v2i2.124.

Y. D. Rahayu, L. A. Muharrom, I. S. Windiarti, and A. H. Sugianto, “A Systematic Literature Review of Multimodal Emotion Recognition,” Sci. J. Informatics, vol. 10, no. 2, pp. 159–176, 2023, doi: 10.15294/sji.v10i2.43792.

W. Budiawan Zulfikar, A. Rialdy Atmadja, and S. F. Pratama, “Sentiment Analysis on Social Media Against Public Policy Using Multinomial Naive Bayes,” Sci. J. Informatics, vol. 10, no. 1, pp. 25–34, 2023, doi: 10.15294/sji.v10i1.39952.

H. T. Ismet, T. Mustaqim, and D. Purwitasari, “Aspect Based Sentiment Analysis of Product Review Using Memory Network,” Sci. J. Informatics, vol. 9, no. 1, pp. 73–83, 2022, doi: 10.15294/sji.v9i1.34094.

S. Fransiska and A. I. Gufroni, “Sentiment Analysis Provider by.U on Google Play Store Reviews with TF-IDF and Support Vector Machine (SVM) Method,” Sci. J. Informatics, vol. 7, no. 2, pp. 2407–7658, 2020, [Online]. Available: http://journal.unnes.ac.id/nju/index.php/sji

R. Jayapermana, A. Aradea, and N. I. Kurniati, “Implementation of Stacking Ensemble Classifier for Multi-class Classification of COVID-19 Vaccines Topics on Twitter,” Sci. J. Informatics, vol. 9, no. 1, pp. 8–15, 2022, doi: 10.15294/sji.v9i1.31648.

A. Y. W. Finansyah, F. Afiahayati, and V. M. Sutanto, “Performance Comparison of Similarity Measure Algorithm as Data Preprocessing Stage: Text Normalization in Bahasa,” Sci. J. Informatics, vol. 9, no. 1, pp. 1–7, 2022, doi: 10.15294/sji.v9i1.30052.

F. H. Khan, S. Bashir, and U. Qamar, “TOM: Twitter opinion mining framework using hybrid classification scheme,” Decis. Support Syst., vol. 57, no. 1, pp. 245–257, 2014, doi: 10.1016/j.dss.2013.09.004.

U. I. Larasati, M. A. Muslim, R. Arifudin, and A. Alamsyah, “Improve the Accuracy of Support Vector Machine Using Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis,” Sci. J. Informatics, vol. 6, no. 1, pp. 138–149, 2019, doi: 10.15294/sji.v6i1.14244.

A. Aristo Jansen Sinlae, D. Alamsyah, L. Suhery, and F. Fatmayati, “Classification of Broadleaf Weeds Using a Combination of K-Nearest Neighbor (KNN) and Principal Component Analysis (PCA),” Sinkron, vol. 7, no. 1, pp. 93–100, 2022, doi: 10.33395/sinkron.v7i1.11237.

N. Hidayat, M. F. Al Hakim, and J. Jumanto, “Halal Food Restaurant Classification Based on Restaurant Review in Indonesian Language Using Machine Learning,” Sci. J. Informatics, vol. 8, no. 2, pp. 314–319, 2021, doi: 10.15294/sji.v8i2.33395.

K. Taunk, S. De, S. Verma, and A. Swetapadma, “A brief review of nearest neighbor algorithm for learning and classification,” 2019 Int. Conf. Intell. Comput. Control Syst. ICCS 2019, no. May, pp. 1255–1260, 2019, doi: 10.1109/ICCS45141.2019.9065747.

N. P. Ririanti and A. Purwinarko, “Implementation of Support Vector Machine Algorithm with Correlation-Based Feature Selection and Term Frequency Inverse Document Frequency for Sentiment Analysis Review Hotel,” Sci. J. Informatics, vol. 8, no. 2, pp. 297–303, 2021, doi: 10.15294/sji.v8i2.29992.

K. Chen, Z. Zhang, J. Long, and H. Zhang, “Turning from TF-IDF to TF-IGM for term weighting in text classification,” Expert Syst. Appl., vol. 66, pp. 1339–1351, 2016, doi: 10.1016/j.eswa.2016.09.009.

G. Domeniconi, G. Moro, R. Pasolini, and C. Sartori, “A study on term weighting for text categorization: A novel supervised variant of tf.idf,” DATA 2015 - 4th Int. Conf. Data Manag. Technol. Appl. Proc., no. September, pp. 26–37, 2015, doi: 10.5220/0005511900260037.

N. R. Sulaiman and M. Md. Siraj, “Classification of Online Grooming on Chat Logs Using Two Term Weighting Schemes,” Int. J. Innov. Comput., vol. 9, no. 2, pp. 43–50, 2019, doi: 10.11113/ijic.v9n2.239.

M. A. Fauzi, A. Z. Arifin, and A. Yuniarti, “Arabic book retrieval using class and book index based term weighting,” Int. J. Electr. Comput. Eng., vol. 7, no. 6, pp. 3705–3710, 2017, doi: 10.11591/ijece.v7i6.pp3705-3711.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.