Performance Comparison of SVM, Naïve Bayes, and KNN Algorithms for Analysis of Public Opinion Sentiment Against COVID-19 Vaccination on Twitter

ABSTRACT


Introduction
The World Health Organization (WHO) determined on March 11, 2020, that the COVID-19 virus is a global pandemic (Sohrabi et al., 2020). WHO said that more than fifty-two million people were confirmed positive for COVID-19 and in the second week of November 2020 it was reported that 1.2 million people had died (Alamsyah et al., 2021). Given the rapid spread of COVID-19 and the consequences of efforts that arise if the problem is not immediately addressed, one way to slow down the process of spreading the virus is to make a vaccine (Fitriana et al., 2021).
There is a controversy with the emergence of the COVID-19 vaccine, which has led some people to turn to the media to express their opinions and views. According to global digital statistics "Digital, Social & Mobile in 2019" that social media users reached one hundred and fifty million in 2019. Twitter is the social media with the most active users in Indonesia, accounting for fifty-two percent of total social media users (Fitriana et al., 2021). Social media Twitter is a means to obtain information that can be used for sentiment analysis, by dividing public opinion about the COVID-19 vaccine into two classes of sentiment, namely negative and positive. Sentiment analysis is a way to classify people's emotional levels as neutral, positive, or negative (Mubarok et al., 2017). Automatically tweets be retrieved by the system and classify an evaluation of tweets that contain neutral, positive, or negative sentences (Tripathy et al., 2016).
In previous observations using the Support Vector Machine (SVM) method, Windasari, Uzzi, and Satoto conducted research on public sentiment towards Gojek's online transportation on Twitter. In the conference, articles accumulated into 1000 tweets positive and tweets negative. These results are known with the help of the Support Vector Machine (SVM) to classify the existing data. After the analysis, the results obtained are 86% accuracy scores and 14% error prediction scores. For tweets positive tweets negative (Windasari et al., 2017).
By looking at the problems above, the focus of this research is to compare the classification algorithm using Feature Extraction TF-IDF TextBlob Library in sentiment tweet by taking the COVID-19 vaccine limit. SVM, Naïve Bayes (NB), and k-Nearest Neighbor (KNN) are used as classification algorithms in the research to be carried out. SVM has the advantage of being able to identify separate hyperplanes that maximize the margin between the two different classes. Meanwhile, Naïve Bayes is an algorithm that is simple, fast and produces maximum accuracy with little training data. The KNN algorithm was chosen because the algorithm is superior to noise data. The performance of the three classification algorithms will be compared, so that it can be seen which algorithm is better in classifying text mining. The resulting accuracy value will be a benchmark for finding the best test model in the case of sentiment classification (Ashari et al., 2016). The purpose of this research is to help the government and the public to find out the public's responses or concerns about the COVID-19 vaccine, and also as material for the government's evaluation to determine further strategies related to education and socialization about COVID-19 vaccination to the public.

Sentiment Analysis
Sentiment analysis is a system of determining sentiment in a document or sentence and classifying the polarity of the text so that it can be categorized as positive, negative, or neutral class sentiment. In sentiment evaluation, statistical mining is carried out to research, process, and extract textual data in an entity such as a service, product, person, phenomenon, or subject. The analytical step includes evaluating texts, forums, tweets, or blogs using pre-processed data including tokenization, stopwords, deletion, root detection, sentiment identity, and the sentiment classification process (Rasenda et al., 2020). Sentiment evaluation is generally carried out in 3 different stages, such as sentence level, document level, and aspect level. Document level has the goal of grouping all documents into positive or negative sentiment classes. Sentence level is based on the polarity of each sentence (Tripathy et al., 2015). Meanwhile, aspect level is to classify the opinion of a document as an opinion with good or bad sentiments based on several large documents with the same topic. Sentence level in sentiment analysis, classifies the sentiment in each sentence by identifying whether the sentence is a positive or negative opinion (Medhat et al., 2014).

Support Vector Machine
Support Vector Machine (SVM) is one of several past methods that is still being used by several researchers in big data (Larasati et al., 2019). SVM is a hyperplane that functions to distinguish two classes in the input space (Athoillah, 2018). SVM aims to find out future data by using existing data based on special characteristics and to find a hyperplane to separate data based on two possible categories of variables, namely positive and negative (Cortes & Vapnik, 1995).

Naïve Bayes
Algorithm Naive Bayes is a classification algorithm with simple probability by applying Bayes' with the assumption of high independence. Algorithm Naïve Bayes is based on the number of data sets used, so we need a method with high classification performance and high accuracy. The advantage of using Naive Bayes is that this algorithm requires not a lot of training data to determine the estimated parameters needed during the classification process (Kao & Poteet, 2007). In implementing classifier for sentiment analysis, multinomial Naïve Bayes due to the general use of this method for the text grouping process and has the aim of creating two simple independent assumptions, namely (Krishnaiah et al., 2013): • Assumption Bag-of-words, the condition about the word position is not important from an assumption.
• The assumption of conditional independence is the assumption that the probability of a feature is independent in a class.

k-Nearest
Neighbor k-Nearest Neighbor (KNN) is one of several ways of grouping in machine learning. Algorithm KNN aims to group objects into one of the predefined classes from sample groups that have been created by machine learning. The KNN algorithm is a supervised algorithm that can classify data based on the level of proximity of the data to other data sets. This algorithm includes the lazy learning, meaning that the search process is done by classifying k features from the closest training data with (similar) features from new data or test data (Mustakim & Oktaviani F, 2016).

Feature Extraction TF-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) is used as a word weighting method to find information in text mining. The number of words that appear in the document is set as the TF-IDF value, of course also balanced by the frequency of words in the word set, used as a determinant of words that occur frequently. The TF-IDF calculation aims to give word weighting values determined from the number of words that appear in a document (Dadgar et al., 2016). Term Frequency or TF means the repetition of the number of words in one sentence. Inverse Document Frequency or IDF is a dimension of the amount of information contained in a word, the intensity of the occurrence of words in all document texts is often or rarely (Hakim et al., 2014).

Library TextBlob
TextBlob is one of several libraries in Python that are used in textual data processing, such as tokenization, sentiment analysis, and the process of translating a language into many common languages around the world (Pedregosa et al., 2011). This library provides a simple API for diving into Natural Language Processing (NLP) (Bose et al., 2020). This study uses TextBlob for sentiment analysis. The sentiment analysis design in TextBlob is only available in English, so users need to translate it into English to use TextBlob.

Method
In this study, by comparing three classification algorithms in the sentiment analysis process for the COVID-19 vaccine on Twitter, the algorithm with the highest accuracy results was used as a reference in the sentiment analysis process. Therefore, prior to the algorithm design process, a research design is made so that research objectives can be achieved and scientifically accounted for.
Research design is described in flowchart which can be seen in Figure 1.

Figure 1. Flowchart Research Design
There are three algorithms that will be applied to sentiment analysis of tweets COVID-19 vaccine Feature Extraction TF-IDF and labeling using TextBlob Library algorithm SVM, Naïve Bayes and KNN. The research on the comparison of three classification algorithms in the sentiment analysis process for the COVID-19 vaccine on Twitter was carried out in several stages. There are six stages in this research, it is the preprocessing, labeling using TextBlob Library, the application of Feature Extraction, the TF-IDF training and testing, the creation of machine learning models, as well as the classification and accuracy testing stages.

Results and Discussion
This section is divided into two parts, results and discussion. The results are a description of the data and findings obtained using the methods and procedures described in the data collection method. The discussion is an explanation of the results that answer research questions more comprehensively.

The results
The results of this study compare the accuracy results of three classification algorithms, namely SVM, Naïve Bayes, and KNN in the process of analyzing COVID-19 vaccine sentiment on Twitter. The results of this study are as follows.

Results of the Crawling Process
Data used in this study was taken from the crawling using the Twitter API with Python programming with the keyword #vaccineCOVID-19 in the period 10-13 February 2022. In the crawling , 35,644 data were generated, and the text was retrieved on the process is a tweet that uses English. Sample data from the crawling can be seen in Figure 2.

Results Preprocessing
This stage aims to align the words, remove characters such as numbers, symbols, punctuation marks, etc., and remove unnecessary words so that the data becomes more structured. Stages preprocessing that will be carried out in this research are case folding, cleansing, tokenization, and stopword removal.

Case Folding
In case folding, the process of converting uppercase letters to lowercase letters is carried out. This is done so that uppercase and lowercase letters are not detected choosing different meanings. The results of tweets before and after going through the case folding can be seen in Table 1. Results case folding b"@zerocovidzoe Yepp it's great to to see a government fighting its own citizens who just want their freedom back and\xe2\x80\xa6 https://t.co/ZlWbg2J2me" b"@zerocovidzoe yepp it's great to to see a government fighting its own citizens who just want their freedom back and\xe2\x80\xa6 https://t.co/zlwbg2j2me" b'@JanBenninkCom Nice vaccin!' b'@janbenninkcom nice vaccin!' b'This is stunning. Sad but stunning.

Cleansing
The cleansing stage is carried out to remove punctuation marks, numbers, symbols and other characters so that the process later analysis is easier and does not mix with other characters that are not text. The results of tweets before and after going through the cleansing can be seen in Table 2. b"@zerocovidzoe Yepp it's great to to see a government fighting its own citizens who just want their freedom back and\xe2\x80\xa6 https://t.co/ZlWbg2J2me" yepp its great to to see a government fighting its own citizens who just want their freedom ack and b"@1Think4yourself @TropicalVertic1 That's not true at all. Furthermore, myocarditis is a much more common complicati\xe2\x80\xa6 https://t.co/Xaf8tlhs2v" thats not true at all furthermore myocarditis is a much more common complicati b'@BelovedAmanda0 @jm131995 @iruntoyouj right? take a flight, vaccination card, a mask, 10 days of quarantine, go to\xe2\x80\xa6 https://t.co/MppGL9BHd0' right take a flight vaccination card a mask days of quarantine go to

Tokenization and Stopword Removal
The tokenization is used to get word pieces that have value in the preparation of the document matrix in the next process. Meanwhile, in stopword removal, words that have no effect but often appear in tweets. Package used for the tokenization stage and the stopword removal is NLTK. The results of tweets before and after going through the tokenization stage and the stopword removal can be seen in Table 3. ['yet', 'tried', 'times', 'latest', 'vaccination']

TextBlob Library Labeling Results Sentiment
Class labeling is usually divided into three classes, called positive, negative, and neutral classes. However, in this study only two classes of sentiment were used, positive and negative. In TextBlob library, the labeling process is conducted by determining the subjectivity and polarity for each tweet.
Labeling indicator uses TextBlob library, which is based on the polarity a tweet, where < 0 is a negative class, = 0 is a neutral class, and > 0 is a positive class. Because this study did not use a neutral class, the normalization by taking 2500 data samples in each positive class. and negative. After the labeling process, it is continued with the normalization to get a sample of positive sentiment class data and negative with a target of 5000 data. The results of labeling sentiment classes using the TextBlob library can be seen in Table 4.

Results Feature Extraction TF-IDF
Data Tweet that has gone through the preprocessing which is still text form will then be converted into vector form using the TF-IDF technique. Numerical data obtained from the word weighting process can be used for classification analysis. The pseudocode for the feature extraction TF-IDF. Pseudocode feature extraction principle is to give weight to each word in the dataset according to the feature extraction TF-IDF. The results of word weighting using the feature extraction TF-IDF can be seen in Table 5.

The classification
Data Tweet that has passed the Feature Extraction TF-IDF data split using a ratio of 80:20. Furthermore, the classification stage is carried out using three algorithms, it is SVM, Naïve Bayes, and KNN. The following is an example of a classification calculation using the SVM algorithm, based on the results of the confusion matrix data testing can be seen in Table 6. Table confusion matrix above, the SVM algorithm classifies 481 positive data predicted correct, 19 negative data predicted wrong, 25 positive data which predicted wrong, and 475 negative data predicted correctly. This shows that the SVM algorithm can classify tweets into positive sentiments and negative correctly as many as 956 tweets out of 1000 tweets. Based on Table 6 the confusion matrix of the SVM algorithm above produces the following level of determination. Based on the results of the evaluation of the calculations above, the accuracy of the SVM algorithm is 95.6%. The precision to determine the level of model accuracy in the classification process is 95.1%. While the recall obtained is to measure the completeness of the overall data which is positive by 96.2%.
In this classification process, 10 tests were carried out. From 10 times of testing, the highest accuracy value will be taken to be used as the final result of this research. The results of 10 tests of each classification algorithm can be seen in Table 7 and Figure 3.

Discussion
In this study, a comparison of the results of accuracy using the SVM, Naïve Bayes, and KNN algorithm preprocessing, labeling using the TextBlob library, normalization , and TF-IDF. The data used is dataset the on keyword #vaccineCOVID-19 Twitter as many as 35,644 tweet form of *.csv with one main column, namely text, and two supporting columns, namely created_at and id.
Based on the research conducted, the best accuracy and performance on the SVM algorithm were obtained after testing 10 times. In this test, the highest accuracy value is taken for each algorithm to be used as the final result. The results of 10 tests of each classification algorithm can be seen in Table  8 and Figure 4.

Figure 4. Graph of Accuracy Comparison Results
Based on the research that has been done, the comparison of the three algorithms shows that the final result of the accuracy of the SVM algorithm is higher than the Naïve Bayes and KNN algorithms, which is 96.3%. Meanwhile, in the Naïve Bayes and KNN algorithms, the final results obtained accuracy values of 94% and 91%, respectively. To find out that the method used in this study is better than the previous method, a comparison of the accuracy results with previous studies using datasets with the same keywords was carried out. The comparison results are shown in Table 9 and Figure 5.  (Wibowo & Musdholifah, 2021). The accuracy results are quite high in this study, which is above 90% supported by the TF-IDF feature extraction and labeling using the TextBlob library. While the drawback in this study is that after the word weighting process using the TF-IDF feature extraction, the words with the highest weight to the lowest weight have not been able to sort words with the provisions of certain weights being selected for later selection. For further researchers, it is expected to use feature selection in order to be able to select words with certain weight provisions so that they can produce more optimal accuracy values. As well as conducting research using different classification algorithms so that comparisons can be made to the level of accuracy generated in the sentiment analysis process.

Conclusion
Based on the results of research that has been conducted on the COVID-19 vaccine on Twitter, it is certain that the way to analyze public sentiment about the COVID-19 vaccine using three classification algorithms, such as SVM, Naïve Bayes, and KNN is carried out in several stages. The first stage is the crawling, followed by preprocessing, labeling using the TextBlob library, normalization, word weighting using TF-IDF, and finally the classification process using the SVM, Naïve Bayes, and KNN algorithms.
This study was conducted to compare the accuracy results of three classification algorithms, namely SVM, Naïve Bayes, and KNN in the process of sentiment analysis of the COVID-19 vaccine on Twitter. After testing three classification algorithms, namely SVM, Naïve Bayes, and KNN 10 times, the highest accuracy values were 96.3% for SVM; 94% for Naïve Bayes and 91% on KNN algorithm. The highest accuracy is obtained using the SVM algorithm, therefore the use of this algorithm is suitable for determining sentiment analysis about the COVID-19 vaccine on Twitter.