Implementation of Stacking Ensemble Classifier for Multi-class Classification of COVID-19 Vaccines Topics on Twitter

. Purpose: However, from the variety of uses of these algorithms, in general, accuracy problems are still a concern today, even accuracy problems related to multi-class classification still require further research. Methods: This study proposes a stacking ensemble classifier method to produce better accuracy by combining Logistic Regression, Random Forest, and Support Vector Machine (SVM) algorithms as first-level learners and using Logistic Regression as a meta-learner for the multi-class classification of COVID-19 vaccine topics on Twitter. Result: Based on the evaluation, the proposed Stacking Ensemble Classifier model shows 86% accuracy, 85% precision, 86% recall, and 85% f1-score. Novelty: The novelty is produce better accuracy by combining Logistic Regression, Random Forest, and Support Vector Machine (SVM) algorithms as first-level learners and using Logistic Regression as a meta-learner.


INTRODUCTION
Vaccines are a further effort to tackle the COVID-19 virus that has spread to various regions in Indonesia. The government's move to hold a COVID-19 vaccination program is the right thing. However, sundry public opinions about this program emerged, ranging from positive to negative opinions. One of the most talked about is the first vaccination against President Joko Widodo on January 13, 2021, at the State Palace [1]. Social media has become the most popular place for a person to express his opinion on various things. That matter happens because social media is something that everyone should have [2]. One of the most widely used social media in Indonesia is Twitter. The use of social media Twitter in Indonesia reaches 63.6% of internet users aged 16 to 64 years [3]. That matter opens up many sources of data that can use to generate knowledge. Twitter is also one of the social media that is often used by researchers in collecting data to support their research. One area of research that often uses data from Twitter is sentiment analysis. Sentiment analysis or also called opinion mining is one of the studies on text mining that aims to determine public perception or subjectivity to a topic, event, or problem [4]. Several studies analyze public opinion regarding the topic of COVID-19 on Twitter. Such as research [5] which classifies sentiments related to the PSBB topic on Twitter social media by comparing the Support Vector Machine (SVM) algorithm with Random Forest. The results show an accuracy of 58% for the Random Forest algorithm and show an accuracy of 56% for the Support Vector Machine (SVM) algorithm. The results of this study are very low due to the use of very little data, which is only 499 data.
Research [6] conducted a classification of Twitter social media users' perceptions of COVID-19 cases using the Logistic Regression algorithm with "L2" and "None" hyperparameters using 44955 tweet data. The results of this study indicate that the Logistic Regression algorithm with "L2" parameters produces an accuracy value of 77% with an f1-score of 74%, while the None parameter produces an accuracy value of 74% with an f1-score of 70%. The use of large data in this study can affect the accuracy of the Logistic In other topics, research [7] conducted a web content analysis for fake news using the Max Voting and Stacking Ensemble Classifier methods. As a result, the combination of the Logistic Regression, Support Vector Machine, and Random Forest algorithms in the Stacking Ensemble Classifier method got the best performance with an accuracy value of 98.37%, precision 0.97, recall 0.99, f1-score 0.98, and AUC 0.9888. Meanwhile, in dataset 2, the accuracy is 94.32%, precision is 0.95, recall is 0.93, f1-score is 0.94, and AUC is 0.9853. This performance is better than using the Max Voting ensemble which got an accuracy of 97.89%, precision 0.97, recall 0.99, f1-score 0.98, and AUC 0.9683 for dataset 1, as well as the accuracy of 93.85%, precision 0.94, recall 0.90, f1-score 0.92, and AUC 0.9625 for dataset 2. Research [8] uses a stacked ensemble method with Naive Bayes algorithm and Support Vector Machine to detect sentiment polarity in Bengali tweets with negative, neutral, and positive sentiment. As a result, the proposed stacked ensemble model achieves an accuracy of 56.2% which exceeds the LSTM algorithm which obtains an accuracy of 55.27%.
Research [9] optimized sentiment analysis for the 2019 presidential election with the Naive Bayes algorithm and Particle Swarm Optimization (PSO). The result of the accuracy with the 10-fold cross-validation for Naive Bayes algorithm of 86.62% and Naive Bayes with Particle Swarm Optimization (PSO) of 90.74%. The accuracy increases by 4.12% for Naive Bayes with Particle Swarm Optimization (PSO) algorithm. Research [10] proposed a new method with Weighted Classifier Selection and Stacked Ensemble for multilabel classification, resulting in an accuracy value of 0.909 for the MLWSE-L1 model and 0.91 for the MLWSE-L2 model. Research [11] conducted a sentiment analysis on the COVID-19 vaccine in India, the results concluded that the Covaxin vaccine had 36% positive tweets and 31% negative tweets, while the Covishield vaccine had 71% positive tweets and 29% negative tweets.
Despite achieving an accuracy of up to 98%, the study [7] only dealt with binary-class classification. Meanwhile, studies that perform multi-class classifications such as research [5], [6], and [8] get a low accuracy value. In this study, we combined the Logistic Regression algorithm, Support Vector Machine (SVM), and Random Forest as first-level learners, then used Logistic Regression as a meta-learner in the Stacking Ensemble Classifier method to improve the accuracy of multi-class classification of the COVID-19 vaccine topic on Twitter. Figure 1 is a process of our system. There as six main stages of the process, i.e., preprocessing, dataset labeling, feature extraction using TF-IDF, splitting the data, modeling, and evaluation.

Data
The data used in this study is tweet data related to the topic of the COVID-19 vaccine written by Indonesian people on Twitter. The data was collected using the twint library [12] in the python programming language. Data was collected using the keyword "vaksin covid19". The tweet data is tweet data available from May 1, 2021, to July 10, 2021. The data that has been collected is 13011 tweets. Table 1 shows a sample of tweet data that has been collected.

Preprocessing
In text mining, the preprocessing stage is carried out to process various data into regular data that can be applied to machine learning algorithms [13]. Figure 2 shows the preprocessing steps in this study: cleansing, case-folding, normalization, tokenizing, removing stop words, and stemming.

Figure 2. Preprocessing steps
The first step in the preprocessing stage is cleansing which functions to clean tweet data from unnecessary elements, such as hashtags starting with the # symbol, mentions starting with the @ symbol, links, retweets starting with RT, punctuation marks, numbers, images, and whitespace. The case-folding stage is used to convert uppercase letters to lowercase letters. The normalization stage is used to change short and "alay" words, into normal words by using "kamus alay" [14]. The tokenizing stage uses the nltk library [15] to break the sentence into a collection of words that compose the sentence. The stopword removal step is used to remove prepositions, pronouns, and conjunctions. The stemming stage uses the PySastrawi library [16] to change words into basic words to reduce word dimensions [17].

Dataset Labeling
The data is labeled positive, negative, or neutral at this stage. In this study, data labeling uses the InSet (Indonesian Sentiment Lexicon) dictionary [18] to calculate the polarity of tweet data. In the InSet dictionary, there are different polarities for each word, both positive and negative words. A negative number represents the polarity for negative words, while a positive number describes the polarity for positive words. The polarity of the tweet is calculated for each word in the sentence. The total number of polarities will determine the polarity of the sentence. If the total polarity is 0, then the tweet is labeled neutral. If the total polarity is less than 0, then the tweet is labeled negative. If the total polarity is more than 0, then the tweet is labeled positive.

Feature Extraction Using TF-IDF
This research uses the TF -IDF method with the sklearn library [19]. Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic to indicate the importance of a word in a document from a list of documents or corpus [20]. Term Frequency (TF) is the number of times a word occurs in a document which is defined as [20]: where , is the number of occurrences of the word in document . Inverse Document Frequency (IDF) is a measure of how much information a word provides [19]. Unlike the traditional IDF, the IDF on sklearn uses a unitary constant in the denominator and numerator which is defined as [19]: where is the number of documents in the corpus and ( ) the number of occurrences of the word in all documents in the corpus.

Split Data
Based on the feature extraction results using TF -IDF, the data is divided into two parts with 80:20 division. Data with 80% is used as training data, and 20% is used as test data.

Modeling
In this study, we propose a model using the stacking ensemble classifier method by combining the Logistic Regression algorithm, Support Vector Machine (SVM), and Random Forest as first-level learners and using the Logistic Regression algorithm for meta-learners. Figure 3 shows the process flow in the proposed model.

RESULT AND DISCUSSION Implementation
The proposed system is implemented in the python programming language with the nltk and PySastrawi libraries in the preprocessing stage, as well as the StackingClassifier module in the sklearn library for the implementation of the proposed model.

Testing
Testing is done by training the model using data as much as 80% of the total data or about 9120 data. The time required for this model training process is about 58 seconds. Then predictions were made using test data as much as 20% of the total data or about 2280 data. The prediction results are compared with the actual data and loaded into the confusion matrix. Table 3 is the result of the confusion matrix on the proposed model. In Table 3, the prediction results for the negative class are 1137 data classified as a negative class, 25 data classified as a neutral class, and 69 data classified as the positive class. The prediction result for the neutral class is 144 data classified as a neutral class, 79 data classified as a negative class, and 53 data classified as the positive class. The prediction result for the positive class is 676 data classified as a positive class, 15 data classified as a neutral class, and 70 data classified as negative class. Based on the confusion matrix in Table 3, the accuracy, precision, recall, and f1-score values can be calculated, the results are shown in Figure 4. The results of the classification report in Figure 4 show that the stacking ensemble classifier model in this study has a good performance. That matter is because the accuracy value obtained in this stacking ensemble classifier model is 0.86. This is high enough for a classification model that handles three classes. The precision value for the negative class got 0.88, the neutral class got 0.73, and the positive class got 0.85, with the average precision of each class using the weighted method got 0.85. The recall value for the negative class got 0.92, the neutral class got 0.52, and the positive class got 0.87, with the average recall value of each class using the weighted method got 0.86. The f1-score value for the negative class got 0.9, the neutral class got 0.61, and the positive class got 0.86, with an average f1-score value for each class using the weighted method got 0.86.

Evaluation
Based on the results of the tests carried out, the proposed stacking ensemble classifier model has a better performance compared to other models with a similar case study, namely multi-class classification, the comparison is shown in Table 4. Table 4. Model performance comparison Model Accuracy (%) F1-Score (%) LR + L2 [6] 77 77 Random Forest [5] 58 44 SVM [5] 56 44 Stacking (LR + SVM + RF) 86 85 In Table 4, it can be seen that the model proposed in this study has a better level of accuracy in handling multi-class classification. Compared to the Logistic Regression + L2 model which uses more data, our proposed model uses fewer data. Our model is 9% better on accuracy and 8% better on the f1-score. Our proposed model is also better than the Random Forest and Support Vector Machine models. With the use of more data, our proposed model is 28% better on the accuracy and 41% on the f1-score against the Random Forest model. Compared with Support Vector Machine model, our model is also 30% better on the accuracy and 41% on the f1-score.

Threats to Validity
We limit the text features generated from the TF-IDF process to 5000 features. The goal is to minimize unnecessary text features. As a result, the accuracy obtained is quite good, reaching 86%. We realize that under other situations if the feature is not limited or reduced, it will affect the accuracy results obtained.
The results of the accuracy can be greater or less.
The composition of the data used in this study is also unbalanced data. That matter can also affect the accuracy of results obtained when using balanced data. In addition, we also use the help of a dictionary in data labeling so that there are labels that do not match the sentences in the tweet data. For example, a positive sentence should be labeled negative. That matter will be very different if create labels manually, which will create labels according to the sentences in the tweet data.

CONCLUSION
The proposed stacking ensemble classifier model is a model that implements the stacking ensemble method using the Logistic Regression algorithm, Support Vector Machine (SVM), and Random Forest as first-level learners. Then we uses the Logistic Regression algorithm as a meta-learner which aims to handle multilevel classification. class. This model produces better accuracy and f1-score values than other similar case studies. Compared to the Random Forest + L2 model which uses 44955 tweets, although it uses less data, the proposed model has a 9% better accuracy value and an 8% better f1-score value. This is because the proposed model combines algorithms to reduce bias, more precisely, the predictions of each estimator are stacked together and used as input for the final estimator to calculate predictions. In addition, the proposed model is also compared with the model using the Random Forest and Support Vector Machine (SVM) algorithm for multi-class classification. The proposed model is 28% better on the accuracy value and 41% more prolific on the f1-score value compared to the Random Forest model. Meanwhile, compared to the SVM model, the proposed model excels by 30% on the accuracy value and 41% on the f1-score value. This is because more data was used in this study, so the model gained more knowledge during the training process. This study limits the text features generated from the TF-IDF process to 5000 features. The dataset used in this study is also unbalanced. Apart from that, we also use the help of a dictionary to label the datasets. These things can affect the accuracy results obtained by the proposed model if it is different from what was done in this study.