Improved Accuracy of Naïve Bayes Algorithm and Support Vector Machine Using Particle Swarm Optimization for Menstrual Cup Sentiment Analysis on Twitter

ABSTRACT


Introduction
Today, social media plays an important role in providing feedback (MacReadie et al., 2011). One of the social media that is most often used to provide opinions and express opinions is Twitter. Twitter is considered as a social media for users to send messages in real time. This feedback can certainly be felt both for individuals and groups. Social media also provides a space for expressing various thought and ideas, as well as conveying various opinions. According to the website Databoks.katadata.co.id which is a website with online media companies and research in the economic and business fields and accessed on February 1, 2022, The average Indonesian who uses Twitter is 59% of users,This makes Twitter the 5th most used social media in Indonesia in 2020. This indicates that Twitter is one of the social media that is quite influential for the social media user community in Indonesia.
Of the many public opinions have entered trending Twitter topics, the use of menstrual cups as menstrual sanitation products for the health of women's organs doesn't escape from discussion. Menstrual cup is a silicone device used as a substitute for disposable sanitary napkins for women. Menstrual cup are also considered an environmentally friendly product because they can be used repeatedly. However, in Indonesia, the use of menstrual cups is still considered taboo and has many pros and cons. From the many pros and cons related to the use of menstrual cups, it is necessary to analyze the public's views regarding the use of menstrual cups, which is called sentiment analysis. Sentiment analysis included in text mining. Text mining is an activity to analyze a document or data with one another to find new data. Text mining includes things like category information and text grouping (Betesda, 2020). Sentiment analysis is an opinion exploration with the aim to analyze and evaluate a topic, product, or service refers to the broad field of natural language processing 140 (Kristanto et al., 2019). Sentiment analysis is a process of classifying opinions or opinions of a text into positive or negative opinion sentiments (Larasati et al., 2019). Public opinion, especially on social media, is very important to make a decision that will be beneficial for individuals and organizations. Currently, the sentiment of product reviews has become important information for producers and potential customers (Hamzah, 2021). The purpose of sentiment analysis is to determine the extent of public or public understanding related to the use of menstrual cups.
In sentiment analysis there are several data classification algorithms including the Naïve Bayes method and the Support Vector Machine. Both algorithms are considered to be able to work well to analyze public sentiment. The Naïve Bayes algorithm is said to be able to calculate the possibility of each factor, then choose the result with the highest probability (Wisnu et al., 2020). This algorithm is considered suitable for the classification process of sentiment analysis because it can produce a fairly high level of accuracy. While the Support Vector Machine is a classification algorithm that is able to produce a good classification model even though it is trained with little data and only with simple parameters (Hasan & Wahyudi, 2018). However, the Support Vector Machine algorithm has several weaknesses, one of which is the problem of selecting appropriate features (Ratino et al., 2020).
From these weaknesses, it is necessary to add Particle Swarm Optimization feature selection to improve its performance. Particle Swarm Optimization is an optimization algorithm with the aim of producing an optimum response value by determining process parameters (Sateria et al., 2019). Particle Swarm Optimization is considered quite easy to use because it doesn't require many lines of programming code and complicated mathematical operators. Therefore, Particle Swarm Optimization can streamline the required memory and speed. Based on the description of the problem above, the research focuses on increasing the accuracy of two classification algorithms, it is Naïve Bayes and Support Vector Machine using the Particle Swarm Optimization feature selection for sentiment analysis of menstrual cup usage on Twitter.

Methods
The process of increasing the accuracy consists of several main stages, including the preprocessing stage, labeling stage, normalization stage, feature extraction stage, feature selection stage, data splitting stage, classification stage, and model testing stage. Each of these stages has a different result. The stages of the process can be seen in Figure 1.

Dataset
The dataset used is based on research by Fauziah, (2020). The study retrieved the dataset by crawling tweets on Twitter. The dataset was taken from April 26, 2020 to May 25, 2020, with the keyword menstrual cup totaling 1,108 tweets in English. It has two attributes, namely created_at which is the time the tweet was created, and text which is the content of the tweet. The dataset used in this study can be seen in Table 1.

Preprocessing
The preprocessing stage aims to eliminate noise so that the sentiment analysis process becomes more accurate and can be used in general. The preprocessing stage is also carried out in order to produce more structured data for further processing in the next stage (Jumeilah, 2017). Data pre-processing consists of the following processes: • Cleansing is a process used to remove all unused characters and serves to reduce noise in the data (Bayhaqy et al., 2018).
• Case Folding is a process used to change all letters in a sentence into lower case.
• Stemming is the process of converting tokens into basic words. This word conversion is carried out to ensure that every word that is the same but has a different suffix can be recognized as the same value to avoid bias in the transformation stage (Mariel et al., 2018).
• Tokenization is the stage for splitting text data into tokens.
After that, clean data is obtained which is ready to be processed in the next stage.

Labelling
This stage is the stage for labeling sentiments on the text using lexicon based. Lexicon based is a method for classifying a sentence into positive or negative sentiments. In this study, text labeling uses Vader (Valence Aware Dictionary And Sentiment Reasoner) as a labeling library. Vader Lexicon is a lexicon based library that is used for automatic labeling in text analysis. Lexicon based has several stages such as determining word polarity, handling negation, and scoring each tweet entity (Mustofa & Prasetiyo, 2021). The labeling process with this method begins after knowing which words contain positive and negative sentiments, then each word containing each of these sentiments is calculated by calculating the opinion value (Mahendrajaya et al., 2019). Vader is considered to work well for sentiment, especially on social media, and is available in an NLTK package that can be directly applied to unlabeled datasets.

2.4 Normalization
The normalization stage is the stage for the process of converting linear data to the original data (Nurjanah et al., 2017). Normalization is used to balance the data in research, so that the purpose of this normalization process is to produce a balance of comparison values between the data before and after the process and to form data with the same range value.

Feature Extraction
The feature extraction stage is the process of converting tokens into numeric vectors. The method used in this stage is TFIDF (Term Frequency Inverse Document Frequency). TFIDF is a technique to count the number of times a word appears in a document. Term Frequency (TF) is the number of words that appear in a document or text. While Inverse Document Frequency (IDF) is the level of importance of a word in the document. To calculate the weight of TF used Equation 1. (1) Meanwhile, to calculate the IDF weight, Equation 2.
(2) Thus, the equation for calculating TFIDF is used Equation 3.

Feature Selection
In this study, Particle Swarm Optimization is used for the feature selection algorithm that is used to increase the accuracy of the classification algorithm, namely Naïve Bayes and Support Vector Machine. The search using Particle Swarm Optimization is based on a population in a number of particles. Then the flight speed of each particle is updated to find the best new solution. Particle Swarm Optimization will stop when a condition has been reached.
Particle Swarm Optimization is likened to the behavior of a flock of birds in a habitat. Each particle is like a bird. This bird behaves using its own intelligence similar to the behavior of its collective flock. When a bird finds a proper path or a shorter path to a food source, the rest of the group will also follow that path even though they are far apart. Each bird or particle is treated like a point in a certain dimension of space. The steps taken by Particle Swarm Optimization in selecting features according to Sabrila et al., (2022) can be seen in Figure 2.

Splitting Data
The splitting data stage is the data sharing stage with the aim of dividing the data into training data and testing data. In this study, the proportion of the distribution of training data and testing data is 80:20, respectively. This is based on research conducted by Gholamy et al., (2018) which states that the distribution of the proportion of data as much as 80:20 is empirically the best ratio for the distribution of training data and testing data. The distribution of split data with the proportion of training data and testing data of 80:20 is also based on the Pareto Principle that Pareto makes observations and gets the results that 20% of the factors determine 80% of success (Harvey & Sotardi, 2018).

Classification
At this stage of classification consists of two classification processes by using the Naïve Bayes algorithm and Support Vector Machine. Classification is a technique that can be used to predict data or describe data classes (Alamsyah & Fadila, 2021). In this study, two classification algorithms are used, namely Naive Bayes and Support Vector Machine. Method of Naïve Bayes has 3 stages such as previous research, find the probability value, and looking probabilitasi end of posterior (Insani, et al., 2018). While the Support Vector Machine has a goal to provide a value for the number of occurrences of a word and can classify sentences into positive or negative labels (Giovani et al., 2020). The stages of Naïve Bayes classification and Support Vector Machine with Particle Swarm Optimization in this study can be seen in Figure 3.

Figure 3.
Classification flowchart with PSO The first stage is to initialize the Particle Swarm Optimization parameters. Then evaluate the fitness value with the Support Vector Machine classification algorithm based on the parameters of each selected particle to get the classification accuracy value. Furthermore, if the optimization process has not reached its maximum iteration, then the velocity and position of each particle is continuously updated until it reaches its maximum iteration.

Model Testing
In this study, model testing was carried out using a confusion matrix and three experiments using k-Fold with a value is 5, so in one experiment it would produce five accuracy results. The confusion matrix is considered to be able to evaluate the performance of the system that has been built (Sabrila et al., 2022). Calculation of accuracy in the confusion matrix is calculated by Equation 4.
(4) = + + + Accuracy is used to measure and determine the value and level of similarity between the measured value and the actual value. Where is the value of True Positive and is the value of false Positive.

Results and Discussion
This study applies feature selection with Particle Swarm Optimization for menstrual cup sentiment analysis on Twitter using the Naïve Bayes classification algorithm and Support Vector Machine. The higher the accuracy value generated, the better algorithm in predicting the resulting sentiment. In this study, three experiments were conducted using the k-Fold value model test, which is 5, meaning that there will be five accuracy results in one experiment.
After the data is collected, the next process is pre-processing, labeling, and normalization. The results of the normalization stage are in the form of datasets that have been labeled positive and negative with the same scale range. The results of the normalization stage can be seen in Table 2. After that, the dataset is processed at the feature extraction stage with TFIDF. Then to increase accuracy using Particle Swarm Optimization, several parameters are used in it including cognitive learning factor (c1) and social learning factor (c2) which is 1.49, inertia weight (w) is 0.72, the number of iterations is 10, particle size is 10. The cost results from Particle Swarm Optimization can be seen in Table 3. Then, the accuracy results of the Naïve Bayes algorithm and Support Vector Machine before applying Particle Swarm Optimization can be seen in Table 4. While the results of the accuracy of Naïve Bayes and Support Vector Machine after applying Particle Swarm Optimization can be seen in Table 5. From these results, a graph of the increase in accuracy of each algorithm is obtained. The graph of the results of increasing the accuracy of Naïve Bayes with and without Particle Swarm Optimization can be seen in Figure 4.  Swarm Optimization is applied. Thus, Particle Swam Optimization has proven to be able to work well to improve the accuracy of the Naïve Bayes classification algorithm and Support Vector Machine for menstrual cup sentiment analysis on Twitter. Then, the comparison of this study with previous studies can be seen in Table 6. The advantage of this research is there is an increase in the accuracy of the classification algorithm before Particle Swarm Optimization is applied, so the feature selection process with Particle Swarm Optimization is considered to be able to improve the accuracy of the Support Vector Machine and Naïve Bayes classification algorithms when compared with previous research.

146
While the weakness of this research is the use of Particle Swarm Optimization depends on the iteration value and the number of particles used during system execution. The larger the number of iteration values and the particles used, the longer it will take to execute.

Conclusion
This paper examines the Naïve Bayes classification algorithm and Support Vector Machine by applying Particle Swarm Optimization for menstrual cup sentiment analysis on Twitter. The aim is to find out the extent of public understanding related to menstrual cups. Particle Swarm Optimization is used to improve the accuracy of the classification algorithm. The results obtained indicate that the Naïve Bayes classification algorithm and Support Vector Machine have a higher level of accuracy when Particle Swarm Optimization is applied compared to using only the classification algorithm. The accuracy results obtained are for Nave Bayes of 92.72%, Support Vector Machine of 95.87%, Nave Bayes with Particle Swarm Optimization of 96.13%, and Support Vector Machine with Particle Swarm Optimization of 96.68%.

References
Alamsyah, A., & Fadila, T. (2021, July). Increased accuracy of prediction hepatitis disease using the application of principal component analysis on a support vector machine. In Journal of Physics: Conference Series (Vol. 1968, No. 1, p. 012016