Classification of Movie Review Sentiment Analysis Using Chi-Square and Multinomial Naïve Bayes with Adaptive Boosting

ABSTRACT

Sentiment analysis problems have attracted the attention of researchers. Sentiment analysis is a process that aims to determine the sentiment polarity of text. Nowadays, sentiment from product reviews has become a piece of important information for producers and potential customers. This paper conducted a sentiment analysis classification on a movie review from the IMDb site. In the classification analysis, the sentiment of movie reviews used the multinomial naïve Bayes algorithm. Adaboost was applied to boosting the accuracy of multinomial naïve Bayes. Feature selection is used to reduce the number of features and irrelevant features. The chi-square feature selection used was employed in the current study. The accuracy obtained in movie review sentiment analysis classification using the multinomial naïve Bayes algorithm is 81.39%. Meanwhile, the accuracy of the multinomial naïve Bayes algorithm by applying chisquare is 85.37%. The final result of multinomial naïve Bayes algorithm accuracy by applying AdaBoost and chi-square feature selection is 87.74%. Shah, 2018). Naïve Bayes was chosen in this study. This choice is based on the fact that naïve Bayes is an algorithm that is fast, easy to implement, and effective and useful in high-dimensional data because the probability of each feature is estimated to be independent (Taheri & Mammadov, 2013). Jagdale et al. comparing the naïve Bayes algorithm and SVM on different datasets. The study results show that the navenaïve Bayes algorithm has better accuracy on several datasets than the SVM algorithm.
Similar research was also conducted by Baik, Gupta, and Chaplot (2017). Baik et al. compare naïve Bayes algorithm, k-nearest neighbour, and random forest on sentiment analysis of movie reviews. The results show that the naïve Bayes algorithm has better accuracy than the k-nearest neighbour and random forest algorithms.
One of the problems that often occur in sentiment analysis is the number of features. A large number of features can reduce the classification performance; therefore, a feature selection process is needed. One of the most popular selection features is the chi-square. Chi-square is the most effective selection feature (Madasu & Elango, 2020). Research conducted by Madasu and Elango on the Amazon Review Dataset, IMDb Review Dataset, and Yelp Review Dataset using several selections features such as odds ratio, chi-square, GSS coefficient, Bi-Normal Separation with logistic regression algorithm, SVM-RBF, SVM -Linear, decision tree, multinomial naïve Bayes, and Bernoulli naïve Bayes. This study shows that multinomial naïve Bayes with chi-square feature selection shows the best accuracy values in two datasets: the Amazon Review Dataset and the IMDb Review Dataset.
In sentiment analysis, to increase accuracy and reduce bias, additional methods are needed. One of the ensemble methods is adaptive boosting (AdaBoost). AdaBoost has the concept of each classifier focused on the previous classifier error by applying the majority voting concept to increase accuracy (Zhang & Ma, 2012). Silva et al. (2015) compared the multinomial naïve Bayes algorithm, SVM, SVM with AdaBoost, and multinomial naïve Bayes with AdaBoost. This study shows that multinomial naïve Bayes with AdaBoost has the highest accuracy score of 57.35%.
Based on the description of the problem above, a study was conducted to classify the movie review dataset using the multinomial naïve Bayes algorithm with AdaBoost and the application of the chi-square selection feature.

Methods
This study classified the movie review dataset using an approximation method to get better accuracy results. The sentiment analysis classification process was carried out according to Figure 1. The process consists of several main stages, such as the preprocessing stage, data transformation, feature selection, and classification stage.

Dataset
The dataset used in this study is the large movie review v1.0 (Maas et al., 2011). dataset obtained from https://ai.stanford.edu/~amaas/data/sentiment. The dataset contains 50,000 data consisting of 25,000 positive data reviews and 25,000 negative data reviews.

Data Preprocessing
Data preprocessing is the stage of converting unstructured data into structured or semi-structured data [16]. Data preprocessing comprises the following process: 1. Case folding is the process of changing capital letters into non-capital letters (Kowsari et al., 2019). 2. Tokenize is a process to break a document or text into a single word (Susilowati, Sabariah, & Gozali, 2015). 3. The Stopword filter removes words that do not have a contribution to the text classification (Handayani & Pribadi, 2015). 4. Stemming is the process of changing the form of words into basic words (Ipmawati, Kusrini, & Taufiq, 2017).

Data Transformation
Data transformation is the process of converting tokens into numeric vectors. The purpose of this process is so that the classification algorithm can process the data because the classification algorithm cannot process the original dataset. The method used in this process uses the TF-IDF method. TF-IDF calculates the inverse document frequency (IDF) and frequency term (TF) values for each word or term in each document in a class (Jindal, Malhotra, & Jain, 2015). To calculate TF by counting every word frequency that appears in the document. While the IDF shows the weight of scarcity. The equation for calculating IDF can be seen below.

Splitting Data
Splitting data is the process of dividing the dataset into training data and test data. The data is divided into 80% training data and 20% test data or 40,000 training data and 10,000 test data. The ratio of the division of 80:20 is chosen based on the Pareto principle (Dunford, Su, Tamang, & Wintour, 2014).

Chi-Square Feature Selection
Chi-square is the method chosen in this study to reduce the number of features. The selected feature is a feature that has a strong correlation in the classification process. The chi-square equation can be seen in equation 3. (3)

Multinomial Naïve Bayes
At this stage, the multinomial naïve Bayes algorithm is used to classify the movie dataset reviews that have previously gone through the preprocessing process, data transformation, and selected features. First, calculate the maximum likelihood value. For each term in each class, the equation can be seen in equation 4.
After getting the maximum likelihood value, the following process calculates the Prior probability value for each class. The prior probability equation can be seen in equation 5.

P( ) = (5)
Then calculate the maximum a posteriori value for each class. After that, select the class with the maximum a posteriori value which has the highest value as the class prediction. The maximum a posteriori equation can be seen in equation 6. = arg max[log P( ) + ∑ log P(t |c) 1≤ ≤ ] (7)

Adaboost
Adaboost is an ensemble algorithm that serves to increase the accuracy of classifiers. AdaBoost can be used to reduce errors from base learners consistently to produce better classifications than random guesses (Freund, Schapire, & Hill, 1996). The steps of the AdaBoost algorithm include the following.

Results and Discussion
This study applied chi-square feature selection and multinomial naïve Bayes algorithm with AdaBoost to improve sentiment analysis of movie reviews. With AdaBoost and implemented the chi-square selection feature. The higher the accuracy value, the better the algorithm.
In the application of the multinomial naïve Bayes algorithm with chi-square feature selection, it is optimally selected with the highest accuracy value. The results of the optimal feature selection are shown in Figure 2. In the classification stage of the multinomial naïve Bayes algorithm with chi-square feature selection, the optimal number of features is the 2,000 best features. The best number of features will combine with AdaBoost. A fine-tuning process applies the multinomial naïve Bayes algorithm with AdaBoost and chi-square feature selection with the number of selected features 2,000 the best learning rate and iteration values. The results of fine-tuning can be seen in Table 1. The results of the fine-tuning process obtained the highest accuracy value at 100 iterations and a learning rate of 0.1 with an accuracy value of 87.74%. The results of the comparison of the three methods can be seen in Figure 3. Based on the accuracy values obtained from each algorithm, the accuracy results of applying multinomial naïve Bayes with AdaBoost and chi-square selection features is 6.35%. The application of the multinomial naïve Bayes algorithm with AdaBoost and chi-square feature selection is proven to be able to increase the accuracy of the extensive movie review v1.0 dataset.

Conclusion
This paper examines the multinomial naïve Bayes algorithm with AdaBoost to classify sentiment towards film reviews using chi-square. The application of chi-square, in this case, is used to select the best features in the classification process. Adaboost is used to improve the accuracy of the multinomial naïve Bayes algorithm. The evaluation results of the accuracy value show that the combination approach of the multinomial naïve Bayes with AdaBoost and chi-square feature selection produced higher accuracy compared to using only the multinomial naïve Bayes algorithm.