Improve the Accuracy of Support Vector Machine Using Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis

Data processing can be done with text mining techniques. To process large text data is required a machine to explore opinions, including positive or negative opinions. Sentiment analysis is a process that applies text mining methods. Sentiment analysis is a process that aims to determine the content of the dataset in the form of text is positive or negative. Support vector machine is one of the classification algorithms that can be used for sentiment analysis. However, support vector machine works less well on the large-sized data. In addition, in the text mining process there are constraints one is number of attributes used. With many attributes it will reduce the performance of the classifier so as to provide a low level of accuracy. The purpose of this research is to increase the support vector machine accuracy with implementation of feature selection and feature weighting. Feature selection will reduce a large number of irrelevant attributes. In this study the feature is selected based on the top value of K = 500. Once selected the relevant attributes are then performed feature weighting to calculate the weight of each attribute selected. The feature selection method used is chi square statistic and feature weighting using Term Frequency Inverse Document Frequency (TFIDF). Result of experiment using Matlab R2017b is integration of support vector machine with chi square statistic and TFIDF that uses 10 fold cross validation gives an increase of accuracy of 11.5% with the following explanation, the accuracy of the support vector machine without applying chi square statistic and TFIDF resulted in an accuracy of 68.7% and the accuracy of the support vector machine by applying chi square statistic and TFIDF resulted in an accuracy of 80.2%.


INTRODUCTION
Distribution of information supported by technological developments that better facilitate the public in obtaining information for free and in large numbers, one of which is textual information.Textual information can be categorized into two, namely the facts and opinions.Fact is an objective expression of an entity, event, or nature of an object.While opinion is a subjective expression that describes a person's sentiments, opinions, or feelings about an entity, event, and nature.Textual information can be processed using the text mining process.
The problems in data mining can be grouped into classification, regression, association analysis, anomaly detection, time series, and text mining [1].Text mining is the application of data mining with input is text data, can be documents, messages, e-mail or page of a website [1].According to [2], text mining can be broadly defined as an intensive knowledge process where users interact with datasets using analytical tools.Text mining is also known as text data mining [3].Text mining is similar to data mining, in fact a tool for data mining is designed for structured data from a database but text mining is designed for unstructured or semi-structured datasets such as word documents, emails, and more.
One of the problems associated with text mining is sentiments analysis.According to [4], sentiment analysis is a process that aims to determine the content of a dataset in the form of text (documents, sentences, paragraphs, etc.) to be either positive or negative.Sentiment analysis is usually implemented on three levels: sentence level, document level, and aspect level.The main purpose of the document level is to classify all documents or topics into positive or negative classes.Sentence levels are based on the polarity of each individual sentence [5].More details are described in [6], the main purpose of the document level sentiment analysis is to classify the opinion of a document as a positive or a negative opinion based on several large documents with the same topic.Sentence level sentiment analysis, classifies sentiment in each sentence by identifying the sentence subjective or objectively.If the sentence is subjective, sentence level sentiment analysis will determine the sentence including a positive or negative opinion.
Public opinion becomes very important for industry players.In [7] mentions that sentiment analysis is used by industry players to know opinions about the industry's products in order to predict future sales.It is also mentioned in [8] that the movie's industry applies sentiment analysis to find out public opinion.According to [9] by looking at people's opinions, it can influence people's thinking on certain products so that people can deduce the quality of a particular product.The public can provide a review of a particular product through a website page.The review provided in the form of text of opinions, among others are review of cosmetic products, electronics, books [10] , food [11] , [12 ] and [10] movies, and so on.Movie production is one of the growing industries.One example of a site that provides a review of a movie product is the Internet Movie Database (IMDB).IMDB is a website page that deals with movie and movie production.IMDB provides complete information about the production of a movie, any cast in the movie, a brief synopsis of the movie, trailer link, release date, and reviews from other users.People use IMDB to know the quality of movies before buying or watching a movie, because other people's comments and movie ratings typically influence the level of interest in buying or watching the movie.
Data mining methods can be distinguished based on statistical approaches known as statistical methods and machine learning based on some techniques of supervised learning and unsupervised learning [13].Some classification algorithms are applied in sentiment analysis such as, [11] using Naïve Bayes (NB) and Support Vector Machine (SVM) to classify restaurant review sentiments.In [14] using four machine learning methods, namely NB, ME, Stochastic Gradient Descent, and SVM.SVM is a widely used method in text classification [13].SVM is a fast and effective method for text classification [2].According to [15], one of the problems in text classification or text data processing is the number of features/attributes used on a dataset that will degrade the performance of the classifier.To optimize the work of the classifier needs to be done by selecting relevant features using feature selection.Feature selection is used to reduce feature/attribute dimension by removing irrelevant words so as to improve classification accuracy.On [16] explains that the feature selection method is used to reduce the dimension of the dataset by removing features/attributes that are irrelevant for classification.Feature selection provides several advantages such as smaller dataset sizes, less computing requirements for text classification algorithms.On [17] explains that feature selection can be divided into two types, namely filter and wrapper.Examples of filter types are chi square, information gain, and log like hood ratio.Examples of wrapper types are forward selection and backward elimination.
Chi square statistic gives good results when combined with SVM algorithm [18].Chi square statistic was used to test the independence of two events.For feature selection two events are term (ti) and class (Ck) [19].Research [19] using chi square as a feature selection in the support vector machine algorithm, provides an effective result in Arabic dataset classification with an F-measure of 88.11.
After the feature is selected then the process of weighting feature (feature weighting).Weighting feature is done to weigh the weight of each feature.One method of feature weighting is the Term Frequency Inverse Document Frequency (TFIDF).TFIDF is a combination of the term frequency and inverse document frequency to generate weights for each term in each document [20].Research [21] apply TFIDF to calculate the connectedness weight of a term against a document.
The purpose of this study is to improve the accuracy of the SVM with mene r apkan chi square and TFIDF.Based on the description above, to reduce the number of large attributes need to apply feature selection to perform the process of choosing the right attributes and reduction of the number of attributes to improve accuracy.To improve the accuracy of existing models, it is proposed chi square statistic as feature selection and feature weighting with TFIDF.Feature selection and feature weighting will be integrated with the SVM classification method.

METHODS
This research was conducted in several stages, according to the process of classification of text according to Figure 1.Classification is one technique in data mining [22].Data analysis is an effort to work with data, organize data, sort it into manageable units, synthesize it, search for and find what is important and what must be studied and decided.This study consists of several stages: stage preprocessing, feature selection, weighting of the feature, and classification of sentiment analysis.

Preprocessing
The preprocessing stage aims to prepare unstructured text documents ready for use for further processing.Preprocess stage conducted in this research is tokenize, case folding, stopword filtering, case folding, and stemming.Stopword filtering using stopword list is English stoplist and stemmer used is porter stemmer.

Feature Selection
The selection of attributes/features in this study using chi square statistic.Feature is selected based on the top value of K that is a number of K words with the highest chi square value.The top value of K is determined by the researcher, K = 500.Then it is repeated as much as K until the highest value of the classification is obtained.The calculation of chi square statistic for the selection of relevant attributes begins with a set of data that has been labeled class (positive, negative).Then from the data set with the concept of bag of word that contains the number of occurrence of term/word on each document for each class (positive, negative) as in Table 1. Table

1.2.1
Chi Square Statistic Chi square statistic was used to test the independence of two events.For feature selection two events are term (ti) and class (Ck) [19]."a" is the number of records/instance category Ck containing term ti, "b " is the number of records/instances which is not a category/class Ck containing term ti, "c" is the number of records/instances in the category Ck which contains no term ti, and "d" is the number of records/instances which is not a category Ck that does not contain the term ti.Where N is the entire document used.The chi square value for each term is calculated by Equation 1.
( ) If ( ) , term t i and the category C k independent; therefore term t i has no effect on the category.The greater value of ( ), then term t i increasingly also affect the category.Selecting attribute/feature in this study using chi square statistics with the following steps.1. Preparing bag-of-word results from the preprocessing stage as shown in Table 1 which displays some of the terms to be calculated for the chi square value.2. Calculating the chi square value of each term in the bag of word obtained from the preprocessing stage using Eq. 1.

From Equation 1 this study uses the following values.
 N = 1000, p = 500, dan n = 500; where p is the number of documents labeled positive and n is the number of documents labeled negative.


= data class (positive and negative). The value of "a" is the number of positive documents containing the term in the pos column of Table 1. The value of b is the number of non-positive documents containing the term in the neg column Table 1. The value of c is the number of positive documents but does not contain words/terms that is . The value of d is the number of documents that are not positive and do not contain the word/term that is .4. Then the value of chi square for each term can be calculated by Equation 1 as follows.For example used a term that is "good".
( ) Did t the value of chi square for the term "good " was 25.45. 5. Then each term in bag-of-word is sorted by chi square value from highest to lowest.6. Feature selected based on the value of the top K is a K word of bag-of-word with the highest value of chi square.The study determines the value of K=500.Then the selected feature is the term with the highest 500 chi square value.

Feature Weighting
The weight of each feature for each document is calculated using TFIDF.The value of TFIDF is the weight of each feature on each document.After the preprocessing phase is completed and the relevant feature has been selected with feature selection there will be a number of N features that can be represented in the order of t1, t2, ..., tN.The ith document can be represented by a set of Ndimensional vector sequence is written to (Xi1, Xi2, ..., XiN) where Xij is the weight that calculates the level of interest term to j in the ith document.Vector space model is the result of the process of weighting each word in this case the word has become a feature that has been selected.One method of weighting is the TFIDF.This method calculates the Term Frequency (TF) and Inverse Document Frequency (IDF) values for each feature selected for N documents.TF value is defined by TF = tij, that is the number/total term i appears in document j.The DF value is the sum/total of the document where the term i appears, this value is used to calculate the IDF.The IDF value is defined as in Equation 2.
. / where N is the number of the entire document.TFIDF is calculated by multiplying the Term Frequency (TF) with the Inverse Document Frequency (IDF) as in Equation 3.

Mining Process
In the classification phase of sentiment analysis using SVM based on 10-fold cross validation training data and test data divided on 10-fold cross validation iteration [23], so that the learning and testing stage is done in 10-fold cross validation iteration as follows.
1) Prepare the dataset.The data used is the new term document matrix that is the result of the feature weighting stage.
Go back to step 2 to the value reach convergent (no significant change).

4) Testing model with test data with decision function:
( ) 5) Doing looping up to k looping in step 2) up to 4).
6) The final result will be obtained an output that is the level of accuracy.The accuracy level is obtained from the average on each iteration.

RESULT AND DISCUSSION
In this section, the experimental results are analyzed to evaluate the performance of the proposed data mining algorithm.The data used is movie review on Sentiment Labelled Sentences [24] taken from the UCI repository of machine learning.The first step is preprocessing data.In the preprocessing stage/preprocessing do case folding, tokenize, stopword filtering, and stemming.The preprocessing stage aims to prepare unstructured text documents to be ready for use in the next process.At this stage generated term document matrix and bag of word with as many as 2477 term candidate feature of the 1000 documents used and will be used at a later stage.
The next step is to choose relevant feature to the method of feature selection that is chi square statistic.By using Equation 1 the value of chi square is calculated for each candidate feature.The value of chi square is then sorted from the highest value to the lowest value.Chi square is applied to reduce a large number of attributes by taking a number of K=500 attribute of the highest ranking.Then we calculated the weight of each selected attribute with feature weighting TFIDF.
The next stage after the selected feature has been determined and has calculated the value of weight with TFIDF then the process of sentiment analysis with support vector machine algorithm can be done.At this stage will be calculated the highest level of support vector machine based on 10 fold cross validation in classification analysis of movie review sentiment with the application of chi square statistic and TFIDF.The level of accuracy of support vector machine algorithm on the classification of movie review analysis without chi square statistic and TFIDF treatment was 68.7%.After being given chi square treatment statistic and TFIDF support vector machine algorithm achieved the highest level of accuracy when the top value of K = 212 is 80.2% with an accuracy increase of 11.5%.This study used chi square statistic as feature selection based on the top value of K = 500.The update in this research is from the predetermined top K value, it is looping as much as K looping to get optimum accuracy value.The feature is selected based on the chi square value, the higher the chi square value the more relevant the feature.The preprocessing stage also plays a role in improving accuracy in this study.Most of the irrelevant features have been removed at the preprocessing stage.Table 1 shows the accuracy of the support vector machine in text classification for the movie review sentiment analysis without applying chi square and TFIDF and the accuracy of the support vector machine that implements chi square and TFIDF.

CONCLUSION
This research uses Sentiment Labelled Data set taken from UCI Repository consists of 500 documents labeled positive and 500 documents labeled negatives.From result of experiment by using chi square with value of top K = 500 got highest accuracy value at top K = 212 and TFIDF, support vector machine showed an increase accuracy by 11, 5% from 68.7% to 80.2%.It can be concluded that the application of chi square statistic and TFIDF increases the accuracy of the support vector machine in the classification movie review sentiment analysis.Limitations in the sentiment analysis classification is highly dependent on the data to be tested.So this research can be used as a reference for further research by maximizing the data to be used in order to provide a more accurate level of accuracy.In this research use bag of word concept so that the feature/attributes formed are word by word, analysis is done based on word per word from given data and does not apply emoticon detection and negation detection.In addition to the data used, the need for further research is to find out how the results of this study (performance) can be used as supporting decision makers in movie production.

Figure 1 .
Figure 1.SVM algorithm with chi square statistic and TFIDF

2 ) 2 )
Dividing data.Data is shared as much 10 equal parts.For the first iteration the test data used is as much as one piece of data and the other part as training data.3) SVM process.At this stage training is done to get the classification model with training data based on the division in step 2).The modeling stages of SVM algorithm are as follows.a) Specifying the data point: * + ; is a feature space with as many as n features.b) Specifying the class data: * + c) Pairing the data and class: *( )+ d) Minimize margin to determine and values ‖ ‖ with (( ) ) e) Specifies the separation hyperplane written as follows.Perform step (a), (b), dan (c) belo for (a) ∑ 1. Bag-of-word (example)

Table 1 .
[25]arch resultApplication of chi square statistic and TFIDF on the support vector machine algorithm proved to be a good enough model to improve the accuracy of the support vector machine in the analysis of movie review sentiment in the data sentiment labelled dataset.With a given level of accuracy, this model is expected to be able to analyze the sentiment of the review data different with exact.So for further research, this model can be used for classification of sentiment analysis of other movie review data.This statement is aligned with[25], ie the level of accuracy (performance) increases with the application of feature selection.The study[25]applies a modification of the conventional IG feature selection that SAIG (Sparsity Adjusted Information Gain) shows improved classifier accuracy.Of the two datasets were used (amazon datasets and dataset movie (sentiment labeled dataset)) SAIG give better results than the IG on SVM and KNN.SQM + SAIG accuracy rate 67,9% (60 feature), SVM + IG 66,6% (100 feature), KNN + SAIG 68,2% (50 & 60 feature), and KNN + IG 57,4% (80 feature).However, with the same data this study shows better performance by applying conventional feature selection that is chi square statistic which shows an accuracy increase of 11.5% from SVM accuracy without application of chi square statistic and TFIDF 68,7% to 80,2% with the application of chi square statistic and TFIDF.Comparison of performance (accuracy) with previous research is presented in Table2.