Implementation of Support Vector Machine Algorithm with Correlation-Based Feature Selection and Term Frequency Inverse Document Frequency for Sentiment Analysis Review Hotel

Purpose: The study aims to reduce the number of irrelevant features in sentiment analysis with large features. Methods/Study design/approach: The Support Vector Machine (SVM) algorithm is used to classify hotel review sentiment analysis because it has advantages in processing large datasets. Term Frequency-Inverse Document Frequency (TF-IDF) is used to give weight values to features in the dataset. Result/Findings: This study's results indicate that the accuracy of the SVM method with TF-IDF produces an accuracy of 93.14%, and the SVM method in the classification of hotel reviews by implementing TFIDF and CFS has increased by 1.18% from 93.14% to 94.32%. Novelty/Originality/Value: Use of Correlation-Based Feature Section (CFS) for the feature selection process, which reduces the number of irrelevant features by ranking the feature subset based on the strong correlation value in each feature.


INTRODUCTION
The development of the Internet in the developing tourism sector has resulted in extensive reviews of hospitality and tourism services widely available on websites and other social networks. Reviews in the form of information are beneficial for people going on a tour or hotel booking. Some people tend to seek information from other visitors before deciding to make the right choice and make a hotel booking. Many platforms and websites are used as a facility for online opinions and experiences by internet users. Public opinion can be grouped or classified using sentiment analysis techniques. Sentiment analysis is a process that aims to determine the contents of a dataset in the form of text (documents, sentences, paragraphs) to be positive, negative, or neutral [1]. Sentiment analysis can also overcome text classification by automatically grouping user reviews into positive or negative opinions [2]. Opinions depicted are subjective expressions in the form of textual information that can describe a person's sentiments, opinions, or feelings about an event and nature. Textual information can be processed using a text mining process [3]. Text mining is also referred to as text mining, which is defined as the process of obtaining information from a set of text data [4]. The text mining process is divided into four categories: classification, analysis association, information extraction, and grouping.
Several classification algorithms are applied in sentiment analysis, such as Naïve Bayes (NB), Support Vector Machine (SVM), KNN. Several studies have been done to classify sentiment analysis against available reviews, and the SVM algorithm has become a famous classification and regression method for linear and nonlinear problems [5].
Before entering the classification stage, the words contained in the dataset are weighted using the Term Frequency Inverse Document Frequency (TF-IDF) technique. TF-IDF combines frequency term and document frequency-inverse to produce weights for each feature in the document [1]. With the many features possessed by data, sometimes some features may have irrelevant value for mining tasks, and if they include irrelevant features, it can harm and confuse the task of the classification algorithm [6].
Therefore, feature selection is necessary, which identifies and eliminates features with irrelevant or redundant values. The correlation-based feature selection (CFS) method is a simple filter algorithm and feature selection algorithm that ranks feature subsets and finds the benefits of features or feature subsets based on correlation [7]. The CFS will select the best feature subset containing features that strongly correlate with the target class but are not correlated with each other. However, if the data sample is limited, the attributes chosen by the CFS are not necessarily the attributes that give the best accuracy results [8]. This study uses a collection of hotel review datasets taken from https://www.kaggle.com/jiashenliu/515khotel-reviews-data-in-europe. The purpose of this study was to determine the accuracy of the SVM algorithm results after implementing CFS as feature selection and TFIDF as feature weighting for hotel review analysis sentiment.

METHODS
The steps in this study went through several stages, namely, text pre-processing, application of the TF-IDF feature as word weighting, application of CFS as feature selection, and the classification process using the SVM method. The study began by entering the Hotel Reviews dataset. The data then processed through the text pre-processing stage with tokenization, transform case, stopword removal, and stemming. The weighting of each word in the dataset was carried out using the TF-IDF technique. After the word weighting was carried out, the feature selection stage was conducted with the CFS, which functions to reduce irrelevant and low-value features to reduce the accuracy value in the classification algorithm. Based on the selected features, the classification process was carried out using the SVM algorithm. Then the classification model was tested using test data and evaluated using a confusion matrix to produce a value. The research method flow chart can be seen in Figure 1.

Dataset
The data used in this study were 2000 hotel review datasets with positive review labels and negative reviews, which can be downloaded from https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-ineurope.

Pre-processing
The text pre-processing stages carried out in this study were tokenization, transform Case, stopword removal, stemming. Text pre-processing is the stage of the initial process of the text to prepare the text into data that will be processed further. 1) Tokenization is the process of separating a set of words into pieces of words, phrases, symbols, or elements that have other meanings or are called tokens or terms [9]. 2) Transform case stage can help normalize text. Text will be converted into a single letter form in uppercase or lowercase letters [10]. 3) Stopword Removal, often referred to as a hyphen, is defined as a word that appears very often in text documents and does not give significant meaning to the document's contents [11]. 4) The stemming stage is used to change the word into its basic form. The goal is to reduce the number of words that have primary forms to save time and memory space [12]. The TF-IDF method was used in the current study to determine the weight value of each word contained in each document [13]. TF-IDF is one of the most well-known algorithms used in text mining research. The first step in calculating TF-IDF was to count the appearance of words or Term Frequency (TF) in each document. Next, by calculating the Inverse Document Frequency (IDF) value. IDF value is the inverse of DF where the number of documents the word/term i appears. DF is the document appearance value of each word in each document. The IDF value is defined as equation 1.
Where N is the total number of files in the document, and the formula for calculating TFIDF can be stated as equation 2.

Correlation-Based Feature Selection (CFS)
CFS is a filter algorithm and a simple feature selection algorithm that ranks feature subsets and finds the benefits of features or feature subsets based on correlation [7]. The CFS method will select the best feature subset containing features that strongly correlate with the target class but are not correlated with each other. However, if the data sample is limited, the attributes chosen by the CFS are not necessarily the attributes that give the best accuracy results [8]. The equation for calculating the correlation value can be formulated with equation 3.
CFS is also an automatic algorithm that does not require the user to specify the number of features to be selected [14].

Classification
Text classification was the process of classifying documents with predetermined text categories. It can be defined if is a document from a document set and {C_1, C_2, C_3 . . . , C_n} is the set of all categories, then the text classification determines one category to the document. Based on its characteristics, documents can be labeled for one or more classes [15].

RESULT AND DISCUSSION Results
This study used the python programming language. The classification process on the dataset used the SVM algorithm method, the classification in the hotel review dataset with the SVM algorithm, and uses the word weighting TF-IDF results in a value of 93.14%. The hotel review dataset classification with the SVM algorithm and using TF-IDF word weighting and the CFS selection feature produces an accuracy of 94.32%.

Pre-processing Result
The pre-processing stage serves to prepare the dataset so that it is ready to be processed into information. This pre-processing stage can provide optimal results in the classification process that will be used. The pre-processing stages used in this study are tokenization, transform case, stopword removal, and stemming. After pre-processing, the result is shown in Table 1. Comfy bed good location 'comfi' 'bed' 'good' 'locat'

TF-IDF Result
TF-IDF method is used to determine the weight value of each word contained in each document. The first step in calculating TF-IDF is to count the appearance of words or Term Frequency (TF) in each document. Next, by calculating the Inverse Document Frequency (IDF) value. The result of TF-IDF calculation is represented in Table 2.

CFS Result
After the TF-IDF process is carried out, a numeric feature is generated, which will then be carried out by the feature selection process using the CFS. The CFS method will select the features with the highest correlation weight value with the Top K value = 2000. The sample features that have the highest CFS weight can be seen in Table 3.

Discussion
In this research, there will be two stages of the classification process. The first stage is to carry out the sentiment classification process on the hotel review dataset, which has gone through text pre-processing and TF-IDF with the SVM algorithm. The second is by applying CFS to the SVM algorithm.

Classification SVM+TF-IDF
The first application of classification is to use the SVM algorithm for classification and TF-IDF as feature weighting. The SVM algorithm is applied using a hotel review dataset with 2000 data gone through text pre-processing. The SVM and TF-IDF algorithms obtain an accuracy of 93.14%. With this accuracy, it is assessed that the SVM algorithm can classify the hotel review dataset well because it obtains an accuracy result more excellent than 60% or greater than the classification error rate. However, the results of this accuracy can still be improved by implementing CFS on SVM.

Classification SVM+TF-IDF+CFS
This classification applies a combination of TF-IDF as feature weighting and CFS as feature selection in the SVM classification algorithm for sentiment analysis. The features that have known their weight with TF-IDF will then look for a good correlation value that has been selected by the CFS algorithm, which will later affect the accuracy results of the classification process. The selection of 2000 features with the highest correlation value is based on experiments carried out in the classification using the SVM algorithm. By using 2000 features with the highest CFS ranking, the highest accuracy results are obtained. So it can be concluded that 2000 features have the most robust correlation weight value among the correlations with other features. The accuracy results obtained from this classification can be seen in Table 4. In this classification process, testing was carried out 10 times and obtained the results as shown in Table 5. Table 5. Average results of the SVM classification experiment The results of the accuracy of the application of the SVM + TF-IDF + CFS algorithm classification model can increase the accuracy of the algorithm by 1.18%.

CONCLUSION
This paper has tested using the SVM method to classify hotel review sentiment using TF-IDF and CFS. The application of the CFS selection feature in this study eliminates some features that have irrelevant values by ranking the feature subset based on the solid value in each feature. The selection of features used in this classification is based on the feature ranking, with the acquisition criteria based on the better classification. The evaluation of the accuracy shows that the SVM algorithm with TF-IDF and the CFS selection feature results in a higher classification that outperforms the approach using the SVM algorithm with TF-IDF.