Improve the Accuracy of Support Vector Machine Using Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis

Ukhti Ikhsani Larasati(1), Much Aziz Muslim(2), Riza Arifudin(3), Alamsyah Alamsyah(4),


(1) Universitas Negeri Semarang
(2) Universitas Negeri Semarang
(3) Universitas Negeri Semarang
(4) Universitas Negeri Semarang

Abstract

Data processing can be done with text mining techniques. To process large text data is required a machine to explore opinions, including positive or negative opinions. Sentiment analysis is a process that applies text mining methods. Sentiment analysis is a process that aims to determine the content of the dataset in the form of text is positive or negative. Support vector machine is one of the classification algorithms that can be used for sentiment analysis. However, support vector machine works less well on the large-sized data. In addition, in the text mining process there are constraints one is number of attributes used. With many attributes it will reduce the performance of the classifier so as to provide a low level of accuracy. The purpose of this research is to increase the support vector machine accuracy with implementation of feature selection and feature weighting. Feature selection will reduce a large number of irrelevant attributes. In this study the feature is selected based on the top value of K = 500. Once selected the relevant attributes are then performed feature weighting to calculate the weight of each attribute selected. The feature selection method used is chi square statistic and feature weighting using Term Frequency Inverse Document Frequency (TFIDF). Result of experiment using Matlab R2017b is integration of support vector machine with chi square statistic and TFIDF that uses 10 fold cross validation gives an increase of accuracy of 11.5% with the following explanation, the accuracy of the support vector machine without applying chi square statistic and TFIDF resulted in an accuracy of 68.7% and the accuracy of the support vector machine by applying chi square statistic and TFIDF resulted in an accuracy of 80.2%.

Keywords

SVM, Chi square statistic, TFIDF, Sentiment Analysis, Text Classification

Full Text:

PDF

References

Kotu, V., & Deshpande, B. 2015. Predictive Analytics and Data mining: Concepts and Practice with RapidMiner. Waltham, MA: Elsevier/Morgan Kauffmann.

Feldman, R., & Sanger, J. 2007. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.

Gupta, V., & Lehal, G. S. 2009. A Survey of Text Mining Techniques and Applications. Journal of Emerging Technologies in Web Intelligence. 1(1):60-76.

Kontopoulos, E., Berberidis, C., Dergiades, T., & Bassiliades, N. 2013. Ontology-based sentiment analysis of twitter posts. Expert System with Application. 40:4065-4074.

Tripathy, A., Agrawal, A., & Rath, S. K. 2015. Classiï¬cation of Sentimental Reviews Using Machine Learning Techniques. Procedia Computer Science. 57:821-829.

Medhat, W., Hassan, A., & Korashy, H. 2014. Sentiment Analysis Algorithms and Applications: A Survey. Ain Shams Engineering Journal. 5(4):1093-1113.

Liu, Y., Huang, X., An, A., & Yu, X. 2007. ARSA: A SentimentAware Model for Predicting Sales Performance Using Blogs. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. (pp. 607).

Tsou, Benjamin K., Zhu, J., Wang, H., Zhu, M., & Ma, M. 2011. Aspect-Based Opinion Polling from Customer Reviews. IEEE Transactions on Affective Computing. 2(1):37-49.

Koh, N. S., Hu, N., & Clemons, E. K. (2010). Do online reviews reflect a product’s true perceived quality? An investigation of online movie reviews across cultures. Electronic Commerce Research and Applications. 9(5): 374–385.

Moraes, R., Valiati, J. F., & Neto, W. P. G. 2013. Document-level sentiment classification an empirical comparison between SVM and ANN. Expert System with Application. 40:621-633.

Zhang, Z., Ye, Q., Zhang, Z., & Li, Y. 2011. Sentiment Classification of Internet Restaurant Reviews Written in Cantonese. Expert Systems with Applications. 38(6):7674-7682.

Pang, B., Lee, L., & Vaithyanathan, S. 2002. Thumbs Up?: Sentiment Classification Using Machine Learning Techniques. EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing. 10:79-86.

Jindal, R., Malhotra, R & Jain, A. 2015. Techniques for text classification: Literature review and current trends. Weobology. 12(2):1-28.

Tripathy, A., Agrawal, A., & Rath, S. K. 2016. Classification of Sentiment Reviews Using N-Gram Machine Learning Approach. Expert Systems with Applications. 57:117-126.

Wang, S., Li, D., Song, X., Wei, Y., & Li, H. 2011. A Feature Selection Method Based on Improved Fisher’s Discriminant Ratio for Text Sentiment Classification. Expert Systems with Applications. 38(7):8696-8702.

Vala, M., & Gandhi, J. 2015. Survey of Text Classification Technique and Compare Classifier. International Journal of Innovative Research in Computer and Communication Engineering. 3(11): 10809-10813.

Liu, Y., Wang, G., Chen, H., Dong, H., Zhu, X., & Wang, S. 2011. An Improved Particle Swarm Optimization for Feature Selection. Journal of Bionic Engineering. 8(2):191-200.

Meesad, P., Boonrawd, P., & Nuipian, V. 2011. A Chi-Square-Test for Word Importance Differentiation in Text Classification. Proceedings of International Conference on Information and Electronics Engineering.

Mesleh, A. M. 2007. Chi Square Feature Extraction Based SVMs Arabic Language Text Categorization System. Journal of Computer Science. 3(6):430-435.

Manning, Christopher D., Prabhakar R., & Hinrich S. 2009. An Introduction to Information Retrieval. England: Cambridge University Press.

Trihanto, W. B., R. Arifudin, & M. A. Muslim. 2017. Information Retrieval System for Determining The Title of Journal Trends in Indonesian Language Using TF-IDF and Naive Bayes Classifier. Scientific Journal of Informatics. 4(2):180.

Muslim, M. A., A. J. Herowati, E. Sugiharti, & B. Prasetiyo. 2018. Application of The Pessimistic Pruning to Increase The Accuracy of C4.5 Algorithm in Diagnosing Chronic Kidney Disease. Journal of Physics: Conference Series 983 (1).

Muslim, M. A., S. H. Rukmana, E. Sugiharti, B. Prasetiyo, & S. Alimah. 2018. Optimization of C4.5 Algorithm-based Particle Swarm Optimization for Breast Cancer Diagnosis. Journal of Physics: Conference Series 983 (1).

Kotzias, D., M. Denil, N. D. Freitas, & P. Smyth. 2015. From Group to Individual Labels using Deep Features. KDD '15 Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney: International Conference on Knowledge Discovery and Data Mining.

Ong, B. Y., S. . Goh, & CC. Xu. 2015. Saprsity Adjusted Information Gain for Feature Selection in Sentiment Analysis. Proceeding of IEEE International Conference on Big Data. ():2122.

Tang, D., B. Qin, & T. Liu. 2015. Learning semantic representations of users and products for document level sentiment classification. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1:1014.

Wang S., M. Zhou, G. Fei, Y. Chang, B. Liu. 2018. Contextual and Position-Aware Factorization Machines for Sentiment Classification. arXiv preprint arXiv:1801.06172.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.