Weakly Supervised Sentiment Analysis of Indonesian Rural Tourism Reviews: A TF-IDF Baseline for Melung Tourism Village

Authors

  • Zanuar Rifa’i Universitas Amikom Purwokerto Author
  • Bayu Priya Mukti Universitas Amikom Purwokerto Author

DOI:

https://doi.org/10.15294/edukom.v12i1.31893

Keywords:

Indonesian Tourist Reviews, Machine Learning Classification, Random Oversampling, Sentiment Analysis, TF-IDF Feature Extraction

Abstract

This study investigates sentiment classification of Indonesian-language tourist reviews from the rural destination of Melung Tourism Village. A total of 724 user-generated reviews from 546 unique users are preprocessed using Indonesian-specific text cleaning, stopword filtering, and stemming, then weakly labeled through a stemmed positive–negative lexicon. TF-IDF unigram–bigram features are extracted from the preprocessed texts and used to train three classical classifiers: Naive Bayes, linear Support Vector Machine (SVM), and Logistic Regression. To address class imbalance, RandomOverSampler is applied only to the training data, and model evaluation combines stratified 5-fold cross-validation with a held-out test set, using weighted F1-score as the primary metric. Logistic Regression achieves the best performance on the test set (weighted F1 = 0.8799, accuracy = 0.8828), closely followed by SVM, while Naive Bayes lags behind. The results show that, even with a modest, weakly supervised dataset, a carefully designed classical pipeline can yield reliable sentiment indicators to support data-driven management of rural tourism destinations.

References

Aksu, M. Ç., & Karaman, E. (2021). Analysis of Turkish Sentiment Expressions About Touristic Sites Using Machine Learning. Journal of Intelligent Systems: Theory and Applications, 4(2), 103–112. https://doi.org/10.38016/jista.854250

Alshari, E. M., Azman, A., Doraisamy, S., Mustapha, N., & Alkeshr, M. (2018). Effective Method for Sentiment Lexical Dictionary Enrichment Based on Word2Vec for Sentiment Analysis. 2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP), 1–5. https://doi.org/10.1109/INFRKM.2018.8464775

Bhattacharjee, M., Ghosh, K., Banerjee, A., & Chatterjee, S. (2021). Multilabel Sentiment Prediction by Addressing Imbalanced Class Problem Using Oversampling. In S. Banerjee & J. K. Mandal (Eds.), Advances in Smart Communication Technology and Information Processing (Vol. 165, pp. 239–249). Springer Singapore. https://doi.org/10.1007/978-981-15-9433-5_23

Burns, N., Bi, Y., Wang, H., & Anderson, T. (2011). Sentiment Analysis of Customer Reviews: Balanced versus Unbalanced Datasets. In A. König, A. Dengel, K. Hinkelmann, K. Kise, R. J. Howlett, & L. C. Jain (Eds.), Knowledge-Based and Intelligent Information and Engineering Systems (Vol. 6881, pp. 161–170). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23851-2_17

Choi, Y., & Lee, H. (2017). Data properties and the performance of sentiment classification for electronic commerce applications. Information Systems Frontiers, 19(5), 993–1012. https://doi.org/10.1007/s10796-017-9741-7

Da Poian, V., Theiling, B., Clough, L., McKinney, B., Major, J., Chen, J., & Hörst, S. (2023). Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry. Frontiers in Astronomy and Space Sciences, 10, 1134141. https://doi.org/10.3389/fspas.2023.1134141

Das, M., K., S., & Alphonse, P. J. A. (2023). A Comparative Study on TF-IDF feature Weighting Method and its Analysis using Unstructured Dataset (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2308.04037

Deniz, A., Angin, M., & Angin, P. (2021). Evolutionary Multiobjective Feature Selection for Sentiment Analysis. IEEE Access, 9, 142982–142996. https://doi.org/10.1109/ACCESS.2021.3118961

Devi, M. D., & Saharia, N. (2020). Learning Adaptable Approach to Classify Sentiment with Incremental Datasets. Procedia Computer Science, 171, 2426–2434. https://doi.org/10.1016/j.procs.2020.04.262

Duong, H.-T., & Nguyen-Thi, T.-A. (2021). A review: Preprocessing techniques and data augmentation for sentiment analysis. Computational Social Networks, 8(1), 1. https://doi.org/10.1186/s40649-020-00080-x

Fatah, D. A., Rochman, E. M. S., Setiawan, W., Aulia, A. R., Kamil, F. I., & Su’ud, A. (2024). Sentiment Analysis of Public Opinion Towards Tourism in Bangkalan Regency Using Naïve Bayes Method. E3S Web of Conferences, 499, 01016. https://doi.org/10.1051/e3sconf/202449901016

Lubihana, E., & Y., B. (2022). Design of a Tourism Recommendation System Based on Sentiment Analysis with Lexicon LSTM. 2022 International Symposium on Electronics and Smart Devices (ISESD), 1–6. https://doi.org/10.1109/ISESD56103.2022.9980738

Moreo, A., Esuli, A., & Sebastiani, F. (2016). Distributional Random Oversampling for Imbalanced Text Classification. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 805–808. https://doi.org/10.1145/2911451.2914722

Ningsih, M. R., & Unjung, J. (2025). Sentiment Analysis on SocialMedia Using TF-IDF Vectorization and H2O Gradient Boosting for Student Anxiety Detection. Scientific Journal of Informatics, 11(4), 1137–1144. https://doi.org/10.15294/sji.v12i1.20582

Ondara, B., Waithaka, S., Kandiri, J., & Muchemi, L. (2022). Machine Learning Techniques, Features, Datasets, and Algorithm Performance Parameters for Sentiment Analysis: A Systematic Review. Open Journal for Information Technology, 5(1), 1–16. https://doi.org/10.32591/coas.ojit.0501.01001o

Osly Usman, & Wijaya, C. N. S. (2025). The Influence of Social Proof and User-Generated Content (UGC) on Brand Perception through Consumer Trust among Digital Consumers. International Student Conference on Business, Education, Economics, Accounting, and Management (ISC-BEAM), 3(1), 2654–2673. https://doi.org/10.21009/ISC-BEAM.013.191

Panjaitan, C. H. P. (2025). Systematic Literature Review of Sentiment Analysis on Various Review Platforms in the Tourism Sector. Journal of Advanced Computer Knowledge and Algorithms, 2(1), 12–18. https://doi.org/10.29103/jacka.v2i1.20287

Pradana, A. W., & Hayaty, M. (2019). The Effect of Stemming and Removal of Stopwords on the Accuracy of Sentiment Analysis on Indonesian-language Texts. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 375–380. https://doi.org/10.22219/kinetik.v4i4.912

Romadhony, A., Al Faraby, S., Rismala, R., Wisesty, U. N., & Arifianto, A. (2024). Sentiment Analysis on a Large Indonesian Product Review Dataset. Journal of Information Systems Engineering and Business Intelligence, 10(1), 167–178. https://doi.org/10.20473/jisebi.10.1.167-178

Sahu, M. K., & Selot, S. (2022). Comparative Analysis of Various Supervised Machine Learning Techniques Used for Sentiment Analysis on Tourism Reviews. In R. P. Mahapatra, S. K. Peddoju, S. Roy, P. Parwekar, & L. Goel (Eds.), Proceedings of International Conference on Recent Trends in Computing (Vol. 341, pp. 19–49). Springer Nature Singapore. https://doi.org/10.1007/978-981-16-7118-0_3

Saini, P., & Mishra, A. (2025). EFFECT OF ONLINE REVIEW ON SELECTION OF 5-STAR HOTEL. International Journal For Multidisciplinary Research, 7(2), 40376. https://doi.org/10.36948/ijfmr.2025.v07i02.40376

Saraswati, N. W. S., Ketut Gede Darma Putra, I., Sudarma, M., & Made Sukarsa, I. (2024). Enhance sentiment analysis in big data tourism using hybrid lexicon and active learning support vector machine. Bulletin of Electrical Engineering and Informatics, 13(5), 3663–3674. https://doi.org/10.11591/eei.v13i5.7807

Shehu, H. A., Sharif, Md. H., Sharif, Md. H. U., Datta, R., Tokat, S., Uyaver, S., Kusetogullari, H., & Ramadan, R. A. (2021). Deep Sentiment Analysis: A Case Study on Stemmed Turkish Twitter Data. IEEE Access, 9, 56836–56854. https://doi.org/10.1109/ACCESS.2021.3071393

Sreenivas, G., Murthy, K. M., Prit Gopali, K., Eedula, N., & H R, M. (2023). Sentiment Analysis of Hotel Reviews—A Comparative Study. 2023 IEEE 8th International Conference for Convergence in Technology (I2CT), 1–9. https://doi.org/10.1109/I2CT57861.2023.10126445

Touahri, I. (2022). The construction of an accurate Arabic sentiment analysis system based on resources alteration and approaches comparison. Applied Computing and Informatics. https://doi.org/10.1108/ACI-12-2021-0338

Wang, C., Yang, X., & Ding, L. (2021). Sentiment classification based on weak tagging information and imbalanced data. Intelligent Data Analysis, 25(3), 555–570. https://doi.org/10.3233/IDA-205408

Zhu, J.-P., Niu, B., Cai, P., Ni, Z., Wan, J., Xu, K., Huang, J., Ma, S., Wang, B., Zhou, X., Bao, G., Zhang, D., Tang, L., & Liu, Q. (2024). Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2412.07214

Downloads

Published

2025-08-30

Article ID

31893

How to Cite

Rifa’i, Z., & Mukti, B. P. (2025). Weakly Supervised Sentiment Analysis of Indonesian Rural Tourism Reviews: A TF-IDF Baseline for Melung Tourism Village. Edu Komputika Journal, 12(1), 48-60. https://doi.org/10.15294/edukom.v12i1.31893