Performance Comparison of Similarity Measure Algorithm as Data Preprocessing Stage: Text Normalization in Bahasa

Achmad Yohni Wahyu Finansyah(1), FNU Afiahayati(2), Vincent Michael Sutanto(3),


(1) 2Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
(2) Dept. of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada
(3) 3Institute for Research Initiatives / Division of Information Science, Nara Institute of Science and Technology, Nara, Japan

Abstract

Purpose: More and more data are stored in text form due to technological developments, making text data processing more difficult. It also causes problems in the text preprocessing algorithm, one of which is when two texts are identical, but are considered distinct by the algorithm. Therefore, it is necessary to normalize the text to get the standard form of words in a particular language. Spelling correction is often used to normalize text, but for Bahasa Indonesia, there has not been much research on the spell correction algorithm. Thus, there needs to be a comparison of the most appropriate spelling correction algorithms for the normalization process to be effective.

Methods: In this study, we compared three algorithms, namely Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman. These algorithms were evaluated using questionnaire data and tweet data, which both are in Bahasa Indonesia.

Result: The fastest normalization time is obtained by the Jaro-Winkler, taking an average of 31.01 seconds for questionnaire data and 59.27 seconds for tweet data. The best accuracy is obtained by the Levenshtein Distance with a value of 44.90% for the questionnaire data and 60.04% for the tweet data.

Novelty: The novelty of this research is to compare the similarity measure algorithm in Bahasa Indonesia. Therefore, the most suitable similarity measure algorithm for Bahasa Indonesia will be obtained.

Keywords

Preprocessing, Text Normalization, Spell-correction, Levenshtein Distance, Jaro-Winkler Distance, Smith-Waterman

Full Text:

PDF

References

Y. Chen, Z. Ding, Q. Zheng, Y. Qin, R. Huang, and N. Shah, “A history and theory of textual event detection and recognition,” IEEE Access, vol. 8, pp. 201371–201392, 2020.

Y. Li, A. Algarni, M. Albathan, Y. Shen, and M. A. Bijaksana, “Relevance feature discovery for text mining,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 6, pp. 1656–1669, 2015.

I. E. Agbehadji, H. Yang, S. Fong, and R. Millham, “The Comparative Analysis of Smith-Waterman Algorithm with Jaro-Winkler Algorithm for the Detection of Duplicate Health Related Records,” 2018 Int. Conf. Adv. Big Data, Comput. Data Commun. Syst. icABCD 2018, pp. 1–10, 2018.

R. Sproat, A. W. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards, “Normalization of non-standard words,” Comput. Speech Lang., vol. 15, no. 3, pp. 287–333, 2001.

G. Slamova and M. Mukhanova, “Text normalization and spelling correction in Kazakh language,” CEUR Workshop Proc., vol. 2268, pp. 221–228, 2018.

I. G. B. B. Nugraha and R. D. Rizqullah, “Normalisasi Kata Tidak Baku yang Tidak Disingkat dengan Jarak Perubahan,” J. Nas. Tek. Elektro dan Teknol. Inf., vol. 8, no. 3, p. 218, 2019.

E. Lefever, S. Labat, and P. Singh, “Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings,” Lr. 2020 - 12th Int. Conf. Lang. Resour. Eval. Conf. Proc., no. May, pp. 4096–4101, 2020.

Y. Abdellah, A. S. Lhoussain, G. Hicham, and N. Mohamed, “Spelling correction for the Arabic language-space deletion errors,” Procedia Comput. Sci., vol. 177, pp. 568–574, 2020.

K. H. Wai, Y. K. Thu, H. A. Thant, S. Z. Moe, and T. Supnithi, “String Similarity Measures for Myanmar Language (Burmese),” Proc. First Int. Work. NLP Solut. Under Resour. Lang. (NSURL 2019) co-located with ICNLSP 2019 - Short Pap., pp. 94–102, 2019, [Online]. Available: https://www.aclweb.org/anthology/2019.nsurl-1.14

A. Y. W. Finansyah, “Analisis Perbandingan Kinerja Algoritma Similarity Measure sebagai Tahapan Data Preprocessing: Text Normalization Bahasa Indonesia Untuk Analisa Sentimen,” Universitas Gadjah Mada, 2020.

N. Aliyah Salsabila, Y. Ardhito Winatmoko, A. Akbar Septiandri, and A. Jamal, “Colloquial Indonesian Lexicon,” Proc. 2018 Int. Conf. Asian Lang. Process. IALP 2018, pp. 226–229, 2019.

V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Sov. Phys. Dokl., vol. 10, no. 8, pp. 707–710, 1966.

Y. W. Berger, B., Waterman, M. S., & Yu, “Levenshtein distance, sequence comparison and biological database search,” IEEE Trans. Inf. Theory, vol. 67, no. 6, pp. 3287–3294, 2021.

B. Kilic and F. Gülgen, “Investigating the quality of reverse geocoding services using text similarity techniques and logistic regression analysis,” Cartogr. Geogr. Inf. Sci., vol. 47, no. 4, pp. 336–349, 2020.

A. Yagahara, M. Uesugi, and H. Yokoi, “Identification of synonyms using definition similarities in japanese medical device adverse event terminology,” Appl. Sci., vol. 11, no. 8, p. 3659, 2021.

S. Y. Yuliani, S. Y. Yuliani, S. Sahib, M. F. Abdollah, Y. S. Wijaya, and N. H. M. Yusoff, “Hoax news validation using similarity algorithms,” J. Phys. Conf. Ser., vol. 1524, no. 1, 2020.

F. Habibie, Afiahayati, G. B. Herwanto, S. Hartati, and A. Z. K. Frisky, “A Parallel ClustalW Algorithm on Multi-Raspberry Pis for Multiple Sequence Alignment,” Proc. - 2018 1st Int. Conf. Bioinformatics, Biotechnol. Biomed. Eng. BioMIC 2018, no. 1, pp. 1–6, 2019.

Afiahayati and S. Hartati, “Multiple sequence alignment using Hidden Markov model with augmented set based on BLOSUM 80 and its influence on phylogenetic accuracy,” 2010 Int. Conf. Distrib. Fram. Multimed. Appl., pp. 1–8, 2010.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.