Performance Comparison of Similarity Measure Algorithm as Data Preprocessing Stage: Text Normalization in Bahasa Indonesia

. Purpose: More and more data are stored in text form due to technological developments, making text data processing more difficult. It also causes problems in the text preprocessing algorithm, one of which is when two texts are identical, but are considered distinct by the algorithm. Therefore, it is necessary to normalize the text to get the standard form of words in a particular language. Spelling correction is often used to normalize text, but for Bahasa Indonesia , there has not been much research on the spell correction algorithm. Thus, there needs to be a comparison of the most appropriate spelling correction algorithms for the normalization process to be effective. Methods: In this study, we compared three algorithms, namely Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman. These algorithms were evaluated using questionnaire data and tweet data, which both are in Bahasa Indonesia . Result: The fastest normalization time is obtained by the Jaro-Winkler, taking an average of 31.01 seconds for questionnaire data and 59.27 seconds for tweet data. The best accuracy is obtained by the Levenshtein Distance with a value of 44.90% for the questionnaire data and 60.04% for the tweet data. Novelty: The novelty of this research is to compare the similarity measure algorithm in Bahasa Indonesia . Therefore, the most suitable similarity measure algorithm for Bahasa Indonesia will be obtained.


INTRODUCTION
The growth of the industry in the technology sector has made stored data bigger and made it difficult to process data. This makes research in the field of text mining continue to develop [1]. In line with this, new problems in text mining have also been revealed, such as large amounts of data, high dimensionality, various data structures, and noise in data [2]. This problem, specifically unstructured data, makes the data inconsistent and causes the results of natural language processing to be inaccurate [3]. One of the causes of unstructured data is Non-Standard Words (NSWs), which is a condition when words cannot be found in a dictionary. Therefore, one cannot find a specific word in the word list and also cannot derive the morphological meaning of the words from the dictionary [4]. The solution to this problem is text normalization. Text normalization is used at the data preprocessing stage, helping to remove or replace informal writing or NSWs into its standard form in language [4]. Text normalization can be done in various ways, including removing punctuation marks, changing capitalization, spell correction, and adding, deleting, or rearranging words [5].
The implementation of spelling correction usually uses the word that has the closest distance or highest similarity measure to a word in the dictionary. The algorithms for similarity measurement are divided into several categories, such as edit based (Levenshtein Distance, Jaro-Winkler Distance, Hamming Distance), token-based (n-gram), domain-dependent, and hybrid (TF-IDF) [3]. Comparison of these algorithms has been done before. Nugraha [6] conducted a comparison between the Longest Common Subsequence algorithm with Levenshtein Distance and Jaro-Winkler Distance.
The results showed that normalization using the Levenshtein Distance and Jaro-Winkler method had better accuracy than LCS. [3] conducted a comparison of Jaro-Winkler Distance and Smith-Waterman in detecting duplicate data in the English dataset. The experiment shows that Smith-Waterman matches strings with more accurate results than Jaro-Winkler Distance. Research on normalization has also been carried out for English, Arabic, French, and Burmese language [7], [8], [9].
These previous works show that text normalization usually uses Levenshtein Distance or Jaro-Winkler Distance for spelling correction. Besides, other work shows that Smith-Waterman provides higher accuracy than Jaro-Winkler Distance in detecting duplicate data in British medical datasets [3]. However, there has never been any research comparing the performance of these algorithms on text normalization for the Bahasa Indonesia dataset. Therefore, a comparison of the normalization algorithm for the Bahasa Indonesia dataset is needed.
Based on the background, this research aims to compare Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman as a similarity measure algorithm for text normalization, specifically in Bahasa Indonesia. The parameters which will be compared are accuracy and time needed for normalizing. By choosing an algorithm suitable for the Bahasa Indonesia language structure, it is expected to reduce the time of execution and increase the accuracy when used in natural language processing. This study is an extended version of the authors' thesis [10].

METHODS
In this section, all steps used in this study are explained. First, is the dataset used. The dataset used consists of two different types of data, which are formal-writing-style and informal-writing-style dataset. The dataset was preprocessed using several techniques in the following order: data cleansing, case folding, and normalization. Afterwards, the preprocessed data were spell-corrected using three distinct methods: the Levenshtein Distance, the Jaro-Winkler Distance, and the Smith-Waterman method. During the spellcorrection process, the time and performance of each method were noted. The performance of each method was then compared, followed by a discussion on which is the best method to be used in the scope of formal and informal Bahasa Indonesia. The research stages are visualized in Figure 1.

Dataset
The research will use 2 kinds of datasets in Bahasa Indonesia. The first dataset was taken from the questionnaire of student's comments on the teacher. Thus, the language is more formal. The second dataset was acquired through web-scraping Indonesian tweets. This data has more free writing style and more NSW compared to the first dataset. Each dataset consists of 1500 data rows written in Bahasa Indonesia. A subset of the questionnaire and tweet dataset are presented in Table 1 and Table 2 Guru idaman sepanjang masa 2 Tetaapp baik, ramah, dan sabarr ya bu 3 Ibu terbaikkk!! 4 Semoga pembelajaran menjadi semakin baik dsan efektif. 5 Lebih dijelaskan lagi bagian yang sulit Tabel 2. Subset of the tweet dataset No Tweets 1 b'@p_genedi Wah Mantep nih Ruangguru semakin Berinovasi dengan menu baru \xf0\x9f\xa5\xb0' 2 "b'@edcfess2 Mulai kelas 8 nyicil2.. buat un smp aku ga bimbel, Cuma ngandelin pt di sekolah sama zenius aja wkwk. Alhamdulillah dapet 36 \xf0\x9f\x98\xab" 3 b'@anakcangtip terimakasih @ruangguru_ berkatnya aku beres ujian pertama xixi' 4 "b'zenius error yaa? Aku bukaa di web kok ""hljsj lite: network error"" teruss… helpp" 5 "b'mau lo quipper, zenius, ruangguru, brainly adalah jalan terbaique. https://t.co.2aBBJgcuXk" Data Cleansing, Case Folding, and Tokenization Before the normalization process, the data must be cleaned to reduce data variance. The data cleansing stage uses regular expressions to remove punctuation marks, symbols, and URLs, which often occurred specifically in the tweet dataset. This had to be done as URL and symbols have no contribution in the future process. Afterwards, the case folding and tokenization processes are carried out. The tweets will be converted at the case folding stage by lowercasing every letter. By lowercasing each letter, we created a well-uniformed dataset. At the tokenization stages, every word that exists within a sentence is tokenized into a list. This stage will facilitate the spell-correction stage, where when making spelling corrections, the input received must be in the form of a single word and not sentences.

Spell Correction
The spell correction method is important when dealing with vast user inputs, in this case, users' tweets. Thus, several similarity measure algorithms will be used and compared. The algorithms that will be compared are Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman. The process of spell correction can be seen in Figure 2. The scheme of the spelling correction process starts with checking word-by-word input with the words in the dictionary. The dictionaries used are the slang dictionary and the formal dictionary. The slang dictionary is a collection of informal words. Like other languages, Indonesian has several forms of everyday language [11]. Examples of slang dictionaries can be seen in Table 3. A formal dictionary is a collection of formal words. This dictionary is used to determine which words should be normalized by common words in the dictionary. Examples of formal dictionaries can be seen in Table 4. If a word cannot be found in both dictionaries, then it is necessary to correct the spelling. Spelling correction is done by calculating the similarity of words to each word in a formal dictionary. The most similar dictionary's word will replace the word in the dataset. In this stage, three algorithms will be used to calculate the similarity. The algorithm that will be compared is Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman.

Levenshtein Distance
Levenshtein Distance, proposed by Vladimir Levenshtein in 1965, is an algorithm for calculating the distance between two texts or groups of characters [12]. This distance is calculated based on the minimum number of transformations of a string into another string, which includes deletion, insertion, and replacement [13]. The Levenshtein distance between two strings , (which have lengths | | and | |) can be defined as (1). Where 1 ≠ is equal to 0 when = and equal to 1 otherwise.

Jaro-Winkler Distance
The Jaro-Winkler process proposed by William E. Winkler is refined by the Jaro Distance algorithm [14]. Jaro's distance algorithm determines the similarity value of two words by counting the number of characters that correspond to the two words that are not too far away and reducing the number of characters that match up to half of the number of characters undergoing transposition. The similarity value of the Jaro Distance algorithm can be calculated by (2), where | | is Length of the string , is the number of macthing characters, and is the half of the number of transpositions [15].
The characters in 1 and 2 is declared match if the position difference in certain position is not more than: Winkler added the application of penalties for inappropriate characters in the first four characters in the Jaro Distance algorithm. The Jaro-Winkler similarity value is defined as follows: Where is the Jaro similarity, is the length of common prefix (capped at 4), and is a constant scaling factor for how much the score is adjusted upwards for having common prefixes.

Smith-Waterman
The Smith-Waterman algorithm is an algorithm used to compare two nucleotide sequences or protein structures in the field of bioinformatics. By applying the sequence alignment function of the Smith-Waterman algorithm, the calculation of the text-similarity with the smith-waterman algorithm can be applied based on the word order [16]- [18].
The string-matching process between two strings will produce identical/similar alignment (hit) with or without string sequence changes such as deletion, insertion, and replacement [16]. The string matching between two strings = 1 , 2 , 3 , … , and = 1 , 2 , 3 , … , can be applied by the following steps. First, create a substitution matrix using equation (5).
If there is a match, assign +1, if there is a mismatch, assign -1, and if there is a gap, assign -2 as shown in (6). [ ] = 0 . The size of the score matrix is set to be ( ℎ( ) + 1) * ( ℎ( ) + 1). Third, score each element of the scoring matrix using (7): With the input in the matrix being the best similarity value in the prefix of the two strings, is the cost of a gap expressed in a linear penalty gap as = where is the length of the gap. Fourth, traceback from the element with the highest score to an element with score 0, from the bottom to the top.

RESULT AND DISCUSSION
The results of this study are divided into two parts namely data from questionnaires and data from tweets. The results were obtained after experimenting with five repetitions. Table 5 below shows the experimental result of normalization on questionnaire data using Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman. The Jaro-Winkler algorithm spent 31.01 seconds; the fastest running time compared to two other algorithms. The Levenshtein Distance algorithm spent 47.36 seconds, followed by the Smith-Waterman algorithm which spent the longest time processing the questionnaire data, 437.72 seconds. In the term of accuracy, the Levenshtein Distance algorithm scored the highest accuracy, achieving 44.90%. The Smith-Waterman and Jaro-Winkler algorithm was followed, scoring 37.31% and 33.62%, respectively. Regarding running time and accuracy score, the Levenshtein Distance algorithm is better than the two others. Although spending longer running time than the Jaro-Winkler Distance algorithm, there is no significant difference, as the difference is only a few seconds. On the other side, even when the Smith-Waterman algorithm scored higher accuracy than the Jaro-Winkler algorithm, it does not appear to be a better option, as it spent about 14 times longer training times than the other two algorithms. Table 6 below shows the experimental result of normalization on the tweet data using Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman. Although, there are differences in the gap of time and accuracy between the Levenshtein Distance and Jaro-Winkler Distance compared to the previous section. Firstly, the gap in time between the Levenshtein Distance and Jaro-Winkler Distance is increased by around 38 seconds (around 17 seconds on the questionnaire dataset). Secondly, the gap in accuracy scores between these two algorithms is narrowed, having approximately 5.35% in differences (around 11.28% on the questionnaire dataset). Also, in this experiment, the Smith-Waterman algorithm did not score any better than others in terms of time and accuracy.

CONCLUSION
Based on the experimental results, it can be concluded that the Jaro-Winkler Distance algorithm can normalize the two Indonesian datasets quickly. However, in terms of accuracy, the Levenshtein Distance algorithm provides relatively better results compared to Jaro-Winkler and Smith-Waterman. The Smith-Waterman algorithm may not be a better choice because it consumes much time and has mediocre accuracy scores.