Identifying The Common Type of Spelling Error by Leveraging Levenshtein Distance and N-gram

Margareta Hardiyanti(1),


(1) Universitas Telkom

Abstract

Purpose: This study aims to identify the common type of spelling error and it uses the list of common misspelling words submitted by Wikipedia contributors. Methods: Levenshtein and N-gram distance are utilized to predict the correct word of misspelling from English dictionary. Then, the result of both algorithms is observed and evaluated using recall metrics to determine which technique works more effectively. Result: The result of this study shows that Levenshtein works well to correct substitution single letter and transposition two sequenced letters, while N-gram operates effectively to fix the word with letter omission. The overall result is then evaluated by recall measurement to see which technique that works well on correcting the misspellings. Since the recall of Levenshtein is higher than N-gram, it is concluded that the frequency of misspelling words that are correctly fixed by Levenshtein occurs more often. Novelty: This is the first study that compares two spelling correction algorithms on identifying the common type of spelling error.

Keywords

Spelling Error Type, Levenshtein, N-gram

Full Text:

PDF

References

J. T. Grudin, “Error Patterns in Novice and Skilled Transcription Typing,†Cogn. Asp. Ski. Typewriting, pp. 121–143, 1983.

A. A. Lunsford and K. J. Lunsford, “‘Mistakes are a fact of life’: A national comparative study,†Coll. Compos. Commun., vol. 59, no. 4, pp. 781–806, 2008.

R. Nagata, H. Takamura, and G. Neubig, “Adaptive Spelling Error Correction Models for Learner English,†Procedia Comput. Sci., vol. 112, pp. 474–483, 2017.

K. Kukich, “Techniques for Automatically Correcting Words in Text,†ACM Comput. Surv., vol. 24, no. 4, pp. 377–439, 1992.

F. J. Damerau, “A technique for computer detection and correction of spelling errors,†Commun. ACM, vol. 7, no. 3, pp. 171–176, 1964.

J. H. Lee, M. Kim, and H. C. Kwon, “Deep Learning-Based Context-Sensitive Spelling Typing Error Correction,†IEEE Access, vol. 8, pp. 152565–152578, 2020.

A. I. Fahma, “Identifikasi Kesalahan Penulisan Kata ( Typographical Error ) pada Dokumen Berbahasa Indonesia Menggunakan Metode N-gram dan Levenshtein Distance,†J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 2, no. 1, pp. 53–62, 2018.

P. F. M. Hameed, “A Study of the Spelling Errors committed by Students of English in Saudi Arabia: Exploration and Remedial Measures,†Adv. Lang. Lit. Stud., vol. 7, no. 1, 2015.

M. A. Murdilan, A. Abdiansyah, and N. Yusliani, “Sistem Koreksi Kesalahan Tulisan Pada Teks Karangan Siswa Sekolah Dasar Mengunakan Metode N-Gram Dan Levenshtein Distance,†Universitas Sriwijaya, 2020.

T. Okuda, E. Tanaka, and T. Kasai, “A Method for the Correction of Garbled Words Based on the Levenshtein Metric,†IEEE Trans. Comput., vol. C–25, no. 2, pp. 172–178, 1976.

H. N. Abdulkhudhur, I. Q. Habeeb, Y. Yusof, and S. A. M. Yusof, “Implementation of improved Levenshtein algorithm for spelling correction word candidate list generation,†J. Theor. Appl. Inf. Technol., vol. 88, no. 3, pp. 449–455, 2016.

F. Ahmed, E. W. De Luca, and A. Nürnberger, “Revised N-Gram based Automatic Spelling Correction Tool to Improve Retrieval Effectiveness,†Polibits, vol. 40, no. 40, pp. 39–48, 2009.

J. R. Ullmann, “A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words,†Comput. J., 1975.

Q. Huang et al., “Chinese Spelling Check System Based on Tri-gram Model,†in Proc. Eighth SIGHAN Workshop Chin. Lang. Process., 2015, pp. 128–136.

“Wikipedia:Lists of common misspellings/For Machines,†2020. [Online]. Available: https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines.

V. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,†Soviet physics doklady, vol. 10, no. 8. pp. 707–710, 1966.

R. Haldar and D. Mukhopadhyay, “Levenshtein Distance Technique in Dictionary Lookup Methods: An Improved Approach,†ArXiv, 2011.

J. L. Peterson, “A note on undetected typing errors,†Commun. ACM, vol. 29, no. 7, pp. 633–637, 1986.

M. Buckland and F. Gey, “The Relationship between Recall and Precision,†J. Am. Soc. Inf. Sci., vol. 45, no. 1, pp. 12–19, 1994.

G. Navarro, “A guided tour to approximate string matching,†ACM Comput. Surv., vol. 33, no. 1, pp. 31–88, 2001.

V. Christanti Mawardi, N. Susanto, and D. Santun Naga, “Spelling Correction for Text Documents in Bahasa Indonesia Using Finite State Automata and Levinshtein Distance Method,†MATEC Web Conf., vol. 164, 2018.

W. J. Wilbur, W. Kim, and N. Xie, “Spelling correction in the PubMed search engine,†Inf. Retr. Boston., vol. 9, no. 5, pp. 543–564, 2006.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.