Damerau-Levenshtein Distance Algorithm Based on Abstract Syntax Tree to Detect Code Plagiarism

Ahlijati Nuraminah(1), Abdullah Ammar(2),


(1) Department of Computer Science, Sekolah Tinggi Ilmu Manajemen dan Ilmu Komputer ESQ, Indonesia
(2) Department of Computer Science, Sekolah Tinggi Ilmu Manajemen dan Ilmu Komputer ESQ, Indonesia

Abstract

Purpose: This research aimed to detect source code plagiarism based on Abstract Syntax Tree using Damerau-Levenshtein Distance algorithm, which is expected to streamline the inaccuracies and time-consumption associated with the manual process.

Methods: Damerau-Levenshtein Distance algorithm was used to determine the similarity between source code files and calculate F-Measure. The dataset, which consisted of 178 source code files from 20 coursework assignments, was obtained from GitHub by Lawton Nichols in 2019. Damerau-Levenshtein Distance algorithm was used to compute the minimum cost required to transform one line of code into another. Furthermore, ANTLR detected AST, which was processed through preprocessing, including node pruning, function and variable sorting, and log output removal.

Result: The result showed that the two methods took 5.704 seconds and 0.996 seconds to complete. The lowest and highest values obtained using F-Measure were 0.16 and 0.8, respectively. Therefore, the system performed detection processes quickly and effectively detected common forms of code plagiarism with difficulty in the more complex forms.

Novelty: In conclusion, this research used AST and Damerau-Levenshtein Distance algorithm to calculate the 5 levels of similarity in Java programming language source code. For further development, preprocessing steps were needed to prune unnecessary nodes and detect equivalent but differently syntaxed code.

 

Keywords

Code plagiarism; Code similarity; Abstract syntax tree; Damerau-levensthein distance algorithm

Full Text:

PDF

References

M. Krokoscz, “Plagiarism in articles published in journals,” Int. J. Educ. Integr., vol. 17, no. 1, pp. 1–22, 2021, [Online]. Available: https://doi.org/10.1007/s40979-020-00063-5

G. Cosma and M. Joy, “Towards a definition of source-code plagiarism,” IEEE Trans. Educ., vol. 51, no. 2, pp. 195–200, 2008, doi: 10.1109/TE.2007.906776.

J. Pierce and C. Zilles, “Investigating student plagiarism patterns and correlations to grades,” Proc. Conf. Integr. Technol. into Comput. Sci. Educ. ITiCSE, pp. 471–476, 2017, doi: 10.1145/3017680.3017797.

M. N. Tran, S. Marshall, and L. Hogg, “Development of doctoral student perceptions of plagiarism and academic integrity: the roles of agency and aspirational identity,” Acad. Qual. Integr. New High. Educ. Digit. Environ., pp. 143–162, 2023, doi: 10.1016/B978-0-323-95423-5.00006-5.

L. Sun, L. Hu, and D. Zhou, “Programming attitudes predict computational thinking: Analysis of differences in gender and programming experience,” Comput. Educ., vol. 181, no. 27, p. 104457, 2022, doi: 10.1016/j.compedu.2022.104457.

A. A. Pandit and G. Toksha, “Review of Plagiarism Detection Technique in Source Code,” pp. 393–405, 2020, doi: 10.1007/978-981-15-0633-8_38.

D. Gitchell and N. Tran, “Sim,” ACM SIGCSE Bull., vol. 31, no. 1, pp. 266–270, 1999, doi: 10.1145/384266.299783.

A. Ahtiainen, S. Surakka, and M. Rahikainen, “Plaggie,” Proc. 6th Balt. Sea Conf. Comput. Educ. Res. Koli Call. 2006, pp. 141–142, 2006, doi: 10.1145/1315803.1315831.

V. T. Martins, D. Fonte, P. R. Henriques, and D. Da Cruz, “Plagiarism detection: A tool survey and comparison,” OpenAccess Ser. Informatics, vol. 38, pp. 143–158, 2014, doi: 10.4230/OASIcs.SLATE.2014.143.

A. Ahadi and L. Mathieson, “A Comparison of Three Popular Source code Similarity Tools for Detecting Student Plagiarism,” Proc. Twenty-First Australas. Comput. Educ. Conf. ACM (Association Comput. Mach., pp. 112–117, 2019, doi: 10.1145/3286960.3286974.

M. J. Mišić, J. Protić, and M. V. Tomašević, “Improving source code plagiarism detection: Lessons learned,” 2017 25th Telecommun. Forum, TELFOR 2017 - Proc., vol. 2017-January, pp. 1–8, 2018, doi: 10.1109/TELFOR.2017.8249481.

Y. Golubev, V. Poletansky, N. Povarov, and T. Bryksin, “Multi-threshold token-based code clone detection,” Proc. - 2021 IEEE Int. Conf. Softw. Anal. Evol. Reengineering, SANER 2021, pp. 496–500, 2021, doi: 10.1109/SANER50967.2021.00053.

N. Kumar, “A graph based automatic plagiarism detection technique to handle artificial word reordering and paraphrasing,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8404 LNCS, no. PART 2, pp. 481–494, 2014, doi: 10.1007/978-3-642-54903-8_40.

O. Karnalim, “IR-based technique for linearizing abstract method invocation in plagiarism-suspected source code pair,” J. King Saud Univ. - Comput. Inf. Sci., vol. 31, no. 3, pp. 327–334, 2019, doi: 10.1016/j.jksuci.2018.01.012.

O. Karnalim, “A Low-Level Structure-based Approach for Detecting Source Code Plagiarism,” 2019, [Online]. Available: http://orcid.org/0000-0003-4930-6249

M. Duracik, E. Krsak, and P. Hrkut, “Scalable Source Code Plagiarism Detection Using Source Code Vectors Clustering,” Proc. IEEE Int. Conf. Softw. Eng. Serv. Sci. ICSESS, vol. 2018-November, pp. 499–502, 2018, doi: 10.1109/ICSESS.2018.8663708.

L. Nichols, K. Dewey, M. Emre, S. Chen, and B. Hardekopf, “Syntax-based Improvements to Plagiarism Detectors and their Evaluations,” Annu. Conf. Innov. Technol. Comput. Sci. Educ. ITiCSE, pp. 555–561, 2019, doi: 10.1145/3304221.3319789.

Y. Chaabi and F. Ataa Allah, “Amazigh spell checker using Damerau-Levenshtein algorithm and N-gram,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 8, pp. 6116–6124, 2022, doi: 10.1016/j.jksuci.2021.07.015.

F. J. Damerau, “A technique for computer detection and correction of spelling errors,” Commun. ACM, vol. 7, no. 3, pp. 171–176, 1964, doi: 10.1145/363958.363994.

A. M. Bejarano, L. E. García, and E. E. Zurek, “Detection of source code similitude in academic environments,” Comput. Appl. Eng. Educ., vol. 23, no. 1, pp. 13–22, 2015, doi: 10.1002/cae.21571.

N. Tahaei and D. C. Noelle, “Automated plagiarism detection for computer programming exercises based on patterns of resubmission,” ICER 2018 - Proc. 2018 ACM Conf. Int. Comput. Educ. Res., pp. 178–186, 2018, doi: 10.1145/3230977.3231006.

T. Saǧlam, S. Hahner, J. W. Wittler, and T. Kühn, “Token-based plagiarism detection for metamodels,” Proc. - ACM/IEEE 25th Int. Conf. Model Driven Eng. Lang. Syst. Model. 2022 Companion Proc., pp. 138–141, 2022, doi: 10.1145/3550356.3556508.

Y. Yustikasari, H. Mubarok, and R. Rianto, “Comparative Analysis Performance of K-Nearest Neighbor Algorithm and Adaptive Boosting on the Prediction of Non-Cash Food Aid Recipients,” Sci. J. Informatics, vol. 9, no. 2, pp. 205–217, 2022, doi: 10.15294/sji.v9i2.32369.

N. Hazimah, S. Harahap, A. Amirullah, M. B. Saputro, and I. A. Tamaroh, “Classification of potential customers using C4.5 and k-means algorithms to determine customer service priorities to maintain loyalty,” J. Soft Comput. Explor., vol. 3, no. 2, pp. 123–130, 2022, doi: 10.52465/joscex.v3i2.89.

W. F. Abror and M. Aziz, “Journal of Information System Bankruptcy Prediction Using Genetic Algorithm-Support Vector Machine ( GA-SVM ) Feature Selection and Stacking,” vol. 1, no. 2, pp. 103–108, 2023.

M. B. Miles, A. M. Huberman, and J. Saldana, Qualitative Data Analysis: A Methods Sourcebook 3rd Edition, 3rd ed. SAGE Publications, Inc, 2014.

W. Wen, X. Xue, Y. Li, P. Gu, and J. Xu, “Code Similarity Detection using AST and Textual Information,” Int. J. Performability Eng., vol. 15, no. 10, pp. 2683–2691, 2019, doi: 10.23940/ijpe.19.10.p14.26832691.

T. Parr, The Definitive ANTLR 4 Reference. Dallas, Texas: The Pragmatic Bookshelf, 2014.

Y. Y. Wang, R. K. Shen, G. J. Chiou, C. Y. Yang, V. R. L. Shen, and F. P. Putri, “Novel code plagiarism detection based on abstract syntax tree and fuzzy petri nets,” Int. J. Eng. Educ., vol. 1, no. 1, pp. 46–56, 2019, doi: 10.14710/IJEE.1.1.46-56.

K. Kredpattanakul and Y. Limpiyakorn, “Transforming javascript-based web application to cross-platform desktop with electron,” Lect. Notes Electr. Eng., vol. 514, pp. 571–579, 2019, doi: 10.1007/978-981-13-1056-0_56.

R. Ollila, N. Mäkitalo, and T. Mikkonen, “Modern Web Frameworks: A Comparison of Rendering Performance,” J. Web Eng., 2022, doi: 10.13052/jwe1540-9589.21311.

G. Manduchi, A. Luchetta, G. Moro, A. Rigoni, and C. Taliercio, “Web-based streamed waveform display using MDSplus events and Node.js,” Fusion Eng. Des., vol. 157, no. January, p. 111625, 2020, doi: 10.1016/j.fusengdes.2020.111625.

X. Geng, X. Zeng, L. Hu, and Z. Guo, “An Novel Architecture and Inter-process Communication Scheme to Adapt Chromium Based on Docker Container,” Procedia Comput. Sci., vol. 107, no. Icict, pp. 691–696, 2017, doi: 10.1016/j.procs.2017.03.149.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.