DCU@FIRE-2014: an information retrieval approach for source code plagiarism detection
Ganguly, DebasisORCID: 0000-0003-0050-7138 and Jones, Gareth J.F.ORCID: 0000-0002-4033-9135
(2014)
DCU@FIRE-2014: an information retrieval approach for source code plagiarism detection.
In: Forum for Information Retrieval Evaluation (FIRE 2014) workshop, 5-7 Dec 2014, Bangalore, India.
This paper investigates an information retrieval (IR) based approach for source code plagiarism detection. The method of extensively checking pairwise similarities between documents is not scalable for large collections of source code documents. To make the task of source code plagiarism detection fast and scalable in practice, we propose an IR based approach in which each document is treated as a pseudo-query in order to retrieve a list of potential candidate documents in a decreasing order of their similarity values. A threshold is then applied on the relative similarity decrement ratios to report a set of documents as potential cases of source-code reuse. Instead of treating a source code as an unstructured text document, we explore term extraction from the annotated parse tree of a source code and also make use of field based language model for indexing and retrieval of source code documents. Results conrm that source code parsing plays a vital role in improving the plagiarism prediction accuracy.