thesis
An approach to source-code plagiarism detection investigation using latent semantic analysis
- Publication date
- Publisher
Abstract
This thesis looks at three aspects of source-code plagiarism. The first aspect of the
thesis is concerned with creating a definition of source-code plagiarism; the second aspect
is concerned with describing the findings gathered from investigating the Latent Semantic
Analysis information retrieval algorithm for source-code similarity detection; and the final
aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that
combines Latent Semantic Analysis with plagiarism detection tools.
A recent review of the literature revealed that there is no commonly agreed definition of
what constitutes source-code plagiarism in the context of student assignments. This thesis
first analyses the findings from a survey carried out to gather an insight into the perspectives
of UK Higher Education academics who teach programming on computing courses. Based
on the survey findings, a detailed definition of source-code plagiarism is proposed.
Secondly, the thesis investigates the application of an information retrieval technique,
Latent Semantic Analysis, to derive semantic information from source-code files. Various
parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent
Semantic Analysis using various parameter settings and its effectiveness in retrieving
similar source-code files when optimising those parameters are evaluated.
Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection
tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is
a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism
detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility
for investigating the importance of source-code fragments with regards to their contribution
towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of
suspicious files and source-code fragments