4 research outputs found
A Simplified Latent Semantic Indexing Approach for Multi-Linguistic Information Retrieval
Latent Semantic Indexing (LSI) approach provides a promising solution to overcome the language barrier between queries and documents, but unfortunately the high dimensions of the training matrix is computationally prohibitive for its key step of Singular Value Decomposition (SVD). Based on the semantic parallelism of the multi-linguistic training corpus we prove in this paper that, theoretically if the training term-by-document matrix can appear in either of two symmetry forms, strong or weak, the dimension of the matrix under decomposition can be reduced to the size of a monolingual matrix. The retrieval accuracy will not deteriorate in such a simplification. And we also discuss what these two forms of symmetry mean in the context of multi-linguistic information retrieval. Although in real world data the term-by-document matrices are not naturally in either symmetry form, we suggest a way to make them appear more symmetric in the strong form by means of word clustering and term weighting. A real data experiment is also given to support our method of simplification.
An approach to source-code plagiarism detection investigation using latent semantic analysis
This thesis looks at three aspects of source-code plagiarism. The first aspect of the
thesis is concerned with creating a definition of source-code plagiarism; the second aspect
is concerned with describing the findings gathered from investigating the Latent Semantic
Analysis information retrieval algorithm for source-code similarity detection; and the final
aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that
combines Latent Semantic Analysis with plagiarism detection tools.
A recent review of the literature revealed that there is no commonly agreed definition of
what constitutes source-code plagiarism in the context of student assignments. This thesis
first analyses the findings from a survey carried out to gather an insight into the perspectives
of UK Higher Education academics who teach programming on computing courses. Based
on the survey findings, a detailed definition of source-code plagiarism is proposed.
Secondly, the thesis investigates the application of an information retrieval technique,
Latent Semantic Analysis, to derive semantic information from source-code files. Various
parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent
Semantic Analysis using various parameter settings and its effectiveness in retrieving
similar source-code files when optimising those parameters are evaluated.
Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection
tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is
a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism
detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility
for investigating the importance of source-code fragments with regards to their contribution
towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of
suspicious files and source-code fragments
An approach to source-code plagiarism detection investigation using latent semantic analysis
This thesis looks at three aspects of source-code plagiarism. The first aspect of the thesis is concerned with creating a definition of source-code plagiarism; the second aspect is concerned with describing the findings gathered from investigating the Latent Semantic Analysis information retrieval algorithm for source-code similarity detection; and the final aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that combines Latent Semantic Analysis with plagiarism detection tools. A recent review of the literature revealed that there is no commonly agreed definition of what constitutes source-code plagiarism in the context of student assignments. This thesis first analyses the findings from a survey carried out to gather an insight into the perspectives of UK Higher Education academics who teach programming on computing courses. Based on the survey findings, a detailed definition of source-code plagiarism is proposed. Secondly, the thesis investigates the application of an information retrieval technique, Latent Semantic Analysis, to derive semantic information from source-code files. Various parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent Semantic Analysis using various parameter settings and its effectiveness in retrieving similar source-code files when optimising those parameters are evaluated. Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility for investigating the importance of source-code fragments with regards to their contribution towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of suspicious files and source-code fragments.EThOS - Electronic Theses Online ServiceGBUnited Kingdo