1,335 research outputs found
Plagiarism detection system for Armenian language
In the academic context, it is very important to evaluate the uniqueness of reports, scientific papers and other documents that are everyday disseminated on the web. There are already several tools with this purpose but not for Armenian texts. In this paper, a system to analyze the similarity of Armenian documents is presented. The idea is to collect a set of documents of the same domain in order to identify keywords. Then, based on that information, the system receives two documents and compares them calculating the probability of plagiarism. For that, an approach based on several levels of analysis is implemented and some of those
steps allow the user interaction choosing options or adding more information.info:eu-repo/semantics/acceptedVersio
TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion
Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion, in which the copied code file is rewritten in another programming language. In response to that, this paper proposes a detection technique which is able to accurately compare code files written in various programming languages, but with limited effort in accommodating such languages at development stage. The only language-dependent feature used in the technique is source code tokeniser and no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF inspired weighting, in which rare matches are prioritised. Our evaluation shows that the technique outperforms common techniques in academia for handling language conversion disguises. Further, it is comparable to those techniques when dealing with conventional disguises
An approach to source-code plagiarism detection investigation using latent semantic analysis
This thesis looks at three aspects of source-code plagiarism. The first aspect of the
thesis is concerned with creating a definition of source-code plagiarism; the second aspect
is concerned with describing the findings gathered from investigating the Latent Semantic
Analysis information retrieval algorithm for source-code similarity detection; and the final
aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that
combines Latent Semantic Analysis with plagiarism detection tools.
A recent review of the literature revealed that there is no commonly agreed definition of
what constitutes source-code plagiarism in the context of student assignments. This thesis
first analyses the findings from a survey carried out to gather an insight into the perspectives
of UK Higher Education academics who teach programming on computing courses. Based
on the survey findings, a detailed definition of source-code plagiarism is proposed.
Secondly, the thesis investigates the application of an information retrieval technique,
Latent Semantic Analysis, to derive semantic information from source-code files. Various
parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent
Semantic Analysis using various parameter settings and its effectiveness in retrieving
similar source-code files when optimising those parameters are evaluated.
Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection
tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is
a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism
detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility
for investigating the importance of source-code fragments with regards to their contribution
towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of
suspicious files and source-code fragments
Software Plagiarism Detection Using N-grams
Plagiarism is an act of copying where one doesn’t rightfully credit the original source. The
motivations behind plagiarism can vary from completing academic courses to even gaining
economical advantage. Plagiarism exists in various domains, where people want to take credit
from something they have worked on. These areas can include e.g. literature, art or software,
which all have a meaning for an authorship.
In this thesis we conduct a systematic literature review from the topic of source code
plagiarism detection methods, then based on the literature propose a new approach to detect
plagiarism which combines both similarity detection and authorship identification, introduce
our tokenization method for the source code, and lastly evaluate the model by using real life
data sets. The goal for our model is to point out possible plagiarism from a collection of
documents, which in this thesis is specified as a collection of source code files written by various
authors. Our data, which we will use to our statistical methods, consists of three datasets:
(1) collection of documents belonging to University of Helsinki’s first programming course, (2)
collection of documents belonging to University of Helsinki’s advanced programming course
and (3) submissions for source code re-use competition. Statistical methods in this thesis are
inspired by the theory of search engines, which are related to data mining when detecting
similarity between documents and machine learning when classifying document with the most
likely author in authorship identification.
Results show that our similarity detection model can be used successfully to retrieve
documents for further plagiarism inspection, but false positives are quickly introduced even
when using a high threshold that controls the minimum allowed level of similarity between
documents. We were unable to use the results of authorship identification in our study, as
the results with our machine learning model were not high enough to be used sensibly. This
was possibly caused by the high similarity between documents, which is due to the restricted
tasks and the course setting that teaches a specific programming style during the timespan of
the course
A Plagiarism Detection Algorithm based on Extended Winnowing
Plagiarism is a common problem faced by academia and education. Mature commercial plagiarism detection system has the advantages of comprehensive and high accuracy, but the expensive detection costs make it unsuitable for real-time, lightweight application environment such as the student assignments plagiarism detection. This paper introduces the method of extending classic Winnowing plagiarism detection algorithm, expands the algorithm in functionality. The extended algorithm can retain the text location and length information in original document while extracting the fingerprints of a document, so that the locating and marking for plagiarism text fragment are much easier to achieve. The experimental results and several years of running practice show that the expansion of the algorithm has little effect on its performance, normal hardware configuration of PC will be able to meet small and medium-sized applications requirements. Based on the characteristics of lightweight, high efficiency, reliability and flexibility of Winnowing, the extended algorithm further enhances the adaptability and extends the application areas
Plagiarism detection for Indonesian texts
As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts.
To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation.
Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts.
Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm
A text uniqueness checking system for Armenian language
The goal of this dissertation is to develop a tool to analyze the similarity of Armenian
texts. The idea is to compare two texts or to compare a text with a set of texts and
detect the possibility of plagiarism. This system will be used in academic contexts but
can also be useful in other situations. In the academic context it is very important to
evaluate the uniqueness of reports, scienti c papers and other documents that are everyday
disseminated on the web. There are already several tools with this purpose but not for
Armenian texts.O objetivo desta dissertação é desenvolver uma ferramenta para analisar a semelhança
de textos em arménio. A ideia é comparar dois textos ou comparar um texto com um
conjunto de textos e detectar a possibilidade de plágio. Este sistema poderá ser usado
em contextos académicos, mas, também pode ser útil em outras situações. No contexto
académico, é muito importante avaliar a singularidade de relatórios, artigos científicos e
outros documentos que são todos os dias divulgados na web. Já existem várias ferramentas
com este propósito mas não para a linguagem arménia.Տեղեկատվական տեխնոլոգիաների զարգացմանը զուգընթաց ավելացել են նաև գրագո֊
ղության դեպքերը։ Հաշվի առնելով այն հանգամանքը, որ կան գրագողությունը ստուգող
մի շարք համակարգեր, բայց ոչ մի համակարգ նախատեսված չէ հայերեն տեքստերի
ունիկալության վերլուծություն համար, խնդիր դրվեց մշակել այնպիսի համակարգ, որը
կապահովի տեղեկատվական համակարգերում տեքստերի ունիկալության վերլուծությունը,
ինչպես նաև թույլ կտա համեմատել և հայտնաբերել գրագողության առկայությունը։ Աշ֊
խատանքի նպատակն է ուսումնական գործընթացում ունիկալությունը ստուգող համա֊
կարգերի կիրառումը, քանի որ շատ կարևոր է գնահատել ատենախոսությունների, ռեֆե֊
րատների, կուրսային աշխատանքների և այլ տեքստերի ունիկալության աստիճանը։ Այս
նախագիծը հնարավորություն կտա մշակել և հիմնավորել հայերեն տեքստերի ունիկալու֊
թյան համակարգչային վերլուծությունը և կանխել գրագողությունը հայերենում
- …