1,335 research outputs found

    Plagiarism detection system for Armenian language

    Get PDF
    In the academic context, it is very important to evaluate the uniqueness of reports, scientific papers and other documents that are everyday disseminated on the web. There are already several tools with this purpose but not for Armenian texts. In this paper, a system to analyze the similarity of Armenian documents is presented. The idea is to collect a set of documents of the same domain in order to identify keywords. Then, based on that information, the system receives two documents and compares them calculating the probability of plagiarism. For that, an approach based on several levels of analysis is implemented and some of those steps allow the user interaction choosing options or adding more information.info:eu-repo/semantics/acceptedVersio

    WASTK: A Weighted Abstract Syntax Tree Kernel Method for Source Code Plagiarism Detection

    Get PDF

    TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion

    Get PDF
    Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion, in which the copied code file is rewritten in another programming language. In response to that, this paper proposes a detection technique which is able to accurately compare code files written in various programming languages, but with limited effort in accommodating such languages at development stage. The only language-dependent feature used in the technique is source code tokeniser and no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF inspired weighting, in which rare matches are prioritised. Our evaluation shows that the technique outperforms common techniques in academia for handling language conversion disguises. Further, it is comparable to those techniques when dealing with conventional disguises

    An approach to source-code plagiarism detection investigation using latent semantic analysis

    Get PDF
    This thesis looks at three aspects of source-code plagiarism. The first aspect of the thesis is concerned with creating a definition of source-code plagiarism; the second aspect is concerned with describing the findings gathered from investigating the Latent Semantic Analysis information retrieval algorithm for source-code similarity detection; and the final aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that combines Latent Semantic Analysis with plagiarism detection tools. A recent review of the literature revealed that there is no commonly agreed definition of what constitutes source-code plagiarism in the context of student assignments. This thesis first analyses the findings from a survey carried out to gather an insight into the perspectives of UK Higher Education academics who teach programming on computing courses. Based on the survey findings, a detailed definition of source-code plagiarism is proposed. Secondly, the thesis investigates the application of an information retrieval technique, Latent Semantic Analysis, to derive semantic information from source-code files. Various parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent Semantic Analysis using various parameter settings and its effectiveness in retrieving similar source-code files when optimising those parameters are evaluated. Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility for investigating the importance of source-code fragments with regards to their contribution towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of suspicious files and source-code fragments

    Software Plagiarism Detection Using N-grams

    Get PDF
    Plagiarism is an act of copying where one doesn’t rightfully credit the original source. The motivations behind plagiarism can vary from completing academic courses to even gaining economical advantage. Plagiarism exists in various domains, where people want to take credit from something they have worked on. These areas can include e.g. literature, art or software, which all have a meaning for an authorship. In this thesis we conduct a systematic literature review from the topic of source code plagiarism detection methods, then based on the literature propose a new approach to detect plagiarism which combines both similarity detection and authorship identification, introduce our tokenization method for the source code, and lastly evaluate the model by using real life data sets. The goal for our model is to point out possible plagiarism from a collection of documents, which in this thesis is specified as a collection of source code files written by various authors. Our data, which we will use to our statistical methods, consists of three datasets: (1) collection of documents belonging to University of Helsinki’s first programming course, (2) collection of documents belonging to University of Helsinki’s advanced programming course and (3) submissions for source code re-use competition. Statistical methods in this thesis are inspired by the theory of search engines, which are related to data mining when detecting similarity between documents and machine learning when classifying document with the most likely author in authorship identification. Results show that our similarity detection model can be used successfully to retrieve documents for further plagiarism inspection, but false positives are quickly introduced even when using a high threshold that controls the minimum allowed level of similarity between documents. We were unable to use the results of authorship identification in our study, as the results with our machine learning model were not high enough to be used sensibly. This was possibly caused by the high similarity between documents, which is due to the restricted tasks and the course setting that teaches a specific programming style during the timespan of the course

    A Plagiarism Detection Algorithm based on Extended Winnowing

    Full text link
    Plagiarism is a common problem faced by academia and education. Mature commercial plagiarism detection system has the advantages of comprehensive and high accuracy, but the expensive detection costs make it unsuitable for real-time, lightweight application environment such as the student assignments plagiarism detection. This paper introduces the method of extending classic Winnowing plagiarism detection algorithm, expands the algorithm in functionality. The extended algorithm can retain the text location and length information in original document while extracting the fingerprints of a document, so that the locating and marking for plagiarism text fragment are much easier to achieve. The experimental results and several years of running practice show that the expansion of the algorithm has little effect on its performance, normal hardware configuration of PC will be able to meet small and medium-sized applications requirements. Based on the characteristics of lightweight, high efficiency, reliability and flexibility of Winnowing, the extended algorithm further enhances the adaptability and extends the application areas

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    A text uniqueness checking system for Armenian language

    Get PDF
    The goal of this dissertation is to develop a tool to analyze the similarity of Armenian texts. The idea is to compare two texts or to compare a text with a set of texts and detect the possibility of plagiarism. This system will be used in academic contexts but can also be useful in other situations. In the academic context it is very important to evaluate the uniqueness of reports, scienti c papers and other documents that are everyday disseminated on the web. There are already several tools with this purpose but not for Armenian texts.O objetivo desta dissertação é desenvolver uma ferramenta para analisar a semelhança de textos em arménio. A ideia é comparar dois textos ou comparar um texto com um conjunto de textos e detectar a possibilidade de plágio. Este sistema poderá ser usado em contextos académicos, mas, também pode ser útil em outras situações. No contexto académico, é muito importante avaliar a singularidade de relatórios, artigos científicos e outros documentos que são todos os dias divulgados na web. Já existem várias ferramentas com este propósito mas não para a linguagem arménia.Տեղեկատվական տեխնոլոգիաների զարգացմանը զուգընթաց ավելացել են նաև գրագո֊ ղության դեպքերը։ Հաշվի առնելով այն հանգամանքը, որ կան գրագողությունը ստուգող մի շարք համակարգեր, բայց ոչ մի համակարգ նախատեսված չէ հայերեն տեքստերի ունիկալության վերլուծություն համար, խնդիր դրվեց մշակել այնպիսի համակարգ, որը կապահովի տեղեկատվական համակարգերում տեքստերի ունիկալության վերլուծությունը, ինչպես նաև թույլ կտա համեմատել և հայտնաբերել գրագողության առկայությունը։ Աշ֊ խատանքի նպատակն է ուսումնական գործընթացում ունիկալությունը ստուգող համա֊ կարգերի կիրառումը, քանի որ շատ կարևոր է գնահատել ատենախոսությունների, ռեֆե֊ րատների, կուրսային աշխատանքների և այլ տեքստերի ունիկալության աստիճանը։ Այս նախագիծը հնարավորություն կտա մշակել և հիմնավորել հայերեն տեքստերի ունիկալու֊ թյան համակարգչային վերլուծությունը և կանխել գրագողությունը հայերենում
    corecore