2,959 research outputs found

    Plagiarism meets paraphrasing: insights for the new generation in automatic plagiarism detection

    Get PDF
    Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyse the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource which uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analysed in the light of this annotation. The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical substitutions are the paraphrase mechanisms used the most when plagiarising, and (iii) paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analysed, providing critical insights for the improvement of automatic plagiarism detection systems

    Revisiting the challenges and surveys in text similarity matching and detection methods

    Get PDF
    The massive amount of information from the internet has revolutionized the field of natural language processing. One of the challenges was estimating the similarity between texts. This has been an open research problem although various studies have proposed new methods over the years. This paper surveyed and traced the primary studies in the field of text similarity. The aim was to give a broad overview of existing issues, applications, and methods of text similarity research. This paper identified four issues and several applications of text similarity matching. It classified current studies based on intrinsic, extrinsic, and hybrid approaches. Then, we identified the methods and classified them into lexical-similarity, syntactic-similarity, semantic-similarity, structural-similarity, and hybrid. Furthermore, this study also analyzed and discussed method improvement, current limitations, and open challenges on this topic for future research directions

    An improved extrinsic monolingual plagiarism detection approach of the Bengali text

    Get PDF
    Plagiarism is an act of literature fraud, which is presenting others’ work or ideas without giving credit to the original work. All published and unpublished written documents are under the cover of this definition. Plagiarism, which increased significantly over the last few years, is a concerning issue for students, academicians, and professionals. Due to this, there are several plagiarism detection tools or software available to detect plagiarism in different languages. Unfortunately, negligible work has been done and no plagiarism detection software available in the Bengali language where Bengali is one of the most spoken languages in the world. In this paper, we have proposed a plagiarism detection tool for the Bengali language that mainly focuses on the educational and newspaper domain. We have collected 82 textbooks from the National Curriculum of Textbooks (NCTB), Bangladesh, scrapped all articles from 12 reputed newspapers and compiled our corpus with more than 10 million sentences. The proposed method on Bengali text corpus shows an accuracy rate of 97.31

    TAKSONOMIJA METODA AKADEMSKOG PLAGIRANJA

    Get PDF
    The article gives an overview of the plagiarism domain, with focus on academic plagiarism. The article defines plagiarism, explains the origin of the term, as well as plagiarism related terms. It identifies the extent of the plagiarism domain and then focuses on the plagiarism subdomain of text documents, for which it gives an overview of current classifications and taxonomies and then proposes a more comprehensive classification according to several criteria: their origin and purpose, technical implementation, consequence, complexity of detection and according to the number of linguistic sources. The article suggests the new classification of academic plagiarism, describes sorts and methods of plagiarism, types and categories, approaches and phases of plagiarism detection, the classification of methods and algorithms for plagiarism detection. The title of the article explicitly targets the academic community, but it is sufficiently general and interdisciplinary, so it can be useful for many other professionals like software developers, linguists and librarians.Rad daje pregled domene plagiranja tekstnih dokumenata. Opisuje porijeklo pojma plagijata, daje prikaz definicija te objašnjava plagijatu srodne pojmove. Ukazuje na širinu domene plagiranja, a za tekstne dokumenate daje pregled dosadašnjih taksonomija i predlaže sveobuhvatniju taksonomiju prema više kriterija: porijeklu i namjeni, tehničkoj provedbi plagiranja, posljedicama plagiranja, složenosti otkrivanja i (više)jezičnom porijeklu. Rad predlaže novu klasifikaciju akademskog plagiranja, prikazuje vrste i metode plagiranja, tipove i kategorije plagijata, pristupe i faze otkrivanja plagiranja. Potom opisuje klasifikaciju metoda i algoritama otkrivanja plagijata. Iako cilja na akademskog čitatelja, može biti od koristi u interdisciplinarnim područjima te razvijateljima softvera, lingvistima i knjižničarima

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    A Machine Learning Approach for Plagiarism Detection

    Get PDF
    Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic. This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases. Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author. The Intrinsic method aims to generate a model of author “style” by revealing a set of certain features of authorship. The model’s generation procedure focuses on just one author as an attempt to summarise aspects of an author’s style in a definitive and clear-cut manner. The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists

    A computational academic integrity framework

    Get PDF
    L'abast creixent i la naturalesa canviant dels programes acadèmics constitueixen un repte per a la integritat dels protocols tradicionals de proves i exàmens. L'objectiu d¿aquesta tesi és introduir una alternativa als enfocaments tradicionals d'integritat acadèmica, per a cobrir la bretxa del buit de l'anonimat i donar la possibilitat als instructors i administradors acadèmics de fer servir nous mitjans que permetin mantenir la integritat acadèmica i promoguin la responsabilitat, accessibilitat i eficiència, a més de preservar la privadesa i minimitzin la interrupció en el procés d'aprenentatge. Aquest treball té com a objectiu començar un canvi de paradigma en les pràctiques d'integritat acadèmica. La recerca en l'àrea de la identitat de l'estudiant i la garantia de l'autoria són importants perquè la concessió de crèdits d'estudi a entitats no verificades és perjudicial per a la credibilitat institucional i la seguretat pública. Aquesta tesi es basa en la noció que la identitat de l'alumne es compon de dues capes diferents, física i de comportament, en les quals tant els criteris d'identitat com els d'autoria han de ser confirmats per a mantenir un nivell raonable d'integritat acadèmica. Per a això, aquesta tesi s'organitza en tres seccions, cadascuna de les quals aborda el problema des d'una de les perspectives següents: (a) teòrica, (b) empírica i (c) pragmàtica.El creciente alcance y la naturaleza cambiante de los programas académicos constituyen un reto para la integridad de los protocolos tradicionales de pruebas y exámenes. El objetivo de esta tesis es introducir una alternativa a los enfoques tradicionales de integridad académica, para cubrir la brecha del vacío anonimato y dar la posibilidad a los instructores y administradores académicos de usar nuevos medios que permitan mantener la integridad académica y promuevan la responsabilidad, accesibilidad y eficiencia, además de preservar la privacidad y minimizar la interrupción en el proceso de aprendizaje. Este trabajo tiene como objetivo iniciar un cambio de paradigma en las prácticas de integridad académica. La investigación en el área de la identidad del estudiante y la garantía de la autoría son importantes porque la concesión de créditos de estudio a entidades no verificadas es perjudicial para la credibilidad institucional y la seguridad pública. Esta tesis se basa en la noción de que la identidad del alumno se compone de dos capas distintas, física y de comportamiento, en las que tanto los criterios de identidad como los de autoría deben ser confirmados para mantener un nivel razonable de integridad académica. Para ello, esta tesis se organiza en tres secciones, cada una de las cuales aborda el problema desde una de las siguientes perspectivas: (a) teórica, (b) empírica y (c) pragmática.The growing scope and changing nature of academic programmes provide a challenge to the integrity of traditional testing and examination protocols. The aim of this thesis is to introduce an alternative to the traditional approaches to academic integrity, bridging the anonymity gap and empowering instructors and academic administrators with new ways of maintaining academic integrity that preserve privacy, minimize disruption to the learning process, and promote accountability, accessibility and efficiency. This work aims to initiate a paradigm shift in academic integrity practices. Research in the area of learner identity and authorship assurance is important because the award of course credits to unverified entities is detrimental to institutional credibility and public safety. This thesis builds upon the notion of learner identity consisting of two distinct layers (a physical layer and a behavioural layer), where the criteria of identity and authorship must both be confirmed to maintain a reasonable level of academic integrity. To pursue this goal in organized fashion, this thesis has the following three sections: (a) theoretical, (b) empirical, and (c) pragmatic

    A Computational Academic Integrity Framework

    Get PDF
    L'abast creixent i la naturalesa canviant dels programes acadèmics constitueixen un repte per a la integritat dels protocols tradicionals de proves i exàmens. L'objectiu d'aquesta tesi és introduir una alternativa als enfocaments tradicionals d'integritat acadèmica, per a cobrir la bretxa del buit de l'anonimat i donar la possibilitat als instructors i administradors acadèmics de fer servir nous mitjans que permetin mantenir la integritat acadèmica i promoguin la responsabilitat, accessibilitat i eficiència, a més de preservar la privadesa i minimitzin la interrupció en el procés d'aprenentatge. Aquest treball té com a objectiu començar un canvi de paradigma en les pràctiques d'integritat acadèmica. La recerca en l'àrea de la identitat de l'estudiant i la garantia de l'autoria són importants perquè la concessió de crèdits d'estudi a entitats no verificades és perjudicial per a la credibilitat institucional i la seguretat pública. Aquesta tesi es basa en la noció que la identitat de l'alumne es compon de dues capes diferents, física i de comportament, en les quals tant els criteris d'identitat com els d'autoria han de ser confirmats per a mantenir un nivell raonable d'integritat acadèmica. Per a això, aquesta tesi s'organitza en tres seccions, cadascuna de les quals aborda el problema des d'una de les perspectives següents: (a) teòrica, (b) empírica i (c) pragmàtica.El creciente alcance y la naturaleza cambiante de los programas académicos constituyen un reto para la integridad de los protocolos tradicionales de pruebas y exámenes. El objetivo de esta tesis es introducir una alternativa a los enfoques tradicionales de integridad académica, para cubrir la brecha del vacío anonimato y dar la posibilidad a los instructores y administradores académicos de usar nuevos medios que permitan mantener la integridad académica y promuevan la responsabilidad, accesibilidad y eficiencia, además de preservar la privacidad y minimizar la interrupción en el proceso de aprendizaje. Este trabajo tiene como objetivo iniciar un cambio de paradigma en las prácticas de integridad académica. La investigación en el área de la identidad del estudiante y la garantía de la autoría son importantes porque la concesión de créditos de estudio a entidades no verificadas es perjudicial para la credibilidad institucional y la seguridad pública. Esta tesis se basa en la noción de que la identidad del alumno se compone de dos capas distintas, física y de comportamiento, en las que tanto los criterios de identidad como los de autoría deben ser confirmados para mantener un nivel razonable de integridad académica. Para ello, esta tesis se organiza en tres secciones, cada una de las cuales aborda el problema desde una de las siguientes perspectivas: (a) teórica, (b) empírica y (c) pragmática.The growing scope and changing nature of academic programmes provide a challenge to the integrity of traditional testing and examination protocols. The aim of this thesis is to introduce an alternative to the traditional approaches to academic integrity, bridging the anonymity gap and empowering instructors and academic administrators with new ways of maintaining academic integrity that preserve privacy, minimize disruption to the learning process, and promote accountability, accessibility and efficiency. This work aims to initiate a paradigm shift in academic integrity practices. Research in the area of learner identity and authorship assurance is important because the award of course credits to unverified entities is detrimental to institutional credibility and public safety. This thesis builds upon the notion of learner identity consisting of two distinct layers (a physical layer and a behavioural layer), where the criteria of identity and authorship must both be confirmed to maintain a reasonable level of academic integrity. To pursue this goal in organized fashion, this thesis has the following three sections: (a) theoretical, (b) empirical, and (c) pragmatic
    corecore