Search CORE

1,012 research outputs found

The Influence of Text Pre-processing on Plagiarism Detection

Author: Ceska Z
Fox C
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2011
Field of study

This paper explores the influence of text preprocessing techniques on plagiarism detection. We examine stop-word removal, lemmatization,number replacement, synonymy recognition, and word generalization. We also look into the influence of punctuation and word-order within N-grams. All these techniques are evaluated according to their impact on F1-measure and speed of execution. Our experiments were performed on a Czech corpus of plagiarized documents about politics. At the end of this paper, we propose what we consider to be the best combination of text pre-processing techniques

University of Essex Research Repository

Plagiarism Detection in arXiv

Author: Gehrke Johannes
Ginsparg Paul
Sorokina Daria
Warner Simeon
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.Comment: Sixth International Conference on Data Mining (ICDM'06), Dec 200

arXiv.org e-Print Archive

CiteSeerX

eCommons@Cornell

Text-Based Plagiarism Detection System

Author: Hussain Hazliyana
Publication venue: Universiti Teknologi Petronas
Publication date: 01/12/2005
Field of study

Due to increasing of internet usage, students attempt to plagiarize the digital documents as their own work without acknowledging the sources as references. As this phenomenon becomes very common among students, a system that can detect plagiarism is most welcome to overcome the problem. The system is able to map out the words from the body of text files and then compare the strings between the text files. Besides, the system is also able to compare lines in the text files. The system is developed referring to the concept of Word Frequency Model which count the number words occurrence in the text files

UTPedia

Distributed similarity and plagiarism search

Author: Pataki Máté
Publication venue: 'Webmed Limited'
Publication date: 01/06/2006
Field of study

This paper describes the different approaches of plagiarism search, the methods used by the KOPI Online Plagiarism Search and Information Portal and, shows a distributed approach for building a plagiarism search system. This architecture adds scalability to the system, by allowing placing an arbitrary number of identical components into it. To reduce network traffic and enable secure transfer of the documents between the portal and the document servers a new method of communication is introduced

SZTAKI Publication Repository

A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

Author: Ekhtiarzadeh Masoud
Parsa Saeed
Ramezani Mohammad
Roy Chanchal
Zakeri-Nasrabadi Morteza
Publication venue
Publication date: 28/06/2023
Field of study

Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.Comment: 49 pages, 10 figures, 6 table

arXiv.org e-Print Archive

\u201cEvery Writer is Checked for Plagiarism\u201d: Occluded Authorship in Academic Writing

Author: Bromwich William John
Publication venue: place:Boca Raton
Publication date: 01/01/2014
Field of study

\u201cEvery Writer is Checked for Plagiarism\u201d: Occluded Authorship in Academic Writing This paper takes as its starting point the insights provided by Bhatia (2004), Bhatia / Gotti (2006) and Hyland (2000, 2002, 2005) to investigate the generic features of academic writing in connection with \u201cessay writing services\u201d. These services appear to be playing an ever-expanding role not only in undergraduate but also in postgraduate writing, with serious implications for the quality of higher education and the authenticity of the qualifications awarded by universities. An admixture of far-reaching technological innovation, wide-ranging social changes associated with globalization, and the rapid expansion of higher education appears to have led to the expansion of this phenomenon in academic writing. The paper highlights the discordance between the definition of various forms of plagiarism in academic writing in institutional discourse, and the description of these practices by online \u201cessay writing services\u201d that attempt to present them as legitimate and desirable. An analysis of the generic norms of this occluded discourse community provides evidence that practices once on the margins of the academic world appear to be gaining ground and making increasingly strident claims to legitimacy. In a sociolinguistic perspective, reference is made to Daniel Patrick Moynihan\u2019s 1993 essay on \u201cDefining Deviancy Down\u201d in which he argues that as social pathologies become more common, they tend to be reclassified and no longer seen as a form of deviancy, and this concept may also be applied to academic malpractice. The paper also attempts to cast light on \u201csecondary plagiarism\u201d in which the \u201cessay writing services\u201d that are paid to produce \u201coriginal work\u201d draw from an existing repertoire of material, thus infringing not only the norms laid down in the official academic discourse, but also the internal \u201ccode of conduct\u201d that is part of this occluded genre

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia