1,012 research outputs found
The Influence of Text Pre-processing on Plagiarism Detection
This paper explores the influence of text preprocessing techniques on plagiarism detection. We examine stop-word removal, lemmatization,number replacement, synonymy recognition, and word generalization. We also look into the influence of punctuation and word-order within N-grams. All these techniques are evaluated according to their impact on F1-measure and speed of execution. Our experiments were performed on a Czech corpus of plagiarized documents about politics. At the end of this paper, we propose what we consider to be the best combination of text pre-processing techniques
Plagiarism Detection in arXiv
We describe a large-scale application of methods for finding plagiarism in
research document collections. The methods are applied to a collection of
284,834 documents collected by arXiv.org over a 14 year period, covering a few
different research disciplines. The methodology efficiently detects a variety
of problematic author behaviors, and heuristics are developed to reduce the
number of false positives. The methods are also efficient enough to implement
as a real-time submission screen for a collection many times larger.Comment: Sixth International Conference on Data Mining (ICDM'06), Dec 200
Text-Based Plagiarism Detection System
Due to increasing of internet usage, students attempt to plagiarize the digital
documents as their own work without acknowledging the sources as references. As
this phenomenon becomes very common among students, a system that can detect
plagiarism is most welcome to overcome the problem. The system is able to map out
the words from the body of text files and then compare the strings between the text
files. Besides, the system is also able to compare lines in the text files. The system is
developed referring to the concept of Word Frequency Model which count the
number words occurrence in the text files
Distributed similarity and plagiarism search
This paper describes the different approaches of plagiarism search, the methods used by the KOPI Online Plagiarism Search and Information Portal and, shows a distributed approach for building a plagiarism search system. This architecture adds scalability to the system, by allowing placing an arbitrary number of identical components into it. To reduce network traffic and enable secure transfer of the documents between the portal and the document servers a new method of communication is introduced
A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges
Measuring and evaluating source code similarity is a fundamental software
engineering activity that embraces a broad range of applications, including but
not limited to code recommendation, duplicate code, plagiarism, malware, and
smell detection. This paper proposes a systematic literature review and
meta-analysis on code similarity measurement and evaluation techniques to shed
light on the existing approaches and their characteristics in different
applications. We initially found over 10000 articles by querying four digital
libraries and ended up with 136 primary studies in the field. The studies were
classified according to their methodology, programming languages, datasets,
tools, and applications. A deep investigation reveals 80 software tools,
working with eight different techniques on five application domains. Nearly 49%
of the tools work on Java programs and 37% support C and C++, while there is no
support for many programming languages. A noteworthy point was the existence of
12 datasets related to source code similarity measurement and duplicate codes,
of which only eight datasets were publicly accessible. The lack of reliable
datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm
languages are the main challenges in the field. Emerging applications of code
similarity measurement concentrate on the development phase in addition to the
maintenance.Comment: 49 pages, 10 figures, 6 table
\u201cEvery Writer is Checked for Plagiarism\u201d: Occluded Authorship in Academic Writing
\u201cEvery Writer is Checked for Plagiarism\u201d: Occluded Authorship in Academic Writing
This paper takes as its starting point the insights provided by Bhatia (2004), Bhatia / Gotti (2006) and Hyland (2000, 2002, 2005) to investigate the generic features of academic writing in connection with \u201cessay writing services\u201d. These services appear to be playing an ever-expanding role not only in undergraduate but also in postgraduate writing, with serious implications for the quality of higher education and the authenticity of the qualifications awarded by universities. An admixture of far-reaching technological innovation, wide-ranging social changes associated with globalization, and the rapid expansion of higher education appears to have led to the expansion of this phenomenon in academic writing. The paper highlights the discordance between the definition of various forms of plagiarism in academic writing in institutional discourse, and the description of these practices by online \u201cessay writing services\u201d that attempt to present them as legitimate and desirable. An analysis of the generic norms of this occluded discourse community provides evidence that practices once on the margins of the academic world appear to be gaining ground and making increasingly strident claims to legitimacy. In a sociolinguistic perspective, reference is made to Daniel Patrick Moynihan\u2019s 1993 essay on \u201cDefining Deviancy Down\u201d in which he argues that as social pathologies become more common, they tend to be reclassified and no longer seen as a form of deviancy, and this concept may also be applied to academic malpractice. The paper also attempts to cast light on \u201csecondary plagiarism\u201d in which the \u201cessay writing services\u201d that are paid to produce \u201coriginal work\u201d draw from an existing repertoire of material, thus infringing not only the norms laid down in the official academic discourse, but also the internal \u201ccode of conduct\u201d that is part of this occluded genre
- …