Search CORE

1,653 research outputs found

Plagiarism Detection in arXiv

Author: Gehrke Johannes
Ginsparg Paul
Sorokina Daria
Warner Simeon
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.Comment: Sixth International Conference on Data Mining (ICDM'06), Dec 200

arXiv.org e-Print Archive

CiteSeerX

eCommons@Cornell

Efficient Clustering-based Plagiarism Detection using IPPDC

Author: Ohmann Anthony
Publication venue: DigitalCommons@CSB/SJU
Publication date: 01/01/2013
Field of study

The volume of source code available on the Internet is astronomical. When seeking to detect cases of plagiarism, one must maintain a large database of known documents. This can lead to unacceptably slow runtimes for systems designed to detect cases of source code plagiarism. We seek to use partitional and density-based clustering as well as intelligent parallelism to improve VOCS, a plagiarism detection system. In addition, we will attempt to increase the system’s usability and usefulness by expanding its programming language support and building an intuitive web interface. Finally, we propose utilizing Program Dependence Graphs to construct a hybrid approach in order to more accurately and precisely detect well-disguised plagiarism

College of Saint Benedict and Saint John’s University: DigitalCommons@CSB/SJU

On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

Author: Barrón Cedeño Luis Alberto
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 08/06/2012
Field of study

Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

RiuNet

Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase

Author: Gharavi Erfaneh
Rosso Paolo
Veisi Hadi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/07/2020
Field of study

[EN] The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.The work of Paolo Rosso was partially funded by the Spanish MICINN under the research Project MISMIS-FAKEn-HATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).Gharavi, E.; Veisi, H.; Rosso, P. (2020). Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase. Neural Computing and Applications. 32(14):10593-10607. https://doi.org/10.1007/s00521-019-04594-yS1059310607321

RiuNet

Issues Related to the Detection of Source Code Plagiarism in Students Assignments

Author: AlHami I.
Alsmadi Izzat M.
Kazakzeh S.
Publication venue: Digital Commons @ Texas A&M University-San Antonio
Publication date: 01/01/2014
Field of study

Detecting similarity or plagiarism in the academic research publications, source code, etc. has been a long time complex and time consuming task. Several algorithms, tools and websites exist that try to find plagiarism or possible plagiarism in those human creative products. In this paper we used source code plagiarism detection tools to assess the level of plagiarism in source codes. We also investigated issues related to accuracy and challenges in detecting possible plagiarism in students\u27 assignments. In a second study, we evaluated some tools against detecting possible plagiarism in research papers. Results showed that such process or decision is not binary to make and that subjectivity is high. In addition, there is a need to tune plagiarism detection tools to give criticality or weights by users of those tools to categorize and classify different levels of seriousness for committing plagiarism

Digital Commons @ Texas A&M University-San Antonio