Search CORE

2,400 research outputs found

Plagiarism Detection in arXiv

Author: Gehrke Johannes
Ginsparg Paul
Sorokina Daria
Warner Simeon
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.Comment: Sixth International Conference on Data Mining (ICDM'06), Dec 200

arXiv.org e-Print Archive

CiteSeerX

eCommons@Cornell

Semantically-informed distance and similarity measures for paraphrase plagiarism identification

Author: Abdi
Barrón-Cedeño
Brlek
Gomaa
Hoad
Levenshtein
Mikolov
Miller
Pandey
Stamatatos
Publication venue: 'IOS Press'
Publication date: 24/05/2018
Field of study

[EN] Paraphrase plagiarism identification represents a very complex task given that plagiarized texts are intentionally modified through several rewording techniques. Accordingly, this paper introduces two new measures for evaluating the relatedness of two given texts: a semantically-informed similarity measure and a semantically-informed edit distance. Both measures are able to extract semantic information from either an external resource or a distributed representation of words, resulting in informative features for training a supervised classifier for detecting paraphrase plagiarism. Obtained results indicate that the proposed metrics are consistently good in detecting different types of paraphrase plagiarism. In addition, results are very competitive against state-of-the art methods having the advantage of representing a much more simple but equally effective solution.This work was partially supported by CONACYT under scholarship 401887, project grants 257383, 258588 and 2016-01-2410 and under the Thematic Networks program (Language Technologies Thematic Network project 281795). The work of the fourth author was partially supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMAMATER (Prometeo II/2014/030).Álvarez Carmona, M.; Franco-Salvador, M.; Villatoro-Tello, E.; Montes Gomez, M.; Rosso, P.; Villaseñor Pineda, L. (2018). Semantically-informed distance and similarity measures for paraphrase plagiarism identification. Journal of Intelligent & Fuzzy Systems. 34(5):2983-2990. https://doi.org/10.3233/JIFS-169483S29832990345Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015). PDLK: Plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42(22), 8936-8946. doi:10.1016/j.eswa.2015.07.048Barrón-Cedeño, A., Vila, M., Martí, M., & Rosso, P. (2013). Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection. Computational Linguistics, 39(4), 917-947. doi:10.1162/coli_a_00153Biggins S. , Mohammed S. and Oakley S. , University of shefield: Two approaches to semantic text similarity, In First Joint Conference on Lexical and Computational Semantics (SEM at NAACL 2012), Montreal, Canada, 2012, pp. 655–661.Chatterjee K. , Henzinger T.A. , Ibsen-Jensen R. and Otop J. , Edit distance for pushdown automata. arXiv preprint arXiv:1504.08259, 2015.Cheng J. and Kartsaklis D. , Syntax-aware multi-sense word embeddings for deep compositional models of meaning, In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015 pp. 1531–1542.Courtney C. and Mihalcea R. , Measuring the semantic similarity of texts, In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment (EMSEE at NAALC 2005), 2005, pp. 13–18.Dolan W.B. and Brockett C. , Automatically constructing a corpus of sentential paraphrases, In Proc of IWP, 2005.H.Gomaa, W., & A. Fahmy, A. (2013). A Survey of Text Similarity Approaches. International Journal of Computer Applications, 68(13), 13-18. doi:10.5120/11638-7118Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 54(3), 203-215. doi:10.1002/asi.10170Kim S. , Wilbur W.J. and Lu Z. , Bridging the gap:Asemantic similarity measure between queries and documents., arXiv preprint arXiv:1608.01972, 2016.Lukashenko R. , Graudina V. and Grundspenkis J. , Computerbased plagiarism detection methods and tools: An overview, In Proceedings of the 2007 International Conference on Computer Systems and Technologies, 2007, p. 40 ACM.Miller, G. A. (1995). WordNet. Communications of the ACM, 38(11), 39-41. doi:10.1145/219717.219748Palkovskii Y. , Belov A. and Muzyka I. , Using wordnet-based semantic similarity measurement in external plagiarism detection, In Notebook for PAN at CLEF’11, 2011.Pandey, A., Kaur, M., & Goyal, P. (2015). The menace of plagiarism: How to detect and curb it. 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services. doi:10.1109/ettlis.2015.7048213Stamatatos, E. (2011). Plagiarism detection using stopword n-grams. Journal of the American Society for Information Science and Technology, 62(12), 2512-2527. doi:10.1002/asi.21630Wu Z. and Palmer M. , Verbs semantics and lexical selection, In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, ACL ’94, 1994, Stroudsburg, PA, USA, pp. 133–138. Association for Computational Linguistic.Zechner M. , Muhr M. , Kern R. and Granitzer M. , External and intrinsic plagiarism detection using vector space models, In CEUR Workshop Proceedings, vol. 502, 2009, pp. 47–55

arXiv.org e-Print Archive

Crossref

RiuNet

How Large Language Models are Transforming Machine-Paraphrased Plagiarism

Author: Gipp Bela
Kirstein Frederic
Ruas Terry
Wahle Jan Philip
Publication venue
Publication date: 10/11/2022
Field of study

The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive transformers in generating machine-paraphrased plagiarism and their detection is still developing in the literature. This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia. We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples. Our results suggest that large models can rewrite text humans have difficulty identifying as machine-paraphrased (53% mean acc.). Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5, fluency 4.2/5, coherence 3.8/5). The best-performing detection model (GPT-3) achieves a 66% F1-score in detecting paraphrases

arXiv.org e-Print Archive

Automatic generation of benchmarks for plagiarism detection tools using grammatical evolution

Author: Alfonseca Manuel
Cebrián Ramos Manuel
Ortega Alfonso
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2007
Field of study

This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in {Source Publication}, http://dx.doi.org/10.1145/10.1145/1276958.1277388An extended version of this poster is available at arXiv‘. See: http://arxiv.org/abs/cs/0703134v4Student plagiarism is a major problem in universities worldwide. In this paper, we focus on plagiarism in answers to computer programming assignments, where students mix and/or modify one or more original solutions to obtain counterfeits. Although several software tools have been developed to help the tedious and time consuming task of detecting plagiarism, little has been done to assess their quality, because determining the real authorship of the whole submission corpus is practically impossible for graders. In this article we present a Grammatical Evolution technique which generates benchmarks for testing plagiarism detection tools. Given a programming language, our technique generates a set of original solutions to an assignment, together with a set of plagiarisms of the former set which mimic the basic plagiarism techniques performed by students. The authorship of the submission corpus is predefined by the user, providing a base for the assessment and further comparison of copy-catching tools. We give empirical evidence of the suitability of our approach by studying the behavior of one state-of-the-art detection tool (AC) on four benchmarks coded in APL2, generated with our technique.Work supported by grant TSI2005-08255-C07-06 of the Spanish Ministry of Education and Science

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Citation sentence reuse behavior of scientists: A case study on massive bibliographic text dataset of computer science

Author: Bakshi Nikhil Angad
Goyal Pawan
Gupta Divyansh
Mukherjee Animesh
Niranjan Abhishek
Singh Mayank
Publication venue
Publication date: 06/05/2017
Field of study

Our current knowledge of scholarly plagiarism is largely based on the similarity between full text research articles. In this paper, we propose an innovative and novel conceptualization of scholarly plagiarism in the form of reuse of explicit citation sentences in scientific research articles. Note that while full-text plagiarism is an indicator of a gross-level behavior, copying of citation sentences is a more nuanced micro-scale phenomenon observed even for well-known researchers. The current work poses several interesting questions and attempts to answer them by empirically investigating a large bibliographic text dataset from computer science containing millions of lines of citation sentences. In particular, we report evidences of massive copying behavior. We also present several striking real examples throughout the paper to showcase widespread adoption of this undesirable practice. In contrast to the popular perception, we find that copying tendency increases as an author matures. The copying behavior is reported to exist in all fields of computer science; however, the theoretical fields indicate more copying than the applied fields

arXiv.org e-Print Archive

Crossref