2,400 research outputs found

    Plagiarism Detection in arXiv

    Full text link
    We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.Comment: Sixth International Conference on Data Mining (ICDM'06), Dec 200

    Semantically-informed distance and similarity measures for paraphrase plagiarism identification

    Full text link
    [EN] Paraphrase plagiarism identification represents a very complex task given that plagiarized texts are intentionally modified through several rewording techniques. Accordingly, this paper introduces two new measures for evaluating the relatedness of two given texts: a semantically-informed similarity measure and a semantically-informed edit distance. Both measures are able to extract semantic information from either an external resource or a distributed representation of words, resulting in informative features for training a supervised classifier for detecting paraphrase plagiarism. Obtained results indicate that the proposed metrics are consistently good in detecting different types of paraphrase plagiarism. In addition, results are very competitive against state-of-the art methods having the advantage of representing a much more simple but equally effective solution.This work was partially supported by CONACYT under scholarship 401887, project grants 257383, 258588 and 2016-01-2410 and under the Thematic Networks program (Language Technologies Thematic Network project 281795). The work of the fourth author was partially supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMAMATER (Prometeo II/2014/030).Álvarez Carmona, M.; Franco-Salvador, M.; Villatoro-Tello, E.; Montes Gomez, M.; Rosso, P.; Villaseñor Pineda, L. (2018). Semantically-informed distance and similarity measures for paraphrase plagiarism identification. Journal of Intelligent & Fuzzy Systems. 34(5):2983-2990. https://doi.org/10.3233/JIFS-169483S29832990345Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015). PDLK: Plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42(22), 8936-8946. doi:10.1016/j.eswa.2015.07.048Barrón-Cedeño, A., Vila, M., Martí, M., & Rosso, P. (2013). Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection. Computational Linguistics, 39(4), 917-947. doi:10.1162/coli_a_00153Biggins S. , Mohammed S. and Oakley S. , University of shefield: Two approaches to semantic text similarity, In First Joint Conference on Lexical and Computational Semantics (SEM at NAACL 2012), Montreal, Canada, 2012, pp. 655–661.Chatterjee K. , Henzinger T.A. , Ibsen-Jensen R. and Otop J. , Edit distance for pushdown automata. arXiv preprint arXiv:1504.08259, 2015.Cheng J. and Kartsaklis D. , Syntax-aware multi-sense word embeddings for deep compositional models of meaning, In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015 pp. 1531–1542.Courtney C. and Mihalcea R. , Measuring the semantic similarity of texts, In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment (EMSEE at NAALC 2005), 2005, pp. 13–18.Dolan W.B. and Brockett C. , Automatically constructing a corpus of sentential paraphrases, In Proc of IWP, 2005.H.Gomaa, W., & A. Fahmy, A. (2013). A Survey of Text Similarity Approaches. International Journal of Computer Applications, 68(13), 13-18. doi:10.5120/11638-7118Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 54(3), 203-215. doi:10.1002/asi.10170Kim S. , Wilbur W.J. and Lu Z. , Bridging the gap:Asemantic similarity measure between queries and documents., arXiv preprint arXiv:1608.01972, 2016.Lukashenko R. , Graudina V. and Grundspenkis J. , Computerbased plagiarism detection methods and tools: An overview, In Proceedings of the 2007 International Conference on Computer Systems and Technologies, 2007, p. 40 ACM.Miller, G. A. (1995). WordNet. Communications of the ACM, 38(11), 39-41. doi:10.1145/219717.219748Palkovskii Y. , Belov A. and Muzyka I. , Using wordnet-based semantic similarity measurement in external plagiarism detection, In Notebook for PAN at CLEF’11, 2011.Pandey, A., Kaur, M., & Goyal, P. (2015). The menace of plagiarism: How to detect and curb it. 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services. doi:10.1109/ettlis.2015.7048213Stamatatos, E. (2011). Plagiarism detection using stopword n-grams. Journal of the American Society for Information Science and Technology, 62(12), 2512-2527. doi:10.1002/asi.21630Wu Z. and Palmer M. , Verbs semantics and lexical selection, In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, ACL ’94, 1994, Stroudsburg, PA, USA, pp. 133–138. Association for Computational Linguistic.Zechner M. , Muhr M. , Kern R. and Granitzer M. , External and intrinsic plagiarism detection using vector space models, In CEUR Workshop Proceedings, vol. 502, 2009, pp. 47–55

    How Large Language Models are Transforming Machine-Paraphrased Plagiarism

    Full text link
    The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive transformers in generating machine-paraphrased plagiarism and their detection is still developing in the literature. This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia. We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples. Our results suggest that large models can rewrite text humans have difficulty identifying as machine-paraphrased (53% mean acc.). Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5, fluency 4.2/5, coherence 3.8/5). The best-performing detection model (GPT-3) achieves a 66% F1-score in detecting paraphrases

    Automatic generation of benchmarks for plagiarism detection tools using grammatical evolution

    Full text link
    This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in {Source Publication}, http://dx.doi.org/10.1145/10.1145/1276958.1277388An extended version of this poster is available at arXiv‘. See: http://arxiv.org/abs/cs/0703134v4Student plagiarism is a major problem in universities worldwide. In this paper, we focus on plagiarism in answers to computer programming assignments, where students mix and/or modify one or more original solutions to obtain counterfeits. Although several software tools have been developed to help the tedious and time consuming task of detecting plagiarism, little has been done to assess their quality, because determining the real authorship of the whole submission corpus is practically impossible for graders. In this article we present a Grammatical Evolution technique which generates benchmarks for testing plagiarism detection tools. Given a programming language, our technique generates a set of original solutions to an assignment, together with a set of plagiarisms of the former set which mimic the basic plagiarism techniques performed by students. The authorship of the submission corpus is predefined by the user, providing a base for the assessment and further comparison of copy-catching tools. We give empirical evidence of the suitability of our approach by studying the behavior of one state-of-the-art detection tool (AC) on four benchmarks coded in APL2, generated with our technique.Work supported by grant TSI2005-08255-C07-06 of the Spanish Ministry of Education and Science

    Citation sentence reuse behavior of scientists: A case study on massive bibliographic text dataset of computer science

    Full text link
    Our current knowledge of scholarly plagiarism is largely based on the similarity between full text research articles. In this paper, we propose an innovative and novel conceptualization of scholarly plagiarism in the form of reuse of explicit citation sentences in scientific research articles. Note that while full-text plagiarism is an indicator of a gross-level behavior, copying of citation sentences is a more nuanced micro-scale phenomenon observed even for well-known researchers. The current work poses several interesting questions and attempts to answer them by empirically investigating a large bibliographic text dataset from computer science containing millions of lines of citation sentences. In particular, we report evidences of massive copying behavior. We also present several striking real examples throughout the paper to showcase widespread adoption of this undesirable practice. In contrast to the popular perception, we find that copying tendency increases as an author matures. The copying behavior is reported to exist in all fields of computer science; however, the theoretical fields indicate more copying than the applied fields
    corecore