Semantically-informed distance and similarity measures for paraphrase plagiarism identification

Abstract

[EN] Paraphrase plagiarism identification represents a very complex task given that plagiarized texts are intentionally modified through several rewording techniques. Accordingly, this paper introduces two new measures for evaluating the relatedness of two given texts: a semantically-informed similarity measure and a semantically-informed edit distance. Both measures are able to extract semantic information from either an external resource or a distributed representation of words, resulting in informative features for training a supervised classifier for detecting paraphrase plagiarism. Obtained results indicate that the proposed metrics are consistently good in detecting different types of paraphrase plagiarism. In addition, results are very competitive against state-of-the art methods having the advantage of representing a much more simple but equally effective solution.This work was partially supported by CONACYT under scholarship 401887, project grants 257383, 258588 and 2016-01-2410 and under the Thematic Networks program (Language Technologies Thematic Network project 281795). The work of the fourth author was partially supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMAMATER (Prometeo II/2014/030).Álvarez Carmona, M.; Franco-Salvador, M.; Villatoro-Tello, E.; Montes Gomez, M.; Rosso, P.; Villaseñor Pineda, L. (2018). Semantically-informed distance and similarity measures for paraphrase plagiarism identification. Journal of Intelligent & Fuzzy Systems. 34(5):2983-2990. https://doi.org/10.3233/JIFS-169483S29832990345Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015). PDLK: Plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42(22), 8936-8946. doi:10.1016/j.eswa.2015.07.048Barrón-Cedeño, A., Vila, M., Martí, M., & Rosso, P. (2013). Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection. Computational Linguistics, 39(4), 917-947. doi:10.1162/coli_a_00153Biggins S. , Mohammed S. and Oakley S. , University of shefield: Two approaches to semantic text similarity, In First Joint Conference on Lexical and Computational Semantics (SEM at NAACL 2012), Montreal, Canada, 2012, pp. 655–661.Chatterjee K. , Henzinger T.A. , Ibsen-Jensen R. and Otop J. , Edit distance for pushdown automata. arXiv preprint arXiv:1504.08259, 2015.Cheng J. and Kartsaklis D. , Syntax-aware multi-sense word embeddings for deep compositional models of meaning, In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015 pp. 1531–1542.Courtney C. and Mihalcea R. , Measuring the semantic similarity of texts, In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment (EMSEE at NAALC 2005), 2005, pp. 13–18.Dolan W.B. and Brockett C. , Automatically constructing a corpus of sentential paraphrases, In Proc of IWP, 2005.H.Gomaa, W., & A. Fahmy, A. (2013). A Survey of Text Similarity Approaches. International Journal of Computer Applications, 68(13), 13-18. doi:10.5120/11638-7118Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 54(3), 203-215. doi:10.1002/asi.10170Kim S. , Wilbur W.J. and Lu Z. , Bridging the gap:Asemantic similarity measure between queries and documents., arXiv preprint arXiv:1608.01972, 2016.Lukashenko R. , Graudina V. and Grundspenkis J. , Computerbased plagiarism detection methods and tools: An overview, In Proceedings of the 2007 International Conference on Computer Systems and Technologies, 2007, p. 40 ACM.Miller, G. A. (1995). WordNet. Communications of the ACM, 38(11), 39-41. doi:10.1145/219717.219748Palkovskii Y. , Belov A. and Muzyka I. , Using wordnet-based semantic similarity measurement in external plagiarism detection, In Notebook for PAN at CLEF’11, 2011.Pandey, A., Kaur, M., & Goyal, P. (2015). The menace of plagiarism: How to detect and curb it. 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services. doi:10.1109/ettlis.2015.7048213Stamatatos, E. (2011). Plagiarism detection using stopword n-grams. Journal of the American Society for Information Science and Technology, 62(12), 2512-2527. doi:10.1002/asi.21630Wu Z. and Palmer M. , Verbs semantics and lexical selection, In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, ACL ’94, 1994, Stroudsburg, PA, USA, pp. 133–138. Association for Computational Linguistic.Zechner M. , Muhr M. , Kern R. and Granitzer M. , External and intrinsic plagiarism detection using vector space models, In CEUR Workshop Proceedings, vol. 502, 2009, pp. 47–55

    Similar works

    Full text

    thumbnail-image

    Available Versions