6 research outputs found

    On the use of word embedding for cross language plagiarism detection

    Full text link
    [EN] Cross language plagiarism is the unacknowledged reuse of text across language pairs. It occurs if a passage of text is translated from source language to target language and no proper citation is provided. Although various methods have been developed for detection of cross language plagiarism, less attention has been paid to measure and compare their performance, especially when tackling with different types of paraphrasing through translation. In this paper, we investigate various approaches to cross language plagiarism detection. Moreover, we present a novel approach to cross language plagiarism detection using word embedding methods and explore its performance against other state-of-the-art plagiarism detection algorithms. In order to evaluate the methods, we have constructed an English-Persian bilingual plagiarism detection corpus (referred to as HAMTA-CL) comprised of seven types of obfuscation. The results show that the word embedding approach outperforms the other approaches with respect to recall when encountering heavily paraphrased passages. On the other hand, translation based approach performs well when the precision is the main consideration of the cross language plagiarism detection system.Asghari, H.; Fatemi, O.; Mohtaj, S.; Faili, H.; Rosso, P. (2019). On the use of word embedding for cross language plagiarism detection. Intelligent Data Analysis. 23(3):661-680. https://doi.org/10.3233/IDA-183985S661680233H. Asghari, K. Khoshnava, O. Fatemi and H. Faili, Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus: Notebook for {PAN} at {CLEF} 2015, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.A. Barrón-Cede no, M. Potthast, P. Rosso and B. Stein, Corpus and evaluation measures for automatic plagiarism detection, In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner and D. Tapias, editors, Proceedings of the International Conference on Language Resources and Evaluation, {LREC} 2010, 17–23 May 2010, Valletta, Malta. European Language Resources Association, 2010.A. Barrón-Cede no, P. Rosso, D. Pinto and A. Juan, On cross-lingual plagiarism analysis using a statistical model, In B. Stein, E. Stamatatos and M. Koppel, editors, Proceedings of the ECAI’08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008, volume 377 of {CEUR} Workshop Proceedings. CEUR-WS.org, 2008.Farghaly, A., & Shaalan, K. (2009). Arabic Natural Language Processing. ACM Transactions on Asian Language Information Processing, 8(4), 1-22. doi:10.1145/1644879.1644881J. Ferrero, F. Agnès, L. Besacier and D. Schwab, A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection, In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk and S. Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation {LREC} 2016, Portorož, Slovenia, May 23–28, 2016, European Language Resources Association {(ELRA)}, 2016.Franco-Salvador, M., Gupta, P., Rosso, P., & Banchs, R. E. (2016). Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowledge-Based Systems, 111, 87-99. doi:10.1016/j.knosys.2016.08.004Franco-Salvador, M., Rosso, P., & Montes-y-Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing & Management, 52(4), 550-570. doi:10.1016/j.ipm.2015.12.004C.K. Kent and N. Salim, Web based cross language plagiarism detection, CoRR, abs/0912.3, 2009.McNamee, P., & Mayfield, J. (2004). Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval, 7(1/2), 73-97. doi:10.1023/b:inrt.0000009441.78971.beT. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient estimation of word representations in vector space, CoRR, abs/1301.3, 2013.S. Mohtaj, B. Roshanfekr, A. Zafarian and H. Asghari, Parsivar: A language processing toolkit for persian, In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis and T. Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018, European Language Resources Association ELRA, 2018.R.M.A. Nawab, M. Stevenson and P.D. Clough, University of Sheffield – Lab Report for {PAN} at {CLEF} 2010, In M. Braschler, D. Harman and E. Pianta, editors, {CLEF} 2010 LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy, volume 1176 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2010.G. Oberreuter, G. L’Huillier, S.A. Rios and J.D. Velásquez, Approaches for intrinsic and external plagiarism detection – Notebook for {PAN} at {CLEF} 2011, In V. Petras, P. Forner and P.D. Clough, editors, {CLEF} 2011 Labs and Workshop, Notebook Papers, 19–22 September 2011, Amsterdam, The Netherlands, volume 1177 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2011.Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51-60. doi:10.1016/j.jalgor.2009.02.005M. Potthast, A. Barrón-Cede no, A. Eiselt, B. Stein and P. Rosso, Overview of the 2nd international competition on plagiarism detection, In M. Braschler, D. Harman and E. Pianta, editors, {CLEF} 2010 LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy, volume 1176 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2010.Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2010). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45-62. doi:10.1007/s10579-009-9114-zM. Potthast, A. Eiselt, A. Barrón-Cede no, B. Stein and P. Rosso, Overview of the 3rd international competition on plagiarism detection, In V. Petras, P. Forner and P.D. Clough, editors, {CLEF} 2011 Labs and Workshop, Notebook Papers, 19–22 September 2011, Amsterdam, The Netherlands, volume 1177 of {CEUR} Workshop Proceedings. CEUR-WS.org, 2011.M. Potthast, S. Goering, P. Rosso and B. Stein, Towards data submissions for shared tasks: First experiences for the task of text alignment, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.Potthast, M., Stein, B., & Anderka, M. (s. f.). A Wikipedia-Based Multilingual Retrieval Model. Advances in Information Retrieval, 522-530. doi:10.1007/978-3-540-78646-7_51B. Pouliquen, R. Steinberger and C. Ignat, Automatic identification of document translations in large multilingual document collections, CoRR, abs/cs/060, 2006.B. Stein, E. Stamatatos and M. Koppel, Proceedings of the ECAI’08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008, volume 377 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2008.J. Wieting, M. Bansal, K. Gimpel and K. Livescu, Towards universal paraphrastic sentence embeddings, CoRR, abs/1511.0, 2015.V. Zarrabi, J. Rafiei, K. Khoshnava, H. Asghari and S. Mohtaj, Evaluation of text reuse corpora for text alignment task of plagiarism detection, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.Barrón-Cedeño, A., Gupta, P., & Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, 211-217. doi:10.1016/j.knosys.2013.06.01

    Detecting translingual plagiarism and the backlash against translation plagiarists

    Get PDF
    Os métodos de detecção de plágio registaram melhorias significativas ao longo das últimas décadas e, decorrente da investigação avançada realizada por linguistas computacionais e, sobretudo, por linguistas forenses, é, agora, maisfácil identiVcar estratégias de reutilização de texto simples e soVsticadas. Especificamente, simples algoritmos de comparação de texto criados por linguistas computacionais permitem detectar fácil e (semi-)automaticamente plágio literal,ipsis verbis (i.e. que consiste na reutilização de trechos de texto idênticos em diferentes documentos) como é o caso do Turnitin ou o SafeAssign , embora o desempenho destes métodos tenha tendência a piorar quando a reutilizaçãoé disfarçada através da introdução de alterações ao texto original. Neste caso, são necessárias técnicas linguísticas mais soVsticadas, como a análise de sobreposição lexical (Johnson, 1997), para detectar a reutilização. Contudo, estastécnicas são de aplicação muito limitada em casos de plágio translingue, em que determinado texto é traduzido e reutilizado sem atribuição da autoria ao texto original, proveniente de outra língua. Considerando que (a) normalmente,a tradução amadora (e.g. tradução literal ou tradução automática gratuita) é ométodo utilizado para plagiar; (b) é comum os plagiadores fazerem alterações aotexto, nomeadamente gramaticais e sintácticas, sobretudo após a tradução automática;e (c) os elementos lexicais são aqueles que a tradução automática processamais correctamente, antes da sua reutilização no texto derivado, este artigopropõe um método de detecção de plágio translingue informado pelas teorias datradução e da interlíngua (Selinker, 1972; Bassnett and Lefevere, 1998), bem comopelo princípio de singularidade linguística (Coulthard, 2004). Recorrendo a dadosempíricos do corpus CorRUPT (Corpus of Reused and Plagiarised Texts),um corpus de textos académicos e não académicos reais, que foram investigadose acusados de plagiar textos originais noutras línguas, demonstra-se a utilidadeda metodologia proposta para a detecção de plágio translingue. Finalmente,discute-se possíveis aplicações deste método como ferramenta de investigação emcontextos forenses.Plagiarism detection methods have improved signiVcantly over thelast decades, and as a result of the advanced research conducted by computationaland mostly forensic linguists, simple and sophisticated textual borrowingstrategies can now be identiVed more easily. In particular, simple text comparisonalgorithms developed by computational linguists allow literal, word-for-wordplagiarism (i.e. where identical strings of text are reused across diUerent documents)to be easily detected (semi-)automatically (e.g. Turnitin or SafeAssign),although these methods tend to perform less well when the borrowing is obfuscatedby introducing edits to the original text. In this case, more sophisticatedlinguistic techniques, such as an analysis of lexical overlap (Johnson, 1997), arerequired to detect the borrowing. However, these have limited applicability incases of translingual plagiarism, where a text is translated and borrowed withoutacknowledgment from an original in another language. Considering that(a) traditionally non-professional translation (e.g. literal or free machine translation)is the method used to plagiarise; (b) the plagiarist usually edits the textfor grammar and syntax, especially when machine-translated; and (c) lexicalitems are those that tend to be translated more correctly, and carried over to thederivative text, this paper proposes a method for translingual plagiarism detectionthat is grounded on translation and interlanguage theories (Selinker, 1972;Bassnett and Lefevere, 1998), as well as on the principle of linguistic uniqueness(Coulthard, 2004). Empirical evidence from the CorRUPT corpus (Corpus ofReused and Plagiarised Texts), a corpus of real academic and non-academic textsthat were investigated and accused of plagiarising originals in other languages, isused to illustrate the applicability of the methodology proposed for translingualplagiarism detection. Finally, applications of the method as an investigative toolin forensic contexts are discussed

    English-Persian Plagiarism Detection based on a Semantic Approach

    Get PDF
    Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-lingual plagiarism. In cross-lingual translation, writers meld a translation with their own words and ideas. Based on monolingual plagiarism detection methods, this paper ultimately intends to find a way to detect cross-lingual plagiarism. A framework called Multi-Lingual Plagiarism Detection (MLPD) has been presented for cross-lingual plagiarism analysis with ultimate objective of detection of plagiarism cases. English is the reference language and Persian materials are back translated using translation tools. The data for assessment of MLPD were obtained from English-Persian Mizan parallel corpus. Apache’s Solr was also applied to record the creep of the documents and their indexation. The accuracy mean of the proposed method revealed to be 98.82% when employing highly accurate translation tools which indicate the high accuracy of the proposed method. Also, Google translation service showed the accuracy mean to be 56.9%. These tests demonstrate that improved translation tools enhance the accuracy of the proposed method

    iPlag: Intelligent Plagiarism Reasoner in scientific publications

    Full text link

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    Detecting plagiarism in the forensic linguistics turn

    Get PDF
    This study investigates plagiarism detection, with an application in forensic contexts. Two types of data were collected for the purposes of this study. Data in the form of written texts were obtained from two Portuguese Universities and from a Portuguese newspaper. These data are analysed linguistically to identify instances of verbatim, morpho-syntactical, lexical and discursive overlap. Data in the form of survey were obtained from two higher education institutions in Portugal, and another two in the United Kingdom. These data are analysed using a 2 by 2 between-groups Univariate Analysis of Variance (ANOVA), to reveal cross-cultural divergences in the perceptions of plagiarism. The study discusses the legal and social circumstances that may contribute to adopting a punitive approach to plagiarism, or, conversely, reject the punishment. The research adopts a critical approach to plagiarism detection. On the one hand, it describes the linguistic strategies adopted by plagiarists when borrowing from other sources, and, on the other hand, it discusses the relationship between these instances of plagiarism and the context in which they appear. A focus of this study is whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality. It also evaluates current computational approaches to plagiarism detection, and identifies strategies that these systems fail to detect. Specifically, a method is proposed to translingual plagiarism. The findings indicate that, although cross-cultural aspects influence the different perceptions of plagiarism, a distinction needs to be made between intentional and unintentional plagiarism. The linguistic analysis demonstrates that linguistic elements can contribute to finding clues for the plagiarist’s intentionality. Furthermore, the findings show that translingual plagiarism can be detected by using the method proposed, and that plagiarism detection software can be improved using existing computer tools
    corecore