    On the use of word embedding for cross language plagiarism detection

    [EN] Cross language plagiarism is the unacknowledged reuse of text across language pairs. It occurs if a passage of text is translated from source language to target language and no proper citation is provided. Although various methods have been developed for detection of cross language plagiarism, less attention has been paid to measure and compare their performance, especially when tackling with different types of paraphrasing through translation. In this paper, we investigate various approaches to cross language plagiarism detection. Moreover, we present a novel approach to cross language plagiarism detection using word embedding methods and explore its performance against other state-of-the-art plagiarism detection algorithms. In order to evaluate the methods, we have constructed an English-Persian bilingual plagiarism detection corpus (referred to as HAMTA-CL) comprised of seven types of obfuscation.     Detecting translingual plagiarism and the backlash against translation plagiarists

    Os métodos de detecção de plágio registaram melhorias significativas ao longo das últimas décadas e, decorrente da investigação avançada realizada por linguistas computacionais e, sobretudo, por linguistas forenses, é, agora, maisfácil identiVcar estratégias de reutilização de texto simples e soVsticadas. Especificamente, simples algoritmos de comparação de texto criados por linguistas computacionais permitem detectar fácil e (semi-)automaticamente plágio literal,ipsis verbis (i.e. que consiste na reutilização de trechos de texto idênticos em diferentes documentos) como é o caso do Turnitin ou o SafeAssign , embora o desempenho destes métodos tenha tendência a piorar quando a reutilizaçãoé disfarçada através da introdução de alterações ao texto original. Neste caso, são necessárias técnicas linguísticas mais soVsticadas, como a análise de sobreposição lexical (Johnson, 1997), para detectar a reutilização. Contudo, estastécnicas são de aplicação muito limitada em casos de plágio translingue, em que determinado texto é traduzido e reutilizado sem atribuição da autoria ao texto original, proveniente de outra língua. Considerando que (a) normalmente,a tradução amadora (e.g. tradução literal ou tradução automática gratuita) é ométodo utilizado para plagiar; (b) é comum os plagiadores fazerem alterações aotexto, nomeadamente gramaticais e sintácticas, sobretudo após a tradução automática;e (c) os elementos lexicais são aqueles que a tradução automática processamais correctamente, antes da sua reutilização no texto derivado, este artigopropõe um método de detecção de plágio translingue informado pelas teorias datradução e da interlíngua (Selinker, 1972; Bassnett and Lefevere, 1998), bem comopelo princípio de singularidade linguística (Coulthard, 2004). Recorrendo a dadosempíricos do corpus CorRUPT (Corpus of Reused and Plagiarised Texts),um corpus de textos académicos e não académicos reais, que foram investigadose acusados de plagiar textos originais noutras línguas, demonstra-se a utilidadeda metodologia proposta para a detecção de plágio translingue. Finalmente,discute-se possíveis aplicações deste método como ferramenta de investigação emcontextos forenses.Plagiarism detection methods have improved signiVcantly over thelast decades, and as a result of the advanced research conducted by computationaland mostly forensic linguists, simple and sophisticated textual borrowingstrategies can now be identiVed more easily. In particular, simple text comparisonalgorithms developed by computational linguists allow literal, word-for-wordplagiarism (i.e. where identical strings of text are reused across diUerent documents)to be easily detected (semi-)automatically (e.g. Turnitin or SafeAssign),although these methods tend to perform less well when the borrowing is obfuscatedby introducing edits to the original text. In this case, more sophisticatedlinguistic techniques, such as an analysis of lexical overlap (Johnson, 1997), arerequired to detect the borrowing. However, these have limited applicability incases of translingual plagiarism, where a text is translated and borrowed withoutacknowledgment from an original in another language. Considering that(a) traditionally non-professional translation (e.g. literal or free machine translation)is the method used to plagiarise; (b) the plagiarist usually edits the textfor grammar and syntax, especially when machine-translated; and (c) lexicalitems are those that tend to be translated more correctly, and carried over to thederivative text, this paper proposes a method for translingual plagiarism detectionthat is grounded on translation and interlanguage theories (Selinker, 1972;Bassnett and Lefevere, 1998), as well as on the principle of linguistic uniqueness(Coulthard, 2004). Empirical evidence from the CorRUPT corpus (Corpus ofReused and Plagiarised Texts), a corpus of real academic and non-academic textsthat were investigated and accused of plagiarising originals in other languages, isused to illustrate the applicability of the methodology proposed for translingualplagiarism detection. Finally, applications of the method as an investigative toolin forensic contexts are discussed

    English-Persian Plagiarism Detection based on a Semantic Approach

    Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-lingual plagiarism. In cross-lingual translation, writers meld a translation with their own words and ideas. Based on monolingual plagiarism detection methods, this paper ultimately intends to find a way to detect cross-lingual plagiarism. A framework called Multi-Lingual Plagiarism Detection (MLPD) has been presented for cross-lingual plagiarism analysis with ultimate objective of detection of plagiarism cases. English is the reference language and Persian materials are back translated using translation tools. The data for assessment of MLPD were obtained from English-Persian Mizan parallel corpus. Apache’s Solr was also applied to record the creep of the documents and their indexation. The accuracy mean of the proposed method revealed to be 98.82% when employing highly accurate translation tools which indicate the high accuracy of the proposed method. Also, Google translation service showed the accuracy mean to be 56.9%. These tests demonstrate that improved translation tools enhance the accuracy of the proposed method

    iPlag: Intelligent Plagiarism Reasoner in scientific publications

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    Detecting plagiarism in the forensic linguistics turn

    This study investigates plagiarism detection, with an application in forensic contexts. Two types of data were collected for the purposes of this study. Data in the form of written texts were obtained from two Portuguese Universities and from a Portuguese newspaper. These data are analysed linguistically to identify instances of verbatim, morpho-syntactical, lexical and discursive overlap. Data in the form of survey were obtained from two higher education institutions in Portugal, and another two in the United Kingdom. These data are analysed using a 2 by 2 between-groups Univariate Analysis of Variance (ANOVA), to reveal cross-cultural divergences in the perceptions of plagiarism. The study discusses the legal and social circumstances that may contribute to adopting a punitive approach to plagiarism, or, conversely, reject the punishment. The research adopts a critical approach to plagiarism detection. On the one hand, it describes the linguistic strategies adopted by plagiarists when borrowing from other sources, and, on the other hand, it discusses the relationship between these instances of plagiarism and the context in which they appear. A focus of this study is whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality. It also evaluates current computational approaches to plagiarism detection, and identifies strategies that these systems fail to detect. Specifically, a method is proposed to translingual plagiarism. The findings indicate that, although cross-cultural aspects influence the different perceptions of plagiarism, a distinction needs to be made between intentional and unintentional plagiarism. The linguistic analysis demonstrates that linguistic elements can contribute to finding clues for the plagiarist’s intentionality. Furthermore, the findings show that translingual plagiarism can be detected by using the method proposed, and that plagiarism detection software can be improved using existing computer tools