Search CORE

3 research outputs found

A scalable framework for cross-lingual authorship identification

Author: Li Q
Nutanong S
Rakthanmanon T
Sarwar R
Publication venue: 'Elsevier BV'
Publication date: 07/07/2018
Field of study

This is an accepted manuscript of an article published by Elsevier in Information Sciences on 10/07/2018, available online: https://doi.org/10.1016/j.ins.2018.07.009 The accepted version of the publication may differ from the final published version.© 2018 Elsevier Inc. Cross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is that the stylistic markers (features) used in one language may not be applicable to other languages in the corpus. Existing methods overcome this challenge by using external resources such as machine translation and part-of-speech tagging. However, such solutions are not applicable to languages with poor external resources (known as low resource languages). They also fail to scale as the number of candidate authors and/or the number of languages in the corpus increases. In this investigation, we analyze different types of stylometric features and identify 10 high-performance language-independent features for cross-lingual stylometric analysis tasks. Based on these stylometric features, we propose a cross-lingual authorship identification solution that can accurately handle a large number of authors. Specifically, we partition the documents into fragments where each fragment is further decomposed into fixed size chunks. Using a multilingual corpus of 400 authors with 825 documents written in 6 different languages, we show that our method can achieve an accuracy level of 96.66%. Our solution also outperforms the best existing solution that does not rely on external resources.Published versio

Wolverhampton Intellectual Repository and E-theses

Urdu AI: writeprints for Urdu authorship identification

Author: Hassan Saeed-Ul
Sarwar Raheem
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/07/2021
Field of study

This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing on 31/10/2021, available online at: https://doi.org/10.1145/3476467 The accepted version of the publication may differ from the final published version.The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. On the other hand, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces, and when the number of candidate author increases. Consequently, these solutions are inapplicable to real-world cases. To overcome these limitations, we formulate a stylometric feature space. Based on this feature space we use an authorship identification solution that transforms each text sample into point set, retrieves candidate text samples, and relies the nearest neighbour classifier to predict the original author of the anonymous text sample. To evaluate our method, we create a significantly larger corpus than existing studies and conduct several experimental studies which show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works

E-space: Manchester Metropolitan University's Research Repository

Wolverhampton Intellectual Repository and E-theses

Una nueva visión de la supuesta influencia de Madame Bovary en La Regenta a través de la estilometría y el análisis de sentimientos basados en lenguaje R

Author: Lozano Zahonero
Publication venue
Publication date: 01/07/2020
Field of study

Madame Bovary's supposed influence on La Regenta has been the subject of numerous critical studies although, since the beginning, it has been surrounded by controversy and debate. The traditionally adopted approach has been qualitative and based on partial, and not always objective, data. Furthermore, only merely anecdotal impressions have been sometimes the basis of different hypotheses and, consequently, the results obtained have been discordant. The main goal of this work is to provide quantitative data that allow to answer this still open question. To this end, a computational analysis of both the stylistic patterns and the emotional dimension, which underlie both novels, will be carried out by using the programming language R. In addition, the comparison between the original version of Madame Bovary and its translation into Spanish will also be addressed to test a new model for identifying equivalence in translation. Despite its limitations due its novelty, this approach can be a first step to examine new ways for investigating phenomena such as assimilation, imitation, intertextuality or plagiarism in literary texts, as well as equivalence in translation.La supuesta influencia de Madame Bovary en La Regenta, rodeada desde el inicio de polémicas y enfrentamientos, ha sido objeto de numerosos estudios críticos. El enfoque tradicionalmente adoptado ha sido de tipo cualitativo y se ha fundado en datos parciales, no siempre objetivos. Es más, en ocasiones, se han tomado como base de las distintas hipótesis tan solo impresiones meramente anecdóticas y, en consecuencia, los resultados obtenidos han sido discordantes. El objetivo principal de este trabajo es aportar datos cuantitativos que contribuyan a dar respuesta a esta cuestión aún abierta. Con este fin, llevaremos a cabo un análisis computacional de los patrones estilísticos y la dimensión emotiva que subyacen en ambas novelas utilizando para ello el lenguaje de programación R. Además de este objetivo primario se abordará también secundariamente la comparación de la versión original de Madame Bovary con su traducción al español, a fin de someter a experimentación un nuevo modelo de acercamiento a la equivalencia traductora. A pesar de que, dada su novedad, este enfoque presenta aún limitaciones, puede constituir un primer paso para explorar nuevas vías de investigación de fenómenos como la asimilación, la imitación, la intertextualidad o el plagio en textos literarios, así como de la equivalencia en traducción

ART