3 research outputs found

    A scalable framework for cross-lingual authorship identification

    Get PDF
    This is an accepted manuscript of an article published by Elsevier in Information Sciences on 10/07/2018, available online: https://doi.org/10.1016/j.ins.2018.07.009 The accepted version of the publication may differ from the final published version.© 2018 Elsevier Inc. Cross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is that the stylistic markers (features) used in one language may not be applicable to other languages in the corpus. Existing methods overcome this challenge by using external resources such as machine translation and part-of-speech tagging. However, such solutions are not applicable to languages with poor external resources (known as low resource languages). They also fail to scale as the number of candidate authors and/or the number of languages in the corpus increases. In this investigation, we analyze different types of stylometric features and identify 10 high-performance language-independent features for cross-lingual stylometric analysis tasks. Based on these stylometric features, we propose a cross-lingual authorship identification solution that can accurately handle a large number of authors. Specifically, we partition the documents into fragments where each fragment is further decomposed into fixed size chunks. Using a multilingual corpus of 400 authors with 825 documents written in 6 different languages, we show that our method can achieve an accuracy level of 96.66%. Our solution also outperforms the best existing solution that does not rely on external resources.Published versio

    Urdu AI: writeprints for Urdu authorship identification

    Get PDF
    This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing on 31/10/2021, available online at: https://doi.org/10.1145/3476467 The accepted version of the publication may differ from the final published version.The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. On the other hand, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces, and when the number of candidate author increases. Consequently, these solutions are inapplicable to real-world cases. To overcome these limitations, we formulate a stylometric feature space. Based on this feature space we use an authorship identification solution that transforms each text sample into point set, retrieves candidate text samples, and relies the nearest neighbour classifier to predict the original author of the anonymous text sample. To evaluate our method, we create a significantly larger corpus than existing studies and conduct several experimental studies which show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works

    Una nueva visión de la supuesta influencia de Madame Bovary en La Regenta a través de la estilometría y el análisis de sentimientos basados en lenguaje R

    Get PDF
    Madame Bovary's supposed influence on La Regenta has been the subject of numerous critical studies although, since the beginning, it has been surrounded by controversy and debate. The traditionally adopted approach has been qualitative and based on partial, and not always objective, data. Furthermore, only merely anecdotal impressions have been sometimes the basis of different hypotheses and, consequently, the results obtained have been discordant. The main goal of this work is to provide quantitative data that allow to answer this still open question. To this end, a computational analysis of both the stylistic patterns and the emotional dimension, which underlie both novels, will be carried out by using the programming language R. In addition, the comparison between the original version of Madame Bovary and its translation into Spanish will also be addressed to test a new model for identifying equivalence in translation. Despite its limitations due its novelty, this approach can be a first step to examine new ways for investigating phenomena such as assimilation, imitation, intertextuality or plagiarism in literary texts, as well as equivalence in translation.La supuesta influencia de Madame Bovary en La Regenta, rodeada desde el inicio de polémicas y enfrentamientos, ha sido objeto de numerosos estudios críticos. El enfoque tradicionalmente adoptado ha sido de tipo cualitativo y se ha fundado en datos parciales, no siempre objetivos. Es más, en ocasiones, se han tomado como base de las distintas hipótesis tan solo impresiones meramente anecdóticas y, en consecuencia, los resultados obtenidos han sido discordantes. El objetivo principal de este trabajo es aportar datos cuantitativos que contribuyan a dar respuesta a esta cuestión aún abierta. Con este fin, llevaremos a cabo un análisis computacional de los patrones estilísticos y la dimensión emotiva que subyacen en ambas novelas utilizando para ello el lenguaje de programación R. Además de este objetivo primario se abordará también secundariamente la comparación de la versión original de Madame Bovary con su traducción al español, a fin de someter a experimentación un nuevo modelo de acercamiento a la equivalencia traductora. A pesar de que, dada su novedad, este enfoque presenta aún limitaciones, puede constituir un primer paso para explorar nuevas vías de investigación de fenómenos como la asimilación, la imitación, la intertextualidad o el plagio en textos literarios, así como de la equivalencia en traducción
    corecore