3,544 research outputs found

    Modeling Global Syntactic Variation in English Using Dialect Classification

    Get PDF
    This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both web corpora and social media corpora in order to measure the robustness of syntactic variation across registers

    A scalable framework for cross-lingual authorship identification

    Get PDF
    This is an accepted manuscript of an article published by Elsevier in Information Sciences on 10/07/2018, available online: https://doi.org/10.1016/j.ins.2018.07.009 The accepted version of the publication may differ from the final published version.© 2018 Elsevier Inc. Cross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is that the stylistic markers (features) used in one language may not be applicable to other languages in the corpus. Existing methods overcome this challenge by using external resources such as machine translation and part-of-speech tagging. However, such solutions are not applicable to languages with poor external resources (known as low resource languages). They also fail to scale as the number of candidate authors and/or the number of languages in the corpus increases. In this investigation, we analyze different types of stylometric features and identify 10 high-performance language-independent features for cross-lingual stylometric analysis tasks. Based on these stylometric features, we propose a cross-lingual authorship identification solution that can accurately handle a large number of authors. Specifically, we partition the documents into fragments where each fragment is further decomposed into fixed size chunks. Using a multilingual corpus of 400 authors with 825 documents written in 6 different languages, we show that our method can achieve an accuracy level of 96.66%. Our solution also outperforms the best existing solution that does not rely on external resources.Published versio

    A Likelihood Ratio Based Forensic Text Comparison with Multiple Types of Features

    Get PDF
    This study aims at further improving forensic text comparison (FTC) under the likelihood ratio (LR) framework. While the use of the LR framework to conclude the strength of evidence is well recognised in forensic science, studies on forensic text evidence within the LR framework are limited, and this study is an attempt of alleviating this situation. There have already been initiatives to obtain LRs for textual evidence by adopting various approaches and using different sets of stylometric features. (Carne & Ishihara, 2020; Ishihara, 2014, 2017a, 2017b, 2021). However, only few features have been tested in the similarity-only score-based approach (Ishihara, 2021), and there are many features left to be further investigated. To achieve the aim of the study, we will investigate some of the features in LR-based FTC and demonstrate how they contribute to the further improvement of the LR-based FTC system. Statistic, word n-gram (n=1,2,3), character n-gram (n=1,2,3,4), and part of speech (POS) n-gram (n=1,2,3) features were separately tested first in this study, and then the separately estimated LRs were fused for overall LRs. The databased used was prepared by Ishihara (2021), and the documents of comparison were modelled into feature vectors using a bag-of-words model. Two groups of documents, which both contained documents of 700, 1,400, and 2,100 words, were concatenated for each author, resulting in the total of 719 same-author comparisons and 516,242 different-author comparisons. The Cosine similarity was used to measure the similarity of texts, and the similarity-only score-based approach was used to estimate the LRs from the scores of similarity (Helper et al., 2012; Bolck et al., 2015). Log-likelihood ratio cost (Cllr) and their composites—Cllrmin and Cllrcal—were used as assessment metrics. Findings indicate that (a) when the LRs of all the feature types are fused, the fused Cllr values are 0.56, 0.30, and 0.19 for 700, 1,400, and 2,100 words, respectively, and (b) feature selection depending on the nature of an FTC task matters to the performance of the FTC system and can contribute to the improvement of LR-based FTC
    corecore