1,064 research outputs found

    Syllabic quantity patterns as rhythmic features for Latin authorship attribution

    Get PDF
    It is well known that, within the Latin production of written text, peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on so-called syllabic quantity, that is, on the length of the involved syllables, and there is substantial evidence suggesting that certain authors had a preference for certain metric patterns over others. In this research we investigate the possibility to employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts. We test the impact of these features on the authorship attribution task when combined with other topic-agnostic features. Our experiments, carried out on three different datasets using support vector machines (SVMs) show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors

    Investigating features and techniques for Arabic authoriship attribution

    Get PDF
    Authorship attribution is the problem of identifying the true author of a disputed text. Throughout history, there have been many examples of this problem concerned with revealing genuine authors of works of literature that were published anonymously, and in some cases where more than one author claimed authorship of the disputed text. There has been considerable research effort into trying to solve this problem. Initially these efforts were based on statistical patterns, and more recently they have centred on a range of techniques from artificial intelligence. An important early breakthrough was achieved by Mosteller and Wallace in 1964 [15], who pioneered the use of ‘function words’ – typically pronouns, conjunctions and prepositions – as the features on which to base the discovery of patterns of usage relevant to specific authors. The authorship attribution problem has been tackled in many languages, but predominantly in the English language. In this thesis the problem is addressed for the first time in the Arabic Language. We therefore investigate whether the concept of functions words in English can also be used in the same way for authorship attribution in Arabic. We also describe and evaluate a hybrid of evolutionary algorithms and linear discriminant analysis as an approach to learn a model that classifies the author of a text, based on features derived from Arabic function words. The main target of the hybrid algorithm is to find a subset of features that can robustly and accurately classify disputed texts in unseen data. The hybrid algorithm also aims to do this with relatively small subsets of features. A specialised dataset was produced for this work, based on a collection of 14 Arabic books of different natures, representing a collection of six authors. This dataset was processed into training and test partitions in a way that provides a diverse collection of challenges for any authorship attribution approach. The combination of the successful list of Arabic function words and the hybrid algorithm for classification led to satisfying levels of accuracy in determining the author of portions of the texts in test data. The work described here is the first (to our knowledge) that investigates authorship attribution in the Arabic knowledge using computational methods. Among its contributions are: the first set of Arabic function words, the first specialised dataset aimed at testing Arabic authorship attribution methods, a new hybrid algorithm for classifying authors based on patterns derived from these function words, and, finally, a number of ideas and variants regarding how to use function words in association with character level features, leading in some cases to more accurate results

    European Approaches to Japanese Language and Linguistics

    Get PDF
    In this volume European specialists of Japanese language present new and original research into Japanese over a wide spectrum of topics which include descriptive, sociolinguistic, pragmatic and didactic accounts. The articles share a focus on contemporary issues and adopt new approaches to the study of Japanese that often are specific to European traditions of language study. The articles address an audience that includes both Japanese Studies and Linguistics. They are representative of the wide range of topics that are currently studied in European universities, and they address scholars and students alike

    Unravelling Interlanguage Facts via Explainable Machine Learning

    Full text link
    Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e., that of analysing the internals of an NLI classifier trained by an \emph{explainable} machine learning algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena ``give a speaker's native language away''. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e., guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners' essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker's L1. We also present two case studies, one on Spanish and one on Italian learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s. Overall, our study shows that the use of explainable machine learning can be a valuable tool for t

    Versification and Authorship Attribution

    Get PDF
    The technique known as contemporary stylometry uses different methods, including machine learning, to discover a poem’s author based on features like the frequencies of words and character n-grams. However, there is one potential textual fingerprint stylometry tends to ignore: versification, or the very making of language into verse. Using poetic texts in three different languages (Czech, German, and Spanish), Petr Plecháč asks whether versification features like rhythm patterns and types of rhyme can help determine authorship. He then tests its findings on two unsolved literary mysteries. In the first, Plecháč distinguishes the parts of the Elizabethan verse play The Two Noble Kinsmen written by William Shakespeare from those written by his coauthor, John Fletcher. In the second, he seeks to solve a case of suspected forgery: how authentic was a group of poems first published as the work of the nineteenth-century Russian author Gavriil Stepanovich Batenkov? This book of poetic investigation should appeal to literary sleuths the world over.illustrato

    As good as it gets? Unrepresented litigant and courtroom dynamics: a case study

    Get PDF
    This paper examines the pragmatic competence of unrepresented litigants in court. In doing so, it engages larger themes including laymen understanding of the law and the major challenge for judges in an adversarial legal system of balancing offers of assistance with maintaining judicial neutrality. The research reported in the paper involved 72 hours of observation during a 14-day trial in a Hong Kong appellate court. The litigant in the case had represented herself previously in at least three lawsuits over a period of ten years. She had initiated each action, and two cases had gone to appeal. As well as having extensive litigation experience in this way, the litigant was also a highly educated professional, capable of speaking fluently in the professional language of the proceedings; seemingly she had also devoted a lot of time to researching and preparing her cases. These characteristics mark her out as among the most prepared of unrepresented litigants to deal with obstacles presented by legal procedures and courtroom requirements. The study contributes to the field in two main respects: 1) Perspective. Previous studies of unrepresented litigants have tended to take a top-down approach. They look at litigant behaviour from the perspective of a judge or lawyer. This study uses discourse data obtained from courtroom observation in an attempt to understand the trial from the litigant’s perspective. 2) Access to justice. The stereotypical unrepresented litigant has low income and literacy, and makes obvious mistakes in court. The litigant in this case study represents the other end of the litigant-in-person spectrum. Her courtroom struggles expose obstacles that all unrepresented litigants face, which are not easily overcome even after repeated experience of the legal system. The data show persistent misconceptions regarding the law, and reveal tensions between layman understanding of justice and the institutional delivery of legal outcomes by the courts. In recent years there has been a rapid rise in the number of unrepresented litigants in Hong Kong, as in many other jurisdictions. Better understanding of courtroom dynamics created by the behaviour of such litigants may help prevent interruptions in courtroom proceedings and improve overall public access to justice.postprin

    Linguistic identifiers of L1 Persian speakers writing in English:NLID for authorship analysis

    Get PDF
    This research focuses on Native Language Identification (NLID), and in particular, on the linguistic identifiers of L1 Persian speakers writing in English. This project comprises three sub-studies; the first study devises a coding system to account for interlingual features present in a corpus of L1 Persian speakers blogging in English, and a corpus of L1 English blogs. Study One then demonstrates that it is possible to use interlingual identifiers to distinguish authorship by L1 Persian speakers. Study Two examines the coding system in relation to the L1 Persian corpus and a corpus of L1 Azeri and L1 Pashto speakers. The findings of this section indicate that the NLID method and features designed are able to discriminate between L1 influences from different languages. Study Three focuses on elicited data, in which participants were tasked with disguising their language to appear as L1 Persian speakers writing in English. This study indicated that there was a significant difference between the features in the L1 Persian corpus, and the corpus of disguise texts. The findings of this research indicate that NLID and the coding system devised have a very strong potential to aid forensic authorship analysis in investigative situations. Unlike existing research, this project focuses predominantly on blogs, as opposed to student data, making the findings more appropriate to forensic casework data

    Design of a Controlled Language for Critical Infrastructures Protection

    Get PDF
    We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen

    Affordances and limitations of algorithmic criticism

    Get PDF
    Humanities scholars currently have access to unprecedented quantities of machine-readable texts, and, at the same time, the tools and the methods with which we can analyse and visualise these texts are becoming more and more sophisticated. As has been shown in numerous studies, many of the new technical possibilities that emerge from fields such as text mining and natural language processing can have useful applications within literary research. Computational methods can help literary scholars to discover interesting trends and correlations within massive text collections, and they can enable a thoroughly systematic examination of the stylistic properties of literary works. While such computer-assisted forms of reading have proven invaluable for research in the field of literary history, relatively few studies have applied these technologies to expand or to transform the ways in which we can interpret literary texts. Based on a comparative analysis of digital scholarship and traditional scholarship, this thesis critically examines the possibilities and the limitations of a computer-based literary criticism. It argues that quantitative analyses of data about literary techniques can often reveal surprising qualities of works of literature, which can, in turn, lead to new interpretative readings
    corecore