3,436 research outputs found

    Dictionary writing system (DWS) plus corpus query package (CQP): the case of TshwaneLex

    Get PDF
    In this article the integrated corpus query functionality of the dictionary compilation software TshwanelLex is analysed. Attention is given to the handling of both raw corpus data and annotated corpus data. With regard to the latter it is shown how, with a minimum of human effort, machine learning techniques can be employed to obtain part-of-speech tagged corpora that can be used for lexicographic purposes. All points are illustrated with data drawn from English and Northern Sotho. The tools and techniques themselves, however, are language-independent, and as Such the encouraging outcomes of this study are far-reaching

    Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus

    Get PDF
    The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning

    An automatic part-of-speech tagger for Middle Low German

    Get PDF
    Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them

    Event-based Access to Historical Italian War Memoirs

    Full text link
    The progressive digitization of historical archives provides new, often domain specific, textual resources that report on facts and events which have happened in the past; among these, memoirs are a very common type of primary source. In this paper, we present an approach for extracting information from Italian historical war memoirs and turning it into structured knowledge. This is based on the semantic notions of events, participants and roles. We evaluate quantitatively each of the key-steps of our approach and provide a graph-based representation of the extracted knowledge, which allows to move between a Close and a Distant Reading of the collection.Comment: 23 pages, 6 figure

    The Theoretical Argument for Disproving Asymptotic Upper-Bounds on the Accuracy of Part-of-Speech Tagging Algorithms: Adopting a Linguistics, Rule-Based Approach

    Get PDF
    This paper takes a deep dive into a particular area of the interdisciplinary domain of Computational Linguistics, Part-of-Speech Tagging algorithms. The author relies primarily on scholarly Computer Science and Linguistics papers to describe previous approaches to this task and the often-hypothesized existence of the asymptotic accuracy rate of around 98%, by which this task is allegedly bound. However, after doing more research into why the accuracy of previous algorithms have behaved in this asymptotic manner, the author identifies valid and empirically-backed reasons why the accuracy of previous approaches do not necessarily reflect any sort of general asymptotic bound on the task of automated Part-of-Speech Tagging. In response, a theoretical argument is proposed to circumvent the shortcomings of previous approaches to this task, which involves abandoning the flawed status-quo of training machine learning algorithms and predictive models on outdated corpora, and instead walks the reader from conception through implementation of a rule-based algorithm with roots in both practical and theoretical Linguistics. While the resulting algorithm is simply a prototype which cannot be currently verified in achieving a tagging-accuracy rate of over 98%, its multi-tiered methodology, meant to mirror aspects of human cognition in Natural Language Understanding, is meant to serve as a theoretical blueprint for a new and inevitably more-reliable way to deal with the challenges in Part-of-Speech Tagging, and provide much-needed advances in the popular area of Natural Language Processing
    • …
    corecore