8,792 research outputs found

    Vagueness and referential ambiguity in a large-scale annotated corpus

    Get PDF
    In this paper, we argue that difficulties in the definition of coreference itself contribute to lower inter-annotator agreement in certain cases. Data from a large referentially annotated corpus serves to corroborate this point, using a quantitative investigation to assess which effects or problems are likely to be the most prominent. Several examples where such problems occur are discussed in more detail, and we then propose a generalisation of Poesio, Reyle and Stevenson’s Justified Sloppiness Hypothesis to provide a unified model for these cases of disagreement and argue that a deeper understanding of the phenomena involved allows to tackle problematic cases in a more principled fashion than would be possible using only pre-theoretic intuitions

    Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus

    Get PDF
    The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning

    Building a semantically annotated corpus of clinical texts

    Get PDF
    In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records. The paper details the sampling of textual material from a collection of 20,000 cancer patient records, the development of a semantic annotation scheme, the annotation methodology, the distribution of annotations in the final corpus, and the use of the corpus for development of an adaptive information extraction system. The resulting corpus is the most richly semantically annotated resource for clinical text processing built to date, whose value has been demonstrated through its use in developing an effective information extraction system. The detailed presentation of our corpus construction and annotation methodology will be of value to others seeking to build high-quality semantically annotated corpora in biomedical domains

    Temporal expression normalisation in natural language texts

    Get PDF
    Automatic annotation of temporal expressions is a research challenge of great interest in the field of information extraction. In this report, I describe a novel rule-based architecture, built on top of a pre-existing system, which is able to normalise temporal expressions detected in English texts. Gold standard temporally-annotated resources are limited in size and this makes research difficult. The proposed system outperforms the state-of-the-art systems with respect to TempEval-2 Shared Task (value attribute) and achieves substantially better results with respect to the pre-existing system on top of which it has been developed. I will also introduce a new free corpus consisting of 2822 unique annotated temporal expressions. Both the corpus and the system are freely available on-line.Comment: 7 pages, 1 figure, 5 table

    Customizing BPMN Diagrams Using Timelines

    Get PDF
    BPMN (Business Process Model and Notation) is widely used standard modeling technique for representing Business Processes by using diagrams, but lacks in some aspects. Representing execution-dependent and time-dependent decisions in BPMN Diagrams may be a daunting challenge [Carlo Combi et al., 2017]. In many cases such constraints are omitted in order to preserve the simplicity and the readability of the process model. However, for purposes such as compliance checking, process mining, and verification, formalizing such constraints could be very useful. In this paper, we propose a novel approach for annotating BPMN Diagrams with Temporal Synchronization Rules borrowed from the timeline-based planning field. We discuss the expressivity of the proposed approach and show that it is able to capture a lot of complex temporally-related constraints without affecting the structure of BPMN diagrams. Finally, we provide a mapping from annotated BPMN diagrams to timeline-based planning problems that allows one to take advantage of the last twenty years of theoretical and practical developments in the field

    Information structure

    Get PDF
    The guidelines for Information Structure include instructions for the annotation of Information Status (or ‘givenness’), Topic, and Focus, building upon a basic syntactic annotation of nominal phrases and sentences. A procedure for the annotation of these features is proposed

    Active learning in annotating micro-blogs dealing with e-reputation

    Full text link
    Elections unleash strong political views on Twitter, but what do people really think about politics? Opinion and trend mining on micro blogs dealing with politics has recently attracted researchers in several fields including Information Retrieval and Machine Learning (ML). Since the performance of ML and Natural Language Processing (NLP) approaches are limited by the amount and quality of data available, one promising alternative for some tasks is the automatic propagation of expert annotations. This paper intends to develop a so-called active learning process for automatically annotating French language tweets that deal with the image (i.e., representation, web reputation) of politicians. Our main focus is on the methodology followed to build an original annotated dataset expressing opinion from two French politicians over time. We therefore review state of the art NLP-based ML algorithms to automatically annotate tweets using a manual initiation step as bootstrap. This paper focuses on key issues about active learning while building a large annotated data set from noise. This will be introduced by human annotators, abundance of data and the label distribution across data and entities. In turn, we show that Twitter characteristics such as the author's name or hashtags can be considered as the bearing point to not only improve automatic systems for Opinion Mining (OM) and Topic Classification but also to reduce noise in human annotations. However, a later thorough analysis shows that reducing noise might induce the loss of crucial information.Comment: Journal of Interdisciplinary Methodologies and Issues in Science - Vol 3 - Contextualisation digitale - 201
    • …
    corecore