41 research outputs found

    Corpus annotation as a scientific task

    Get PDF
    Annotation studies in CL are generally unscientific: they are mostly not reproducible, make use of too few (and often non-independent) annotators and use guidelines that are often something of a moving target. Additionally, the notion of ‘expert annotators’ invariably means only that the annotators have linguistic training. While this can be acceptable in some special contexts, it is often far from ideal. This is particularly the case when subtle judgements are required or when, as increasingly, one is making use of corpora originating from technical texts that have been produced by, and intended to be consumed by, an audience of technical experts in the field. We outline a more rigorous approach to collecting human annotations, using as our example a study designed to capture judgements on the meaning of hedge words in medical records

    Finding predominant word senses in untagged text

    Get PDF
    In word sense disambiguation (WSD), the heuristic of choosing the most common sense is extremely powerful because the distribution of the senses of a word is often skewed. The problem with using the predominant, or first sense heuristic, aside from the fact that it does not take surrounding context into account, is that it assumes some quantity of handtagged data. Whilst there are a few hand-tagged corpora available for some languages, one would expect the frequency distribution of the senses of words, particularly topical words, to depend on the genre and domain of the text under consideration. We present work on the use of a thesaurus acquired from raw textual corpora and the WordNet similarity package to find predominant noun senses automatically. The acquired predominant senses give a precision of 64% on the nouns of the SENSEVAL- 2 English all-words task. This is a very promising result given that our method does not require any hand-tagged text, such as SemCor. Furthermore, we demonstrate that our method discovers appropriate predominant senses for words from two domainspecific corpora

    What does validation of cases in electronic record databases mean? The potential contribution of free text

    Get PDF
    Electronic health records are increasingly used for research. The definition of cases or endpoints often relies on the use of coded diagnostic data, using a pre-selected group of codes. Validation of these cases, as ‘true’ cases of the disease, is crucial. There are, however, ambiguities in what is meant by validation in the context of electronic records. Validation usually implies comparison of a definition against a gold standard of diagnosis and the ability to identify false negatives (‘true’ cases which were not detected) as well as false positives (detected cases which did not have the condition). We argue that two separate concepts of validation are often conflated in existing studies. Firstly, whether the GP thought the patient was suffering from a particular condition (which we term confirmation or internal validation) and secondly, whether the patient really had the condition (external validation). Few studies have the ability to detect false negatives who have not received a diagnostic code. Natural language processing is likely to open up the use of free text within the electronic record which will facilitate both the validation of the coded diagnosis and searching for false negatives

    Robust Grammatical Analysis for Spoken Dialogue Systems

    Full text link
    We argue that grammatical analysis is a viable alternative to concept spotting for processing spoken input in a practical spoken dialogue system. We discuss the structure of the grammar, and a model for robust parsing which combines linguistic sources of information and statistical sources of information. We discuss test results suggesting that grammatical processing allows fast and accurate processing of spoken input.Comment: Accepted for JNL

    Annotating a corpus of clinical text records for learning to recognize symptoms automatically

    Get PDF
    We report on a research effort to create a corpus of clinical free text records enriched with annotation for symptoms of a particular disease (ovarian cancer). We describe the original data, the annotation procedure and the resulting corpus. The data (approximately 192K words) was annotated by three clinicians and a procedure was devised to resolve disagreements. We are using the corpus to investigate the amount of symptom-related information in clinical records that is not coded, and to develop techniques for recognizing these symptoms automatically in unseen text

    What evidence is there for a delay in diagnostic coding of rheumatoid arthritis in UK general practice records? An observational study of free text

    Get PDF
    Objectives: Much research with electronic health records uses coded or structured data only; important information captured in the free text remains unused. One dimension of EHR data quality assessment is “currency” or timeliness, i.e. that data are representative of the patient state at the time of measurement. We explored the utility of free text in UK general practice patient records to evaluate delays in recording of rheumatoid arthritis (RA) diagnosis. We also aimed to locate and quantify disease and diagnostic information recorded only in text Setting: UK general practice patient records from the Clinical Practice Research Datalink. Participants: 294 individuals with incident diagnosis of RA between 2005 and 2008; 204 women and 85 men, median age 63 years. Primary and Secondary Outcome Measures: Assessment of 1) quantity and timing of text entries for disease modifying anti-rheumatic drugs (DMARDs) as a proxy for the RA disease code, and 2) quantity, location and timing of free text information relating to RA onset and diagnosis. Results: Inflammatory markers, pain and DMARDs were the most common categories of disease information in text prior to RA diagnostic code; 10-37% of patients had such information only in text. Read codes associated with RA-related text included correspondence, general consultation, and arthritis codes. 64 patients (22%) had DMARD text entries >14 days prior to RA code; these patients had more and earlier referrals to rheumatology, tests, swelling, pain, and DMARD prescriptions, suggestive of an earlier implicit diagnosis than was recorded by the diagnostic code. Conclusions: RA-related symptoms, tests, referrals and prescriptions were recorded in free text with 22% of patients showing strong evidence of delay in coding of diagnosis. Researchers using EHRs may need to mitigate for delayed codes by incorporating text into their case-ascertainment strategies. Natural language processing techniques have the capability to do this at scale

    Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus

    Get PDF
    The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning
    corecore