3,316 research outputs found

    Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus

    Get PDF
    The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning

    Automatic Population of Structured Reports from Narrative Pathology Reports

    Get PDF
    There are a number of advantages for the use of structured pathology reports: they can ensure the accuracy and completeness of pathology reporting; it is easier for the referring doctors to glean pertinent information from them. The goal of this thesis is to extract pertinent information from free-text pathology reports and automatically populate structured reports for cancer diseases and identify the commonalities and differences in processing principles to obtain maximum accuracy. Three pathology corpora were annotated with entities and relationships between the entities in this study, namely the melanoma corpus, the colorectal cancer corpus and the lymphoma corpus. A supervised machine-learning based-approach, utilising conditional random fields learners, was developed to recognise medical entities from the corpora. By feature engineering, the best feature configurations were attained, which boosted the F-scores significantly from 4.2% to 6.8% on the training sets. Without proper negation and uncertainty detection, the quality of the structured reports will be diminished. The negation and uncertainty detection modules were built to handle this problem. The modules obtained overall F-scores ranging from 76.6% to 91.0% on the test sets. A relation extraction system was presented to extract four relations from the lymphoma corpus. The system achieved very good performance on the training set, with 100% F-score obtained by the rule-based module and 97.2% F-score attained by the support vector machines classifier. Rule-based approaches were used to generate the structured outputs and populate them to predefined templates. The rule-based system attained over 97% F-scores on the training sets. A pipeline system was implemented with an assembly of all the components described above. It achieved promising results in the end-to-end evaluations, with 86.5%, 84.2% and 78.9% F-scores on the melanoma, colorectal cancer and lymphoma test sets respectively

    Real-time classifiers from free-text for continuous surveillance of small animal disease

    Get PDF
    A wealth of information of epidemiological importance is held within unstructured narrative clinical records. Text mining provides computational techniques for extracting usable information from the language used to communicate between humans, including the spoken and written word. The aim of this work was to develop text-mining methodologies capable of rendering the large volume of information within veterinary clinical narratives accessible for research and surveillance purposes. The free-text records collated within the dataset of the Small Animal Veterinary Surveillance Network formed the development material and target of this work. The efficacy of pre-existent clinician-assigned coding applied to the dataset was evaluated and the nature of notation and vocabulary used in documenting consultations was explored and described. Consultation records were pre-processed to improve human and software readability, and software was developed to redact incidental identifiers present within the free-text. An automated system able to classify for the presence of clinical signs, utilising only information present within the free-text record, was developed with the aim that it would facilitate timely detection of spatio-temporal trends in clinical signs. Clinician-assigned main reason for visit coding provided a poor summary of the large quantity of information exchanged during a veterinary consultation and the nature of the coding and questionnaire triggering further obfuscated information. Delineation of the previously undocumented veterinary clinical sublanguage identified common themes and their manner of documentation, this was key to the development of programmatic methods. A rule-based classifier using logically-chosen dictionaries, sequential processing and data-masking redacted identifiers while maintaining research usability of records. Highly sensitive and specific free-text classification was achieved by applying classifiers for individual clinical signs within a context-sensitive scaffold, this permitted or prohibited matching dependent on the clinical context in which a clinical sign was documented. The mean sensitivity achieved within an unseen test dataset was 98.17 (74.47, 99.9)% and mean specificity 99.94 (77.1, 100.0)%. When used in combination to identify animals with any of a combination of gastrointestinal clinical signs, the sensitivity achieved was 99.44% (95% CI: 98.57, 99.78)% and specificity 99.74 (95% CI: 99.62, 99.83). This work illustrates the importance, utility and promise of free-text classification of clinical records and provides a framework within which this is possible whilst respecting the confidentiality of client and clinician
    • …
    corecore