2,786 research outputs found

    Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus

    Get PDF
    The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning

    Using Neural Networks for Relation Extraction from Biomedical Literature

    Full text link
    Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.Comment: Artificial Neural Networks book (Springer) - Chapter 1

    Cognition-based approaches for high-precision text mining

    Get PDF
    This research improves the precision of information extraction from free-form text via the use of cognitive-based approaches to natural language processing (NLP). Cognitive-based approaches are an important, and relatively new, area of research in NLP and search, as well as linguistics. Cognitive approaches enable significant improvements in both the breadth and depth of knowledge extracted from text. This research has made contributions in the areas of a cognitive approach to automated concept recognition in. Cognitive approaches to search, also called concept-based search, have been shown to improve search precision. Given the tremendous amount of electronic text generated in our digital and connected world, cognitive approaches enable substantial opportunities in knowledge discovery. The generation and storage of electronic text is ubiquitous, hence opportunities for improved knowledge discovery span virtually all knowledge domains. While cognition-based search offers superior approaches, challenges exist due to the need to mimic, even in the most rudimentary way, the extraordinary powers of human cognition. This research addresses these challenges in the key area of a cognition-based approach to automated concept recognition. In addition it resulted in a semantic processing system framework for use in applications in any knowledge domain. Confabulation theory was applied to the problem of automated concept recognition. This is a relatively new theory of cognition using a non-Bayesian measure, called cogency, for predicting the results of human cognition. An innovative distance measure derived from cogent confabulation and called inverse cogency, to rank order candidate concepts during the recognition process. When used with a multilayer perceptron, it improved the precision of concept recognition by 5% over published benchmarks. Additional precision improvements are anticipated. These research steps build a foundation for cognition-based, high-precision text mining. Long-term it is anticipated that this foundation enables a cognitive-based approach to automated ontology learning. Such automated ontology learning will mimic human language cognition, and will, in turn, enable the practical use of cognitive-based approaches in virtually any knowledge domain --Abstract, page iii

    Ontology Enrichment from Free-text Clinical Documents: A Comparison of Alternative Approaches

    Get PDF
    While the biomedical informatics community widely acknowledges the utility of domain ontologies, there remain many barriers to their effective use. One important requirement of domain ontologies is that they achieve a high degree of coverage of the domain concepts and concept relationships. However, the development of these ontologies is typically a manual, time-consuming, and often error-prone process. Limited resources result in missing concepts and relationships, as well as difficulty in updating the ontology as domain knowledge changes. Methodologies developed in the fields of Natural Language Processing (NLP), Information Extraction (IE), Information Retrieval (IR), and Machine Learning (ML) provide techniques for automating the enrichment of ontology from free-text documents. In this dissertation, I extended these methodologies into biomedical ontology development. First, I reviewed existing methodologies and systems developed in the fields of NLP, IR, and IE, and discussed how existing methods can benefit the development of biomedical ontologies. This previously unconducted review was published in the Journal of Biomedical Informatics. Second, I compared the effectiveness of three methods from two different approaches, the symbolic (the Hearst method) and the statistical (the Church and Lin methods), using clinical free-text documents. Third, I developed a methodological framework for Ontology Learning (OL) evaluation and comparison. This framework permits evaluation of the two types of OL approaches that include three OL methods. The significance of this work is as follows: 1) The results from the comparative study showed the potential of these methods for biomedical ontology enrichment. For the two targeted domains (NCIT and RadLex), the Hearst method revealed an average of 21% and 11% new concept acceptance rates, respectively. The Lin method produced a 74% acceptance rate for NCIT; the Church method, 53%. As a result of this study (published in the Journal of Methods of Information in Medicine), many suggested candidates have been incorporated into the NCIT; 2) The evaluation framework is flexible and general enough that it can analyze the performance of ontology enrichment methods for many domains, thus expediting the process of automation and minimizing the likelihood that key concepts and relationships would be missed as domain knowledge evolves

    Doctor of Philosophy

    Get PDF
    dissertationManual annotation of clinical texts is often used as a method of generating reference standards that provide data for training and evaluation of Natural Language Processing (NLP) systems. Manually annotating clinical texts is time consuming, expensive, and requires considerable cognitive effort on the part of human reviewers. Furthermore, reference standards must be generated in ways that produce consistent and reliable data but must also be valid in order to adequately evaluate the performance of those systems. The amount of labeled data necessary varies depending on the level of analysis, the complexity of the clinical use case, and the methods that will be used to develop automated machine systems for information extraction and classification. Evaluating methods that potentially reduce cost, manual human workload, introduce task efficiencies, and reduce the amount of labeled data necessary to train NLP tools for specific clinical use cases are active areas of research inquiry in the clinical NLP domain. This dissertation integrates a mixed methods approach using methodologies from cognitive science and artificial intelligence with manual annotation of clinical texts. Aim 1 of this dissertation identifies factors that affect manual annotation of clinical texts. These factors are further explored by evaluating approaches that may introduce efficiencies into manual review tasks applied to two different NLP development areas - semantic annotation of clinical concepts and identification of information representing Protected Health Information (PHI) as defined by HIPAA. Both experiments integrate iv different priming mechanisms using noninteractive and machine-assisted methods. The main hypothesis for this research is that integrating pre-annotation or other machineassisted methods within manual annotation workflows will improve efficiency of manual annotation tasks without diminishing the quality of generated reference standards

    Sentence Simplification for Text Processing

    Get PDF
    A thesis submitted in partial fulfilment of the requirement of the University of Wolverhampton for the degree of Doctor of Philosophy.Propositional density and syntactic complexity are two features of sentences which affect the ability of humans and machines to process them effectively. In this thesis, I present a new approach to automatic sentence simplification which processes sentences containing compound clauses and complex noun phrases (NPs) and converts them into sequences of simple sentences which contain fewer of these constituents and have reduced per sentence propositional density and syntactic complexity. My overall approach is iterative and relies on both machine learning and handcrafted rules. It implements a small set of sentence transformation schemes, each of which takes one sentence containing compound clauses or complex NPs and converts it one or two simplified sentences containing fewer of these constituents (Chapter 5). The iterative algorithm applies the schemes repeatedly and is able to simplify sentences which contain arbitrary numbers of compound clauses and complex NPs. The transformation schemes rely on automatic detection of these constituents, which may take a variety of forms in input sentences. In the thesis, I present two new shallow syntactic analysis methods which facilitate the detection process. The first of these identifies various explicit signs of syntactic complexity in input sentences and classifies them according to their specific syntactic linking and bounding functions. I present the annotated resources used to train and evaluate this sign tagger (Chapter 2) and the machine learning method used to implement it (Chapter 3). The second syntactic analysis method exploits the sign tagger and identifies the spans of compound clauses and complex NPs in input sentences. In Chapter 4 of the thesis, I describe the development and evaluation of a machine learning approach performing this task. This chapter also presents a new annotated dataset supporting this activity. In the thesis, I present two implementations of my approach to sentence simplification. One of these exploits handcrafted rule activation patterns to detect different parts of input sentences which are relevant to the simplification process. The other implementation uses my machine learning method to identify compound clauses and complex NPs for this purpose. Intrinsic evaluation of the two implementations is presented in Chapter 6 together with a comparison of their performance with several baseline systems. The evaluation includes comparisons of system output with human-produced simplifications, automated estimations of the readability of system output, and surveys of human opinions on the grammaticality, accessibility, and meaning of automatically produced simplifications. Chapter 7 presents extrinsic evaluation of the sentence simplification method exploiting handcrafted rule activation patterns. The extrinsic evaluation involves three NLP tasks: multidocument summarisation, semantic role labelling, and information extraction. Finally, in Chapter 8, conclusions are drawn and directions for future research considered

    Getting More out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics.

    Get PDF
    This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/ outcome models in the UK’s largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors’ own group) who work in text processing for biomedicine and other areas. GATE is available online ,1. under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis

    Information extraction from medication leaflets

    Get PDF
    Tese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201
    • …
    corecore