8 research outputs found

    Extraction of chemical-induced diseases using prior knowledge and textual information

    Get PDF
    We describe our approach to the chemical-disease relation (CDR) task in the BioCreative V challenge. The CDR task consists of two subtasks: Automatic disease-named entity recognition and normalization (DNER), and extraction of chemical-induced diseases (CIDs) from Medline abstracts. For the DNER subtask, we used our concept recognition tool Peregrine, in combination with several optimization steps. For the CID subtask, our system, which we named RELigator, was trained on a rich feature set, comprising features derived from a graph database containing prior knowledge about chemicals and diseases, and linguistic and statistical features derived from the abstracts in the CDR training corpus. We describe the systems that were developed and present evaluation results for both subtasks on the CDR test set. For DNER, our Peregrine system reached an F-score of 0.757. For CID, the system achieved an F-score of 0.526, which ranked second among 18 participating teams. Several post-challenge modifications of the systems resulted in substantially improved F-scores (0.828 for DNER and 0.602 for CID)

    Doctor of Philosophy

    Get PDF
    dissertationBiomedical data are a rich source of information and knowledge. Not only are they useful for direct patient care, but they may also offer answers to important population-based questions. Creating an environment where advanced analytics can be performed against biomedical data is nontrivial, however. Biomedical data are currently scattered across multiple systems with heterogeneous data, and integrating these data is a bigger task than humans can realistically do by hand; therefore, automatic biomedical data integration is highly desirable but has never been fully achieved. This dissertation introduces new algorithms that were devised to support automatic and semiautomatic integration of heterogeneous biomedical data. The new algorithms incorporate both data mining and biomedical informatics techniques to create "concept bags" that are used to compute similarity between data elements in the same way that "word bags" are compared in data mining. Concept bags are composed of controlled medical vocabulary concept codes that are extracted from text using named-entity recognition software. To test the new algorithm, three biomedical text similarity use cases were examined: automatically aligning data elements between heterogeneous data sets, determining degrees of similarity between medical terms using a published benchmark, and determining similarity between ICU discharge summaries. The method is highly configurable and 5 different versions were tested. The concept bag method performed particularly well aligning data elements and outperformed the compared algorithms by iv more than 5%. Another configuration that included hierarchical semantics performed particularly well at matching medical terms, meeting or exceeding 30 of 31 other published results using the same benchmark. Results for the third scenario of computing ICU discharge summary similarity were less successful. Correlations between multiple methods were low, including between terminologists. The concept bag algorithms performed consistently and comparatively well and appear to be viable options for multiple scenarios. New applications of the method and ideas for improving the algorithm are being discussed for future work, including several performance enhancements, configuration-based enhancements, and concept vector weighting using the TF-IDF formulas

    Combining Lexical and Semantic Methods of Inter-terminology Mapping Using the UMLS

    No full text
    The need for inter-terminology mapping is constantly increasing with the growth in the volume of electronically captured biomedical data and the demand to re-use the same data for secondary purposes. Using the UMLS as a knowledge base, semantically-based and lexically-based mappings were generated from SNOMED CT to ICD9CM terms and compared to a gold standard. Semantic mapping performed better than lexical mapping in terms of coverage, recall and precision. As the two mapping methods are orthogonal, the two sets of mappings can be used to validate and enhance each other. A method of combining the mappings based on the precision level of sub-categories in each method was derived. The combined method outperformed both methods, achieving coverage of 91%, recall of 43 % and precision of 27%. It is also possible to customize the method of combination to optimize performance according to the task at hand. Keywords: Unified Medical Language System, controlled terminology, inter-terminology mapping

    Exploring the relationship between age and health conditions using electronic health records: from single diseases to multimorbidities

    Get PDF
    Background Two enormous challenges facing healthcare systems are ageing and multimorbidity. Clinicians, policymakers, healthcare providers and researchers need to know “who gets which diseases when” in order to effectively prevent, detect and manage multiple conditions. Identification of ageing-related diseases (ARDs) is a starting point for research into common biological pathways in ageing. Examining multimorbidity clusters can facilitate a shift from the single-disease paradigm that pervades medical research and practice to models which reflect the reality of the patient population. Aim To examine how age influences an individual’s likelihood of developing single and multiple health conditions over the lifecourse. Methods and Outputs I used primary care and hospital admission electronic health records (EHRs) of 3,872,451 individuals from the Clinical Practice Research Datalink (CPRD) linked to the Hospital Episode Statistics admitted patient care (HES-APC) dataset in England from 1 April 2010 to 31 March 2015. In collaboration with Professor Aroon Hingorani, Dr Osman Bhatti, Dr Shanaz Husain, Dr Shailen Sutaria, Professor Dorothea Nitsch, Mrs Melanie Hingorani, Dr Constantinos Parisinos, Dr Tom Lumbers and Dr Reecha Sofat, I derived the case definitions for 308 clinically important health conditions, by harmonising Read, ICD-10 and OPCS-4 codes across primary and secondary care records in England. I calculated the age-specific incidence rate, period prevalence and median age at first recorded diagnosis for these conditions and described the 50 most common diseases in each decade of life. I developed a protocol for identifying ARDs using machine-learning and actuarial techniques. Finally, I identified highly correlated multimorbidity clusters and created a tool to visualise comorbidity clusters using a network approach. Conclusions I have developed case definitions (with a panel of clinicians) and calculated disease frequency estimates for 308 clinically important health conditions in the NHS in England. I have described patterns of ageing and multimorbidity using these case definitions, and produced an online app for interrogating comorbidities for an index condition. This work facilitates future research into ageing pathways and multimorbidity

    Alineamiento y validación de terminologías a gran escala en el åmbito médico

    Get PDF
    This work presents a semi-automated method to map terminologies on a large scale, and later validation of the resulting alignments. The method combines different techniques to increase the automation

    Ascertaining Pain in Mental Health Records:Combining Empirical and Knowledge-Based Methods for Clinical Modelling of Electronic Health Record Text

    Get PDF
    In recent years, state-of-the-art clinical Natural Language Processing (NLP), as in other domains, has been dominated by neural networks and other statistical models. In contrast to the unstructured nature of Electronic Health Record (EHR) text, biomedical knowledge is increasingly available in structured and codified forms, underpinned by curated databases, machine-readable clinical guidelines, and logically defined terminologies. This thesis examines the incorporation of external medical knowledge into clinical NLP and tests these methods on a use case of ascertaining physical pain in clinical notes of mental health records.Pain is a common reason for accessing healthcare resources and has been a growing area of research, especially its impact on mental health. Pain also presents a unique NLP problem due to its ambiguous nature and the varying circumstances in which it can be used. For these reasons, pain has been chosen as a use case, making it a good case study for the application of the methods explored in this thesis. Models are built by assimilating both structured medical knowledge and clinical NLP and leveraging the inherent relations that exist within medical ontologies. The data source used in this project is a mental health EHR database called CRIS, which contains de-identified patient records from the South London and Maudsley NHS Foundation Trust, one of the largest mental health providers in Western Europe.A lexicon of pain terms was developed to identify documents within CRIS mentioning painrelated terms. Gold standard annotations were created by conducting manual annotations on these documents. These gold standard annotations were used to build models for a binary classification task, with the objective of classifying sentences from the clinical text as “relevant”, which indicates the sentence contains relevant mentions of pain, i.e., physical pain affecting the patient, or “not relevant”, which indicates the sentence does not contain mentions of physical pain, or the mention does not relate to the patient (ex: someone else in physical pain). Two models incorporating structured medical knowledge were built:1. a transformer-based model, SapBERT, that utilises a knowledge graph of the UMLS ontology, and2. a knowledge graph embedding model that utilises embeddings from SNOMED CT, which was then used to build a random forest classifier. This was achieved by modelling the clinical pain terms and their relations from SNOMED CT into knowledge graph embeddings, thus combining the data-driven view of clinical language, with the logical view of medical knowledge.These models have been compared with NLP models (binary classifiers) that do not incorporate such structured medical knowledge:1. a transformer-based model, BERT_base, and2. a random forest classifier model.Amongst the two transformer-based models, SapBERT performed better at the classification task (F1-score: 0.98), and amongst the random forest models, the one incorporating knowledge graph embeddings performed better (F1-score: 0.94). The SapBERT model was run on sentences from a cohort of patients within CRIS, with the objective of conducting a prevalence study to understand the distribution of pain based on sociodemographic and diagnostic factors.The contribution of this research is both methodological and practical, showing the difference between a conventional NLP approach of binary classification and one that incorporates external knowledge, and further utilising the models obtained from both these approaches ina prevalence study which was designed based on inputs from clinicians and a patient and public involvement group. The results emphasise the significance of going beyond the conventional approach to NLP when addressing complex issues such as pain.<br/
    corecore