8 research outputs found
Extraction of chemical-induced diseases using prior knowledge and textual information
We describe our approach to the chemical-disease relation (CDR) task in the BioCreative V challenge. The CDR task consists of two subtasks: Automatic disease-named entity recognition and normalization (DNER), and extraction of chemical-induced diseases (CIDs) from Medline abstracts. For the DNER subtask, we used our concept recognition tool Peregrine, in combination with several optimization steps. For the CID subtask, our system, which we named RELigator, was trained on a rich feature set, comprising features derived from a graph database containing prior knowledge about chemicals and diseases, and linguistic and statistical features derived from the abstracts in the CDR training corpus. We describe the systems that were developed and present evaluation results for both subtasks on the CDR test set. For DNER, our Peregrine system reached an F-score of 0.757. For CID, the system achieved an F-score of 0.526, which ranked second among 18 participating teams. Several post-challenge modifications of the systems resulted in substantially improved F-scores (0.828 for DNER and 0.602 for CID)
Doctor of Philosophy
dissertationBiomedical data are a rich source of information and knowledge. Not only are they useful for direct patient care, but they may also offer answers to important population-based questions. Creating an environment where advanced analytics can be performed against biomedical data is nontrivial, however. Biomedical data are currently scattered across multiple systems with heterogeneous data, and integrating these data is a bigger task than humans can realistically do by hand; therefore, automatic biomedical data integration is highly desirable but has never been fully achieved. This dissertation introduces new algorithms that were devised to support automatic and semiautomatic integration of heterogeneous biomedical data. The new algorithms incorporate both data mining and biomedical informatics techniques to create "concept bags" that are used to compute similarity between data elements in the same way that "word bags" are compared in data mining. Concept bags are composed of controlled medical vocabulary concept codes that are extracted from text using named-entity recognition software. To test the new algorithm, three biomedical text similarity use cases were examined: automatically aligning data elements between heterogeneous data sets, determining degrees of similarity between medical terms using a published benchmark, and determining similarity between ICU discharge summaries. The method is highly configurable and 5 different versions were tested. The concept bag method performed particularly well aligning data elements and outperformed the compared algorithms by iv more than 5%. Another configuration that included hierarchical semantics performed particularly well at matching medical terms, meeting or exceeding 30 of 31 other published results using the same benchmark. Results for the third scenario of computing ICU discharge summary similarity were less successful. Correlations between multiple methods were low, including between terminologists. The concept bag algorithms performed consistently and comparatively well and appear to be viable options for multiple scenarios. New applications of the method and ideas for improving the algorithm are being discussed for future work, including several performance enhancements, configuration-based enhancements, and concept vector weighting using the TF-IDF formulas
Combining Lexical and Semantic Methods of Inter-terminology Mapping Using the UMLS
The need for inter-terminology mapping is constantly increasing with the growth in the volume of electronically captured biomedical data and the demand to re-use the same data for secondary purposes. Using the UMLS as a knowledge base, semantically-based and lexically-based mappings were generated from SNOMED CT to ICD9CM terms and compared to a gold standard. Semantic mapping performed better than lexical mapping in terms of coverage, recall and precision. As the two mapping methods are orthogonal, the two sets of mappings can be used to validate and enhance each other. A method of combining the mappings based on the precision level of sub-categories in each method was derived. The combined method outperformed both methods, achieving coverage of 91%, recall of 43 % and precision of 27%. It is also possible to customize the method of combination to optimize performance according to the task at hand. Keywords: Unified Medical Language System, controlled terminology, inter-terminology mapping
Exploring the relationship between age and health conditions using electronic health records: from single diseases to multimorbidities
Background Two enormous challenges facing healthcare systems are ageing and multimorbidity. Clinicians, policymakers, healthcare providers and researchers need to know âwho gets which diseases whenâ in order to effectively prevent, detect and manage multiple conditions. Identification of ageing-related diseases (ARDs) is a starting point for research into common biological pathways in ageing. Examining multimorbidity clusters can facilitate a shift from the single-disease paradigm that pervades medical research and practice to models which reflect the reality of the patient population. Aim To examine how age influences an individualâs likelihood of developing single and multiple health conditions over the lifecourse. Methods and Outputs I used primary care and hospital admission electronic health records (EHRs) of 3,872,451 individuals from the Clinical Practice Research Datalink (CPRD) linked to the Hospital Episode Statistics admitted patient care (HES-APC) dataset in England from 1 April 2010 to 31 March 2015. In collaboration with Professor Aroon Hingorani, Dr Osman Bhatti, Dr Shanaz Husain, Dr Shailen Sutaria, Professor Dorothea Nitsch, Mrs Melanie Hingorani, Dr Constantinos Parisinos, Dr Tom Lumbers and Dr Reecha Sofat, I derived the case definitions for 308 clinically important health conditions, by harmonising Read, ICD-10 and OPCS-4 codes across primary and secondary care records in England. I calculated the age-specific incidence rate, period prevalence and median age at first recorded diagnosis for these conditions and described the 50 most common diseases in each decade of life. I developed a protocol for identifying ARDs using machine-learning and actuarial techniques. Finally, I identified highly correlated multimorbidity clusters and created a tool to visualise comorbidity clusters using a network approach. Conclusions I have developed case definitions (with a panel of clinicians) and calculated disease frequency estimates for 308 clinically important health conditions in the NHS in England. I have described patterns of ageing and multimorbidity using these case definitions, and produced an online app for interrogating comorbidities for an index condition. This work facilitates future research into ageing pathways and multimorbidity
Alineamiento y validaciĂłn de terminologĂas a gran escala en el ĂĄmbito mĂ©dico
This work presents a semi-automated method to map terminologies on a large
scale, and later validation of the resulting alignments. The method combines
different techniques to increase the automation
Ascertaining Pain in Mental Health Records:Combining Empirical and Knowledge-Based Methods for Clinical Modelling of Electronic Health Record Text
In recent years, state-of-the-art clinical Natural Language Processing (NLP), as in other domains, has been dominated by neural networks and other statistical models. In contrast to the unstructured nature of Electronic Health Record (EHR) text, biomedical knowledge is increasingly available in structured and codified forms, underpinned by curated databases, machine-readable clinical guidelines, and logically defined terminologies. This thesis examines the incorporation of external medical knowledge into clinical NLP and tests these methods on a use case of ascertaining physical pain in clinical notes of mental health records.Pain is a common reason for accessing healthcare resources and has been a growing area of research, especially its impact on mental health. Pain also presents a unique NLP problem due to its ambiguous nature and the varying circumstances in which it can be used. For these reasons, pain has been chosen as a use case, making it a good case study for the application of the methods explored in this thesis. Models are built by assimilating both structured medical knowledge and clinical NLP and leveraging the inherent relations that exist within medical ontologies. The data source used in this project is a mental health EHR database called CRIS, which contains de-identified patient records from the South London and Maudsley NHS Foundation Trust, one of the largest mental health providers in Western Europe.A lexicon of pain terms was developed to identify documents within CRIS mentioning painrelated terms. Gold standard annotations were created by conducting manual annotations on these documents. These gold standard annotations were used to build models for a binary classification task, with the objective of classifying sentences from the clinical text as ârelevantâ, which indicates the sentence contains relevant mentions of pain, i.e., physical pain affecting the patient, or ânot relevantâ, which indicates the sentence does not contain mentions of physical pain, or the mention does not relate to the patient (ex: someone else in physical pain). Two models incorporating structured medical knowledge were built:1. a transformer-based model, SapBERT, that utilises a knowledge graph of the UMLS ontology, and2. a knowledge graph embedding model that utilises embeddings from SNOMED CT, which was then used to build a random forest classifier. This was achieved by modelling the clinical pain terms and their relations from SNOMED CT into knowledge graph embeddings, thus combining the data-driven view of clinical language, with the logical view of medical knowledge.These models have been compared with NLP models (binary classifiers) that do not incorporate such structured medical knowledge:1. a transformer-based model, BERT_base, and2. a random forest classifier model.Amongst the two transformer-based models, SapBERT performed better at the classification task (F1-score: 0.98), and amongst the random forest models, the one incorporating knowledge graph embeddings performed better (F1-score: 0.94). The SapBERT model was run on sentences from a cohort of patients within CRIS, with the objective of conducting a prevalence study to understand the distribution of pain based on sociodemographic and diagnostic factors.The contribution of this research is both methodological and practical, showing the difference between a conventional NLP approach of binary classification and one that incorporates external knowledge, and further utilising the models obtained from both these approaches ina prevalence study which was designed based on inputs from clinicians and a patient and public involvement group. The results emphasise the significance of going beyond the conventional approach to NLP when addressing complex issues such as pain.<br/
Recommended from our members
Secondary use of electronic medical records for early identification of raised condition likelihoods in individuals: a machine learning approach
With many symptoms being common to multiple diseases, there is a challenge in producing an initial diagnosis or recommendation for diagnostic tests from a set of symptoms that could have been produced by a number of diseases. Often the initial choice of diagnosis or testing is based on a clinicianâs impression of the likelihood of that condition in a general population; however the opportunity may exist for modification of these likelihoods based on individualsâ recorded medical histories. This data-driven approach utilises existing data and is thus cheap and non-invasive. A method is proposed by which an individualâs likelihoods of having specified medical conditions are modified by the similarity of that individualâs medical history to the medical histories of other individuals, comparing the prevalence of conditions in those other individualsâ records who are similar to the individual of interest versus the prevalence of the conditions in those individuals who are dissimilar. In order to maximise the number of records available for analysis, a process was developed for the merging of data from disparate sources that used different clinical coding systems, including extensive development of a technique for semi automatically mapping clinical events coded in ICD9-CM to Clinical Terms Version 3 (CTV3), for which no existing mapping table was found. Semantically similar fields in the source code sets were identified and retained in the combined data set. âCodelistsâ comprising multiple CTV3 codes for a variety of conditions were built that defined the presence of those conditions within individual records. The hierarchical structure of the CTV3 code table was utilised as a method of identifying codes that differed in structure but had clinically similar or related meaning. The optimum degree of granularity of the coded data to use in identifying similar records was investigated and used in subsequent analysis.
Two methods were used for discovering groups of similar and dissimilar individuals: the ânearest neighboursâ method and the grouping of records using a clustering process. Altered likelihoods for a range of conditions were investigated and results for the nearest-neighbours approach compared to the clustering approach. Results for adjusted condition likelihoods for 18 conditions are reported, together with a discussion of possible reasons for a change, or otherwise, in the condition likelihood, and a discussion of the clinical significance and potential use of information about such a change. logistic regressions performed on a selection of conditions KNN performed better than logistic regression when judged by F-score (or sensitivity and specificity separately), however situation more nuanced when looking at likelihood ratios: Logistic regression produced higher (better) positive likelihood ratios, but KNN produced lower (better) negative likelihood ratios. Logistic regression produced higher odds ratios