150 research outputs found

    Language modelling for clinical natural language understanding and generation

    Get PDF
    One of the long-standing objectives of Artificial Intelligence (AI) is to design and develop algorithms for social good including tackling public health challenges. In the era of digitisation, with an unprecedented amount of healthcare data being captured in digital form, the analysis of the healthcare data at scale can lead to better research of diseases, better monitoring patient conditions and more importantly improving patient outcomes. However, many AI-based analytic algorithms rely solely on structured healthcare data such as bedside measurements and test results which only account for 20% of all healthcare data, whereas the remaining 80% of healthcare data is unstructured including textual data such as clinical notes and discharge summaries which is still underexplored. Conventional Natural Language Processing (NLP) algorithms that are designed for clinical applications rely on the shallow matching, templates and non-contextualised word embeddings which lead to limited understanding of contextual semantics. Though recent advances in NLP algorithms have demonstrated promising performance on a variety of NLP tasks in the general domain with contextualised language models, most of these generic NLP algorithms struggle at specific clinical NLP tasks which require biomedical knowledge and reasoning. Besides, there is limited research to study generative NLP algorithms to generate clinical reports and summaries automatically by considering salient clinical information. This thesis aims to design and develop novel NLP algorithms especially clinical-driven contextualised language models to understand textual healthcare data and generate clinical narratives which can potentially support clinicians, medical scientists and patients. The first contribution of this thesis focuses on capturing phenotypic information of patients from clinical notes which is important to profile patient situation and improve patient outcomes. The thesis proposes a novel self-supervised language model, named Phenotypic Intelligence Extraction (PIE), to annotate phenotypes from clinical notes with the detection of contextual synonyms and the enhancement to reason with numerical values. The second contribution is to demonstrate the utility and benefits of using phenotypic features of patients in clinical use cases by predicting patient outcomes in Intensive Care Units (ICU) and identifying patients at risk of specific diseases with better accuracy and model interpretability. The third contribution is to propose generative models to generate clinical narratives to automate and accelerate the process of report writing and summarisation by clinicians. This thesis first proposes a novel summarisation language model named PEGASUS which surpasses or is on par with the state-of-the-art performance on 12 downstream datasets including biomedical literature from PubMed. PEGASUS is further extended to generate medical scientific documents from input tabular data.Open Acces

    Deep learning for causal inference on electronic health records

    Get PDF
    Cardiovascular diseases (CVD) are the leading causes of mortality around the world and disentangling cause and effect is central to better understanding and treating these diseases. While randomised clinical trials are the “gold standard” of assessing the effect of an intervention, some hypotheses cannot be feasibly tested in the randomised setting. In these cases, observational studies with appropriate methods of confounding adjustment can deliver reliable evidence concerning the association between an exposure and outcome. Indeed, trusted conventional statistical models guided by subject area experts for confounder selection have been used to estimate associations in many observational studies; however, in the observational studies for which confounding is unknown and/or the population suffers from complex illness, the conventional approaches render insufficiently adjusted estimates. In parallel, recently, there has been unprecedented access to nationally representative multimodal electronic health record (EHR) datasets and advances in statistical learning including “deep” machine learning, a form of machine learning that relies on automatic feature capture dissolving the need for expert-driven feature engineering. In this doctoral research, the aim was to develop a deep learning approach for causal inference on EHR. To do so in a structured way, the research was split into three investigations: 1) The development of a deep learning model for EHR data and assessment of risk prediction performance 2) Given the “black box” nature of deep learning modelling, the development of methods to explain the proposed model. 3) The derivation of a model for causal inference, and application of the models for association estimation in elderly/at-risk patient subgroups. The model, Bidirectional EHR Transformer (BEHRT) was created for EHR representation learning and risk prediction. The model outperformed several benchmarks for risk prediction on a variety of tasks including incident heart failure prediction. Furthermore, in the second work, explainability investigations yielded that the model captured validated factors of risk (e.g., hypertension, diabetes, and other diseases) and offered several more factors that could be potentially preventative of incident heart failure. Lastly, a derivation of BEHRT was developed for association estimation, Targeted-BEHRT, that fused advances in deep learning and semi- parametric statistics. The model demonstrated superior estimation abilities on several simulated data experiments, and was applied to better understand the effects of antihypertensives, blood pressure, and paracetamol on cardiovascular endpoints, mortality, and other outcomes in at-risk patients. Overall, the doctoral research has made advances in both methodological and clinical cardiovascular research. While the research focuses on developing methods for the study of cardiovascular diseases, the methods developed and tested have several important implications for epidemiological research in the observational setting at large. Especially in patient groups with pre-existing health issues, the causal models developed can be a more appropriate approach for association analysis than conventional statistical ones. In terms of clinical impact, the research has progressed our understanding of risk and protection in the context of CVD

    Automating the Annotation of Data through Machine Learning and Semantic Technologies

    Get PDF
    The ever-increasing scale and complexity of scientific research is surpassing our means to assimilate newly produced knowledge. Computer tools are necessary for the organisation, retrieval, and interpretation of new scientific knowledge and data. The efficacy of such tools requires that research outputs are described by rich machine-readable metadata. Ontologies provide the framework to unambiguously describe the meaning of knowledge and data, so that it may be re-used or combined to synthesise new knowledge. However, manually annotating research with ontology terms, a process called semantic annotation, is also infeasible due to the aforementioned scale. This thesis describes research to develop deep learning-based tools for semantic annotation. The approaches described explore different methods for exploiting the domain knowledge encoded into ontologies to avoid the need to manually curate training corpora. They also take advantage of the inherent integrative capabilities of ontologies, to leverage combinations of heterogeneous knowledge to improve annotation performance and model interpretability. Several models exceeded previous benchmarks for semantic annotation in the bio-medical domain. This thesis concludes with a discussion of the strengths and limitations of the methods, and the implications for multi-domain ontology semantic annotation and for explainable artificial intelligence

    Computationally Linking Chemical Exposure to Molecular Effects with Complex Data: Comparing Methods to Disentangle Chemical Drivers in Environmental Mixtures and Knowledge-based Deep Learning for Predictions in Environmental Toxicology

    Get PDF
    Chemical exposures affect the environment and may lead to adverse outcomes in its organisms. Omics-based approaches, like standardised microarray experiments, have expanded the toolbox to monitor the distribution of chemicals and assess the risk to organisms in the environment. The resulting complex data have extended the scope of toxicological knowledge bases and published literature. A plethora of computational approaches have been applied in environmental toxicology considering systems biology and data integration. Still, the complexity of environmental and biological systems given in data challenges investigations of exposure-related effects. This thesis aimed at computationally linking chemical exposure to biological effects on the molecular level considering sources of complex environmental data. The first study employed data of an omics-based exposure study considering mixture effects in a freshwater environment. We compared three data-driven analyses in their suitability to disentangle mixture effects of chemical exposures to biological effects and their reliability in attributing potentially adverse outcomes to chemical drivers with toxicological databases on gene and pathway levels. Differential gene expression analysis and a network inference approach resulted in toxicologically meaningful outcomes and uncovered individual chemical effects — stand-alone and in combination. We developed an integrative computational strategy to harvest exposure-related gene associations from environmental samples considering mixtures of lowly concentrated compounds. The applied approaches allowed assessing the hazard of chemicals more systematically with correlation-based compound groups. This dissertation presents another achievement toward a data-driven hypothesis generation for molecular exposure effects. The approach combined text-mining and deep learning. The study was entirely data-driven and involved state-of-the-art computational methods of artificial intelligence. We employed literature-based relational data and curated toxicological knowledge to predict chemical-biomolecule interactions. A word embedding neural network with a subsequent feed-forward network was implemented. Data augmentation and recurrent neural networks were beneficial for training with curated toxicological knowledge. The trained models reached accuracies of up to 94% for unseen test data of the employed knowledge base. However, we could not reliably confirm known chemical-gene interactions across selected data sources. Still, the predictive models might derive unknown information from toxicological knowledge sources, like literature, databases or omics-based exposure studies. Thus, the deep learning models might allow predicting hypotheses of exposure-related molecular effects. Both achievements of this dissertation might support the prioritisation of chemicals for testing and an intelligent selection of chemicals for monitoring in future exposure studies.:Table of Contents ... I Abstract ... V Acknowledgements ... VII Prelude ... IX 1 Introduction 1.1 An overview of environmental toxicology ... 2 1.1.1 Environmental toxicology ... 2 1.1.2 Chemicals in the environment ... 4 1.1.3 Systems biological perspectives in environmental toxicology ... 7 Computational toxicology ... 11 1.2.1 Omics-based approaches ... 12 1.2.2 Linking chemical exposure to transcriptional effects ... 14 1.2.3 Up-scaling from the gene level to higher biological organisation levels ... 19 1.2.4 Biomedical literature-based discovery ... 24 1.2.5 Deep learning with knowledge representation ... 27 1.3 Research question and approaches ... 29 2 Methods and Data ... 33 2.1 Linking environmental relevant mixture exposures to transcriptional effects ... 34 2.1.1 Exposure and microarray data ... 34 2.1.2 Preprocessing ... 35 2.1.3 Differential gene expression ... 37 2.1.4 Association rule mining ... 38 2.1.5 Weighted gene correlation network analysis ... 39 2.1.6 Method comparison ... 41 Predicting exposure-related effects on a molecular level ... 44 2.2.1 Input ... 44 2.2.2 Input preparation ... 47 2.2.3 Deep learning models ... 49 2.2.4 Toxicogenomic application ... 54 3 Method comparison to link complex stream water exposures to effects on the transcriptional level ... 57 3.1 Background and motivation ... 58 3.1.1 Workflow ... 61 3.2 Results ... 62 3.2.1 Data preprocessing ... 62 3.2.2 Differential gene expression analysis ... 67 3.2.3 Association rule mining ... 71 3.2.4 Network inference ... 78 3.2.5 Method comparison ... 84 3.2.6 Application case of method integration ... 87 3.3 Discussion ... 91 3.4 Conclusion ... 99 4 Deep learning prediction of chemical-biomolecule interactions ... 101 4.1 Motivation ... 102 4.1.1Workflow ...105 4.2 Results ... 107 4.2.1 Input preparation ... 107 4.2.2 Model selection ... 110 4.2.3 Model comparison ... 118 4.2.4 Toxicogenomic application ... 121 4.2.5 Horizontal augmentation without tail-padding ...123 4.2.6 Four-class problem formulation ... 124 4.2.7 Training with CTD data ... 125 4.3 Discussion ... 129 4.3.1 Transferring biomedical knowledge towards toxicology ... 129 4.3.2 Deep learning with biomedical knowledge representation ...133 4.3.3 Data integration ...136 4.4 Conclusion ... 141 5 Conclusion and Future perspectives ... 143 5.1 Conclusion ... 143 5.1.1 Investigating complex mixtures in the environment ... 144 5.1.2 Complex knowledge from literature and curated databases predict chemical- biomolecule interactions ... 145 5.1.3 Linking chemical exposure to biological effects by integrating CTD ... 146 5.2 Future perspectives ... 147 S1 Supplement Chapter 1 ... 153 S1.1 Example of an estrogen bioassay ... 154 S1.2 Types of mode of action ... 154 S1.3 The dogma of molecular biology ... 157 S1.4 Transcriptomics ... 159 S2 Supplement Chapter 3 ... 161 S3 Supplement Chapter 4 ... 175 S3.1 Hyperparameter tuning results ... 176 S3.2 Functional enrichment with predicted chemical-gene interactions and CTD reference pathway genesets ... 179 S3.3 Reduction of learning rate in a model with large word embedding vectors ... 183 S3.4 Horizontal augmentation without tail-padding ... 183 S3.5 Four-relationship classification ... 185 S3.6 Interpreting loss observations for SemMedDB trained models ... 187 List of Abbreviations ... i List of Figures ... vi List of Tables ... x Bibliography ... xii Curriculum scientiae ... xxxix Selbständigkeitserklärung ... xlii

    Deep learning for clinical texts in low-data regimes

    Get PDF
    Electronic health records contain a wealth of valuable information for improving healthcare. There are, however, challenges associated with clinical text that prevent computers from maximising the utility of such information. While deep learning (DL) has emerged as a practical paradigm for dealing with the complexities of natural language, applying this class of machine learning algorithms to clinical text raises several research questions. First, we tackled the problem of data sparsity by looking into the task of adverse event detection. As these events are rare, examples thereof are lacking. To compensate for data scarcity, we leveraged large pre-trained language models (LMs) in combination with formally represented medical knowledge. We demonstrated that such a combination exhibits remarkable generalisation abilities despite the low availability of data. Second, we focused on the omnipresence of short forms in clinical texts. This typically leads to out-of-vocabulary problems, which motivates unlocking the underlying words. The novelty of our approach lies in its capacity to learn how to automatically expand short forms without resorting to external resources. Third, we investigated data augmentation to address the issue of data scarcity at its core. To the best of our knowledge, we were one of the firsts to investigate population-based augmentation for scheduling text data augmentation. Interestingly, little improvement was seen in fine-tuning large pre-trained LMs with the augmented data. We suggest that, as LMs proved able to cope well with small datasets, the need for data augmentation was made redundant. We conclude that DL approaches to clinical text mining should be developed by fine-tuning large LMs. One area where such models may struggle is the use of clinical short forms. Our method to automating their expansion fixes this issue. Together, these two approaches provide a blueprint for successfully developing DL approaches to clinical text mining in low-data regimes

    Structuring the Unstructured: Unlocking pharmacokinetic data from journals with Natural Language Processing

    Get PDF
    The development of a new drug is an increasingly expensive and inefficient process. Many drug candidates are discarded due to pharmacokinetic (PK) complications detected at clinical phases. It is critical to accurately estimate the PK parameters of new drugs before being tested in humans since they will determine their efficacy and safety outcomes. Preclinical predictions of PK parameters are largely based on prior knowledge from other compounds, but much of this potentially valuable data is currently locked in the format of scientific papers. With an ever-increasing amount of scientific literature, automated systems are essential to exploit this resource efficiently. Developing text mining systems that can structure PK literature is critical to improving the drug development pipeline. This thesis studied the development and application of text mining resources to accelerate the curation of PK databases. Specifically, the development of novel corpora and suitable natural language processing architectures in the PK domain were addressed. The work presented focused on machine learning approaches that can model the high diversity of PK studies, parameter mentions, numerical measurements, units, and contextual information reported across the literature. Additionally, architectures and training approaches that could efficiently deal with the scarcity of annotated examples were explored. The chapters of this thesis tackle the development of suitable models and corpora to (1) retrieve PK documents, (2) recognise PK parameter mentions, (3) link PK entities to a knowledge base and (4) extract relations between parameter mentions, estimated measurements, units and other contextual information. Finally, the last chapter of this thesis studied the feasibility of the whole extraction pipeline to accelerate tasks in drug development research. The results from this thesis exhibited the potential of text mining approaches to automatically generate PK databases that can aid researchers in the field and ultimately accelerate the drug development pipeline. Additionally, the thesis presented contributions to biomedical natural language processing by developing suitable architectures and corpora for multiple tasks, tackling novel entities and relations within the PK domain

    Gaining Insight into Determinants of Physical Activity using Bayesian Network Learning

    Get PDF
    Contains fulltext : 228326pre.pdf (preprint version ) (Open Access) Contains fulltext : 228326pub.pdf (publisher's version ) (Open Access)BNAIC/BeneLearn 202

    Proceedings of the 15th ISWC workshop on Ontology Matching (OM 2020)

    Get PDF
    15th International Workshop on Ontology Matching co-located with the 19th International Semantic Web Conference (ISWC 2020)International audienc

    Nudging lifestyles for better health outcomes: crowdsourced data and persuasive technologies for behavioural change

    Get PDF
    For at least three decades, a Tsunami of preventable poor health has continued to threaten the future prosperity of our nations. Despite its effective destructive power, our collective predictive and preventive capacity remains remarkably under-developed This Tsunami is almost entirely mediated through the passive and unintended consequences of modernisation. The malignant spread of obesity in genetically stable populations dictates that gene disposition is not a significant contributor as populations, crowds or cohorts are all incapable of experiencing a new shipment of genes in only 2-3 decades. The authors elaborate on why a supply-side approach: advancing health care delivery cannot be expected to impact health outcomes effectively. Better care sets the stage for more care yet remains largely impotent in returning individuals to disease-free states. The authors urge an expedited paradigmatic shift in policy selection criterion towards using data intensive crowd-based evidence integrating insights from system thinking, networks and nudging. Collectively these will support emerging potentialities of ICT used in proactive policy modelling. Against this background the authors proposes a solution that stated in a most compact form consists of: the provision of mundane yet high yield data through light instrumentation of crowds enabling participative sensing, real time living epidemiology separating the per unit co-occurrences which are health promoting from those which are not, nudging through persuasive technologies, serious gaming to sustain individual health behaviour change and intuitive visualisation with reliable simulation to evaluate and direct public health investments and policies in evidence-based waysJRC.DDG.J.4-Information Societ
    • …
    corecore