163 research outputs found
Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases
The scientific literature represents a rich source for retrieval of knowledge on associations between biomedical concepts such as genes, diseases and cellular processes. A commonly used method to establish relationships between biomedical concepts from literature is co-occurrence. Apart from its use in knowledge retrieval, the co-occurrence method is also well-suited to discover new, hidden relationships between biomedical concepts following a simple ABC-principle, in which A and C have no direct relationship, but are connected via shared B-intermediates. In this paper we describe CoPub Discovery, a tool that mines the literature for new relationships between biomedical concepts. Statistical analysis using ROC curves showed that CoPub Discovery performed well over a wide range of settings and keyword thesauri. We subsequently used CoPub Discovery to search for new relationships between genes, drugs, pathways and diseases. Several of the newly found relationships were validated using independent literature sources. In addition, new predicted relationships between compounds and cell proliferation were validated and confirmed experimentally in an in vitro cell proliferation assay. The results show that CoPub Discovery is able to identify novel associations between genes, drugs, pathways and diseases that have a high probability of being biologically valid. This makes CoPub Discovery a useful tool to unravel the mechanisms behind disease, to find novel drug targets, or to find novel applications for existing drugs
Building Disease-Specific Drug-Protein Connectivity Maps from Molecular Interaction Networks and PubMed Abstracts
The recently proposed concept of molecular connectivity maps enables researchers to integrate experimental measurements of genes, proteins, metabolites, and drug compounds under similar biological conditions. The study of these maps provides opportunities for future toxicogenomics and drug discovery applications. We developed a computational framework to build disease-specific drug-protein connectivity maps. We integrated gene/protein and drug connectivity information based on protein interaction networks and literature mining, without requiring gene expression profile information derived from drug perturbation experiments on disease samples. We described the development and application of this computational framework using Alzheimer's Disease (AD) as a primary example in three steps. First, molecular interaction networks were incorporated to reduce bias and improve relevance of AD seed proteins. Second, PubMed abstracts were used to retrieve enriched drug terms that are indirectly associated with AD through molecular mechanistic studies. Third and lastly, a comprehensive AD connectivity map was created by relating enriched drugs and related proteins in literature. We showed that this molecular connectivity map development approach outperformed both curated drug target databases and conventional information retrieval systems. Our initial explorations of the AD connectivity map yielded a new hypothesis that diltiazem and quinidine may be investigated as candidate drugs for AD treatment. Molecular connectivity maps derived computationally can help study molecular signature differences between different classes of drugs in specific disease contexts. To achieve overall good data coverage and quality, a series of statistical methods have been developed to overcome high levels of data noise in biological networks and literature mining results. Further development of computational molecular connectivity maps to cover major disease areas will likely set up a new model for drug development, in which therapeutic/toxicological profiles of candidate drugs can be checked computationally before costly clinical trials begin
Computationally Linking Chemical Exposure to Molecular Effects with Complex Data: Comparing Methods to Disentangle Chemical Drivers in Environmental Mixtures and Knowledge-based Deep Learning for Predictions in Environmental Toxicology
Chemical exposures affect the environment and may lead to adverse outcomes in its organisms. Omics-based approaches, like standardised microarray experiments, have expanded the toolbox to monitor the distribution of chemicals and assess the risk to organisms in the environment. The resulting complex data have extended the scope of toxicological knowledge bases and published literature. A plethora of computational approaches have been applied in environmental toxicology considering systems biology and data integration. Still, the complexity of environmental and biological systems given in data challenges investigations of exposure-related effects. This thesis aimed at computationally linking chemical exposure to biological effects on the molecular level considering sources of complex environmental data.
The first study employed data of an omics-based exposure study considering mixture effects in a freshwater environment. We compared three data-driven analyses in their suitability to disentangle mixture effects of chemical exposures to biological effects and their reliability in attributing potentially adverse outcomes to chemical drivers with toxicological databases on gene and pathway levels. Differential gene expression analysis and a network inference approach resulted in toxicologically meaningful outcomes and uncovered individual chemical effects — stand-alone and in combination. We developed an integrative computational strategy to harvest exposure-related gene associations from environmental samples considering mixtures of lowly concentrated compounds. The applied approaches allowed assessing the hazard of chemicals more systematically with correlation-based compound groups.
This dissertation presents another achievement toward a data-driven hypothesis generation for molecular exposure effects. The approach combined text-mining and deep learning. The study was entirely data-driven and involved state-of-the-art computational methods of artificial intelligence. We employed literature-based relational data and curated toxicological knowledge to predict chemical-biomolecule interactions. A word embedding neural network with a subsequent feed-forward network was implemented. Data augmentation and recurrent neural networks were beneficial for training with curated toxicological knowledge. The trained models reached accuracies of up to 94% for unseen test data of the employed knowledge base.
However, we could not reliably confirm known chemical-gene interactions across selected data sources. Still, the predictive models might derive unknown information from toxicological knowledge sources, like literature, databases or omics-based exposure studies. Thus, the deep learning models might allow predicting hypotheses of exposure-related molecular effects.
Both achievements of this dissertation might support the prioritisation of chemicals for testing and an intelligent selection of chemicals for monitoring in future exposure studies.:Table of Contents ... I
Abstract ... V
Acknowledgements ... VII
Prelude ... IX
1 Introduction
1.1 An overview of environmental toxicology ... 2
1.1.1 Environmental toxicology ... 2
1.1.2 Chemicals in the environment ... 4
1.1.3 Systems biological perspectives in environmental toxicology ... 7
Computational toxicology ... 11
1.2.1 Omics-based approaches ... 12
1.2.2 Linking chemical exposure to transcriptional effects ... 14
1.2.3 Up-scaling from the gene level to higher biological organisation levels ... 19
1.2.4 Biomedical literature-based discovery ... 24
1.2.5 Deep learning with knowledge representation ... 27
1.3 Research question and approaches ... 29
2 Methods and Data ... 33
2.1 Linking environmental relevant mixture exposures to transcriptional effects ... 34
2.1.1 Exposure and microarray data ... 34
2.1.2 Preprocessing ... 35
2.1.3 Differential gene expression ... 37
2.1.4 Association rule mining ... 38
2.1.5 Weighted gene correlation network analysis ... 39
2.1.6 Method comparison ... 41
Predicting exposure-related effects on a molecular level ... 44
2.2.1 Input ... 44
2.2.2 Input preparation ... 47
2.2.3 Deep learning models ... 49
2.2.4 Toxicogenomic application ... 54
3 Method comparison to link complex stream water exposures to effects on
the transcriptional level ... 57
3.1 Background and motivation ... 58
3.1.1 Workflow ... 61
3.2 Results ... 62
3.2.1 Data preprocessing ... 62
3.2.2 Differential gene expression analysis ... 67
3.2.3 Association rule mining ... 71
3.2.4 Network inference ... 78
3.2.5 Method comparison ... 84
3.2.6 Application case of method integration ... 87
3.3 Discussion ... 91
3.4 Conclusion ... 99
4 Deep learning prediction of chemical-biomolecule interactions ... 101
4.1 Motivation ... 102
4.1.1Workflow ...105
4.2 Results ... 107
4.2.1 Input preparation ... 107
4.2.2 Model selection ... 110
4.2.3 Model comparison ... 118
4.2.4 Toxicogenomic application ... 121
4.2.5 Horizontal augmentation without tail-padding ...123
4.2.6 Four-class problem formulation ... 124
4.2.7 Training with CTD data ... 125
4.3 Discussion ... 129
4.3.1 Transferring biomedical knowledge towards toxicology ... 129
4.3.2 Deep learning with biomedical knowledge representation ...133
4.3.3 Data integration ...136
4.4 Conclusion ... 141
5 Conclusion and Future perspectives ... 143
5.1 Conclusion ... 143
5.1.1 Investigating complex mixtures in the environment ... 144
5.1.2 Complex knowledge from literature and curated databases predict chemical-
biomolecule interactions ... 145
5.1.3 Linking chemical exposure to biological effects by integrating CTD ... 146
5.2 Future perspectives ... 147
S1 Supplement Chapter 1 ... 153
S1.1 Example of an estrogen bioassay ... 154
S1.2 Types of mode of action ... 154
S1.3 The dogma of molecular biology ... 157
S1.4 Transcriptomics ... 159
S2 Supplement Chapter 3 ... 161
S3 Supplement Chapter 4 ... 175
S3.1 Hyperparameter tuning results ... 176
S3.2 Functional enrichment with predicted chemical-gene interactions and CTD reference pathway genesets ... 179
S3.3 Reduction of learning rate in a model with large word embedding vectors ... 183
S3.4 Horizontal augmentation without tail-padding ... 183
S3.5 Four-relationship classification ... 185
S3.6 Interpreting loss observations for SemMedDB trained models ... 187
List of Abbreviations ... i
List of Figures ... vi
List of Tables ... x
Bibliography ... xii
Curriculum scientiae ... xxxix
Selbständigkeitserklärung ... xlii
Mining Relational Paths in Integrated Biomedical Data
Much life science and biology research requires an understanding of complex relationships between biological entities (genes, compounds, pathways, diseases, and so on). There is a wealth of data on such relationships in publicly available datasets and publications, but these sources are overlapped and distributed so that finding pertinent relational data is increasingly difficult. Whilst most public datasets have associated tools for searching, there is a lack of searching methods that can cross data sources and that in particular search not only based on the biological entities themselves but also on the relationships between them. In this paper, we demonstrate how graph-theoretic algorithms for mining relational paths can be used together with a previous integrative data resource we developed called Chem2Bio2RDF to extract new biological insights about the relationships between such entities. In particular, we use these methods to investigate the genetic basis of side-effects of thiazolinedione drugs, and in particular make a hypothesis for the recently discovered cardiac side-effects of Rosiglitazone (Avandia) and a prediction for Pioglitazone which is backed up by recent clinical studies
Contextual Analysis of Large-Scale Biomedical Associations for the Elucidation and Prioritization of Genes and their Roles in Complex Disease
Vast amounts of biomedical associations are easily accessible in public resources, spanning gene-disease associations, tissue-specific gene expression, gene function and pathway annotations, and many other data types. Despite this mass of data, information most relevant to the study of a particular disease remains loosely coupled and difficult to incorporate into ongoing research. Current public databases are difficult to navigate and do not interoperate well due to the plethora of interfaces and varying biomedical concept identifiers used. Because no coherent display of data within a specific problem domain is available, finding the latent relationships associated with a disease of interest is impractical.
This research describes a method for extracting the contextual relationships embedded within associations relevant to a disease of interest. After applying the method to a small test data set, a large-scale integrated association network is constructed for application of a network propagation technique that helps uncover more distant latent relationships. Together these methods are adept at uncovering highly relevant relationships without any a priori knowledge of the disease of interest.
The combined contextual search and relevance methods power a tool which makes pertinent biomedical associations easier to find, easier to assimilate into ongoing work, and more prominent than currently available databases. Increasing the accessibility of current information is an important component to understanding high-throughput experimental results and surviving the data deluge
Finding Complex Biological Relationships in Recent PubMed Articles Using Bio-LDA
The overwhelming amount of available scholarly literature in the life
sciences poses significant challenges to scientists wishing to keep up with
important developments related to their research, but also provides a useful
resource for the discovery of recent information concerning genes, diseases,
compounds and the interactions between them. In this paper, we describe an
algorithm called Bio-LDA that uses extracted biological terminology to
automatically identify latent topics, and provides a variety of measures to
uncover putative relations among topics and bio-terms. Relationships identified
using those approaches are combined with existing data in life science datasets
to provide additional insight. Three case studies demonstrate the utility of
the Bio-LDA model, including association predication, association search and
connectivity map generation. This combined approach offers new opportunities
for knowledge discovery in many areas of biology including target
identification, lead hopping and drug repurposing.Comment: 14 pages, 8 figures, 10 table
- …