55,282 research outputs found

    From data towards knowledge: Revealing the architecture of signaling systems by unifying knowledge mining and data mining of systematic perturbation data

    Get PDF
    Genetic and pharmacological perturbation experiments, such as deleting a gene and monitoring gene expression responses, are powerful tools for studying cellular signal transduction pathways. However, it remains a challenge to automatically derive knowledge of a cellular signaling system at a conceptual level from systematic perturbation-response data. In this study, we explored a framework that unifies knowledge mining and data mining approaches towards the goal. The framework consists of the following automated processes: 1) applying an ontology-driven knowledge mining approach to identify functional modules among the genes responding to a perturbation in order to reveal potential signals affected by the perturbation; 2) applying a graph-based data mining approach to search for perturbations that affect a common signal with respect to a functional module, and 3) revealing the architecture of a signaling system organize signaling units into a hierarchy based on their relationships. Applying this framework to a compendium of yeast perturbation-response data, we have successfully recovered many well-known signal transduction pathways; in addition, our analysis have led to many hypotheses regarding the yeast signal transduction system; finally, our analysis automatically organized perturbed genes as a graph reflecting the architect of the yeast signaling system. Importantly, this framework transformed molecular findings from a gene level to a conceptual level, which readily can be translated into computable knowledge in the form of rules regarding the yeast signaling system, such as "if genes involved in MAPK signaling are perturbed, genes involved in pheromone responses will be differentially expressed"

    DRASTIC—INSIGHTS:querying information in a plant gene expression database

    Get PDF
    DRASTIC––Database Resource for the Analysis of Signal Transduction In Cells (http://www.drastic.org.uk/) has been created as a first step towards a data-based approach for constructing signal transduction pathways. DRASTIC is a relational database of plant expressed sequence tags and genes up- or down-regulated in response to various pathogens, chemical exposure or other treatments such as drought, salt and low temperature. More than 17700 records have been obtained from 306 treatments affecting 73 plant species from 512 peer-reviewed publications with most emphasis being placed on data from Arabidopsis thaliana. DRASTIC has been developed by the Scottish Crop Research Institute and the Abertay University and allows rapid identification of plant genes that are up- or down-regulated by multiple treatments and those that are regulated by a very limited (or perhaps a single) treatment. The INSIGHTS (INference of cell SIGnaling HypoTheseS) suite of web-based tools allows intelligent data mining and extraction of information from the DRASTIC database. Potential response pathways can be visualized and comparisons made between gene expression patterns in response to various treatments. The knowledge gained informs plant signalling pathways and systems biology investigations

    Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

    Get PDF
    Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts. This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results. Part I deals with biomedical text mining: Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues. In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005). In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (Küffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation. Part II deals with gene expression data analysis: Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks. Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets. In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006). The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models. Part III deals with integrated approaches and thus provides the connection between parts I and II: Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining. In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (Küffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation. Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented

    An Overview of the Use of Neural Networks for Data Mining Tasks

    Get PDF
    In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks

    Infectious Disease Ontology

    Get PDF
    Technological developments have resulted in tremendous increases in the volume and diversity of the data and information that must be processed in the course of biomedical and clinical research and practice. Researchers are at the same time under ever greater pressure to share data and to take steps to ensure that data resources are interoperable. The use of ontologies to annotate data has proven successful in supporting these goals and in providing new possibilities for the automated processing of data and information. In this chapter, we describe different types of vocabulary resources and emphasize those features of formal ontologies that make them most useful for computational applications. We describe current uses of ontologies and discuss future goals for ontology-based computing, focusing on its use in the field of infectious diseases. We review the largest and most widely used vocabulary resources relevant to the study of infectious diseases and conclude with a description of the Infectious Disease Ontology (IDO) suite of interoperable ontology modules that together cover the entire infectious disease domain

    Computationally Linking Chemical Exposure to Molecular Effects with Complex Data: Comparing Methods to Disentangle Chemical Drivers in Environmental Mixtures and Knowledge-based Deep Learning for Predictions in Environmental Toxicology

    Get PDF
    Chemical exposures affect the environment and may lead to adverse outcomes in its organisms. Omics-based approaches, like standardised microarray experiments, have expanded the toolbox to monitor the distribution of chemicals and assess the risk to organisms in the environment. The resulting complex data have extended the scope of toxicological knowledge bases and published literature. A plethora of computational approaches have been applied in environmental toxicology considering systems biology and data integration. Still, the complexity of environmental and biological systems given in data challenges investigations of exposure-related effects. This thesis aimed at computationally linking chemical exposure to biological effects on the molecular level considering sources of complex environmental data. The first study employed data of an omics-based exposure study considering mixture effects in a freshwater environment. We compared three data-driven analyses in their suitability to disentangle mixture effects of chemical exposures to biological effects and their reliability in attributing potentially adverse outcomes to chemical drivers with toxicological databases on gene and pathway levels. Differential gene expression analysis and a network inference approach resulted in toxicologically meaningful outcomes and uncovered individual chemical effects — stand-alone and in combination. We developed an integrative computational strategy to harvest exposure-related gene associations from environmental samples considering mixtures of lowly concentrated compounds. The applied approaches allowed assessing the hazard of chemicals more systematically with correlation-based compound groups. This dissertation presents another achievement toward a data-driven hypothesis generation for molecular exposure effects. The approach combined text-mining and deep learning. The study was entirely data-driven and involved state-of-the-art computational methods of artificial intelligence. We employed literature-based relational data and curated toxicological knowledge to predict chemical-biomolecule interactions. A word embedding neural network with a subsequent feed-forward network was implemented. Data augmentation and recurrent neural networks were beneficial for training with curated toxicological knowledge. The trained models reached accuracies of up to 94% for unseen test data of the employed knowledge base. However, we could not reliably confirm known chemical-gene interactions across selected data sources. Still, the predictive models might derive unknown information from toxicological knowledge sources, like literature, databases or omics-based exposure studies. Thus, the deep learning models might allow predicting hypotheses of exposure-related molecular effects. Both achievements of this dissertation might support the prioritisation of chemicals for testing and an intelligent selection of chemicals for monitoring in future exposure studies.:Table of Contents ... I Abstract ... V Acknowledgements ... VII Prelude ... IX 1 Introduction 1.1 An overview of environmental toxicology ... 2 1.1.1 Environmental toxicology ... 2 1.1.2 Chemicals in the environment ... 4 1.1.3 Systems biological perspectives in environmental toxicology ... 7 Computational toxicology ... 11 1.2.1 Omics-based approaches ... 12 1.2.2 Linking chemical exposure to transcriptional effects ... 14 1.2.3 Up-scaling from the gene level to higher biological organisation levels ... 19 1.2.4 Biomedical literature-based discovery ... 24 1.2.5 Deep learning with knowledge representation ... 27 1.3 Research question and approaches ... 29 2 Methods and Data ... 33 2.1 Linking environmental relevant mixture exposures to transcriptional effects ... 34 2.1.1 Exposure and microarray data ... 34 2.1.2 Preprocessing ... 35 2.1.3 Differential gene expression ... 37 2.1.4 Association rule mining ... 38 2.1.5 Weighted gene correlation network analysis ... 39 2.1.6 Method comparison ... 41 Predicting exposure-related effects on a molecular level ... 44 2.2.1 Input ... 44 2.2.2 Input preparation ... 47 2.2.3 Deep learning models ... 49 2.2.4 Toxicogenomic application ... 54 3 Method comparison to link complex stream water exposures to effects on the transcriptional level ... 57 3.1 Background and motivation ... 58 3.1.1 Workflow ... 61 3.2 Results ... 62 3.2.1 Data preprocessing ... 62 3.2.2 Differential gene expression analysis ... 67 3.2.3 Association rule mining ... 71 3.2.4 Network inference ... 78 3.2.5 Method comparison ... 84 3.2.6 Application case of method integration ... 87 3.3 Discussion ... 91 3.4 Conclusion ... 99 4 Deep learning prediction of chemical-biomolecule interactions ... 101 4.1 Motivation ... 102 4.1.1Workflow ...105 4.2 Results ... 107 4.2.1 Input preparation ... 107 4.2.2 Model selection ... 110 4.2.3 Model comparison ... 118 4.2.4 Toxicogenomic application ... 121 4.2.5 Horizontal augmentation without tail-padding ...123 4.2.6 Four-class problem formulation ... 124 4.2.7 Training with CTD data ... 125 4.3 Discussion ... 129 4.3.1 Transferring biomedical knowledge towards toxicology ... 129 4.3.2 Deep learning with biomedical knowledge representation ...133 4.3.3 Data integration ...136 4.4 Conclusion ... 141 5 Conclusion and Future perspectives ... 143 5.1 Conclusion ... 143 5.1.1 Investigating complex mixtures in the environment ... 144 5.1.2 Complex knowledge from literature and curated databases predict chemical- biomolecule interactions ... 145 5.1.3 Linking chemical exposure to biological effects by integrating CTD ... 146 5.2 Future perspectives ... 147 S1 Supplement Chapter 1 ... 153 S1.1 Example of an estrogen bioassay ... 154 S1.2 Types of mode of action ... 154 S1.3 The dogma of molecular biology ... 157 S1.4 Transcriptomics ... 159 S2 Supplement Chapter 3 ... 161 S3 Supplement Chapter 4 ... 175 S3.1 Hyperparameter tuning results ... 176 S3.2 Functional enrichment with predicted chemical-gene interactions and CTD reference pathway genesets ... 179 S3.3 Reduction of learning rate in a model with large word embedding vectors ... 183 S3.4 Horizontal augmentation without tail-padding ... 183 S3.5 Four-relationship classification ... 185 S3.6 Interpreting loss observations for SemMedDB trained models ... 187 List of Abbreviations ... i List of Figures ... vi List of Tables ... x Bibliography ... xii Curriculum scientiae ... xxxix Selbständigkeitserklärung ... xlii

    A Constraint Programming Approach for Mining Sequential Patterns in a Sequence Database

    Full text link
    Constraint-based pattern discovery is at the core of numerous data mining tasks. Patterns are extracted with respect to a given set of constraints (frequency, closedness, size, etc). In the context of sequential pattern mining, a large number of devoted techniques have been developed for solving particular classes of constraints. The aim of this paper is to investigate the use of Constraint Programming (CP) to model and mine sequential patterns in a sequence database. Our CP approach offers a natural way to simultaneously combine in a same framework a large set of constraints coming from various origins. Experiments show the feasibility and the interest of our approach
    corecore