1,387 research outputs found

    Computational Methods for Gene Expression and Genomic Sequence Analysis

    Get PDF
    Advances in technologies currently produce more and more cost-effective, high-throughput, and large-scale biological data. As a result, there is an urgent need for developing efficient computational methods for analyzing these massive data. In this dissertation, we introduce methods to address several important issues in gene expression and genomic sequence analysis, two of the most important areas in bioinformatics.Firstly, we introduce a novel approach to predicting patterns of gene response to multiple treatments in case of small sample size. Researchers are increasingly interested in experiments with many treatments such as chemicals compounds or drug doses. However, due to cost, many experiments do not have large enough samples, making it difficult for conventional methods to predict patterns of gene response. Here we introduce an approach which exploited dependencies of pairwise comparisons outcomes and resampling techniques to predict true patterns of gene response in case of insufficient samples. This approach deduced more and better functionally enriched gene clusters than conventional methods. Our approach is therefore useful for multiple-treatment studies which have small sample size or contain highly variantly expressed genes.Secondly, we introduce a novel method for aligning short reads, which are DNA fragments extracted across genomes of individuals, to reference genomes. Results from short read alignment can be used for many studies such as measuring gene expression or detecting genetic variants. Here we introduce a method which employed an iterated randomized algorithm based on FM-index, an efficient data structure for full-text indexing, to align reads to the reference. This method improved alignment performance across a wide range of read lengths and error rates compared to several popular methods, making it a good choice for community to perform short read alignment.Finally, we introduce a novel approach to detecting genetic variants such as SNPs (single nucleotide polymorphisms) or INDELs (insertions/deletions). This study has great significance in a wide range of areas, from bioinformatics and genetic research to medical field. For example, one can predict how genomic changes are related to phenotype in their organism of interest, or associate genetic changes to disease risk or medical treatment efficacy. Here we introduce a method which leveraged known genetic variants existing in well-established databases to improve accuracy of detecting variants. This method had higher accuracy than several state-of-the-art methods in many cases, especially for detecting INDELs. Our method therefore has potential to be useful in research and clinical applications which rely on identifying genetic variants accurately

    Expression data dnalysis and regulatory network inference by means of correlation patterns

    Get PDF
    With the advance of high-throughput techniques, the amount of available data in the bio-molecular field is rapidly growing. It is now possible to measure genome-wide aspects of an entire biological system as a whole. Correlations that emerge due to internal dependency structures of these systems entail the formation of characteristic patterns in the corresponding data. The extraction of these patterns has become an integral part of computational biology. By triggering perturbations and interventions it is possible to induce an alteration of patterns, which may help to derive the dependency structures present in the system. In particular, differential expression experiments may yield alternate patterns that we can use to approximate the actual interplay of regulatory proteins and genetic elements, namely, the regulatory network of a cell. In this work, we examine the detection of correlation patterns from bio-molecular data and we evaluate their applicability in terms of protein contact prediction, experimental artifact removal, the discovery of unexpected expression patterns and genome-scale inference of regulatory networks. Correlation patterns are not limited to expression data. Their analysis in the context of conserved interfaces among proteins is useful to estimate whether these may have co-evolved. Patterns that hint on correlated mutations would then occur in the associated protein sequences as well. We employ a conceptually simple sampling strategy to decide whether or not two pathway elements share a conserved interface and are thus likely to be in physical contact. We successfully apply our method to a system of ABC-transporters and two-component systems from the phylum of Firmicute bacteria. For spatially resolved gene expression data like microarrays, the detection of artifacts, as opposed to noise, corresponds to the extraction of localized patterns that resemble outliers in a given region. We develop a method to detect and remove such artifacts using a sliding-window approach. Our method is very accurate and it is shown to adapt to other platforms like custom arrays as well. Further, we developed Padesco as a way to reveal unexpected expression patterns. We extract frequent and recurring patterns that are conserved across many experiments. For a specific experiment, we predict whether a gene deviates from its expected behaviour. We show that Padesco is an effective approach for selecting promising candidates from differential expression experiments. In Chapter 5, we then focus on the inference of genome-scale regulatory networks from expression data. Here, correlation patterns have proven useful for the data-driven estimation of regulatory interactions. We show that, for reliable eukaryotic network inference, the integration of prior networks is essential. We reveal that this integration leads to an over-estimate of network-wide quality estimates and suggest a corrective procedure, CoRe, to counterbalance this effect. CoRe drastically improves the false discovery rate of the originally predicted networks. We further suggest a consensus approach in combination with an extended set of topological features to obtain a more accurate estimate of the eukaryotic regulatory network for yeast. In the course of this work we show how correlation patterns can be detected and how they can be applied for various problem settings in computational molecular biology. We develop and discuss competitive approaches for the prediction of protein contacts, artifact repair, differential expression analysis, and network inference and show their applicability in practical setups.Mit der Weiterentwicklung von Hochdurchsatztechniken steigt die Anzahl verfügbarer Daten im Bereich der Molekularbiologie rapide an. Es ist heute möglich, genomweite Aspekte eines ganzen biologischen Systems komplett zu erfassen. Korrelationen, die aufgrund der internen Abhängigkeits-Strukturen dieser Systeme enstehen, führen zu charakteristischen Mustern in gemessenen Daten. Die Extraktion dieser Muster ist zum integralen Bestandteil der Bioinformatik geworden. Durch geplante Eingriffe in das System ist es möglich Muster-Änderungen auszulösen, die helfen, die Abhängigkeits-Strukturen des Systems abzuleiten. Speziell differentielle Expressions-Experimente können Muster-Wechsel bedingen, die wir verwenden können, um uns dem tatsächlichen Wechselspiel von regulatorischen Proteinen und genetischen Elementen anzunähern, also dem regulatorischen Netzwerk einer Zelle. In der vorliegenden Arbeit beschäftigen wir uns mit der Erkennung von Korrelations-Mustern in molekularbiologischen Daten und schätzen ihre praktische Nutzbarkeit ab, speziell im Kontext der Kontakt-Vorhersage von Proteinen, der Entfernung von experimentellen Artefakten, der Aufdeckung unerwarteter Expressions-Muster und der genomweiten Vorhersage regulatorischer Netzwerke. Korrelations-Muster sind nicht auf Expressions-Daten beschränkt. Ihre Analyse im Kontext konservierter Schnittstellen zwischen Proteinen liefert nützliche Hinweise auf deren Ko-Evolution. Muster die auf korrelierte Mutationen hinweisen, würden in diesem Fall auch in den entsprechenden Proteinsequenzen auftauchen. Wir nutzen eine einfache Sampling-Strategie, um zu entscheiden, ob zwei Elemente eines Pathways eine gemeinsame Schnittstelle teilen, berechnen also die Wahrscheinlichkeit für deren physikalischen Kontakt. Wir wenden unsere Methode mit Erfolg auf ein System von ABC-Transportern und Zwei-Komponenten-Systemen aus dem Firmicutes Bakterien-Stamm an. Für räumlich aufgelöste Expressions-Daten wie Microarrays enspricht die Detektion von Artefakten der Extraktion lokal begrenzter Muster. Im Gegensatz zur Erkennung von Rauschen stellen diese innerhalb einer definierten Region Ausreißer dar. Wir entwickeln eine Methodik, um mit Hilfe eines Sliding-Window-Verfahrens, solche Artefakte zu erkennen und zu entfernen. Das Verfahren erkennt diese sehr zuverlässig. Zudem kann es auf Daten diverser Plattformen, wie Custom-Arrays, eingesetzt werden. Als weitere Möglichkeit unerwartete Korrelations-Muster aufzudecken, entwickeln wir Padesco. Wir extrahieren häufige und wiederkehrende Muster, die über Experimente hinweg konserviert sind. Für ein bestimmtes Experiment sagen wir vorher, ob ein Gen von seinem erwarteten Verhalten abweicht. Wir zeigen, dass Padesco ein effektives Vorgehen ist, um vielversprechende Kandidaten eines differentiellen Expressions-Experiments auszuwählen. Wir konzentrieren uns in Kapitel 5 auf die Vorhersage genomweiter regulatorischer Netzwerke aus Expressions-Daten. Hierbei haben sich Korrelations-Muster als nützlich für die datenbasierte Abschätzung regulatorischer Interaktionen erwiesen. Wir zeigen, dass für die Inferenz eukaryotischer Systeme eine Integration zuvor bekannter Regulationen essentiell ist. Unsere Ergebnisse ergeben, dass diese Integration zur Überschätzung netzwerkübergreifender Qualitätsmaße führt und wir schlagen eine Prozedur - CoRe - zur Verbesserung vor, um diesen Effekt auszugleichen. CoRe verbessert die False Discovery Rate der ursprünglich vorhergesagten Netzwerke drastisch. Weiterhin schlagen wir einen Konsensus-Ansatz in Kombination mit einem erweiterten Satz topologischer Features vor, um eine präzisere Vorhersage für das eukaryotische Hefe-Netzwerk zu erhalten. Im Rahmen dieser Arbeit zeigen wir, wie Korrelations-Muster erkannt und wie sie auf verschiedene Problemstellungen der Bioinformatik angewandt werden können. Wir entwickeln und diskutieren Ansätze zur Vorhersage von Proteinkontakten, Behebung von Artefakten, differentiellen Analyse von Expressionsdaten und zur Vorhersage von Netzwerken und zeigen ihre Eignung im praktischen Einsatz

    Expression data dnalysis and regulatory network inference by means of correlation patterns

    Get PDF
    With the advance of high-throughput techniques, the amount of available data in the bio-molecular field is rapidly growing. It is now possible to measure genome-wide aspects of an entire biological system as a whole. Correlations that emerge due to internal dependency structures of these systems entail the formation of characteristic patterns in the corresponding data. The extraction of these patterns has become an integral part of computational biology. By triggering perturbations and interventions it is possible to induce an alteration of patterns, which may help to derive the dependency structures present in the system. In particular, differential expression experiments may yield alternate patterns that we can use to approximate the actual interplay of regulatory proteins and genetic elements, namely, the regulatory network of a cell. In this work, we examine the detection of correlation patterns from bio-molecular data and we evaluate their applicability in terms of protein contact prediction, experimental artifact removal, the discovery of unexpected expression patterns and genome-scale inference of regulatory networks. Correlation patterns are not limited to expression data. Their analysis in the context of conserved interfaces among proteins is useful to estimate whether these may have co-evolved. Patterns that hint on correlated mutations would then occur in the associated protein sequences as well. We employ a conceptually simple sampling strategy to decide whether or not two pathway elements share a conserved interface and are thus likely to be in physical contact. We successfully apply our method to a system of ABC-transporters and two-component systems from the phylum of Firmicute bacteria. For spatially resolved gene expression data like microarrays, the detection of artifacts, as opposed to noise, corresponds to the extraction of localized patterns that resemble outliers in a given region. We develop a method to detect and remove such artifacts using a sliding-window approach. Our method is very accurate and it is shown to adapt to other platforms like custom arrays as well. Further, we developed Padesco as a way to reveal unexpected expression patterns. We extract frequent and recurring patterns that are conserved across many experiments. For a specific experiment, we predict whether a gene deviates from its expected behaviour. We show that Padesco is an effective approach for selecting promising candidates from differential expression experiments. In Chapter 5, we then focus on the inference of genome-scale regulatory networks from expression data. Here, correlation patterns have proven useful for the data-driven estimation of regulatory interactions. We show that, for reliable eukaryotic network inference, the integration of prior networks is essential. We reveal that this integration leads to an over-estimate of network-wide quality estimates and suggest a corrective procedure, CoRe, to counterbalance this effect. CoRe drastically improves the false discovery rate of the originally predicted networks. We further suggest a consensus approach in combination with an extended set of topological features to obtain a more accurate estimate of the eukaryotic regulatory network for yeast. In the course of this work we show how correlation patterns can be detected and how they can be applied for various problem settings in computational molecular biology. We develop and discuss competitive approaches for the prediction of protein contacts, artifact repair, differential expression analysis, and network inference and show their applicability in practical setups.Mit der Weiterentwicklung von Hochdurchsatztechniken steigt die Anzahl verfügbarer Daten im Bereich der Molekularbiologie rapide an. Es ist heute möglich, genomweite Aspekte eines ganzen biologischen Systems komplett zu erfassen. Korrelationen, die aufgrund der internen Abhängigkeits-Strukturen dieser Systeme enstehen, führen zu charakteristischen Mustern in gemessenen Daten. Die Extraktion dieser Muster ist zum integralen Bestandteil der Bioinformatik geworden. Durch geplante Eingriffe in das System ist es möglich Muster-Änderungen auszulösen, die helfen, die Abhängigkeits-Strukturen des Systems abzuleiten. Speziell differentielle Expressions-Experimente können Muster-Wechsel bedingen, die wir verwenden können, um uns dem tatsächlichen Wechselspiel von regulatorischen Proteinen und genetischen Elementen anzunähern, also dem regulatorischen Netzwerk einer Zelle. In der vorliegenden Arbeit beschäftigen wir uns mit der Erkennung von Korrelations-Mustern in molekularbiologischen Daten und schätzen ihre praktische Nutzbarkeit ab, speziell im Kontext der Kontakt-Vorhersage von Proteinen, der Entfernung von experimentellen Artefakten, der Aufdeckung unerwarteter Expressions-Muster und der genomweiten Vorhersage regulatorischer Netzwerke. Korrelations-Muster sind nicht auf Expressions-Daten beschränkt. Ihre Analyse im Kontext konservierter Schnittstellen zwischen Proteinen liefert nützliche Hinweise auf deren Ko-Evolution. Muster die auf korrelierte Mutationen hinweisen, würden in diesem Fall auch in den entsprechenden Proteinsequenzen auftauchen. Wir nutzen eine einfache Sampling-Strategie, um zu entscheiden, ob zwei Elemente eines Pathways eine gemeinsame Schnittstelle teilen, berechnen also die Wahrscheinlichkeit für deren physikalischen Kontakt. Wir wenden unsere Methode mit Erfolg auf ein System von ABC-Transportern und Zwei-Komponenten-Systemen aus dem Firmicutes Bakterien-Stamm an. Für räumlich aufgelöste Expressions-Daten wie Microarrays enspricht die Detektion von Artefakten der Extraktion lokal begrenzter Muster. Im Gegensatz zur Erkennung von Rauschen stellen diese innerhalb einer definierten Region Ausreißer dar. Wir entwickeln eine Methodik, um mit Hilfe eines Sliding-Window-Verfahrens, solche Artefakte zu erkennen und zu entfernen. Das Verfahren erkennt diese sehr zuverlässig. Zudem kann es auf Daten diverser Plattformen, wie Custom-Arrays, eingesetzt werden. Als weitere Möglichkeit unerwartete Korrelations-Muster aufzudecken, entwickeln wir Padesco. Wir extrahieren häufige und wiederkehrende Muster, die über Experimente hinweg konserviert sind. Für ein bestimmtes Experiment sagen wir vorher, ob ein Gen von seinem erwarteten Verhalten abweicht. Wir zeigen, dass Padesco ein effektives Vorgehen ist, um vielversprechende Kandidaten eines differentiellen Expressions-Experiments auszuwählen. Wir konzentrieren uns in Kapitel 5 auf die Vorhersage genomweiter regulatorischer Netzwerke aus Expressions-Daten. Hierbei haben sich Korrelations-Muster als nützlich für die datenbasierte Abschätzung regulatorischer Interaktionen erwiesen. Wir zeigen, dass für die Inferenz eukaryotischer Systeme eine Integration zuvor bekannter Regulationen essentiell ist. Unsere Ergebnisse ergeben, dass diese Integration zur Überschätzung netzwerkübergreifender Qualitätsmaße führt und wir schlagen eine Prozedur - CoRe - zur Verbesserung vor, um diesen Effekt auszugleichen. CoRe verbessert die False Discovery Rate der ursprünglich vorhergesagten Netzwerke drastisch. Weiterhin schlagen wir einen Konsensus-Ansatz in Kombination mit einem erweiterten Satz topologischer Features vor, um eine präzisere Vorhersage für das eukaryotische Hefe-Netzwerk zu erhalten. Im Rahmen dieser Arbeit zeigen wir, wie Korrelations-Muster erkannt und wie sie auf verschiedene Problemstellungen der Bioinformatik angewandt werden können. Wir entwickeln und diskutieren Ansätze zur Vorhersage von Proteinkontakten, Behebung von Artefakten, differentiellen Analyse von Expressionsdaten und zur Vorhersage von Netzwerken und zeigen ihre Eignung im praktischen Einsatz

    Model guided trait-specific co-expression network estimation as a new perspective for identifying molecular interactions and pathways

    Get PDF
    Author summary Here we built up a mathematically justified bridge between 1) parametric approaches and 2) co-expression networks in light of identifying molecular interactions underlying complex traits. We first shared our concern that methodological improvements around these schemes, adjusting only their power and scalability, are bounded by more fundamental scheme-specific limitations. Subsequently, our theoretical results were exploited to overcome these limitations to find gene-by-gene interactions neither of which can capture alone. We also aimed to illustrate how this framework enables the interpretation of co-expression networks in a more parametric sense to achieve systematic insights into complex biological processes more reliably. The main procedure was fit for various types of biological applications and high-dimensional data to cover the area of systems biology as broadly as possible. In particular, we chose to illustrate the method's applicability for gene-profile based risk-stratification in cancer research using public acute myeloid leukemia datasets. A wide variety of 1) parametric regression models and 2) co-expression networks have been developed for finding gene-by-gene interactions underlying complex traits from expression data. While both methodological schemes have their own well-known benefits, little is known about their synergistic potential. Our study introduces their methodological fusion that cross-exploits the strengths of individual approaches via a built-in information-sharing mechanism. This fusion is theoretically based on certain trait-conditioned dependency patterns between two genes depending on their role in the underlying parametric model. Resulting trait-specific co-expression network estimation method 1) serves to enhance the interpretation of biological networks in a parametric sense, and 2) exploits the underlying parametric model itself in the estimation process. To also account for the substantial amount of intrinsic noise and collinearities, often entailed by expression data, a tailored co-expression measure is introduced along with this framework to alleviate related computational problems. A remarkable advance over the reference methods in simulated scenarios substantiate the method's high-efficiency. As proof-of-concept, this synergistic approach is successfully applied in survival analysis, with acute myeloid leukemia data, further highlighting the framework's versatility and broad practical relevance.Peer reviewe

    Computationally Linking Chemical Exposure to Molecular Effects with Complex Data: Comparing Methods to Disentangle Chemical Drivers in Environmental Mixtures and Knowledge-based Deep Learning for Predictions in Environmental Toxicology

    Get PDF
    Chemical exposures affect the environment and may lead to adverse outcomes in its organisms. Omics-based approaches, like standardised microarray experiments, have expanded the toolbox to monitor the distribution of chemicals and assess the risk to organisms in the environment. The resulting complex data have extended the scope of toxicological knowledge bases and published literature. A plethora of computational approaches have been applied in environmental toxicology considering systems biology and data integration. Still, the complexity of environmental and biological systems given in data challenges investigations of exposure-related effects. This thesis aimed at computationally linking chemical exposure to biological effects on the molecular level considering sources of complex environmental data. The first study employed data of an omics-based exposure study considering mixture effects in a freshwater environment. We compared three data-driven analyses in their suitability to disentangle mixture effects of chemical exposures to biological effects and their reliability in attributing potentially adverse outcomes to chemical drivers with toxicological databases on gene and pathway levels. Differential gene expression analysis and a network inference approach resulted in toxicologically meaningful outcomes and uncovered individual chemical effects — stand-alone and in combination. We developed an integrative computational strategy to harvest exposure-related gene associations from environmental samples considering mixtures of lowly concentrated compounds. The applied approaches allowed assessing the hazard of chemicals more systematically with correlation-based compound groups. This dissertation presents another achievement toward a data-driven hypothesis generation for molecular exposure effects. The approach combined text-mining and deep learning. The study was entirely data-driven and involved state-of-the-art computational methods of artificial intelligence. We employed literature-based relational data and curated toxicological knowledge to predict chemical-biomolecule interactions. A word embedding neural network with a subsequent feed-forward network was implemented. Data augmentation and recurrent neural networks were beneficial for training with curated toxicological knowledge. The trained models reached accuracies of up to 94% for unseen test data of the employed knowledge base. However, we could not reliably confirm known chemical-gene interactions across selected data sources. Still, the predictive models might derive unknown information from toxicological knowledge sources, like literature, databases or omics-based exposure studies. Thus, the deep learning models might allow predicting hypotheses of exposure-related molecular effects. Both achievements of this dissertation might support the prioritisation of chemicals for testing and an intelligent selection of chemicals for monitoring in future exposure studies.:Table of Contents ... I Abstract ... V Acknowledgements ... VII Prelude ... IX 1 Introduction 1.1 An overview of environmental toxicology ... 2 1.1.1 Environmental toxicology ... 2 1.1.2 Chemicals in the environment ... 4 1.1.3 Systems biological perspectives in environmental toxicology ... 7 Computational toxicology ... 11 1.2.1 Omics-based approaches ... 12 1.2.2 Linking chemical exposure to transcriptional effects ... 14 1.2.3 Up-scaling from the gene level to higher biological organisation levels ... 19 1.2.4 Biomedical literature-based discovery ... 24 1.2.5 Deep learning with knowledge representation ... 27 1.3 Research question and approaches ... 29 2 Methods and Data ... 33 2.1 Linking environmental relevant mixture exposures to transcriptional effects ... 34 2.1.1 Exposure and microarray data ... 34 2.1.2 Preprocessing ... 35 2.1.3 Differential gene expression ... 37 2.1.4 Association rule mining ... 38 2.1.5 Weighted gene correlation network analysis ... 39 2.1.6 Method comparison ... 41 Predicting exposure-related effects on a molecular level ... 44 2.2.1 Input ... 44 2.2.2 Input preparation ... 47 2.2.3 Deep learning models ... 49 2.2.4 Toxicogenomic application ... 54 3 Method comparison to link complex stream water exposures to effects on the transcriptional level ... 57 3.1 Background and motivation ... 58 3.1.1 Workflow ... 61 3.2 Results ... 62 3.2.1 Data preprocessing ... 62 3.2.2 Differential gene expression analysis ... 67 3.2.3 Association rule mining ... 71 3.2.4 Network inference ... 78 3.2.5 Method comparison ... 84 3.2.6 Application case of method integration ... 87 3.3 Discussion ... 91 3.4 Conclusion ... 99 4 Deep learning prediction of chemical-biomolecule interactions ... 101 4.1 Motivation ... 102 4.1.1Workflow ...105 4.2 Results ... 107 4.2.1 Input preparation ... 107 4.2.2 Model selection ... 110 4.2.3 Model comparison ... 118 4.2.4 Toxicogenomic application ... 121 4.2.5 Horizontal augmentation without tail-padding ...123 4.2.6 Four-class problem formulation ... 124 4.2.7 Training with CTD data ... 125 4.3 Discussion ... 129 4.3.1 Transferring biomedical knowledge towards toxicology ... 129 4.3.2 Deep learning with biomedical knowledge representation ...133 4.3.3 Data integration ...136 4.4 Conclusion ... 141 5 Conclusion and Future perspectives ... 143 5.1 Conclusion ... 143 5.1.1 Investigating complex mixtures in the environment ... 144 5.1.2 Complex knowledge from literature and curated databases predict chemical- biomolecule interactions ... 145 5.1.3 Linking chemical exposure to biological effects by integrating CTD ... 146 5.2 Future perspectives ... 147 S1 Supplement Chapter 1 ... 153 S1.1 Example of an estrogen bioassay ... 154 S1.2 Types of mode of action ... 154 S1.3 The dogma of molecular biology ... 157 S1.4 Transcriptomics ... 159 S2 Supplement Chapter 3 ... 161 S3 Supplement Chapter 4 ... 175 S3.1 Hyperparameter tuning results ... 176 S3.2 Functional enrichment with predicted chemical-gene interactions and CTD reference pathway genesets ... 179 S3.3 Reduction of learning rate in a model with large word embedding vectors ... 183 S3.4 Horizontal augmentation without tail-padding ... 183 S3.5 Four-relationship classification ... 185 S3.6 Interpreting loss observations for SemMedDB trained models ... 187 List of Abbreviations ... i List of Figures ... vi List of Tables ... x Bibliography ... xii Curriculum scientiae ... xxxix Selbständigkeitserklärung ... xlii
    • …
    corecore