374 research outputs found

    Refining transcriptional regulatory networks using network evolutionary models and gene histories

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Computational inference of transcriptional regulatory networks remains a challenging problem, in part due to the lack of strong network models. In this paper we present evolutionary approaches to improve the inference of regulatory networks for a family of organisms by developing an evolutionary model for these networks and taking advantage of established phylogenetic relationships among these organisms. In previous work, we used a simple evolutionary model and provided extensive simulation results showing that phylogenetic information, combined with such a model, could be used to gain significant improvements on the performance of current inference algorithms.</p> <p>Results</p> <p>In this paper, we extend the evolutionary model so as to take into account gene duplications and losses, which are viewed as major drivers in the evolution of regulatory networks. We show how to adapt our evolutionary approach to this new model and provide detailed simulation results, which show significant improvement on the reference network inference algorithms. Different evolutionary histories for gene duplications and losses are studied, showing that our adapted approach is feasible under a broad range of conditions. We also provide results on biological data (<it>cis</it>-regulatory modules for 12 species of <it>Drosophila</it>), confirming our simulation results.</p

    Transcriptional Regulatory Networks across Species:Evolution, Inference, and Refinement

    Get PDF
    The determination of transcriptional regulatory networks is key to the understanding of biological systems. However, the experimental determination of transcriptional regulatory networks in the laboratory remains difficult and time-consuming, while current computational methods to infer these networks (typically from gene-expression data) achieve only modest accuracy. The latter can be attributed in part to the limitations of a single-organism approach. Computational biology has long used comparative and, more generally, evolutionary approaches to extend the reach and accuracy of its analyses. We therefore use an evolutionary approach to the inference of regulatory networks, which enables us to study evolutionary models for these networks as well as to improve the accuracy of inferred networks. Since the regulatory networks evolve along with the genomes, we consider that the regulatory networks for a family of organisms are related to each other through the same phylogenetic tree. These relationships contain information that can be used to improve the accuracy of inferred networks. Advances in the study of evolution of regulatory networks provide evidence to establish evolutionary models for regulatory networks, which is an important component of our evolutionary approach. We use two network evolutionary models, a basic model that considers only the gains and losses of regulatory connections during evolution, and an extended model that also takes into account the duplications and losses of genes. With the network evolutionary models, we design refinement algorithms to make use of the phylogenetic relationships to refine noisy regulatory networks for a family of organisms. These refinement algorithms include: RefineFast and RefineML, which are two-step iterative algorithms, and ProPhyC and ProPhyCC, which are based on a probabilistic phylogenetic model. For each algorithm we first design it with the basic network evolutionary model and then generalize it to the extended evolutionary model. All these algorithms are computationally efficient and are supported by extensive experimental results showing that they yield substantial improvement in the quality of the input noisy networks. In particular, ProPhyC and ProPhyCC further improve the performance of RefineFast and RefineML. Besides the four refinement algorithms mentioned above, we also design an algorithm based on transfer learning theory called tree transfer learning (TTL). TTL differs from the previous four refinement algorithms in the sense that it takes the gene-expression data for the family of organisms as input, instead of their inferred noisy networks. TTL then learns the network structures for all the organisms at once, meanwhile taking advantage of the phylogenetic relationships. Although this approach outperforms an inference algorithm used alone, it does not perform better than ProPhyC, which indicates that the ProPhyC framework makes good use of the phylogenetic information

    Algorithms for pre-microrna classification and a GPU program for whole genome comparison

    Get PDF
    MicroRNAs (miRNAs) are non-coding RNAs with approximately 22 nucleotides that are derived from precursor molecules. These precursor molecules or pre-miRNAs often fold into stem-loop hairpin structures. However, a large number of sequences with pre-miRNA-like hairpin can be found in genomes. It is a challenge to distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (referred to as pseudo pre-miRNAs). The first part of this dissertation presents a new method, called MirID, for identifying and classifying microRNA precursors. MirID is comprised of three steps. Initially, a combinatorial feature mining algorithm is developed to identify suitable feature sets. Then, the feature sets are used to train support vector machines to obtain classification models, based on which classifier ensemble is constructed. Finally, an AdaBoost algorithm is adopted to further enhance the accuracy of the classifier ensemble. Experimental results on a variety of species demonstrate the good performance of the proposed approach, and its superiority over existing methods. In the second part of this dissertation, A GPU (Graphics Processing Unit) program is developed for whole genome comparison. The goal for the research is to identify the commonalities and differences of two genomes from closely related organisms, via multiple sequencing alignments by using a seed and extend technique to choose reliable subsets of exact or near exact matches, which are called anchors. A rigorous method named Smith-Waterman search is applied for the anchor seeking, but takes days and months to map millions of bases for mammalian genome sequences. With GPU programming, which is designed to run in parallel hundreds of short functions called threads, up to 100X speed up is achieved over similar CPU executions

    Phylogenetic transfer of knowledge for biological networks

    Get PDF

    Prokaryote growth temperature prediction with machine learning

    Get PDF
    Archaea and bacteria can be divided into four groups based on their growth temperature adaptation: mesophiles, thermophiles, hyperthermophiles, and psychrophiles. The thermostability of proteins is a sum of multiple different physical forces such as van der Waals interactions, chemical polarity, and ionic interactions. Genes causing the adaptation have not been identified and this thesis aims to identify temperature adaptation linked genes and predict temperature adaptation based on the absence or presence of genes. A dataset of 4361 genes from 711 prokaryotes was analyzed with four different machine learning algorithms: neural network, random forest, gradient boosting machine, and logistic regression. Logistic regression was chosen to be an explanatory and predictive model based on micro averaged AUC and Occamā€™s razor principle. Logistic regression was able to predict temperature adaptation with good performance. Machine learning is a powerful predictor for temperature adaptation and less than 200 genes were needed for the prediction of each adaptation. This technique can be used to predict the adaptation of uncultivated prokaryotes. However, the statistical importance of genes connected to temperature adaptation was not verified and this thesis did not provide much additional support for previously proposed temperature adaptation linked genes

    Computational identification of transcriptional regulatory elements in DNA sequence

    Get PDF
    Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges

    Transcription Factor-DNA Binding Via Machine Learning Ensembles

    Full text link
    We present ensemble methods in a machine learning (ML) framework combining predictions from five known motif/binding site exploration algorithms. For a given TF the ensemble starts with position weight matrices (PWM's) for the motif, collected from the component algorithms. Using dimension reduction, we identify significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF's gene (promoter) targets (Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool. Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string) feature PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel machine learning approach that uses promoter string features and ML importance scores in a classification algorithm locating binding sites across the genome. For target gene identification this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method. Top motif outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME). For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) we match the best performer without much human intervention. It also improved the performance on mammalian TFs. The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more TFs. The TF gene target identification component (problem 1 above) is useful in constructing a transcriptional regulatory network from known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information.Comment: 33 page

    Expression data dnalysis and regulatory network inference by means of correlation patterns

    Get PDF
    With the advance of high-throughput techniques, the amount of available data in the bio-molecular field is rapidly growing. It is now possible to measure genome-wide aspects of an entire biological system as a whole. Correlations that emerge due to internal dependency structures of these systems entail the formation of characteristic patterns in the corresponding data. The extraction of these patterns has become an integral part of computational biology. By triggering perturbations and interventions it is possible to induce an alteration of patterns, which may help to derive the dependency structures present in the system. In particular, differential expression experiments may yield alternate patterns that we can use to approximate the actual interplay of regulatory proteins and genetic elements, namely, the regulatory network of a cell. In this work, we examine the detection of correlation patterns from bio-molecular data and we evaluate their applicability in terms of protein contact prediction, experimental artifact removal, the discovery of unexpected expression patterns and genome-scale inference of regulatory networks. Correlation patterns are not limited to expression data. Their analysis in the context of conserved interfaces among proteins is useful to estimate whether these may have co-evolved. Patterns that hint on correlated mutations would then occur in the associated protein sequences as well. We employ a conceptually simple sampling strategy to decide whether or not two pathway elements share a conserved interface and are thus likely to be in physical contact. We successfully apply our method to a system of ABC-transporters and two-component systems from the phylum of Firmicute bacteria. For spatially resolved gene expression data like microarrays, the detection of artifacts, as opposed to noise, corresponds to the extraction of localized patterns that resemble outliers in a given region. We develop a method to detect and remove such artifacts using a sliding-window approach. Our method is very accurate and it is shown to adapt to other platforms like custom arrays as well. Further, we developed Padesco as a way to reveal unexpected expression patterns. We extract frequent and recurring patterns that are conserved across many experiments. For a specific experiment, we predict whether a gene deviates from its expected behaviour. We show that Padesco is an effective approach for selecting promising candidates from differential expression experiments. In Chapter 5, we then focus on the inference of genome-scale regulatory networks from expression data. Here, correlation patterns have proven useful for the data-driven estimation of regulatory interactions. We show that, for reliable eukaryotic network inference, the integration of prior networks is essential. We reveal that this integration leads to an over-estimate of network-wide quality estimates and suggest a corrective procedure, CoRe, to counterbalance this effect. CoRe drastically improves the false discovery rate of the originally predicted networks. We further suggest a consensus approach in combination with an extended set of topological features to obtain a more accurate estimate of the eukaryotic regulatory network for yeast. In the course of this work we show how correlation patterns can be detected and how they can be applied for various problem settings in computational molecular biology. We develop and discuss competitive approaches for the prediction of protein contacts, artifact repair, differential expression analysis, and network inference and show their applicability in practical setups.Mit der Weiterentwicklung von Hochdurchsatztechniken steigt die Anzahl verfĆ¼gbarer Daten im Bereich der Molekularbiologie rapide an. Es ist heute mƶglich, genomweite Aspekte eines ganzen biologischen Systems komplett zu erfassen. Korrelationen, die aufgrund der internen AbhƤngigkeits-Strukturen dieser Systeme enstehen, fĆ¼hren zu charakteristischen Mustern in gemessenen Daten. Die Extraktion dieser Muster ist zum integralen Bestandteil der Bioinformatik geworden. Durch geplante Eingriffe in das System ist es mƶglich Muster-Ƅnderungen auszulƶsen, die helfen, die AbhƤngigkeits-Strukturen des Systems abzuleiten. Speziell differentielle Expressions-Experimente kƶnnen Muster-Wechsel bedingen, die wir verwenden kƶnnen, um uns dem tatsƤchlichen Wechselspiel von regulatorischen Proteinen und genetischen Elementen anzunƤhern, also dem regulatorischen Netzwerk einer Zelle. In der vorliegenden Arbeit beschƤftigen wir uns mit der Erkennung von Korrelations-Mustern in molekularbiologischen Daten und schƤtzen ihre praktische Nutzbarkeit ab, speziell im Kontext der Kontakt-Vorhersage von Proteinen, der Entfernung von experimentellen Artefakten, der Aufdeckung unerwarteter Expressions-Muster und der genomweiten Vorhersage regulatorischer Netzwerke. Korrelations-Muster sind nicht auf Expressions-Daten beschrƤnkt. Ihre Analyse im Kontext konservierter Schnittstellen zwischen Proteinen liefert nĆ¼tzliche Hinweise auf deren Ko-Evolution. Muster die auf korrelierte Mutationen hinweisen, wĆ¼rden in diesem Fall auch in den entsprechenden Proteinsequenzen auftauchen. Wir nutzen eine einfache Sampling-Strategie, um zu entscheiden, ob zwei Elemente eines Pathways eine gemeinsame Schnittstelle teilen, berechnen also die Wahrscheinlichkeit fĆ¼r deren physikalischen Kontakt. Wir wenden unsere Methode mit Erfolg auf ein System von ABC-Transportern und Zwei-Komponenten-Systemen aus dem Firmicutes Bakterien-Stamm an. FĆ¼r rƤumlich aufgelƶste Expressions-Daten wie Microarrays enspricht die Detektion von Artefakten der Extraktion lokal begrenzter Muster. Im Gegensatz zur Erkennung von Rauschen stellen diese innerhalb einer definierten Region AusreiƟer dar. Wir entwickeln eine Methodik, um mit Hilfe eines Sliding-Window-Verfahrens, solche Artefakte zu erkennen und zu entfernen. Das Verfahren erkennt diese sehr zuverlƤssig. Zudem kann es auf Daten diverser Plattformen, wie Custom-Arrays, eingesetzt werden. Als weitere Mƶglichkeit unerwartete Korrelations-Muster aufzudecken, entwickeln wir Padesco. Wir extrahieren hƤufige und wiederkehrende Muster, die Ć¼ber Experimente hinweg konserviert sind. FĆ¼r ein bestimmtes Experiment sagen wir vorher, ob ein Gen von seinem erwarteten Verhalten abweicht. Wir zeigen, dass Padesco ein effektives Vorgehen ist, um vielversprechende Kandidaten eines differentiellen Expressions-Experiments auszuwƤhlen. Wir konzentrieren uns in Kapitel 5 auf die Vorhersage genomweiter regulatorischer Netzwerke aus Expressions-Daten. Hierbei haben sich Korrelations-Muster als nĆ¼tzlich fĆ¼r die datenbasierte AbschƤtzung regulatorischer Interaktionen erwiesen. Wir zeigen, dass fĆ¼r die Inferenz eukaryotischer Systeme eine Integration zuvor bekannter Regulationen essentiell ist. Unsere Ergebnisse ergeben, dass diese Integration zur ƜberschƤtzung netzwerkĆ¼bergreifender QualitƤtsmaƟe fĆ¼hrt und wir schlagen eine Prozedur - CoRe - zur Verbesserung vor, um diesen Effekt auszugleichen. CoRe verbessert die False Discovery Rate der ursprĆ¼nglich vorhergesagten Netzwerke drastisch. Weiterhin schlagen wir einen Konsensus-Ansatz in Kombination mit einem erweiterten Satz topologischer Features vor, um eine prƤzisere Vorhersage fĆ¼r das eukaryotische Hefe-Netzwerk zu erhalten. Im Rahmen dieser Arbeit zeigen wir, wie Korrelations-Muster erkannt und wie sie auf verschiedene Problemstellungen der Bioinformatik angewandt werden kƶnnen. Wir entwickeln und diskutieren AnsƤtze zur Vorhersage von Proteinkontakten, Behebung von Artefakten, differentiellen Analyse von Expressionsdaten und zur Vorhersage von Netzwerken und zeigen ihre Eignung im praktischen Einsatz
    • ā€¦
    corecore