195 research outputs found

    Inferring regulation programs in a transcription regulatory module network

    Get PDF
    Cells have a complex mechanism to control the expression of genes so that they are capable of adapting to environmental changes or genetic perturbations. A major part of the mechanism is fulfilled by transcription factors which can regulate the expression of other genes. Transcriptional regulatory relationships between genes and their transcription factors can be represented by a network, called a transcription regulatory network. Many algorithms have been proposed to learn transcription regulatory networks from gene expression data. In particular, the module network method, a special type of Bayesian networks, has shown promising results. In a module network, a regulatory module is a set of genes that show similar expression profiles and are regulated by a shared set of transcription factors (i.e., the regulation program of the module). This method significantly decreases the number of parameters to be learned. Module network learning consists of two tasks: clustering genes into modules and inferring the regulation program for each module. This thesis concentrates on designing algorithms for the latter task. First, we introduce a regression tree-based Gibbs sampling algorithm for learning regulation programs in module networks. The novelty of this method is that a set of tree operations is defined for generating new regression trees from a given tree. We show that the set of tree operations is sufficient to generate a well mixing Gibbs sampler even for large datasets. Second, we apply linear models to infer regulation programs. Given a gene module, this method partitions all experimental conditions into two condition clusters, between which the module's genes are most differentially expressed. Consequently, the process of learning the regulation program for the module becomes one of identifying transcription factors that are also differentially expressed between these two condition clusters. Third, we explore the possibility of integrating results from different algorithms. The integration methods we select are union, intersection, and weighted rank aggregation. The experiments in a yeast dataset show that the union and weighted rank aggregation methods produce more accurate predictions than those given by individual algorithms, whereas the intersection method does not yield any improvement in the accuracy of predictions. In addition, somewhat surprisingly, the union method, which has a much lower computational cost than rank aggregation, archives comparable results as given by rank aggregation

    Detection of regulator genes and eQTLs in gene networks

    Full text link
    Genetic differences between individuals associated to quantitative phenotypic traits, including disease states, are usually found in non-coding genomic regions. These genetic variants are often also associated to differences in expression levels of nearby genes (they are "expression quantitative trait loci" or eQTLs for short) and presumably play a gene regulatory role, affecting the status of molecular networks of interacting genes, proteins and metabolites. Computational systems biology approaches to reconstruct causal gene networks from large-scale omics data have therefore become essential to understand the structure of networks controlled by eQTLs together with other regulatory genes, and to generate detailed hypotheses about the molecular mechanisms that lead from genotype to phenotype. Here we review the main analytical methods and softwares to identify eQTLs and their associated genes, to reconstruct co-expression networks and modules, to reconstruct causal Bayesian gene and module networks, and to validate predicted networks in silico.Comment: minor revision with typos corrected; review article; 24 pages, 2 figure

    Motif Discovery through Predictive Modeling of Gene Regulation

    Full text link
    We present MEDUSA, an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting. At each iteration of the algorithm, MEDUSA builds a motif model whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression. In this way, we learn motifs that are functional and predictive of regulatory response rather than motifs that are simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model of the transcriptional control logic that can predict the expression of any gene in the organism, given the sequence of the promoter region of the target gene and the expression state of a set of known or putative transcription factors and signaling molecules. Each motif model is either a kk-length sequence, a dimer, or a PSSM that is built by agglomerative probabilistic clustering of sequences with similar boosting loss. By applying MEDUSA to a set of environmental stress response expression data in yeast, we learn motifs whose ability to predict differential expression of target genes outperforms motifs from the TRANSFAC dataset and from a previously published candidate set of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed binding sites associated with environmental stress response from the literature.Comment: RECOMB 200

    An Integrative Approach to Infer Regulation Programs in a Transcription Regulatory Module Network

    Get PDF
    The module network method, a special type of Bayesian network algorithms, has been proposed to infer transcription regulatory networks from gene expression data. In this method, a module represents a set of genes, which have similar expression profiles and are regulated by same transcription factors. The process of learning module networks consists of two steps: first clustering genes into modules and then inferring the regulation program (transcription factors) of each module. Many algorithms have been designed to infer the regulation program of a given gene module, and these algorithms show very different biases in detecting regulatory relationships. In this work, we explore the possibility of integrating results from different algorithms. The integration methods we select are union, intersection, and weighted rank aggregation. Experiments in a yeast dataset show that the union and weighted rank aggregation methods produce more accurate predictions than those given by individual algorithms, whereas the intersection method does not yield any improvement in the accuracy of predictions. In addition, somewhat surprisingly, the union method, which has a lower computational cost than rank aggregation, achieves comparable results as given by rank aggregation

    Modeling gene regulatory networks through data integration

    Full text link
    Modeling gene regulatory networks has become a problem of great interest in biology and medical research. Most common methods for learning regulatory dependencies rely on observations in the form of gene expression data. In this dissertation, computational models for gene regulation have been developed based on constrained regression by integrating comprehensive gene expression data for M. tuberculosis with genome-scale ChIP-Seq interaction data. The resulting models confirmed predictive power for expression in independent stress conditions and identified mechanisms driving hypoxic adaptation and lipid metabolism in M. tuberculosis. I then used the regulatory network model for M. tuberculosis to identify factors responding to stress conditions and drug treatments, revealing drug synergies and conditions that potentiate drug treatments. These results can guide and optimize design of drug treatments for this pathogen. I took the next step in this direction, by proposing a new probabilistic framework for learning modular structures in gene regulatory networks from gene expression and protein-DNA interaction data, combining the ideas of module networks and stochastic blockmodels. These models also capture combinatorial interactions between regulators. Comparisons with other network modeling methods that rely solely on expression data, showed the essentiality of integrating ChIP-Seq data in identifying direct regulatory links in M. tuberculosis. Moreover, this work demonstrates the theoretical advantages of integrating ChIP-Seq data for the class of widely-used module network models. The systems approach and statistical modeling presented in this dissertation can also be applied to problems in other organisms. A similar approach was taken to model the regulatory network controlling genes with circadian gene expression in Neurospora crassa, through integrating time-course expression data with ChIP-Seq data. The models explained combinatorial regulations leading to different phase differences in circadian rhythms. The Neurospora crassa network model also works as a tool to manipulate the phases of target genes

    Modelling Transcriptional Regulation with a Mixture of Factor Analyzers and Variational Bayesian Expectation Maximization

    Get PDF
    Understanding the mechanisms of gene transcriptional regulation through analysis of high-throughput postgenomic data is one of the central problems of computational systems biology. Various approaches have been proposed, but most of them fail to address at least one of the following objectives: (1) allow for the fact that transcription factors are potentially subject to posttranscriptional regulation; (2) allow for the fact that transcription factors cooperate as a functional complex in regulating gene expression, and (3) provide a model and a learning algorithm with manageable computational complexity. The objective of the present study is to propose and test a method that addresses these three issues. The model we employ is a mixture of factor analyzers, in which the latent variables correspond to different transcription factors, grouped into complexes or modules. We pursue inference in a Bayesian framework, using the Variational Bayesian Expectation Maximization (VBEM) algorithm for approximate inference of the posterior distributions of the model parameters, and estimation of a lower bound on the marginal likelihood for model selection. We have evaluated the performance of the proposed method on three criteria: activity profile reconstruction, gene clustering, and network inference

    Using machine learning to predict gene expression and discover sequence motifs

    Get PDF
    Recently, large amounts of experimental data for complex biological systems have become available. We use tools and algorithms from machine learning to build data-driven predictive models. We first present a novel algorithm to discover gene sequence motifs associated with temporal expression patterns of genes. Our algorithm, which is based on partial least squares (PLS) regression, is able to directly model the flow of information, from gene sequence to gene expression, to learn cis regulatory motifs and characterize associated gene expression patterns. Our algorithm outperforms traditional computational methods e.g. clustering in motif discovery. We then present a study of extending a machine learning model for transcriptional regulation predictive of genetic regulatory response to Caenorhabditis elegans. We show meaningful results both in terms of prediction accuracy on the test experiments and biological information extracted from the regulatory program. The model discovers DNA binding sites ab intio. We also present a case study where we detect a signal of lineage-specific regulation. Finally we present a comparative study on learning predictive models for motif discovery, based on different boosting algorithms: Adaptive Boosting (AdaBoost), Linear Programming Boosting (LPBoost) and Totally Corrective Boosting (TotalBoost). We evaluate and compare the performance of the three boosting algorithms via both statistical and biological validation, for hypoxia response in Saccharomyces cerevisiae
    corecore