3,389 research outputs found

    Motif Discovery through Predictive Modeling of Gene Regulation

    Full text link
    We present MEDUSA, an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting. At each iteration of the algorithm, MEDUSA builds a motif model whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression. In this way, we learn motifs that are functional and predictive of regulatory response rather than motifs that are simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model of the transcriptional control logic that can predict the expression of any gene in the organism, given the sequence of the promoter region of the target gene and the expression state of a set of known or putative transcription factors and signaling molecules. Each motif model is either a kk-length sequence, a dimer, or a PSSM that is built by agglomerative probabilistic clustering of sequences with similar boosting loss. By applying MEDUSA to a set of environmental stress response expression data in yeast, we learn motifs whose ability to predict differential expression of target genes outperforms motifs from the TRANSFAC dataset and from a previously published candidate set of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed binding sites associated with environmental stress response from the literature.Comment: RECOMB 200

    Discovering Conserved cis-Regulatory Elements That Regulate Expression in Caenorhabditis elegans

    Get PDF
    The aim of this dissertation is two-fold:: 1) To catalog all cis-regulatory elements within the intergenic and intronic regions surrounding every gene in C.elegans: i.e. the regulome) and: 2) to determine which cis-regulatory elements are associated with expression under specific conditions. We initially use PhyloNet to predict conserved motifs with instances in about half of the protein-coding genes. This initial first step was valuable as it recovered some known elements and cis-regulatory modules. Yet the results had a lot of redundant motifs and sites, and the approach was not efficiently scalable to the entire regulome of C. elegans or other higher-order eukaryotes. Magma: Multiple Aligner of Genomic Multiple Alignments) overcomes these shortcomings by using efficient clustering and memory management algorithms. Additionally, it implements a fast greedy set-cover solution to significantly reduce redundant motifs. These differences make Magma ~70 times faster than PhyloNet and Magma-based predictions occur near ~99% of all C. elegans protein-coding genes. Furthermore, we show tractable scaling for higher-order eukaryotes with larger regulomes. Finally, we demonstrate that a Magma-predicted motif, which represents the binding specificity for HLH-30, plays a critical role in the host-defense to pathogenic infections. This novel finding shows that hlh-30(-) animals are more susceptible to S. aureus and P. aeruginosa than their wild type counterparts

    The Escherichia coli transcriptome mostly consists of independently regulated modules

    Get PDF
    Underlying cellular responses is a transcriptional regulatory network (TRN) that modulates gene expression. A useful description of the TRN would decompose the transcriptome into targeted effects of individual transcriptional regulators. Here, we apply unsupervised machine learning to a diverse compendium of over 250 high-quality Escherichia coli RNA-seq datasets to identify 92 statistically independent signals that modulate the expression of specific gene sets. We show that 61 of these transcriptomic signals represent the effects of currently characterized transcriptional regulators. Condition-specific activation of signals is validated by exposure of E. coli to new environmental conditions. The resulting decomposition of the transcriptome provides: a mechanistic, systems-level, network-based explanation of responses to environmental and genetic perturbations; a guide to gene and regulator function discovery; and a basis for characterizing transcriptomic differences in multiple strains. Taken together, our results show that signal summation describes the composition of a model prokaryotic transcriptome

    An ensemble learning approach to reverse-engineering transcriptional regulatory networks from time-series gene expression data

    Get PDF
    Background One of the most challenging tasks in the post-genomic era is to reconstruct the transcriptional regulatory networks. The goal is to reveal, for each gene that responds to a certain biological event, which transcription factors affect its expression, and how a set of transcription factors coordinate to accomplish temporal and spatial specific regulations. Results Here we propose a supervised machine learning approach to address these questions. We focus our study on the gene transcriptional regulation of the cell cycle in the budding yeast, thanks to the large amount of data available and relatively well-understood biology, although the main ideas of our method can be applied to other data as well. Our method starts with building an ensemble of decision trees for each microarray data to capture the association between the expression levels of yeast genes and the binding of transcription factors to gene promoter regions, as determined by chromatin immunoprecipitation microarray (ChIP-chip) experiment. Cross-validation experiments show that the method is more accurate and reliable than the naive decision tree algorithm and several other ensemble learning methods. From the decision tree ensembles, we extract logical rules that explain how a set of transcription factors act in concert to regulate the expression of their targets. We further compute a profile for each rule to show its regulation strengths at different time points. We also propose a spline interpolation method to integrate the rule profiles learned from several time series expression data sets that measure the same biological process. We then combine these rule profiles to build a transcriptional regulatory network for the yeast cell cycle. Compared to the results in the literature, our method correctly identifies all major known yeast cell cycle transcription factors, and assigns them into appropriate cell cycle phases. Our method also identifies many interesting synergetic relationships among these transcription factors, most of which are well known, while many of the rest can also be supported by other evidences. Conclusion The high accuracy of our method indicates that our method is valid and robust. As more gene expression and transcription factor binding data become available, we believe that our method is useful for reconstructing large-scale transcriptional regulatory networks in other species as well

    Discovering Transcriptional Regulatory Rules from Gene Expression and TF-DNA Binding Data by Decision Tree Learning

    Get PDF
    Background: One of the most promising but challenging task in the post-genomic era is to reconstruct the transcriptional regulatory networks. The goal is to reveal, for each gene that responds to a certain biological event, which transcription factors affect its transcription, and how several transcription factors coordinate to accomplish specific regulations. Results: Here we propose a supervised machine learning approach to address these questions. We build decision trees to associate the expression level of a gene with the transcription factor binding data of its promoter. From the decision trees, we extract regulatory rules that specify how the binding of a combination of several transcription factors affects the expression of a gene. Such rules are easy to interpret, and represent experimentally testable hypotheses. We use a decision tree ensemble approach to increase modeling accuracy and robustness. We also propose a novel method to integrate rules learned from several time series that measure the same biological processes. We apply our method to publicly available cell cycle expression data and transcription factor binding data for the budding yeast. Cross-validation experiments show that our method is highly accurate and reliable. The method correctly identifies all major known yeast cell cycle transcription factors, and assigns them into appropriate cell cycle phases. It also explicitly reveals synergetic relationships of transcription factors, most of which agree well with existing literatures, while the rest provide testable biological hypotheses. Conclusions: The high accuracy of our method indicates that our method is valid and that the learned regulatory rules can be used as the basic building elements of a transcriptional regulatory network. As more and more gene expression and TF binding data are available, we believe that our method will be useful for reconstructing large scale transcriptional regulatory networks

    High resolution mapping of Twist to DNA in Drosophila embryos: Efficient functional analysis and evolutionary conservation

    Get PDF
    Cis-regulatory modules (CRMs) function by binding sequence specific transcription factors, but the relationship between in vivo physical binding and the regulatory capacity of factor-bound DNA elements remains uncertain. We investigate this relationship for the well-studied Twist factor in Drosophila melanogaster embryos by analyzing genome-wide factor occupancy and testing the functional significance of Twist occupied regions and motifs within regions. Twist ChIP-seq data efficiently identified previously studied Twist-dependent CRMs and robustly predicted new CRM activity in transgenesis, with newly identified Twist-occupied regions supporting diverse spatiotemporal patterns (>74% positive, n = 31). Some, but not all, candidate CRMs require Twist for proper expression in the embryo. The Twist motifs most favored in genome ChIP data (in vivo) differed from those most favored by Systematic Evolution of Ligands by EXponential enrichment (SELEX) (in vitro). Furthermore, the majority of ChIP-seq signals could be parsimoniously explained by a CABVTG motif located within 50 bp of the ChIP summit and, of these, CACATG was most prevalent. Mutagenesis experiments demonstrated that different Twist E-box motif types are not fully interchangeable, suggesting that the ChIP-derived consensus (CABVTG) includes sites having distinct regulatory outputs. Further analysis of position, frequency of occurrence, and sequence conservation revealed significant enrichment and conservation of CABVTG E-box motifs near Twist ChIP-seq signal summits, preferential conservation of ±150 bp surrounding Twist occupied summits, and enrichment of GA- and CA-repeat sequences near Twist occupied summits. Our results show that high resolution in vivo occupancy data can be used to drive efficient discovery and dissection of global and local cis-regulatory logic
    corecore