7,448 research outputs found

    Motif Discovery through Predictive Modeling of Gene Regulation

    Full text link
    We present MEDUSA, an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting. At each iteration of the algorithm, MEDUSA builds a motif model whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression. In this way, we learn motifs that are functional and predictive of regulatory response rather than motifs that are simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model of the transcriptional control logic that can predict the expression of any gene in the organism, given the sequence of the promoter region of the target gene and the expression state of a set of known or putative transcription factors and signaling molecules. Each motif model is either a kk-length sequence, a dimer, or a PSSM that is built by agglomerative probabilistic clustering of sequences with similar boosting loss. By applying MEDUSA to a set of environmental stress response expression data in yeast, we learn motifs whose ability to predict differential expression of target genes outperforms motifs from the TRANSFAC dataset and from a previously published candidate set of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed binding sites associated with environmental stress response from the literature.Comment: RECOMB 200

    Reverse-engineering transcriptional modules from gene expression data

    Full text link
    "Module networks" are a framework to learn gene regulatory networks from expression data using a probabilistic model in which coregulated genes share the same parameters and conditional distributions. We present a method to infer ensembles of such networks and an averaging procedure to extract the statistically most significant modules and their regulators. We show that the inferred probabilistic models extend beyond the data set used to learn the models.Comment: 5 pages REVTeX, 4 figure

    Validating module network learning algorithms using simulated data

    Get PDF
    In recent years, several authors have used probabilistic graphical models to learn expression modules and their regulatory programs from gene expression data. Here, we demonstrate the use of the synthetic data generator SynTReN for the purpose of testing and comparing module network learning algorithms. We introduce a software package for learning module networks, called LeMoNe, which incorporates a novel strategy for learning regulatory programs. Novelties include the use of a bottom-up Bayesian hierarchical clustering to construct the regulatory programs, and the use of a conditional entropy measure to assign regulators to the regulation program nodes. Using SynTReN data, we test the performance of LeMoNe in a completely controlled situation and assess the effect of the methodological changes we made with respect to an existing software package, namely Genomica. Additionally, we assess the effect of various parameters, such as the size of the data set and the amount of noise, on the inference performance. Overall, application of Genomica and LeMoNe to simulated data sets gave comparable results. However, LeMoNe offers some advantages, one of them being that the learning process is considerably faster for larger data sets. Additionally, we show that the location of the regulators in the LeMoNe regulation programs and their conditional entropy may be used to prioritize regulators for functional validation, and that the combination of the bottom-up clustering strategy with the conditional entropy-based assignment of regulators improves the handling of missing or hidden regulators.Comment: 13 pages, 6 figures + 2 pages, 2 figures supplementary informatio

    Inferring a Transcriptional Regulatory Network from Gene Expression Data Using Nonlinear Manifold Embedding

    Get PDF
    Transcriptional networks consist of multiple regulatory layers corresponding to the activity of global regulators, specialized repressors and activators of transcription as well as proteins and enzymes shaping the DNA template. Such intrinsic multi-dimensionality makes uncovering connectivity patterns difficult and unreliable and it calls for adoption of methodologies commensurate with the underlying organization of the data source. Here we present a new computational method that predicts interactions between transcription factors and target genes using a compendium of microarray gene expression data and the knowledge of known interactions between genes and transcription factors. The proposed method called Kernel Embedding of REgulatory Networks (KEREN) is based on the concept of gene-regulon association and it captures hidden geometric patterns of the network via manifold embedding. We applied KEREN to reconstruct gene regulatory interactions in the model bacteria E.coli on a genome-wide scale. Our method not only yields accurate prediction of verifiable interactions, which outperforms on certain metrics comparable methodologies, but also demonstrates the utility of a geometric approach to the analysis of high-dimensional biological data. We also describe the general application of kernel embedding techniques to some other function and network discovery algorithms

    Detection of regulator genes and eQTLs in gene networks

    Full text link
    Genetic differences between individuals associated to quantitative phenotypic traits, including disease states, are usually found in non-coding genomic regions. These genetic variants are often also associated to differences in expression levels of nearby genes (they are "expression quantitative trait loci" or eQTLs for short) and presumably play a gene regulatory role, affecting the status of molecular networks of interacting genes, proteins and metabolites. Computational systems biology approaches to reconstruct causal gene networks from large-scale omics data have therefore become essential to understand the structure of networks controlled by eQTLs together with other regulatory genes, and to generate detailed hypotheses about the molecular mechanisms that lead from genotype to phenotype. Here we review the main analytical methods and softwares to identify eQTLs and their associated genes, to reconstruct co-expression networks and modules, to reconstruct causal Bayesian gene and module networks, and to validate predicted networks in silico.Comment: minor revision with typos corrected; review article; 24 pages, 2 figure

    Community detection for networks with unipartite and bipartite structure

    Full text link
    Finding community structures in networks is important in network science, technology, and applications. To date, most algorithms that aim to find community structures only focus either on unipartite or bipartite networks. A unipartite network consists of one set of nodes and a bipartite network consists of two nonoverlapping sets of nodes with only links joining the nodes in different sets. However, a third type of network exists, defined here as the mixture network. Just like a bipartite network, a mixture network also consists of two sets of nodes, but some nodes may simultaneously belong to two sets, which breaks the nonoverlapping restriction of a bipartite network. The mixture network can be considered as a general case, with unipartite and bipartite networks viewed as its limiting cases. A mixture network can represent not only all the unipartite and bipartite networks, but also a wide range of real-world networks that cannot be properly represented as either unipartite or bipartite networks in fields such as biology and social science. Based on this observation, we first propose a probabilistic model that can find modules in unipartite, bipartite, and mixture networks in a unified framework based on the link community model for a unipartite undirected network [B Ball et al (2011 Phys. Rev. E 84 036103)]. We test our algorithm on synthetic networks (both overlapping and nonoverlapping communities) and apply it to two real-world networks: a southern women bipartite network and a human transcriptional regulatory mixture network. The results suggest that our model performs well for all three types of networks, is competitive with other algorithms for unipartite or bipartite networks, and is applicable to real-world networks.Comment: 27 pages, 8 figures. (http://iopscience.iop.org/1367-2630/16/9/093001

    Systematic identification of functional plant modules through the integration of complementary data sources

    Get PDF
    A major challenge is to unravel how genes interact and are regulated to exert specific biological functions. The integration of genome-wide functional genomics data, followed by the construction of gene networks, provides a powerful approach to identify functional gene modules. Large-scale expression data, functional gene annotations, experimental protein-protein interactions, and transcription factor-target interactions were integrated to delineate modules in Arabidopsis (Arabidopsis thaliana). The different experimental input data sets showed little overlap, demonstrating the advantage of combining multiple data types to study gene function and regulation. In the set of 1,563 modules covering 13,142 genes, most modules displayed strong coexpression, but functional and cis-regulatory coherence was less prevalent. Highly connected hub genes showed a significant enrichment toward embryo lethality and evidence for cross talk between different biological processes. Comparative analysis revealed that 58% of the modules showed conserved coexpression across multiple plants. Using module-based functional predictions, 5,562 genes were annotated, and an evaluation experiment disclosed that, based on 197 recently experimentally characterized genes, 38.1% of these functions could be inferred through the module context. Examples of confirmed genes of unknown function related to cell wall biogenesis, xylem and phloem pattern formation, cell cycle, hormone stimulus, and circadian rhythm highlight the potential to identify new gene functions. The module-based predictions offer new biological hypotheses for functionally unknown genes in Arabidopsis (1,701 genes) and six other plant species (43,621 genes). Furthermore, the inferred modules provide new insights into the conservation of coexpression and coregulation as well as a starting point for comparative functional annotation

    Spectral analysis of gene expression profiles using gene networks

    Full text link
    Microarrays have become extremely useful for analysing genetic phenomena, but establishing a relation between microarray analysis results (typically a list of genes) and their biological significance is often difficult. Currently, the standard approach is to map a posteriori the results onto gene networks to elucidate the functions perturbed at the level of pathways. However, integrating a priori knowledge of the gene networks could help in the statistical analysis of gene expression data and in their biological interpretation. Here we propose a method to integrate a priori the knowledge of a gene network in the analysis of gene expression data. The approach is based on the spectral decomposition of gene expression profiles with respect to the eigenfunctions of the graph, resulting in an attenuation of the high-frequency components of the expression profiles with respect to the topology of the graph. We show how to derive unsupervised and supervised classification algorithms of expression profiles, resulting in classifiers with biological relevance. We applied the method to the analysis of a set of expression profiles from irradiated and non-irradiated yeast strains. It performed at least as well as the usual classification but provides much more biologically relevant results and allows a direct biological interpretation
    • …
    corecore