241 research outputs found

    The Iterative Signature Algorithm for the analysis of large scale gene expression data

    Full text link
    We present a new approach for the analysis of genome-wide expression data. Our method is designed to overcome the limitations of traditional techniques, when applied to large-scale data. Rather than alloting each gene to a single cluster, we assign both genes and conditions to context-dependent and potentially overlapping transcription modules. We provide a rigorous definition of a transcription module as the object to be retrieved from the expression data. An efficient algorithm, that searches for the modules encoded in the data by iteratively refining sets of genes and conditions until they match this definition, is established. Each iteration involves a linear map, induced by the normalized expression matrix, followed by the application of a threshold function. We argue that our method is in fact a generalization of Singular Value Decomposition, which corresponds to the special case where no threshold is applied. We show analytically that for noisy expression data our approach leads to better classification due to the implementation of the threshold. This result is confirmed by numerical analyses based on in-silico expression data. We discuss briefly results obtained by applying our algorithm to expression data from the yeast S. cerevisiae.Comment: Latex, 36 pages, 8 figure

    An optimization model for metabolic pathways

    Get PDF
    This article is available open access through the publisher’s website through the link below. Copyright @ The Author 2009.Motivation: Different mathematical methods have emerged in the post-genomic era to determine metabolic pathways. These methods can be divided into stoichiometric methods and path finding methods. In this paper we detail a novel optimization model, based upon integer linear programming, to determine metabolic pathways. Our model links reaction stoichiometry with path finding in a single approach. We test the ability of our model to determine 40 annotated Escherichia coli metabolic pathways. We show that our model is able to determine 36 of these 40 pathways in a computationally effective manner. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online (http://bioinformatics.oxfordjournals.org/cgi/content/full/btp441/DC1)

    Discovering transcriptional modules by Bayesian data integration

    Get PDF
    Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets. Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs

    Characterization and Comparison of the Tissue-Related Modules in Human and Mouse

    Get PDF
    BACKGROUND: Due to the advances of high throughput technology and data-collection approaches, we are now in an unprecedented position to understand the evolution of organisms. Great efforts have characterized many individual genes responsible for the interspecies divergence, yet little is known about the genome-wide divergence at a higher level. Modules, serving as the building blocks and operational units of biological systems, provide more information than individual genes. Hence, the comparative analysis between species at the module level would shed more light on the mechanisms underlying the evolution of organisms than the traditional comparative genomics approaches. RESULTS: We systematically identified the tissue-related modules using the iterative signature algorithm (ISA), and we detected 52 and 65 modules in the human and mouse genomes, respectively. The gene expression patterns indicate that all of these predicted modules have a high possibility of serving as real biological modules. In addition, we defined a novel quantity, "total constraint intensity," a proxy of multiple constraints (of co-regulated genes and tissues where the co-regulation occurs) on the evolution of genes in module context. We demonstrate that the evolutionary rate of a gene is negatively correlated with its total constraint intensity. Furthermore, there are modules coding the same essential biological processes, while their gene contents have diverged extensively between human and mouse. CONCLUSIONS: Our results suggest that unlike the composition of module, which exhibits a great difference between human and mouse, the functional organization of the corresponding modules may evolve in a more conservative manner. Most importantly, our findings imply that similar biological processes can be carried out by different sets of genes from human and mouse, therefore, the functional data of individual genes from mouse may not apply to human in certain occasions

    Growth landscape formed by perception and import of glucose in yeast

    Get PDF
    An important challenge in systems biology is to quantitatively describe microbial growth using a few measurable parameters that capture the essence of this complex phenomenon. Two key events at the cell membrane—extracellular glucose sensing and uptake—initiate the budding yeast’s growth on glucose. However, conventional growth models focus almost exclusively on glucose uptake. Here we present results from growth-rate experiments that cannot be explained by focusing on glucose uptake alone. By imposing a glucose uptake rate independent of the sensed extracellular glucose level, we show that despite increasing both the sensed glucose concentration and uptake rate, the cell’s growth rate can decrease or even approach zero. We resolve this puzzle by showing that the interaction between glucose perception and import, not their individual actions, determines the central features of growth, and characterize this interaction using a quantitative model. Disrupting this interaction by knocking out two key glucose sensors significantly changes the cell’s growth rate, yet uptake rates are unchanged. This is due to a decrease in burden that glucose perception places on the cells. Our work shows that glucose perception and import are separate and pivotal modules of yeast growth, the interaction of which can be precisely tuned and measured.National Institutes of Health (U.S.). Pioneer AwardNatural Sciences and Engineering Research Council of Canada (NSERC). Graduate Fellowshi

    Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics

    Get PDF
    Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites. google.com/site/gaussianbhc

    Using Pre-existing Microarray Datasets to Increase Experimental Power: Application to Insulin Resistance

    Get PDF
    Although they have become a widely used experimental technique for identifying differentially expressed (DE) genes, DNA microarrays are notorious for generating noisy data. A common strategy for mitigating the effects of noise is to perform many experimental replicates. This approach is often costly and sometimes impossible given limited resources; thus, analytical methods are needed which increase accuracy at no additional cost. One inexpensive source of microarray replicates comes from prior work: to date, data from hundreds of thousands of microarray experiments are in the public domain. Although these data assay a wide range of conditions, they cannot be used directly to inform any particular experiment and are thus ignored by most DE gene methods. We present the SVD Augmented Gene expression Analysis Tool (SAGAT), a mathematically principled, data-driven approach for identifying DE genes. SAGAT increases the power of a microarray experiment by using observed coexpression relationships from publicly available microarray datasets to reduce uncertainty in individual genes' expression measurements. We tested the method on three well-replicated human microarray datasets and demonstrate that use of SAGAT increased effective sample sizes by as many as 2.72 arrays. We applied SAGAT to unpublished data from a microarray study investigating transcriptional responses to insulin resistance, resulting in a 50% increase in the number of significant genes detected. We evaluated 11 (58%) of these genes experimentally using qPCR, confirming the directions of expression change for all 11 and statistical significance for three. Use of SAGAT revealed coherent biological changes in three pathways: inflammation, differentiation, and fatty acid synthesis, furthering our molecular understanding of a type 2 diabetes risk factor. We envision SAGAT as a means to maximize the potential for biological discovery from subtle transcriptional responses, and we provide it as a freely available software package that is immediately applicable to any human microarray study

    Repression of Mitochondrial Translation, Respiration and a Metabolic Cycle-Regulated Gene, SLF1, by the Yeast Pumilio-Family Protein Puf3p

    Get PDF
    Synthesis and assembly of the mitochondrial oxidative phosphorylation (OXPHOS) system requires genes located both in the nuclear and mitochondrial genomes, but how gene expression is coordinated between these two compartments is not fully understood. One level of control is through regulated expression mitochondrial ribosomal proteins and other factors required for mitochondrial translation and OXPHOS assembly, which are all products of nuclear genes that are subsequently imported into mitochondria. Interestingly, this cadre of genes in budding yeast has in common a 3′-UTR element that is bound by the Pumilio family protein, Puf3p, and is coordinately regulated under many conditions, including during the yeast metabolic cycle. Multiple functions have been assigned to Puf3p, including promoting mRNA degradation, localizing nucleus-encoded mitochondrial transcripts to the outer mitochondrial membrane, and facilitating mitochondria-cytoskeletal interactions and motility. Here we show that Puf3p has a general repressive effect on mitochondrial OXPHOS abundance, translation, and respiration that does not involve changes in overall mitochondrial biogenesis and largely independent of TORC1-mitochondrial signaling. We also identified the cytoplasmic translation factor Slf1p as yeast metabolic cycle-regulated gene that is repressed by Puf3p at the post-transcriptional level and promotes respiration and extension of yeast chronological life span when over-expressed. Altogether, these results should facilitate future studies on which of the many functions of Puf3p is most relevant for regulating mitochondrial gene expression and the role of nuclear-mitochondrial communication in aging and longevity

    A classification-based framework for predicting and analyzing gene regulatory response

    Get PDF
    BACKGROUND: We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem — predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree. METHODS: In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data. RESULTS: Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast — the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors — and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from

    Extracting expression modules from perturbational gene expression compendia

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Compendia of gene expression profiles under chemical and genetic perturbations constitute an invaluable resource from a systems biology perspective. However, the perturbational nature of such data imposes specific challenges on the computational methods used to analyze them. In particular, traditional clustering algorithms have difficulties in handling one of the prominent features of perturbational compendia, namely partial coexpression relationships between genes. Biclustering methods on the other hand are specifically designed to capture such partial coexpression patterns, but they show a variety of other drawbacks. For instance, some biclustering methods are less suited to identify overlapping biclusters, while others generate highly redundant biclusters. Also, none of the existing biclustering tools takes advantage of the staple of perturbational expression data analysis: the identification of differentially expressed genes.</p> <p>Results</p> <p>We introduce a novel method, called ENIGMA, that addresses some of these issues. ENIGMA leverages differential expression analysis results to extract expression modules from perturbational gene expression data. The core parameters of the ENIGMA clustering procedure are automatically optimized to reduce the redundancy between modules. In contrast to the biclusters produced by most other methods, ENIGMA modules may show internal substructure, i.e. subsets of genes with distinct but significantly related expression patterns. The grouping of these (often functionally) related patterns in one module greatly aids in the biological interpretation of the data. We show that ENIGMA outperforms other methods on artificial datasets, using a quality criterion that, unlike other criteria, can be used for algorithms that generate overlapping clusters and that can be modified to take redundancy between clusters into account. Finally, we apply ENIGMA to the Rosetta compendium of expression profiles for <it>Saccharomyces cerevisiae </it>and we analyze one pheromone response-related module in more detail, demonstrating the potential of ENIGMA to generate detailed predictions.</p> <p>Conclusion</p> <p>It is increasingly recognized that perturbational expression compendia are essential to identify the gene networks underlying cellular function, and efforts to build these for different organisms are currently underway. We show that ENIGMA constitutes a valuable addition to the repertoire of methods to analyze such data.</p
    corecore