792 research outputs found

    Partial mixture model for tight clustering of gene expression time-course

    Get PDF
    Background: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored. Results: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms. Conclusion: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the ombination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion

    Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm

    Get PDF
    We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor.org/packages/2.10/bioc/html/BHC.html. We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/randomisedbhc/

    High-resolution temporal profiling of transcripts during Arabidopsis leaf senescence reveals a distinct chronology of processes and regulation

    Get PDF
    Leaf senescence is an essential developmental process that impacts dramatically on crop yields and involves altered regulation of thousands of genes and many metabolic and signaling pathways, resulting in major changes in the leaf. The regulation of senescence is complex, and although senescence regulatory genes have been characterized, there is little information on how these function in the global control of the process. We used microarray analysis to obtain a highresolution time-course profile of gene expression during development of a single leaf over a 3-week period to senescence. A complex experimental design approach and a combination of methods were used to extract high-quality replicated data and to identify differentially expressed genes. The multiple time points enable the use of highly informative clustering to reveal distinct time points at which signaling and metabolic pathways change. Analysis of motif enrichment, as well as comparison of transcription factor (TF) families showing altered expression over the time course, identify clear groups of TFs active at different stages of leaf development and senescence. These data enable connection of metabolic processes, signaling pathways, and specific TF activity, which will underpin the development of network models to elucidate the process of senescence

    Bayesian correlated clustering to integrate multiple datasets

    Get PDF
    Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets. Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods

    Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics

    Get PDF
    Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites. google.com/site/gaussianbhc

    Model-based clustering with data correction for removing artifacts in gene expression data

    Full text link
    The NIH Library of Integrated Network-based Cellular Signatures (LINCS) contains gene expression data from over a million experiments, using Luminex Bead technology. Only 500 colors are used to measure the expression levels of the 1,000 landmark genes measured, and the data for the resulting pairs of genes are deconvolved. The raw data are sometimes inadequate for reliable deconvolution leading to artifacts in the final processed data. These include the expression levels of paired genes being flipped or given the same value, and clusters of values that are not at the true expression level. We propose a new method called model-based clustering with data correction (MCDC) that is able to identify and correct these three kinds of artifacts simultaneously. We show that MCDC improves the resulting gene expression data in terms of agreement with external baselines, as well as improving results from subsequent analysis.Comment: 28 page

    Reconstructing regulatory networks from high-throughput post-genomic data using MCMC methods

    Get PDF
    Modern biological research aims to understand when genes are expressed and how certain genes in uence the expression of other genes. For organizing and visualizing gene expression activity gene regulatory networks are used. The architecture of these networks holds great importance, as they enable us to identify inconsistencies between hypotheses and observations, and to predict the behavior of biological processes in yet untested conditions. Data from gene expression measurements are used to construct gene regulatory networks. Along with the advance of high-throughput technologies for measuring gene expression statistical methods to predict regulatory networks have also been evolving. This thesis presents a computational framework based on a Bayesian modeling technique using state space models (SSM) for the inference of gene regulatory networks from time-series measurements. A linear SSM consists of observation and hidden state equations. The hidden variables can unfold effects that cannot be directly measured in an experiment, such as missing gene expression. We have used a Bayesian MCMC approach based on Gibbs sampling for the inference of parameters. However the task of determining the dimension of the hidden state space variables remains crucial for the accuracy of network inference. For this we have used the Bayesian evidence (or marginal likelihood) as a yardstick. In addition, the Bayesian approach also provides the possibility of incorporating prior information, based on literature knowledge. We compare marginal likelihoods calculated from the Gibbs sampler output to the lower bound calculated by a variational approximation. Before using the algorithm for the analysis of real biological experimental datasets we perform validation tests using numerical experiments based on simulated time series datasets generated by in-silico networks. The robustness of our algorithm can be measured by its ability to recapture the input data and generating networks using the inferred parameters. Our developed algorithm, GBSSM, was used to infer a gene network using E. coli data sets from the different stress conditions of temperature shift and acid stress. The resulting model for the gene expression response under temperature shift captures the effects of global transcription factors, such as fnr that control the regulation of hundreds of other genes. Interestingly, we also observe the stress-inducible membrane protein OsmC regulating transcriptional activity involved in the adaptation mechanism under both temperature shift and acid stress conditions. In the case of acid stress, integration of metabolomic and transcriptome data suggests that the observed rapid decrease in the concentration of glycine betaine is the result of the activation of osmoregulators which may play a key role in acid stress adaptation

    Robust Algorithms for Detecting Hidden Structure in Biological Data

    Get PDF
    Biological data, such as molecular abundance measurements and protein sequences, harbor complex hidden structure that reflects its underlying biological mechanisms. For example, high-throughput abundance measurements provide a snapshot the global state of a living cell, while homologous protein sequences encode the residue-level logic of the proteins\u27 function and provide a snapshot of the evolutionary trajectory of the protein family. In this work I describe algorithmic approaches and analysis software I developed for uncovering hidden structure in both kinds of data. Clustering is an unsurpervised machine learning technique commonly used to map the structure of data collected in high-throughput experiments, such as quantification of gene expression by DNA microarrays or short-read sequencing. Clustering algorithms always yield a partitioning of the data, but relying on a single partitioning solution can lead to spurious conclusions. In particular, noise in the data can cause objects to fall into the same cluster by chance rather than due to meaningful association. In the first part of this thesis I demonstrate approaches to clustering data robustly in the presence of noise and apply robust clustering to analyze the transcriptional response to injury in a neuron cell. In the second part of this thesis I describe identifying hidden specificity determining residues (SDPs) from alignments of protein sequences descended through gene duplication from a common ancestor (paralogs) and apply the approach to identify numerous putative SDPs in bacterial transcription factors in the LacI family. Finally, I describe and demonstrate a new algorithm for reconstructing the history of duplications by which paralogs descended from their common ancestor. This algorithm addresses the complexity of such reconstruction due to indeterminate or erroneous homology assignments made by sequence alignment algorithms and to the vast prevalence of divergence through speciation over divergence through gene duplication in protein evolution
    • …
    corecore