2,010 research outputs found

    Integrative Model-based clustering of microarray methylation and expression data

    Full text link
    In many fields, researchers are interested in large and complex biological processes. Two important examples are gene expression and DNA methylation in genetics. One key problem is to identify aberrant patterns of these processes and discover biologically distinct groups. In this article we develop a model-based method for clustering such data. The basis of our method involves the construction of a likelihood for any given partition of the subjects. We introduce cluster specific latent indicators that, along with some standard assumptions, impose a specific mixture distribution on each cluster. Estimation is carried out using the EM algorithm. The methods extend naturally to multiple data types of a similar nature, which leads to an integrated analysis over multiple data platforms, resulting in higher discriminating power.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS533 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Mixture model with multiple allocations for clustering spatially correlated observations in the analysis of ChIP-Seq data

    Get PDF
    Model-based clustering is a technique widely used to group a collection of units into mutually exclusive groups. There are, however, situations in which an observation could in principle belong to more than one cluster. In the context of Next-Generation Sequencing (NGS) experiments, for example, the signal observed in the data might be produced by two (or more) different biological processes operating together and a gene could participate in both (or all) of them. We propose a novel approach to cluster NGS discrete data, coming from a ChIP-Seq experiment, with a mixture model, allowing each unit to belong potentially to more than one group: these multiple allocation clusters can be flexibly defined via a function combining the features of the original groups without introducing new parameters. The formulation naturally gives rise to a `zero-inflation group' in which values close to zero can be allocated, acting as a correction for the abundance of zeros that manifest in this type of data. We take into account the spatial dependency between observations, which is described through a latent Conditional Auto-Regressive process that can reflect different dependency patterns. We assess the performance of our model within a simulation environment and then we apply it to ChIP-seq real data.Comment: 25 pages; 3 tables, 6 figure

    Sparse covariance estimation in heterogeneous samples

    Full text link
    Standard Gaussian graphical models (GGMs) implicitly assume that the conditional independence among variables is common to all observations in the sample. However, in practice, observations are usually collected form heterogeneous populations where such assumption is not satisfied, leading in turn to nonlinear relationships among variables. To tackle these problems we explore mixtures of GGMs; in particular, we consider both infinite mixture models of GGMs and infinite hidden Markov models with GGM emission distributions. Such models allow us to divide a heterogeneous population into homogenous groups, with each cluster having its own conditional independence structure. The main advantage of considering infinite mixtures is that they allow us easily to estimate the number of number of subpopulations in the sample. As an illustration, we study the trends in exchange rate fluctuations in the pre-Euro era. This example demonstrates that the models are very flexible while providing extremely interesting interesting insights into real-life applications

    Unsupervised Bayesian linear unmixing of gene expression microarrays

    Get PDF
    Background: This paper introduces a new constrained model and the corresponding algorithm, called unsupervised Bayesian linear unmixing (uBLU), to identify biological signatures from high dimensional assays like gene expression microarrays. The basis for uBLU is a Bayesian model for the data samples which are represented as an additive mixture of random positive gene signatures, called factors, with random positive mixing coefficients, called factor scores, that specify the relative contribution of each signature to a specific sample. The particularity of the proposed method is that uBLU constrains the factor loadings to be non-negative and the factor scores to be probability distributions over the factors. Furthermore, it also provides estimates of the number of factors. A Gibbs sampling strategy is adopted here to generate random samples according to the posterior distribution of the factors, factor scores, and number of factors. These samples are then used to estimate all the unknown parameters. Results: Firstly, the proposed uBLU method is applied to several simulated datasets with known ground truth and compared with previous factor decomposition methods, such as principal component analysis (PCA), non negative matrix factorization (NMF), Bayesian factor regression modeling (BFRM), and the gradient-based algorithm for general matrix factorization (GB-GMF). Secondly, we illustrate the application of uBLU on a real time-evolving gene expression dataset from a recent viral challenge study in which individuals have been inoculated with influenza A/H3N2/Wisconsin. We show that the uBLU method significantly outperforms the other methods on the simulated and real data sets considered here. Conclusions: The results obtained on synthetic and real data illustrate the accuracy of the proposed uBLU method when compared to other factor decomposition methods from the literature (PCA, NMF, BFRM, and GB-GMF). The uBLU method identifies an inflammatory component closely associated with clinical symptom scores collected during the study. Using a constrained model allows recovery of all the inflammatory genes in a single factor

    Gamma-based clustering via ordered means with application to gene-expression analysis

    Full text link
    Discrete mixture models provide a well-known basis for effective clustering algorithms, although technical challenges have limited their scope. In the context of gene-expression data analysis, a model is presented that mixes over a finite catalog of structures, each one representing equality and inequality constraints among latent expected values. Computations depend on the probability that independent gamma-distributed variables attain each of their possible orderings. Each ordering event is equivalent to an event in independent negative-binomial random variables, and this finding guides a dynamic-programming calculation. The structuring of mixture-model components according to constraints among latent means leads to strict concavity of the mixture log likelihood. In addition to its beneficial numerical properties, the clustering method shows promising results in an empirical study.Comment: Published in at http://dx.doi.org/10.1214/10-AOS805 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore