6,068 research outputs found

    Bayesian mixture modeling for multivariate conditional distributions

    Full text link
    We present a Bayesian mixture model for estimating the joint distribution of mixed ordinal, nominal, and continuous data conditional on a set of fixed variables. The model uses multivariate normal and categorical mixture kernels for the random variables. It induces dependence between the random and fixed variables through the means of the multivariate normal mixture kernels and via a truncated local Dirichlet process. The latter encourages observations with similar values of the fixed variables to share mixture components. Using a simulation of data fusion, we illustrate that the model can estimate underlying relationships in the data and the distributions of the missing values more accurately than a mixture model applied to the random and fixed variables jointly. We use the model to analyze consumers' reading behaviors using a quota sample, i.e., a sample where the empirical distribution of some variables is fixed by design and so should not be modeled as random, conducted by the book publisher HarperCollins

    A Bayesian nonparametric mixture model for selecting genes and gene subnetworks

    Full text link
    It is very challenging to select informative features from tens of thousands of measured features in high-throughput data analysis. Recently, several parametric/regression models have been developed utilizing the gene network information to select genes or pathways strongly associated with a clinical/biological outcome. Alternatively, in this paper, we propose a nonparametric Bayesian model for gene selection incorporating network information. In addition to identifying genes that have a strong association with a clinical outcome, our model can select genes with particular expressional behavior, in which case the regression models are not directly applicable. We show that our proposed model is equivalent to an infinity mixture model for which we develop a posterior computation algorithm based on Markov chain Monte Carlo (MCMC) methods. We also propose two fast computing algorithms that approximate the posterior simulation with good accuracy but relatively low computational cost. We illustrate our methods on simulation studies and the analysis of Spellman yeast cell cycle microarray data.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS719 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Bayesian Nonparametric Graph Clustering

    Full text link
    We present clustering methods for multivariate data exploiting the underlying geometry of the graphical structure between variables. As opposed to standard approaches that assume known graph structures, we first estimate the edge structure of the unknown graph using Bayesian neighborhood selection approaches, wherein we account for the uncertainty of graphical structure learning through model-averaged estimates of the suitable parameters. Subsequently, we develop a nonparametric graph clustering model on the lower dimensional projections of the graph based on Laplacian embeddings using Dirichlet process mixture models. In contrast to standard algorithmic approaches, this fully probabilistic approach allows incorporation of uncertainty in estimation and inference for both graph structure learning and clustering. More importantly, we formalize the arguments for Laplacian embeddings as suitable projections for graph clustering by providing theoretical support for the consistency of the eigenspace of the estimated graph Laplacians. We develop fast computational algorithms that allow our methods to scale to large number of nodes. Through extensive simulations we compare our clustering performance with standard clustering methods. We apply our methods to a novel pan-cancer proteomic data set, and evaluate protein networks and clusters across multiple different cancer types

    Structure Learning in Graphical Modeling

    Full text link
    A graphical model is a statistical model that is associated to a graph whose nodes correspond to variables of interest. The edges of the graph reflect allowed conditional dependencies among the variables. Graphical models admit computationally convenient factorization properties and have long been a valuable tool for tractable modeling of multivariate distributions. More recently, applications such as reconstructing gene regulatory networks from gene expression data have driven major advances in structure learning, that is, estimating the graph underlying a model. We review some of these advances and discuss methods such as the graphical lasso and neighborhood selection for undirected graphical models (or Markov random fields), and the PC algorithm and score-based search methods for directed graphical models (or Bayesian networks). We further review extensions that account for effects of latent variables and heterogeneous data sources

    On sensing capacity of sensor networks for the class of linear observation, fixed SNR models

    Full text link
    In this paper we address the problem of finding the sensing capacity of sensor networks for a class of linear observation models and a fixed SNR regime. Sensing capacity is defined as the maximum number of signal dimensions reliably identified per sensor observation. In this context sparsity of the phenomena is a key feature that determines sensing capacity. Precluding the SNR of the environment the effect of sparsity on the number of measurements required for accurate reconstruction of a sparse phenomena has been widely dealt with under compressed sensing. Nevertheless the development there was motivated from an algorithmic perspective. In this paper our aim is to derive these bounds in an information theoretic set-up and thus provide algorithm independent conditions for reliable reconstruction of sparse signals. In this direction we first generalize the Fano's inequality and provide lower bounds to the probability of error in reconstruction subject to an arbitrary distortion criteria. Using these lower bounds to the probability of error, we derive upper bounds to sensing capacity and show that for fixed SNR regime sensing capacity goes down to zero as sparsity goes down to zero. This means that disproportionately more sensors are required to monitor very sparse events. Our next main contribution is that we show the effect of sensing diversity on sensing capacity, an effect that has not been considered before. Sensing diversity is related to the effective \emph{coverage} of a sensor with respect to the field. In this direction we show the following results (a) Sensing capacity goes down as sensing diversity per sensor goes down; (b) Random sampling (coverage) of the field by sensors is better than contiguous location sampling (coverage).Comment: 37 pages, single colum

    Variational Inference for Sparse and Undirected Models

    Full text link
    Undirected graphical models are applied in genomics, protein structure prediction, and neuroscience to identify sparse interactions that underlie discrete data. Although Bayesian methods for inference would be favorable in these contexts, they are rarely used because they require doubly intractable Monte Carlo sampling. Here, we develop a framework for scalable Bayesian inference of discrete undirected models based on two new methods. The first is Persistent VI, an algorithm for variational inference of discrete undirected models that avoids doubly intractable MCMC and approximations of the partition function. The second is Fadeout, a reparameterization approach for variational inference under sparsity-inducing priors that captures a posteriori correlations between parameters and hyperparameters with noncentered parameterizations. We find that, together, these methods for variational inference substantially improve learning of sparse undirected graphical models in simulated and real problems from physics and biology.Comment: 34th International Conference on Machine Learning (ICML 2017

    Image Denoising with Kernels based on Natural Image Relations

    Full text link
    A successful class of image denoising methods is based on Bayesian approaches working in wavelet representations. However, analytical estimates can be obtained only for particular combinations of analytical models of signal and noise, thus precluding its straightforward extension to deal with other arbitrary noise sources. In this paper, we propose an alternative non-explicit way to take into account the relations among natural image wavelet coefficients for denoising: we use support vector regression (SVR) in the wavelet domain to enforce these relations in the estimated signal. Since relations among the coefficients are specific to the signal, the regularization property of SVR is exploited to remove the noise, which does not share this feature. The specific signal relations are encoded in an anisotropic kernel obtained from mutual information measures computed on a representative image database. Training considers minimizing the Kullback-Leibler divergence (KLD) between the estimated and actual probability functions of signal and noise in order to enforce similarity. Due to its non-parametric nature, the method can eventually cope with different noise sources without the need of an explicit re-formulation, as it is strictly necessary under parametric Bayesian formalisms. Results under several noise levels and noise sources show that: (1) the proposed method outperforms conventional wavelet methods that assume coefficient independence, (2) it is similar to state-of-the-art methods that do explicitly include these relations when the noise source is Gaussian, and (3) it gives better numerical and visual performance when more complex, realistic noise sources are considered. Therefore, the proposed machine learning approach can be seen as a more flexible (model-free) alternative to the explicit description of wavelet coefficient relations for image denoising

    Probabilistic Approach for Evaluating Metabolite Sample Integrity

    Full text link
    The success of metabolomics studies depends upon the "fitness" of each biological sample used for analysis: it is critical that metabolite levels reported for a biological sample represent an accurate snapshot of the studied organism's metabolite profile at time of sample collection. Numerous factors may compromise metabolite sample fitness, including chemical and biological factors which intervene during sample collection, handling, storage, and preparation for analysis. We propose a probabilistic model for the quantitative assessment of metabolite sample fitness. Collection and processing of nuclear magnetic resonance (NMR) and ultra-performance liquid chromatography (UPLC-MS) metabolomics data is discussed. Feature selection methods utilized for multivariate data analysis are briefly reviewed, including feature clustering and computation of latent vectors using spectral methods. We propose that the time-course of metabolite changes in samples stored at different temperatures may be utilized to identify changing-metabolite-to-stable-metabolite ratios as markers of sample fitness. Tolerance intervals may be computed to characterize these ratios among fresh samples. In order to discover additional structure in the data relevant to sample fitness, we propose using data labeled according to these ratios to train a Dirichlet process mixture model (DPMM) for assessing sample fitness. DPMMs are highly intuitive since they model the metabolite levels in a sample as arising from a combination of processes including, e.g., normal biological processes and degradation- or contamination-inducing processes. The outputs of a DPMM are probabilities that a sample is associated with a given process, and these probabilities may be incorporated into a final classifier for sample fitness

    Bayesian Structured Sparsity from Gaussian Fields

    Full text link
    Substantial research on structured sparsity has contributed to analysis of many different applications. However, there have been few Bayesian procedures among this work. Here, we develop a Bayesian model for structured sparsity that uses a Gaussian process (GP) to share parameters of the sparsity-inducing prior in proportion to feature similarity as defined by an arbitrary positive definite kernel. For linear regression, this sparsity-inducing prior on regression coefficients is a relaxation of the canonical spike-and-slab prior that flattens the mixture model into a scale mixture of normals. This prior retains the explicit posterior probability on inclusion parameters---now with GP probit prior distributions---but enables tractable computation via elliptical slice sampling for the latent Gaussian field. We motivate development of this prior using the genomic application of association mapping, or identifying genetic variants associated with a continuous trait. Our Bayesian structured sparsity model produced sparse results with substantially improved sensitivity and precision relative to comparable methods. Through simulations, we show that three properties are key to this improvement: i) modeling structure in the covariates, ii) significance testing using the posterior probabilities of inclusion, and iii) model averaging. We present results from applying this model to a large genomic dataset to demonstrate computational tractability.Comment: 23 pages, 7 figure

    Stagewise Learning for Sparse Clustering of Discretely-Valued Data

    Full text link
    The performance of EM in learning mixtures of product distributions often depends on the initialization. This can be problematic in crowdsourcing and other applications, e.g. when a small number of 'experts' are diluted by a large number of noisy, unreliable participants. We develop a new EM algorithm that is driven by these experts. In a manner that differs from other approaches, we start from a single mixture class. The algorithm then develops the set of 'experts' in a stagewise fashion based on a mutual information criterion. At each stage EM operates on this subset of the players, effectively regularizing the E rather than the M step. Experiments show that stagewise EM outperforms other initialization techniques for crowdsourcing and neurosciences applications, and can guide a full EM to results comparable to those obtained knowing the exact distribution.Comment: 9 page
    • …
    corecore