Search CORE

6,068 research outputs found

Bayesian mixture modeling for multivariate conditional distributions

Author: DeYoreo Maria
Reiter Jerome P.
Publication venue
Publication date: 13/07/2016
Field of study

We present a Bayesian mixture model for estimating the joint distribution of mixed ordinal, nominal, and continuous data conditional on a set of fixed variables. The model uses multivariate normal and categorical mixture kernels for the random variables. It induces dependence between the random and fixed variables through the means of the multivariate normal mixture kernels and via a truncated local Dirichlet process. The latter encourages observations with similar values of the fixed variables to share mixture components. Using a simulation of data fusion, we illustrate that the model can estimate underlying relationships in the data and the distributions of the missing values more accurately than a mixture model applied to the random and fixed variables jointly. We use the model to analyze consumers' reading behaviors using a quota sample, i.e., a sample where the empirical distribution of some variables is fixed by design and so should not be modeled as random, conducted by the book publisher HarperCollins

arXiv.org e-Print Archive

A Bayesian nonparametric mixture model for selecting genes and gene subnetworks

Author: Kang Jian
Yu Tianwei
Zhao Yize
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 31/07/2014
Field of study

It is very challenging to select informative features from tens of thousands of measured features in high-throughput data analysis. Recently, several parametric/regression models have been developed utilizing the gene network information to select genes or pathways strongly associated with a clinical/biological outcome. Alternatively, in this paper, we propose a nonparametric Bayesian model for gene selection incorporating network information. In addition to identifying genes that have a strong association with a clinical outcome, our model can select genes with particular expressional behavior, in which case the regression models are not directly applicable. We show that our proposed model is equivalent to an infinity mixture model for which we develop a posterior computation algorithm based on Markov chain Monte Carlo (MCMC) methods. We also propose two fast computing algorithms that approximate the posterior simulation with good accuracy but relatively low computational cost. We illustrate our methods on simulation studies and the analysis of Spellman yeast cell cycle microarray data.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS719 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Bayesian Nonparametric Graph Clustering

Author: Akbani Rehan
Baladandayuthapani Veerabhadran
Banerjee Sayantan
Publication venue
Publication date: 24/09/2015
Field of study

We present clustering methods for multivariate data exploiting the underlying geometry of the graphical structure between variables. As opposed to standard approaches that assume known graph structures, we first estimate the edge structure of the unknown graph using Bayesian neighborhood selection approaches, wherein we account for the uncertainty of graphical structure learning through model-averaged estimates of the suitable parameters. Subsequently, we develop a nonparametric graph clustering model on the lower dimensional projections of the graph based on Laplacian embeddings using Dirichlet process mixture models. In contrast to standard algorithmic approaches, this fully probabilistic approach allows incorporation of uncertainty in estimation and inference for both graph structure learning and clustering. More importantly, we formalize the arguments for Laplacian embeddings as suitable projections for graph clustering by providing theoretical support for the consistency of the eigenspace of the estimated graph Laplacians. We develop fast computational algorithms that allow our methods to scale to large number of nodes. Through extensive simulations we compare our clustering performance with standard clustering methods. We apply our methods to a novel pan-cancer proteomic data set, and evaluate protein networks and clusters across multiple different cancer types

arXiv.org e-Print Archive

Structure Learning in Graphical Modeling

Author: Drton Mathias
Maathuis Marloes H.
Publication venue
Publication date: 07/06/2016
Field of study

A graphical model is a statistical model that is associated to a graph whose nodes correspond to variables of interest. The edges of the graph reflect allowed conditional dependencies among the variables. Graphical models admit computationally convenient factorization properties and have long been a valuable tool for tractable modeling of multivariate distributions. More recently, applications such as reconstructing gene regulatory networks from gene expression data have driven major advances in structure learning, that is, estimating the graph underlying a model. We review some of these advances and discuss methods such as the graphical lasso and neighborhood selection for undirected graphical models (or Markov random fields), and the PC algorithm and score-based search methods for directed graphical models (or Bayesian networks). We further review extensions that account for effects of latent variables and heterogeneous data sources

arXiv.org e-Print Archive

On sensing capacity of sensor networks for the class of linear observation, fixed SNR models

Author: Aeron Shuchin
Saligrama Venkatesh
Zhao Manqi
Publication venue
Publication date: 04/06/2007
Field of study

In this paper we address the problem of finding the sensing capacity of sensor networks for a class of linear observation models and a fixed SNR regime. Sensing capacity is defined as the maximum number of signal dimensions reliably identified per sensor observation. In this context sparsity of the phenomena is a key feature that determines sensing capacity. Precluding the SNR of the environment the effect of sparsity on the number of measurements required for accurate reconstruction of a sparse phenomena has been widely dealt with under compressed sensing. Nevertheless the development there was motivated from an algorithmic perspective. In this paper our aim is to derive these bounds in an information theoretic set-up and thus provide algorithm independent conditions for reliable reconstruction of sparse signals. In this direction we first generalize the Fano's inequality and provide lower bounds to the probability of error in reconstruction subject to an arbitrary distortion criteria. Using these lower bounds to the probability of error, we derive upper bounds to sensing capacity and show that for fixed SNR regime sensing capacity goes down to zero as sparsity goes down to zero. This means that disproportionately more sensors are required to monitor very sparse events. Our next main contribution is that we show the effect of sensing diversity on sensing capacity, an effect that has not been considered before. Sensing diversity is related to the effective \emph{coverage} of a sensor with respect to the field. In this direction we show the following results (a) Sensing capacity goes down as sensing diversity per sensor goes down; (b) Random sampling (coverage) of the field by sensors is better than contiguous location sampling (coverage).Comment: 37 pages, single colum

arXiv.org e-Print Archive

Variational Inference for Sparse and Undirected Models

Author: Ingraham John
Marks Debora
Publication venue
Publication date: 14/06/2017
Field of study

Undirected graphical models are applied in genomics, protein structure prediction, and neuroscience to identify sparse interactions that underlie discrete data. Although Bayesian methods for inference would be favorable in these contexts, they are rarely used because they require doubly intractable Monte Carlo sampling. Here, we develop a framework for scalable Bayesian inference of discrete undirected models based on two new methods. The first is Persistent VI, an algorithm for variational inference of discrete undirected models that avoids doubly intractable MCMC and approximations of the partition function. The second is Fadeout, a reparameterization approach for variational inference under sparsity-inducing priors that captures a posteriori correlations between parameters and hyperparameters with noncentered parameterizations. We find that, together, these methods for variational inference substantially improve learning of sparse undirected graphical models in simulated and real problems from physics and biology.Comment: 34th International Conference on Machine Learning (ICML 2017

arXiv.org e-Print Archive

Image Denoising with Kernels based on Natural Image Relations

Author: Camps-Valls Gustavo
Gutiérrez Juan
Laparra Valero
Malo Jesús
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/01/2016
Field of study

A successful class of image denoising methods is based on Bayesian approaches working in wavelet representations. However, analytical estimates can be obtained only for particular combinations of analytical models of signal and noise, thus precluding its straightforward extension to deal with other arbitrary noise sources. In this paper, we propose an alternative non-explicit way to take into account the relations among natural image wavelet coefficients for denoising: we use support vector regression (SVR) in the wavelet domain to enforce these relations in the estimated signal. Since relations among the coefficients are specific to the signal, the regularization property of SVR is exploited to remove the noise, which does not share this feature. The specific signal relations are encoded in an anisotropic kernel obtained from mutual information measures computed on a representative image database. Training considers minimizing the Kullback-Leibler divergence (KLD) between the estimated and actual probability functions of signal and noise in order to enforce similarity. Due to its non-parametric nature, the method can eventually cope with different noise sources without the need of an explicit re-formulation, as it is strictly necessary under parametric Bayesian formalisms. Results under several noise levels and noise sources show that: (1) the proposed method outperforms conventional wavelet methods that assume coefficient independence, (2) it is similar to state-of-the-art methods that do explicitly include these relations when the noise source is Gaussian, and (3) it gives better numerical and visual performance when more complex, realistic noise sources are considered. Therefore, the proposed machine learning approach can be seen as a more flexible (model-free) alternative to the explicit description of wavelet coefficient relations for image denoising

arXiv.org e-Print Archive

Probabilistic Approach for Evaluating Metabolite Sample Integrity

Author: Jensen Shane T.
Slaff Barry M.
Weljie Aalim M.
Publication venue
Publication date: 14/06/2015
Field of study

The success of metabolomics studies depends upon the "fitness" of each biological sample used for analysis: it is critical that metabolite levels reported for a biological sample represent an accurate snapshot of the studied organism's metabolite profile at time of sample collection. Numerous factors may compromise metabolite sample fitness, including chemical and biological factors which intervene during sample collection, handling, storage, and preparation for analysis. We propose a probabilistic model for the quantitative assessment of metabolite sample fitness. Collection and processing of nuclear magnetic resonance (NMR) and ultra-performance liquid chromatography (UPLC-MS) metabolomics data is discussed. Feature selection methods utilized for multivariate data analysis are briefly reviewed, including feature clustering and computation of latent vectors using spectral methods. We propose that the time-course of metabolite changes in samples stored at different temperatures may be utilized to identify changing-metabolite-to-stable-metabolite ratios as markers of sample fitness. Tolerance intervals may be computed to characterize these ratios among fresh samples. In order to discover additional structure in the data relevant to sample fitness, we propose using data labeled according to these ratios to train a Dirichlet process mixture model (DPMM) for assessing sample fitness. DPMMs are highly intuitive since they model the metabolite levels in a sample as arising from a combination of processes including, e.g., normal biological processes and degradation- or contamination-inducing processes. The outputs of a DPMM are probabilities that a sample is associated with a given process, and these probabilities may be incorporated into a final classifier for sample fitness

arXiv.org e-Print Archive

Bayesian Structured Sparsity from Gaussian Fields

Author: Adams Ryan P.
Engelhardt Barbara E.
Publication venue
Publication date: 08/07/2014
Field of study

Substantial research on structured sparsity has contributed to analysis of many different applications. However, there have been few Bayesian procedures among this work. Here, we develop a Bayesian model for structured sparsity that uses a Gaussian process (GP) to share parameters of the sparsity-inducing prior in proportion to feature similarity as defined by an arbitrary positive definite kernel. For linear regression, this sparsity-inducing prior on regression coefficients is a relaxation of the canonical spike-and-slab prior that flattens the mixture model into a scale mixture of normals. This prior retains the explicit posterior probability on inclusion parameters---now with GP probit prior distributions---but enables tractable computation via elliptical slice sampling for the latent Gaussian field. We motivate development of this prior using the genomic application of association mapping, or identifying genetic variants associated with a continuous trait. Our Bayesian structured sparsity model produced sparse results with substantially improved sensitivity and precision relative to comparable methods. Through simulations, we show that three properties are key to this improvement: i) modeling structure in the covariates, ii) significance testing using the posterior probabilities of inclusion, and iii) model averaging. We present results from applying this model to a large genomic dataset to demonstrate computational tractability.Comment: 23 pages, 7 figure

arXiv.org e-Print Archive

Stagewise Learning for Sparse Clustering of Discretely-Valued Data

Author: Zhao Vincent
Zucker Steven W.
Publication venue
Publication date: 27/05/2016
Field of study

The performance of EM in learning mixtures of product distributions often depends on the initialization. This can be problematic in crowdsourcing and other applications, e.g. when a small number of 'experts' are diluted by a large number of noisy, unreliable participants. We develop a new EM algorithm that is driven by these experts. In a manner that differs from other approaches, we start from a single mixture class. The algorithm then develops the set of 'experts' in a stagewise fashion based on a mutual information criterion. At each stage EM operates on this subset of the players, effectively regularizing the E rather than the M step. Experiments show that stagewise EM outperforms other initialization techniques for crowdsourcing and neurosciences applications, and can guide a full EM to results comparable to those obtained knowing the exact distribution.Comment: 9 page

arXiv.org e-Print Archive