6,068 research outputs found
Bayesian mixture modeling for multivariate conditional distributions
We present a Bayesian mixture model for estimating the joint distribution of
mixed ordinal, nominal, and continuous data conditional on a set of fixed
variables. The model uses multivariate normal and categorical mixture kernels
for the random variables. It induces dependence between the random and fixed
variables through the means of the multivariate normal mixture kernels and via
a truncated local Dirichlet process. The latter encourages observations with
similar values of the fixed variables to share mixture components. Using a
simulation of data fusion, we illustrate that the model can estimate underlying
relationships in the data and the distributions of the missing values more
accurately than a mixture model applied to the random and fixed variables
jointly. We use the model to analyze consumers' reading behaviors using a quota
sample, i.e., a sample where the empirical distribution of some variables is
fixed by design and so should not be modeled as random, conducted by the book
publisher HarperCollins
A Bayesian nonparametric mixture model for selecting genes and gene subnetworks
It is very challenging to select informative features from tens of thousands
of measured features in high-throughput data analysis. Recently, several
parametric/regression models have been developed utilizing the gene network
information to select genes or pathways strongly associated with a
clinical/biological outcome. Alternatively, in this paper, we propose a
nonparametric Bayesian model for gene selection incorporating network
information. In addition to identifying genes that have a strong association
with a clinical outcome, our model can select genes with particular
expressional behavior, in which case the regression models are not directly
applicable. We show that our proposed model is equivalent to an infinity
mixture model for which we develop a posterior computation algorithm based on
Markov chain Monte Carlo (MCMC) methods. We also propose two fast computing
algorithms that approximate the posterior simulation with good accuracy but
relatively low computational cost. We illustrate our methods on simulation
studies and the analysis of Spellman yeast cell cycle microarray data.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS719 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Bayesian Nonparametric Graph Clustering
We present clustering methods for multivariate data exploiting the underlying
geometry of the graphical structure between variables. As opposed to standard
approaches that assume known graph structures, we first estimate the edge
structure of the unknown graph using Bayesian neighborhood selection
approaches, wherein we account for the uncertainty of graphical structure
learning through model-averaged estimates of the suitable parameters.
Subsequently, we develop a nonparametric graph clustering model on the lower
dimensional projections of the graph based on Laplacian embeddings using
Dirichlet process mixture models. In contrast to standard algorithmic
approaches, this fully probabilistic approach allows incorporation of
uncertainty in estimation and inference for both graph structure learning and
clustering. More importantly, we formalize the arguments for Laplacian
embeddings as suitable projections for graph clustering by providing
theoretical support for the consistency of the eigenspace of the estimated
graph Laplacians. We develop fast computational algorithms that allow our
methods to scale to large number of nodes. Through extensive simulations we
compare our clustering performance with standard clustering methods. We apply
our methods to a novel pan-cancer proteomic data set, and evaluate protein
networks and clusters across multiple different cancer types
Structure Learning in Graphical Modeling
A graphical model is a statistical model that is associated to a graph whose
nodes correspond to variables of interest. The edges of the graph reflect
allowed conditional dependencies among the variables. Graphical models admit
computationally convenient factorization properties and have long been a
valuable tool for tractable modeling of multivariate distributions. More
recently, applications such as reconstructing gene regulatory networks from
gene expression data have driven major advances in structure learning, that is,
estimating the graph underlying a model. We review some of these advances and
discuss methods such as the graphical lasso and neighborhood selection for
undirected graphical models (or Markov random fields), and the PC algorithm and
score-based search methods for directed graphical models (or Bayesian
networks). We further review extensions that account for effects of latent
variables and heterogeneous data sources
On sensing capacity of sensor networks for the class of linear observation, fixed SNR models
In this paper we address the problem of finding the sensing capacity of
sensor networks for a class of linear observation models and a fixed SNR
regime. Sensing capacity is defined as the maximum number of signal dimensions
reliably identified per sensor observation. In this context sparsity of the
phenomena is a key feature that determines sensing capacity. Precluding the SNR
of the environment the effect of sparsity on the number of measurements
required for accurate reconstruction of a sparse phenomena has been widely
dealt with under compressed sensing. Nevertheless the development there was
motivated from an algorithmic perspective. In this paper our aim is to derive
these bounds in an information theoretic set-up and thus provide algorithm
independent conditions for reliable reconstruction of sparse signals. In this
direction we first generalize the Fano's inequality and provide lower bounds to
the probability of error in reconstruction subject to an arbitrary distortion
criteria. Using these lower bounds to the probability of error, we derive upper
bounds to sensing capacity and show that for fixed SNR regime sensing capacity
goes down to zero as sparsity goes down to zero. This means that
disproportionately more sensors are required to monitor very sparse events. Our
next main contribution is that we show the effect of sensing diversity on
sensing capacity, an effect that has not been considered before. Sensing
diversity is related to the effective \emph{coverage} of a sensor with respect
to the field. In this direction we show the following results (a) Sensing
capacity goes down as sensing diversity per sensor goes down; (b) Random
sampling (coverage) of the field by sensors is better than contiguous location
sampling (coverage).Comment: 37 pages, single colum
Variational Inference for Sparse and Undirected Models
Undirected graphical models are applied in genomics, protein structure
prediction, and neuroscience to identify sparse interactions that underlie
discrete data. Although Bayesian methods for inference would be favorable in
these contexts, they are rarely used because they require doubly intractable
Monte Carlo sampling. Here, we develop a framework for scalable Bayesian
inference of discrete undirected models based on two new methods. The first is
Persistent VI, an algorithm for variational inference of discrete undirected
models that avoids doubly intractable MCMC and approximations of the partition
function. The second is Fadeout, a reparameterization approach for variational
inference under sparsity-inducing priors that captures a posteriori
correlations between parameters and hyperparameters with noncentered
parameterizations. We find that, together, these methods for variational
inference substantially improve learning of sparse undirected graphical models
in simulated and real problems from physics and biology.Comment: 34th International Conference on Machine Learning (ICML 2017
Image Denoising with Kernels based on Natural Image Relations
A successful class of image denoising methods is based on Bayesian approaches
working in wavelet representations. However, analytical estimates can be
obtained only for particular combinations of analytical models of signal and
noise, thus precluding its straightforward extension to deal with other
arbitrary noise sources. In this paper, we propose an alternative non-explicit
way to take into account the relations among natural image wavelet coefficients
for denoising: we use support vector regression (SVR) in the wavelet domain to
enforce these relations in the estimated signal. Since relations among the
coefficients are specific to the signal, the regularization property of SVR is
exploited to remove the noise, which does not share this feature. The specific
signal relations are encoded in an anisotropic kernel obtained from mutual
information measures computed on a representative image database. Training
considers minimizing the Kullback-Leibler divergence (KLD) between the
estimated and actual probability functions of signal and noise in order to
enforce similarity. Due to its non-parametric nature, the method can eventually
cope with different noise sources without the need of an explicit
re-formulation, as it is strictly necessary under parametric Bayesian
formalisms. Results under several noise levels and noise sources show that: (1)
the proposed method outperforms conventional wavelet methods that assume
coefficient independence, (2) it is similar to state-of-the-art methods that do
explicitly include these relations when the noise source is Gaussian, and (3)
it gives better numerical and visual performance when more complex, realistic
noise sources are considered. Therefore, the proposed machine learning approach
can be seen as a more flexible (model-free) alternative to the explicit
description of wavelet coefficient relations for image denoising
Probabilistic Approach for Evaluating Metabolite Sample Integrity
The success of metabolomics studies depends upon the "fitness" of each
biological sample used for analysis: it is critical that metabolite levels
reported for a biological sample represent an accurate snapshot of the studied
organism's metabolite profile at time of sample collection. Numerous factors
may compromise metabolite sample fitness, including chemical and biological
factors which intervene during sample collection, handling, storage, and
preparation for analysis. We propose a probabilistic model for the quantitative
assessment of metabolite sample fitness. Collection and processing of nuclear
magnetic resonance (NMR) and ultra-performance liquid chromatography (UPLC-MS)
metabolomics data is discussed. Feature selection methods utilized for
multivariate data analysis are briefly reviewed, including feature clustering
and computation of latent vectors using spectral methods. We propose that the
time-course of metabolite changes in samples stored at different temperatures
may be utilized to identify changing-metabolite-to-stable-metabolite ratios as
markers of sample fitness. Tolerance intervals may be computed to characterize
these ratios among fresh samples. In order to discover additional structure in
the data relevant to sample fitness, we propose using data labeled according to
these ratios to train a Dirichlet process mixture model (DPMM) for assessing
sample fitness. DPMMs are highly intuitive since they model the metabolite
levels in a sample as arising from a combination of processes including, e.g.,
normal biological processes and degradation- or contamination-inducing
processes. The outputs of a DPMM are probabilities that a sample is associated
with a given process, and these probabilities may be incorporated into a final
classifier for sample fitness
Bayesian Structured Sparsity from Gaussian Fields
Substantial research on structured sparsity has contributed to analysis of
many different applications. However, there have been few Bayesian procedures
among this work. Here, we develop a Bayesian model for structured sparsity that
uses a Gaussian process (GP) to share parameters of the sparsity-inducing prior
in proportion to feature similarity as defined by an arbitrary positive
definite kernel. For linear regression, this sparsity-inducing prior on
regression coefficients is a relaxation of the canonical spike-and-slab prior
that flattens the mixture model into a scale mixture of normals. This prior
retains the explicit posterior probability on inclusion parameters---now with
GP probit prior distributions---but enables tractable computation via
elliptical slice sampling for the latent Gaussian field. We motivate
development of this prior using the genomic application of association mapping,
or identifying genetic variants associated with a continuous trait. Our
Bayesian structured sparsity model produced sparse results with substantially
improved sensitivity and precision relative to comparable methods. Through
simulations, we show that three properties are key to this improvement: i)
modeling structure in the covariates, ii) significance testing using the
posterior probabilities of inclusion, and iii) model averaging. We present
results from applying this model to a large genomic dataset to demonstrate
computational tractability.Comment: 23 pages, 7 figure
Stagewise Learning for Sparse Clustering of Discretely-Valued Data
The performance of EM in learning mixtures of product distributions often
depends on the initialization. This can be problematic in crowdsourcing and
other applications, e.g. when a small number of 'experts' are diluted by a
large number of noisy, unreliable participants. We develop a new EM algorithm
that is driven by these experts. In a manner that differs from other
approaches, we start from a single mixture class. The algorithm then develops
the set of 'experts' in a stagewise fashion based on a mutual information
criterion. At each stage EM operates on this subset of the players, effectively
regularizing the E rather than the M step. Experiments show that stagewise EM
outperforms other initialization techniques for crowdsourcing and neurosciences
applications, and can guide a full EM to results comparable to those obtained
knowing the exact distribution.Comment: 9 page
- …