2,507 research outputs found
Combinatorial clustering and the beta negative binomial process
We develop a Bayesian nonparametric approach to a general family of latent
class problems in which individuals can belong simultaneously to multiple
classes and where each class can be exhibited multiple times by an individual.
We introduce a combinatorial stochastic process known as the negative binomial
process (NBP) as an infinite-dimensional prior appropriate for such problems.
We show that the NBP is conjugate to the beta process, and we characterize the
posterior distribution under the beta-negative binomial process (BNBP) and
hierarchical models based on the BNBP (the HBNBP). We study the asymptotic
properties of the BNBP and develop a three-parameter extension of the BNBP that
exhibits power-law behavior. We derive MCMC algorithms for posterior inference
under the HBNBP, and we present experiments using these algorithms in the
domains of image segmentation, object recognition, and document analysis.Comment: 56 pages, 4 figures, 6 table
Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes
We define a family of probability distributions for random count matrices
with a potentially unbounded number of rows and columns. The three
distributions we consider are derived from the gamma-Poisson, gamma-negative
binomial, and beta-negative binomial processes. Because the models lead to
closed-form Gibbs sampling update equations, they are natural candidates for
nonparametric Bayesian priors over count matrices. A key aspect of our analysis
is the recognition that, although the random count matrices within the family
are defined by a row-wise construction, their columns can be shown to be i.i.d.
This fact is used to derive explicit formulas for drawing all the columns at
once. Moreover, by analyzing these matrices' combinatorial structure, we
describe how to sequentially construct a column-i.i.d. random count matrix one
row at a time, and derive the predictive distribution of a new row count vector
with previously unseen features. We describe the similarities and differences
between the three priors, and argue that the greater flexibility of the gamma-
and beta- negative binomial processes, especially their ability to model
over-dispersed, heavy-tailed count data, makes these well suited to a wide
variety of real-world applications. As an example of our framework, we
construct a naive-Bayes text classifier to categorize a count vector to one of
several existing random count matrices of different categories. The classifier
supports an unbounded number of features, and unlike most existing methods, it
does not require a predefined finite vocabulary to be shared by all the
categories, and needs neither feature selection nor parameter tuning. Both the
gamma- and beta- negative binomial processes are shown to significantly
outperform the gamma-Poisson process for document categorization, with
comparable performance to other state-of-the-art supervised text classification
algorithms.Comment: To appear in Journal of the American Statistical Association (Theory
and Methods). 31 pages + 11 page supplement, 5 figure
Beta-Negative Binomial Process and Exchangeable Random Partitions for Mixed-Membership Modeling
The beta-negative binomial process (BNBP), an integer-valued stochastic
process, is employed to partition a count vector into a latent random count
matrix. As the marginal probability distribution of the BNBP that governs the
exchangeable random partitions of grouped data has not yet been developed,
current inference for the BNBP has to truncate the number of atoms of the beta
process. This paper introduces an exchangeable partition probability function
to explicitly describe how the BNBP clusters the data points of each group into
a random number of exchangeable partitions, which are shared across all the
groups. A fully collapsed Gibbs sampler is developed for the BNBP, leading to a
novel nonparametric Bayesian topic model that is distinct from existing ones,
with simple implementation, fast convergence, good mixing, and state-of-the-art
predictive performance.Comment: in Neural Information Processing Systems (NIPS) 2014. 9 pages + 3
page appendi
Generalized Negative Binomial Processes and the Representation of Cluster Structures
The paper introduces the concept of a cluster structure to define a joint
distribution of the sample size and its exchangeable random partitions. The
cluster structure allows the probability distribution of the random partitions
of a subset of the sample to be dependent on the sample size, a feature not
presented in a partition structure. A generalized negative binomial process
count-mixture model is proposed to generate a cluster structure, where in the
prior the number of clusters is finite and Poisson distributed and the cluster
sizes follow a truncated negative binomial distribution. The number and sizes
of clusters can be controlled to exhibit distinct asymptotic behaviors. Unique
model properties are illustrated with example clustering results using a
generalized Polya urn sampling scheme. The paper provides new methods to
generate exchangeable random partitions and to control both the cluster-number
and cluster-size distributions.Comment: 30 pages, 8 figure
Cluster and Feature Modeling from Combinatorial Stochastic Processes
One of the focal points of the modern literature on Bayesian nonparametrics
has been the problem of clustering, or partitioning, where each data point is
modeled as being associated with one and only one of some collection of groups
called clusters or partition blocks. Underlying these Bayesian nonparametric
models are a set of interrelated stochastic processes, most notably the
Dirichlet process and the Chinese restaurant process. In this paper we provide
a formal development of an analogous problem, called feature modeling, for
associating data points with arbitrary nonnegative integer numbers of groups,
now called features or topics. We review the existing combinatorial stochastic
process representations for the clustering problem and develop analogous
representations for the feature modeling problem. These representations include
the beta process and the Indian buffet process as well as new representations
that provide insight into the connections between these processes. We thereby
bring the same level of completeness to the treatment of Bayesian nonparametric
feature modeling that has previously been achieved for Bayesian nonparametric
clustering.Comment: Published in at http://dx.doi.org/10.1214/13-STS434 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Poisson Latent Feature Calculus for Generalized Indian Buffet Processes
The purpose of this work is to describe a unified, and indeed simple,
mechanism for non-parametric Bayesian analysis, construction and generative
sampling of a large class of latent feature models which one can describe as
generalized notions of Indian Buffet Processes(IBP). This is done via the
Poisson Process Calculus as it now relates to latent feature models. The IBP
was ingeniously devised by Griffiths and Ghahramani in (2005) and its
generative scheme is cast in terms of customers entering sequentially an Indian
Buffet restaurant and selecting previously sampled dishes as well as new
dishes. In this metaphor dishes corresponds to latent features, attributes,
preferences shared by individuals. The IBP, and its generalizations, represent
an exciting class of models well suited to handle high dimensional statistical
problems now common in this information age. The IBP is based on the usage of
conditionally independent Bernoulli random variables, coupled with completely
random measures acting as Bayesian priors, that are used to create sparse
binary matrices. This Bayesian non-parametric view was a key insight due to
Thibaux and Jordan (2007). One way to think of generalizations is to to use
more general random variables. Of note in the current literature are models
employing Poisson and Negative-Binomial random variables. However, unlike their
closely related counterparts, generalized Chinese restaurant processes, the
ability to analyze IBP models in a systematic and general manner is not yet
available. The limitations are both in terms of knowledge about the effects of
different priors and in terms of models based on a wider choice of random
variables. This work will not only provide a thorough description of the
properties of existing models but also provide a simple template to devise and
analyze new models.Comment: This version provides more details for the multivariate extensions in
section 5. We highlight the case of a simple multinomial distribution and
showcase a multivariate Levy process prior we call a stable-Beta Dirichlet
process. Section 4.1.1 expande
Generalized Species Sampling Priors with Latent Beta reinforcements
Many popular Bayesian nonparametric priors can be characterized in terms of
exchangeable species sampling sequences. However, in some applications,
exchangeability may not be appropriate. We introduce a {novel and
probabilistically coherent family of non-exchangeable species sampling
sequences characterized by a tractable predictive probability function with
weights driven by a sequence of independent Beta random variables. We compare
their theoretical clustering properties with those of the Dirichlet Process and
the two parameters Poisson-Dirichlet process. The proposed construction
provides a complete characterization of the joint process, differently from
existing work. We then propose the use of such process as prior distribution in
a hierarchical Bayes modeling framework, and we describe a Markov Chain Monte
Carlo sampler for posterior inference. We evaluate the performance of the prior
and the robustness of the resulting inference in a simulation study, providing
a comparison with popular Dirichlet Processes mixtures and Hidden Markov
Models. Finally, we develop an application to the detection of chromosomal
aberrations in breast cancer by leveraging array CGH data.Comment: For correspondence purposes, Edoardo M. Airoldi's email is
[email protected]; Federico Bassetti's email is
[email protected]; Michele Guindani's email is
[email protected] ; Fabrizo Leisen's email is
[email protected]. To appear in the Journal of the American
Statistical Associatio
- …