647 research outputs found
Distance Dependent Chinese Restaurant Processes
We develop the distance dependent Chinese restaurant process (CRP), a
flexible class of distributions over partitions that allows for
non-exchangeability. This class can be used to model many kinds of dependencies
between data in infinite clustering models, including dependencies across time
or space. We examine the properties of the distance dependent CRP, discuss its
connections to Bayesian nonparametric mixture models, and derive a Gibbs
sampler for both observed and mixture settings. We study its performance with
three text corpora. We show that relaxing the assumption of exchangeability
with distance dependent CRPs can provide a better fit to sequential data. We
also show its alternative formulation of the traditional CRP leads to a
faster-mixing Gibbs sampling algorithm than the one based on the original
formulation
Beta-Negative Binomial Process and Exchangeable Random Partitions for Mixed-Membership Modeling
The beta-negative binomial process (BNBP), an integer-valued stochastic
process, is employed to partition a count vector into a latent random count
matrix. As the marginal probability distribution of the BNBP that governs the
exchangeable random partitions of grouped data has not yet been developed,
current inference for the BNBP has to truncate the number of atoms of the beta
process. This paper introduces an exchangeable partition probability function
to explicitly describe how the BNBP clusters the data points of each group into
a random number of exchangeable partitions, which are shared across all the
groups. A fully collapsed Gibbs sampler is developed for the BNBP, leading to a
novel nonparametric Bayesian topic model that is distinct from existing ones,
with simple implementation, fast convergence, good mixing, and state-of-the-art
predictive performance.Comment: in Neural Information Processing Systems (NIPS) 2014. 9 pages + 3
page appendi
Bibliographic Analysis on Research Publications using Authors, Categorical Labels and the Citation Network
Bibliographic analysis considers the author's research areas, the citation
network and the paper content among other things. In this paper, we combine
these three in a topic model that produces a bibliographic model of authors,
topics and documents, using a nonparametric extension of a combination of the
Poisson mixed-topic link model and the author-topic model. This gives rise to
the Citation Network Topic Model (CNTM). We propose a novel and efficient
inference algorithm for the CNTM to explore subsets of research publications
from CiteSeerX. The publication datasets are organised into three corpora,
totalling to about 168k publications with about 62k authors. The queried
datasets are made available online. In three publicly available corpora in
addition to the queried datasets, our proposed model demonstrates an improved
performance in both model fitting and document clustering, compared to several
baselines. Moreover, our model allows extraction of additional useful knowledge
from the corpora, such as the visualisation of the author-topics network.
Additionally, we propose a simple method to incorporate supervision into topic
modelling to achieve further improvement on the clustering task.Comment: Preprint for Journal Machine Learnin
- …