2 research outputs found
Distributed, partially collapsed MCMC for Bayesian Nonparametrics
Bayesian nonparametric (BNP) models provide elegant methods for discovering
underlying latent features within a data set, but inference in such models can
be slow. We exploit the fact that completely random measures, which commonly
used models like the Dirichlet process and the beta-Bernoulli process can be
expressed as, are decomposable into independent sub-measures. We use this
decomposition to partition the latent measure into a finite measure containing
only instantiated components, and an infinite measure containing all other
components. We then select different inference algorithms for the two
components: uncollapsed samplers mix well on the finite measure, while
collapsed samplers mix well on the infinite, sparsely occupied tail. The
resulting hybrid algorithm can be applied to a wide class of models, and can be
easily distributed to allow scalable inference without sacrificing asymptotic
convergence guarantees.Comment: To appear in the 23rd International Conference on Artificial
Intelligence and Statistic
Scalable Hierarchical Agglomerative Clustering
The applicability of agglomerative clustering, for inferring both
hierarchical and flat clustering, is limited by its scalability. Existing
scalable hierarchical clustering methods sacrifice quality for speed and often
lead to over-merging of clusters. In this paper, we present a scalable,
agglomerative method for hierarchical clustering that does not sacrifice
quality and scales to billions of data points. We perform a detailed
theoretical analysis, showing that under mild separability conditions our
algorithm can not only recover the optimal flat partition, but also provide a
two-approximation to non-parametric DP-Means objective. This introduces a novel
application of hierarchical clustering as an approximation algorithm for the
non-parametric clustering objective. We additionally relate our algorithm to
the classic hierarchical agglomerative clustering method. We perform extensive
empirical experiments in both hierarchical and flat clustering settings and
show that our proposed approach achieves state-of-the-art results on publicly
available clustering benchmarks. Finally, we demonstrate our method's
scalability by applying it to a dataset of 30 billion queries. Human evaluation
of the discovered clusters show that our method finds better quality of
clusters than the current state-of-the-art.Comment: Appeared in KDD '21: Proceedings of the 27th ACM SIGKDD Conference on
Knowledge Discovery & Data Minin