77,948 research outputs found
Center-based Clustering under Perturbation Stability
Clustering under most popular objective functions is NP-hard, even to
approximate well, and so unlikely to be efficiently solvable in the worst case.
Recently, Bilu and Linial \cite{Bilu09} suggested an approach aimed at
bypassing this computational barrier by using properties of instances one might
hope to hold in practice. In particular, they argue that instances in practice
should be stable to small perturbations in the metric space and give an
efficient algorithm for clustering instances of the Max-Cut problem that are
stable to perturbations of size . In addition, they conjecture that
instances stable to as little as O(1) perturbations should be solvable in
polynomial time. In this paper we prove that this conjecture is true for any
center-based clustering objective (such as -median, -means, and
-center). Specifically, we show we can efficiently find the optimal
clustering assuming only stability to factor-3 perturbations of the underlying
metric in spaces without Steiner points, and stability to factor
perturbations for general metrics. In particular, we show for such instances
that the popular Single-Linkage algorithm combined with dynamic programming
will find the optimal clustering. We also present NP-hardness results under a
weaker but related condition
Measuring Cluster Stability for Bayesian Nonparametrics Using the Linear Bootstrap
Clustering procedures typically estimate which data points are clustered
together, a quantity of primary importance in many analyses. Often used as a
preliminary step for dimensionality reduction or to facilitate interpretation,
finding robust and stable clusters is often crucial for appropriate for
downstream analysis. In the present work, we consider Bayesian nonparametric
(BNP) models, a particularly popular set of Bayesian models for clustering due
to their flexibility. Because of its complexity, the Bayesian posterior often
cannot be computed exactly, and approximations must be employed. Mean-field
variational Bayes forms a posterior approximation by solving an optimization
problem and is widely used due to its speed. An exact BNP posterior might vary
dramatically when presented with different data. As such, stability and
robustness of the clustering should be assessed.
A popular mean to assess stability is to apply the bootstrap by resampling
the data, and rerun the clustering for each simulated data set. The time cost
is thus often very expensive, especially for the sort of exploratory analysis
where clustering is typically used. We propose to use a fast and automatic
approximation to the full bootstrap called the "linear bootstrap", which can be
seen by local data perturbation. In this work, we demonstrate how to apply this
idea to a data analysis pipeline, consisting of an MFVB approximation to a BNP
clustering posterior of time course gene expression data. We show that using
auto-differentiation tools, the necessary calculations can be done
automatically, and that the linear bootstrap is a fast but approximate
alternative to the bootstrap.Comment: 9 pages, NIPS 2017 Advances in Approximate Bayesian Inference
Worksho
- …