18,999 research outputs found
Center-based Clustering under Perturbation Stability
Clustering under most popular objective functions is NP-hard, even to
approximate well, and so unlikely to be efficiently solvable in the worst case.
Recently, Bilu and Linial \cite{Bilu09} suggested an approach aimed at
bypassing this computational barrier by using properties of instances one might
hope to hold in practice. In particular, they argue that instances in practice
should be stable to small perturbations in the metric space and give an
efficient algorithm for clustering instances of the Max-Cut problem that are
stable to perturbations of size . In addition, they conjecture that
instances stable to as little as O(1) perturbations should be solvable in
polynomial time. In this paper we prove that this conjecture is true for any
center-based clustering objective (such as -median, -means, and
-center). Specifically, we show we can efficiently find the optimal
clustering assuming only stability to factor-3 perturbations of the underlying
metric in spaces without Steiner points, and stability to factor
perturbations for general metrics. In particular, we show for such instances
that the popular Single-Linkage algorithm combined with dynamic programming
will find the optimal clustering. We also present NP-hardness results under a
weaker but related condition
Average Distance Queries through Weighted Samples in Graphs and Metric Spaces: High Scalability with Tight Statistical Guarantees
The average distance from a node to all other nodes in a graph, or from a
query point in a metric space to a set of points, is a fundamental quantity in
data analysis. The inverse of the average distance, known as the (classic)
closeness centrality of a node, is a popular importance measure in the study of
social networks. We develop novel structural insights on the sparsifiability of
the distance relation via weighted sampling. Based on that, we present highly
practical algorithms with strong statistical guarantees for fundamental
problems. We show that the average distance (and hence the centrality) for all
nodes in a graph can be estimated using single-source
distance computations. For a set of points in a metric space, we show
that after preprocessing which uses distance computations we can compute
a weighted sample of size such that the average
distance from any query point to can be estimated from the distances
from to . Finally, we show that for a set of points in a metric
space, we can estimate the average pairwise distance using
distance computations. The estimate is based on a weighted sample of
pairs of points, which is computed using distance
computations. Our estimates are unbiased with normalized mean square error
(NRMSE) of at most . Increasing the sample size by a
factor ensures that the probability that the relative error exceeds
is polynomially small.Comment: 21 pages, will appear in the Proceedings of RANDOM 201
Indexability, concentration, and VC theory
Degrading performance of indexing schemes for exact similarity search in high
dimensions has long since been linked to histograms of distributions of
distances and other 1-Lipschitz functions getting concentrated. We discuss this
observation in the framework of the phenomenon of concentration of measure on
the structures of high dimension and the Vapnik-Chervonenkis theory of
statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded,
improved and corrected version of the SISAP'2010 invited paper, this e-print,
v3
Efficient Document Indexing Using Pivot Tree
We present a novel method for efficiently searching top-k neighbors for
documents represented in high dimensional space of terms based on the cosine
similarity. Mostly, documents are stored as bag-of-words tf-idf representation.
One of the most used ways of computing similarity between a pair of documents
is cosine similarity between the vector representations, but cosine similarity
is not a metric distance measure as it doesn't follow triangle inequality,
therefore most metric searching methods can not be applied directly. We propose
an efficient method for indexing documents using a pivot tree that leads to
efficient retrieval. We also study the relation between precision and
efficiency for the proposed method and compare it with a state of the art in
the area of document searching based on inner product.Comment: 6 Pages, 2 Figure
Computing medians and means in Hadamard spaces
The geometric median as well as the Frechet mean of points in an Hadamard
space are important in both theory and applications. Surprisingly, no
algorithms for their computation are hitherto known. To address this issue, we
use a split version of the proximal point algorithm for minimizing a sum of
convex functions and prove that this algorithm produces a sequence converging
to a minimizer of the objective function, which extends a recent result of D.
Bertsekas (2001) into Hadamard spaces. The method is quite robust and not only
does it yield algorithms for the median and the mean, but it also applies to
various other optimization problems. We moreover show that another algorithm
for computing the Frechet mean can be derived from the law of large numbers due
to K.-T. Sturm (2002). In applications, computing medians and means is probably
most needed in tree space, which is an instance of an Hadamard space, invented
by Billera, Holmes, and Vogtmann (2001) as a tool for averaging phylogenetic
trees. It turns out, however, that it can be also used to model numerous other
tree-like structures. Since there now exists a polynomial-time algorithm for
computing geodesics in tree space due to M. Owen and S. Provan (2011), we
obtain efficient algorithms for computing medians and means, which can be
directly used in practice.Comment: Corrected version. Accepted in SIAM Journal on Optimizatio
- …