114 research outputs found
Edit Distance: Sketching, Streaming and Document Exchange
We show that in the document exchange problem, where Alice holds and Bob holds , Alice can send Bob a message of
size bits such that Bob can recover using the
message and his input if the edit distance between and is no more
than , and output "error" otherwise. Both the encoding and decoding can be
done in time . This result significantly
improves the previous communication bounds under polynomial encoding/decoding
time. We also show that in the referee model, where Alice and Bob hold and
respectively, they can compute sketches of and of sizes
bits (the encoding), and send to the referee, who can
then compute the edit distance between and together with all the edit
operations if the edit distance is no more than , and output "error"
otherwise (the decoding). To the best of our knowledge, this is the first
result for sketching edit distance using bits.
Moreover, the encoding phase of our sketching algorithm can be performed by
scanning the input string in one pass. Thus our sketching algorithm also
implies the first streaming algorithm for computing edit distance and all the
edits exactly using bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE
Symposium on Foundations of Computer Science (FOCS 2016
Improved Outlier Robust Seeding for k-means
The -means is a popular clustering objective, although it is inherently
non-robust and sensitive to outliers. Its popular seeding or initialization
called -means++ uses sampling and comes with a provable
approximation guarantee \cite{AV2007}. However, in the presence of adversarial
noise or outliers, sampling is more likely to pick centers from distant
outliers instead of inlier clusters, and therefore its approximation guarantees
\textit{w.r.t.} -means solution on inliers, does not hold.
Assuming that the outliers constitute a constant fraction of the given data,
we propose a simple variant in the sampling distribution, which makes it
robust to the outliers. Our algorithm runs in time, outputs
clusters, discards marginally more points than the optimal number of outliers,
and comes with a provable approximation guarantee.
Our algorithm can also be modified to output exactly clusters instead of
clusters, while keeping its running time linear in and . This is
an improvement over previous results for robust -means based on LP
relaxation and rounding \cite{Charikar}, \cite{KrishnaswamyLS18} and
\textit{robust -means++} \cite{DeshpandeKP20}. Our empirical results show
the advantage of our algorithm over -means++~\cite{AV2007}, uniform random
seeding, greedy sampling for means~\cite{tkmeanspp}, and robust
-means++~\cite{DeshpandeKP20}, on standard real-world and synthetic data
sets used in previous work. Our proposal is easily amenable to scalable,
faster, parallel implementations of -means++ \cite{Bahmani,BachemL017} and
is of independent interest for coreset constructions in the presence of
outliers \cite{feldman2007ptas,langberg2010universal,feldman2011unified}
On Generalization Bounds for Projective Clustering
Given a set of points, clustering consists of finding a partition of a point
set into clusters such that the center to which a point is assigned is as
close as possible. Most commonly, centers are points themselves, which leads to
the famous -median and -means objectives. One may also choose centers to
be dimensional subspaces, which gives rise to subspace clustering. In this
paper, we consider learning bounds for these problems. That is, given a set of
samples drawn independently from some unknown, but fixed distribution
, how quickly does a solution computed on converge to the
optimal clustering of ? We give several near optimal results. In
particular,
For center-based objectives, we show a convergence rate of
. This matches the known optimal bounds
of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016]
and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for -means
and extends it to other important objectives such as -median.
For subspace clustering with -dimensional subspaces, we show a convergence
rate of . These are the first
provable bounds for most of these problems. For the specific case of projective
clustering, which generalizes -means, we show a convergence rate of
is necessary, thereby proving that the
bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical
Society 2016] are essentially optimal
An Empirical Evaluation of k-Means Coresets
Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as k-means in both theory and practice. Curiously, there exists no work on comparing the quality of available k-means coresets.
In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice
Size-constrained Weighted Ancestors with Applications
The weighted ancestor problem on a rooted node-weighted tree is a
generalization of the classic predecessor problem: construct a data structure
for a set of integers that supports fast predecessor queries. Both problems are
known to require time for queries provided
space is available, where is the input
size. The weighted ancestor problem has attracted a lot of attention by the
combinatorial pattern matching community due to its direct application to
suffix trees. In this formulation of the problem, the nodes are weighted by
string depth. This attention has culminated in a data structure for weighted
ancestors in suffix trees with query time and an
-time construction algorithm [Belazzougui et al., CPM 2021]. In
this paper, we consider a different version of the weighted ancestor problem,
where the nodes are weighted by any function that maps the
nodes of to positive integers, such that for any node and if node is a descendant of node , where
is the number of nodes in the subtree rooted at . In the
size-constrained weighted ancestor (SWAQ) problem, for any node of and
any integer , we are asked to return the lowest ancestor of with
weight at least . We show that for any rooted tree with nodes, we can
locate node in time after -time
preprocessing. In particular, this implies a data structure for the SWAQ
problem in suffix trees with query time and
-time preprocessing, when the nodes are weighted by
. We also show several string-processing applications of this
result
A Tight Bound for Shortest Augmenting Paths on Trees
The shortest augmenting path technique is one of the fundamental ideas used
in maximum matching and maximum flow algorithms. Since being introduced by
Edmonds and Karp in 1972, it has been widely applied in many different
settings. Surprisingly, despite this extensive usage, it is still not well
understood even in the simplest case: online bipartite matching problem on
trees. In this problem a bipartite tree is being revealed
online, i.e., in each round one vertex from with its incident edges
arrives. It was conjectured by Chaudhuri et. al. [K. Chaudhuri, C. Daskalakis,
R. D. Kleinberg, and H. Lin. Online bipartite perfect matching with
augmentations. In INFOCOM 2009] that the total length of all shortest
augmenting paths found is . In this paper, we prove a tight upper bound for the total length of shortest augmenting paths for
trees improving over bound [B. Bosek, D. Leniowski, P.
Sankowski, and A. Zych. Shortest augmenting paths for online matchings on
trees. In WAOA 2015].Comment: 22 pages, 10 figure
- …