Search CORE

114 research outputs found

Edit Distance: Sketching, Streaming and Document Exchange

Author: Belazzougui Djamal
Zhang Qin
Publication venue
Publication date: 14/07/2016
Field of study

We show that in the document exchange problem, where Alice holds

x \in \{0,1\}^n

and Bob holds

y \in \{0,1\}^n

, Alice can send Bob a message of size

O(K(\log^2 K+\log n))

bits such that Bob can recover

x

using the message and his input

y

if the edit distance between

x

and

y

is no more than

K

, and output "error" otherwise. Both the encoding and decoding can be done in time

\tilde{O}(n+\mathsf{poly}(K))

. This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold

x

and

y

respectively, they can compute sketches of

x

and

y

of sizes

\mathsf{poly}(K \log n)

bits (the encoding), and send to the referee, who can then compute the edit distance between

x

and

y

together with all the edit operations if the edit distance is no more than

K

, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using

\mathsf{poly}(K \log n)

bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using

\mathsf{poly}(K \log n)

bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2016

arXiv.org e-Print Archive

Crossref

Patching Colors with Tensors

Author: Brand Cornelius
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th Annual European Symposium on Algorithms (ESA 2019)
Publication date: 01/01/2019
Field of study

Dagstuhl Research Online Publication Server

Improved Outlier Robust Seeding for k-means

Author: Deshpande Amit
Pratap Rameshwar
Publication venue
Publication date: 06/09/2023
Field of study

The

k

-means is a popular clustering objective, although it is inherently non-robust and sensitive to outliers. Its popular seeding or initialization called

k

-means++ uses

D^{2}

sampling and comes with a provable

O(\log k)

approximation guarantee \cite{AV2007}. However, in the presence of adversarial noise or outliers,

D^{2}

sampling is more likely to pick centers from distant outliers instead of inlier clusters, and therefore its approximation guarantees \textit{w.r.t.}

k

-means solution on inliers, does not hold. Assuming that the outliers constitute a constant fraction of the given data, we propose a simple variant in the

D^2

sampling distribution, which makes it robust to the outliers. Our algorithm runs in

O(ndk)

time, outputs

O(k)

clusters, discards marginally more points than the optimal number of outliers, and comes with a provable

O(1)

approximation guarantee. Our algorithm can also be modified to output exactly

k

clusters instead of

O(k)

clusters, while keeping its running time linear in

n

and

d

. This is an improvement over previous results for robust

k

-means based on LP relaxation and rounding \cite{Charikar}, \cite{KrishnaswamyLS18} and \textit{robust

k

-means++} \cite{DeshpandeKP20}. Our empirical results show the advantage of our algorithm over

k

-means++~\cite{AV2007}, uniform random seeding, greedy sampling for

k

means~\cite{tkmeanspp}, and robust

k

-means++~\cite{DeshpandeKP20}, on standard real-world and synthetic data sets used in previous work. Our proposal is easily amenable to scalable, faster, parallel implementations of

k

-means++ \cite{Bahmani,BachemL017} and is of independent interest for coreset constructions in the presence of outliers \cite{feldman2007ptas,langberg2010universal,feldman2011unified}

arXiv.org e-Print Archive

On Generalization Bounds for Projective Clustering

Author: Bucarelli Maria Sofia
Larsen Matilde Fjeldsø
Schwiegelshohn Chris
Toftrup Mads Bech
Publication venue
Publication date: 13/10/2023
Field of study

Given a set of points, clustering consists of finding a partition of a point set into

k

clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous

k

-median and

k

-means objectives. One may also choose centers to be

j

dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of

n

samples

P

drawn independently from some unknown, but fixed distribution

\mathcal{D}

, how quickly does a solution computed on

P

converge to the optimal clustering of

\mathcal{D}

? We give several near optimal results. In particular, For center-based objectives, we show a convergence rate of

\tilde{O}\left(\sqrt{{k}/{n}}\right)

. This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for

k

-means and extends it to other important objectives such as

k

-median. For subspace clustering with

j

-dimensional subspaces, we show a convergence rate of

\tilde{O}\left(\sqrt{\frac{kj^2}{n}}\right)

. These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes

k

-means, we show a convergence rate of

\Omega\left(\sqrt{\frac{kj}{n}}\right)

is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal

arXiv.org e-Print Archive

An Empirical Evaluation of k-Means Coresets

Author: Schwiegelshohn Chris
Sheikh-Omar Omar Ali
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as k-means in both theory and practice. Curiously, there exists no work on comparing the quality of available k-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice

Dagstuhl Research Online Publication Server

Size-constrained Weighted Ancestors with Applications

Author: Bille Philip
Nekrich Yakov
Pissis Solon P.
Publication venue
Publication date: 27/11/2023
Field of study

The weighted ancestor problem on a rooted node-weighted tree

T

is a generalization of the classic predecessor problem: construct a data structure for a set of integers that supports fast predecessor queries. Both problems are known to require

\Omega(\log\log n)

time for queries provided

\mathcal{O}(n\text{ poly} \log n)

space is available, where

n

is the input size. The weighted ancestor problem has attracted a lot of attention by the combinatorial pattern matching community due to its direct application to suffix trees. In this formulation of the problem, the nodes are weighted by string depth. This attention has culminated in a data structure for weighted ancestors in suffix trees with

\mathcal{O}(1)

query time and an

\mathcal{O}(n)

-time construction algorithm [Belazzougui et al., CPM 2021]. In this paper, we consider a different version of the weighted ancestor problem, where the nodes are weighted by any function

\textsf{weight}

that maps the nodes of

T

to positive integers, such that

\textsf{weight}(u)\le \textsf{size}(u)

for any node

u

and

\textsf{weight}(u_1)\le \textsf{weight}(u_2)

if node

u_1

is a descendant of node

u_2

, where

\textsf{size}(u)

is the number of nodes in the subtree rooted at

u

. In the size-constrained weighted ancestor (SWAQ) problem, for any node

u

T

and any integer

k

, we are asked to return the lowest ancestor

w

u

with weight at least

k

. We show that for any rooted tree with

n

nodes, we can locate node

w

\mathcal{O}(1)

time after

\mathcal{O}(n)

-time preprocessing. In particular, this implies a data structure for the SWAQ problem in suffix trees with

\mathcal{O}(1)

query time and

\mathcal{O}(n)

-time preprocessing, when the nodes are weighted by

\textsf{weight}

. We also show several string-processing applications of this result

arXiv.org e-Print Archive

A Tight Bound for Shortest Augmenting Paths on Trees

Author: Bosek Bartłomiej
Leniowski Dariusz
Sankowski Piotr
Zych-Pawlewicz Anna
Publication venue
Publication date: 20/12/2017
Field of study

The shortest augmenting path technique is one of the fundamental ideas used in maximum matching and maximum flow algorithms. Since being introduced by Edmonds and Karp in 1972, it has been widely applied in many different settings. Surprisingly, despite this extensive usage, it is still not well understood even in the simplest case: online bipartite matching problem on trees. In this problem a bipartite tree

T=(W \uplus B, E)

is being revealed online, i.e., in each round one vertex from

B

with its incident edges arrives. It was conjectured by Chaudhuri et. al. [K. Chaudhuri, C. Daskalakis, R. D. Kleinberg, and H. Lin. Online bipartite perfect matching with augmentations. In INFOCOM 2009] that the total length of all shortest augmenting paths found is

O(n \log n)

. In this paper, we prove a tight

O(n \log n)

upper bound for the total length of shortest augmenting paths for trees improving over

O(n \log^2 n)

bound [B. Bosek, D. Leniowski, P. Sankowski, and A. Zych. Shortest augmenting paths for online matchings on trees. In WAOA 2015].Comment: 22 pages, 10 figure

arXiv.org e-Print Archive

Jagiellonian Univeristy Repository