114 research outputs found

    Edit Distance: Sketching, Streaming and Document Exchange

    Full text link
    We show that in the document exchange problem, where Alice holds x{0,1}nx \in \{0,1\}^n and Bob holds y{0,1}ny \in \{0,1\}^n, Alice can send Bob a message of size O(K(log2K+logn))O(K(\log^2 K+\log n)) bits such that Bob can recover xx using the message and his input yy if the edit distance between xx and yy is no more than KK, and output "error" otherwise. Both the encoding and decoding can be done in time O~(n+poly(K))\tilde{O}(n+\mathsf{poly}(K)). This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold xx and yy respectively, they can compute sketches of xx and yy of sizes poly(Klogn)\mathsf{poly}(K \log n) bits (the encoding), and send to the referee, who can then compute the edit distance between xx and yy together with all the edit operations if the edit distance is no more than KK, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using poly(Klogn)\mathsf{poly}(K \log n) bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using poly(Klogn)\mathsf{poly}(K \log n) bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2016

    Patching Colors with Tensors

    Get PDF

    Improved Outlier Robust Seeding for k-means

    Full text link
    The kk-means is a popular clustering objective, although it is inherently non-robust and sensitive to outliers. Its popular seeding or initialization called kk-means++ uses D2D^{2} sampling and comes with a provable O(logk)O(\log k) approximation guarantee \cite{AV2007}. However, in the presence of adversarial noise or outliers, D2D^{2} sampling is more likely to pick centers from distant outliers instead of inlier clusters, and therefore its approximation guarantees \textit{w.r.t.} kk-means solution on inliers, does not hold. Assuming that the outliers constitute a constant fraction of the given data, we propose a simple variant in the D2D^2 sampling distribution, which makes it robust to the outliers. Our algorithm runs in O(ndk)O(ndk) time, outputs O(k)O(k) clusters, discards marginally more points than the optimal number of outliers, and comes with a provable O(1)O(1) approximation guarantee. Our algorithm can also be modified to output exactly kk clusters instead of O(k)O(k) clusters, while keeping its running time linear in nn and dd. This is an improvement over previous results for robust kk-means based on LP relaxation and rounding \cite{Charikar}, \cite{KrishnaswamyLS18} and \textit{robust kk-means++} \cite{DeshpandeKP20}. Our empirical results show the advantage of our algorithm over kk-means++~\cite{AV2007}, uniform random seeding, greedy sampling for kk means~\cite{tkmeanspp}, and robust kk-means++~\cite{DeshpandeKP20}, on standard real-world and synthetic data sets used in previous work. Our proposal is easily amenable to scalable, faster, parallel implementations of kk-means++ \cite{Bahmani,BachemL017} and is of independent interest for coreset constructions in the presence of outliers \cite{feldman2007ptas,langberg2010universal,feldman2011unified}

    On Generalization Bounds for Projective Clustering

    Full text link
    Given a set of points, clustering consists of finding a partition of a point set into kk clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous kk-median and kk-means objectives. One may also choose centers to be jj dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of nn samples PP drawn independently from some unknown, but fixed distribution D\mathcal{D}, how quickly does a solution computed on PP converge to the optimal clustering of D\mathcal{D}? We give several near optimal results. In particular, For center-based objectives, we show a convergence rate of O~(k/n)\tilde{O}\left(\sqrt{{k}/{n}}\right). This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for kk-means and extends it to other important objectives such as kk-median. For subspace clustering with jj-dimensional subspaces, we show a convergence rate of O~(kj2n)\tilde{O}\left(\sqrt{\frac{kj^2}{n}}\right). These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes kk-means, we show a convergence rate of Ω(kjn)\Omega\left(\sqrt{\frac{kj}{n}}\right) is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal

    An Empirical Evaluation of k-Means Coresets

    Get PDF
    Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as k-means in both theory and practice. Curiously, there exists no work on comparing the quality of available k-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice

    Size-constrained Weighted Ancestors with Applications

    Full text link
    The weighted ancestor problem on a rooted node-weighted tree TT is a generalization of the classic predecessor problem: construct a data structure for a set of integers that supports fast predecessor queries. Both problems are known to require Ω(loglogn)\Omega(\log\log n) time for queries provided O(n polylogn)\mathcal{O}(n\text{ poly} \log n) space is available, where nn is the input size. The weighted ancestor problem has attracted a lot of attention by the combinatorial pattern matching community due to its direct application to suffix trees. In this formulation of the problem, the nodes are weighted by string depth. This attention has culminated in a data structure for weighted ancestors in suffix trees with O(1)\mathcal{O}(1) query time and an O(n)\mathcal{O}(n)-time construction algorithm [Belazzougui et al., CPM 2021]. In this paper, we consider a different version of the weighted ancestor problem, where the nodes are weighted by any function weight\textsf{weight} that maps the nodes of TT to positive integers, such that weight(u)size(u)\textsf{weight}(u)\le \textsf{size}(u) for any node uu and weight(u1)weight(u2)\textsf{weight}(u_1)\le \textsf{weight}(u_2) if node u1u_1 is a descendant of node u2u_2, where size(u)\textsf{size}(u) is the number of nodes in the subtree rooted at uu. In the size-constrained weighted ancestor (SWAQ) problem, for any node uu of TT and any integer kk, we are asked to return the lowest ancestor ww of uu with weight at least kk. We show that for any rooted tree with nn nodes, we can locate node ww in O(1)\mathcal{O}(1) time after O(n)\mathcal{O}(n)-time preprocessing. In particular, this implies a data structure for the SWAQ problem in suffix trees with O(1)\mathcal{O}(1) query time and O(n)\mathcal{O}(n)-time preprocessing, when the nodes are weighted by weight\textsf{weight}. We also show several string-processing applications of this result

    A Tight Bound for Shortest Augmenting Paths on Trees

    Full text link
    The shortest augmenting path technique is one of the fundamental ideas used in maximum matching and maximum flow algorithms. Since being introduced by Edmonds and Karp in 1972, it has been widely applied in many different settings. Surprisingly, despite this extensive usage, it is still not well understood even in the simplest case: online bipartite matching problem on trees. In this problem a bipartite tree T=(WB,E)T=(W \uplus B, E) is being revealed online, i.e., in each round one vertex from BB with its incident edges arrives. It was conjectured by Chaudhuri et. al. [K. Chaudhuri, C. Daskalakis, R. D. Kleinberg, and H. Lin. Online bipartite perfect matching with augmentations. In INFOCOM 2009] that the total length of all shortest augmenting paths found is O(nlogn)O(n \log n). In this paper, we prove a tight O(nlogn)O(n \log n) upper bound for the total length of shortest augmenting paths for trees improving over O(nlog2n)O(n \log^2 n) bound [B. Bosek, D. Leniowski, P. Sankowski, and A. Zych. Shortest augmenting paths for online matchings on trees. In WAOA 2015].Comment: 22 pages, 10 figure
    corecore