25 research outputs found

    Block Heavy Hitters

    Get PDF
    e study a natural generalization of the heavy hitters problem in thestreaming context. We term this generalization *block heavy hitters* and define it as follows. We are to stream over a matrixAA, and report all *rows* that are heavy, where a row is heavy ifits ell_1-norm is at least phi fraction of the ell_1 norm ofthe entire matrix AA. In comparison, in the standard heavy hittersproblem, we are required to report the matrix *entries* that areheavy. As is common in streaming, we solve the problem approximately:we return all rows with weight at least phi, but also possibly someother rows that have weight no less than (1-eps)phi. To solve theblock heavy hitters problem, we show how to construct a linear sketchof A from which we can recover the heavy rows of A.The block heavy hitters problem has already found applications forother streaming problems. In particular, it is a crucial buildingblock in a streaming algorithm that constructs asmall-size sketch for the Ulam metric, a metric on non-repetitivestrings under the edit (Levenshtein) distance

    Approximate nearest neighbor problem in high dimensions

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 47-49).We investigate the problem of finding the approximate nearest neighbor when the data set points are the substrings of a given text T. The exact version of this problem is defined as follows. Given a text T of length n, we want to build a data structure that supports the following operation: given a pattern P, find the substring of T that is the closest to P. Since the exact version of this problem is surprisingly difficult, we address the approximate version, in which we are allowed to return a substring of T that is at most c times further than the actual closest substring of T. This problem occurs, for example, in computational biology [4, 5]. In particular, we study the case where the length of the pattern P, denoted by m, is not known in advance, which is the most natural scenario. We present a data structure that uses O(n1+1/c) space and has 0 (nl/c + mn⁰(l)) query time' when the distance between two strings is the Hamming distance. These bounds essentially match the earlier bounds of [12], which assumed that the pattern length m is fixed in advance. Furthermore, our data structure can be constructed in O (n1+1/c + n1+⁰(1)M1/3) time, where M is an upper bound for m. This time essentially matches the preprocessing time of [12] as long as the term O(n1+1/c) dominates the running time, which is the case when, for example, c < 3. We also extend our results to the case where the distances are measured according to the lI distance. The query time and the space bound are essentially the same, while the preprocessing time becomes 6 (n'+/c + nl+o(l)M2/3) (We use notation f(n) = O(g(n)) to denote f(n) = g(n) logO(1) g(n)).by Alexandr Andoni.M.Eng

    Practical and Optimal LSH for Angular Distance

    Get PDF
    We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [Andoni, Indyk, Nguyen, Razenshteyn 2014], [Andoni, Razenshteyn 2015]), our algorithm is also practical, improving upon the well-studied hyperplane LSH [Charikar, 2002] in practice. We also introduce a multiprobe version of this algorithm, and conduct experimental evaluation on real and synthetic data sets. We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions.Comment: 22 pages, an extended abstract is to appear in the proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS 2015

    Distance-Sensitive Hashing

    Get PDF
    Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy or uncertain data, for example in connection with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH framework is not known to yield good solutions, and instead ad hoc solutions have been designed for particular similarity and distance measures. For example, this is true for output-sensitive similarity search/join, and for indexes supporting annulus queries that aim to report a point close to a certain given distance from the query point. In this paper we initiate the study of distance-sensitive hashing (DSH), a generalization of LSH that seeks a family of hash functions such that the probability of two points having the same hash value is a given function of the distance between them. More precisely, given a distance space (X,dist)(X, \text{dist}) and a "collision probability function" (CPF) f ⁣:R[0,1]f\colon \mathbb{R}\rightarrow [0,1] we seek a distribution over pairs of functions (h,g)(h,g) such that for every pair of points x,yXx, y \in X the collision probability is Pr[h(x)=g(y)]=f(dist(x,y))\Pr[h(x)=g(y)] = f(\text{dist}(x,y)). Locality-sensitive hashing is the study of how fast a CPF can decrease as the distance grows. For many spaces, ff can be made exponentially decreasing even if we restrict attention to the symmetric case where g=hg=h. We show that the asymmetry achieved by having a pair of functions makes it possible to achieve CPFs that are, for example, increasing or unimodal, and show how this leads to principled solutions to problems not addressed by the LSH framework. This includes a novel application to privacy-preserving distance estimation. We believe that the DSH framework will find further applications in high-dimensional data management.Comment: Accepted at PODS'18. Abstract shortened due to character limi

    Approximate Near Neighbors for General Symmetric Norms

    Full text link
    We show that every symmetric normed space admits an efficient nearest neighbor search data structure with doubly-logarithmic approximation. Specifically, for every nn, d=no(1)d = n^{o(1)}, and every dd-dimensional symmetric norm \|\cdot\|, there exists a data structure for poly(loglogn)\mathrm{poly}(\log \log n)-approximate nearest neighbor search over \|\cdot\| for nn-point datasets achieving no(1)n^{o(1)} query time and n1+o(1)n^{1+o(1)} space. The main technical ingredient of the algorithm is a low-distortion embedding of a symmetric norm into a low-dimensional iterated product of top-kk norms. We also show that our techniques cannot be extended to general norms.Comment: 27 pages, 1 figur

    New LSH-based Algorithm for Approximate Nearest Neighbor

    No full text
    We present an algorithm for c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time ofO(dn^{1/c^2+o(1)}) and space O(dn + n^{1+1/c^2+o(1)})
    corecore