19 research outputs found
Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing
We prove a tight lower bound for the exponent for data-dependent
Locality-Sensitive Hashing schemes, recently used to design efficient solutions
for the -approximate nearest neighbor search. In particular, our lower bound
matches the bound of for the space,
obtained via the recent algorithm from [Andoni-Razenshteyn, STOC'15].
In recent years it emerged that data-dependent hashing is strictly superior
to the classical Locality-Sensitive Hashing, when the hash function is
data-independent. In the latter setting, the best exponent has been already
known: for the space, the tight bound is , with the upper
bound from [Indyk-Motwani, STOC'98] and the matching lower bound from
[O'Donnell-Wu-Zhou, ITCS'11].
We prove that, even if the hashing is data-dependent, it must hold that
. To prove the result, we need to formalize the
exact notion of data-dependent hashing that also captures the complexity of the
hash functions (in addition to their collision properties). Without restricting
such complexity, we would allow for obviously infeasible solutions such as the
Voronoi diagram of a dataset. To preclude such solutions, we require our hash
functions to be succinct. This condition is satisfied by all the known
algorithmic results.Comment: 16 pages, no figure
Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search
The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a
general technique for constructing a data structure to answer approximate near
neighbor queries by using a distribution over locality-sensitive
hash functions that partition space. For a collection of points, after
preprocessing, the query time is dominated by evaluations
of hash functions from and hash table lookups and
distance computations where is determined by the
locality-sensitivity properties of . It follows from a recent
result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive
hash functions can be reduced to , leaving the query time to be
dominated by distance computations and
additional word-RAM operations. We state this result as a general framework and
provide a simpler analysis showing that the number of lookups and distance
computations closely match the Indyk-Motwani framework, making it a viable
replacement in practice. Using ideas from another locality-sensitive hashing
framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of
additional word-RAM operations to .Comment: 15 pages, 3 figure
Practical and Optimal LSH for Angular Distance
We show the existence of a Locality-Sensitive Hashing (LSH) family for the
angular distance that yields an approximate Near Neighbor Search algorithm with
the asymptotically optimal running time exponent. Unlike earlier algorithms
with this property (e.g., Spherical LSH [Andoni, Indyk, Nguyen, Razenshteyn
2014], [Andoni, Razenshteyn 2015]), our algorithm is also practical, improving
upon the well-studied hyperplane LSH [Charikar, 2002] in practice. We also
introduce a multiprobe version of this algorithm, and conduct experimental
evaluation on real and synthetic data sets.
We complement the above positive results with a fine-grained lower bound for
the quality of any LSH family for angular distance. Our lower bound implies
that the above LSH family exhibits a trade-off between evaluation time and
quality that is close to optimal for a natural class of LSH functions.Comment: 22 pages, an extended abstract is to appear in the proceedings of the
29th Annual Conference on Neural Information Processing Systems (NIPS 2015
Tradeoffs for nearest neighbors on the sphere
We consider tradeoffs between the query and update complexities for the
(approximate) nearest neighbor problem on the sphere, extending the recent
spherical filters to sparse regimes and generalizing the scheme and analysis to
account for different tradeoffs. In a nutshell, for the sparse regime the
tradeoff between the query complexity and update complexity
for data sets of size is given by the following equation in
terms of the approximation factor and the exponents and :
For small , minimizing the time for updates leads to a linear
space complexity at the cost of a query time complexity .
Balancing the query and update costs leads to optimal complexities
, matching bounds from [Andoni-Razenshteyn, 2015] and [Dubiner,
IEEE-TIT'10] and matching the asymptotic complexities of [Andoni-Razenshteyn,
STOC'15] and [Andoni-Indyk-Laarhoven-Razenshteyn-Schmidt, NIPS'15]. A
subpolynomial query time complexity can be achieved at the cost of a
space complexity of the order , matching the bound
of [Andoni-Indyk-Patrascu, FOCS'06] and
[Panigrahy-Talwar-Wieder, FOCS'10] and improving upon results of
[Indyk-Motwani, STOC'98] and [Kushilevitz-Ostrovsky-Rabani, STOC'98].
For large , minimizing the update complexity results in a query complexity
of , improving upon the related exponent for large of
[Kapralov, PODS'15] by a factor , and matching the bound
of [Panigrahy-Talwar-Wieder, FOCS'08]. Balancing the costs leads to optimal
complexities , while a minimum query time complexity can be
achieved with update complexity , improving upon the
previous best exponents of Kapralov by a factor .Comment: 16 pages, 1 table, 2 figures. Mostly subsumed by arXiv:1608.03580
[cs.DS] (along with arXiv:1605.02701 [cs.DS]
Graph-Based Time-Space Trade-Offs for Approximate Near Neighbors
We take a first step towards a rigorous asymptotic analysis of graph-based methods for finding (approximate) nearest neighbors in high-dimensional spaces, by analyzing the complexity of randomized greedy walks on the approximate nearest neighbor graph. For random data sets of size n = 2^{o(d)} on the d-dimensional Euclidean unit sphere, using near neighbor graphs we can provably solve the approximate nearest neighbor problem with approximation factor c > 1 in query time n^{rho_{q} + o(1)} and space n^{1 + rho_{s} + o(1)}, for arbitrary rho_{q}, rho_{s} >= 0 satisfying (2c^2 - 1) rho_{q} + 2 c^2 (c^2 - 1) sqrt{rho_{s} (1 - rho_{s})} >= c^4. Graph-based near neighbor searching is especially competitive with hash-based methods for small c and near-linear memory, and in this regime the asymptotic scaling of a greedy graph-based search matches optimal hash-based trade-offs of Andoni-Laarhoven-Razenshteyn-Waingarten [Andoni et al., 2017]. We further study how the trade-offs scale when the data set is of size n = 2^{Theta(d)}, and analyze asymptotic complexities when applying these results to lattice sieving
Explicit Correlation Amplifiers for Finding Outlier Correlations in Deterministic Subquadratic Time
We derandomize G. Valiant\u27s [J.ACM 62(2015) Art.13] subquadratic-time algorithm for finding outlier correlations in binary data. Our derandomized algorithm gives deterministic subquadratic scaling essentially for the same parameter range as Valiant\u27s randomized algorithm, but the precise constants we save over quadratic scaling are more modest. Our main technical tool for derandomization is an explicit family of correlation amplifiers built via a family of zigzag-product expanders in Reingold, Vadhan, and Wigderson [Ann. of Math 155(2002), 157-187]. We say that a function f:{-1,1}^d ->{-1,1}^D is a correlation amplifier with threshold 0 = 1, and strength p an even positive integer if for all pairs of vectors x,y in {-1,1}^d it holds that (i) ||| | >= tau*d implies (/gamma^d})^p*D /d)^p*D