1,095 research outputs found
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Tradeoffs for nearest neighbors on the sphere
We consider tradeoffs between the query and update complexities for the
(approximate) nearest neighbor problem on the sphere, extending the recent
spherical filters to sparse regimes and generalizing the scheme and analysis to
account for different tradeoffs. In a nutshell, for the sparse regime the
tradeoff between the query complexity and update complexity
for data sets of size is given by the following equation in
terms of the approximation factor and the exponents and :
For small , minimizing the time for updates leads to a linear
space complexity at the cost of a query time complexity .
Balancing the query and update costs leads to optimal complexities
, matching bounds from [Andoni-Razenshteyn, 2015] and [Dubiner,
IEEE-TIT'10] and matching the asymptotic complexities of [Andoni-Razenshteyn,
STOC'15] and [Andoni-Indyk-Laarhoven-Razenshteyn-Schmidt, NIPS'15]. A
subpolynomial query time complexity can be achieved at the cost of a
space complexity of the order , matching the bound
of [Andoni-Indyk-Patrascu, FOCS'06] and
[Panigrahy-Talwar-Wieder, FOCS'10] and improving upon results of
[Indyk-Motwani, STOC'98] and [Kushilevitz-Ostrovsky-Rabani, STOC'98].
For large , minimizing the update complexity results in a query complexity
of , improving upon the related exponent for large of
[Kapralov, PODS'15] by a factor , and matching the bound
of [Panigrahy-Talwar-Wieder, FOCS'08]. Balancing the costs leads to optimal
complexities , while a minimum query time complexity can be
achieved with update complexity , improving upon the
previous best exponents of Kapralov by a factor .Comment: 16 pages, 1 table, 2 figures. Mostly subsumed by arXiv:1608.03580
[cs.DS] (along with arXiv:1605.02701 [cs.DS]
When Hashing Met Matching: Efficient Spatio-Temporal Search for Ridesharing
Carpooling, or sharing a ride with other passengers, holds immense potential
for urban transportation. Ridesharing platforms enable such sharing of rides
using real-time data. Finding ride matches in real-time at urban scale is a
difficult combinatorial optimization task and mostly heuristic approaches are
applied. In this work, we mathematically model the problem as that of finding
near-neighbors and devise a novel efficient spatio-temporal search algorithm
based on the theory of locality sensitive hashing for Maximum Inner Product
Search (MIPS). The proposed algorithm can find near-optimal potential
matches for every ride from a pool of rides in time and space for a small . Our
algorithm can be extended in several useful and interesting ways increasing its
practical appeal. Experiments with large NY yellow taxi trip datasets show that
our algorithm consistently outperforms state-of-the-art heuristic methods
thereby proving its practical applicability
Hybrid LSH: Faster Near Neighbors Reporting in High-dimensional Space
We study the -near neighbors reporting problem (-NN), i.e., reporting
\emph{all} points in a high-dimensional point set that lie within a radius
of a given query point . Our approach builds upon on the
locality-sensitive hashing (LSH) framework due to its appealing asymptotic
sublinear query time for near neighbor search problems in high-dimensional
space. A bottleneck of the traditional LSH scheme for solving -NN is that
its performance is sensitive to data and query-dependent parameters. On
datasets whose data distributions have diverse local density patterns, LSH with
inappropriate tuning parameters can sometimes be outperformed by a simple
linear search.
In this paper, we introduce a hybrid search strategy between LSH-based search
and linear search for -NN in high-dimensional space. By integrating an
auxiliary data structure into LSH hash tables, we can efficiently estimate the
computational cost of LSH-based search for a given query regardless of the data
distribution. This means that we are able to choose the appropriate search
strategy between LSH-based search and linear search to achieve better
performance. Moreover, the integrated data structure is time efficient and fits
well with many recent state-of-the-art LSH-based approaches. Our experiments on
real-world datasets show that the hybrid search approach outperforms (or is
comparable to) both LSH-based search and linear search for a wide range of
search radii and data distributions in high-dimensional space.Comment: Accepted as a short paper in EDBT 201
- …