1,095 research outputs found

    Hashing for Similarity Search: A Survey

    Full text link
    Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

    Tradeoffs for nearest neighbors on the sphere

    Get PDF
    We consider tradeoffs between the query and update complexities for the (approximate) nearest neighbor problem on the sphere, extending the recent spherical filters to sparse regimes and generalizing the scheme and analysis to account for different tradeoffs. In a nutshell, for the sparse regime the tradeoff between the query complexity nρqn^{\rho_q} and update complexity nρun^{\rho_u} for data sets of size nn is given by the following equation in terms of the approximation factor cc and the exponents ρq\rho_q and ρu\rho_u: c2ρq+(c21)ρu=2c21.c^2\sqrt{\rho_q}+(c^2-1)\sqrt{\rho_u}=\sqrt{2c^2-1}. For small c=1+ϵc=1+\epsilon, minimizing the time for updates leads to a linear space complexity at the cost of a query time complexity n14ϵ2n^{1-4\epsilon^2}. Balancing the query and update costs leads to optimal complexities n1/(2c21)n^{1/(2c^2-1)}, matching bounds from [Andoni-Razenshteyn, 2015] and [Dubiner, IEEE-TIT'10] and matching the asymptotic complexities of [Andoni-Razenshteyn, STOC'15] and [Andoni-Indyk-Laarhoven-Razenshteyn-Schmidt, NIPS'15]. A subpolynomial query time complexity no(1)n^{o(1)} can be achieved at the cost of a space complexity of the order n1/(4ϵ2)n^{1/(4\epsilon^2)}, matching the bound nΩ(1/ϵ2)n^{\Omega(1/\epsilon^2)} of [Andoni-Indyk-Patrascu, FOCS'06] and [Panigrahy-Talwar-Wieder, FOCS'10] and improving upon results of [Indyk-Motwani, STOC'98] and [Kushilevitz-Ostrovsky-Rabani, STOC'98]. For large cc, minimizing the update complexity results in a query complexity of n2/c2+O(1/c4)n^{2/c^2+O(1/c^4)}, improving upon the related exponent for large cc of [Kapralov, PODS'15] by a factor 22, and matching the bound nΩ(1/c2)n^{\Omega(1/c^2)} of [Panigrahy-Talwar-Wieder, FOCS'08]. Balancing the costs leads to optimal complexities n1/(2c21)n^{1/(2c^2-1)}, while a minimum query time complexity can be achieved with update complexity n2/c2+O(1/c4)n^{2/c^2+O(1/c^4)}, improving upon the previous best exponents of Kapralov by a factor 22.Comment: 16 pages, 1 table, 2 figures. Mostly subsumed by arXiv:1608.03580 [cs.DS] (along with arXiv:1605.02701 [cs.DS]

    When Hashing Met Matching: Efficient Spatio-Temporal Search for Ridesharing

    Full text link
    Carpooling, or sharing a ride with other passengers, holds immense potential for urban transportation. Ridesharing platforms enable such sharing of rides using real-time data. Finding ride matches in real-time at urban scale is a difficult combinatorial optimization task and mostly heuristic approaches are applied. In this work, we mathematically model the problem as that of finding near-neighbors and devise a novel efficient spatio-temporal search algorithm based on the theory of locality sensitive hashing for Maximum Inner Product Search (MIPS). The proposed algorithm can find kk near-optimal potential matches for every ride from a pool of nn rides in time O(n1+ρ(k+logn)logk)O(n^{1 + \rho} (k + \log n) \log k) and space O(n1+ρlogk)O(n^{1 + \rho} \log k) for a small ρ<1\rho < 1. Our algorithm can be extended in several useful and interesting ways increasing its practical appeal. Experiments with large NY yellow taxi trip datasets show that our algorithm consistently outperforms state-of-the-art heuristic methods thereby proving its practical applicability

    Hybrid LSH: Faster Near Neighbors Reporting in High-dimensional Space

    Get PDF
    We study the rr-near neighbors reporting problem (rr-NN), i.e., reporting \emph{all} points in a high-dimensional point set SS that lie within a radius rr of a given query point qq. Our approach builds upon on the locality-sensitive hashing (LSH) framework due to its appealing asymptotic sublinear query time for near neighbor search problems in high-dimensional space. A bottleneck of the traditional LSH scheme for solving rr-NN is that its performance is sensitive to data and query-dependent parameters. On datasets whose data distributions have diverse local density patterns, LSH with inappropriate tuning parameters can sometimes be outperformed by a simple linear search. In this paper, we introduce a hybrid search strategy between LSH-based search and linear search for rr-NN in high-dimensional space. By integrating an auxiliary data structure into LSH hash tables, we can efficiently estimate the computational cost of LSH-based search for a given query regardless of the data distribution. This means that we are able to choose the appropriate search strategy between LSH-based search and linear search to achieve better performance. Moreover, the integrated data structure is time efficient and fits well with many recent state-of-the-art LSH-based approaches. Our experiments on real-world datasets show that the hybrid search approach outperforms (or is comparable to) both LSH-based search and linear search for a wide range of search radii and data distributions in high-dimensional space.Comment: Accepted as a short paper in EDBT 201
    corecore