43 research outputs found

    Optimal lower bounds for locality sensitive hashing (except when q is tiny)

    Full text link
    We study lower bounds for Locality Sensitive Hashing (LSH) in the strongest setting: point sets in {0,1}^d under the Hamming distance. Recall that here H is said to be an (r, cr, p, q)-sensitive hash family if all pairs x, y in {0,1}^d with dist(x,y) at most r have probability at least p of collision under a randomly chosen h in H, whereas all pairs x, y in {0,1}^d with dist(x,y) at least cr have probability at most q of collision. Typically, one considers d tending to infinity, with c fixed and q bounded away from 0. For its applications to approximate nearest neighbor search in high dimensions, the quality of an LSH family H is governed by how small its "rho parameter" rho = ln(1/p)/ln(1/q) is as a function of the parameter c. The seminal paper of Indyk and Motwani showed that for each c, the extremely simple family H = {x -> x_i : i in d} achieves rho at most 1/c. The only known lower bound, due to Motwani, Naor, and Panigrahy, is that rho must be at least .46/c (minus o_d(1)). In this paper we show an optimal lower bound: rho must be at least 1/c (minus o_d(1)). This lower bound for Hamming space yields a lower bound of 1/c^2 for Euclidean space (or the unit sphere) and 1/c for the Jaccard distance on sets; both of these match known upper bounds. Our proof is simple; the essence is that the noise stability of a boolean function at e^{-t} is a log-convex function of t.Comment: 9 pages + abstract and reference

    Scalable Image Retrieval by Sparse Product Quantization

    Get PDF
    Fast Approximate Nearest Neighbor (ANN) search technique for high-dimensional feature indexing and retrieval is the crux of large-scale image retrieval. A recent promising technique is Product Quantization, which attempts to index high-dimensional image features by decomposing the feature space into a Cartesian product of low dimensional subspaces and quantizing each of them separately. Despite the promising results reported, their quantization approach follows the typical hard assignment of traditional quantization methods, which may result in large quantization errors and thus inferior search performance. Unlike the existing approaches, in this paper, we propose a novel approach called Sparse Product Quantization (SPQ) to encoding the high-dimensional feature vectors into sparse representation. We optimize the sparse representations of the feature vectors by minimizing their quantization errors, making the resulting representation is essentially close to the original data in practice. Experiments show that the proposed SPQ technique is not only able to compress data, but also an effective encoding technique. We obtain state-of-the-art results for ANN search on four public image datasets and the promising results of content-based image retrieval further validate the efficacy of our proposed method.Comment: 12 page

    A reliable order-statistics-based approximate nearest neighbor search algorithm

    Full text link
    We propose a new algorithm for fast approximate nearest neighbor search based on the properties of ordered vectors. Data vectors are classified based on the index and sign of their largest components, thereby partitioning the space in a number of cones centered in the origin. The query is itself classified, and the search starts from the selected cone and proceeds to neighboring ones. Overall, the proposed algorithm corresponds to locality sensitive hashing in the space of directions, with hashing based on the order of components. Thanks to the statistical features emerging through ordering, it deals very well with the challenging case of unstructured data, and is a valuable building block for more complex techniques dealing with structured data. Experiments on both simulated and real-world data prove the proposed algorithm to provide a state-of-the-art performance
    corecore