43 research outputs found
Optimal lower bounds for locality sensitive hashing (except when q is tiny)
We study lower bounds for Locality Sensitive Hashing (LSH) in the strongest
setting: point sets in {0,1}^d under the Hamming distance. Recall that here H
is said to be an (r, cr, p, q)-sensitive hash family if all pairs x, y in
{0,1}^d with dist(x,y) at most r have probability at least p of collision under
a randomly chosen h in H, whereas all pairs x, y in {0,1}^d with dist(x,y) at
least cr have probability at most q of collision. Typically, one considers d
tending to infinity, with c fixed and q bounded away from 0.
For its applications to approximate nearest neighbor search in high
dimensions, the quality of an LSH family H is governed by how small its "rho
parameter" rho = ln(1/p)/ln(1/q) is as a function of the parameter c. The
seminal paper of Indyk and Motwani showed that for each c, the extremely simple
family H = {x -> x_i : i in d} achieves rho at most 1/c. The only known lower
bound, due to Motwani, Naor, and Panigrahy, is that rho must be at least .46/c
(minus o_d(1)).
In this paper we show an optimal lower bound: rho must be at least 1/c (minus
o_d(1)). This lower bound for Hamming space yields a lower bound of 1/c^2 for
Euclidean space (or the unit sphere) and 1/c for the Jaccard distance on sets;
both of these match known upper bounds. Our proof is simple; the essence is
that the noise stability of a boolean function at e^{-t} is a log-convex
function of t.Comment: 9 pages + abstract and reference
Scalable Image Retrieval by Sparse Product Quantization
Fast Approximate Nearest Neighbor (ANN) search technique for high-dimensional
feature indexing and retrieval is the crux of large-scale image retrieval. A
recent promising technique is Product Quantization, which attempts to index
high-dimensional image features by decomposing the feature space into a
Cartesian product of low dimensional subspaces and quantizing each of them
separately. Despite the promising results reported, their quantization approach
follows the typical hard assignment of traditional quantization methods, which
may result in large quantization errors and thus inferior search performance.
Unlike the existing approaches, in this paper, we propose a novel approach
called Sparse Product Quantization (SPQ) to encoding the high-dimensional
feature vectors into sparse representation. We optimize the sparse
representations of the feature vectors by minimizing their quantization errors,
making the resulting representation is essentially close to the original data
in practice. Experiments show that the proposed SPQ technique is not only able
to compress data, but also an effective encoding technique. We obtain
state-of-the-art results for ANN search on four public image datasets and the
promising results of content-based image retrieval further validate the
efficacy of our proposed method.Comment: 12 page
A reliable order-statistics-based approximate nearest neighbor search algorithm
We propose a new algorithm for fast approximate nearest neighbor search based
on the properties of ordered vectors. Data vectors are classified based on the
index and sign of their largest components, thereby partitioning the space in a
number of cones centered in the origin. The query is itself classified, and the
search starts from the selected cone and proceeds to neighboring ones. Overall,
the proposed algorithm corresponds to locality sensitive hashing in the space
of directions, with hashing based on the order of components. Thanks to the
statistical features emerging through ordering, it deals very well with the
challenging case of unstructured data, and is a valuable building block for
more complex techniques dealing with structured data. Experiments on both
simulated and real-world data prove the proposed algorithm to provide a
state-of-the-art performance