340 research outputs found
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Nearest neighbor searching of large databases in high-dimensional spaces is
inherently difficult due to the curse of dimensionality. A flavor of
approximation is, therefore, necessary to practically solve the problem of
nearest neighbor search. In this paper, we propose a novel yet simple indexing
scheme, HD-Index, to solve the problem of approximate k-nearest neighbor
queries in massive high-dimensional databases. HD-Index consists of a set of
novel hierarchical structures called RDB-trees built on Hilbert keys of
database objects. The leaves of the RDB-trees store distances of database
objects to reference objects, thereby allowing efficient pruning using distance
filters. In addition to triangular inequality, we also use Ptolemaic inequality
to produce better lower bounds. Experiments on massive (up to billion scale)
high-dimensional (up to 1000+) datasets show that HD-Index is effective,
efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
Fast Cross-Polytope Locality-Sensitive Hashing
We provide a variant of cross-polytope locality sensitive hashing with
respect to angular distance which is provably optimal in asymptotic sensitivity
and enjoys hash computation time. Building on a recent
result (by Andoni, Indyk, Laarhoven, Razenshteyn, Schmidt, 2015), we show that
optimal asymptotic sensitivity for cross-polytope LSH is retained even when the
dense Gaussian matrix is replaced by a fast Johnson-Lindenstrauss transform
followed by discrete pseudo-rotation, reducing the hash computation time from
to . Moreover, our scheme achieves
the optimal rate of convergence for sensitivity. By incorporating a
low-randomness Johnson-Lindenstrauss transform, our scheme can be modified to
require only random bitsComment: 14 pages, 6 figure
Drawbacks and Proposed Solutions for Real-time Processing on Existing State-of-the-art Locality Sensitive Hashing Techniques
Nearest-neighbor query processing is a fundamental operation for many image
retrieval applications. Often, images are stored and represented by
high-dimensional vectors that are generated by feature-extraction algorithms.
Since tree-based index structures are shown to be ineffective for high
dimensional processing due to the well-known "Curse of Dimensionality",
approximate nearest neighbor techniques are used for faster query processing.
Locality Sensitive Hashing (LSH) is a very popular and efficient approximate
nearest neighbor technique that is known for its sublinear query processing
complexity and theoretical guarantees. Nowadays, with the emergence of
technology, several diverse application domains require real-time
high-dimensional data storing and processing capacity. Existing LSH techniques
are not suitable to handle real-time data and queries. In this paper, we
discuss the challenges and drawbacks of existing LSH techniques for processing
real-time high-dimensional image data. Additionally, through experimental
analysis, we propose improvements for existing state-of-the-art LSH techniques
for efficient processing of high-dimensional image data.Comment: Accepted and Presented at the 5th International Conference on Signal
and Image Processing (SIGI-2019), Dubai, UA
Constant Sequence Extension for Fast Search Using Weighted Hamming Distance
Representing visual data using compact binary codes is attracting increasing
attention as binary codes are used as direct indices into hash table(s) for
fast non-exhaustive search. Recent methods show that ranking binary codes using
weighted Hamming distance (WHD) rather than Hamming distance (HD) by generating
query-adaptive weights for each bit can better retrieve query-related items.
However, search using WHD is slower than that using HD. One main challenge is
that the complexity of extending a monotone increasing sequence using WHD to
probe buckets in hash table(s) for existing methods is at least proportional to
the square of the sequence length, while that using HD is proportional to the
sequence length. To overcome this challenge, we propose a novel fast
non-exhaustive search method using WHD. The key idea is to design a constant
sequence extension algorithm to perform each sequence extension in constant
computational complexity and the total complexity is proportional to the
sequence length, which is justified by theoretical analysis. Experimental
results show that our method is faster than other WHD-based search methods.
Also, compared with the HD-based non-exhaustive search method, our method has
comparable efficiency but retrieves more query-related items for the dataset of
up to one billion items
- …