2,556 research outputs found
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Scalable Techniques for Similarity Search
Document similarity is similar to the nearest neighbour problem and has applications in various domains. In order to determine the similarity / dissimilarity of the documents first they need to be converted into sets containing shingles. Each document is converted into k-shingles, k being the length of each shingle. The similarity is calculated using Jaccard distance between sets and output into a characteristic matrix, the complexity to parse this matrix is significantly high especially when the sets are large. In this project we explore various approaches such as Min hashing, LSH & Bloom Filter to decrease the matrix size and to improve the time complexity. Min hashing creates a signature matrix which significantly smaller compared to a characteristic matrix. In this project we will look into Min-Hashing implementation, pros and cons. Also we will explore Locality Sensitive Hashing, Bloom Filters and their advantages
Bloom Filters and Compact Hash Codes for Efficient and Distributed Image Retrieval
This paper presents a novel method for efficient image retrieval, based on a
simple and effective hashing of CNN features and the use of an indexing
structure based on Bloom filters. These filters are used as gatekeepers for the
database of image features, allowing to avoid to perform a query if the query
features are not stored in the database and speeding up the query process,
without affecting retrieval performance. Thanks to the limited memory
requirements the system is suitable for mobile applications and distributed
databases, associating each filter to a distributed portion of the database.
Experimental validation has been performed on three standard image retrieval
datasets, outperforming state-of-the-art hashing methods in terms of precision,
while the proposed indexing method obtains a speedup
- …