2,876 research outputs found
Scalable Techniques for Similarity Search
Document similarity is similar to the nearest neighbour problem and has applications in various domains. In order to determine the similarity / dissimilarity of the documents first they need to be converted into sets containing shingles. Each document is converted into k-shingles, k being the length of each shingle. The similarity is calculated using Jaccard distance between sets and output into a characteristic matrix, the complexity to parse this matrix is significantly high especially when the sets are large. In this project we explore various approaches such as Min hashing, LSH & Bloom Filter to decrease the matrix size and to improve the time complexity. Min hashing creates a signature matrix which significantly smaller compared to a characteristic matrix. In this project we will look into Min-Hashing implementation, pros and cons. Also we will explore Locality Sensitive Hashing, Bloom Filters and their advantages
Hybrid LSH: Faster Near Neighbors Reporting in High-dimensional Space
We study the -near neighbors reporting problem (-NN), i.e., reporting
\emph{all} points in a high-dimensional point set that lie within a radius
of a given query point . Our approach builds upon on the
locality-sensitive hashing (LSH) framework due to its appealing asymptotic
sublinear query time for near neighbor search problems in high-dimensional
space. A bottleneck of the traditional LSH scheme for solving -NN is that
its performance is sensitive to data and query-dependent parameters. On
datasets whose data distributions have diverse local density patterns, LSH with
inappropriate tuning parameters can sometimes be outperformed by a simple
linear search.
In this paper, we introduce a hybrid search strategy between LSH-based search
and linear search for -NN in high-dimensional space. By integrating an
auxiliary data structure into LSH hash tables, we can efficiently estimate the
computational cost of LSH-based search for a given query regardless of the data
distribution. This means that we are able to choose the appropriate search
strategy between LSH-based search and linear search to achieve better
performance. Moreover, the integrated data structure is time efficient and fits
well with many recent state-of-the-art LSH-based approaches. Our experiments on
real-world datasets show that the hybrid search approach outperforms (or is
comparable to) both LSH-based search and linear search for a wide range of
search radii and data distributions in high-dimensional space.Comment: Accepted as a short paper in EDBT 201
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Visual Search at eBay
In this paper, we propose a novel end-to-end approach for scalable visual
search infrastructure. We discuss the challenges we faced for a massive
volatile inventory like at eBay and present our solution to overcome those. We
harness the availability of large image collection of eBay listings and
state-of-the-art deep learning techniques to perform visual search at scale.
Supervised approach for optimized search limited to top predicted categories
and also for compact binary signature are key to scale up without compromising
accuracy and precision. Both use a common deep neural network requiring only a
single forward inference. The system architecture is presented with in-depth
discussions of its basic components and optimizations for a trade-off between
search relevance and latency. This solution is currently deployed in a
distributed cloud infrastructure and fuels visual search in eBay ShopBot and
Close5. We show benchmark on ImageNet dataset on which our approach is faster
and more accurate than several unsupervised baselines. We share our learnings
with the hope that visual search becomes a first class citizen for all large
scale search engines rather than an afterthought.Comment: To appear in 23rd SIGKDD Conference on Knowledge Discovery and Data
Mining (KDD), 2017. A demonstration video can be found at
https://youtu.be/iYtjs32vh4
Deep constrained siamese hash coding network and load-balanced locality-sensitive hashing for near duplicate image detection
We construct a new efficient near duplicate image detection method using a hierarchical hash code learning neural network and load-balanced Locality Sensitive Hashing (LSH) indexing. We propose a deep constrained siamese hash coding neural network combined with deep feature learning. Our neural network is able to extract effective features for near duplicate image detection. The extracted features are used to construct a LSH-based index. We propose a load-balanced LSH method to produce load-balanced buckets in the hashing process. The load-balanced LSH significantly reduces the query time. Based on the proposed load-balanced LSH, we design an effective and feasible algorithm for near duplicate image detection. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our deep siamese hash encoding network and load-balanced LSH
- …