1,169 research outputs found
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Latent Structure Preserving Hashing
Aiming at efficient similarity search, hash functions are designed to embed high-dimensional feature descriptors to low-dimensional binary codes such that similar descriptors will lead to binary codes with a short distance in the Hamming space. It is critical to effectively maintain the intrinsic structure and preserve the original information of data in a hashing algorithm. In this paper, we propose a novel hashing algorithm called Latent Structure Preserving Hashing (LSPH), with the target of finding a well-structured low-dimensional data representation from the original high-dimensional data through a novel objective function based on Nonnegative Matrix Factorization (NMF) with their corresponding Kullback-Leibler divergence of data distribution as the regularization term. Via exploiting the joint probabilistic distribution of data, LSPH can automatically learn the latent information and successfully preserve the structure of high-dimensional data. To further achieve robust performance with complex and nonlinear data, in this paper, we also contribute a more generalized multi-layer LSPH (ML-LSPH) framework, in which hierarchical representations can be effectively learned by a multiplicative up-propagation algorithm. Once obtaining the latent representations, the hash functions can be easily acquired through multi-variable logistic regression. Experimental results on three large-scale retrieval datasets, i.e., SIFT 1M, GIST 1M and 500 K TinyImage, show that ML-LSPH can achieve better performance than the single-layer LSPH and both of them outperform existing hashing techniques on large-scale data
Coding for Random Projections
The method of random projections has become very popular for large-scale
applications in statistical learning, information retrieval, bio-informatics
and other applications. Using a well-designed coding scheme for the projected
data, which determines the number of bits needed for each projected value and
how to allocate these bits, can significantly improve the effectiveness of the
algorithm, in storage cost as well as computational speed. In this paper, we
study a number of simple coding schemes, focusing on the task of similarity
estimation and on an application to training linear classifiers. We demonstrate
that uniform quantization outperforms the standard existing influential method
(Datar et. al. 2004). Indeed, we argue that in many cases coding with just a
small number of bits suffices. Furthermore, we also develop a non-uniform 2-bit
coding scheme that generally performs well in practice, as confirmed by our
experiments on training linear support vector machines (SVM)
- …