3,955 research outputs found
Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)
We present the first provably sublinear time algorithm for approximate
\emph{Maximum Inner Product Search} (MIPS). Our proposal is also the first
hashing algorithm for searching with (un-normalized) inner product as the
underlying similarity measure. Finding hashing schemes for MIPS was considered
hard. We formally show that the existing Locality Sensitive Hashing (LSH)
framework is insufficient for solving MIPS, and then we extend the existing LSH
framework to allow asymmetric hashing schemes. Our proposal is based on an
interesting mathematical phenomenon in which inner products, after independent
asymmetric transformations, can be converted into the problem of approximate
near neighbor search. This key observation makes efficient sublinear hashing
scheme for MIPS possible. In the extended asymmetric LSH (ALSH) framework, we
provide an explicit construction of provably fast hashing scheme for MIPS. The
proposed construction and the extended LSH framework could be of independent
theoretical interest. Our proposed algorithm is simple and easy to implement.
We evaluate the method, for retrieving inner products, in the collaborative
filtering task of item recommendations on Netflix and Movielens datasets
Collaborative Learning for Extremely Low Bit Asymmetric Hashing
Hashing techniques are in great demand for a wide range of real-world
applications such as image retrieval and network compression. Nevertheless,
existing approaches could hardly guarantee a satisfactory performance with the
extremely low-bit (e.g., 4-bit) hash codes due to the severe information loss
and the shrink of the discrete solution space. In this paper, we propose a
novel \textit{Collaborative Learning} strategy that is tailored for generating
high-quality low-bit hash codes. The core idea is to jointly distill
bit-specific and informative representations for a group of pre-defined code
lengths. The learning of short hash codes among the group can benefit from the
manifold shared with other long codes, where multiple views from different hash
codes provide the supplementary guidance and regularization, making the
convergence faster and more stable. To achieve that, an asymmetric hashing
framework with two variants of multi-head embedding structures is derived,
termed as Multi-head Asymmetric Hashing (MAH), leading to great efficiency of
training and querying. Extensive experiments on three benchmark datasets have
been conducted to verify the superiority of the proposed MAH, and have shown
that the 8-bit hash codes generated by MAH achieve of the MAP (Mean
Average Precision (MAP)) score on the CIFAR-10 dataset, which significantly
surpasses the performance of the 48-bit codes by the state-of-the-arts in image
retrieval tasks
Compositional Coding for Collaborative Filtering
Efficiency is crucial to the online recommender systems. Representing users
and items as binary vectors for Collaborative Filtering (CF) can achieve fast
user-item affinity computation in the Hamming space, in recent years, we have
witnessed an emerging research effort in exploiting binary hashing techniques
for CF methods. However, CF with binary codes naturally suffers from low
accuracy due to limited representation capability in each bit, which impedes it
from modeling complex structure of the data.
In this work, we attempt to improve the efficiency without hurting the model
performance by utilizing both the accuracy of real-valued vectors and the
efficiency of binary codes to represent users/items. In particular, we propose
the Compositional Coding for Collaborative Filtering (CCCF) framework, which
not only gains better recommendation efficiency than the state-of-the-art
binarized CF approaches but also achieves even higher accuracy than the
real-valued CF method. Specifically, CCCF innovatively represents each
user/item with a set of binary vectors, which are associated with a sparse
real-value weight vector. Each value of the weight vector encodes the
importance of the corresponding binary vector to the user/item. The continuous
weight vectors greatly enhances the representation capability of binary codes,
and its sparsity guarantees the processing speed. Furthermore, an integer
weight approximation scheme is proposed to further accelerate the speed. Based
on the CCCF framework, we design an efficient discrete optimization algorithm
to learn its parameters. Extensive experiments on three real-world datasets
show that our method outperforms the state-of-the-art binarized CF methods
(even achieves better performance than the real-valued CF method) by a large
margin in terms of both recommendation accuracy and efficiency.Comment: SIGIR201
Shared Predictive Cross-Modal Deep Quantization
With explosive growth of data volume and ever-increasing diversity of data
modalities, cross-modal similarity search, which conducts nearest neighbor
search across different modalities, has been attracting increasing interest.
This paper presents a deep compact code learning solution for efficient
cross-modal similarity search. Many recent studies have proven that
quantization-based approaches perform generally better than hashing-based
approaches on single-modal similarity search. In this paper, we propose a deep
quantization approach, which is among the early attempts of leveraging deep
neural networks into quantization-based cross-modal similarity search. Our
approach, dubbed shared predictive deep quantization (SPDQ), explicitly
formulates a shared subspace across different modalities and two private
subspaces for individual modalities, and representations in the shared subspace
and the private subspaces are learned simultaneously by embedding them to a
reproducing kernel Hilbert space, where the mean embedding of different
modality distributions can be explicitly compared. In addition, in the shared
subspace, a quantizer is learned to produce the semantics preserving compact
codes with the help of label alignment. Thanks to this novel network
architecture in cooperation with supervised quantization training, SPDQ can
preserve intramodal and intermodal similarities as much as possible and greatly
reduce quantization error. Experiments on two popular benchmarks corroborate
that our approach outperforms state-of-the-art methods
Efficient Similarity Search in Dynamic Data Streams
The Jaccard index is an important similarity measure for item sets and
Boolean data. On large datasets, an exact similarity computation is often
infeasible for all item pairs both due to time and space constraints, giving
rise to faster approximate methods. The algorithm of choice used to quickly
compute the Jaccard index of
two item sets and is usually a form of min-hashing. Most min-hashing
schemes are maintainable in data streams processing only additions, but none
are known to work when facing item-wise deletions. In this paper, we
investigate scalable approximation algorithms for rational set similarities, a
broad class of similarity measures including Jaccard. Motivated by a result of
Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity
admits a locality sensitive hashing (LSH) scheme if and only if the
corresponding distance is a metric, we can show that there exists a space
efficient summary maintaining a multiplicative
approximation to in dynamic data streams. This in turn also yields a
additive approximation of the similarity. The existence of these
approximations hints at, but does not directly imply a LSH scheme in dynamic
data streams. Our second and main contribution now lies in the design of such a
LSH scheme maintainable in dynamic data streams. The scheme is space efficient,
easy to implement and to the best of our knowledge the first of its kind able
to process deletions
Privacy-preserving Targeted Advertising
Recommendation systems form the center piece of a rapidly growing trillion
dollar online advertisement industry. Even with numerous optimizations and
approximations, collaborative filtering (CF) based approaches require real-time
computations involving very large vectors. Curating and storing such related
profile information vectors on web portals seriously breaches the user's
privacy. Modifying such systems to achieve private recommendations further
requires communication of long encrypted vectors, making the whole process
inefficient. We present a more efficient recommendation system alternative, in
which user profiles are maintained entirely on their device, and appropriate
recommendations are fetched from web portals in an efficient privacy preserving
manner. We base this approach on association rules.Comment: A preliminary version was presented at the 11th INFORMS Workshop on
Data Mining and Decision Analytics (2016
A Survey on Learning to Hash
Nearest neighbor search is a problem of finding the data points from the
database such that the distances from them to the query point are the smallest.
Learning to hash is one of the major solutions to this problem and has been
widely studied recently. In this paper, we present a comprehensive survey of
the learning to hash algorithms, categorize them according to the manners of
preserving the similarities into: pairwise similarity preserving, multiwise
similarity preserving, implicit similarity preserving, as well as quantization,
and discuss their relations. We separate quantization from pairwise similarity
preserving as the objective function is very different though quantization, as
we show, can be derived from preserving the pairwise similarities. In addition,
we present the evaluation protocols, and the general performance analysis, and
point out that the quantization algorithms perform superiorly in terms of
search accuracy, search time cost, and space cost. Finally, we introduce a few
emerging topics.Comment: To appear in IEEE Transactions On Pattern Analysis and Machine
Intelligence (TPAMI
Self-Taught Hashing for Fast Similarity Search
The ability of fast similarity search at large scale is of great importance
to many Information Retrieval (IR) applications. A promising way to accelerate
similarity search is semantic hashing which designs compact binary codes for a
large number of documents so that semantically similar documents are mapped to
similar codes (within a short Hamming distance). Although some recently
proposed techniques are able to generate high-quality codes for documents known
in advance, obtaining the codes for previously unseen documents remains to be a
very challenging problem. In this paper, we emphasise this issue and propose a
novel Self-Taught Hashing (STH) approach to semantic hashing: we first find the
optimal -bit binary codes for all documents in the given corpus via
unsupervised learning, and then train classifiers via supervised learning
to predict the -bit code for any query document unseen before. Our
experiments on three real-world text datasets show that the proposed approach
using binarised Laplacian Eigenmap (LapEig) and linear Support Vector Machine
(SVM) outperforms state-of-the-art techniques significantly
Conditional Restricted Boltzmann Machines for Structured Output Prediction
Conditional Restricted Boltzmann Machines (CRBMs) are rich probabilistic
models that have recently been applied to a wide range of problems, including
collaborative filtering, classification, and modeling motion capture data.
While much progress has been made in training non-conditional RBMs, these
algorithms are not applicable to conditional models and there has been almost
no work on training and generating predictions from conditional RBMs for
structured output problems. We first argue that standard Contrastive
Divergence-based learning may not be suitable for training CRBMs. We then
identify two distinct types of structured output prediction problems and
propose an improved learning algorithm for each. The first problem type is one
where the output space has arbitrary structure but the set of likely output
configurations is relatively small, such as in multi-label classification. The
second problem is one where the output space is arbitrarily structured but
where the output space variability is much greater, such as in image denoising
or pixel labeling. We show that the new learning algorithms can work much
better than Contrastive Divergence on both types of problems
Data-Parallel Hashing Techniques for GPU Architectures
Hash tables are one of the most fundamental data structures for effectively
storing and accessing sparse data, with widespread usage in domains ranging
from computer graphics to machine learning. This study surveys the
state-of-the-art research on data-parallel hashing techniques for emerging
massively-parallel, many-core GPU architectures. Key factors affecting the
performance of different hashing schemes are discovered and used to suggest
best practices and pinpoint areas for further research
- …