7 research outputs found
Sign-Full Random Projections
The method of 1-bit ("sign-sign") random projections has been a popular tool
for efficient search and machine learning on large datasets. Given two -dim
data vectors , , one can generate , and , where iid. The
"collision probability" is , where is the cosine
similarity.
We develop "sign-full" random projections by estimating from (e.g.,)
the expectation , which can be further
substantially improved by normalizing . For nonnegative data, we recommend
an interesting estimator based on
and its normalized version. The recommended estimator almost matches the
accuracy of the (computationally expensive) maximum likelihood estimator. At
high similarity (), the asymptotic variance of recommended
estimator is only of the estimator for sign-sign
projections. At small and high similarity, the improvement would be even
much more substantial
Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising
The advancement of the communication technology and the popularity of the
smart phones foster the booming of video ads. Baidu, as one of the leading
search engine companies in the world, receives billions of search queries per
day. How to pair the video ads with the user search is the core task of Baidu
video advertising. Due to the modality gap, the query-to-video retrieval is
much more challenging than traditional query-to-document retrieval and
image-to-image search. Traditionally, the query-to-video retrieval is tackled
by the query-to-title retrieval, which is not reliable when the quality of
tiles are not high. With the rapid progress achieved in computer vision and
natural language processing in recent years, content-based search methods
becomes promising for the query-to-video retrieval. Benefited from pretraining
on large-scale datasets, some visionBERT methods based on cross-modal attention
have achieved excellent performance in many vision-language tasks not only in
academia but also in industry. Nevertheless, the expensive computation cost of
cross-modal attention makes it impractical for large-scale search in industrial
applications. In this work, we present a tree-based combo-attention network
(TCAN) which has been recently launched in Baidu's dynamic video advertising
platform. It provides a practical solution to deploy the heavy cross-modal
attention for the large-scale query-to-video search. After launching tree-based
combo-attention network, click-through rate gets improved by 2.29\% and
conversion rate get improved by 2.63\%.Comment: This revision is based on a manuscript submitted in October 2020, to
ICDE 2021. We thank the Program Committee for their valuable comment
CoopHash: Cooperative Learning of Multipurpose Descriptor and Contrastive Pair Generator via Variational MCMC Teaching for Supervised Image Hashing
Leveraging supervised information can lead to superior retrieval performance
in the image hashing domain but the performance degrades significantly without
enough labeled data. One effective solution to boost the performance is to
employ generative models, such as Generative Adversarial Networks (GANs), to
generate synthetic data in an image hashing model. However, GAN-based methods
are difficult to train and suffer from mode collapse issue, which prevents the
hashing approaches from jointly training the generative models and the hash
functions. This limitation results in sub-optimal retrieval performance. To
overcome this limitation, we propose a novel framework, the generative
cooperative hashing network (CoopHash), which is based on the energy-based
cooperative learning. CoopHash jointly learns a powerful generative
representation of the data and a robust hash function. CoopHash has two
components: a top-down contrastive pair generator that synthesizes contrastive
images and a bottom-up multipurpose descriptor that simultaneously represents
the images from multiple perspectives, including probability density, hash
code, latent code, and category. The two components are jointly learned via a
novel likelihood-based cooperative learning scheme. We conduct experiments on
several real-world datasets and show that the proposed method outperforms the
competing hashing supervised methods, achieving up to 10% relative improvement
over the current state-of-the-art supervised hashing methods, and exhibits a
significantly better performance in out-of-distribution retrieval
Constrained Approximate Similarity Search on Proximity Graph
Search engines and recommendation systems are built to efficiently display
relevant information from those massive amounts of candidates. Typically a
three-stage mechanism is employed in those systems: (i) a small collection of
items are first retrieved by (e.g.,) approximate near neighbor search
algorithms; (ii) then a collection of constraints are applied on the retrieved
items; (iii) a fine-grained ranking neural network is employed to determine the
final recommendation. We observe a major defect of the original three-stage
pipeline: Although we only target to retrieve vectors in the final
recommendation, we have to preset a sufficiently large () for each
query, and ``hope'' the number of survived vectors after the filtering is not
smaller than . That is, at least vectors in the similar candidates
satisfy the query constraints.
In this paper, we investigate this constrained similarity search problem and
attempt to merge the similarity search stage and the filtering stage into one
single search operation. We introduce AIRSHIP, a system that integrates a
user-defined function filtering into the similarity search framework. The
proposed system does not need to build extra indices nor require prior
knowledge of the query constraints. We propose three optimization strategies:
(1) starting point selection, (2) multi-direction search, and (3) biased
priority queue selection. Experimental evaluations on both synthetic and real
data confirm the effectiveness of the proposed AIRSHIP algorithm. We focus on
constrained graph-based approximate near neighbor (ANN) search in this study,
in part because graph-based ANN is known to achieve excellent performance. We
believe it is also possible to develop constrained hashing-based ANN or
constrained quantization-based ANN
Breaking the waves: asymmetric random periodic features for low-bitrate kernel machines
Many signal processing and machine learning applications are built from
evaluating a kernel on pairs of signals, e.g. to assess the similarity of an
incoming query to a database of known signals. This nonlinear evaluation can be
simplified to a linear inner product of the random Fourier features of those
signals: random projections followed by a periodic map, the complex
exponential. It is known that a simple quantization of those features
(corresponding to replacing the complex exponential by a different periodic map
that takes binary values, which is appealing for their transmission and
storage), distorts the approximated kernel, which may be undesirable in
practice. Our take-home message is that when the features of only one of the
two signals are quantized, the original kernel is recovered without distortion;
its practical interest appears in several cases where the kernel evaluations
are asymmetric by nature, such as a client-server scheme. Concretely, we
introduce the general framework of asymmetric random periodic features, where
the two signals of interest are observed through random periodic features:
random projections followed by a general periodic map, which is allowed to be
different for both signals. We derive the influence of those periodic maps on
the approximated kernel, and prove uniform probabilistic error bounds holding
for all signal pairs from an infinite low-complexity set. Interestingly, our
results allow the periodic maps to be discontinuous, thanks to a new
mathematical tool, i.e. the mean Lipschitz smoothness. We then apply this
generic framework to semi-quantized kernel machines (where only one signal has
quantized features and the other has classical random Fourier features), for
which we show theoretically that the approximated kernel remains unchanged
(with the associated error bound), and confirm the power of the approach with
numerical simulations
Differential Privacy with Random Projections and Sign Random Projections
In this paper, we develop a series of differential privacy (DP) algorithms
from a family of random projections (RP), for general applications in machine
learning, data mining, and information retrieval. Among the presented
algorithms, \textbf{iDP-SignRP} is remarkably effective under the setting of
``individual differential privacy'' (iDP), based on sign random projections
(SignRP). Also, \textbf{DP-SignOPORP} considerably improves existing algorithms
in the literature under the standard DP setting, using ``one permutation + one
random projection'' (OPORP), where OPORP is a variant of the celebrated
count-sketch method with fixed-length binning and normalization. Without taking
signs, among the DP-RP family, \textbf{DP-OPORP} achieves the best performance.
The concept of iDP (individual differential privacy) is defined only on a
particular dataset of interest. While iDP is not strictly DP, iDP might be
useful in certain applications, such as releasing a dataset (including sharing
embeddings across companies or countries). In our study, we find that
\textbf{iDP-SignRP} is remarkably effective for search and machine learning
applications, in that the utilities are exceptionally good even at a very small
privacy parameter (e.g., )