1,674 research outputs found
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Shared Predictive Cross-Modal Deep Quantization
With explosive growth of data volume and ever-increasing diversity of data
modalities, cross-modal similarity search, which conducts nearest neighbor
search across different modalities, has been attracting increasing interest.
This paper presents a deep compact code learning solution for efficient
cross-modal similarity search. Many recent studies have proven that
quantization-based approaches perform generally better than hashing-based
approaches on single-modal similarity search. In this paper, we propose a deep
quantization approach, which is among the early attempts of leveraging deep
neural networks into quantization-based cross-modal similarity search. Our
approach, dubbed shared predictive deep quantization (SPDQ), explicitly
formulates a shared subspace across different modalities and two private
subspaces for individual modalities, and representations in the shared subspace
and the private subspaces are learned simultaneously by embedding them to a
reproducing kernel Hilbert space, where the mean embedding of different
modality distributions can be explicitly compared. In addition, in the shared
subspace, a quantizer is learned to produce the semantics preserving compact
codes with the help of label alignment. Thanks to this novel network
architecture in cooperation with supervised quantization training, SPDQ can
preserve intramodal and intermodal similarities as much as possible and greatly
reduce quantization error. Experiments on two popular benchmarks corroborate
that our approach outperforms state-of-the-art methods
Learning Discriminative Representations for Semantic Cross Media Retrieval
Heterogeneous gap among different modalities emerges as one of the critical
issues in modern AI problems. Unlike traditional uni-modal cases, where raw
features are extracted and directly measured, the heterogeneous nature of cross
modal tasks requires the intrinsic semantic representation to be compared in a
unified framework. This paper studies the learning of different representations
that can be retrieved across different modality contents. A novel approach for
mining cross-modal representations is proposed by incorporating explicit linear
semantic projecting in Hilbert space. The insight is that the discriminative
structures of different modality data can be linearly represented in
appropriate high dimension Hilbert spaces, where linear operations can be used
to approximate nonlinear decisions in the original spaces. As a result, an
efficient linear semantic down mapping is jointly learned for multimodal data,
leading to a common space where they can be compared. The mechanism of "feature
up-lifting and down-projecting" works seamlessly as a whole, which accomplishes
crossmodal retrieval tasks very well. The proposed method, named as shared
discriminative semantic representation learning (\textbf{SDSRL}), is tested on
two public multimodal dataset for both within- and inter- modal retrieval. The
experiments demonstrate that it outperforms several state-of-the-art methods in
most scenarios
Triplet-Based Deep Hashing Network for Cross-Modal Retrieval
Given the benefits of its low storage requirements and high retrieval
efficiency, hashing has recently received increasing attention. In
particular,cross-modal hashing has been widely and successfully used in
multimedia similarity search applications. However, almost all existing methods
employing cross-modal hashing cannot obtain powerful hash codes due to their
ignoring the relative similarity between heterogeneous data that contains
richer semantic information, leading to unsatisfactory retrieval performance.
In this paper, we propose a triplet-based deep hashing (TDH) network for
cross-modal retrieval. First, we utilize the triplet labels, which describes
the relative relationships among three instances as supervision in order to
capture more general semantic correlations between cross-modal instances. We
then establish a loss function from the inter-modal view and the intra-modal
view to boost the discriminative abilities of the hash codes. Finally, graph
regularization is introduced into our proposed TDH method to preserve the
original semantic similarity between hash codes in Hamming space. Experimental
results show that our proposed method outperforms several state-of-the-art
approaches on two popular cross-modal datasets
Fusion-supervised Deep Cross-modal Hashing
Deep hashing has recently received attention in cross-modal retrieval for its
impressive advantages. However, existing hashing methods for cross-modal
retrieval cannot fully capture the heterogeneous multi-modal correlation and
exploit the semantic information. In this paper, we propose a novel
\emph{Fusion-supervised Deep Cross-modal Hashing} (FDCH) approach. Firstly,
FDCH learns unified binary codes through a fusion hash network with paired
samples as input, which effectively enhances the modeling of the correlation of
heterogeneous multi-modal data. Then, these high-quality unified hash codes
further supervise the training of the modality-specific hash networks for
encoding out-of-sample queries. Meanwhile, both pair-wise similarity
information and classification information are embedded in the hash networks
under one stream framework, which simultaneously preserves cross-modal
similarity and keeps semantic consistency. Experimental results on two
benchmark datasets demonstrate the state-of-the-art performance of FDCH
Discriminative Supervised Hashing for Cross-Modal similarity Search
With the advantage of low storage cost and high retrieval efficiency, hashing
techniques have recently been an emerging topic in cross-modal similarity
search. As multiple modal data reflect similar semantic content, many
researches aim at learning unified binary codes. However, discriminative
hashing features learned by these methods are not adequate. This results in
lower accuracy and robustness. We propose a novel hashing learning framework
which jointly performs classifier learning, subspace learning and matrix
factorization to preserve class-specific semantic content, termed
Discriminative Supervised Hashing (DSH), to learn the discrimative unified
binary codes for multi-modal data. Besides, reducing the loss of information
and preserving the non-linear structure of data, DSH non-linearly projects
different modalities into the common space in which the similarity among
heterogeneous data points can be measured. Extensive experiments conducted on
the three publicly available datasets demonstrate that the framework proposed
in this paper outperforms several state-of -the-art methods.Comment: 7 pages,3 figures,4 tables;The paper is under consideration at Image
and Vision Computin
HashGAN:Attention-aware Deep Adversarial Hashing for Cross Modal Retrieval
As the rapid growth of multi-modal data, hashing methods for cross-modal
retrieval have received considerable attention. Deep-networks-based cross-modal
hashing methods are appealing as they can integrate feature learning and hash
coding into end-to-end trainable frameworks. However, it is still challenging
to find content similarities between different modalities of data due to the
heterogeneity gap. To further address this problem, we propose an adversarial
hashing network with attention mechanism to enhance the measurement of content
similarities by selectively focusing on informative parts of multi-modal data.
The proposed new adversarial network, HashGAN, consists of three building
blocks: 1) the feature learning module to obtain feature representations, 2)
the generative attention module to generate an attention mask, which is used to
obtain the attended (foreground) and the unattended (background) feature
representations, 3) the discriminative hash coding module to learn hash
functions that preserve the similarities between different modalities. In our
framework, the generative module and the discriminative module are trained in
an adversarial way: the generator is learned to make the discriminator cannot
preserve the similarities of multi-modal data w.r.t. the background feature
representations, while the discriminator aims to preserve the similarities of
multi-modal data w.r.t. both the foreground and the background feature
representations. Extensive evaluations on several benchmark datasets
demonstrate that the proposed HashGAN brings substantial improvements over
other state-of-the-art cross-modal hashing methods.Comment: 10 pages, 8 figures, 3 table
Attribute-Guided Network for Cross-Modal Zero-Shot Hashing
Zero-Shot Hashing aims at learning a hashing model that is trained only by
instances from seen categories but can generate well to those of unseen
categories. Typically, it is achieved by utilizing a semantic embedding space
to transfer knowledge from seen domain to unseen domain. Existing efforts
mainly focus on single-modal retrieval task, especially Image-Based Image
Retrieval (IBIR). However, as a highlighted research topic in the field of
hashing, cross-modal retrieval is more common in real world applications. To
address the Cross-Modal Zero-Shot Hashing (CMZSH) retrieval task, we propose a
novel Attribute-Guided Network (AgNet), which can perform not only IBIR, but
also Text-Based Image Retrieval (TBIR). In particular, AgNet aligns different
modal data into a semantically rich attribute space, which bridges the gap
caused by modality heterogeneity and zero-shot setting. We also design an
effective strategy that exploits the attribute to guide the generation of hash
codes for image and text within the same network. Extensive experimental
results on three benchmark datasets (AwA, SUN, and ImageNet) demonstrate the
superiority of AgNet on both cross-modal and single-modal zero-shot image
retrieval tasks.Comment: 9 pages, 8 figure
Unsupervised Multi-modal Hashing for Cross-modal retrieval
With the advantage of low storage cost and high efficiency, hashing learning
has received much attention in the domain of Big Data. In this paper, we
propose a novel unsupervised hashing learning method to cope with this open
problem to directly preserve the manifold structure by hashing. To address this
problem, both the semantic correlation in textual space and the locally
geometric structure in the visual space are explored simultaneously in our
framework. Besides, the `2;1-norm constraint is imposed on the projection
matrices to learn the discriminative hash function for each modality. Extensive
experiments are performed to evaluate the proposed method on the three publicly
available datasets and the experimental results show that our method can
achieve superior performance over the state-of-the-art methods.Comment: 4 pages, 4 figure
A New Evaluation Protocol and Benchmarking Results for Extendable Cross-media Retrieval
This paper proposes a new evaluation protocol for cross-media retrieval which
better fits the real-word applications. Both image-text and text-image
retrieval modes are considered. Traditionally, class labels in the training and
testing sets are identical. That is, it is usually assumed that the query falls
into some pre-defined classes. However, in practice, the content of a query
image/text may vary extensively, and the retrieval system does not necessarily
know in advance the class label of a query. Considering the inconsistency
between the real-world applications and laboratory assumptions, we think that
the existing protocol that works under identical train/test classes can be
modified and improved.
This work is dedicated to addressing this problem by considering the protocol
under an extendable scenario, \ie, the training and testing classes do not
overlap. We provide extensive benchmarking results obtained by the existing
protocol and the proposed new protocol on several commonly used datasets. We
demonstrate a noticeable performance drop when the testing classes are unseen
during training. Additionally, a trivial solution, \ie, directly using the
predicted class label for cross-media retrieval, is tested. We show that the
trivial solution is very competitive in traditional non-extendable retrieval,
but becomes less so under the new settings. The train/test split, evaluation
code, and benchmarking results are publicly available on our website.Comment: 10 pages, 9 figure
- …