818 research outputs found
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
Multimedia retrieval plays an indispensable role in big data utilization.
Past efforts mainly focused on single-media retrieval. However, the
requirements of users are highly flexible, such as retrieving the relevant
audio clips with one query of image. So challenges stemming from the "media
gap", which means that representations of different media types are
inconsistent, have attracted increasing attention. Cross-media retrieval is
designed for the scenarios where the queries and retrieval results are of
different media types. As a relatively new research topic, its concepts,
methodologies and benchmarks are still not clear in the literatures. To address
these issues, we review more than 100 references, give an overview including
the concepts, methodologies, major challenges and open issues, as well as build
up the benchmarks including datasets and experimental results. Researchers can
directly adopt the benchmarks to promptly evaluate their proposed methods. This
will help them to focus on algorithm design, rather than the time-consuming
compared methods and results. It is noted that we have constructed a new
dataset XMedia, which is the first publicly available dataset with up to five
media types (text, image, video, audio and 3D model). We believe this overview
will attract more researchers to focus on cross-media retrieval and be helpful
to them.Comment: 14 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technolog
Unsupervised Multi-modal Hashing for Cross-modal retrieval
With the advantage of low storage cost and high efficiency, hashing learning
has received much attention in the domain of Big Data. In this paper, we
propose a novel unsupervised hashing learning method to cope with this open
problem to directly preserve the manifold structure by hashing. To address this
problem, both the semantic correlation in textual space and the locally
geometric structure in the visual space are explored simultaneously in our
framework. Besides, the `2;1-norm constraint is imposed on the projection
matrices to learn the discriminative hash function for each modality. Extensive
experiments are performed to evaluate the proposed method on the three publicly
available datasets and the experimental results show that our method can
achieve superior performance over the state-of-the-art methods.Comment: 4 pages, 4 figure
CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network
Cross-modal retrieval has become a highlighted research topic for retrieval
across multimedia data such as image and text. A two-stage learning framework
is widely adopted by most existing methods based on Deep Neural Network (DNN):
The first learning stage is to generate separate representation for each
modality, and the second learning stage is to get the cross-modal common
representation. However, the existing methods have three limitations: (1) In
the first learning stage, they only model intra-modality correlation, but
ignore inter-modality correlation with rich complementary context. (2) In the
second learning stage, they only adopt shallow networks with single-loss
regularization, but ignore the intrinsic relevance of intra-modality and
inter-modality correlation. (3) Only original instances are considered while
the complementary fine-grained clues provided by their patches are ignored. For
addressing the above problems, this paper proposes a cross-modal correlation
learning (CCL) approach with multi-grained fusion by hierarchical network, and
the contributions are as follows: (1) In the first learning stage, CCL exploits
multi-level association with joint optimization to preserve the complementary
context from intra-modality and inter-modality correlation simultaneously. (2)
In the second learning stage, a multi-task learning strategy is designed to
adaptively balance the intra-modality semantic category constraints and
inter-modality pairwise similarity constraints. (3) CCL adopts multi-grained
modeling, which fuses the coarse-grained instances and fine-grained patches to
make cross-modal correlation more precise. Comparing with 13 state-of-the-art
methods on 6 widely-used cross-modal datasets, the experimental results show
our CCL approach achieves the best performance.Comment: 16 pages, accepted by IEEE Transactions on Multimedi
Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network
Nowadays, cross-modal retrieval plays an indispensable role to flexibly find
information across different modalities of data. Effectively measuring the
similarity between different modalities of data is the key of cross-modal
retrieval. Different modalities such as image and text have imbalanced and
complementary relationships, which contain unequal amount of information when
describing the same semantics. For example, images often contain more details
that cannot be demonstrated by textual descriptions and vice versa. Existing
works based on Deep Neural Network (DNN) mostly construct one common space for
different modalities to find the latent alignments between them, which lose
their exclusive modality-specific characteristics. Different from the existing
works, we propose modality-specific cross-modal similarity measurement (MCSM)
approach by constructing independent semantic space for each modality, which
adopts end-to-end framework to directly generate modality-specific cross-modal
similarity without explicit common representation. For each semantic space,
modality-specific characteristics within one modality are fully exploited by
recurrent attention network, while the data of another modality is projected
into this space with attention based joint embedding to utilize the learned
attention weights for guiding the fine-grained cross-modal correlation
learning, which can capture the imbalanced and complementary relationships
between different modalities. Finally, the complementarity between the semantic
spaces for different modalities is explored by adaptive fusion of the
modality-specific cross-modal similarities to perform cross-modal retrieval.
Experiments on the widely-used Wikipedia and Pascal Sentence datasets as well
as our constructed large-scale XMediaNet dataset verify the effectiveness of
our proposed approach, outperforming 9 state-of-the-art methods.Comment: 13 pages, submitted to IEEE Transactions on Image Processin
Discriminative Supervised Hashing for Cross-Modal similarity Search
With the advantage of low storage cost and high retrieval efficiency, hashing
techniques have recently been an emerging topic in cross-modal similarity
search. As multiple modal data reflect similar semantic content, many
researches aim at learning unified binary codes. However, discriminative
hashing features learned by these methods are not adequate. This results in
lower accuracy and robustness. We propose a novel hashing learning framework
which jointly performs classifier learning, subspace learning and matrix
factorization to preserve class-specific semantic content, termed
Discriminative Supervised Hashing (DSH), to learn the discrimative unified
binary codes for multi-modal data. Besides, reducing the loss of information
and preserving the non-linear structure of data, DSH non-linearly projects
different modalities into the common space in which the similarity among
heterogeneous data points can be measured. Extensive experiments conducted on
the three publicly available datasets demonstrate that the framework proposed
in this paper outperforms several state-of -the-art methods.Comment: 7 pages,3 figures,4 tables;The paper is under consideration at Image
and Vision Computin
Cross-modal Subspace Learning for Fine-grained Sketch-based Image Retrieval
Sketch-based image retrieval (SBIR) is challenging due to the inherent
domain-gap between sketch and photo. Compared with pixel-perfect depictions of
photos, sketches are iconic renderings of the real world with highly abstract.
Therefore, matching sketch and photo directly using low-level visual clues are
unsufficient, since a common low-level subspace that traverses semantically
across the two modalities is non-trivial to establish. Most existing SBIR
studies do not directly tackle this cross-modal problem. This naturally
motivates us to explore the effectiveness of cross-modal retrieval methods in
SBIR, which have been applied in the image-text matching successfully. In this
paper, we introduce and compare a series of state-of-the-art cross-modal
subspace learning methods and benchmark them on two recently released
fine-grained SBIR datasets. Through thorough examination of the experimental
results, we have demonstrated that the subspace learning can effectively model
the sketch-photo domain-gap. In addition we draw a few key insights to drive
future research.Comment: Accepted by Neurocomputin
CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning
It is known that the inconsistent distribution and representation of
different modalities, such as image and text, cause the heterogeneity gap that
makes it challenging to correlate such heterogeneous data. Generative
adversarial networks (GANs) have shown its strong ability of modeling data
distribution and learning discriminative representation, existing GANs-based
works mainly focus on generative problem to generate new data. We have
different goal, aim to correlate heterogeneous data, by utilizing the power of
GANs to model cross-modal joint distribution. Thus, we propose Cross-modal GANs
to learn discriminative common representation for bridging heterogeneity gap.
The main contributions are: (1) Cross-modal GANs architecture is proposed to
model joint distribution over data of different modalities. The inter-modality
and intra-modality correlation can be explored simultaneously in generative and
discriminative models. Both of them beat each other to promote cross-modal
correlation learning. (2) Cross-modal convolutional autoencoders with
weight-sharing constraint are proposed to form generative model. They can not
only exploit cross-modal correlation for learning common representation, but
also preserve reconstruction information for capturing semantic consistency
within each modality. (3) Cross-modal adversarial mechanism is proposed, which
utilizes two kinds of discriminative models to simultaneously conduct
intra-modality and inter-modality discrimination. They can mutually boost to
make common representation more discriminative by adversarial training process.
To the best of our knowledge, our proposed CM-GANs approach is the first to
utilize GANs to perform cross-modal common representation learning. Experiments
are conducted to verify the performance of our proposed approach on cross-modal
retrieval paradigm, compared with 10 methods on 3 cross-modal datasets
COBRA: Contrastive Bi-Modal Representation Algorithm
There are a wide range of applications that involve multi-modal data, such as
cross-modal retrieval, visual question-answering, and image captioning. Such
applications are primarily dependent on aligned distributions of the different
constituent modalities. Existing approaches generate latent embeddings for each
modality in a joint fashion by representing them in a common manifold. However
these joint embedding spaces fail to sufficiently reduce the modality gap,
which affects the performance in downstream tasks. We hypothesize that these
embeddings retain the intra-class relationships but are unable to preserve the
inter-class dynamics. In this paper, we present a novel framework COBRA that
aims to train two modalities (image and text) in a joint fashion inspired by
the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE)
paradigms which preserve both inter and intra-class relationships. We
empirically show that this framework reduces the modality gap significantly and
generates a robust and task agnostic joint-embedding space. We outperform
existing work on four diverse downstream tasks spanning across seven benchmark
cross-modal datasets.Comment: 13 Pages, 6 Figures and 10 Table
Modality-dependent Cross-media Retrieval
In this paper, we investigate the cross-media retrieval between images and
text, i.e., using image to search text (I2T) and using text to search images
(T2I). Existing cross-media retrieval methods usually learn one couple of
projections, by which the original features of images and text can be projected
into a common latent space to measure the content similarity. However, using
the same projections for the two different retrieval tasks (I2T and T2I) may
lead to a tradeoff between their respective performances, rather than their
best performances. Different from previous works, we propose a
modality-dependent cross-media retrieval (MDCR) model, where two couples of
projections are learned for different cross-media retrieval tasks instead of
one couple of projections. Specifically, by jointly optimizing the correlation
between images and text and the linear regression from one modal space (image
or text) to the semantic space, two couples of mappings are learned to project
images and text from their original feature spaces into two common latent
subspaces (one for I2T and the other for T2I). Extensive experiments show the
superiority of the proposed MDCR compared with other methods. In particular,
based the 4,096 dimensional convolutional neural network (CNN) visual feature
and 100 dimensional LDA textual feature, the mAP of the proposed method
achieves 41.5\%, which is a new state-of-the-art performance on the Wikipedia
dataset.Comment: in ACM Transactions on Intelligent Systems and Technolog
Cross-modal Deep Metric Learning with Multi-task Regularization
DNN-based cross-modal retrieval has become a research hotspot, by which users
can search results across various modalities like image and text. However,
existing methods mainly focus on the pairwise correlation and reconstruction
error of labeled data. They ignore the semantically similar and dissimilar
constraints between different modalities, and cannot take advantage of
unlabeled data. This paper proposes Cross-modal Deep Metric Learning with
Multi-task Regularization (CDMLMR), which integrates quadruplet ranking loss
and semi-supervised contrastive loss for modeling cross-modal semantic
similarity in a unified multi-task learning architecture. The quadruplet
ranking loss can model the semantically similar and dissimilar constraints to
preserve cross-modal relative similarity ranking information. The
semi-supervised contrastive loss is able to maximize the semantic similarity on
both labeled and unlabeled data. Compared to the existing methods, CDMLMR
exploits not only the similarity ranking information but also unlabeled
cross-modal data, and thus boosts cross-modal retrieval accuracy.Comment: Revision: Added reference [7] 6 pages, 1 figure, to appear in the
proceedings of the IEEE International Conference on Multimedia and Expo
(ICME), Jul 10, 2017 - Jul 14, 2017, Hong Kong, Hong Kon
- …