1,073 research outputs found
Cross-modal Subspace Learning for Fine-grained Sketch-based Image Retrieval
Sketch-based image retrieval (SBIR) is challenging due to the inherent
domain-gap between sketch and photo. Compared with pixel-perfect depictions of
photos, sketches are iconic renderings of the real world with highly abstract.
Therefore, matching sketch and photo directly using low-level visual clues are
unsufficient, since a common low-level subspace that traverses semantically
across the two modalities is non-trivial to establish. Most existing SBIR
studies do not directly tackle this cross-modal problem. This naturally
motivates us to explore the effectiveness of cross-modal retrieval methods in
SBIR, which have been applied in the image-text matching successfully. In this
paper, we introduce and compare a series of state-of-the-art cross-modal
subspace learning methods and benchmark them on two recently released
fine-grained SBIR datasets. Through thorough examination of the experimental
results, we have demonstrated that the subspace learning can effectively model
the sketch-photo domain-gap. In addition we draw a few key insights to drive
future research.Comment: Accepted by Neurocomputin
Unsupervised Multi-modal Hashing for Cross-modal retrieval
With the advantage of low storage cost and high efficiency, hashing learning
has received much attention in the domain of Big Data. In this paper, we
propose a novel unsupervised hashing learning method to cope with this open
problem to directly preserve the manifold structure by hashing. To address this
problem, both the semantic correlation in textual space and the locally
geometric structure in the visual space are explored simultaneously in our
framework. Besides, the `2;1-norm constraint is imposed on the projection
matrices to learn the discriminative hash function for each modality. Extensive
experiments are performed to evaluate the proposed method on the three publicly
available datasets and the experimental results show that our method can
achieve superior performance over the state-of-the-art methods.Comment: 4 pages, 4 figure
Triplet-Based Deep Hashing Network for Cross-Modal Retrieval
Given the benefits of its low storage requirements and high retrieval
efficiency, hashing has recently received increasing attention. In
particular,cross-modal hashing has been widely and successfully used in
multimedia similarity search applications. However, almost all existing methods
employing cross-modal hashing cannot obtain powerful hash codes due to their
ignoring the relative similarity between heterogeneous data that contains
richer semantic information, leading to unsatisfactory retrieval performance.
In this paper, we propose a triplet-based deep hashing (TDH) network for
cross-modal retrieval. First, we utilize the triplet labels, which describes
the relative relationships among three instances as supervision in order to
capture more general semantic correlations between cross-modal instances. We
then establish a loss function from the inter-modal view and the intra-modal
view to boost the discriminative abilities of the hash codes. Finally, graph
regularization is introduced into our proposed TDH method to preserve the
original semantic similarity between hash codes in Hamming space. Experimental
results show that our proposed method outperforms several state-of-the-art
approaches on two popular cross-modal datasets
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
Multimedia retrieval plays an indispensable role in big data utilization.
Past efforts mainly focused on single-media retrieval. However, the
requirements of users are highly flexible, such as retrieving the relevant
audio clips with one query of image. So challenges stemming from the "media
gap", which means that representations of different media types are
inconsistent, have attracted increasing attention. Cross-media retrieval is
designed for the scenarios where the queries and retrieval results are of
different media types. As a relatively new research topic, its concepts,
methodologies and benchmarks are still not clear in the literatures. To address
these issues, we review more than 100 references, give an overview including
the concepts, methodologies, major challenges and open issues, as well as build
up the benchmarks including datasets and experimental results. Researchers can
directly adopt the benchmarks to promptly evaluate their proposed methods. This
will help them to focus on algorithm design, rather than the time-consuming
compared methods and results. It is noted that we have constructed a new
dataset XMedia, which is the first publicly available dataset with up to five
media types (text, image, video, audio and 3D model). We believe this overview
will attract more researchers to focus on cross-media retrieval and be helpful
to them.Comment: 14 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technolog
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Discriminative Supervised Hashing for Cross-Modal similarity Search
With the advantage of low storage cost and high retrieval efficiency, hashing
techniques have recently been an emerging topic in cross-modal similarity
search. As multiple modal data reflect similar semantic content, many
researches aim at learning unified binary codes. However, discriminative
hashing features learned by these methods are not adequate. This results in
lower accuracy and robustness. We propose a novel hashing learning framework
which jointly performs classifier learning, subspace learning and matrix
factorization to preserve class-specific semantic content, termed
Discriminative Supervised Hashing (DSH), to learn the discrimative unified
binary codes for multi-modal data. Besides, reducing the loss of information
and preserving the non-linear structure of data, DSH non-linearly projects
different modalities into the common space in which the similarity among
heterogeneous data points can be measured. Extensive experiments conducted on
the three publicly available datasets demonstrate that the framework proposed
in this paper outperforms several state-of -the-art methods.Comment: 7 pages,3 figures,4 tables;The paper is under consideration at Image
and Vision Computin
Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network
Nowadays, cross-modal retrieval plays an indispensable role to flexibly find
information across different modalities of data. Effectively measuring the
similarity between different modalities of data is the key of cross-modal
retrieval. Different modalities such as image and text have imbalanced and
complementary relationships, which contain unequal amount of information when
describing the same semantics. For example, images often contain more details
that cannot be demonstrated by textual descriptions and vice versa. Existing
works based on Deep Neural Network (DNN) mostly construct one common space for
different modalities to find the latent alignments between them, which lose
their exclusive modality-specific characteristics. Different from the existing
works, we propose modality-specific cross-modal similarity measurement (MCSM)
approach by constructing independent semantic space for each modality, which
adopts end-to-end framework to directly generate modality-specific cross-modal
similarity without explicit common representation. For each semantic space,
modality-specific characteristics within one modality are fully exploited by
recurrent attention network, while the data of another modality is projected
into this space with attention based joint embedding to utilize the learned
attention weights for guiding the fine-grained cross-modal correlation
learning, which can capture the imbalanced and complementary relationships
between different modalities. Finally, the complementarity between the semantic
spaces for different modalities is explored by adaptive fusion of the
modality-specific cross-modal similarities to perform cross-modal retrieval.
Experiments on the widely-used Wikipedia and Pascal Sentence datasets as well
as our constructed large-scale XMediaNet dataset verify the effectiveness of
our proposed approach, outperforming 9 state-of-the-art methods.Comment: 13 pages, submitted to IEEE Transactions on Image Processin
CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning
It is known that the inconsistent distribution and representation of
different modalities, such as image and text, cause the heterogeneity gap that
makes it challenging to correlate such heterogeneous data. Generative
adversarial networks (GANs) have shown its strong ability of modeling data
distribution and learning discriminative representation, existing GANs-based
works mainly focus on generative problem to generate new data. We have
different goal, aim to correlate heterogeneous data, by utilizing the power of
GANs to model cross-modal joint distribution. Thus, we propose Cross-modal GANs
to learn discriminative common representation for bridging heterogeneity gap.
The main contributions are: (1) Cross-modal GANs architecture is proposed to
model joint distribution over data of different modalities. The inter-modality
and intra-modality correlation can be explored simultaneously in generative and
discriminative models. Both of them beat each other to promote cross-modal
correlation learning. (2) Cross-modal convolutional autoencoders with
weight-sharing constraint are proposed to form generative model. They can not
only exploit cross-modal correlation for learning common representation, but
also preserve reconstruction information for capturing semantic consistency
within each modality. (3) Cross-modal adversarial mechanism is proposed, which
utilizes two kinds of discriminative models to simultaneously conduct
intra-modality and inter-modality discrimination. They can mutually boost to
make common representation more discriminative by adversarial training process.
To the best of our knowledge, our proposed CM-GANs approach is the first to
utilize GANs to perform cross-modal common representation learning. Experiments
are conducted to verify the performance of our proposed approach on cross-modal
retrieval paradigm, compared with 10 methods on 3 cross-modal datasets
CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network
Cross-modal retrieval has become a highlighted research topic for retrieval
across multimedia data such as image and text. A two-stage learning framework
is widely adopted by most existing methods based on Deep Neural Network (DNN):
The first learning stage is to generate separate representation for each
modality, and the second learning stage is to get the cross-modal common
representation. However, the existing methods have three limitations: (1) In
the first learning stage, they only model intra-modality correlation, but
ignore inter-modality correlation with rich complementary context. (2) In the
second learning stage, they only adopt shallow networks with single-loss
regularization, but ignore the intrinsic relevance of intra-modality and
inter-modality correlation. (3) Only original instances are considered while
the complementary fine-grained clues provided by their patches are ignored. For
addressing the above problems, this paper proposes a cross-modal correlation
learning (CCL) approach with multi-grained fusion by hierarchical network, and
the contributions are as follows: (1) In the first learning stage, CCL exploits
multi-level association with joint optimization to preserve the complementary
context from intra-modality and inter-modality correlation simultaneously. (2)
In the second learning stage, a multi-task learning strategy is designed to
adaptively balance the intra-modality semantic category constraints and
inter-modality pairwise similarity constraints. (3) CCL adopts multi-grained
modeling, which fuses the coarse-grained instances and fine-grained patches to
make cross-modal correlation more precise. Comparing with 13 state-of-the-art
methods on 6 widely-used cross-modal datasets, the experimental results show
our CCL approach achieves the best performance.Comment: 16 pages, accepted by IEEE Transactions on Multimedi
Cross-modal Subspace Learning via Kernel Correlation Maximization and Discriminative Structure Preserving
The measure between heterogeneous data is still an open problem. Many
research works have been developed to learn a common subspace where the
similarity between different modalities can be calculated directly. However,
most of existing works focus on learning a latent subspace but the semantically
structural information is not well preserved. Thus, these approaches cannot get
desired results. In this paper, we propose a novel framework, termed
Cross-modal subspace learning via Kernel correlation maximization and
Discriminative structure-preserving (CKD), to solve this problem in two
aspects. Firstly, we construct a shared semantic graph to make each modality
data preserve the neighbor relationship semantically. Secondly, we introduce
the Hilbert-Schmidt Independence Criteria (HSIC) to ensure the consistency
between feature-similarity and semantic-similarity of samples. Our model not
only considers the inter-modality correlation by maximizing the kernel
correlation but also preserves the semantically structural information within
each modality. The extensive experiments are performed to evaluate the proposed
framework on the three public datasets. The experimental results demonstrated
that the proposed CKD is competitive compared with the classic subspace
learning methods.Comment: The paper is under consideration at Multimedia Tools and Application
- …