1,589 research outputs found
Pairwise Constraint Propagation on Multi-View Data
This paper presents a graph-based learning approach to pairwise constraint
propagation on multi-view data. Although pairwise constraint propagation has
been studied extensively, pairwise constraints are usually defined over pairs
of data points from a single view, i.e., only intra-view constraint propagation
is considered for multi-view tasks. In fact, very little attention has been
paid to inter-view constraint propagation, which is more challenging since
pairwise constraints are now defined over pairs of data points from different
views. In this paper, we propose to decompose the challenging inter-view
constraint propagation problem into semi-supervised learning subproblems so
that they can be efficiently solved based on graph-based label propagation. To
the best of our knowledge, this is the first attempt to give an efficient
solution to inter-view constraint propagation from a semi-supervised learning
viewpoint. Moreover, since graph-based label propagation has been adopted for
basic optimization, we develop two constrained graph construction methods for
interview constraint propagation, which only differ in how the intra-view
pairwise constraints are exploited. The experimental results in cross-view
retrieval have shown the promising performance of our inter-view constraint
propagation
Triplet-Based Deep Hashing Network for Cross-Modal Retrieval
Given the benefits of its low storage requirements and high retrieval
efficiency, hashing has recently received increasing attention. In
particular,cross-modal hashing has been widely and successfully used in
multimedia similarity search applications. However, almost all existing methods
employing cross-modal hashing cannot obtain powerful hash codes due to their
ignoring the relative similarity between heterogeneous data that contains
richer semantic information, leading to unsatisfactory retrieval performance.
In this paper, we propose a triplet-based deep hashing (TDH) network for
cross-modal retrieval. First, we utilize the triplet labels, which describes
the relative relationships among three instances as supervision in order to
capture more general semantic correlations between cross-modal instances. We
then establish a loss function from the inter-modal view and the intra-modal
view to boost the discriminative abilities of the hash codes. Finally, graph
regularization is introduced into our proposed TDH method to preserve the
original semantic similarity between hash codes in Hamming space. Experimental
results show that our proposed method outperforms several state-of-the-art
approaches on two popular cross-modal datasets
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
Multimedia retrieval plays an indispensable role in big data utilization.
Past efforts mainly focused on single-media retrieval. However, the
requirements of users are highly flexible, such as retrieving the relevant
audio clips with one query of image. So challenges stemming from the "media
gap", which means that representations of different media types are
inconsistent, have attracted increasing attention. Cross-media retrieval is
designed for the scenarios where the queries and retrieval results are of
different media types. As a relatively new research topic, its concepts,
methodologies and benchmarks are still not clear in the literatures. To address
these issues, we review more than 100 references, give an overview including
the concepts, methodologies, major challenges and open issues, as well as build
up the benchmarks including datasets and experimental results. Researchers can
directly adopt the benchmarks to promptly evaluate their proposed methods. This
will help them to focus on algorithm design, rather than the time-consuming
compared methods and results. It is noted that we have constructed a new
dataset XMedia, which is the first publicly available dataset with up to five
media types (text, image, video, audio and 3D model). We believe this overview
will attract more researchers to focus on cross-media retrieval and be helpful
to them.Comment: 14 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technolog
Cross-modal Subspace Learning for Fine-grained Sketch-based Image Retrieval
Sketch-based image retrieval (SBIR) is challenging due to the inherent
domain-gap between sketch and photo. Compared with pixel-perfect depictions of
photos, sketches are iconic renderings of the real world with highly abstract.
Therefore, matching sketch and photo directly using low-level visual clues are
unsufficient, since a common low-level subspace that traverses semantically
across the two modalities is non-trivial to establish. Most existing SBIR
studies do not directly tackle this cross-modal problem. This naturally
motivates us to explore the effectiveness of cross-modal retrieval methods in
SBIR, which have been applied in the image-text matching successfully. In this
paper, we introduce and compare a series of state-of-the-art cross-modal
subspace learning methods and benchmark them on two recently released
fine-grained SBIR datasets. Through thorough examination of the experimental
results, we have demonstrated that the subspace learning can effectively model
the sketch-photo domain-gap. In addition we draw a few key insights to drive
future research.Comment: Accepted by Neurocomputin
CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network
Cross-modal retrieval has become a highlighted research topic for retrieval
across multimedia data such as image and text. A two-stage learning framework
is widely adopted by most existing methods based on Deep Neural Network (DNN):
The first learning stage is to generate separate representation for each
modality, and the second learning stage is to get the cross-modal common
representation. However, the existing methods have three limitations: (1) In
the first learning stage, they only model intra-modality correlation, but
ignore inter-modality correlation with rich complementary context. (2) In the
second learning stage, they only adopt shallow networks with single-loss
regularization, but ignore the intrinsic relevance of intra-modality and
inter-modality correlation. (3) Only original instances are considered while
the complementary fine-grained clues provided by their patches are ignored. For
addressing the above problems, this paper proposes a cross-modal correlation
learning (CCL) approach with multi-grained fusion by hierarchical network, and
the contributions are as follows: (1) In the first learning stage, CCL exploits
multi-level association with joint optimization to preserve the complementary
context from intra-modality and inter-modality correlation simultaneously. (2)
In the second learning stage, a multi-task learning strategy is designed to
adaptively balance the intra-modality semantic category constraints and
inter-modality pairwise similarity constraints. (3) CCL adopts multi-grained
modeling, which fuses the coarse-grained instances and fine-grained patches to
make cross-modal correlation more precise. Comparing with 13 state-of-the-art
methods on 6 widely-used cross-modal datasets, the experimental results show
our CCL approach achieves the best performance.Comment: 16 pages, accepted by IEEE Transactions on Multimedi
Unsupervised Multi-modal Hashing for Cross-modal retrieval
With the advantage of low storage cost and high efficiency, hashing learning
has received much attention in the domain of Big Data. In this paper, we
propose a novel unsupervised hashing learning method to cope with this open
problem to directly preserve the manifold structure by hashing. To address this
problem, both the semantic correlation in textual space and the locally
geometric structure in the visual space are explored simultaneously in our
framework. Besides, the `2;1-norm constraint is imposed on the projection
matrices to learn the discriminative hash function for each modality. Extensive
experiments are performed to evaluate the proposed method on the three publicly
available datasets and the experimental results show that our method can
achieve superior performance over the state-of-the-art methods.Comment: 4 pages, 4 figure
CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning
It is known that the inconsistent distribution and representation of
different modalities, such as image and text, cause the heterogeneity gap that
makes it challenging to correlate such heterogeneous data. Generative
adversarial networks (GANs) have shown its strong ability of modeling data
distribution and learning discriminative representation, existing GANs-based
works mainly focus on generative problem to generate new data. We have
different goal, aim to correlate heterogeneous data, by utilizing the power of
GANs to model cross-modal joint distribution. Thus, we propose Cross-modal GANs
to learn discriminative common representation for bridging heterogeneity gap.
The main contributions are: (1) Cross-modal GANs architecture is proposed to
model joint distribution over data of different modalities. The inter-modality
and intra-modality correlation can be explored simultaneously in generative and
discriminative models. Both of them beat each other to promote cross-modal
correlation learning. (2) Cross-modal convolutional autoencoders with
weight-sharing constraint are proposed to form generative model. They can not
only exploit cross-modal correlation for learning common representation, but
also preserve reconstruction information for capturing semantic consistency
within each modality. (3) Cross-modal adversarial mechanism is proposed, which
utilizes two kinds of discriminative models to simultaneously conduct
intra-modality and inter-modality discrimination. They can mutually boost to
make common representation more discriminative by adversarial training process.
To the best of our knowledge, our proposed CM-GANs approach is the first to
utilize GANs to perform cross-modal common representation learning. Experiments
are conducted to verify the performance of our proposed approach on cross-modal
retrieval paradigm, compared with 10 methods on 3 cross-modal datasets
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Discriminative Supervised Hashing for Cross-Modal similarity Search
With the advantage of low storage cost and high retrieval efficiency, hashing
techniques have recently been an emerging topic in cross-modal similarity
search. As multiple modal data reflect similar semantic content, many
researches aim at learning unified binary codes. However, discriminative
hashing features learned by these methods are not adequate. This results in
lower accuracy and robustness. We propose a novel hashing learning framework
which jointly performs classifier learning, subspace learning and matrix
factorization to preserve class-specific semantic content, termed
Discriminative Supervised Hashing (DSH), to learn the discrimative unified
binary codes for multi-modal data. Besides, reducing the loss of information
and preserving the non-linear structure of data, DSH non-linearly projects
different modalities into the common space in which the similarity among
heterogeneous data points can be measured. Extensive experiments conducted on
the three publicly available datasets demonstrate that the framework proposed
in this paper outperforms several state-of -the-art methods.Comment: 7 pages,3 figures,4 tables;The paper is under consideration at Image
and Vision Computin
Unsupervised Cross-Media Hashing with Structure Preservation
Recent years have seen the exponential growth of heterogeneous multimedia
data. The need for effective and accurate data retrieval from heterogeneous
data sources has attracted much research interest in cross-media retrieval.
Here, given a query of any media type, cross-media retrieval seeks to find
relevant results of different media types from heterogeneous data sources. To
facilitate large-scale cross-media retrieval, we propose a novel unsupervised
cross-media hashing method. Our method incorporates local affinity and distance
repulsion constraints into a matrix factorization framework. Correspondingly,
the proposed method learns hash functions that generates unified hash codes
from different media types, while ensuring intrinsic geometric structure of the
data distribution is preserved. These hash codes empower the similarity between
data of different media types to be evaluated directly. Experimental results on
two large-scale multimedia datasets demonstrate the effectiveness of the
proposed method, where we outperform the state-of-the-art methods
- …