1,426 research outputs found
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities
Cross-modal retrieval aims to retrieve relevant data across different
modalities (e.g., texts vs. images). The common strategy is to apply
element-wise constraints between manually labeled pair-wise items to guide the
generators to learn the semantic relationships between the modalities, so that
the similar items can be projected close to each other in the common
representation subspace. However, such constraints often fail to preserve the
semantic structure between unpaired but semantically similar items (e.g. the
unpaired items with the same class label are more similar than items with
different labels). To address the above problem, we propose a novel cross-modal
similarity transferring (CMST) method to learn and preserve the semantic
relationships between unpaired items in an unsupervised way. The key idea is to
learn the quantitative similarities in single-modal representation subspace,
and then transfer them to the common representation subspace to establish the
semantic relationships between unpaired items across modalities. Experiments
show that our method outperforms the state-of-the-art approaches both in the
class-based and pair-based retrieval tasks
Cross-modal Subspace Learning via Kernel Correlation Maximization and Discriminative Structure Preserving
The measure between heterogeneous data is still an open problem. Many
research works have been developed to learn a common subspace where the
similarity between different modalities can be calculated directly. However,
most of existing works focus on learning a latent subspace but the semantically
structural information is not well preserved. Thus, these approaches cannot get
desired results. In this paper, we propose a novel framework, termed
Cross-modal subspace learning via Kernel correlation maximization and
Discriminative structure-preserving (CKD), to solve this problem in two
aspects. Firstly, we construct a shared semantic graph to make each modality
data preserve the neighbor relationship semantically. Secondly, we introduce
the Hilbert-Schmidt Independence Criteria (HSIC) to ensure the consistency
between feature-similarity and semantic-similarity of samples. Our model not
only considers the inter-modality correlation by maximizing the kernel
correlation but also preserves the semantically structural information within
each modality. The extensive experiments are performed to evaluate the proposed
framework on the three public datasets. The experimental results demonstrated
that the proposed CKD is competitive compared with the classic subspace
learning methods.Comment: The paper is under consideration at Multimedia Tools and Application
Unsupervised Multi-modal Hashing for Cross-modal retrieval
With the advantage of low storage cost and high efficiency, hashing learning
has received much attention in the domain of Big Data. In this paper, we
propose a novel unsupervised hashing learning method to cope with this open
problem to directly preserve the manifold structure by hashing. To address this
problem, both the semantic correlation in textual space and the locally
geometric structure in the visual space are explored simultaneously in our
framework. Besides, the `2;1-norm constraint is imposed on the projection
matrices to learn the discriminative hash function for each modality. Extensive
experiments are performed to evaluate the proposed method on the three publicly
available datasets and the experimental results show that our method can
achieve superior performance over the state-of-the-art methods.Comment: 4 pages, 4 figure
COBRA: Contrastive Bi-Modal Representation Algorithm
There are a wide range of applications that involve multi-modal data, such as
cross-modal retrieval, visual question-answering, and image captioning. Such
applications are primarily dependent on aligned distributions of the different
constituent modalities. Existing approaches generate latent embeddings for each
modality in a joint fashion by representing them in a common manifold. However
these joint embedding spaces fail to sufficiently reduce the modality gap,
which affects the performance in downstream tasks. We hypothesize that these
embeddings retain the intra-class relationships but are unable to preserve the
inter-class dynamics. In this paper, we present a novel framework COBRA that
aims to train two modalities (image and text) in a joint fashion inspired by
the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE)
paradigms which preserve both inter and intra-class relationships. We
empirically show that this framework reduces the modality gap significantly and
generates a robust and task agnostic joint-embedding space. We outperform
existing work on four diverse downstream tasks spanning across seven benchmark
cross-modal datasets.Comment: 13 Pages, 6 Figures and 10 Table
Modality-dependent Cross-media Retrieval
In this paper, we investigate the cross-media retrieval between images and
text, i.e., using image to search text (I2T) and using text to search images
(T2I). Existing cross-media retrieval methods usually learn one couple of
projections, by which the original features of images and text can be projected
into a common latent space to measure the content similarity. However, using
the same projections for the two different retrieval tasks (I2T and T2I) may
lead to a tradeoff between their respective performances, rather than their
best performances. Different from previous works, we propose a
modality-dependent cross-media retrieval (MDCR) model, where two couples of
projections are learned for different cross-media retrieval tasks instead of
one couple of projections. Specifically, by jointly optimizing the correlation
between images and text and the linear regression from one modal space (image
or text) to the semantic space, two couples of mappings are learned to project
images and text from their original feature spaces into two common latent
subspaces (one for I2T and the other for T2I). Extensive experiments show the
superiority of the proposed MDCR compared with other methods. In particular,
based the 4,096 dimensional convolutional neural network (CNN) visual feature
and 100 dimensional LDA textual feature, the mAP of the proposed method
achieves 41.5\%, which is a new state-of-the-art performance on the Wikipedia
dataset.Comment: in ACM Transactions on Intelligent Systems and Technolog
Shared Predictive Cross-Modal Deep Quantization
With explosive growth of data volume and ever-increasing diversity of data
modalities, cross-modal similarity search, which conducts nearest neighbor
search across different modalities, has been attracting increasing interest.
This paper presents a deep compact code learning solution for efficient
cross-modal similarity search. Many recent studies have proven that
quantization-based approaches perform generally better than hashing-based
approaches on single-modal similarity search. In this paper, we propose a deep
quantization approach, which is among the early attempts of leveraging deep
neural networks into quantization-based cross-modal similarity search. Our
approach, dubbed shared predictive deep quantization (SPDQ), explicitly
formulates a shared subspace across different modalities and two private
subspaces for individual modalities, and representations in the shared subspace
and the private subspaces are learned simultaneously by embedding them to a
reproducing kernel Hilbert space, where the mean embedding of different
modality distributions can be explicitly compared. In addition, in the shared
subspace, a quantizer is learned to produce the semantics preserving compact
codes with the help of label alignment. Thanks to this novel network
architecture in cooperation with supervised quantization training, SPDQ can
preserve intramodal and intermodal similarities as much as possible and greatly
reduce quantization error. Experiments on two popular benchmarks corroborate
that our approach outperforms state-of-the-art methods
An Efficient Approach for Geo-Multimedia Cross-Modal Retrieval
Due to the rapid development of mobile Internet techniques, cloud computation
and popularity of online social networking and location-based services, massive
amount of multimedia data with geographical information is generated and
uploaded to the Internet. In this paper, we propose a novel type of cross-modal
multimedia retrieval called geo-multimedia cross-modal retrieval which aims to
search out a set of geo-multimedia objects based on geographical distance
proximity and semantic similarity between different modalities. Previous
studies for cross-modal retrieval and spatial keyword search cannot address
this problem effectively because they do not consider multimedia data with
geo-tags and do not focus on this type of query. In order to address this
problem efficiently, we present the definition of NN geo-multimedia
cross-modal query at the first time and introduce relevant conceptions such as
cross-modal semantic representation space. To bridge the semantic gap between
different modalities, we propose a method named cross-modal semantic matching
which contains two important component, i.e., CorrProj and LogsTran, which aims
to construct a common semantic representation space for cross-modal semantic
similarity measurement. Besides, we designed a framework based on deep learning
techniques to implement common semantic representation space construction. In
addition, a novel hybrid indexing structure named GMR-Tree combining
geo-multimedia data and R-Tree is presented and a efficient NN search
algorithm called GMCMS is designed. Comprehensive experimental evaluation on
real and synthetic dataset clearly demonstrates that our solution outperforms
the-state-of-the-art methods.Comment: 27 page
Deep Multi-view Learning to Rank
We study the problem of learning to rank from multiple information sources.
Though multi-view learning and learning to rank have been studied extensively
leading to a wide range of applications, multi-view learning to rank as a
synergy of both topics has received little attention. The aim of the paper is
to propose a composite ranking method while keeping a close correlation with
the individual rankings simultaneously. We present a generic framework for
multi-view subspace learning to rank (MvSL2R), and two novel solutions are
introduced under the framework. The first solution captures information of
feature mappings from within each view as well as across views using
autoencoder-like networks. Novel feature embedding methods are formulated in
the optimization of multi-view unsupervised and discriminant autoencoders.
Moreover, we introduce an end-to-end solution to learning towards both the
joint ranking objective and the individual rankings. The proposed solution
enhances the joint ranking with minimum view-specific ranking loss, so that it
can achieve the maximum global view agreements in a single optimization
process. The proposed method is evaluated on three different ranking problems,
i.e. university ranking, multi-view lingual text ranking and image data
ranking, providing superior results compared to related methods.Comment: Published at IEEE TKD
Modeling Text with Graph Convolutional Network for Cross-Modal Information Retrieval
Cross-modal information retrieval aims to find heterogeneous data of various
modalities from a given query of one modality. The main challenge is to map
different modalities into a common semantic space, in which distance between
concepts in different modalities can be well modeled. For cross-modal
information retrieval between images and texts, existing work mostly uses
off-the-shelf Convolutional Neural Network (CNN) for image feature extraction.
For texts, word-level features such as bag-of-words or word2vec are employed to
build deep learning models to represent texts. Besides word-level semantics,
the semantic relations between words are also informative but less explored. In
this paper, we model texts by graphs using similarity measure based on
word2vec. A dual-path neural network model is proposed for couple feature
learning in cross-modal information retrieval. One path utilizes Graph
Convolutional Network (GCN) for text modeling based on graph representations.
The other path uses a neural network with layers of nonlinearities for image
modeling based on off-the-shelf features. The model is trained by a pairwise
similarity loss function to maximize the similarity of relevant text-image
pairs and minimize the similarity of irrelevant pairs. Experimental results
show that the proposed model outperforms the state-of-the-art methods
significantly, with 17% improvement on accuracy for the best case.Comment: 7 pages, 11 figure
- …