9,999 research outputs found
CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning
It is known that the inconsistent distribution and representation of
different modalities, such as image and text, cause the heterogeneity gap that
makes it challenging to correlate such heterogeneous data. Generative
adversarial networks (GANs) have shown its strong ability of modeling data
distribution and learning discriminative representation, existing GANs-based
works mainly focus on generative problem to generate new data. We have
different goal, aim to correlate heterogeneous data, by utilizing the power of
GANs to model cross-modal joint distribution. Thus, we propose Cross-modal GANs
to learn discriminative common representation for bridging heterogeneity gap.
The main contributions are: (1) Cross-modal GANs architecture is proposed to
model joint distribution over data of different modalities. The inter-modality
and intra-modality correlation can be explored simultaneously in generative and
discriminative models. Both of them beat each other to promote cross-modal
correlation learning. (2) Cross-modal convolutional autoencoders with
weight-sharing constraint are proposed to form generative model. They can not
only exploit cross-modal correlation for learning common representation, but
also preserve reconstruction information for capturing semantic consistency
within each modality. (3) Cross-modal adversarial mechanism is proposed, which
utilizes two kinds of discriminative models to simultaneously conduct
intra-modality and inter-modality discrimination. They can mutually boost to
make common representation more discriminative by adversarial training process.
To the best of our knowledge, our proposed CM-GANs approach is the first to
utilize GANs to perform cross-modal common representation learning. Experiments
are conducted to verify the performance of our proposed approach on cross-modal
retrieval paradigm, compared with 10 methods on 3 cross-modal datasets
Cross-media Multi-level Alignment with Relation Attention Network
With the rapid growth of multimedia data, such as image and text, it is a
highly challenging problem to effectively correlate and retrieve the data of
different media types. Naturally, when correlating an image with textual
description, people focus on not only the alignment between discriminative
image regions and key words, but also the relations lying in the visual and
textual context. Relation understanding is essential for cross-media
correlation learning, which is ignored by prior cross-media retrieval works. To
address the above issue, we propose Cross-media Relation Attention Network
(CRAN) with multi-level alignment. First, we propose visual-language relation
attention model to explore both fine-grained patches and their relations of
different media types. We aim to not only exploit cross-media fine-grained
local information, but also capture the intrinsic relation information, which
can provide complementary hints for correlation learning. Second, we propose
cross-media multi-level alignment to explore global, local and relation
alignments across different media types, which can mutually boost to learn more
precise cross-media correlation. We conduct experiments on 2 cross-media
datasets, and compare with 10 state-of-the-art methods to verify the
effectiveness of proposed approach.Comment: 7 pages, accepted by International Joint Conference on Artificial
Intelligence (IJCAI) 201
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Little research focuses on cross-modal correlation learning where temporal
structures of different data modalities such as audio and lyrics are taken into
account. Stemming from the characteristic of temporal structures of music in
nature, we are motivated to learn the deep sequential correlation between audio
and lyrics. In this work, we propose a deep cross-modal correlation learning
architecture involving two-branch deep neural networks for audio modality and
text modality (lyrics). Different modality data are converted to the same
canonical space where inter modal canonical correlation analysis is utilized as
an objective function to calculate the similarity of temporal structures. This
is the first study on understanding the correlation between language and music
audio through deep architectures for learning the paired temporal correlation
of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected
layers (fully-connected deep neural network) is used to represent lyrics. Two
significant contributions are made in the audio branch, as follows: i)
pre-trained CNN followed by fully-connected layers is investigated for
representing music audio. ii) We further suggest an end-to-end architecture
that simultaneously trains convolutional layers and fully-connected layers to
better learn temporal structures of music audio. Particularly, our end-to-end
deep architecture contains two properties: simultaneously implementing feature
learning and cross-modal correlation learning, and learning joint
representation by considering temporal structures. Experimental results, using
audio to retrieve lyrics or using lyrics to retrieve audio, verify the
effectiveness of the proposed deep correlation learning architectures in
cross-modal music retrieval
SoDeep: a Sorting Deep net to learn ranking loss surrogates
Several tasks in machine learning are evaluated using non-differentiable
metrics such as mean average precision or Spearman correlation. However, their
non-differentiability prevents from using them as objective functions in a
learning framework. Surrogate and relaxation methods exist but tend to be
specific to a given metric.
In the present work, we introduce a new method to learn approximations of
such non-differentiable objective functions. Our approach is based on a deep
architecture that approximates the sorting of arbitrary sets of scores. It is
trained virtually for free using synthetic data. This sorting deep (SoDeep) net
can then be combined in a plug-and-play manner with existing deep
architectures. We demonstrate the interest of our approach in three different
tasks that require ranking: Cross-modal text-image retrieval, multi-label image
classification and visual memorability ranking. Our approach yields very
competitive results on these three tasks, which validates the merit and the
flexibility of SoDeep as a proxy for sorting operation in ranking-based losses.Comment: Accepted to CVPR 201
Triplet-Based Deep Hashing Network for Cross-Modal Retrieval
Given the benefits of its low storage requirements and high retrieval
efficiency, hashing has recently received increasing attention. In
particular,cross-modal hashing has been widely and successfully used in
multimedia similarity search applications. However, almost all existing methods
employing cross-modal hashing cannot obtain powerful hash codes due to their
ignoring the relative similarity between heterogeneous data that contains
richer semantic information, leading to unsatisfactory retrieval performance.
In this paper, we propose a triplet-based deep hashing (TDH) network for
cross-modal retrieval. First, we utilize the triplet labels, which describes
the relative relationships among three instances as supervision in order to
capture more general semantic correlations between cross-modal instances. We
then establish a loss function from the inter-modal view and the intra-modal
view to boost the discriminative abilities of the hash codes. Finally, graph
regularization is introduced into our proposed TDH method to preserve the
original semantic similarity between hash codes in Hamming space. Experimental
results show that our proposed method outperforms several state-of-the-art
approaches on two popular cross-modal datasets
JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features
Learning social media content is the basis of many real-world applications,
including information retrieval and recommendation systems, among others. In
contrast with previous works that focus mainly on single modal or bi-modal
learning, we propose to learn social media content by fusing jointly textual,
acoustic, and visual information (JTAV). Effective strategies are proposed to
extract fine-grained features of each modality, that is, attBiGRU and DCRNN. We
also introduce cross-modal fusion and attentive pooling techniques to integrate
multi-modal information comprehensively. Extensive experimental evaluation
conducted on real-world datasets demonstrates our proposed model outperforms
the state-of-the-art approaches by a large margin
A Survey on Food Computing
Food is very essential for human life and it is fundamental to the human
experience. Food-related study may support multifarious applications and
services, such as guiding the human behavior, improving the human health and
understanding the culinary culture. With the rapid development of social
networks, mobile networks, and Internet of Things (IoT), people commonly
upload, share, and record food images, recipes, cooking videos, and food
diaries, leading to large-scale food data. Large-scale food data offers rich
knowledge about food and can help tackle many central issues of human society.
Therefore, it is time to group several disparate issues related to food
computing. Food computing acquires and analyzes heterogenous food data from
disparate sources for perception, recognition, retrieval, recommendation, and
monitoring of food. In food computing, computational approaches are applied to
address food related issues in medicine, biology, gastronomy and agronomy. Both
large-scale food data and recent breakthroughs in computer science are
transforming the way we analyze food data. Therefore, vast amounts of work has
been conducted in the food area, targeting different food-oriented tasks and
applications. However, there are very few systematic reviews, which shape this
area well and provide a comprehensive and in-depth summary of current efforts
or detail open problems in this area. In this paper, we formalize food
computing and present such a comprehensive overview of various emerging
concepts, methods, and tasks. We summarize key challenges and future directions
ahead for food computing. This is the first comprehensive survey that targets
the study of computing technology for the food area and also offers a
collection of research studies and technologies to benefit researchers and
practitioners working in different food-related fields.Comment: Accepted by ACM Computing Survey
Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions
In this paper we tackle the problem of image search when the query is a short
textual description of the image the user is looking for. We choose to
implement the actual search process as a similarity search in a visual feature
space, by learning to translate a textual query into a visual representation.
Searching in the visual feature space has the advantage that any update to the
translation model does not require to reprocess the, typically huge, image
collection on which the search is performed. We propose Text2Vis, a neural
network that generates a visual representation, in the visual feature space of
the fc6-fc7 layers of ImageNet, from a short descriptive text. Text2Vis
optimizes two loss functions, using a stochastic loss-selection method. A
visual-focused loss is aimed at learning the actual text-to-visual feature
mapping, while a text-focused loss is aimed at modeling the higher-level
semantic concepts expressed in language and countering the overfit on
non-relevant visual components of the visual loss. We report preliminary
results on the MS-COCO dataset.Comment: Neu-IR '16 SIGIR Workshop on Neural Information Retrieval, July 21,
2016, Pisa, Ital
Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network
Nowadays, cross-modal retrieval plays an indispensable role to flexibly find
information across different modalities of data. Effectively measuring the
similarity between different modalities of data is the key of cross-modal
retrieval. Different modalities such as image and text have imbalanced and
complementary relationships, which contain unequal amount of information when
describing the same semantics. For example, images often contain more details
that cannot be demonstrated by textual descriptions and vice versa. Existing
works based on Deep Neural Network (DNN) mostly construct one common space for
different modalities to find the latent alignments between them, which lose
their exclusive modality-specific characteristics. Different from the existing
works, we propose modality-specific cross-modal similarity measurement (MCSM)
approach by constructing independent semantic space for each modality, which
adopts end-to-end framework to directly generate modality-specific cross-modal
similarity without explicit common representation. For each semantic space,
modality-specific characteristics within one modality are fully exploited by
recurrent attention network, while the data of another modality is projected
into this space with attention based joint embedding to utilize the learned
attention weights for guiding the fine-grained cross-modal correlation
learning, which can capture the imbalanced and complementary relationships
between different modalities. Finally, the complementarity between the semantic
spaces for different modalities is explored by adaptive fusion of the
modality-specific cross-modal similarities to perform cross-modal retrieval.
Experiments on the widely-used Wikipedia and Pascal Sentence datasets as well
as our constructed large-scale XMediaNet dataset verify the effectiveness of
our proposed approach, outperforming 9 state-of-the-art methods.Comment: 13 pages, submitted to IEEE Transactions on Image Processin
- …