4,547 research outputs found
Cross-modal Subspace Learning for Fine-grained Sketch-based Image Retrieval
Sketch-based image retrieval (SBIR) is challenging due to the inherent
domain-gap between sketch and photo. Compared with pixel-perfect depictions of
photos, sketches are iconic renderings of the real world with highly abstract.
Therefore, matching sketch and photo directly using low-level visual clues are
unsufficient, since a common low-level subspace that traverses semantically
across the two modalities is non-trivial to establish. Most existing SBIR
studies do not directly tackle this cross-modal problem. This naturally
motivates us to explore the effectiveness of cross-modal retrieval methods in
SBIR, which have been applied in the image-text matching successfully. In this
paper, we introduce and compare a series of state-of-the-art cross-modal
subspace learning methods and benchmark them on two recently released
fine-grained SBIR datasets. Through thorough examination of the experimental
results, we have demonstrated that the subspace learning can effectively model
the sketch-photo domain-gap. In addition we draw a few key insights to drive
future research.Comment: Accepted by Neurocomputin
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
COBRA: Contrastive Bi-Modal Representation Algorithm
There are a wide range of applications that involve multi-modal data, such as
cross-modal retrieval, visual question-answering, and image captioning. Such
applications are primarily dependent on aligned distributions of the different
constituent modalities. Existing approaches generate latent embeddings for each
modality in a joint fashion by representing them in a common manifold. However
these joint embedding spaces fail to sufficiently reduce the modality gap,
which affects the performance in downstream tasks. We hypothesize that these
embeddings retain the intra-class relationships but are unable to preserve the
inter-class dynamics. In this paper, we present a novel framework COBRA that
aims to train two modalities (image and text) in a joint fashion inspired by
the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE)
paradigms which preserve both inter and intra-class relationships. We
empirically show that this framework reduces the modality gap significantly and
generates a robust and task agnostic joint-embedding space. We outperform
existing work on four diverse downstream tasks spanning across seven benchmark
cross-modal datasets.Comment: 13 Pages, 6 Figures and 10 Table
Modality-dependent Cross-media Retrieval
In this paper, we investigate the cross-media retrieval between images and
text, i.e., using image to search text (I2T) and using text to search images
(T2I). Existing cross-media retrieval methods usually learn one couple of
projections, by which the original features of images and text can be projected
into a common latent space to measure the content similarity. However, using
the same projections for the two different retrieval tasks (I2T and T2I) may
lead to a tradeoff between their respective performances, rather than their
best performances. Different from previous works, we propose a
modality-dependent cross-media retrieval (MDCR) model, where two couples of
projections are learned for different cross-media retrieval tasks instead of
one couple of projections. Specifically, by jointly optimizing the correlation
between images and text and the linear regression from one modal space (image
or text) to the semantic space, two couples of mappings are learned to project
images and text from their original feature spaces into two common latent
subspaces (one for I2T and the other for T2I). Extensive experiments show the
superiority of the proposed MDCR compared with other methods. In particular,
based the 4,096 dimensional convolutional neural network (CNN) visual feature
and 100 dimensional LDA textual feature, the mAP of the proposed method
achieves 41.5\%, which is a new state-of-the-art performance on the Wikipedia
dataset.Comment: in ACM Transactions on Intelligent Systems and Technolog
Learning Robust Visual-Semantic Embeddings
Many of the existing methods for learning joint embedding of images and text
use only supervised information from paired images and its textual attributes.
Taking advantage of the recent success of unsupervised learning in deep neural
networks, we propose an end-to-end learning framework that is able to extract
more robust multi-modal representations across domains. The proposed method
combines representation learning models (i.e., auto-encoders) together with
cross-domain learning criteria (i.e., Maximum Mean Discrepancy loss) to learn
joint embeddings for semantic and visual features. A novel technique of
unsupervised-data adaptation inference is introduced to construct more
comprehensive embeddings for both labeled and unlabeled data. We evaluate our
method on Animals with Attributes and Caltech-UCSD Birds 200-2011 dataset with
a wide range of applications, including zero and few-shot image recognition and
retrieval, from inductive to transductive settings. Empirically, we show that
our framework improves over the current state of the art on many of the
considered tasks.Comment: 12 page
Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks
There has been an explosion of multimodal content generated on social media
networks in the last few years, which has necessitated a deeper understanding
of social media content and user behavior. We present a novel
content-independent content-user-reaction model for social multimedia content
analysis. Compared to prior works that generally tackle semantic content
understanding and user behavior modeling in isolation, we propose a generalized
solution to these problems within a unified framework. We embed users, images
and text drawn from open social media in a common multimodal geometric space,
using a novel loss function designed to cope with distant and disparate
modalities, and thereby enable seamless three-way retrieval. Our model not only
outperforms unimodal embedding based methods on cross-modal retrieval tasks but
also shows improvements stemming from jointly solving the two tasks on Twitter
data. We also show that the user embeddings learned within our joint multimodal
embedding model are better at predicting user interests compared to those
learned with unimodal content on Instagram data. Our framework thus goes beyond
the prior practice of using explicit leader-follower link information to
establish affiliations by extracting implicit content-centric affiliations from
isolated users. We provide qualitative results to show that the user clusters
emerging from learned embeddings have consistent semantics and the ability of
our model to discover fine-grained semantics from noisy and unstructured data.
Our work reveals that social multimodal content is inherently multimodal and
possesses a consistent structure because in social networks meaning is created
through interactions between users and content.Comment: Preprint submitted to IJC
Recent Advances in Zero-shot Recognition
With the recent renaissance of deep convolution neural networks, encouraging
breakthroughs have been achieved on the supervised recognition tasks, where
each class has sufficient training data and fully annotated training data.
However, to scale the recognition to a large number of classes with few or now
training samples for each class remains an unsolved problem. One approach to
scaling up the recognition is to develop models capable of recognizing unseen
categories without any training instances, or zero-shot recognition/ learning.
This article provides a comprehensive review of existing zero-shot recognition
techniques covering various aspects ranging from representations of models, and
from datasets and evaluation settings. We also overview related recognition
tasks including one-shot and open set recognition which can be used as natural
extensions of zero-shot recognition when limited number of class samples become
available or when zero-shot recognition is implemented in a real-world setting.
Importantly, we highlight the limitations of existing approaches and point out
future research directions in this existing new research area.Comment: accepted by IEEE Signal Processing Magazin
Twitter100k: A Real-world Dataset for Weakly Supervised Cross-Media Retrieval
This paper contributes a new large-scale dataset for weakly supervised
cross-media retrieval, named Twitter100k. Current datasets, such as Wikipedia,
NUS Wide and Flickr30k, have two major limitations. First, these datasets are
lacking in content diversity, i.e., only some pre-defined classes are covered.
Second, texts in these datasets are written in well-organized language, leading
to inconsistency with realistic applications. To overcome these drawbacks, the
proposed Twitter100k dataset is characterized by two aspects: 1) it has 100,000
image-text pairs randomly crawled from Twitter and thus has no constraint in
the image categories; 2) text in Twitter100k is written in informal language by
the users.
Since strongly supervised methods leverage the class labels that may be
missing in practice, this paper focuses on weakly supervised learning for
cross-media retrieval, in which only text-image pairs are exploited during
training. We extensively benchmark the performance of four subspace learning
methods and three variants of the Correspondence AutoEncoder, along with
various text features on Wikipedia, Flickr30k and Twitter100k. Novel insights
are provided. As a minor contribution, inspired by the characteristic of
Twitter100k, we propose an OCR-based cross-media retrieval method. In
experiment, we show that the proposed OCR-based method improves the baseline
performance
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
Cross-modal retrieval between visual data and natural language description
remains a long-standing challenge in multimedia. While recent image-text
retrieval methods offer great promise by learning deep representations aligned
across modalities, most of these methods are plagued by the issue of training
with small-scale datasets covering a limited number of images with ground-truth
sentences. Moreover, it is extremely expensive to create a larger dataset by
annotating millions of images with sentences and may lead to a biased model.
Inspired by the recent success of webly supervised learning in deep neural
networks, we capitalize on readily-available web images with noisy annotations
to learn robust image-text joint representation. Specifically, our main idea is
to leverage web images and corresponding tags, along with fully annotated
datasets, in training for learning the visual-semantic joint embedding. We
propose a two-stage approach for the task that can augment a typical supervised
pair-wise ranking loss based formulation with weakly-annotated web images to
learn a more robust visual-semantic embedding. Experiments on two standard
benchmark datasets demonstrate that our method achieves a significant
performance gain in image-text retrieval compared to state-of-the-art
approaches.Comment: ACM Multimedia 201
Multimodal Transformer with Multi-View Visual Representation for Image Captioning
Image captioning aims to automatically generate a natural language
description of a given image, and most state-of-the-art models have adopted an
encoder-decoder framework. The framework consists of a convolution neural
network (CNN)-based image encoder that extracts region-based visual features
from the input image, and an recurrent neural network (RNN)-based caption
decoder that generates the output caption words based on the visual features
with the attention mechanism. Despite the success of existing studies, current
methods only model the co-attention that characterizes the inter-modal
interactions while neglecting the self-attention that characterizes the
intra-modal interactions. Inspired by the success of the Transformer model in
machine translation, here we extend it to a Multimodal Transformer (MT) model
for image captioning. Compared to existing image captioning approaches, the MT
model simultaneously captures intra- and inter-modal interactions in a unified
attention block. Due to the in-depth modular composition of such attention
blocks, the MT model can perform complex multimodal reasoning and output
accurate captions. Moreover, to further improve the image captioning
performance, multi-view visual features are seamlessly introduced into the MT
model. We quantitatively and qualitatively evaluate our approach using the
benchmark MSCOCO image captioning dataset and conduct extensive ablation
studies to investigate the reasons behind its effectiveness. The experimental
results show that our method significantly outperforms the previous
state-of-the-art methods. With an ensemble of seven models, our solution ranks
the 1st place on the real-time leaderboard of the MSCOCO image captioning
challenge at the time of the writing of this paper.Comment: submitted to a journa
- …