730 research outputs found
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Vision-to-language tasks aim to integrate computer vision and natural
language processing together, which has attracted the attention of many
researchers. For typical approaches, they encode image into feature
representations and decode it into natural language sentences. While they
neglect high-level semantic concepts and subtle relationships between image
regions and natural language elements. To make full use of these information,
this paper attempt to exploit the text guided attention and semantic-guided
attention (SA) to find the more correlated spatial information and reduce the
semantic gap between vision and language. Our method includes two level
attention networks. One is the text-guided attention network which is used to
select the text-related regions. The other is SA network which is used to
highlight the concept-related regions and the region-related concepts. At last,
all these information are incorporated to generate captions or answers.
Practically, image captioning and visual question answering experiments have
been carried out, and the experimental results have shown the excellent
performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference
Neural Baby Talk
We introduce a novel framework for image captioning that can produce natural
language explicitly grounded in entities that object detectors find in the
image. Our approach reconciles classical slot filling approaches (that are
generally better grounded in images) with modern neural captioning approaches
(that are generally more natural sounding and accurate). Our approach first
generates a sentence `template' with slot locations explicitly tied to specific
image regions. These slots are then filled in by visual concepts identified in
the regions by object detectors. The entire architecture (sentence template
generation and slot filling with object detectors) is end-to-end
differentiable. We verify the effectiveness of our proposed model on different
image captioning tasks. On standard image captioning and novel object
captioning, our model reaches state-of-the-art on both COCO and Flickr30k
datasets. We also demonstrate that our model has unique advantages when the
train and test distributions of scene compositions -- and hence language priors
of associated captions -- are different. Code has been made available at:
https://github.com/jiasenlu/NeuralBabyTalkComment: 12 pages, 7 figures, CVPR 201
ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering
We propose a novel attention based deep learning architecture for visual
question answering task (VQA). Given an image and an image related natural
language question, VQA generates the natural language answer for the question.
Generating the correct answers requires the model's attention to focus on the
regions corresponding to the question, because different questions inquire
about the attributes of different image regions. We introduce an attention
based configurable convolutional neural network (ABC-CNN) to learn such
question-guided attention. ABC-CNN determines an attention map for an
image-question pair by convolving the image feature map with configurable
convolutional kernels derived from the question's semantics. We evaluate the
ABC-CNN architecture on three benchmark VQA datasets: Toronto COCO-QA, DAQUAR,
and VQA dataset. ABC-CNN model achieves significant improvements over
state-of-the-art methods on these datasets. The question-guided attention
generated by ABC-CNN is also shown to reflect the regions that are highly
relevant to the questions
Saliency-Guided Attention Network for Image-Sentence Matching
This paper studies the task of matching image and sentence, where learning
appropriate representations across the multi-modal data appears to be the main
challenge. Unlike previous approaches that predominantly deploy symmetrical
architecture to represent both modalities, we propose Saliency-guided Attention
Network (SAN) that asymmetrically employs visual and textual attention modules
to learn the fine-grained correlation intertwined between vision and language.
The proposed SAN mainly includes three components: saliency detector,
Saliency-weighted Visual Attention (SVA) module, and Saliency-guided Textual
Attention (STA) module. Concretely, the saliency detector provides the visual
saliency information as the guidance for the two attention modules. SVA is
designed to leverage the advantage of the saliency information to improve
discrimination of visual representations. By fusing the visual information from
SVA and textual information as a multi-modal guidance, STA learns
discriminative textual representations that are highly sensitive to visual
clues. Extensive experiments demonstrate SAN can substantially improve the
state-of-the-art results on the benchmark Flickr30K and MSCOCO datasets by a
large margin.Comment: 10 pages, 5 figure
Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text
Images and text in advertisements interact in complex, non-literal ways. The
two channels are usually complementary, with each channel telling a different
part of the story. Current approaches, such as image captioning methods, only
examine literal, redundant relationships, where image and text show exactly the
same content. To understand more complex relationships, we first collect a
dataset of advertisement interpretations for whether the image and slogan in
the same visual advertisement form a parallel (conveying the same message
without literally saying the same thing) or non-parallel relationship, with the
help of workers recruited on Amazon Mechanical Turk. We develop a variety of
features that capture the creativity of images and the specificity or ambiguity
of text, as well as methods that analyze the semantics within and across
channels. We show that our method outperforms standard image-text alignment
approaches on predicting the parallel/non-parallel relationship between image
and text.Comment: To appear in BMVC201
Image captioning with weakly-supervised attention penalty
Stories are essential for genealogy research since they can help build
emotional connections with people. A lot of family stories are reserved in
historical photos and albums. Recent development on image captioning models
makes it feasible to "tell stories" for photos automatically. The attention
mechanism has been widely adopted in many state-of-the-art encoder-decoder
based image captioning models, since it can bridge the gap between the visual
part and the language part. Most existing captioning models implicitly trained
attention modules with word-likelihood loss. Meanwhile, lots of studies have
investigated intrinsic attentions for visual models using gradient-based
approaches. Ideally, attention maps predicted by captioning models should be
consistent with intrinsic attentions from visual models for any given visual
concept. However, no work has been done to align implicitly learned attention
maps with intrinsic visual attentions. In this paper, we proposed a novel model
that measured consistency between captioning predicted attentions and intrinsic
visual attentions. This alignment loss allows explicit attention correction
without using any expensive bounding box annotations. We developed and
evaluated our model on COCO dataset as well as a genealogical dataset from
Ancestry.com Operations Inc., which contains billions of historical photos. The
proposed model achieved better performances on all commonly used language
evaluation metrics for both datasets.Comment: 10 pages, 5 figure
Multimodal Transformer with Multi-View Visual Representation for Image Captioning
Image captioning aims to automatically generate a natural language
description of a given image, and most state-of-the-art models have adopted an
encoder-decoder framework. The framework consists of a convolution neural
network (CNN)-based image encoder that extracts region-based visual features
from the input image, and an recurrent neural network (RNN)-based caption
decoder that generates the output caption words based on the visual features
with the attention mechanism. Despite the success of existing studies, current
methods only model the co-attention that characterizes the inter-modal
interactions while neglecting the self-attention that characterizes the
intra-modal interactions. Inspired by the success of the Transformer model in
machine translation, here we extend it to a Multimodal Transformer (MT) model
for image captioning. Compared to existing image captioning approaches, the MT
model simultaneously captures intra- and inter-modal interactions in a unified
attention block. Due to the in-depth modular composition of such attention
blocks, the MT model can perform complex multimodal reasoning and output
accurate captions. Moreover, to further improve the image captioning
performance, multi-view visual features are seamlessly introduced into the MT
model. We quantitatively and qualitatively evaluate our approach using the
benchmark MSCOCO image captioning dataset and conduct extensive ablation
studies to investigate the reasons behind its effectiveness. The experimental
results show that our method significantly outperforms the previous
state-of-the-art methods. With an ensemble of seven models, our solution ranks
the 1st place on the real-time leaderboard of the MSCOCO image captioning
challenge at the time of the writing of this paper.Comment: submitted to a journa
Translating Videos to Commands for Robotic Manipulation with Deep Recurrent Neural Networks
We present a new method to translate videos to commands for robotic
manipulation using Deep Recurrent Neural Networks (RNN). Our framework first
extracts deep features from the input video frames with a deep Convolutional
Neural Networks (CNN). Two RNN layers with an encoder-decoder architecture are
then used to encode the visual features and sequentially generate the output
words as the command. We demonstrate that the translation accuracy can be
improved by allowing a smooth transaction between two RNN layers and using the
state-of-the-art feature extractor. The experimental results on our new
challenging dataset show that our approach outperforms recent methods by a fair
margin. Furthermore, we combine the proposed translation module with the vision
and planning system to let a robot perform various manipulation tasks. Finally,
we demonstrate the effectiveness of our framework on a full-size humanoid robot
WALK-MAN
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
Existing visual reasoning datasets such as Visual Question Answering (VQA),
often suffer from biases conditioned on the question, image or answer
distributions. The recently proposed CLEVR dataset addresses these limitations
and requires fine-grained reasoning but the dataset is synthetic and consists
of similar objects and sentence structures across the dataset.
In this paper, we introduce a new inference task, Visual Entailment (VE) -
consisting of image-sentence pairs whereby a premise is defined by an image,
rather than a natural language sentence as in traditional Textual Entailment
tasks. The goal of a trained VE model is to predict whether the image
semantically entails the text. To realize this task, we build a dataset SNLI-VE
based on the Stanford Natural Language Inference corpus and Flickr30k dataset.
We evaluate various existing VQA baselines and build a model called Explainable
Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71%
accuracy and outperforms several other state-of-the-art VQA based models.
Finally, we demonstrate the explainability of EVE through cross-modal attention
visualizations. The SNLI-VE dataset is publicly available at
https://github.com/ necla-ml/SNLI-VE
Stacked Semantic-Guided Attention Model for Fine-Grained Zero-Shot Learning
Zero-Shot Learning (ZSL) is achieved via aligning the semantic relationships
between the global image feature vector and the corresponding class semantic
descriptions. However, using the global features to represent fine-grained
images may lead to sub-optimal results since they neglect the discriminative
differences of local regions. Besides, different regions contain distinct
discriminative information. The important regions should contribute more to the
prediction. To this end, we propose a novel stacked semantics-guided attention
(S2GA) model to obtain semantic relevant features by using individual class
semantic features to progressively guide the visual features to generate an
attention map for weighting the importance of different local regions. Feeding
both the integrated visual features and the class semantic features into a
multi-class classification architecture, the proposed framework can be trained
end-to-end. Extensive experimental results on CUB and NABird datasets show that
the proposed approach has a consistent improvement on both fine-grained
zero-shot classification and retrieval tasks
- …