40,711 research outputs found
Neural Baby Talk
We introduce a novel framework for image captioning that can produce natural
language explicitly grounded in entities that object detectors find in the
image. Our approach reconciles classical slot filling approaches (that are
generally better grounded in images) with modern neural captioning approaches
(that are generally more natural sounding and accurate). Our approach first
generates a sentence `template' with slot locations explicitly tied to specific
image regions. These slots are then filled in by visual concepts identified in
the regions by object detectors. The entire architecture (sentence template
generation and slot filling with object detectors) is end-to-end
differentiable. We verify the effectiveness of our proposed model on different
image captioning tasks. On standard image captioning and novel object
captioning, our model reaches state-of-the-art on both COCO and Flickr30k
datasets. We also demonstrate that our model has unique advantages when the
train and test distributions of scene compositions -- and hence language priors
of associated captions -- are different. Code has been made available at:
https://github.com/jiasenlu/NeuralBabyTalkComment: 12 pages, 7 figures, CVPR 201
VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation
Rich and dense human labeled datasets are among the main enabling factors for
the recent advance on vision-language understanding. Many seemingly distant
annotations (e.g., semantic segmentation and visual question answering (VQA))
are inherently connected in that they reveal different levels and perspectives
of human understandings about the same visual scenes --- and even the same set
of images (e.g., of COCO). The popularity of COCO correlates those annotations
and tasks. Explicitly linking them up may significantly benefit both individual
tasks and the unified vision and language modeling. We present the preliminary
work of linking the instance segmentations provided by COCO to the questions
and answers (QAs) in the VQA dataset, and name the collected links visual
questions and segmentation answers (VQS). They transfer human supervision
between the previously separate tasks, offer more effective leverage to
existing problems, and also open the door for new research problems and models.
We study two applications of the VQS data in this paper: supervised attention
for VQA and a novel question-focused semantic segmentation task. For the
former, we obtain state-of-the-art results on the VQA real multiple-choice task
by simply augmenting the multilayer perceptrons with some attention features
that are learned using the segmentation-QA links as explicit supervision. To
put the latter in perspective, we study two plausible methods and compare them
to an oracle method assuming that the instance segmentations are given at the
test stage.Comment: To appear on ICCV 201
Excitation Backprop for RNNs
Deep models are state-of-the-art for many vision tasks including video action
recognition and video captioning. Models are trained to caption or classify
activity in videos, but little is known about the evidence used to make such
decisions. Grounding decisions made by deep networks has been studied in
spatial visual content, giving more insight into model predictions for images.
However, such studies are relatively lacking for models of spatiotemporal
visual content - videos. In this work, we devise a formulation that
simultaneously grounds evidence in space and time, in a single pass, using
top-down saliency. We visualize the spatiotemporal cues that contribute to a
deep model's classification/captioning output using the model's internal
representation. Based on these spatiotemporal cues, we are able to localize
segments within a video that correspond with a specific action, or phrase from
a caption, without explicitly optimizing/training for these tasks.Comment: CVPR 2018 Camera Ready Versio
Grounding semantics in robots for Visual Question Answering
In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning
ViP-CNN: Visual Phrase Guided Convolutional Neural Network
As the intermediate level task connecting image captioning and object
detection, visual relationship detection started to catch researchers'
attention because of its descriptive power and clear structure. It detects the
objects and captures their pair-wise interactions with a
subject-predicate-object triplet, e.g. person-ride-horse. In this paper, each
visual relationship is considered as a phrase with three components. We
formulate the visual relationship detection as three inter-connected
recognition problems and propose a Visual Phrase guided Convolutional Neural
Network (ViP-CNN) to address them simultaneously. In ViP-CNN, we present a
Phrase-guided Message Passing Structure (PMPS) to establish the connection
among relationship components and help the model consider the three problems
jointly. Corresponding non-maximum suppression method and model training
strategy are also proposed. Experimental results show that our ViP-CNN
outperforms the state-of-art method both in speed and accuracy. We further
pretrain ViP-CNN on our cleansed Visual Genome Relationship dataset, which is
found to perform better than the pretraining on the ImageNet for this task.Comment: 10 pages, 5 figures, accepted by CVPR 201
- …