112 research outputs found
Visual Question Answering with Memory-Augmented Networks
In this paper, we exploit a memory-augmented neural network to predict
accurate answers to visual questions, even when those answers occur rarely in
the training set. The memory network incorporates both internal and external
memory blocks and selectively pays attention to each training exemplar. We show
that memory-augmented neural networks are able to maintain a relatively
long-term memory of scarce training exemplars, which is important for visual
question answering due to the heavy-tailed distribution of answers in a general
VQA setting. Experimental results on two large-scale benchmark datasets show
the favorable performance of the proposed algorithm with a comparison to state
of the art.Comment: CVPR 201
Question Type Guided Attention in Visual Question Answering
Visual Question Answering (VQA) requires integration of feature maps with
drastically different structures and focus of the correct regions. Image
descriptors have structures at multiple spatial scales, while lexical inputs
inherently follow a temporal sequence and naturally cluster into semantically
different question types. A lot of previous works use complex models to extract
feature representations but neglect to use high-level information summary such
as question types in learning. In this work, we propose Question Type-guided
Attention (QTA). It utilizes the information of question type to dynamically
balance between bottom-up and top-down visual features, respectively extracted
from ResNet and Faster R-CNN networks. We experiment with multiple VQA
architectures with extensive input ablation studies over the TDIUC dataset and
show that QTA systematically improves the performance by more than 5% across
multiple question type categories such as "Activity Recognition", "Utility" and
"Counting" on TDIUC dataset. By adding QTA on the state-of-art model MCB, we
achieve 3% improvement for overall accuracy. Finally, we propose a multi-task
extension to predict question types which generalizes QTA to applications that
lack of question type, with minimal performance loss
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
Analyzing the Behavior of Visual Question Answering Models
Recently, a number of deep-learning based models have been proposed for the
task of Visual Question Answering (VQA). The performance of most models is
clustered around 60-70%. In this paper we propose systematic methods to analyze
the behavior of these models as a first step towards recognizing their
strengths and weaknesses, and identifying the most fruitful directions for
progress. We analyze two models, one each from two major classes of VQA models
-- with-attention and without-attention and show the similarities and
differences in the behavior of these models. We also analyze the winning entry
of the VQA Challenge 2016.
Our behavior analysis reveals that despite recent progress, today's VQA
models are "myopic" (tend to fail on sufficiently novel instances), often "jump
to conclusions" (converge on a predicted answer after 'listening' to just half
the question), and are "stubborn" (do not change their answers across images).Comment: 13 pages, 20 figures; To appear in EMNLP 201
- …