15,619 research outputs found
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources
We propose a method for visual question answering which combines an internal
representation of the content of an image with information extracted from a
general knowledge base to answer a broad range of image-based questions. This
allows more complex questions to be answered using the predominant neural
network-based approach than has previously been possible. It particularly
allows questions to be asked about the contents of an image, even when the
image itself does not contain the whole answer. The method constructs a textual
representation of the semantic content of an image, and merges it with textual
information sourced from a knowledge base, to develop a deeper understanding of
the scene viewed. Priming a recurrent neural network with this combined
information, and the submitted question, leads to a very flexible visual
question answering approach. We are specifically able to answer questions posed
in natural language, that refer to information not contained in the image. We
demonstrate the effectiveness of our model on two publicly available datasets,
Toronto COCO-QA and MS COCO-VQA and show that it produces the best reported
results in both cases.Comment: Accepted to IEEE Conf. Computer Vision and Pattern Recognitio
A Knowledge-Grounded Multimodal Search-Based Conversational Agent
Multimodal search-based dialogue is a challenging new task: It extends
visually grounded question answering systems into multi-turn conversations with
access to an external database. We address this new challenge by learning a
neural response generation system from the recently released Multimodal
Dialogue (MMD) dataset (Saha et al., 2017). We introduce a knowledge-grounded
multimodal conversational model where an encoded knowledge base (KB)
representation is appended to the decoder input. Our model substantially
outperforms strong baselines in terms of text-based similarity measures (over 9
BLEU points, 3 of which are solely due to the use of additional information
from the KB
- …