54,429 research outputs found
FVQA: Fact-based Visual Question Answering
Visual Question Answering (VQA) has attracted a lot of attention in both
Computer Vision and Natural Language Processing communities, not least because
it offers insight into the relationships between two important sources of
information. Current datasets, and the models built upon them, have focused on
questions which are answerable by direct analysis of the question and image
alone. The set of such questions that require no external information to answer
is interesting, but very limited. It excludes questions which require common
sense, or basic factual knowledge to answer, for example. Here we introduce
FVQA, a VQA dataset which requires, and supports, much deeper reasoning. FVQA
only contains questions which require external information to answer.
We thus extend a conventional visual question answering dataset, which
contains image-question-answerg triplets, through additional
image-question-answer-supporting fact tuples. The supporting fact is
represented as a structural triplet, such as .
We evaluate several baseline models on the FVQA dataset, and describe a novel
model which is capable of reasoning about an image on the basis of supporting
facts.Comment: 16 page
Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources
We propose a method for visual question answering which combines an internal
representation of the content of an image with information extracted from a
general knowledge base to answer a broad range of image-based questions. This
allows more complex questions to be answered using the predominant neural
network-based approach than has previously been possible. It particularly
allows questions to be asked about the contents of an image, even when the
image itself does not contain the whole answer. The method constructs a textual
representation of the semantic content of an image, and merges it with textual
information sourced from a knowledge base, to develop a deeper understanding of
the scene viewed. Priming a recurrent neural network with this combined
information, and the submitted question, leads to a very flexible visual
question answering approach. We are specifically able to answer questions posed
in natural language, that refer to information not contained in the image. We
demonstrate the effectiveness of our model on two publicly available datasets,
Toronto COCO-QA and MS COCO-VQA and show that it produces the best reported
results in both cases.Comment: Accepted to IEEE Conf. Computer Vision and Pattern Recognitio
- …