4 research outputs found
Learning Answer Embeddings for Visual Question Answering
We propose a novel probabilistic model for visual question answering (Visual
QA). The key idea is to infer two sets of embeddings: one for the image and the
question jointly and the other for the answers. The learning objective is to
learn the best parameterization of those embeddings such that the correct
answer has higher likelihood among all possible answers. In contrast to several
existing approaches of treating Visual QA as multi-way classification, the
proposed approach takes the semantic relationships (as characterized by the
embeddings) among answers into consideration, instead of viewing them as
independent ordinal numbers. Thus, the learned embedded function can be used to
embed unseen answers (in the training dataset). These properties make the
approach particularly appealing for transfer learning for open-ended Visual QA,
where the source dataset on which the model is learned has limited overlapping
with the target dataset in the space of answers. We have also developed
large-scale optimization techniques for applying the model to datasets with a
large number of answers, where the challenge is to properly normalize the
proposed probabilistic models. We validate our approach on several Visual QA
datasets and investigate its utility for transferring models across datasets.
The empirical results have shown that the approach performs well not only on
in-domain learning but also on transfer learning.Comment: Accepted at CVPR 201
Transfer Learning via Unsupervised Task Discovery for Visual Question Answering
We study how to leverage off-the-shelf visual and linguistic data to cope
with out-of-vocabulary answers in visual question answering task. Existing
large-scale visual datasets with annotations such as image class labels,
bounding boxes and region descriptions are good sources for learning rich and
diverse visual concepts. However, it is not straightforward how the visual
concepts can be captured and transferred to visual question answering models
due to missing link between question dependent answering models and visual data
without question. We tackle this problem in two steps: 1) learning a task
conditional visual classifier, which is capable of solving diverse
question-specific visual recognition tasks, based on unsupervised task
discovery and 2) transferring the task conditional visual classifier to visual
question answering models. Specifically, we employ linguistic knowledge sources
such as structured lexical database (e.g. WordNet) and visual descriptions for
unsupervised task discovery, and transfer a learned task conditional visual
classifier as an answering unit in a visual question answering model. We
empirically show that the proposed algorithm generalizes to out-of-vocabulary
answers successfully using the knowledge transferred from the visual dataset.Comment: CVPR 201
Visual Question Answering with Prior Class Semantics
We present a novel mechanism to embed prior knowledge in a model for visual
question answering. The open-set nature of the task is at odds with the
ubiquitous approach of training of a fixed classifier. We show how to exploit
additional information pertaining to the semantics of candidate answers. We
extend the answer prediction process with a regression objective in a semantic
space, in which we project candidate answers using prior knowledge derived from
word embeddings. We perform an extensive study of learned representations with
the GQA dataset, revealing that important semantic information is captured in
the relations between embeddings in the answer space. Our method brings
improvements in consistency and accuracy over a range of question types.
Experiments with novel answers, unseen during training, indicate the method's
potential for open-set prediction
An Empirical Study on Leveraging Scene Graphs for Visual Question Answering
Visual question answering (Visual QA) has attracted significant attention
these years. While a variety of algorithms have been proposed, most of them are
built upon different combinations of image and language features as well as
multi-modal attention and fusion. In this paper, we investigate an alternative
approach inspired by conventional QA systems that operate on knowledge graphs.
Specifically, we investigate the use of scene graphs derived from images for
Visual QA: an image is abstractly represented by a graph with nodes
corresponding to object entities and edges to object relationships. We adapt
the recently proposed graph network (GN) to encode the scene graph and perform
structured reasoning according to the input question. Our empirical studies
demonstrate that scene graphs can already capture essential information of
images and graph networks have the potential to outperform state-of-the-art
Visual QA algorithms but with a much cleaner architecture. By analyzing the
features generated by GNs we can further interpret the reasoning process,
suggesting a promising direction towards explainable Visual QA.Comment: Accepted as oral presentation at BMVC 201