45 research outputs found
Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction
We tackle image question answering (ImageQA) problem by learning a
convolutional neural network (CNN) with a dynamic parameter layer whose weights
are determined adaptively based on questions. For the adaptive parameter
prediction, we employ a separate parameter prediction network, which consists
of gated recurrent unit (GRU) taking a question as its input and a
fully-connected layer generating a set of candidate weights as its output.
However, it is challenging to construct a parameter prediction network for a
large number of parameters in the fully-connected dynamic parameter layer of
the CNN. We reduce the complexity of this problem by incorporating a hashing
technique, where the candidate weights given by the parameter prediction
network are selected using a predefined hash function to determine individual
weights in the dynamic parameter layer. The proposed network---joint network
with the CNN for ImageQA and the parameter prediction network---is trained
end-to-end through back-propagation, where its weights are initialized using a
pre-trained CNN and GRU. The proposed algorithm illustrates the
state-of-the-art performance on all available public ImageQA benchmarks
Structure Learning for Neural Module Networks
Neural Module Networks, originally proposed for the task of visual question
answering, are a class of neural network architectures that involve
human-specified neural modules, each designed for a specific form of reasoning.
In current formulations of such networks only the parameters of the neural
modules and/or the order of their execution is learned. In this work, we
further expand this approach and also learn the underlying internal structure
of modules in terms of the ordering and combination of simple and elementary
arithmetic operators. Our results show that one is indeed able to
simultaneously learn both internal module structure and module sequencing
without extra supervisory signals for module execution sequencing. With this
approach, we report performance comparable to models using hand-designed
modules
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Modeling textual or visual information with vector representations trained
from large language or visual datasets has been successfully explored in recent
years. However, tasks such as visual question answering require combining these
vector representations with each other. Approaches to multimodal pooling
include element-wise product or sum, as well as concatenation of the visual and
textual representations. We hypothesize that these methods are not as
expressive as an outer product of the visual and textual vectors. As the outer
product is typically infeasible due to its high dimensionality, we instead
propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and
expressively combine multimodal features. We extensively evaluate MCB on the
visual question answering and grounding tasks. We consistently show the benefit
of MCB over ablations without MCB. For visual question answering, we present an
architecture which uses MCB twice, once for predicting attention over spatial
features and again to combine the attended representation with the question
representation. This model outperforms the state-of-the-art on the Visual7W
dataset and the VQA challenge.Comment: Accepted to EMNLP 201