239 research outputs found
Knowing Where to Look? Analysis on Attention of Visual Question Answering System
Attention mechanisms have been widely used in Visual Question Answering (VQA)
solutions due to their capacity to model deep cross-domain interactions.
Analyzing attention maps offers us a perspective to find out limitations of
current VQA systems and an opportunity to further improve them. In this paper,
we select two state-of-the-art VQA approaches with attention mechanisms to
study their robustness and disadvantages by visualizing and analyzing their
estimated attention maps. We find that both methods are sensitive to features,
and simultaneously, they perform badly for counting and multi-object related
questions. We believe that the findings and analytical method will help
researchers identify crucial challenges on the way to improve their own VQA
systems.Comment: ECCV SiVL Workshop pape
Dual Recurrent Attention Units for Visual Question Answering
Visual Question Answering (VQA) requires AI models to comprehend data in two
domains, vision and text. Current state-of-the-art models use learned attention
mechanisms to extract relevant information from the input domains to answer a
certain question. Thus, robust attention mechanisms are essential for powerful
VQA models. In this paper, we propose a recurrent attention mechanism and show
its benefits compared to the traditional convolutional approach. We perform two
ablation studies to evaluate recurrent attention. First, we introduce a
baseline VQA model with visual attention and test the performance difference
between convolutional and recurrent attention on the VQA 2.0 dataset. Secondly,
we design an architecture for VQA which utilizes dual (textual and visual)
Recurrent Attention Units (RAUs). Using this model, we show the effect of all
possible combinations of recurrent and convolutional dual attention. Our single
model outperforms the first place winner on the VQA 2016 challenge and to the
best of our knowledge, it is the second best performing single model on the VQA
1.0 dataset. Furthermore, our model noticeably improves upon the winner of the
VQA 2017 challenge. Moreover, we experiment replacing attention mechanisms in
state-of-the-art models with our RAUs and show increased performance.Comment: 8 pages, 5 figure
Reciprocal Attention Fusion for Visual Question Answering
Existing attention mechanisms either attend to local image grid or object
level features for Visual Question Answering (VQA). Motivated by the
observation that questions can relate to both object instances and their parts,
we propose a novel attention mechanism that jointly considers reciprocal
relationships between the two levels of visual details. The bottom-up attention
thus generated is further coalesced with the top-down information to only focus
on the scene elements that are most relevant to a given question. Our design
hierarchically fuses multi-modal information i.e., language, object- and
gird-level features, through an efficient tensor decomposition scheme. The
proposed model improves the state-of-the-art single model performances from
67.9% to 68.2% on VQAv1 and from 65.7% to 67.4% on VQAv2, demonstrating a
significant boost.Comment: To appear in the British Machine Vision Conference (BMVC), September
201
Learning Visual Knowledge Memory Networks for Visual Question Answering
Visual question answering (VQA) requires joint comprehension of images and
natural language questions, where many questions can't be directly or clearly
answered from visual content but require reasoning from structured human
knowledge with confirmation from visual content. This paper proposes visual
knowledge memory network (VKMN) to address this issue, which seamlessly
incorporates structured human knowledge and deep visual features into memory
networks in an end-to-end learning framework. Comparing to existing methods for
leveraging external knowledge for supporting VQA, this paper stresses more on
two missing mechanisms. First is the mechanism for integrating visual contents
with knowledge facts. VKMN handles this issue by embedding knowledge triples
(subject, relation, target) and deep visual features jointly into the visual
knowledge features. Second is the mechanism for handling multiple knowledge
facts expanding from question and answer pairs. VKMN stores joint embedding
using key-value pair structure in the memory networks so that it is easy to
handle multiple facts. Experiments show that the proposed method achieves
promising results on both VQA v1.0 and v2.0 benchmarks, while outperforms
state-of-the-art methods on the knowledge-reasoning related questions.Comment: Supplementary to CVPR 2018 versio
Progressive Attention Memory Network for Movie Story Question Answering
This paper proposes the progressive attention memory network (PAMN) for movie
story question answering (QA). Movie story QA is challenging compared to VQA in
two aspects: (1) pinpointing the temporal parts relevant to answer the question
is difficult as the movies are typically longer than an hour, (2) it has both
video and subtitle where different questions require different modality to
infer the answer. To overcome these challenges, PAMN involves three main
features: (1) progressive attention mechanism that utilizes cues from both
question and answer to progressively prune out irrelevant temporal parts in
memory, (2) dynamic modality fusion that adaptively determines the contribution
of each modality for answering the current question, and (3) belief correction
answering scheme that successively corrects the prediction score on each
candidate answer. Experiments on publicly available benchmark datasets, MovieQA
and TVQA, demonstrate that each feature contributes to our movie story QA
architecture, PAMN, and improves performance to achieve the state-of-the-art
result. Qualitative analysis by visualizing the inference mechanism of PAMN is
also provided.Comment: CVPR 2019, Accepte
Inverse Visual Question Answering with Multi-Level Attentions
In this paper, we propose a novel deep multi-level attention model to address
inverse visual question answering. The proposed model generates regional visual
and semantic features at the object level and then enhances them with the
answer cue by using attention mechanisms. Two levels of multiple attentions are
employed in the model, including the dual attention at the partial question
encoding step and the dynamic attention at the next question word generation
step. We evaluate the proposed model on the VQA V1 dataset. It demonstrates
state-of-the-art performance in terms of multiple commonly used metrics
An Improved Attention for Visual Question Answering
We consider the problem of Visual Question Answering (VQA). Given an image
and a free-form, open-ended, question, expressed in natural language, the goal
of VQA system is to provide accurate answer to this question with respect to
the image. The task is challenging because it requires simultaneous and
intricate understanding of both visual and textual information. Attention,
which captures intra- and inter-modal dependencies, has emerged as perhaps the
most widely used mechanism for addressing these challenges. In this paper, we
propose an improved attention-based architecture to solve VQA. We incorporate
an Attention on Attention (AoA) module within encoder-decoder framework, which
is able to determine the relation between attention results and queries.
Attention module generates weighted average for each query. On the other hand,
AoA module first generates an information vector and an attention gate using
attention results and current context; and then adds another attention to
generate final attended information by multiplying the two. We also propose
multimodal fusion module to combine both visual and textual information. The
goal of this fusion module is to dynamically decide how much information should
be considered from each modality. Extensive experiments on VQA-v2 benchmark
dataset show that our method achieves the state-of-the-art performance.Comment: 8 page
Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models
Visual Question Answering (VQA) has emerged as a Visual Turing Test to
validate the reasoning ability of AI agents. The pivot to existing VQA models
is the joint embedding that is learned by combining the visual features from an
image and the semantic features from a given question. Consequently, a large
body of literature has focused on developing complex joint embedding strategies
coupled with visual attention mechanisms to effectively capture the interplay
between these two modalities. However, modelling the visual and semantic
features in a high dimensional (joint embedding) space is computationally
expensive, and more complex models often result in trivial improvements in the
VQA accuracy. In this work, we systematically study the trade-off between the
model complexity and the performance on the VQA task. VQA models have a diverse
architecture comprising of pre-processing, feature extraction, multimodal
fusion, attention and final classification stages. We specifically focus on the
effect of "multi-modal fusion" in VQA models that is typically the most
expensive step in a VQA pipeline. Our thorough experimental evaluation leads us
to two proposals, one optimized for minimal complexity and the other one
optimized for state-of-the-art VQA performance
Multimodal Unified Attention Networks for Vision-and-Language Interactions
Learning an effective attention mechanism for multimodal data is important in
many vision-and-language tasks that require a synergic understanding of both
the visual and textual contents. Existing state-of-the-art approaches use
co-attention models to associate each visual object (e.g., image region) with
each textual object (e.g., query word). Despite the success of these
co-attention models, they only model inter-modal interactions while neglecting
intra-modal interactions. Here we propose a general `unified attention' model
that simultaneously captures the intra- and inter-modal interactions of
multimodal features and outputs their corresponding attended representations.
By stacking such unified attention blocks in depth, we obtain the deep
Multimodal Unified Attention Network (MUAN), which can seamlessly be applied to
the visual question answering (VQA) and visual grounding tasks. We evaluate our
MUAN models on two VQA datasets and three visual grounding datasets, and the
results show that MUAN achieves top-level performance on both tasks without
bells and whistles.Comment: 11 pages, 7 figure
Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering
Learning effective fusion of multi-modality features is at the heart of
visual question answering. We propose a novel method of dynamically fusing
multi-modal features with intra- and inter-modality information flow, which
alternatively pass dynamic information between and across the visual and
language modalities. It can robustly capture the high-level interactions
between language and vision domains, thus significantly improves the
performance of visual question answering. We also show that the proposed
dynamic intra-modality attention flow conditioned on the other modality can
dynamically modulate the intra-modality attention of the target modality, which
is vital for multimodality feature fusion. Experimental evaluations on the VQA
2.0 dataset show that the proposed method achieves state-of-the-art VQA
performance. Extensive ablation studies are carried out for the comprehensive
analysis of the proposed method.Comment: CVPR 2019 ORA
- …