97 research outputs found
A Novel Framework for Robustness Analysis of Visual QA Models
Deep neural networks have been playing an essential role in many computer
vision tasks including Visual Question Answering (VQA). Until recently, the
study of their accuracy was the main focus of research but now there is a trend
toward assessing the robustness of these models against adversarial attacks by
evaluating their tolerance to varying noise levels. In VQA, adversarial attacks
can target the image and/or the proposed main question and yet there is a lack
of proper analysis of the later. In this work, we propose a flexible framework
that focuses on the language part of VQA that uses semantically relevant
questions, dubbed basic questions, acting as controllable noise to evaluate the
robustness of VQA models. We hypothesize that the level of noise is positively
correlated to the similarity of a basic question to the main question. Hence,
to apply noise on any given main question, we rank a pool of basic questions
based on their similarity by casting this ranking task as a LASSO optimization
problem. Then, we propose a novel robustness measure, R_score, and two
large-scale basic question datasets (BQDs) in order to standardize robustness
analysis for VQA models.Comment: Accepted by the Thirty-Third AAAI Conference on Artificial
Intelligence, (AAAI-19), as an oral pape
VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation
Rich and dense human labeled datasets are among the main enabling factors for
the recent advance on vision-language understanding. Many seemingly distant
annotations (e.g., semantic segmentation and visual question answering (VQA))
are inherently connected in that they reveal different levels and perspectives
of human understandings about the same visual scenes --- and even the same set
of images (e.g., of COCO). The popularity of COCO correlates those annotations
and tasks. Explicitly linking them up may significantly benefit both individual
tasks and the unified vision and language modeling. We present the preliminary
work of linking the instance segmentations provided by COCO to the questions
and answers (QAs) in the VQA dataset, and name the collected links visual
questions and segmentation answers (VQS). They transfer human supervision
between the previously separate tasks, offer more effective leverage to
existing problems, and also open the door for new research problems and models.
We study two applications of the VQS data in this paper: supervised attention
for VQA and a novel question-focused semantic segmentation task. For the
former, we obtain state-of-the-art results on the VQA real multiple-choice task
by simply augmenting the multilayer perceptrons with some attention features
that are learned using the segmentation-QA links as explicit supervision. To
put the latter in perspective, we study two plausible methods and compare them
to an oracle method assuming that the instance segmentations are given at the
test stage.Comment: To appear on ICCV 201
Multimodal Attention in Recurrent Neural Networks for Visual Question Answering
Visual Question Answering (VQA) is a task for evaluating image scene understanding abilities and shortcomings and also measuring machine intelligence in the visual domain. Given an image and a natural question about the image, the system must ground the question into
Robust explanations for visual question answering
In this paper, we propose a method to obtain robust explanations for visual
question answering(VQA) that correlate well with the answers. Our model
explains the answers obtained through a VQA model by providing visual and
textual explanations. The main challenges that we address are i) Answers and
textual explanations obtained by current methods are not well correlated and
ii) Current methods for visual explanation do not focus on the right location
for explaining the answer. We address both these challenges by using a
collaborative correlated module which ensures that even if we do not train for
noise based attacks, the enhanced correlation ensures that the right
explanation and answer can be generated. We further show that this also aids in
improving the generated visual and textual explanations. The use of the
correlated module can be thought of as a robust method to verify if the answer
and explanations are coherent. We evaluate this model using VQA-X dataset. We
observe that the proposed method yields better textual and visual justification
that supports the decision. We showcase the robustness of the model against a
noise-based perturbation attack using corresponding visual and textual
explanations. A detailed empirical analysis is shown. Here we provide source
code link for our model \url{https://github.com/DelTA-Lab-IITK/CCM-WACV}.Comment: WACV-2020 (Accepted
- …