89,043 research outputs found
Analyzing the Behavior of Visual Question Answering Models
Recently, a number of deep-learning based models have been proposed for the
task of Visual Question Answering (VQA). The performance of most models is
clustered around 60-70%. In this paper we propose systematic methods to analyze
the behavior of these models as a first step towards recognizing their
strengths and weaknesses, and identifying the most fruitful directions for
progress. We analyze two models, one each from two major classes of VQA models
-- with-attention and without-attention and show the similarities and
differences in the behavior of these models. We also analyze the winning entry
of the VQA Challenge 2016.
Our behavior analysis reveals that despite recent progress, today's VQA
models are "myopic" (tend to fail on sufficiently novel instances), often "jump
to conclusions" (converge on a predicted answer after 'listening' to just half
the question), and are "stubborn" (do not change their answers across images).Comment: 13 pages, 20 figures; To appear in EMNLP 201
STARC: Structured Annotations for Reading Comprehension
We present STARC (Structured Annotations for Reading Comprehension), a new
annotation framework for assessing reading comprehension with multiple choice
questions. Our framework introduces a principled structure for the answer
choices and ties them to textual span annotations. The framework is implemented
in OneStopQA, a new high-quality dataset for evaluation and analysis of reading
comprehension in English. We use this dataset to demonstrate that STARC can be
leveraged for a key new application for the development of SAT-like reading
comprehension materials: automatic annotation quality probing via span ablation
experiments. We further show that it enables in-depth analyses and comparisons
between machine and human reading comprehension behavior, including error
distributions and guessing ability. Our experiments also reveal that the
standard multiple choice dataset in NLP, RACE, is limited in its ability to
measure reading comprehension. 47% of its questions can be guessed by machines
without accessing the passage, and 18% are unanimously judged by humans as not
having a unique correct answer. OneStopQA provides an alternative test set for
reading comprehension which alleviates these shortcomings and has a
substantially higher human ceiling performance.Comment: ACL 2020. OneStopQA dataset, STARC guidelines and human experiments
data are available at https://github.com/berzak/onestop-q
A Novel Framework for Robustness Analysis of Visual QA Models
Deep neural networks have been playing an essential role in many computer
vision tasks including Visual Question Answering (VQA). Until recently, the
study of their accuracy was the main focus of research but now there is a trend
toward assessing the robustness of these models against adversarial attacks by
evaluating their tolerance to varying noise levels. In VQA, adversarial attacks
can target the image and/or the proposed main question and yet there is a lack
of proper analysis of the later. In this work, we propose a flexible framework
that focuses on the language part of VQA that uses semantically relevant
questions, dubbed basic questions, acting as controllable noise to evaluate the
robustness of VQA models. We hypothesize that the level of noise is positively
correlated to the similarity of a basic question to the main question. Hence,
to apply noise on any given main question, we rank a pool of basic questions
based on their similarity by casting this ranking task as a LASSO optimization
problem. Then, we propose a novel robustness measure, R_score, and two
large-scale basic question datasets (BQDs) in order to standardize robustness
analysis for VQA models.Comment: Accepted by the Thirty-Third AAAI Conference on Artificial
Intelligence, (AAAI-19), as an oral pape
Modeling Human Visual Search Performance on Realistic Webpages Using Analytical and Deep Learning Methods
Modeling visual search not only offers an opportunity to predict the
usability of an interface before actually testing it on real users, but also
advances scientific understanding about human behavior. In this work, we first
conduct a set of analyses on a large-scale dataset of visual search tasks on
realistic webpages. We then present a deep neural network that learns to
predict the scannability of webpage content, i.e., how easy it is for a user to
find a specific target. Our model leverages both heuristic-based features such
as target size and unstructured features such as raw image pixels. This
approach allows us to model complex interactions that might be involved in a
realistic visual search task, which can not be easily achieved by traditional
analytical models. We analyze the model behavior to offer our insights into how
the salience map learned by the model aligns with human intuition and how the
learned semantic representation of each target type relates to its visual
search performance.Comment: the 2020 CHI Conference on Human Factors in Computing System
Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning
Visual question answering requires high-order reasoning about an image, which
is a fundamental capability needed by machine systems to follow complex
directives. Recently, modular networks have been shown to be an effective
framework for performing visual reasoning tasks. While modular networks were
initially designed with a degree of model transparency, their performance on
complex visual reasoning benchmarks was lacking. Current state-of-the-art
approaches do not provide an effective mechanism for understanding the
reasoning process. In this paper, we close the performance gap between
interpretable models and state-of-the-art visual reasoning methods. We propose
a set of visual-reasoning primitives which, when composed, manifest as a model
capable of performing complex reasoning tasks in an explicitly-interpretable
manner. The fidelity and interpretability of the primitives' outputs enable an
unparalleled ability to diagnose the strengths and weaknesses of the resulting
model. Critically, we show that these primitives are highly performant,
achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show
that our model is able to effectively learn generalized representations when
provided a small amount of data containing novel object attributes. Using the
CoGenT generalization task, we show more than a 20 percentage point improvement
over the current state of the art.Comment: CVPR 2018 pre-prin
- …