40 research outputs found
An Empirical Study on the Language Modal in Visual Question Answering
Generalization beyond in-domain experience to out-of-distribution data is of
paramount significance in the AI domain. Of late, state-of-the-art Visual
Question Answering (VQA) models have shown impressive performance on in-domain
data, partially due to the language priors bias which, however, hinders the
generalization ability in practice. This paper attempts to provide new insights
into the influence of language modality on VQA performance from an empirical
study perspective. To achieve this, we conducted a series of experiments on six
models. The results of these experiments revealed that, 1) apart from prior
bias caused by question types, there is a notable influence of postfix-related
bias in inducing biases, and 2) training VQA models with word-sequence-related
variant questions demonstrated improved performance on the out-of-distribution
benchmark, and the LXMERT even achieved a 10-point gain without adopting any
debiasing methods. We delved into the underlying reasons behind these
experimental results and put forward some simple proposals to reduce the
models' dependency on language priors. The experimental results demonstrated
the effectiveness of our proposed method in improving performance on the
out-of-distribution benchmark, VQA-CPv2. We hope this study can inspire novel
insights for future research on designing bias-reduction approaches.Comment: Accepted by IJCAI202
On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law
Out-of-distribution (OOD) testing is increasingly popular for evaluating a
machine learning system's ability to generalize beyond the biases of a training
set. OOD benchmarks are designed to present a different joint distribution of
data and labels between training and test time. VQA-CP has become the standard
OOD benchmark for visual question answering, but we discovered three troubling
practices in its current use. First, most published methods rely on explicit
knowledge of the construction of the OOD splits. They often rely on
``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the
common training answer is 'no'. Second, the OOD test set is used for model
selection. Third, a model's in-domain performance is assessed after retraining
it on in-domain splits (VQA v2) that exhibit a more balanced distribution of
labels. These three practices defeat the objective of evaluating
generalization, and put into question the value of methods specifically
designed for this dataset. We show that embarrassingly-simple methods,
including one that generates answers at random, surpass the state of the art on
some question types. We provide short- and long-term solutions to avoid these
pitfalls and realize the benefits of OOD evaluation
Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances
Despite the great progress of Visual Question Answering (VQA), current VQA
models heavily rely on the superficial correlation between the question type
and its corresponding frequent answers (i.e., language priors) to make
predictions, without really understanding the input. In this work, we define
the training instances with the same question type but different answers as
\textit{superficially similar instances}, and attribute the language priors to
the confusion of VQA model on such instances. To solve this problem, we propose
a novel training framework that explicitly encourages the VQA model to
distinguish between the superficially similar instances. Specifically, for each
training instance, we first construct a set that contains its superficially
similar counterparts. Then we exploit the proposed distinguishing module to
increase the distance between the instance and its counterparts in the answer
space. In this way, the VQA model is forced to further focus on the other parts
of the input beyond the question type, which helps to overcome the language
priors. Experimental results show that our method achieves the state-of-the-art
performance on VQA-CP v2. Codes are available at
\href{https://github.com/wyk-nku/Distinguishing-VQA.git}{Distinguishing-VQA}.Comment: Published in COLING 202
Estimating semantic structure for the VQA answer space
Since its appearance, Visual Question Answering (VQA, i.e. answering a
question posed over an image), has always been treated as a classification
problem over a set of predefined answers. Despite its convenience, this
classification approach poorly reflects the semantics of the problem limiting
the answering to a choice between independent proposals, without taking into
account the similarity between them (e.g. equally penalizing for answering cat
or German shepherd instead of dog). We address this issue by proposing (1) two
measures of proximity between VQA classes, and (2) a corresponding loss which
takes into account the estimated proximity. This significantly improves the
generalization of VQA models by reducing their language bias. In particular, we
show that our approach is completely model-agnostic since it allows consistent
improvements with three different VQA models. Finally, by combining our method
with a language bias reduction approach, we report SOTA-level performance on
the challenging VQAv2-CP dataset
Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Visual question answering requires a system to provide an accurate natural
language answer given an image and a natural language question. However, it is
widely recognized that previous generic VQA methods often exhibit a tendency to
memorize biases present in the training data rather than learning proper
behaviors, such as grounding images before predicting answers. Therefore, these
methods usually achieve high in-distribution but poor out-of-distribution
performance. In recent years, various datasets and debiasing methods have been
proposed to evaluate and enhance the VQA robustness, respectively. This paper
provides the first comprehensive survey focused on this emerging fashion.
Specifically, we first provide an overview of the development process of
datasets from in-distribution and out-of-distribution perspectives. Then, we
examine the evaluation metrics employed by these datasets. Thirdly, we propose
a typology that presents the development process, similarities and differences,
robustness comparison, and technical features of existing debiasing methods.
Furthermore, we analyze and discuss the robustness of representative
vision-and-language pre-training models on VQA. Finally, through a thorough
review of the available literature and experimental analysis, we discuss the
key areas for future research from various viewpoints.Comment: IEEE TPAMI (Under Review
Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA
Visual Question Answering (VQA) models are prone to learn the shortcut
solution formed by dataset biases rather than the intended solution. To
evaluate the VQA models' reasoning ability beyond shortcut learning, the VQA-CP
v2 dataset introduces a distribution shift between the training and test set
given a question type. In this way, the model cannot use the training set
shortcut (from question type to answer) to perform well on the test set.
However, VQA-CP v2 only considers one type of shortcut and thus still cannot
guarantee that the model relies on the intended solution rather than a solution
specific to this shortcut. To overcome this limitation, we propose a new
dataset that considers varying types of shortcuts by constructing different
distribution shifts in multiple OOD test sets. In addition, we overcome the
three troubling practices in the use of VQA-CP v2, e.g., selecting models using
OOD test sets, and further standardize OOD evaluation procedure. Our benchmark
provides a more rigorous and comprehensive testbed for shortcut learning in
VQA. We benchmark recent methods and find that methods specifically designed
for particular shortcuts fail to simultaneously generalize to our varying OOD
test sets. We also systematically study the varying shortcuts and provide
several valuable findings, which may promote the exploration of shortcut
learning in VQA.Comment: Fingdings of EMNLP-202