42 research outputs found
Generating Counterfactual Explanations with Natural Language
Natural language explanations of deep neural network decisions provide an
intuitive way for a AI agent to articulate a reasoning process. Current textual
explanations learn to discuss class discriminative features in an image.
However, it is also helpful to understand which attributes might change a
classification decision if present in an image (e.g., "This is not a Scarlet
Tanager because it does not have black wings.") We call such textual
explanations counterfactual explanations, and propose an intuitive method to
generate counterfactual explanations by inspecting which evidence in an input
is missing, but might contribute to a different classification decision if
present in the image. To demonstrate our method we consider a fine-grained
image classification task in which we take as input an image and a
counterfactual class and output text which explains why the image does not
belong to a counterfactual class. We then analyze our generated counterfactual
explanations both qualitatively and quantitatively using proposed automatic
metrics.Comment: presented at 2018 ICML Workshop on Human Interpretability in Machine
Learning (WHI 2018), Stockholm, Swede
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Recent work in vision-and-language pretraining has investigated supervised
signals from object detection data to learn better, fine-grained multimodal
representations. In this work, we take a step further and explore how we can
tap into supervision from small-scale visual relation data. In particular, we
propose two pretraining approaches to contextualise visual entities in a
multimodal setup. With verbalised scene graphs, we transform visual relation
triplets into structured captions, and treat them as additional image
descriptions. With masked relation prediction, we further encourage relating
entities from image regions with visually masked contexts. When applied to
strong baselines pretrained on large amounts of Web data, zero-shot evaluations
on both coarse-grained and fine-grained tasks show the efficacy of our methods
in learning multimodal representations from weakly-supervised relations data.Comment: EMNLP 202
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
Deep models that are both effective and explainable are desirable in many
settings; prior explainable models have been unimodal, offering either
image-based visualization of attention weights or text-based generation of
post-hoc justifications. We propose a multimodal approach to explanation, and
argue that the two modalities provide complementary explanatory strengths. We
collect two new datasets to define and evaluate this task, and propose a novel
model which can provide joint textual rationale generation and attention
visualization. Our datasets define visual and textual justifications of a
classification decision for activity recognition tasks (ACT-X) and for visual
question answering tasks (VQA-X). We quantitatively show that training with the
textual explanations not only yields better textual justification models, but
also better localizes the evidence that supports the decision. We also
qualitatively show cases where visual explanation is more insightful than
textual explanation, and vice versa, supporting our thesis that multimodal
explanation models offer significant benefits over unimodal approaches.Comment: arXiv admin note: text overlap with arXiv:1612.0475
Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)
Deep models are the defacto standard in visual decision problems due to their
impressive performance on a wide array of visual tasks. On the other hand,
their opaqueness has led to a surge of interest in explainable systems. In this
work, we emphasize the importance of model explanation in various forms such as
visual pointing and textual justification. The lack of data with justification
annotations is one of the bottlenecks of generating multimodal explanations.
Thus, we propose two large-scale datasets with annotations that visually and
textually justify a classification decision for various activities, i.e. ACT-X,
and for question answering, i.e. VQA-X. We also introduce a multimodal
methodology for generating visual and textual explanations simultaneously. We
quantitatively show that training with the textual explanations not only yields
better textual justification models, but also models that better localize the
evidence that support their decision.Comment: arXiv admin note: text overlap with arXiv:1612.0475