23 research outputs found
Neural Motifs: Scene Graph Parsing with Global Context
We investigate the problem of producing structured graph representations of
visual scenes. Our work analyzes the role of motifs: regularly appearing
substructures in scene graphs. We present new quantitative insights on such
repeated structures in the Visual Genome dataset. Our analysis shows that
object labels are highly predictive of relation labels but not vice-versa. We
also find that there are recurring patterns even in larger subgraphs: more than
50% of graphs contain motifs involving at least two relations. Our analysis
motivates a new baseline: given object detections, predict the most frequent
relation between object pairs with the given labels, as seen in the training
set. This baseline improves on the previous state-of-the-art by an average of
3.6% relative improvement across evaluation settings. We then introduce Stacked
Motif Networks, a new architecture designed to capture higher order motifs in
scene graphs that further improves over our strong baseline by an average 7.1%
relative gain. Our code is available at github.com/rowanz/neural-motifs.Comment: CVPR 2018 camera read
Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints
Language is increasingly being used to define rich visual recognition
problems with supporting image collections sourced from the web. Structured
prediction models are used in these tasks to take advantage of correlations
between co-occurring labels and visual input but risk inadvertently encoding
social biases found in web corpora. In this work, we study data and models
associated with multilabel object classification and visual semantic role
labeling. We find that (a) datasets for these tasks contain significant gender
bias and (b) models trained on these datasets further amplify existing bias.
For example, the activity cooking is over 33% more likely to involve females
than males in a training set, and a trained model further amplifies the
disparity to 68% at test time. We propose to inject corpus-level constraints
for calibrating existing structured prediction models and design an algorithm
based on Lagrangian relaxation for collective inference. Our method results in
almost no performance loss for the underlying recognition task but decreases
the magnitude of bias amplification by 47.5% and 40.5% for multilabel
classification and visual semantic role labeling, respectively.Comment: 11 pages, published in EMNLP 201
Cascading Biases: Investigating the Effect of Heuristic Annotation Strategies on Data and Models
Cognitive psychologists have documented that humans use cognitive heuristics,
or mental shortcuts, to make quick decisions while expending less effort. While
performing annotation work on crowdsourcing platforms, we hypothesize that such
heuristic use among annotators cascades on to data quality and model
robustness. In this work, we study cognitive heuristic use in the context of
annotating multiple-choice reading comprehension datasets. We propose tracking
annotator heuristic traces, where we tangibly measure low-effort annotation
strategies that could indicate usage of various cognitive heuristics. We find
evidence that annotators might be using multiple such heuristics, based on
correlations with a battery of psychological tests. Importantly, heuristic use
among annotators determines data quality along several dimensions: (1) known
biased models, such as partial input models, more easily solve examples
authored by annotators that rate highly on heuristic use, (2) models trained on
annotators scoring highly on heuristic use don't generalize as well, and (3)
heuristic-seeking annotators tend to create qualitatively less challenging
examples. Our findings suggest that tracking heuristic usage among annotators
can potentially help with collecting challenging datasets and diagnosing model
biases.Comment: EMNLP 202
Interpretable by Design Visual Question Answering
Model interpretability has long been a hard problem for the AI community
especially in the multimodal setting, where vision and language need to be
aligned and reasoned at the same time. In this paper, we specifically focus on
the problem of Visual Question Answering (VQA). While previous researches try
to probe into the network structures of black-box multimodal models, we propose
to tackle the problem from a different angle -- to treat interpretability as an
explicit additional goal.
Given an image and question, we argue that an interpretable VQA model should
be able to tell what conclusions it can get from which part of the image, and
show how each statement help to arrive at an answer. We introduce InterVQA:
Interpretable-by-design VQA, where we design an explicit intermediate dynamic
reasoning structure for VQA problems and enforce symbolic reasoning that only
use the structure for final answer prediction to take place. InterVQA produces
high-quality explicit intermediate reasoning steps, while maintaining similar
to the state-of-the-art (sota) end-task performance.Comment: Multimodal, Vision and Languag