42 research outputs found
Unshuffling Data for Improved Generalization
Generalization beyond the training distribution is a core challenge in
machine learning. The common practice of mixing and shuffling examples when
training neural networks may not be optimal in this regard. We show that
partitioning the data into well-chosen, non-i.i.d. subsets treated as multiple
training environments can guide the learning of models with better
out-of-distribution generalization. We describe a training procedure to capture
the patterns that are stable across environments while discarding spurious
ones. The method makes a step beyond correlation-based learning: the choice of
the partitioning allows injecting information about the task that cannot be
otherwise recovered from the joint distribution of the training data. We
demonstrate multiple use cases with the task of visual question answering,
which is notorious for dataset biases. We obtain significant improvements on
VQA-CP, using environments built from prior knowledge, existing meta data, or
unsupervised clustering. We also get improvements on GQA using annotations of
"equivalent questions", and on multi-dataset training (VQA v2 / Visual Genome)
by treating them as distinct environments
On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law
Out-of-distribution (OOD) testing is increasingly popular for evaluating a
machine learning system's ability to generalize beyond the biases of a training
set. OOD benchmarks are designed to present a different joint distribution of
data and labels between training and test time. VQA-CP has become the standard
OOD benchmark for visual question answering, but we discovered three troubling
practices in its current use. First, most published methods rely on explicit
knowledge of the construction of the OOD splits. They often rely on
``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the
common training answer is 'no'. Second, the OOD test set is used for model
selection. Third, a model's in-domain performance is assessed after retraining
it on in-domain splits (VQA v2) that exhibit a more balanced distribution of
labels. These three practices defeat the objective of evaluating
generalization, and put into question the value of methods specifically
designed for this dataset. We show that embarrassingly-simple methods,
including one that generates answers at random, surpass the state of the art on
some question types. We provide short- and long-term solutions to avoid these
pitfalls and realize the benefits of OOD evaluation
On Efficient Training, Controllability and Compositional Generalization of Insertion-based Language Generators
Auto-regressive language models with the left-to-right generation order have
been a predominant paradigm for language generation. Recently, out-of-order
text generation beyond the traditional left-to-right paradigm has attracted
extensive attention, with a notable variation of insertion-based generation,
where a model is used to gradually extend the context into a complete sentence
purely with insertion operations. However, since insertion operations disturb
the position information of each token, it is often believed that each step of
the insertion-based likelihood estimation requires a bi-directional
\textit{re-encoding} of the whole generated sequence. This computational
overhead prohibits the model from scaling up to generate long, diverse texts
such as stories, news articles, and reports. To address this issue, we propose
InsNet, an insertion-based sequence model that can be trained as efficiently as
traditional transformer decoders while maintaining the same performance as that
with a bi-directional context encoder. We evaluate InsNet on story generation
and CleVR-CoGENT captioning, showing the advantages of InsNet in several
dimensions, including computational costs, generation quality, the ability to
perfectly incorporate lexical controls, and better compositional
generalization
Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA
Visual Question Answering (VQA) models are prone to learn the shortcut
solution formed by dataset biases rather than the intended solution. To
evaluate the VQA models' reasoning ability beyond shortcut learning, the VQA-CP
v2 dataset introduces a distribution shift between the training and test set
given a question type. In this way, the model cannot use the training set
shortcut (from question type to answer) to perform well on the test set.
However, VQA-CP v2 only considers one type of shortcut and thus still cannot
guarantee that the model relies on the intended solution rather than a solution
specific to this shortcut. To overcome this limitation, we propose a new
dataset that considers varying types of shortcuts by constructing different
distribution shifts in multiple OOD test sets. In addition, we overcome the
three troubling practices in the use of VQA-CP v2, e.g., selecting models using
OOD test sets, and further standardize OOD evaluation procedure. Our benchmark
provides a more rigorous and comprehensive testbed for shortcut learning in
VQA. We benchmark recent methods and find that methods specifically designed
for particular shortcuts fail to simultaneously generalize to our varying OOD
test sets. We also systematically study the varying shortcuts and provide
several valuable findings, which may promote the exploration of shortcut
learning in VQA.Comment: Fingdings of EMNLP-202
Exploiting Synthetic Data for Data Imbalance Problems: Baselines from a Data Perspective
We live in a vast ocean of data, and deep neural networks are no exception to
this. However, this data exhibits an inherent phenomenon of imbalance. This
imbalance poses a risk of deep neural networks producing biased predictions,
leading to potentially severe ethical and social consequences. To address these
challenges, we believe that the use of generative models is a promising
approach for comprehending tasks, given the remarkable advancements
demonstrated by recent diffusion models in generating high-quality images. In
this work, we propose a simple yet effective baseline, SYNAuG, that utilizes
synthetic data as a preliminary step before employing task-specific algorithms
to address data imbalance problems. This straightforward approach yields
impressive performance on datasets such as CIFAR100-LT, ImageNet100-LT,
UTKFace, and Waterbird, surpassing the performance of existing task-specific
methods. While we do not claim that our approach serves as a complete solution
to the problem of data imbalance, we argue that supplementing the existing data
with synthetic data proves to be an effective and crucial preliminary step in
addressing data imbalance concerns
Maximum likelihood estimation and graph matching in errorfully observed networks
Given a pair of graphs with the same number of vertices, the inexact graph matching problem consists in finding a correspondence between the vertices of these graphs that minimizes the total number of induced edge disagreements. We study this problem from a statistical framework in which one of the graphs is an errorfully observed copy of the other. We introduce a corrupting channel model, and show that in this model framework, the solution to the graph matching problem is a maximum likelihood estimator (MLE). Necessary and sufficient conditions for consistency of this MLE are presented, as well as a relaxed notion of consistency in which a negligible fraction of the vertices need not be matched correctly. The results are used to study matchability in several families of random graphs, including edge independent models, random regular graphs, and small-world networks. We also use these results to introduce measures of matching feasibility, and experimentally validate the results on simulated and real-world networks. Supplemental files for this article are available online.Accepted manuscrip
Logical Implications for Visual Question Answering Consistency
Despite considerable recent progress in Visual Question Answering (VQA) models, inconsistent or contradictory answers continue to cast doubt on their true reasoning capabilities. However, most proposed methods use indirect strategies or strong assumptions on pairs of questions and answers to enforce model consistency. Instead, we propose a novel strategy intended to improve model performance by directly reducing logical inconsistencies. To do this, we introduce a new consistency loss term that can be used by a wide range of the VQA models and which relies on knowing the logical relation between pairs of questions and answers. While such information is typically not available in VQA datasets, we propose to infer these logical relations using a dedicated language model and use these in our proposed consistency loss function. We conduct extensive experiments on the VQA Introspect and DME datasets and show that our method brings improvements to state-of-the-art VQA models while being robust across different architectures and settings