9 research outputs found
Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance
Models for natural language understanding (NLU) tasks often rely on the
idiosyncratic biases of the dataset, which make them brittle against test cases
outside the training distribution. Recently, several proposed debiasing methods
are shown to be very effective in improving out-of-distribution performance.
However, their improvements come at the expense of performance drop when models
are evaluated on the in-distribution data, which contain examples with higher
diversity. This seemingly inevitable trade-off may not tell us much about the
changes in the reasoning and understanding capabilities of the resulting models
on broader types of examples beyond the small subset represented in the
out-of-distribution data. In this paper, we address this trade-off by
introducing a novel debiasing method, called confidence regularization, which
discourage models from exploiting biases while enabling them to receive enough
incentive to learn from all the training examples. We evaluate our method on
three NLU tasks and show that, in contrast to its predecessors, it improves the
performance on out-of-distribution datasets (e.g., 7pp gain on HANS dataset)
while maintaining the original in-distribution accuracy.Comment: to appear at ACL 202
On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law
Out-of-distribution (OOD) testing is increasingly popular for evaluating a
machine learning system's ability to generalize beyond the biases of a training
set. OOD benchmarks are designed to present a different joint distribution of
data and labels between training and test time. VQA-CP has become the standard
OOD benchmark for visual question answering, but we discovered three troubling
practices in its current use. First, most published methods rely on explicit
knowledge of the construction of the OOD splits. They often rely on
``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the
common training answer is 'no'. Second, the OOD test set is used for model
selection. Third, a model's in-domain performance is assessed after retraining
it on in-domain splits (VQA v2) that exhibit a more balanced distribution of
labels. These three practices defeat the objective of evaluating
generalization, and put into question the value of methods specifically
designed for this dataset. We show that embarrassingly-simple methods,
including one that generates answers at random, surpass the state of the art on
some question types. We provide short- and long-term solutions to avoid these
pitfalls and realize the benefits of OOD evaluation
Syntactic Data Augmentation Increases Robustness to Inference Heuristics
Pretrained neural models such as BERT, when fine-tuned to perform natural
language inference (NLI), often show high accuracy on standard datasets, but
display a surprising lack of sensitivity to word order on controlled challenge
sets. We hypothesize that this issue is not primarily caused by the pretrained
model's limitations, but rather by the paucity of crowdsourced NLI examples
that might convey the importance of syntactic structure at the fine-tuning
stage. We explore several methods to augment standard training sets with
syntactically informative examples, generated by applying syntactic
transformations to sentences from the MNLI corpus. The best-performing
augmentation method, subject/object inversion, improved BERT's accuracy on
controlled examples that diagnose sensitivity to word order from 0.28 to 0.73,
without affecting performance on the MNLI test set. This improvement
generalized beyond the particular construction used for data augmentation,
suggesting that augmentation causes BERT to recruit abstract syntactic
representations.Comment: ACL 202