7,378 research outputs found
Annotation Artifacts in Natural Language Inference Data
Large-scale datasets for natural language inference are created by presenting
crowd workers with a sentence (premise), and asking them to generate three new
sentences (hypotheses) that it entails, contradicts, or is logically neutral
with respect to. We show that, in a significant portion of such data, this
protocol leaves clues that make it possible to identify the label by looking
only at the hypothesis, without observing the premise. Specifically, we show
that a simple text categorization model can correctly classify the hypothesis
alone in about 67% of SNLI (Bowman et. al, 2015) and 53% of MultiNLI (Williams
et. al, 2017). Our analysis reveals that specific linguistic phenomena such as
negation and vagueness are highly correlated with certain inference classes.
Our findings suggest that the success of natural language inference models to
date has been overestimated, and that the task remains a hard open problem.Comment: 6 pages, 1 figure, NAACL 201
Several Experiments on Investigating Pretraining and Knowledge-Enhanced Models for Natural Language Inference
Natural language inference (NLI) is among the most challenging tasks in
natural language understanding. Recent work on unsupervised pretraining that
leverages unsupervised signals such as language-model and sentence prediction
objectives has shown to be very effective on a wide range of NLP problems. It
would still be desirable to further understand how it helps NLI; e.g., if it
learns artifacts in data annotation or instead learn true inference knowledge.
In addition, external knowledge that does not exist in the limited amount of
NLI training data may be added to NLI models in two typical ways, e.g., from
human-created resources or an unsupervised pretraining paradigm. We runs
several experiments here to investigate whether they help NLI in the same way,
and if not,how
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
Given a partial description like "she opened the hood of the car," humans can
reason about the situation and anticipate what might come next ("then, she
examined the engine"). In this paper, we introduce the task of grounded
commonsense inference, unifying natural language inference and commonsense
reasoning.
We present SWAG, a new dataset with 113k multiple choice questions about a
rich spectrum of grounded situations. To address the recurring challenges of
the annotation artifacts and human biases found in many existing datasets, we
propose Adversarial Filtering (AF), a novel procedure that constructs a
de-biased dataset by iteratively training an ensemble of stylistic classifiers,
and using them to filter the data. To account for the aggressive adversarial
filtering, we use state-of-the-art language models to massively oversample a
diverse set of potential counterfactuals. Empirical results demonstrate that
while humans can solve the resulting inference problems with high accuracy
(88%), various competitive models struggle on our task. We provide
comprehensive analysis that indicates significant opportunities for future
research.Comment: EMNLP 201
Misleading Failures of Partial-input Baselines
Recent work establishes dataset difficulty and removes annotation artifacts
via partial-input baselines (e.g., hypothesis-only models for SNLI or
question-only models for VQA). When a partial-input baseline gets high
accuracy, a dataset is cheatable. However, the converse is not necessarily
true: the failure of a partial-input baseline does not mean a dataset is free
of artifacts. To illustrate this, we first design artificial datasets which
contain trivial patterns in the full input that are undetectable by any
partial-input model. Next, we identify such artifacts in the SNLI dataset - a
hypothesis-only model augmented with trivial patterns in the premise can solve
15% of the examples that are previously considered "hard". Our work provides a
caveat for the use of partial-input baselines for dataset verification and
creation.Comment: ACL 201
INFOTABS: Inference on Tables as Semi-structured Data
In this paper, we observe that semi-structured tabulated text is ubiquitous;
understanding them requires not only comprehending the meaning of text
fragments, but also implicit relationships between them. We argue that such
data can prove as a testing ground for understanding how we reason about
information. To study this, we introduce a new dataset called INFOTABS,
comprising of human-written textual hypotheses based on premises that are
tables extracted from Wikipedia info-boxes. Our analysis shows that the
semi-structured, multi-domain and heterogeneous nature of the premises admits
complex, multi-faceted reasoning. Experiments reveal that, while human
annotators agree on the relationships between a table-hypothesis pair, several
standard modeling strategies are unsuccessful at the task, suggesting that
reasoning about tables can pose a difficult modeling challenge.Comment: 16 pages, 6 figures, 14 Tables, ACL 2020, Project Page:
https://infotabs.github.io
CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense
Commonsense reasoning is a critical AI capability, but it is difficult to
construct challenging datasets that test common sense. Recent neural question
answering systems, based on large pre-trained models of language, have already
achieved near-human-level performance on commonsense knowledge benchmarks.
These systems do not possess human-level common sense, but are able to exploit
limitations of the datasets to achieve human-level scores.
We introduce the CODAH dataset, an adversarially-constructed evaluation
dataset for testing common sense. CODAH forms a challenging extension to the
recently-proposed SWAG dataset, which tests commonsense knowledge using
sentence-completion questions that describe situations observed in video. To
produce a more difficult dataset, we introduce a novel procedure for question
acquisition in which workers author questions designed to target weaknesses of
state-of-the-art neural question answering systems. Workers are rewarded for
submissions that models fail to answer correctly both before and after
fine-tuning (in cross-validation). We create 2.8k questions via this procedure
and evaluate the performance of multiple state-of-the-art question answering
systems on our dataset. We observe a significant gap between human performance,
which is 95.3%, and the performance of the best baseline accuracy of 67.5% by
the BERT-Large model.Comment: 8 pages, Appeared in RepEval 201
SocialIQA: Commonsense Reasoning about Social Interactions
We introduce Social IQa, the first largescale benchmark for commonsense
reasoning about social situations. Social IQa contains 38,000 multiple choice
questions for probing emotional and social intelligence in a variety of
everyday situations (e.g., Q: "Jordan wanted to tell Tracy a secret, so Jordan
leaned towards Tracy. Why did Jordan do this?" A: "Make sure no one else could
hear"). Through crowdsourcing, we collect commonsense questions along with
correct and incorrect answers about social interactions, using a new framework
that mitigates stylistic artifacts in incorrect answers by asking workers to
provide the right answer to a different but related question. Empirical results
show that our benchmark is challenging for existing question-answering models
based on pretrained language models, compared to human performance (>20% gap).
Notably, we further establish Social IQa as a resource for transfer learning of
commonsense knowledge, achieving state-of-the-art performance on multiple
commonsense reasoning tasks (Winograd Schemas, COPA).Comment: the first two authors contributed equally; accepted to EMNLP 2019;
camera ready versio
Interpreting Neural Networks With Nearest Neighbors
Local model interpretation methods explain individual predictions by
assigning an importance value to each input feature. This value is often
determined by measuring the change in confidence when a feature is removed.
However, the confidence of neural networks is not a robust measure of model
uncertainty. This issue makes reliably judging the importance of the input
features difficult. We address this by changing the test-time behavior of
neural networks using Deep k-Nearest Neighbors. Without harming text
classification accuracy, this algorithm provides a more robust uncertainty
metric which we use to generate feature importance values. The resulting
interpretations better align with human perception than baseline methods.
Finally, we use our interpretation method to analyze model predictions on
dataset annotation artifacts.Comment: EMNLP 2018 BlackboxNL
Testing the Generalization Power of Neural Network Models Across NLI Benchmarks
Neural network models have been very successful in natural language
inference, with the best models reaching 90% accuracy in some benchmarks.
However, the success of these models turns out to be largely benchmark
specific. We show that models trained on a natural language inference dataset
drawn from one benchmark fail to perform well in others, even if the notion of
inference assumed in these benchmarks is the same or similar. We train six high
performing neural network models on different datasets and show that each one
of these has problems of generalizing when we replace the original test set
with a test set taken from another corpus designed for the same task. In light
of these results, we argue that most of the current neural network models are
not able to generalize well in the task of natural language inference. We find
that using large pre-trained language models helps with transfer learning when
the datasets are similar enough. Our results also highlight that the current
NLI datasets do not cover the different nuances of inference extensively
enough.Comment: Accepted to the 2019 ACL Workshop BackboxNLP: Analyzing and
interpreting neural networks for NL
Transforming Question Answering Datasets Into Natural Language Inference Datasets
Existing datasets for natural language inference (NLI) have propelled
research on language understanding. We propose a new method for automatically
deriving NLI datasets from the growing abundance of large-scale question
answering datasets. Our approach hinges on learning a sentence transformation
model which converts question-answer pairs into their declarative forms.
Despite being primarily trained on a single QA dataset, we show that it can be
successfully applied to a variety of other QA resources. Using this system, we
automatically derive a new freely available dataset of over 500k NLI examples
(QA-NLI), and show that it exhibits a wide range of inference phenomena rarely
seen in previous NLI datasets.Comment: 11 pages, 6 figure
- …