17 research outputs found
Probing Natural Language Inference Models through Semantic Fragments
Do state-of-the-art models for language understanding already have, or can
they easily learn, abilities such as boolean coordination, quantification,
conditionals, comparatives, and monotonicity reasoning (i.e., reasoning about
word substitutions in sentential contexts)? While such phenomena are involved
in natural language inference (NLI) and go beyond basic linguistic
understanding, it is unclear the extent to which they are captured in existing
NLI benchmarks and effectively learned by models. To investigate this, we
propose the use of semantic fragments---systematically generated datasets that
each target a different semantic phenomenon---for probing, and efficiently
improving, such capabilities of linguistic models. This approach to creating
challenge datasets allows direct control over the semantic diversity and
complexity of the targeted linguistic phenomena, and results in a more precise
characterization of a model's linguistic behavior. Our experiments, using a
library of 8 such semantic fragments, reveal two remarkable findings: (a)
State-of-the-art models, including BERT, that are pre-trained on existing NLI
benchmark datasets perform poorly on these new fragments, even though the
phenomena probed here are central to the NLI task. (b) On the other hand, with
only a few minutes of additional fine-tuning---with a carefully selected
learning rate and a novel variation of "inoculation"---a BERT-based model can
master all of these logic and monotonicity fragments while retaining its
performance on established NLI benchmarks.Comment: AAAI camera-ready versio
Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition
Natural language inference (NLI) is an increasingly important task for
natural language understanding, which requires one to infer whether a sentence
entails another. However, the ability of NLI models to make pragmatic
inferences remains understudied. We create an IMPlicature and PRESupposition
diagnostic dataset (IMPPRES), consisting of >25k semiautomatically generated
sentence pairs illustrating well-studied pragmatic inference types. We use
IMPPRES to evaluate whether BERT, InferSent, and BOW NLI models trained on
MultiNLI (Williams et al., 2018) learn to make pragmatic inferences. Although
MultiNLI appears to contain very few pairs illustrating these inference types,
we find that BERT learns to draw pragmatic inferences. It reliably treats
scalar implicatures triggered by "some" as entailments. For some presupposition
triggers like "only", BERT reliably recognizes the presupposition as an
entailment, even when the trigger is embedded under an entailment canceling
operator like negation. BOW and InferSent show weaker evidence of pragmatic
reasoning. We conclude that NLI training encourages models to learn some, but
not all, pragmatic inferences.Comment: to appear in Proceedings of ACL 202