Neural network models have shown great success at natural language inference
(NLI), the task of determining whether a premise entails a hypothesis. However,
recent studies suggest that these models may rely on fallible heuristics rather
than deep language understanding. We introduce a challenge set to test whether
NLI systems adopt one such heuristic: assuming that a sentence entails all of
its subsequences, such as assuming that "Alice believes Mary is lying" entails
"Alice believes Mary." We evaluate several competitive NLI models on this
challenge set and find strong evidence that they do rely on the subsequence
heuristic.Comment: Accepted as an abstract for SCiL 2019; added acknowledgment

Linzen, Tal

McCoy, Richard T.

English

arXiv

ScholarWorks@UMass Amherst

Proceedings of the Society for Computation in LinguisticsVolume 2 Article 462019Non-Entailed Subsequences as a Challenge forNatural Language InferenceRichard T. McCoyJohns Hopkins University, tom.mccoy@jhu.eduTal LinzenJohns Hopkins University, tal.linzen@jhu.eduFollow this and additional works at: https://scholarworks.umass.edu/scilPart of the Computational Linguistics CommonsThis Extended Abstract is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Proceedingsof the Society for Computation in Linguistics by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contactscholarworks@library.umass.edu.Recommended CitationMcCoy, Richard T. and Linzen, Tal (2019) "Non-Entailed Subsequences as a Challenge for Natural Language Inference," Proceedings ofthe Society for Computation in Linguistics: Vol. 2 , Article 46.DOI: https://doi.org/10.7275/9hfp-2974Available at: https://scholarworks.umass.edu/scil/vol2/iss1/46Non-entailed subsequences as a challenge for natural language inferenceR. Thomas McCoyDepartment of Cognitive ScienceJohns Hopkins Universitytom.mccoy@jhu.eduTal LinzenDepartment of Cognitive ScienceJohns Hopkins Universitytal.linzen@jhu.eduIntroduction: Natural language inference (NLI)— the task of determining whether a premise en-tails a hypothesis — is a central challenge for nat-ural language understanding systems (Condoravdiet al., 2003; Dagan et al., 2006; Bowman et al.,2015). The availability of large sets of premisesand hypotheses generated through crowdsourcinghas made it possible to train neural networks with-out explicit logical representations to perform thistask; such systems have reached considerable ac-curacy on these data sets (Radford et al., 2018;Kim et al., 2018). Recent studies have identi-fied biases in these data sets which complicatethe interpretation of these successes; for instance,statistical regularities in crowdsourced hypothe-ses make it possible to reach substantial accu-racy without even considering the premise (Gu-rurangan et al., 2018; Poliak et al., 2018). Sinceneural networks excel at capturing such statisticalregularities, success on biased data sets may re-flect fallible heuristics rather than deep languageunderstanding, underscoring the need for a con-trolled experimental approach for evaluating NLIsystems. To this end, we introduce a challenge setthat targets the following possible heuristic:(1) The subsequence heuristic: Assume that asentence entails all of its subsequences.This heuristic is attractive to a statistical learnerbecause it often yields the correct answer for NLIsentence pairs:(2) a. John likes Baltimore a lot. !John likes Baltimore.b. Roses are red, and violets are blue!Violets are blue.The subsequence heuristic is not a generally validinference strategy, however; for example, it incor-rectly predicts that the following sentence pairs areinstances of entailment:(3) Alice believes Mary is lying. 9Alice believes Mary.(4) The book on the table is blue. 9The table is blue.(5) The student sent the gift by Max yawned. 9The student sent the gift.We conjecture that pairs such as (3)-(5), in whichthe hypothesis is a nonentailed, nonconstituentsubsequence of the premise, are highly unlikelyto be generated as potential contradictions by un-trained annotators; consequently, they will not beavailable when training the model and will not bereflected in standard accuracy metrics.We propose to create a challenge set that lever-ages the syntactic constructions illustrated in (3)-(5), as well as other constructions, to generate sen-tence pairs in which the hypothesis is a nonen-tailed nonconstituent subsequence of the premise.We demonstrate the viability of our approach witha set of sentences modeled after (3). These sen-tence are referred to in psycholinguistics as NP/Ssentences (e.g., Pritchett 1988), because the verb(believe) can take either a direct object nounphrase (NP) or a sentence (S) as its complement;the hypothesis Alice believes Mary is the result ofincorrectly assuming that the complement of theverb is the noun phrase Mary instead of the sen-tence Mary is lying. We evaluate a number ofcompetitive NLI models on this challenge set. Toanticipate our results, the accuracy of these mod-els was close to 0% (when chance performance is50%), supporting the hypothesis that they rely onthe subsequence heuristic.Models: We assess the performance of fiveneural-network NLI models. All models consistedof bidirectional LSTMs trained in two stages, fol-lowing Wang et al. (2018): first, on one of thepre-training tasks described below, and then onNLI (with a classifier predicting the labels entail-358Proceedings of the Society for Computation in Linguistics (SCiL) 2019, pages 358-360.New York City, New York, January 3-6, 2019ment, contradiction and neutral), using the MNLIdata set (Williams et al., 2018). Our pre-trainingtasks were: NLI using theMNLI corpus, combina-tory categorial grammar (CCG) supertagging us-ing tags from CCGbank (derived from the PennTreebank) (Hockenmaier and Steedman, 2007),image generation from captions using the MSCOCO data set (Lin et al., 2014), and languagemodeling (LM) using the WikiText-103 corpus(Merity et al., 2016). We also tested a model with-out pre-training, in which the encoder had ran-dom weights but the classifier was still trained onMNLI.Data set creation: We generated premises usingthe template NP1 V1 S1, where (i) NP1 appearedas the subject of V1 in the MNLI training corpus,(ii) the subject of S1 appeared as the direct objectof V1 in the corpus, and (iii) S1 appeared in thecorpus (not necessarily as a complement of V1).These conditions ensured that our examples werein the domain on which the models were trained,and that the models had been exposed to all wordsand dependencies in our examples. For example,based on the sentences in (6) from the MNLI train-ing corpus, we generated the example in (7):(6) a. The Knights believed that their goal wasjustified, however they would succumb toinfighting.b. No one believed the story that MissHoward has made up.c. San’doro said the story was awful.(7) The Knights believed the story was awful. 9The Knights believed the story.We built our examples around the verbs heard, be-lieved, felt, and claimed. We generated 200 sen-tence pairs and had each one annotated by threeworkers on Amazon Mechanical Turk. We keptthe 88 examples for which two of the annotatorsagreed that the example made sense and that thecorrect label was not entailment. Some premisesfrom our data set shown are in (8)-(10), with theassociated non-entailed hypotheses underlined:(8) They claimed the cinema is in a steel sphere.(9) The committee felt the pressure was appliedby oversight entities.(10)They heard the miners were prepared to fight.MNLI NP/S NP/S (no neg.)MNLI 0.75 0.08 0.01CCG 0.67 0.17 0.03MSCOCO 0.61 0.24 0.03LM 0.72 0.06 0.00Random 0.73 0.03 0.01Chance 0.33 0.50 0.50Table 1: Accuracies on MNLI, our unmodified NP/Sset, and our NP/S set with negation words removed.Results: Table 1 reports accuracies on theMNLIdevelopment set and our NP/S set. All models per-formed reasonably well onMNLI but substantiallybelow chance on the NP/S set. Closer inspectionrevealed that most examples that the models cor-rectly labeled not entailment had a negation wordin the premise but not the hypothesis:(11)They heard the tapes are of no importance9They heard the tapes.(12)The young American believed the statisticianis not involved. 9The young American believed the statistician.This observation suggests that even when the mod-els correctly labeled an NP/S example as not en-tailment they may have done so using a heuristicthat relied heavily on irrelevant negation words.To test whether this was the case, we removed allnegation words from the NP/S examples; as shownin Table 1, this caused the accuracy of all mod-els to fall to nearly 0, suggesting that the modelswere indeed using a negation-word-based heuris-tic. Thus, even when the models provided the cor-rect label on the NP/S evaluation set, they gener-ally did so for the wrong reason.Conclusions: All models perform poorly on theNP/S evaluation set, especially when irrelevantnegation words are removed. These results indi-cate that standard neural models trained on crowd-sourced NLI data sets are prone to heuristics basedon subsequences and negation and suggest thatthere is substantial room for improving the so-phistication of NLI models. The clear and inter-pretable results of our evaluation strategy motivateexpanding our data set to include additional con-structions with similar properties, some of whichare illustrated in (3)-(5), to create an ambitiousstandard for measuring progress in NLI. In future359work, we will also expand this data set into a moregeneral test suite for evaluating which heuristicsa model has learned. This test suite will includethe subsequence heuristic and the negation heuris-tic from the current work, as well as other heuris-tics based on properties such as lexical overlap be-tween the premise and the hypothesis. We willalso investigate other types of models trained onNLI, such as non-neural models and tree-basedneural models, to test whether reliance on the sub-sequence heuristic arises from the the NLI task orfrom the sequential nature of standard RNNs, orboth.ReferencesSamuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages632–642. Association for Computational Linguis-tics.Cleo Condoravdi, Dick Crouch, Valeria de Paiva, Rein-hard Stolle, and Daniel G. Bobrow. 2003. Entail-ment, intensionality and text understanding. In Pro-ceedings of the HLT-NAACL 2003 Workshop on TextMeaning.Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The PASCAL Recognising Textual Entail-ment Challenge. In Joaquin Quin˜onero-Candela,Ido Dagan, Bernardo Magnini, and Florenced’Alche´ Buc, editors, Machine Learning Chal-lenges. Evaluating Predictive Uncertainty, VisualObject Classification, and Recognising Textual En-tailment, pages 177–190. Springer Berlin Heidel-berg, Berlin, Heidelberg.Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel Bowman, and Noah A.Smith. 2018. Annotation artifacts in natural lan-guage inference data. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers),pages 107–112. Association for Computational Lin-guistics.Julia Hockenmaier and Mark Steedman. 2007. CCG-bank: A Corpus of CCG Derivations and Depen-dency Structures Extracted from the Penn Treebank.Computational Linguistics, 33(3):355–396.Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and No-jun Kwak. 2018. Semantic sentence matching withdensely-connected recurrent and co-attentive infor-mation. arXiv preprint arXiv:1805.11360.Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dolla´r,and C Lawrence Zitnick. 2014. Microsoft COCO:Common objects in context. In European Confer-ence on Computer Vision, pages 740–755. Springer.Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2016. Pointer sentinel mixturemodels. arXiv preprint arXiv:1609.07843.Adam Poliak, Jason Naradowsky, Aparajita Haldar,Rachel Rudinger, and Benjamin Van Durme. 2018.Hypothesis only baselines in natural language in-ference. In Proceedings of the Seventh Joint Con-ference on Lexical and Computational Semantics,pages 180–191. Association for Computational Lin-guistics.Bradley L. Pritchett. 1988. Garden path phenomenaand the grammatical basis of language processing.Language, 64(3):539–576.Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.Alex Wang, Amapreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. arXivpreprint arXiv:1804.07461.Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long Papers), pages 1112–1122. Association forComputational Linguistics.360

Non-Entailed Subsequences as a Challenge for Natural Language Inference

Abstract

Similar works

Full text

Available Versions

ScholarWorks@UMass Amherst