Search CORE

1,151 research outputs found

The Winograd Schema Challenge and Reasoning about Correlation. In

Author: Amelia Harrison
Daniel Bailey
Julian Michael
Vladimir Lifschitz
Yuliya Lierler
Publication venue
Publication date: 01/01/2015
Field of study

Abstract The Winograd Schema Challenge is an alternative to the Turing Test that may provide a more meaningful measure of machine intelligence. It poses a set of coreference resolution problems that cannot be solved without human-like reasoning. In this paper, we take the view that the solution to such problems lies in establishing discourse coherence. Specifically, we examine two types of rhetorical relations that can be used to establish discourse coherence: positive and negative correlation. We introduce a framework for reasoning about correlation between sentences, and show how this framework can be used to justify solutions to some Winograd Schema problems

CiteSeerX

The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

Author: Abdou Mostafa
Barrett Maria
Belinkov Yonatan
Elliott Desmond
Ravishankar Vinit
Søgaard Anders
Publication venue
Publication date: 01/01/2020
Field of study

Large-scale pretrained language models are the major driving force behind recent improvements in performance on the Winograd Schema Challenge, a widely employed test of common sense reasoning ability. We show, however, with a new diagnostic dataset, that these models are sensitive to linguistic perturbations of the Winograd examples that minimally affect human understanding. Our results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones. Overall, humans are correct more often than out-of-the-box models, and the models are sometimes right for the wrong reasons. Finally, we show that fine-tuning on a large, task-specific dataset can offer a solution to these issues.Comment: ACL 202

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System