761 research outputs found
Experience and Prediction: A Metric of Hardness for a Novel Litmus Test
In the last decade, the Winograd Schema Challenge (WSC) has become a central
aspect of the research community as a novel litmus test. Consequently, the WSC
has spurred research interest because it can be seen as the means to understand
human behavior. In this regard, the development of new techniques has made
possible the usage of Winograd schemas in various fields, such as the design of
novel forms of CAPTCHAs.
Work from the literature that established a baseline for human adult
performance on the WSC has shown that not all schemas are the same, meaning
that they could potentially be categorized according to their perceived
hardness for humans. In this regard, this \textit{hardness-metric} could be
used in future challenges or in the WSC CAPTCHA service to differentiate
between Winograd schemas.
Recent work of ours has shown that this could be achieved via the design of
an automated system that is able to output the hardness-indexes of Winograd
schemas, albeit with limitations regarding the number of schemas it could be
applied on. This paper adds to previous research by presenting a new system
that is based on Machine Learning (ML), able to output the hardness of any
Winograd schema faster and more accurately than any other previously used
method. Our developed system, which works within two different approaches,
namely the random forest and deep learning (LSTM-based), is ready to be used as
an extension of any other system that aims to differentiate between Winograd
schemas, according to their perceived hardness for humans. At the same time,
along with our developed system we extend previous work by presenting the
results of a large-scale experiment that shows how human performance varies
across Winograd schemas.Comment: 33 pages, 10 figures
Attention Is (not) All You Need for Commonsense Reasoning
The recently introduced BERT model exhibits strong performance on several
language understanding benchmarks. In this paper, we describe a simple
re-implementation of BERT for commonsense reasoning. We show that the
attentions produced by BERT can be directly utilized for tasks such as the
Pronoun Disambiguation Problem and Winograd Schema Challenge. Our proposed
attention-guided commonsense reasoning method is conceptually simple yet
empirically powerful. Experimental analysis on multiple datasets demonstrates
that our proposed system performs remarkably well on all cases while
outperforming the previously reported state of the art by a margin. While
results suggest that BERT seems to implicitly learn to establish complex
relationships between entities, solving commonsense reasoning tasks might
require more than unsupervised models learned from huge text corpora.Comment: to appear at ACL 201
Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation
We present a large-scale collection of diverse natural language inference
(NLI) datasets that help provide insight into how well a sentence
representation captures distinct types of reasoning. The collection results
from recasting 13 existing datasets from 7 semantic phenomena into a common NLI
structure, resulting in over half a million labeled context-hypothesis pairs in
total. We refer to our collection as the DNC: Diverse Natural Language
Inference Collection. The DNC is available online at https://www.decomp.net,
and will grow over time as additional resources are recast and added from novel
sources.Comment: To be presented at EMNLP 2018. 15 page
- …