9,696 research outputs found
CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense
Commonsense reasoning is a critical AI capability, but it is difficult to
construct challenging datasets that test common sense. Recent neural question
answering systems, based on large pre-trained models of language, have already
achieved near-human-level performance on commonsense knowledge benchmarks.
These systems do not possess human-level common sense, but are able to exploit
limitations of the datasets to achieve human-level scores.
We introduce the CODAH dataset, an adversarially-constructed evaluation
dataset for testing common sense. CODAH forms a challenging extension to the
recently-proposed SWAG dataset, which tests commonsense knowledge using
sentence-completion questions that describe situations observed in video. To
produce a more difficult dataset, we introduce a novel procedure for question
acquisition in which workers author questions designed to target weaknesses of
state-of-the-art neural question answering systems. Workers are rewarded for
submissions that models fail to answer correctly both before and after
fine-tuning (in cross-validation). We create 2.8k questions via this procedure
and evaluate the performance of multiple state-of-the-art question answering
systems on our dataset. We observe a significant gap between human performance,
which is 95.3%, and the performance of the best baseline accuracy of 67.5% by
the BERT-Large model.Comment: 8 pages, Appeared in RepEval 201
RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms
Pre-trained language models (PTLMs) have achieved impressive performance on
commonsense inference benchmarks, but their ability to employ commonsense to
make robust inferences, which is crucial for effective communications with
humans, is debated. In the pursuit of advancing fluid human-AI communication,
we propose a new challenge, RICA: Robust Inference capability based on
Commonsense Axioms, that evaluates robust commonsense inference despite textual
perturbations. To generate data for this challenge, we develop a systematic and
scalable procedure using commonsense knowledge bases and probe PTLMs across two
different evaluation settings. Extensive experiments on our generated probe
sets with more than 10k statements show that PTLMs perform no better than
random guessing on the zero-shot setting, are heavily impacted by statistical
biases, and are not robust to perturbation attacks. We also find that
fine-tuning on similar statements offer limited gains, as PTLMs still fail to
generalize to unseen inferences. Our new large-scale benchmark exposes a
significant gap between PTLMs and human-level language understanding and offers
a new challenge for PTLMs to demonstrate commonsense.Comment: 18 pages, 8 figure
ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning
Given questions regarding some prototypical situation such as Name something
that people usually do before they leave the house for work? a human can easily
answer them via acquired experiences. There can be multiple right answers for
such questions, with some more common for a situation than others. This paper
introduces a new question answering dataset for training and evaluating common
sense reasoning capabilities of artificial intelligence systems in such
prototypical situations. The training set is gathered from an existing set of
questions played in a long-running international game show FAMILY- FEUD. The
hidden evaluation set is created by gathering answers for each question from
100 crowd-workers. We also propose a generative evaluation task where a model
has to output a ranked list of answers, ideally covering all prototypical
answers for a question. After presenting multiple competitive baseline models,
we find that human performance still exceeds model scores on all evaluation
metrics with a meaningful gap, supporting the challenging nature of the task
Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches
In the NLP community, recent years have seen a surge of research activities
that address machines' ability to perform deep language understanding which
goes beyond what is explicitly stated in text, rather relying on reasoning and
knowledge of the world. Many benchmark tasks and datasets have been created to
support the development and evaluation of such natural language inference
ability. As these benchmarks become instrumental and a driving force for the
NLP research community, this paper aims to provide an overview of recent
benchmarks, relevant knowledge resources, and state-of-the-art learning and
inference approaches in order to support a better understanding of this growing
field
SocialIQA: Commonsense Reasoning about Social Interactions
We introduce Social IQa, the first largescale benchmark for commonsense
reasoning about social situations. Social IQa contains 38,000 multiple choice
questions for probing emotional and social intelligence in a variety of
everyday situations (e.g., Q: "Jordan wanted to tell Tracy a secret, so Jordan
leaned towards Tracy. Why did Jordan do this?" A: "Make sure no one else could
hear"). Through crowdsourcing, we collect commonsense questions along with
correct and incorrect answers about social interactions, using a new framework
that mitigates stylistic artifacts in incorrect answers by asking workers to
provide the right answer to a different but related question. Empirical results
show that our benchmark is challenging for existing question-answering models
based on pretrained language models, compared to human performance (>20% gap).
Notably, we further establish Social IQa as a resource for transfer learning of
commonsense knowledge, achieving state-of-the-art performance on multiple
commonsense reasoning tasks (Winograd Schemas, COPA).Comment: the first two authors contributed equally; accepted to EMNLP 2019;
camera ready versio
KaLM at SemEval-2020 Task 4: Knowledge-aware Language Models for Comprehension And Generation
This paper presents our strategies in SemEval 2020 Task 4: Commonsense
Validation and Explanation. We propose a novel way to search for evidence and
choose the different large-scale pre-trained models as the backbone for three
subtasks. The results show that our evidence-searching approach improves model
performance on commonsense explanation task. Our team ranks 2nd in subtask C
according to human evaluation score.Comment: 6 pages, 1 figur
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
When answering a question, people often draw upon their rich world knowledge
in addition to the particular context. Recent work has focused primarily on
answering questions given some relevant document or context, and required very
little general background. To investigate question answering with prior
knowledge, we present CommonsenseQA: a challenging new dataset for commonsense
question answering. To capture common sense beyond associations, we extract
from ConceptNet (Speer et al., 2017) multiple target concepts that have the
same semantic relation to a single source concept. Crowd-workers are asked to
author multiple-choice questions that mention the source concept and
discriminate in turn between each of the target concepts. This encourages
workers to create questions with complex semantics that often require prior
knowledge. We create 12,247 questions through this procedure and demonstrate
the difficulty of our task with a large number of strong baselines. Our best
baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy,
well below human performance, which is 89%.Comment: accepted as a long paper at NAACL 201
WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge
In this paper, we present the first comprehensive categorization of essential
commonsense knowledge for answering the Winograd Schema Challenge (WSC). For
each of the questions, we invite annotators to first provide reasons for making
correct decisions and then categorize them into six major knowledge categories.
By doing so, we better understand the limitation of existing methods (i.e.,
what kind of knowledge cannot be effectively represented or inferred with
existing methods) and shed some light on the commonsense knowledge that we need
to acquire in the future for better commonsense reasoning. Moreover, to
investigate whether current WSC models can understand the commonsense or they
simply solve the WSC questions based on the statistical bias of the dataset, we
leverage the collected reasons to develop a new task called WinoWhy, which
requires models to distinguish plausible reasons from very similar but wrong
reasons for all WSC questions. Experimental results prove that even though
pre-trained language representation models have achieved promising progress on
the original WSC dataset, they are still struggling at WinoWhy. Further
experiments show that even though supervised models can achieve better
performance, the performance of these models can be sensitive to the dataset
distribution. WinoWhy and all codes are available at:
https://github.com/HKUST-KnowComp/WinoWhy.Comment: Accepted by ACL 202
A Review of Winograd Schema Challenge Datasets and Approaches
The Winograd Schema Challenge is both a commonsense reasoning and natural
language understanding challenge, introduced as an alternative to the Turing
test. A Winograd schema is a pair of sentences differing in one or two words
with a highly ambiguous pronoun, resolved differently in the two sentences,
that appears to require commonsense knowledge to be resolved correctly. The
examples were designed to be easily solvable by humans but difficult for
machines, in principle requiring a deep understanding of the content of the
text and the situation it describes. This paper reviews existing Winograd
Schema Challenge benchmark datasets and approaches that have been published
since its introduction
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning
Fine-tuning of pre-trained transformer models has become the standard
approach for solving common NLP tasks. Most of the existing approaches rely on
a randomly initialized classifier on top of such networks. We argue that this
fine-tuning procedure is sub-optimal as the pre-trained model has no prior on
the specific classifier labels, while it might have already learned an
intrinsic textual representation of the task. In this paper, we introduce a new
scoring method that casts a plausibility ranking task in a full-text format and
leverages the masked language modeling head tuned during the pre-training
phase. We study commonsense reasoning tasks where the model must rank a set of
hypotheses given a premise, focusing on the COPA, Swag, HellaSwag and
CommonsenseQA datasets. By exploiting our scoring method without fine-tuning,
we are able to produce strong baselines (e.g. 80% test accuracy on COPA) that
are comparable to supervised approaches. Moreover, when fine-tuning directly on
the proposed scoring function, we show that our method provides a much more
stable training phase across random restarts (e.g standard
deviation reduction on COPA test accuracy) and requires less annotated data
than the standard classifier approach to reach equivalent performances.Comment: Accepted at ACL 202
- …