15 research outputs found
Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension
To precisely evaluate a language model's capability for logical reading
comprehension, we present a dataset for testing the understanding of the
rationale behind critical reasoning. For questions taken from an existing
multiplechoice logical reading comprehension dataset, we crowdsource rationale
texts that explain why we should select or eliminate answer options, resulting
in 3,003 multiple-choice subquestions that are associated with 943 main
questions. Experiments on our dataset show that recent large language models
(e.g., InstructGPT) struggle to answer the subquestions even if they are able
to answer the main questions correctly. We find that the models perform
particularly poorly in answering subquestions written for the incorrect options
of the main questions, implying that the models have a limited capability for
explaining why incorrect alternatives should be eliminated. These results
suggest that our dataset encourages further investigation into the critical
reasoning ability of language models while focusing on the elimination process
of relevant alternatives.Comment: Accepted to EMNLP 202
Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios
The possible consequences for the same context may vary depending on the
situation we refer to. However, current studies in natural language processing
do not focus on situated commonsense reasoning under multiple possible
scenarios. This study frames this task by asking multiple questions with the
same set of possible endings as candidate answers, given a short story text.
Our resulting dataset, Possible Stories, consists of more than 4.5K questions
over 1.3K story texts in English. We discover that even current strong
pretrained language models struggle to answer the questions consistently,
highlighting that the highest accuracy in an unsupervised setting (60.2%) is
far behind human accuracy (92.5%). Through a comparison with existing datasets,
we observe that the questions in our dataset contain minimal annotation
artifacts in the answer options. In addition, our dataset includes examples
that require counterfactual reasoning, as well as those requiring readers'
reactions and fictional information, suggesting that our dataset can serve as a
challenging testbed for future studies on situated commonsense reasoning.Comment: Accepted to COLING 202
On Degrees of Freedom in Defining and Testing Natural Language Understanding
Natural language understanding (NLU) studies often exaggerate or
underestimate the capabilities of systems, thereby limiting the reproducibility
of their findings. These erroneous evaluations can be attributed to the
difficulty of defining and testing NLU adequately. In this position paper, we
reconsider this challenge by identifying two types of researcher degrees of
freedom. We revisit Turing's original interpretation of the Turing test and
indicate that an NLU test does not provide an operational definition; it merely
provides inductive evidence that the test subject understands the language
sufficiently well to meet stakeholder objectives. In other words, stakeholders
are free to arbitrarily define NLU through their objectives. To use the test
results as inductive evidence, stakeholders must carefully assess if the
interpretation of test scores is valid or not. However, designing and using NLU
tests involve other degrees of freedom, such as specifying target skills and
defining evaluation metrics. As a result, achieving consensus among
stakeholders becomes difficult. To resolve this issue, we propose a validity
argument, which is a framework comprising a series of validation criteria
across test components. By demonstrating that current practices in NLU studies
can be associated with those criteria and organizing them into a comprehensive
checklist, we prove that the validity argument can serve as a coherent
guideline for designing credible test sets and facilitating scientific
communication.Comment: Accepted to Findings of ACL 202
Probing Physical Reasoning with Counter-Commonsense Context
In this study, we create a CConS (Counter-commonsense Contextual Size
comparison) dataset to investigate how physical commonsense affects the
contextualized size comparison task; the proposed dataset consists of both
contexts that fit physical commonsense and those that do not. This dataset
tests the ability of language models to predict the size relationship between
objects under various contexts generated from our curated noun list and
templates. We measure the ability of several masked language models and
generative models. The results show that while large language models can use
prepositions such as ``in'' and ``into'' in the provided context to infer size
relationships, they fail to use verbs and thus make incorrect judgments led by
their prior physical commonsense.Comment: Accepted to ACL 2023(Short Paper
Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets
Existing analysis work in machine reading comprehension (MRC) is largely
concerned with evaluating the capabilities of systems. However, the
capabilities of datasets are not assessed for benchmarking language
understanding precisely. We propose a semi-automated, ablation-based
methodology for this challenge; By checking whether questions can be solved
even after removing features associated with a skill requisite for language
understanding, we evaluate to what degree the questions do not require the
skill. Experiments on 10 datasets (e.g., CoQA, SQuAD v2.0, and RACE) with a
strong baseline model show that, for example, the relative scores of a baseline
model provided with content words only and with shuffled sentence words in the
context are on average 89.2% and 78.5% of the original score, respectively.
These results suggest that most of the questions already answered correctly by
the model do not necessarily require grammatical and complex reasoning. For
precise benchmarking, MRC datasets will need to take extra care in their design
to ensure that questions can correctly evaluate the intended skills.Comment: 11 pages, AAAI2020, with extra examples, data:
https://github.com/Alab-NII/mrc-ablatio
A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension
The issue of shortcut learning is widely known in NLP and has been an
important research focus in recent years. Unintended correlations in the data
enable models to easily solve tasks that were meant to exhibit advanced
language understanding and reasoning capabilities. In this survey paper, we
focus on the field of machine reading comprehension (MRC), an important task
for showcasing high-level language understanding that also suffers from a range
of shortcuts. We summarize the available techniques for measuring and
mitigating shortcuts and conclude with suggestions for further progress in
shortcut research. Importantly, we highlight two concerns for shortcut
mitigation in MRC: (1) the lack of public challenge sets, a necessary component
for effective and reusable evaluation, and (2) the lack of certain mitigation
techniques that are prominent in other areas.Comment: 18 pages, 2 figures, 4 table
Benchmarking Machine Reading Comprehension: A Psychological Perspective
Machine reading comprehension (MRC) has received considerable attention as a
benchmark for natural language understanding. However, the conventional task
design of MRC lacks explainability beyond the model interpretation, i.e.,
reading comprehension by a model cannot be explained in human terms. To this
end, this position paper provides a theoretical basis for the design of MRC
datasets based on psychology as well as psychometrics, and summarizes it in
terms of the prerequisites for benchmarking MRC. We conclude that future
datasets should (i) evaluate the capability of the model for constructing a
coherent and grounded representation to understand context-dependent situations
and (ii) ensure substantive validity by shortcut-proof questions and
explanation as a part of the task design.Comment: 21 pages, EACL 202