37 research outputs found
STREET: A Multi-Task Structured Reasoning and Explanation Benchmark
We introduce STREET, a unified multi-task and multi-domain natural language
reasoning and explanation benchmark. Unlike most existing question-answering
(QA) datasets, we expect models to not only answer questions, but also produce
step-by-step structured explanations describing how premises in the question
are used to produce intermediate conclusions that can prove the correctness of
a certain answer. We perform extensive evaluation with popular language models
such as few-shot prompting GPT-3 and fine-tuned T5. We find that these models
still lag behind human performance when producing such structured reasoning
steps. We believe this work will provide a way for the community to better
train and test systems on multi-step reasoning and explanations in natural
language.Comment: Published in ICLR 202
Revealing the structure of language model capabilities
Building a theoretical understanding of the capabilities of large language
models (LLMs) is vital for our ability to predict and explain the behavior of
these systems. Here, we investigate the structure of LLM capabilities by
extracting latent capabilities from patterns of individual differences across a
varied population of LLMs. Using a combination of Bayesian and frequentist
factor analysis, we analyzed data from 29 different LLMs across 27 cognitive
tasks. We found evidence that LLM capabilities are not monolithic. Instead,
they are better explained by three well-delineated factors that represent
reasoning, comprehension and core language modeling. Moreover, we found that
these three factors can explain a high proportion of the variance in model
performance. These results reveal a consistent structure in the capabilities of
different LLMs and demonstrate the multifaceted nature of these capabilities.
We also found that the three abilities show different relationships to model
properties such as model size and instruction tuning. These patterns help
refine our understanding of scaling laws and indicate that changes to a model
that improve one ability might simultaneously impair others. Based on these
findings, we suggest that benchmarks could be streamlined by focusing on tasks
that tap into each broad model ability.Comment: 10 pages, 3 figures + references and appendices, for data and
analysis code see https://github.com/RyanBurnell/revealing-LLM-capabilitie
R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason
Recent studies have revealed that reading comprehension (RC) systems learn to
exploit annotation artifacts and other biases in current datasets. This
prevents the community from reliably measuring the progress of RC systems. To
address this issue, we introduce R4C, a new task for evaluating RC systems'
internal reasoning. R4C requires giving not only answers but also derivations:
explanations that justify predicted answers. We present a reliable,
crowdsourced framework for scalably annotating RC datasets with derivations. We
create and publicly release the R4C dataset, the first, quality-assured dataset
consisting of 4.6k questions, each of which is annotated with 3 reference
derivations (i.e. 13.8k derivations). Experiments show that our automatic
evaluation metrics using multiple reference derivations are reliable, and that
R4C assesses different skills from an existing benchmark.Comment: Accepted by ACL2020. See https://naoya-i.github.io/r4c/ for more
informatio
Unification-based Reconstruction of Multi-hop Explanations for Science Questions
This paper presents a novel framework for reconstructing multi-hop
explanations in science Question Answering (QA). While existing approaches for
multi-hop reasoning build explanations considering each question in isolation,
we propose a method to leverage explanatory patterns emerging in a corpus of
scientific explanations. Specifically, the framework ranks a set of atomic
facts by integrating lexical relevance with the notion of unification power,
estimated analysing explanations for similar questions in the corpus.
An extensive evaluation is performed on the Worldtree corpus, integrating
k-NN clustering and Information Retrieval (IR) techniques. We present the
following conclusions: (1) The proposed method achieves results competitive
with Transformers, yet being orders of magnitude faster, a feature that makes
it scalable to large explanatory corpora (2) The unification-based mechanism
has a key role in reducing semantic drift, contributing to the reconstruction
of many hops explanations (6 or more facts) and the ranking of complex
inference facts (+12.0 Mean Average Precision) (3) Crucially, the constructed
explanations can support downstream QA models, improving the accuracy of BERT
by up to 10% overall.Comment: Accepted at EACL 202
A Study of Automatic Metrics for the Evaluation of Natural Language Explanations
As transparency becomes key for robotics and AI, it will be necessary to
evaluate the methods through which transparency is provided, including
automatically generated natural language (NL) explanations. Here, we explore
parallels between the generation of such explanations and the much-studied
field of evaluation of Natural Language Generation (NLG). Specifically, we
investigate which of the NLG evaluation measures map well to explanations. We
present the ExBAN corpus: a crowd-sourced corpus of NL explanations for
Bayesian Networks. We run correlations comparing human subjective ratings with
NLG automatic measures. We find that embedding-based automatic NLG evaluation
methods, such as BERTScore and BLEURT, have a higher correlation with human
ratings, compared to word-overlap metrics, such as BLEU and ROUGE. This work
has implications for Explainable AI and transparent robotic and autonomous
systems.Comment: Accepted at EACL 202