Search CORE

37 research outputs found

STREET: A Multi-Task Structured Reasoning and Explanation Benchmark

Author: Burger Juliette
Dong Rui
Huang Zhiheng
Karypis George
Kong Deguang
Ma Xiaofei
Ramos Anjelica
Ribeiro Danilo
Roth Dan
Wang Shen
Wang William
Xiang Bing
Zhu Henry
Publication venue
Publication date: 13/02/2023
Field of study

We introduce STREET, a unified multi-task and multi-domain natural language reasoning and explanation benchmark. Unlike most existing question-answering (QA) datasets, we expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer. We perform extensive evaluation with popular language models such as few-shot prompting GPT-3 and fine-tuned T5. We find that these models still lag behind human performance when producing such structured reasoning steps. We believe this work will provide a way for the community to better train and test systems on multi-step reasoning and explanations in natural language.Comment: Published in ICLR 202

arXiv.org e-Print Archive

Revealing the structure of language model capabilities

Author: Burnell Ryan
Conway Andrew R. A.
Hao Han
Orallo Jose Hernandez
Publication venue
Publication date: 14/06/2023
Field of study

Building a theoretical understanding of the capabilities of large language models (LLMs) is vital for our ability to predict and explain the behavior of these systems. Here, we investigate the structure of LLM capabilities by extracting latent capabilities from patterns of individual differences across a varied population of LLMs. Using a combination of Bayesian and frequentist factor analysis, we analyzed data from 29 different LLMs across 27 cognitive tasks. We found evidence that LLM capabilities are not monolithic. Instead, they are better explained by three well-delineated factors that represent reasoning, comprehension and core language modeling. Moreover, we found that these three factors can explain a high proportion of the variance in model performance. These results reveal a consistent structure in the capabilities of different LLMs and demonstrate the multifaceted nature of these capabilities. We also found that the three abilities show different relationships to model properties such as model size and instruction tuning. These patterns help refine our understanding of scaling laws and indicate that changes to a model that improve one ability might simultaneously impair others. Based on these findings, we suggest that benchmarks could be streamlined by focusing on tasks that tap into each broad model ability.Comment: 10 pages, 3 figures + references and appendices, for data and analysis code see https://github.com/RyanBurnell/revealing-LLM-capabilitie

arXiv.org e-Print Archive

R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason

Author: Inoue Naoya
Inui Kentaro
Stenetorp Pontus
Publication venue
Publication date: 01/05/2020
Field of study

Recent studies have revealed that reading comprehension (RC) systems learn to exploit annotation artifacts and other biases in current datasets. This prevents the community from reliably measuring the progress of RC systems. To address this issue, we introduce R4C, a new task for evaluating RC systems' internal reasoning. R4C requires giving not only answers but also derivations: explanations that justify predicted answers. We present a reliable, crowdsourced framework for scalably annotating RC datasets with derivations. We create and publicly release the R4C dataset, the first, quality-assured dataset consisting of 4.6k questions, each of which is annotated with 3 reference derivations (i.e. 13.8k derivations). Experiments show that our automatic evaluation metrics using multiple reference derivations are reliable, and that R4C assesses different skills from an existing benchmark.Comment: Accepted by ACL2020. See https://naoya-i.github.io/r4c/ for more informatio

arXiv.org e-Print Archive

UCL Discovery

Unification-based Reconstruction of Multi-hop Explanations for Science Questions

Author: Freitas André
Thayaparan Mokanarangan
Valentino Marco
Publication venue
Publication date: 10/02/2021
Field of study

This paper presents a novel framework for reconstructing multi-hop explanations in science Question Answering (QA). While existing approaches for multi-hop reasoning build explanations considering each question in isolation, we propose a method to leverage explanatory patterns emerging in a corpus of scientific explanations. Specifically, the framework ranks a set of atomic facts by integrating lexical relevance with the notion of unification power, estimated analysing explanations for similar questions in the corpus. An extensive evaluation is performed on the Worldtree corpus, integrating k-NN clustering and Information Retrieval (IR) techniques. We present the following conclusions: (1) The proposed method achieves results competitive with Transformers, yet being orders of magnitude faster, a feature that makes it scalable to large explanatory corpora (2) The unification-based mechanism has a key role in reducing semantic drift, contributing to the reconstruction of many hops explanations (6 or more facts) and the ranking of complex inference facts (+12.0 Mean Average Precision) (3) Crucially, the constructed explanations can support downstream QA models, improving the accuracy of BERT by up to 10% overall.Comment: Accepted at EACL 202

arXiv.org e-Print Archive

The University of Manchester - Institutional Repository

A Study of Automatic Metrics for the Evaluation of Natural Language Explanations

Author: Clinciu Miruna
Eshghi Arash
Hastie Helen
Publication venue
Publication date: 15/03/2021
Field of study

As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations. Here, we explore parallels between the generation of such explanations and the much-studied field of evaluation of Natural Language Generation (NLG). Specifically, we investigate which of the NLG evaluation measures map well to explanations. We present the ExBAN corpus: a crowd-sourced corpus of NL explanations for Bayesian Networks. We run correlations comparing human subjective ratings with NLG automatic measures. We find that embedding-based automatic NLG evaluation methods, such as BERTScore and BLEURT, have a higher correlation with human ratings, compared to word-overlap metrics, such as BLEU and ROUGE. This work has implications for Explainable AI and transparent robotic and autonomous systems.Comment: Accepted at EACL 202

arXiv.org e-Print Archive

Heriot Watt Pure

Edinburgh Research Explorer