21 research outputs found
A Study of Automatic Metrics for the Evaluation of Natural Language Explanations
As transparency becomes key for robotics and AI, it will be necessary to
evaluate the methods through which transparency is provided, including
automatically generated natural language (NL) explanations. Here, we explore
parallels between the generation of such explanations and the much-studied
field of evaluation of Natural Language Generation (NLG). Specifically, we
investigate which of the NLG evaluation measures map well to explanations. We
present the ExBAN corpus: a crowd-sourced corpus of NL explanations for
Bayesian Networks. We run correlations comparing human subjective ratings with
NLG automatic measures. We find that embedding-based automatic NLG evaluation
methods, such as BERTScore and BLEURT, have a higher correlation with human
ratings, compared to word-overlap metrics, such as BLEU and ROUGE. This work
has implications for Explainable AI and transparent robotic and autonomous
systems.Comment: Accepted at EACL 202
Underreporting of errors in NLG output, and what to do about it
We observe a severe under-reporting of the different kinds of errors that Natural Language Generation systems make. This is a problem, because mistakes are an important indicator of where systems should still be improved. If authors only report overall performance metrics, the research community is left in the dark about the specific weaknesses that are exhibited by `state-of-the-art' research. Next to quantifying the extent of error under-reporting, this position paper provides recommendations for error identification, analysis and reporting.Peer reviewe
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License