16 research outputs found
Benchmarking Large Language Model Capabilities for Conditional Generation
Pre-trained large language models (PLMs) underlie most new developments in
natural language processing. They have shifted the field from
application-specific model pipelines to a single model that is adapted to a
wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside
techniques like few-shot learning, have additionally shifted the output
modality to generation instead of classification or regression. Despite their
ubiquitous use, the generation quality of language models is rarely evaluated
when these models are introduced. Additionally, it is unclear how existing
generation tasks--while they can be used to compare systems at a high
level--relate to the real world use cases for which people have been adopting
them. In this work, we discuss how to adapt existing application-specific
generation benchmarks to PLMs and provide an in-depth, empirical study of the
limitations and capabilities of PLMs in natural language generation tasks along
dimensions such as scale, architecture, input and output language. Our results
show that PLMs differ in their applicability to different data regimes and
their generalization to multiple languages and inform which PLMs to use for a
given generation task setup. We share best practices to be taken into
consideration when benchmarking generation capabilities during the development
of upcoming PLMs
mFACE: Multilingual Summarization with Factual Consistency Evaluation
Abstractive summarization has enjoyed renewed interest in recent years,
thanks to pre-trained language models and the availability of large-scale
datasets. Despite promising results, current models still suffer from
generating factually inconsistent summaries, reducing their utility for
real-world application. Several recent efforts attempt to address this by
devising models that automatically detect factual inconsistencies in machine
generated summaries. However, they focus exclusively on English, a language
with abundant resources. In this work, we leverage factual consistency
evaluation models to improve multilingual summarization. We explore two
intuitive approaches to mitigate hallucinations based on the signal provided by
a multilingual NLI model, namely data filtering and controlled generation.
Experimental results in the 45 languages from the XLSum dataset show gains over
strong baselines in both automatic and human evaluation.Comment: 28 pages with links to released dat
On Uncertainty Calibration and Selective Generation in Probabilistic Neural Summarization: A Benchmark Study
Modern deep models for summarization attains impressive benchmark
performance, but they are prone to generating miscalibrated predictive
uncertainty. This means that they assign high confidence to low-quality
predictions, leading to compromised reliability and trustworthiness in
real-world applications. Probabilistic deep learning methods are common
solutions to the miscalibration problem. However, their relative effectiveness
in complex autoregressive summarization tasks are not well-understood. In this
work, we thoroughly investigate different state-of-the-art probabilistic
methods' effectiveness in improving the uncertainty quality of the neural
summarization models, across three large-scale benchmarks with varying
difficulty. We show that the probabilistic methods consistently improve the
model's generation and uncertainty quality, leading to improved selective
generation performance (i.e., abstaining from low-quality summaries) in
practice. We also reveal notable failure patterns of probabilistic methods
widely-adopted in NLP community (e.g., Deep Ensemble and Monte Carlo Dropout),
cautioning the importance of choosing appropriate method for the data setting
Calibrating Likelihoods towards Consistency in Summarization Models
Despite the recent advances in abstractive text summarization, current
summarization models still suffer from generating factually inconsistent
summaries, reducing their utility for real-world application. We argue that the
main reason for such behavior is that the summarization models trained with
maximum likelihood objective assign high probability to plausible sequences
given the context, but they often do not accurately rank sequences by their
consistency. In this work, we solve this problem by calibrating the likelihood
of model generated sequences to better align with a consistency metric measured
by natural language inference (NLI) models. The human evaluation study and
automatic metrics show that the calibrated models generate more consistent and
higher-quality summaries. We also show that the models trained using our method
return probabilities that are better aligned with the NLI scores, which
significantly increase reliability of summarization models
PLAN: Summarizing using a Content Plan as Cross-Lingual Bridge
Cross-lingual summarization consists of generating a summary in one language
given an input document in a different language, allowing for the dissemination
of relevant content across speakers of other languages. The task is challenging
mainly due to the paucity of cross-lingual datasets and the compounded
difficulty of summarizing and translating. This work presents PLAN, an
approach to cross-lingual summarization that uses an intermediate planning step
as a cross-lingual bridge. We formulate the plan as a sequence of entities
capturing the summary's content and the order in which it should be
communicated. Importantly, our plans abstract from surface form: using a
multilingual knowledge base, we align entities to their canonical designation
across languages and generate the summary conditioned on this cross-lingual
bridge and the input. Automatic and human evaluation on the XWikis dataset
(across four language pairs) demonstrates that our planning objective achieves
state-of-the-art performance in terms of informativeness and faithfulness.
Moreover, PLAN models improve the zero-shot transfer to new cross-lingual
language pairs compared to baselines without a planning component.Comment: EACL 202
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
Reliable automatic evaluation of summarization systems is challenging due to
the multifaceted and subjective nature of the task. This is especially the case
for languages other than English, where human evaluations are scarce. In this
work, we introduce SEAHORSE, a dataset for multilingual, multifaceted
summarization evaluation. SEAHORSE consists of 96K summaries with human ratings
along 6 dimensions of text quality: comprehensibility, repetition, grammar,
attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4
datasets. As a result of its size and scope, SEAHORSE can serve both as a
benchmark to evaluate learnt metrics, as well as a large-scale resource for
training such metrics. We show that metrics trained with SEAHORSE achieve
strong performance on the out-of-domain meta-evaluation benchmarks TRUE
(Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make the SEAHORSE
dataset and metrics publicly available for future research on multilingual and
multifaceted summarization evaluation
PaLM: Scaling Language Modeling with Pathways
Large language models have been shown to achieve remarkable performance
across a variety of natural language tasks using few-shot learning, which
drastically reduces the number of task-specific training examples needed to
adapt the model to a particular application. To further our understanding of
the impact of scale on few-shot learning, we trained a 540-billion parameter,
densely activated, Transformer language model, which we call Pathways Language
Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML
system which enables highly efficient training across multiple TPU Pods. We
demonstrate continued benefits of scaling by achieving state-of-the-art
few-shot learning results on hundreds of language understanding and generation
benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough
performance, outperforming the finetuned state-of-the-art on a suite of
multi-step reasoning tasks, and outperforming average human performance on the
recently released BIG-bench benchmark. A significant number of BIG-bench tasks
showed discontinuous improvements from model scale, meaning that performance
steeply increased as we scaled to our largest model. PaLM also has strong
capabilities in multilingual tasks and source code generation, which we
demonstrate on a wide array of benchmarks. We additionally provide a
comprehensive analysis on bias and toxicity, and study the extent of training
data memorization with respect to model scale. Finally, we discuss the ethical
considerations related to large language models and discuss potential
mitigation strategies