205 research outputs found
Improving Factuality of Abstractive Summarization without Sacrificing Summary Quality
Improving factual consistency of abstractive summarization has been a widely
studied topic. However, most of the prior works on training factuality-aware
models have ignored the negative effect it has on summary quality. We propose
EFACTSUM (i.e., Effective Factual Summarization), a candidate summary
generation and ranking technique to improve summary factuality without
sacrificing summary quality. We show that using a contrastive learning
framework with our refined candidate summaries leads to significant gains on
both factuality and similarity-based metrics. Specifically, we propose a
ranking strategy in which we effectively combine two metrics, thereby
preventing any conflict during training. Models trained using our approach show
up to 6 points of absolute improvement over the base model with respect to
FactCC on XSUM and 11 points on CNN/DM, without negatively affecting either
similarity-based metrics or absractiveness.Comment: ACL 202
Prompted Opinion Summarization with GPT-3.5
Large language models have shown impressive performance across a wide variety
of tasks, including text summarization. In this paper, we show that this strong
performance extends to opinion summarization. We explore several pipeline
methods for applying GPT-3.5 to summarize a large collection of user reviews in
a prompted fashion. To handle arbitrarily large numbers of user reviews, we
explore recursive summarization as well as methods for selecting salient
content to summarize through supervised clustering or extraction. On two
datasets, an aspect-oriented summarization dataset of hotel reviews (SPACE) and
a generic summarization dataset of Amazon and Yelp reviews (FewSum), we show
that GPT-3.5 models achieve very strong performance in human evaluation. We
argue that standard evaluation metrics do not reflect this, and introduce three
new metrics targeting faithfulness, factuality, and genericity to contrast
these different methods.Comment: Accepted to ACL (Findings) 202
Fidelity-Enriched Contrastive Search: Reconciling the Faithfulness-Diversity Trade-Off in Text Generation
In this paper, we address the hallucination problem commonly found in natural
language generation tasks. Language models often generate fluent and convincing
content but can lack consistency with the provided source, resulting in
potential inaccuracies. We propose a new decoding method called
Fidelity-Enriched Contrastive Search (FECS), which augments the contrastive
search framework with context-aware regularization terms. FECS promotes tokens
that are semantically similar to the provided source while penalizing
repetitiveness in the generated text. We demonstrate its effectiveness across
two tasks prone to hallucination: abstractive summarization and dialogue
generation. Results show that FECS consistently enhances faithfulness across
various language model sizes while maintaining output diversity comparable to
well-performing decoding algorithms.Comment: Accepted as a short paper at EMNLP 202
Improving Factuality of Abstractive Summarization via Contrastive Reward Learning
Modern abstractive summarization models often generate summaries that contain
hallucinated or contradictory information. In this paper, we propose a simple
but effective contrastive learning framework that incorporates recent
developments in reward learning and factuality metrics. Empirical studies
demonstrate that the proposed framework enables summarization models to learn
from feedback of factuality metrics using contrastive reward learning, leading
to more factual summaries by human evaluations. This suggests that further
advances in learning and evaluation algorithms can feed directly into providing
more factual summaries.Comment: TrustNLP @ ACL 202
Improved Beam Search for Hallucination Mitigation in Abstractive Summarization
Advancement in large pretrained language models has significantly improved
their performance for conditional language generation tasks including
summarization albeit with hallucinations. To reduce hallucinations,
conventional methods proposed improving beam search or using a fact checker as
a postprocessing step. In this paper, we investigate the use of the Natural
Language Inference (NLI) entailment metric to detect and prevent hallucinations
in summary generation. We propose an NLI-assisted beam re-ranking mechanism by
computing entailment probability scores between the input context and
summarization model-generated beams during saliency-enhanced greedy decoding.
Moreover, a diversity metric is introduced to compare its effectiveness against
vanilla beam search. Our proposed algorithm significantly outperforms vanilla
beam decoding on XSum and CNN/DM datasets.Comment: 8 pages, 2 figure
Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization
The problems of unfaithful summaries have been widely discussed under the
context of abstractive summarization. Though extractive summarization is less
prone to the common unfaithfulness issues of abstractive summaries, does that
mean extractive is equal to faithful? Turns out that the answer is no. In this
work, we define a typology with five types of broad unfaithfulness problems
(including and beyond not-entailment) that can appear in extractive summaries,
including incorrect coreference, incomplete coreference, incorrect discourse,
incomplete discourse, as well as other misleading information. We ask humans to
label these problems out of 1500 English summaries produced by 15 diverse
extractive systems. We find that 33% of the summaries have at least one of the
five issues. To automatically detect these problems, we find that 5 existing
faithfulness evaluation metrics for summarization have poor correlations with
human judgment. To remedy this, we propose a new metric, ExtEval, that is
designed for detecting unfaithful extractive summaries and is shown to have the
best performance. We hope our work can increase the awareness of unfaithfulness
problems in extractive summarization and help future work to evaluate and
resolve these issues. Our data and code are publicly available at
https://github.com/ZhangShiyue/extractive_is_not_faithfulComment: 19 page
Learning to Revise References for Faithful Summarization
In many real-world scenarios with naturally occurring datasets, reference
summaries are noisy and contain information that cannot be inferred from the
source text. On large news corpora, removing low quality samples has been shown
to reduce model hallucinations. Yet, this method is largely untested for
smaller, noisier corpora. To improve reference quality while retaining all
data, we propose a new approach: to revise--not remove--unsupported reference
content. Without ground-truth supervision, we construct synthetic unsupported
alternatives to supported sentences and use contrastive learning to
discourage/encourage (un)faithful revisions. At inference, we vary style codes
to over-generate revisions of unsupported reference sentences and select a
final revision which balances faithfulness and abstraction. We extract a small
corpus from a noisy source--the Electronic Health Record (EHR)--for the task of
summarizing a hospital admission from multiple notes. Training models on
original, filtered, and revised references, we find (1) learning from revised
references reduces the hallucination rate substantially more than filtering
(18.4\% vs 3.8\%), (2) learning from abstractive (vs extractive) revisions
improves coherence, relevance, and faithfulness, (3) beyond redress of noisy
data, the revision task has standalone value for the task: as a pre-training
objective and as a post-hoc editor
- …