19 research outputs found
Recommended from our members
Faithfulness in Abstractive Summarization: Progress and Challenges
The exponential increase in online text has created a pressing need for automatic summarization systems that can distill key information from lengthy documents. While neural abstractive summarizers have achieved gains in fluency and coherence, a critical challenge that has emerged is ensuring faithfulness, i.e., accurately preserving the meaning from the original text. Modern neural abstractive summarizers can distort or fabricate facts, undermining their reliability in real-world applications. Thus, this thesis tackles the critical issue of improving faithfulness in abstractive summarization. This thesis is comprised of four parts.
The first part examines challenges in evaluating summarization faithfulness, including issues with reference-free metrics and human evaluation. We propose a novel approach for building automated evaluation metrics that are less reliant on spurious correlations and demonstrate significantly improved performance over existing faithfulness evaluation metrics. We further introduce a novel evaluation framework that enables a more holistic assessment of faithfulness by accounting for the abstractiveness of summarization systems. This framework enables more rigorous faithfulness evaluation, differentiating between gains from increased extraction versus improved abstraction.
The second part focuses on explaining the root causes of faithfulness issues in modern summarization systems. We introduce a novel contrastive approach for attributing errors that vastlyoutperforms prior work at tracing hallucinations in generated summaries back to training data deficiencies. Moreover, incorporating our method’s ideas into an existing technique substantially boosts its performance. Through a case study, we also analyze pre-training biases and demonstrate their propagation to summarization models, yielding biased hallucinations. We show that while mitigation strategies during finetuning can reduce overall hallucination rates, the remaining hallucinations still closely reflect intrinsic pre-training biases.
The third part applies insights from previous sections to develop impactful techniques for improving faithfulness in practice. We propose a novel approach for adaptively determining the appropriate level of abstractiveness for a given input to improve overall faithfulness. Our method yields systems that are both more faithful and more abstractive compared to baseline systems. We further leverage our error attribution approach to clean noisy training data, significantly reducing faithfulness errors in generated outputs. Models trained on datasets cleaned with our approach generate markedly fewer hallucinations than both baseline systems and models trained using other data cleaning techniques.
Finally, the fourth part examines the summarization capabilities of LLMs and assesses their faithfulness. We demonstrate that instruction-tuning and RLHF are key for enabling LLMs to achieve high-quality zero-shot summarization in the news domain, with state-of-the-art LLMs generating summaries comparable to human-written ones. However, this ability does not extend to narrative summarization, where even advanced LLMs struggle to produce consistently faithful summaries. Finally, we highlight the difficulty in evaluating high-performing LLMs, showing that crowdsourcing evaluations of LLM outputs may no longer be reliable as fluency and coherence improve. We observe a substantial gap between crowd workers and experts in identifying deficiencies in LLM-generated narrative summaries
Contrastive Error Attribution for Finetuned Language Models
Recent work has identified noisy and misannotated data as a core cause of
hallucinations and unfaithful outputs in Natural Language Generation (NLG)
tasks. Consequently, identifying and removing these examples is a key open
challenge in creating reliable NLG systems. In this work, we introduce a
framework to identify and remove low-quality training instances that lead to
undesirable outputs, such as faithfulness errors in text summarization. We show
that existing approaches for error tracing, such as gradient-based influence
measures, do not perform reliably for detecting faithfulness errors in NLG
datasets. We overcome the drawbacks of existing error tracing methods through a
new, contrast-based estimate that compares undesired generations to
human-corrected outputs. Our proposed method can achieve a mean average
precision of 0.93 at detecting known data errors across synthetic tasks with
known ground truth, substantially outperforming existing approaches. Using this
approach and re-training models on cleaned data leads to a 70% reduction in
entity hallucinations on the NYT dataset and a 55% reduction in semantic errors
on the E2E dataset.Comment: ACL 202
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Selecting the ``right'' amount of information to include in a summary is a
difficult task. A good summary should be detailed and entity-centric without
being overly dense and hard to follow. To better understand this tradeoff, we
solicit increasingly dense GPT-4 summaries with what we refer to as a ``Chain
of Density'' (CoD) prompt. Specifically, GPT-4 generates an initial
entity-sparse summary before iteratively incorporating missing salient entities
without increasing the length. Summaries generated by CoD are more abstractive,
exhibit more fusion, and have less of a lead bias than GPT-4 summaries
generated by a vanilla prompt. We conduct a human preference study on 100 CNN
DailyMail articles and find that that humans prefer GPT-4 summaries that are
more dense than those generated by a vanilla prompt and almost as dense as
human written summaries. Qualitative analysis supports the notion that there
exists a tradeoff between informativeness and readability. 500 annotated CoD
summaries, as well as an extra 5,000 unannotated summaries, are freely
available on HuggingFace
(https://huggingface.co/datasets/griffin/chain_of_density).Comment: preprin
Generating EDU Extracts for Plan-Guided Summary Re-Ranking
Two-step approaches, in which summary candidates are generated-then-reranked
to return a single summary, can improve ROUGE scores over the standard
single-step approach. Yet, standard decoding methods (i.e., beam search,
nucleus sampling, and diverse beam search) produce candidates with redundant,
and often low quality, content. In this paper, we design a novel method to
generate candidates for re-ranking that addresses these issues. We ground each
candidate abstract on its own unique content plan and generate distinct
plan-guided abstracts using a model's top beam. More concretely, a standard
language model (a BART LM) auto-regressively generates elemental discourse unit
(EDU) content plans with an extractive copy mechanism. The top K beams from the
content plan generator are then used to guide a separate LM, which produces a
single abstractive candidate for each distinct plan. We apply an existing
re-ranker (BRIO) to abstractive candidates generated from our method, as well
as baseline decoding methods. We show large relevance improvements over
previously published methods on widely used single document news article
corpora, with ROUGE-2 F1 gains of 0.88, 2.01, and 0.38 on CNN / Dailymail, NYT,
and Xsum, respectively. A human evaluation on CNN / DM validates these results.
Similarly, on 1k samples from CNN / DM, we show that prompting GPT-3 to follow
EDU plans outperforms sampling-based methods by 1.05 ROUGE-2 F1 points. Code to
generate and realize plans is available at
https://github.com/griff4692/edu-sum.Comment: ACL 202
Novel Chapter Abstractive Summarization using Spinal Tree Aware Sub-Sentential Content Selection
Summarizing novel chapters is a difficult task due to the input length and
the fact that sentences that appear in the desired summaries draw content from
multiple places throughout the chapter. We present a pipelined
extractive-abstractive approach where the extractive step filters the content
that is passed to the abstractive component. Extremely lengthy input also
results in a highly skewed dataset towards negative instances for extractive
summarization; we thus adopt a margin ranking loss for extraction to encourage
separation between positive and negative examples. Our extraction component
operates at the constituent level; our approach to this problem enriches the
text with spinal tree information which provides syntactic context (in the form
of constituents) to the extraction model. We show an improvement of 3.71
Rouge-1 points over best results reported in prior work on an existing novel
chapter dataset