212 research outputs found
ChatGPT as a Factual Inconsistency Evaluator for Abstractive Text Summarization
The performance of abstractive text summarization has been greatly boosted by
pre-trained language models recently. The main concern of existing abstractive
summarization methods is the factual inconsistency problem of their generated
summary. To alleviate the problem, many efforts have focused on developing
effective factuality evaluation metrics based on natural language inference and
question answering et al. However, they have limitations of high computational
complexity and relying on annotated data. Most recently, large language models
such as ChatGPT have shown strong ability in not only natural language
understanding but also natural language inference. In this paper, we study the
factual inconsistency evaluation ability of ChatGPT under the zero-shot setting
by evaluating it on the coarse-grained and fine-grained factuality evaluation
tasks including binary natural language inference (NLI), summary ranking, and
consistency rating. Experimental results show that ChatGPT outperforms previous
SOTA evaluation metrics on 6/9 datasets across three tasks, demonstrating its
great potential for assessing factual inconsistency in the zero-shot setting.
The results also highlight the importance of prompt design and the need for
future efforts to address ChatGPT's limitations on evaluation bias, wrong
reasoning, and hallucination.Comment: ongoing work, 12 pages, 4 figure
The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey
Recently, various neural encoder-decoder models pioneered by Seq2Seq
framework have been proposed to achieve the goal of generating more abstractive
summaries by learning to map input text to output text. At a high level, such
neural models can freely generate summaries without any constraint on the words
or phrases used. Moreover, their format is closer to human-edited summaries and
output is more readable and fluent. However, the neural model's abstraction
ability is a double-edged sword. A commonly observed problem with the generated
summaries is the distortion or fabrication of factual information in the
article. This inconsistency between the original text and the summary has
caused various concerns over its applicability, and the previous evaluation
methods of text summarization are not suitable for this issue. In response to
the above problems, the current research direction is predominantly divided
into two categories, one is to design fact-aware evaluation metrics to select
outputs without factual inconsistency errors, and the other is to develop new
summarization systems towards factual consistency. In this survey, we focus on
presenting a comprehensive review of these fact-specific evaluation methods and
text summarization models.Comment: 9 pages, 5 figure
- …