7 research outputs found
Zero-shot Conversational Summarization Evaluations with small Large Language Models
Large Language Models (LLMs) exhibit powerful summarization abilities.
However, their capabilities on conversational summarization remains under
explored. In this work we evaluate LLMs (approx. 10 billion parameters) on
conversational summarization and showcase their performance on various prompts.
We show that the summaries generated by models depend on the instructions and
the performance of LLMs vary with different instructions sometimes resulting
steep drop in ROUGE scores if prompts are not selected carefully. We also
evaluate the models with human evaluations and discuss the limitations of the
models on conversational summarizationComment: Accepted at RoF0Mo workshop at Neurips 202
Evaluation of Automatic Text Summarization Using Synthetic Facts
Automatic text summarization has achieved remarkable success with the development of deep neural networks and the availability of standardized benchmark datasets. It can generate fluent, human-like summaries. However, the unreliability of the existing evaluation metrics hinders its practical usage and slows down its progress. To address this issue, we propose an automatic reference-less text summarization evaluation system with dynamically generated synthetic facts. We hypothesize that if a system guarantees a summary that has all the facts that are 100% known in the synthetic document, it can provide natural interpretability and high feasibility in measuring factual consistency and comprehensiveness. To our knowledge, our system is the first system that measures the overarching quality of the text summarization models with factual consistency, comprehensiveness, and compression rate. We validate our system by comparing its correlation with human judgment with existing N-gram overlap-based metrics such as ROUGE and BLEU and a BERT-based evaluation metric, BERTScore. Our system\u27s experimental evaluation of PEGASUS, BART, and T5 outperforms the current evaluation metrics in measuring factual consistency with a noticeable margin and demonstrates its statistical significance in measuring comprehensiveness and overall summary quality
Low- and high-resource opinion summarization
Customer reviews play a vital role in the online purchasing decisions we make. The reviews
express user opinions that are useful for setting realistic expectations and uncovering important
details about products. However, some products receive hundreds or even thousands of
reviews, making them time-consuming to read. Moreover, many reviews contain uninformative
content, such as irrelevant personal experiences. Automatic summarization offers an
alternative – short text summaries capturing the essential information expressed in reviews.
Automatically produced summaries can reflect overall or particular opinions and be tailored to
user preferences. Besides being presented on major e-commerce platforms, home assistants
can also vocalize them. This approach can improve user satisfaction by assisting in making
faster and better decisions.
Modern summarization approaches are based on neural networks, often requiring thousands of
annotated samples for training. However, human-written summaries for products are expensive
to produce because annotators need to read many reviews. This has led to annotated data
scarcity where only a few datasets are available. Data scarcity is the central theme of our
works, and we propose a number of approaches to alleviate the problem. The thesis consists
of two parts where we discuss low- and high-resource data settings.
In the first part, we propose self-supervised learning methods applied to customer reviews
and few-shot methods for learning from small annotated datasets. Customer reviews without
summaries are available in large quantities, contain a breadth of in-domain specifics, and
provide a powerful training signal. We show that reviews can be used for learning summarizers
via a self-supervised objective. Further, we address two main challenges associated with
learning from small annotated datasets. First, large models rapidly overfit on small datasets
leading to poor generalization. Second, it is not possible to learn a wide range of in-domain
specifics (e.g., product aspects and usage) from a handful of gold samples. This leads to
subtle semantic mistakes in generated summaries, such as ‘great dead on arrival battery.’ We
address the first challenge by explicitly modeling summary properties (e.g., content coverage
and sentiment alignment). Furthermore, we leverage small modules – adapters – that are
more robust to overfitting. As we show, despite their size, these modules can be used to
store in-domain knowledge to reduce semantic mistakes. Lastly, we propose a simple method
for learning personalized summarizers based on aspects, such as ‘price,’ ‘battery life,’ and
‘resolution.’ This task is harder to learn, and we present a few-shot method for training a
query-based summarizer on small annotated datasets.
In the second part, we focus on the high-resource setting and present a large dataset with
summaries collected from various online resources. The dataset has more than 33,000 humanwritten
summaries, where each is linked up to thousands of reviews. This, however, makes it
challenging to apply an ‘expensive’ deep encoder due to memory and computational costs. To
address this problem, we propose selecting small subsets of informative reviews. Only these
subsets are encoded by the deep encoder and subsequently summarized. We show that the
selector and summarizer can be trained end-to-end via amortized inference and policy gradient
methods