456 research outputs found
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning
Evaluation of a document summarization system has been a critical factor to
impact the success of the summarization task. Previous approaches, such as
ROUGE, mainly consider the informativeness of the assessed summary and require
human-generated references for each test summary. In this work, we propose to
evaluate the summary qualities without reference summaries by unsupervised
contrastive learning. Specifically, we design a new metric which covers both
linguistic qualities and semantic informativeness based on BERT. To learn the
metric, for each summary, we construct different types of negative samples with
respect to different aspects of the summary qualities, and train our model with
a ranking loss. Experiments on Newsroom and CNN/Daily Mail demonstrate that our
new evaluation method outperforms other metrics even without reference
summaries. Furthermore, we show that our method is general and transferable
across datasets.Comment: Long Paper in EMNLP 202
TGSum: Build Tweet Guided Multi-Document Summarization Dataset
The development of summarization research has been significantly hampered by
the costly acquisition of reference summaries. This paper proposes an effective
way to automatically collect large scales of news-related multi-document
summaries with reference to social media's reactions. We utilize two types of
social labels in tweets, i.e., hashtags and hyper-links. Hashtags are used to
cluster documents into different topic sets. Also, a tweet with a hyper-link
often highlights certain key points of the corresponding document. We
synthesize a linked document cluster to form a reference summary which can
cover most key points. To this aim, we adopt the ROUGE metrics to measure the
coverage ratio, and develop an Integer Linear Programming solution to discover
the sentence set reaching the upper bound of ROUGE. Since we allow summary
sentences to be selected from both documents and high-quality tweets, the
generated reference summaries could be abstractive. Both informativeness and
readability of the collected summaries are verified by manual judgment. In
addition, we train a Support Vector Regression summarizer on DUC generic
multi-document summarization benchmarks. With the collected data as extra
training resource, the performance of the summarizer improves a lot on all the
test sets. We release this dataset for further research.Comment: 7 pages, 1 figure in AAAI 201
INEX Tweet Contextualization Task: Evaluation, Results and Lesson Learned
Microblogging platforms such as Twitter are increasingly used for on-line client and market analysis. This motivated the proposal of a new track at CLEF INEX lab of Tweet Contextualization. The objective of this task was to help a user to understand a tweet by providing him with a short explanatory summary (500 words). This summary should be built automatically using resources like Wikipedia and generated by extracting relevant passages and aggregating them into a coherent summary. Running for four years, results show that the best systems combine NLP techniques with more traditional methods. More precisely the best performing systems combine passage retrieval, sentence segmentation and scoring, named entity recognition, text part-of-speech (POS) analysis, anaphora detection, diversity content measure as well as sentence reordering. This paper provides a full summary report on the four-year long task. While yearly overviews focused on system results, in this paper we provide a detailed report on the approaches proposed by the participants and which can be considered as the state of the art for this task. As an important result from the 4 years competition, we also describe the open access resources that have been built and collected. The evaluation measures for automatic summarization designed in DUC or MUC were not appropriate to evaluate tweet contextualization, we explain why and depict in detailed the LogSim measure used to evaluate informativeness of produced contexts or summaries. Finally, we also mention the lessons we learned and that it is worth considering when designing a task
SMART: Sentences as Basic Units for Text Evaluation
Widely used evaluation metrics for text generation either do not work well
with longer texts or fail to evaluate all aspects of text quality. In this
paper, we introduce a new metric called SMART to mitigate such limitations.
Specifically, We treat sentences as basic units of matching instead of tokens,
and use a sentence matching function to soft-match candidate and reference
sentences. Candidate sentences are also compared to sentences in the source
documents to allow grounding (e.g., factuality) evaluation. Our results show
that system-level correlations of our proposed metric with a model-based
matching function outperforms all competing metrics on the SummEval
summarization meta-evaluation dataset, while the same metric with a
string-based matching function is competitive with current model-based metrics.
The latter does not use any neural model, which is useful during model
development phases where resources can be limited and fast evaluation is
required. Finally, we also conducted extensive analyses showing that our
proposed metrics work well with longer summaries and are less biased towards
specific models.Comment: code coming soo
- …