6,463 research outputs found
Multi-Document Summarization via Discriminative Summary Reranking
Existing multi-document summarization systems usually rely on a specific
summarization model (i.e., a summarization method with a specific parameter
setting) to extract summaries for different document sets with different
topics. However, according to our quantitative analysis, none of the existing
summarization models can always produce high-quality summaries for different
document sets, and even a summarization model with good overall performance may
produce low-quality summaries for some document sets. On the contrary, a
baseline summarization model may produce high-quality summaries for some
document sets. Based on the above observations, we treat the summaries produced
by different summarization models as candidate summaries, and then explore
discriminative reranking techniques to identify high-quality summaries from the
candidates for difference document sets. We propose to extract a set of
candidate summaries for each document set based on an ILP framework, and then
leverage Ranking SVM for summary reranking. Various useful features have been
developed for the reranking process, including word-level features,
sentence-level features and summary-level features. Evaluation results on the
benchmark DUC datasets validate the efficacy and robustness of our proposed
approach
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization
We introduce a stochastic graph-based method for computing relative
importance of textual units for Natural Language Processing. We test the
technique on the problem of Text Summarization (TS). Extractive TS relies on
the concept of sentence salience to identify the most important sentences in a
document or set of documents. Salience is typically defined in terms of the
presence of particular important words or in terms of similarity to a centroid
pseudo-sentence. We consider a new approach, LexRank, for computing sentence
importance based on the concept of eigenvector centrality in a graph
representation of sentences. In this model, a connectivity matrix based on
intra-sentence cosine similarity is used as the adjacency matrix of the graph
representation of sentences. Our system, based on LexRank ranked in first place
in more than one task in the recent DUC 2004 evaluation. In this paper we
present a detailed analysis of our approach and apply it to a larger data set
including data from earlier DUC evaluations. We discuss several methods to
compute centrality using the similarity graph. The results show that
degree-based methods (including LexRank) outperform both centroid-based methods
and other systems participating in DUC in most of the cases. Furthermore, the
LexRank with threshold method outperforms the other degree-based techniques
including continuous LexRank. We also show that our approach is quite
insensitive to the noise in the data that may result from an imperfect topical
clustering of documents
Comprehensive Review of Opinion Summarization
The abundance of opinions on the web has kindled the study of opinion summarization over the last few years. People have introduced various techniques and paradigms to solving this special task. This survey attempts to systematically investigate the different techniques and approaches used in opinion summarization. We provide a multi-perspective classification of the approaches used and highlight some of the key weaknesses of these approaches. This survey also covers evaluation techniques and data sets used in studying the opinion summarization problem. Finally, we provide insights into some of the challenges that are left to be addressed as this will help set the trend for future research in this area.unpublishednot peer reviewe
Gold Standard Online Debates Summaries and First Experiments Towards Automatic Summarization of Online Debate Data
Usage of online textual media is steadily increasing. Daily, more and more
news stories, blog posts and scientific articles are added to the online
volumes. These are all freely accessible and have been employed extensively in
multiple research areas, e.g. automatic text summarization, information
retrieval, information extraction, etc. Meanwhile, online debate forums have
recently become popular, but have remained largely unexplored. For this reason,
there are no sufficient resources of annotated debate data available for
conducting research in this genre. In this paper, we collected and annotated
debate data for an automatic summarization task. Similar to extractive gold
standard summary generation our data contains sentences worthy to include into
a summary. Five human annotators performed this task. Inter-annotator
agreement, based on semantic similarity, is 36% for Cohen's kappa and 48% for
Krippendorff's alpha. Moreover, we also implement an extractive summarization
system for online debates and discuss prominent features for the task of
summarizing online debate data automatically.Comment: accepted and presented at the CICLING 2017 - 18th International
Conference on Intelligent Text Processing and Computational Linguistic
- …