11,760 research outputs found
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization
We introduce a stochastic graph-based method for computing relative
importance of textual units for Natural Language Processing. We test the
technique on the problem of Text Summarization (TS). Extractive TS relies on
the concept of sentence salience to identify the most important sentences in a
document or set of documents. Salience is typically defined in terms of the
presence of particular important words or in terms of similarity to a centroid
pseudo-sentence. We consider a new approach, LexRank, for computing sentence
importance based on the concept of eigenvector centrality in a graph
representation of sentences. In this model, a connectivity matrix based on
intra-sentence cosine similarity is used as the adjacency matrix of the graph
representation of sentences. Our system, based on LexRank ranked in first place
in more than one task in the recent DUC 2004 evaluation. In this paper we
present a detailed analysis of our approach and apply it to a larger data set
including data from earlier DUC evaluations. We discuss several methods to
compute centrality using the similarity graph. The results show that
degree-based methods (including LexRank) outperform both centroid-based methods
and other systems participating in DUC in most of the cases. Furthermore, the
LexRank with threshold method outperforms the other degree-based techniques
including continuous LexRank. We also show that our approach is quite
insensitive to the noise in the data that may result from an imperfect topical
clustering of documents
Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps
Concept maps can be used to concisely represent important information and
bring structure into large document collections. Therefore, we study a variant
of multi-document summarization that produces summaries in the form of concept
maps. However, suitable evaluation datasets for this task are currently
missing. To close this gap, we present a newly created corpus of concept maps
that summarize heterogeneous collections of web documents on educational
topics. It was created using a novel crowdsourcing approach that allows us to
efficiently determine important elements in large document collections. We
release the corpus along with a baseline system and proposed evaluation
protocol to enable further research on this variant of summarization.Comment: Published at EMNLP 201
Text Summarization Techniques: A Brief Survey
In recent years, there has been a explosion in the amount of text data from a
variety of sources. This volume of text is an invaluable source of information
and knowledge which needs to be effectively summarized to be useful. In this
review, the main approaches to automatic text summarization are described. We
review the different processes for summarization and describe the effectiveness
and shortcomings of the different methods.Comment: Some of references format have update
Some Reflections on the Task of Content Determination in the Context of Multi-Document Summarization of Evolving Events
Despite its importance, the task of summarizing evolving events has received
small attention by researchers in the field of multi-document summariztion. In
a previous paper (Afantenos et al. 2007) we have presented a methodology for
the automatic summarization of documents, emitted by multiple sources, which
describe the evolution of an event. At the heart of this methodology lies the
identification of similarities and differences between the various documents,
in two axes: the synchronic and the diachronic. This is achieved by the
introduction of the notion of Synchronic and Diachronic Relations. Those
relations connect the messages that are found in the documents, resulting thus
in a graph which we call grid. Although the creation of the grid completes the
Document Planning phase of a typical NLG architecture, it can be the case that
the number of messages contained in a grid is very large, exceeding thus the
required compression rate. In this paper we provide some initial thoughts on a
probabilistic model which can be applied at the Content Determination stage,
and which tries to alleviate this problem.Comment: 5 pages, 2 figure
- …