1,802 research outputs found
TGSum: Build Tweet Guided Multi-Document Summarization Dataset
The development of summarization research has been significantly hampered by
the costly acquisition of reference summaries. This paper proposes an effective
way to automatically collect large scales of news-related multi-document
summaries with reference to social media's reactions. We utilize two types of
social labels in tweets, i.e., hashtags and hyper-links. Hashtags are used to
cluster documents into different topic sets. Also, a tweet with a hyper-link
often highlights certain key points of the corresponding document. We
synthesize a linked document cluster to form a reference summary which can
cover most key points. To this aim, we adopt the ROUGE metrics to measure the
coverage ratio, and develop an Integer Linear Programming solution to discover
the sentence set reaching the upper bound of ROUGE. Since we allow summary
sentences to be selected from both documents and high-quality tweets, the
generated reference summaries could be abstractive. Both informativeness and
readability of the collected summaries are verified by manual judgment. In
addition, we train a Support Vector Regression summarizer on DUC generic
multi-document summarization benchmarks. With the collected data as extra
training resource, the performance of the summarizer improves a lot on all the
test sets. We release this dataset for further research.Comment: 7 pages, 1 figure in AAAI 201
Multi-Document Summarization via Discriminative Summary Reranking
Existing multi-document summarization systems usually rely on a specific
summarization model (i.e., a summarization method with a specific parameter
setting) to extract summaries for different document sets with different
topics. However, according to our quantitative analysis, none of the existing
summarization models can always produce high-quality summaries for different
document sets, and even a summarization model with good overall performance may
produce low-quality summaries for some document sets. On the contrary, a
baseline summarization model may produce high-quality summaries for some
document sets. Based on the above observations, we treat the summaries produced
by different summarization models as candidate summaries, and then explore
discriminative reranking techniques to identify high-quality summaries from the
candidates for difference document sets. We propose to extract a set of
candidate summaries for each document set based on an ILP framework, and then
leverage Ranking SVM for summary reranking. Various useful features have been
developed for the reranking process, including word-level features,
sentence-level features and summary-level features. Evaluation results on the
benchmark DUC datasets validate the efficacy and robustness of our proposed
approach
A Novel ILP Framework for Summarizing Content with High Lexical Variety
Summarizing content contributed by individuals can be challenging, because
people make different lexical choices even when describing the same events.
However, there remains a significant need to summarize such content. Examples
include the student responses to post-class reflective questions, product
reviews, and news articles published by different news agencies related to the
same events. High lexical diversity of these documents hinders the system's
ability to effectively identify salient content and reduce summary redundancy.
In this paper, we overcome this issue by introducing an integer linear
programming-based summarization framework. It incorporates a low-rank
approximation to the sentence-word co-occurrence matrix to intrinsically group
semantically-similar lexical items. We conduct extensive experiments on
datasets of student responses, product reviews, and news documents. Our
approach compares favorably to a number of extractive baselines as well as a
neural abstractive summarization system. The paper finally sheds light on when
and why the proposed framework is effective at summarizing content with high
lexical variety.Comment: Accepted for publication in the journal of Natural Language
Engineering, 201
Graph-based Neural Multi-Document Summarization
We propose a neural multi-document summarization (MDS) system that
incorporates sentence relation graphs. We employ a Graph Convolutional Network
(GCN) on the relation graphs, with sentence embeddings obtained from Recurrent
Neural Networks as input node features. Through multiple layer-wise
propagation, the GCN generates high-level hidden sentence features for salience
estimation. We then use a greedy heuristic to extract salient sentences while
avoiding redundancy. In our experiments on DUC 2004, we consider three types of
sentence relation graphs and demonstrate the advantage of combining sentence
relations in graphs with the representation power of deep neural networks. Our
model improves upon traditional graph-based extractive approaches and the
vanilla GRU sequence model with no graph, and it achieves competitive results
against other state-of-the-art multi-document summarization systems.Comment: In CoNLL 201
A Graph-Based Approach for the Summarization of Scientific Articles
Automatic text summarization is one of the eminent applications in the field of
Natural Language Processing. Text summarization is the process of generating
a gist from text documents. The task is to produce a summary which contains
important, diverse and coherent information, i.e., a summary should be self-contained.
The approaches for text summarization are conventionally extractive.
The extractive approaches select a subset of sentences from an input document
for a summary. In this thesis, we introduce a novel graph-based extractive summarization
approach.
With the progressive advancement of research in the various fields of science,
the summarization of scientific articles has become an essential requirement for
researchers. This is our prime motivation in selecting scientific articles as our
dataset. This newly formed dataset contains scientific articles from the PLOS
Medicine journal, which is a high impact journal in the field of biomedicine.
The summarization of scientific articles is a single-document summarization task.
It is a complex task due to various reasons, one of it being, the important information
in the scientific article is scattered all over it and another reason being, scientific
articles contain numerous redundant information. In our approach, we deal
with the three important factors of summarization: importance, non-redundancy
and coherence. To deal with these factors, we use graphs as they solve data sparsity
problems and are computationally less complex.
We employ bipartite graphical representation for the summarization task, exclusively.
We represent input documents through a bipartite graph that consists of
sentence nodes and entity nodes. This bipartite graph representation contains entity
transition information which is beneficial for selecting the relevant sentences
for a summary. We use a graph-based ranking algorithm to rank the sentences in
a document. The ranks are considered as relevance scores of the sentences which
are further used in our approach.
Scientific articles contain reasonable amount of redundant information, for example,
Introduction and Methodology sections contain similar information regarding
the motivation and approach. In our approach, we ensure that the summary contains
sentences which are non-redundant.
Though the summary should contain important and non-redundant information of
the input document, its sentences should be connected to one another such that
it becomes coherent, understandable and simple to read. If we do not ensure
that a summary is coherent, its sentences may not be properly connected. This
leads to an obscure summary. Until now, only few summarization approaches
take care of coherence. In our approach, we take care of coherence in two different
ways: by using the graph measure and by using the structural information. We
employ outdegree as the graph measure and coherence patterns for the structural
information, in our approach.
We use integer programming as an optimization technique, to select the best subset
of sentences for a summary. The sentences are selected on the basis of relevance,
diversity and coherence measure. The computation of these measures is
tightly integrated and taken care of simultaneously.
We use human judgements to evaluate coherence of summaries. We compare
ROUGE scores and human judgements of different systems on the PLOS Medicine
dataset. Our approach performs considerably better than other systems on this
dataset. Also, we apply our approach on the standard DUC 2002 dataset to compare
the results with the recent state-of-the-art systems. The results show that our
graph-based approach outperforms other systems on DUC 2002. In conclusion,
our approach is robust, i.e., it works on both scientific and news articles. Our
approach has the further advantage of being semi-supervised
A Multilingual Study of Compressive Cross-Language Text Summarization
Cross-Language Text Summarization (CLTS) generates summaries in a language
different from the language of the source documents. Recent methods use
information from both languages to generate summaries with the most informative
sentences. However, these methods have performance that can vary according to
languages, which can reduce the quality of summaries. In this paper, we propose
a compressive framework to generate cross-language summaries. In order to
analyze performance and especially stability, we tested our system and
extractive baselines on a dataset available in four languages (English, French,
Portuguese, and Spanish) to generate English and French summaries. An automatic
evaluation showed that our method outperformed extractive state-of-art CLTS
methods with better and more stable ROUGE scores for all languages
Regression model focused on query for multi documents summarization based on significance of the sentence position
Document summarization is needed to get the information effectively and efficiently. One method used to obtain the document summarization by applying machine learning techniques. This paper proposes the application of regression models to query-focused multi-document summarization based on the significance of the sentence position. The method used is the Support Vector Regression (SVR) which estimates the weight of the sentence on a set of documents to be made as a summary based on sentence feature which has been defined previously. A series of evaluations performed on a data set of DUC 2005. From the test results obtained summary which has an average precision and recall values of 0.0580 and 0.0590 for measurements using ROUGE-2, ROUGE 0.0997 and 0.1019 for measurements using the proposed regression-SU4. Model can perform measurements of the significance of the position of the sentence in the document well
- …