Automatic text summarization is one of the eminent applications in the field of
Natural Language Processing. Text summarization is the process of generating
a gist from text documents. The task is to produce a summary which contains
important, diverse and coherent information, i.e., a summary should be self-contained.
The approaches for text summarization are conventionally extractive.
The extractive approaches select a subset of sentences from an input document
for a summary. In this thesis, we introduce a novel graph-based extractive summarization
approach.
With the progressive advancement of research in the various fields of science,
the summarization of scientific articles has become an essential requirement for
researchers. This is our prime motivation in selecting scientific articles as our
dataset. This newly formed dataset contains scientific articles from the PLOS
Medicine journal, which is a high impact journal in the field of biomedicine.
The summarization of scientific articles is a single-document summarization task.
It is a complex task due to various reasons, one of it being, the important information
in the scientific article is scattered all over it and another reason being, scientific
articles contain numerous redundant information. In our approach, we deal
with the three important factors of summarization: importance, non-redundancy
and coherence. To deal with these factors, we use graphs as they solve data sparsity
problems and are computationally less complex.
We employ bipartite graphical representation for the summarization task, exclusively.
We represent input documents through a bipartite graph that consists of
sentence nodes and entity nodes. This bipartite graph representation contains entity
transition information which is beneficial for selecting the relevant sentences
for a summary. We use a graph-based ranking algorithm to rank the sentences in
a document. The ranks are considered as relevance scores of the sentences which
are further used in our approach.
Scientific articles contain reasonable amount of redundant information, for example,
Introduction and Methodology sections contain similar information regarding
the motivation and approach. In our approach, we ensure that the summary contains
sentences which are non-redundant.
Though the summary should contain important and non-redundant information of
the input document, its sentences should be connected to one another such that
it becomes coherent, understandable and simple to read. If we do not ensure
that a summary is coherent, its sentences may not be properly connected. This
leads to an obscure summary. Until now, only few summarization approaches
take care of coherence. In our approach, we take care of coherence in two different
ways: by using the graph measure and by using the structural information. We
employ outdegree as the graph measure and coherence patterns for the structural
information, in our approach.
We use integer programming as an optimization technique, to select the best subset
of sentences for a summary. The sentences are selected on the basis of relevance,
diversity and coherence measure. The computation of these measures is
tightly integrated and taken care of simultaneously.
We use human judgements to evaluate coherence of summaries. We compare
ROUGE scores and human judgements of different systems on the PLOS Medicine
dataset. Our approach performs considerably better than other systems on this
dataset. Also, we apply our approach on the standard DUC 2002 dataset to compare
the results with the recent state-of-the-art systems. The results show that our
graph-based approach outperforms other systems on DUC 2002. In conclusion,
our approach is robust, i.e., it works on both scientific and news articles. Our
approach has the further advantage of being semi-supervised