229 research outputs found
Automatic summarization of Malayalam documents using clause identification method
Text summarization is an active research area in the field of natural language processing. Huge amount of information in the internet necessitates the development of automatic summarization systems. There are two types of summarization techniques: Extractive and Abstractive. Extractive summarization selects important sentences from the text and produces summary as it is present in the original document. Abstractive summarization systems will provide a summary of the input text as is generated by human beings. Abstractive summary requires semantic analysis of text. Limited works have been carried out in the area of abstractive summarization in Indian languages especially in Malayalam. Only extractive summarization methods are proposed in Malayalam. In this paper, an abstractive summarization system for Malayalam documents using clause identification method is proposed. As part of this research work, a POS tagger and a morphological analyzer for Malayalam words in cricket domain are also developed. The clauses from input sentences are identified using a modified clause identification algorithm. The clauses are then semantically analyzed using an algorithm to identify semantic triples - subject, object and predicate. The score of each clause is then calculated by using feature extraction and the important clauses which are to be included in the summary are selected based on this score. Finally an algorithm is used to generate the sentences from the semantic triples of the selected clauses which is the abstractive summary of input documents
Towards Personalized and Human-in-the-Loop Document Summarization
The ubiquitous availability of computing devices and the widespread use of
the internet have generated a large amount of data continuously. Therefore, the
amount of available information on any given topic is far beyond humans'
processing capacity to properly process, causing what is known as information
overload. To efficiently cope with large amounts of information and generate
content with significant value to users, we require identifying, merging and
summarising information. Data summaries can help gather related information and
collect it into a shorter format that enables answering complicated questions,
gaining new insight and discovering conceptual boundaries.
This thesis focuses on three main challenges to alleviate information
overload using novel summarisation techniques. It further intends to facilitate
the analysis of documents to support personalised information extraction. This
thesis separates the research issues into four areas, covering (i) feature
engineering in document summarisation, (ii) traditional static and inflexible
summaries, (iii) traditional generic summarisation approaches, and (iv) the
need for reference summaries. We propose novel approaches to tackle these
challenges, by: i)enabling automatic intelligent feature engineering, ii)
enabling flexible and interactive summarisation, iii) utilising intelligent and
personalised summarisation approaches. The experimental results prove the
efficiency of the proposed approaches compared to other state-of-the-art
models. We further propose solutions to the information overload problem in
different domains through summarisation, covering network traffic data, health
data and business process data.Comment: PhD thesi
GAE-ISumm: Unsupervised Graph-Based Summarization of Indian Languages
Document summarization aims to create a precise and coherent summary of a
text document. Many deep learning summarization models are developed mainly for
English, often requiring a large training corpus and efficient pre-trained
language models and tools. However, English summarization models for
low-resource Indian languages are often limited by rich morphological
variation, syntax, and semantic differences. In this paper, we propose
GAE-ISumm, an unsupervised Indic summarization model that extracts summaries
from text documents. In particular, our proposed model, GAE-ISumm uses Graph
Autoencoder (GAE) to learn text representations and a document summary jointly.
We also provide a manually-annotated Telugu summarization dataset TELSUM, to
experiment with our model GAE-ISumm. Further, we experiment with the most
publicly available Indian language summarization datasets to investigate the
effectiveness of GAE-ISumm on other Indian languages. Our experiments of
GAE-ISumm in seven languages make the following observations: (i) it is
competitive or better than state-of-the-art results on all datasets, (ii) it
reports benchmark results on TELSUM, and (iii) the inclusion of positional and
cluster information in the proposed model improved the performance of
summaries.Comment: 9 pages, 7 figure
- …