3,111 research outputs found
Automatic Summarization
It has now been 50 years since the publication of Luhnâs seminal paper on automatic summarization. During these years the practical need for automatic summarization has become increasingly urgent and numerous papers have been published on the topic. As a result, it has become harder to find a single reference that gives an overview of past efforts or a complete view of summarization tasks and necessary system components. This article attempts to fill this void by providing a comprehensive overview of research in summarization, including the more traditional efforts in sentence extraction as well as the most novel recent approaches for determining important content, for domain and genre specific summarization and for evaluation of summarization. We also discuss the challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field
Argumentation Mining in User-Generated Web Discourse
The goal of argumentation mining, an evolving research field in computational
linguistics, is to design methods capable of analyzing people's argumentation.
In this article, we go beyond the state of the art in several ways. (i) We deal
with actual Web data and take up the challenges given by the variety of
registers, multiple domains, and unrestricted noisy user-generated Web
discourse. (ii) We bridge the gap between normative argumentation theories and
argumentation phenomena encountered in actual data by adapting an argumentation
model tested in an extensive annotation study. (iii) We create a new gold
standard corpus (90k tokens in 340 documents) and experiment with several
machine learning methods to identify argument components. We offer the data,
source codes, and annotation guidelines to the community under free licenses.
Our findings show that argumentation mining in user-generated Web discourse is
a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in
User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17
A Graph-Based Approach for the Summarization of Scientific Articles
Automatic text summarization is one of the eminent applications in the field of
Natural Language Processing. Text summarization is the process of generating
a gist from text documents. The task is to produce a summary which contains
important, diverse and coherent information, i.e., a summary should be self-contained.
The approaches for text summarization are conventionally extractive.
The extractive approaches select a subset of sentences from an input document
for a summary. In this thesis, we introduce a novel graph-based extractive summarization
approach.
With the progressive advancement of research in the various fields of science,
the summarization of scientific articles has become an essential requirement for
researchers. This is our prime motivation in selecting scientific articles as our
dataset. This newly formed dataset contains scientific articles from the PLOS
Medicine journal, which is a high impact journal in the field of biomedicine.
The summarization of scientific articles is a single-document summarization task.
It is a complex task due to various reasons, one of it being, the important information
in the scientific article is scattered all over it and another reason being, scientific
articles contain numerous redundant information. In our approach, we deal
with the three important factors of summarization: importance, non-redundancy
and coherence. To deal with these factors, we use graphs as they solve data sparsity
problems and are computationally less complex.
We employ bipartite graphical representation for the summarization task, exclusively.
We represent input documents through a bipartite graph that consists of
sentence nodes and entity nodes. This bipartite graph representation contains entity
transition information which is beneficial for selecting the relevant sentences
for a summary. We use a graph-based ranking algorithm to rank the sentences in
a document. The ranks are considered as relevance scores of the sentences which
are further used in our approach.
Scientific articles contain reasonable amount of redundant information, for example,
Introduction and Methodology sections contain similar information regarding
the motivation and approach. In our approach, we ensure that the summary contains
sentences which are non-redundant.
Though the summary should contain important and non-redundant information of
the input document, its sentences should be connected to one another such that
it becomes coherent, understandable and simple to read. If we do not ensure
that a summary is coherent, its sentences may not be properly connected. This
leads to an obscure summary. Until now, only few summarization approaches
take care of coherence. In our approach, we take care of coherence in two different
ways: by using the graph measure and by using the structural information. We
employ outdegree as the graph measure and coherence patterns for the structural
information, in our approach.
We use integer programming as an optimization technique, to select the best subset
of sentences for a summary. The sentences are selected on the basis of relevance,
diversity and coherence measure. The computation of these measures is
tightly integrated and taken care of simultaneously.
We use human judgements to evaluate coherence of summaries. We compare
ROUGE scores and human judgements of different systems on the PLOS Medicine
dataset. Our approach performs considerably better than other systems on this
dataset. Also, we apply our approach on the standard DUC 2002 dataset to compare
the results with the recent state-of-the-art systems. The results show that our
graph-based approach outperforms other systems on DUC 2002. In conclusion,
our approach is robust, i.e., it works on both scientific and news articles. Our
approach has the further advantage of being semi-supervised
NLP Driven Models for Automatically Generating Survey Articles for Scientific Topics.
This thesis presents new methods that use natural language processing (NLP) driven models for summarizing research in scientific fields. Given a topic query in the form of a text string, we present methods for finding research articles relevant to the topic as well as summarization algorithms that use lexical and discourse information present in the text of these articles to generate coherent and readable extractive summaries of past research on the topic. In addition to summarizing prior research, good survey articles should also forecast future trends. With this motivation, we present work on forecasting future impact of scientific publications using NLP driven features.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113407/1/rahuljha_1.pd
- âŠ