450 research outputs found
Extractive Summary as Discrete Latent Variables
In this paper, we compare various methods to compress a text using a neural
model. We find that extracting tokens as latent variables significantly
outperforms the state-of-the-art discrete latent variable models such as
VQ-VAE. Furthermore, we compare various extractive compression schemes. There
are two best-performing methods that perform equally. One method is to simply
choose the tokens with the highest tf-idf scores. Another is to train a
bidirectional language model similar to ELMo and choose the tokens with the
highest loss. If we consider any subsequence of a text to be a text in a
broader sense, we conclude that language is a strong compression code of
itself. Our finding justifies the high quality of generation achieved with
hierarchical method, as their latent variables are nothing but natural language
summary. We also conclude that there is a hierarchy in language such that an
entire text can be predicted much more easily based on a sequence of a small
number of keywords, which can be easily found by classical methods as tf-idf.
We speculate that this extraction process may be useful for unsupervised
hierarchical text generation
ESSMArT Way to Manage User Requests
Quality and market acceptance of software products is strongly influenced by
responsiveness to user requests. Once a request is received from a customer,
decisions need to be made if the request should be escalated to the development
team. Once escalated, the ticket must be formulated as a development task and
be assigned to a developer. To make the process more efficient and reduce the
time between receiving and escalating the user request, we aim to automate of
the complete user request management process. We propose a holistic method
called ESSMArT. The methods performs text summarization, predicts ticket
escalation, creates the title and content of the ticket used by developers, and
assigns the ticket to an available developer. We internally evaluated the
method by 4,114 user tickets from Brightsquid and their secure health care
communication plat- form Secure-Mail. We also perform an external evaluation on
the usefulness of the approach. We found that supervised learning based on
context specific data performs best for extractive summarization. For
predicting escalation of tickets, Random Forest trained on a combination of
conversation and extractive summarization is best with highest precision (of
0.9) and recall (of 0.55). From external evaluation we found that ESSMArT
provides suggestions that are 71% aligned with human ones. Applying the
prototype implementation to 315 user requests resulted in an average time
reduction of 9.2 minutes per request. ESSMArT helps to make ticket management
faster and with reduced effort for human experts. ESSMArT can help Brightsquid
to (i) minimize the impact of staff turnover and (ii) shorten the cycle from an
issue being reported to an assignment to a developer to fix it.Comment: This is a preprint of the submission to Empirical Software
Engineering journal, 201
Semantic WordRank: Generating Finer Single-Document Summarizations
We present Semantic WordRank (SWR), an unsupervised method for generating an
extractive summary of a single document. Built on a weighted word graph with
semantic and co-occurrence edges, SWR scores sentences using an
article-structure-biased PageRank algorithm with a Softplus function
adjustment, and promotes topic diversity using spectral subtopic clustering
under the Word-Movers-Distance metric. We evaluate SWR on the DUC-02 and
SummBank datasets and show that SWR produces better summaries than the
state-of-the-art algorithms over DUC-02 under common ROUGE measures. We then
show that, under the same measures over SummBank, SWR outperforms each of the
three human annotators (aka. judges) and compares favorably with the combined
performance of all judges.Comment: 12 pages, accepted by IDEAL201
Extractive Summarization using Deep Learning
This paper proposes a text summarization approach for factual reports using a
deep learning model. This approach consists of three phases: feature
extraction, feature enhancement, and summary generation, which work together to
assimilate core information and generate a coherent, understandable summary. We
are exploring various features to improve the set of sentences selected for the
summary, and are using a Restricted Boltzmann Machine to enhance and abstract
those features to improve resultant accuracy without losing any important
information. The sentences are scored based on those enhanced features and an
extractive summary is constructed. Experimentation carried out on several
articles demonstrates the effectiveness of the proposed approach. Source code
available at: https://github.com/vagisha-nidhi/TextSummarizerComment: Accepted to 18th International Conference on Computational
Linguistics and Intelligent Text Processin
STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings
This paper introduces STRASS: Summarization by TRAnsformation Selection and
Scoring. It is an extractive text summarization method which leverages the
semantic information in existing sentence embedding spaces. Our method creates
an extractive summary by selecting the sentences with the closest embeddings to
the document embedding. The model learns a transformation of the document
embedding to minimize the similarity between the extractive summary and the
ground truth summary. As the transformation is only composed of a dense layer,
the training can be done on CPU, therefore, inexpensive. Moreover, inference
time is short and linear according to the number of sentences. As a second
contribution, we introduce the French CASS dataset, composed of judgments from
the French Court of cassation and their corresponding summaries. On this
dataset, our results show that our method performs similarly to the state of
the art extractive methods with effective training and inferring time.Comment: To appear in 2019 ACL Student Research Worksho
Guiding Extractive Summarization with Question-Answering Rewards
Highlighting while reading is a natural behavior for people to track salient
content of a document. It would be desirable to teach an extractive summarizer
to do the same. However, a major obstacle to the development of a supervised
summarizer is the lack of ground-truth. Manual annotation of extraction units
is cost-prohibitive, whereas acquiring labels by automatically aligning human
abstracts and source documents can yield inferior results. In this paper we
describe a novel framework to guide a supervised, extractive summarization
system with question-answering rewards. We argue that quality summaries should
serve as a document surrogate to answer important questions, and such
question-answer pairs can be conveniently obtained from human abstracts. The
system learns to promote summaries that are informative, fluent, and perform
competitively on question-answering. Our results compare favorably with those
reported by strong summarization baselines as evaluated by automatic metrics
and human assessors.Comment: NAACL 201
Question-Driven Summarization of Answers to Consumer Health Questions
Automatic summarization of natural language is a widely studied area in
computer science, one that is broadly applicable to anyone who routinely needs
to understand large quantities of information. For example, in the medical
domain, recent developments in deep learning approaches to automatic
summarization have the potential to make health information more easily
accessible to patients and consumers. However, to evaluate the quality of
automatically generated summaries of health information, gold-standard, human
generated summaries are required. Using answers provided by the National
Library of Medicine's consumer health question answering system, we present the
MEDIQA Answer Summarization dataset, the first summarization collection
containing question-driven summaries of answers to consumer health questions.
This dataset can be used to evaluate single or multi-document summaries
generated by algorithms using extractive or abstractive approaches. In order to
benchmark the dataset, we include results of baseline and state-of-the-art deep
learning summarization models, demonstrating that this dataset can be used to
effectively evaluate question-driven machine-generated summaries and promote
further machine learning research in medical question answering
What comes next? Extractive summarization by next-sentence prediction
Existing approaches to automatic summarization assume that a length limit for
the summary is given, and view content selection as an optimization problem to
maximize informativeness and minimize redundancy within this budget. This
framework ignores the fact that human-written summaries have rich internal
structure which can be exploited to train a summarization system. We present
NEXTSUM, a novel approach to summarization based on a model that predicts the
next sentence to include in the summary using not only the source article, but
also the summary produced so far. We show that such a model successfully
captures summary-specific discourse moves, and leads to better content
selection performance, in addition to automatically predicting how long the
target summary should be. We perform experiments on the New York Times
Annotated Corpus of summaries, where NEXTSUM outperforms lead and content-model
summarization baselines by significant margins. We also show that the lengths
of summaries produced by our system correlates with the lengths of the
human-written gold standards
Towards Supervised Extractive Text Summarization via RNN-based Sequence Classification
This article briefly explains our submitted approach to the DocEng'19
competition on extractive summarization. We implemented a recurrent neural
network based model that learns to classify whether an article's sentence
belongs to the corresponding extractive summary or not. We bypass the lack of
large annotated news corpora for extractive summarization by generating
extractive summaries from abstractive ones, which are available from the CNN
corpus
USUM: Update Summary Generation System
Huge amount of information is present in the World Wide Web and a large
amount is being added to it frequently. A query-specific summary of multiple
documents is very helpful to the user in this context. Currently, few systems
have been proposed for query-specific, extractive multi-document summarization.
If a summary is available for a set of documents on a given query and if a new
document is added to the corpus, generating an updated summary from the scratch
is time consuming and many a times it is not practical/possible. In this paper
we propose a solution to this problem. This is especially useful in a scenario
where the source documents are not accessible. We cleverly embed the sentences
of the current summary into the new document and then perform query-specific
summary generation on that document. Our experimental results show that the
performance of the proposed approach is good in terms of both quality and
efficiency
- …