450 research outputs found

    Extractive Summary as Discrete Latent Variables

    Full text link
    In this paper, we compare various methods to compress a text using a neural model. We find that extracting tokens as latent variables significantly outperforms the state-of-the-art discrete latent variable models such as VQ-VAE. Furthermore, we compare various extractive compression schemes. There are two best-performing methods that perform equally. One method is to simply choose the tokens with the highest tf-idf scores. Another is to train a bidirectional language model similar to ELMo and choose the tokens with the highest loss. If we consider any subsequence of a text to be a text in a broader sense, we conclude that language is a strong compression code of itself. Our finding justifies the high quality of generation achieved with hierarchical method, as their latent variables are nothing but natural language summary. We also conclude that there is a hierarchy in language such that an entire text can be predicted much more easily based on a sequence of a small number of keywords, which can be easily found by classical methods as tf-idf. We speculate that this extraction process may be useful for unsupervised hierarchical text generation

    ESSMArT Way to Manage User Requests

    Full text link
    Quality and market acceptance of software products is strongly influenced by responsiveness to user requests. Once a request is received from a customer, decisions need to be made if the request should be escalated to the development team. Once escalated, the ticket must be formulated as a development task and be assigned to a developer. To make the process more efficient and reduce the time between receiving and escalating the user request, we aim to automate of the complete user request management process. We propose a holistic method called ESSMArT. The methods performs text summarization, predicts ticket escalation, creates the title and content of the ticket used by developers, and assigns the ticket to an available developer. We internally evaluated the method by 4,114 user tickets from Brightsquid and their secure health care communication plat- form Secure-Mail. We also perform an external evaluation on the usefulness of the approach. We found that supervised learning based on context specific data performs best for extractive summarization. For predicting escalation of tickets, Random Forest trained on a combination of conversation and extractive summarization is best with highest precision (of 0.9) and recall (of 0.55). From external evaluation we found that ESSMArT provides suggestions that are 71% aligned with human ones. Applying the prototype implementation to 315 user requests resulted in an average time reduction of 9.2 minutes per request. ESSMArT helps to make ticket management faster and with reduced effort for human experts. ESSMArT can help Brightsquid to (i) minimize the impact of staff turnover and (ii) shorten the cycle from an issue being reported to an assignment to a developer to fix it.Comment: This is a preprint of the submission to Empirical Software Engineering journal, 201

    Semantic WordRank: Generating Finer Single-Document Summarizations

    Full text link
    We present Semantic WordRank (SWR), an unsupervised method for generating an extractive summary of a single document. Built on a weighted word graph with semantic and co-occurrence edges, SWR scores sentences using an article-structure-biased PageRank algorithm with a Softplus function adjustment, and promotes topic diversity using spectral subtopic clustering under the Word-Movers-Distance metric. We evaluate SWR on the DUC-02 and SummBank datasets and show that SWR produces better summaries than the state-of-the-art algorithms over DUC-02 under common ROUGE measures. We then show that, under the same measures over SummBank, SWR outperforms each of the three human annotators (aka. judges) and compares favorably with the combined performance of all judges.Comment: 12 pages, accepted by IDEAL201

    Extractive Summarization using Deep Learning

    Full text link
    This paper proposes a text summarization approach for factual reports using a deep learning model. This approach consists of three phases: feature extraction, feature enhancement, and summary generation, which work together to assimilate core information and generate a coherent, understandable summary. We are exploring various features to improve the set of sentences selected for the summary, and are using a Restricted Boltzmann Machine to enhance and abstract those features to improve resultant accuracy without losing any important information. The sentences are scored based on those enhanced features and an extractive summary is constructed. Experimentation carried out on several articles demonstrates the effectiveness of the proposed approach. Source code available at: https://github.com/vagisha-nidhi/TextSummarizerComment: Accepted to 18th International Conference on Computational Linguistics and Intelligent Text Processin

    STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings

    Full text link
    This paper introduces STRASS: Summarization by TRAnsformation Selection and Scoring. It is an extractive text summarization method which leverages the semantic information in existing sentence embedding spaces. Our method creates an extractive summary by selecting the sentences with the closest embeddings to the document embedding. The model learns a transformation of the document embedding to minimize the similarity between the extractive summary and the ground truth summary. As the transformation is only composed of a dense layer, the training can be done on CPU, therefore, inexpensive. Moreover, inference time is short and linear according to the number of sentences. As a second contribution, we introduce the French CASS dataset, composed of judgments from the French Court of cassation and their corresponding summaries. On this dataset, our results show that our method performs similarly to the state of the art extractive methods with effective training and inferring time.Comment: To appear in 2019 ACL Student Research Worksho

    Guiding Extractive Summarization with Question-Answering Rewards

    Full text link
    Highlighting while reading is a natural behavior for people to track salient content of a document. It would be desirable to teach an extractive summarizer to do the same. However, a major obstacle to the development of a supervised summarizer is the lack of ground-truth. Manual annotation of extraction units is cost-prohibitive, whereas acquiring labels by automatically aligning human abstracts and source documents can yield inferior results. In this paper we describe a novel framework to guide a supervised, extractive summarization system with question-answering rewards. We argue that quality summaries should serve as a document surrogate to answer important questions, and such question-answer pairs can be conveniently obtained from human abstracts. The system learns to promote summaries that are informative, fluent, and perform competitively on question-answering. Our results compare favorably with those reported by strong summarization baselines as evaluated by automatic metrics and human assessors.Comment: NAACL 201

    Question-Driven Summarization of Answers to Consumer Health Questions

    Full text link
    Automatic summarization of natural language is a widely studied area in computer science, one that is broadly applicable to anyone who routinely needs to understand large quantities of information. For example, in the medical domain, recent developments in deep learning approaches to automatic summarization have the potential to make health information more easily accessible to patients and consumers. However, to evaluate the quality of automatically generated summaries of health information, gold-standard, human generated summaries are required. Using answers provided by the National Library of Medicine's consumer health question answering system, we present the MEDIQA Answer Summarization dataset, the first summarization collection containing question-driven summaries of answers to consumer health questions. This dataset can be used to evaluate single or multi-document summaries generated by algorithms using extractive or abstractive approaches. In order to benchmark the dataset, we include results of baseline and state-of-the-art deep learning summarization models, demonstrating that this dataset can be used to effectively evaluate question-driven machine-generated summaries and promote further machine learning research in medical question answering

    What comes next? Extractive summarization by next-sentence prediction

    Full text link
    Existing approaches to automatic summarization assume that a length limit for the summary is given, and view content selection as an optimization problem to maximize informativeness and minimize redundancy within this budget. This framework ignores the fact that human-written summaries have rich internal structure which can be exploited to train a summarization system. We present NEXTSUM, a novel approach to summarization based on a model that predicts the next sentence to include in the summary using not only the source article, but also the summary produced so far. We show that such a model successfully captures summary-specific discourse moves, and leads to better content selection performance, in addition to automatically predicting how long the target summary should be. We perform experiments on the New York Times Annotated Corpus of summaries, where NEXTSUM outperforms lead and content-model summarization baselines by significant margins. We also show that the lengths of summaries produced by our system correlates with the lengths of the human-written gold standards

    Towards Supervised Extractive Text Summarization via RNN-based Sequence Classification

    Full text link
    This article briefly explains our submitted approach to the DocEng'19 competition on extractive summarization. We implemented a recurrent neural network based model that learns to classify whether an article's sentence belongs to the corresponding extractive summary or not. We bypass the lack of large annotated news corpora for extractive summarization by generating extractive summaries from abstractive ones, which are available from the CNN corpus

    USUM: Update Summary Generation System

    Full text link
    Huge amount of information is present in the World Wide Web and a large amount is being added to it frequently. A query-specific summary of multiple documents is very helpful to the user in this context. Currently, few systems have been proposed for query-specific, extractive multi-document summarization. If a summary is available for a set of documents on a given query and if a new document is added to the corpus, generating an updated summary from the scratch is time consuming and many a times it is not practical/possible. In this paper we propose a solution to this problem. This is especially useful in a scenario where the source documents are not accessible. We cleverly embed the sentences of the current summary into the new document and then perform query-specific summary generation on that document. Our experimental results show that the performance of the proposed approach is good in terms of both quality and efficiency
    corecore