3,290 research outputs found

    A Novel ILP Framework for Summarizing Content with High Lexical Variety

    Full text link
    Summarizing content contributed by individuals can be challenging, because people make different lexical choices even when describing the same events. However, there remains a significant need to summarize such content. Examples include the student responses to post-class reflective questions, product reviews, and news articles published by different news agencies related to the same events. High lexical diversity of these documents hinders the system's ability to effectively identify salient content and reduce summary redundancy. In this paper, we overcome this issue by introducing an integer linear programming-based summarization framework. It incorporates a low-rank approximation to the sentence-word co-occurrence matrix to intrinsically group semantically-similar lexical items. We conduct extensive experiments on datasets of student responses, product reviews, and news documents. Our approach compares favorably to a number of extractive baselines as well as a neural abstractive summarization system. The paper finally sheds light on when and why the proposed framework is effective at summarizing content with high lexical variety.Comment: Accepted for publication in the journal of Natural Language Engineering, 201

    Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps

    Full text link
    Concept maps can be used to concisely represent important information and bring structure into large document collections. Therefore, we study a variant of multi-document summarization that produces summaries in the form of concept maps. However, suitable evaluation datasets for this task are currently missing. To close this gap, we present a newly created corpus of concept maps that summarize heterogeneous collections of web documents on educational topics. It was created using a novel crowdsourcing approach that allows us to efficiently determine important elements in large document collections. We release the corpus along with a baseline system and proposed evaluation protocol to enable further research on this variant of summarization.Comment: Published at EMNLP 201

    Some Reflections on the Task of Content Determination in the Context of Multi-Document Summarization of Evolving Events

    Full text link
    Despite its importance, the task of summarizing evolving events has received small attention by researchers in the field of multi-document summariztion. In a previous paper (Afantenos et al. 2007) we have presented a methodology for the automatic summarization of documents, emitted by multiple sources, which describe the evolution of an event. At the heart of this methodology lies the identification of similarities and differences between the various documents, in two axes: the synchronic and the diachronic. This is achieved by the introduction of the notion of Synchronic and Diachronic Relations. Those relations connect the messages that are found in the documents, resulting thus in a graph which we call grid. Although the creation of the grid completes the Document Planning phase of a typical NLG architecture, it can be the case that the number of messages contained in a grid is very large, exceeding thus the required compression rate. In this paper we provide some initial thoughts on a probabilistic model which can be applied at the Content Determination stage, and which tries to alleviate this problem.Comment: 5 pages, 2 figure

    Using the Annotated Bibliography as a Resource for Indicative Summarization

    Get PDF
    We report on a language resource consisting of 2000 annotated bibliography entries, which is being analyzed as part of our research on indicative document summarization. We show how annotated bibliographies cover certain aspects of summarization that have not been well-covered by other summary corpora, and motivate why they constitute an important form to study for information retrieval. We detail our methodology for collecting the corpus, and overview our document feature markup that we introduced to facilitate summary analysis. We present the characteristics of the corpus, methods of collection, and show its use in finding the distribution of types of information included in indicative summaries and their relative ordering within the summaries.Comment: 8 pages, 3 figure

    Semi-Supervised Learning for Neural Keyphrase Generation

    Full text link
    We study the problem of generating keyphrases that summarize the key points for a given document. While sequence-to-sequence (seq2seq) models have achieved remarkable performance on this task (Meng et al., 2017), model training often relies on large amounts of labeled data, which is only applicable to resource-rich domains. In this paper, we propose semi-supervised keyphrase generation methods by leveraging both labeled data and large-scale unlabeled samples for learning. Two strategies are proposed. First, unlabeled documents are first tagged with synthetic keyphrases obtained from unsupervised keyphrase extraction methods or a selflearning algorithm, and then combined with labeled samples for training. Furthermore, we investigate a multi-task learning framework to jointly learn to generate keyphrases as well as the titles of the articles. Experimental results show that our semi-supervised learning-based methods outperform a state-of-the-art model trained with labeled data only.Comment: To appear in EMNLP 2018 (12 pages, 7 figures, 6 tables
    • …
    corecore