6,463 research outputs found

    Multi-Document Summarization via Discriminative Summary Reranking

    Full text link
    Existing multi-document summarization systems usually rely on a specific summarization model (i.e., a summarization method with a specific parameter setting) to extract summaries for different document sets with different topics. However, according to our quantitative analysis, none of the existing summarization models can always produce high-quality summaries for different document sets, and even a summarization model with good overall performance may produce low-quality summaries for some document sets. On the contrary, a baseline summarization model may produce high-quality summaries for some document sets. Based on the above observations, we treat the summaries produced by different summarization models as candidate summaries, and then explore discriminative reranking techniques to identify high-quality summaries from the candidates for difference document sets. We propose to extract a set of candidate summaries for each document set based on an ILP framework, and then leverage Ranking SVM for summary reranking. Various useful features have been developed for the reranking process, including word-level features, sentence-level features and summary-level features. Evaluation results on the benchmark DUC datasets validate the efficacy and robustness of our proposed approach

    LexRank: Graph-based Lexical Centrality as Salience in Text Summarization

    Full text link
    We introduce a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. We test the technique on the problem of Text Summarization (TS). Extractive TS relies on the concept of sentence salience to identify the most important sentences in a document or set of documents. Salience is typically defined in terms of the presence of particular important words or in terms of similarity to a centroid pseudo-sentence. We consider a new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. Our system, based on LexRank ranked in first place in more than one task in the recent DUC 2004 evaluation. In this paper we present a detailed analysis of our approach and apply it to a larger data set including data from earlier DUC evaluations. We discuss several methods to compute centrality using the similarity graph. The results show that degree-based methods (including LexRank) outperform both centroid-based methods and other systems participating in DUC in most of the cases. Furthermore, the LexRank with threshold method outperforms the other degree-based techniques including continuous LexRank. We also show that our approach is quite insensitive to the noise in the data that may result from an imperfect topical clustering of documents

    Comprehensive Review of Opinion Summarization

    Get PDF
    The abundance of opinions on the web has kindled the study of opinion summarization over the last few years. People have introduced various techniques and paradigms to solving this special task. This survey attempts to systematically investigate the different techniques and approaches used in opinion summarization. We provide a multi-perspective classification of the approaches used and highlight some of the key weaknesses of these approaches. This survey also covers evaluation techniques and data sets used in studying the opinion summarization problem. Finally, we provide insights into some of the challenges that are left to be addressed as this will help set the trend for future research in this area.unpublishednot peer reviewe

    Gold Standard Online Debates Summaries and First Experiments Towards Automatic Summarization of Online Debate Data

    Full text link
    Usage of online textual media is steadily increasing. Daily, more and more news stories, blog posts and scientific articles are added to the online volumes. These are all freely accessible and have been employed extensively in multiple research areas, e.g. automatic text summarization, information retrieval, information extraction, etc. Meanwhile, online debate forums have recently become popular, but have remained largely unexplored. For this reason, there are no sufficient resources of annotated debate data available for conducting research in this genre. In this paper, we collected and annotated debate data for an automatic summarization task. Similar to extractive gold standard summary generation our data contains sentences worthy to include into a summary. Five human annotators performed this task. Inter-annotator agreement, based on semantic similarity, is 36% for Cohen's kappa and 48% for Krippendorff's alpha. Moreover, we also implement an extractive summarization system for online debates and discuss prominent features for the task of summarizing online debate data automatically.Comment: accepted and presented at the CICLING 2017 - 18th International Conference on Intelligent Text Processing and Computational Linguistic
    • …
    corecore