767 research outputs found

    Using tweets to help sentence compression for news highlights generation

    Get PDF
    We explore using relevant tweets of a given news article to help sentence com-pression for generating compressive news highlights. We extend an unsupervised dependency-tree based sentence compres-sion approach by incorporating tweet in-formation to weight the tree edge in terms of informativeness and syntactic impor-tance. The experimental results on a pub-lic corpus that contains both news arti-cles and relevant tweets show that our pro-posed tweets guided sentence compres-sion method can improve the summariza-tion performance significantly compared to the baseline generic sentence compres-sion method.

    TGSum: Build Tweet Guided Multi-Document Summarization Dataset

    Full text link
    The development of summarization research has been significantly hampered by the costly acquisition of reference summaries. This paper proposes an effective way to automatically collect large scales of news-related multi-document summaries with reference to social media's reactions. We utilize two types of social labels in tweets, i.e., hashtags and hyper-links. Hashtags are used to cluster documents into different topic sets. Also, a tweet with a hyper-link often highlights certain key points of the corresponding document. We synthesize a linked document cluster to form a reference summary which can cover most key points. To this aim, we adopt the ROUGE metrics to measure the coverage ratio, and develop an Integer Linear Programming solution to discover the sentence set reaching the upper bound of ROUGE. Since we allow summary sentences to be selected from both documents and high-quality tweets, the generated reference summaries could be abstractive. Both informativeness and readability of the collected summaries are verified by manual judgment. In addition, we train a Support Vector Regression summarizer on DUC generic multi-document summarization benchmarks. With the collected data as extra training resource, the performance of the summarizer improves a lot on all the test sets. We release this dataset for further research.Comment: 7 pages, 1 figure in AAAI 201

    Utilizing microblogs for improving automatic news high-lights extraction

    Get PDF

    Template-based Abstractive Microblog Opinion Summarisation

    Full text link
    We introduce the task of microblog opinion summarisation (MOS) and share a dataset of 3100 gold-standard opinion summaries to facilitate research in this domain. The dataset contains summaries of tweets spanning a 2-year period and covers more topics than any other public Twitter summarisation dataset. Summaries are abstractive in nature and have been created by journalists skilled in summarising news articles following a template separating factual information (main story) from author opinions. Our method differs from previous work on generating gold-standard summaries from social media, which usually involves selecting representative posts and thus favours extractive summarisation models. To showcase the dataset's utility and challenges, we benchmark a range of abstractive and extractive state-of-the-art summarisation models and achieve good performance, with the former outperforming the latter. We also show that fine-tuning is necessary to improve performance and investigate the benefits of using different sample sizes.Comment: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 2022. Pre-MIT Press publication versio

    GLIMPSED:Improving natural language processing with gaze data

    Get PDF

    WHISK: Web Hosted Information into Summarized Knowledge

    Get PDF
    Today’s online content increases at an alarmingly rate which exceeds users’ ability to consume such content. Modern search techniques allow users to enter keyword queries to find content they wish to see. However, such techniques break down when users freely browse the internet without knowing exactly what they want. Users may have to invest an unnecessarily long time reading content to see if they are interested in it. Automatic text summarization helps relieve this problem by creating synopses that significantly reduce the text while preserving the key points. Steffen Lyngbaek created the SPORK summarization pipeline to solve the content overload in Reddit comment threads. Lyngbaek adapted the Opinosis graph model for extractive summarization and combined it with agglomerative hierarchical clustering and the Smith-Waterman algorithm to perform multi-document summarization on Reddit comments.This thesis presents WHISK as a pipeline for general multi-document text summarization based on SPORK. A generic data model in WHISK allows creating new drivers for different platforms to work with the pipeline. In addition to the existing Opinosis graph model adapted in SPORK, WHISK introduces two simplified graph models for the pipeline. The simplified models removes unnecessary restrictions inherited from Opinosis graph’s abstractive summarization origins. Performance measurements and a study with Digital Democracy compare the two new graph models against the Opinosis graph model. Additionally, the study evaluates WHISK’s ability to generate pull quotes from political discussions as summaries
    corecore