4 research outputs found

    TEXT SUMMARIZATION UNDER LOW SUPERVISION

    Get PDF
    Text summarization aims to create a concise and fluent summary that captures the most salient information from a given document(s). However, most summarization methods require large-scale document-summary pairs as the training data, which is laborious to acquire for many domains. This calls for the development of summarization algorithms that can work in a low-supervision setting, which is still a challenging problem. In this dissertation, we address the problem from three perspectives. We start by improving the summarization methods using external information. Specifically, we focus on the task of product review summarization. We utilize the feature descriptions of the product as external information to better guide the model to identify aspect-related information from reviews and create corresponding summaries. Besides the use of external information, we also explore the use of external models, and propose a method that enables knowledge transfer from single-document summarization (SDS) to multi-document summarization (MDS). Our approach involves an efficient and effective technique of multiple document reordering, which facilitates both unsupervised and supervised MDS. In the third part, we present novel approaches to automatically construct high-quality paired training data for summarization. In particular, we introduce two large-scale datasets: Diana for dialogue summarization and NarraSum for narrative summarization. We experimentally demonstrate that pre-training on these datasets significantly improves summarization quality. Finally, given that the primary objective of summarization is to help users better grasp key information and understand the document, we investigate the potential of utilizing automatically constructed summarization datasets to enhance reading comprehension in a zero-shot manner. We propose Parrot, a zero-shot approach that leverages document-summary pairs for reading comprehension. Our results demonstrate that Parrot outperforms previous zero-shot approaches and achieves comparable performance to fully supervised models, showcasing how text summarization can facilitate reading comprehension with minimal supervision.Doctor of Philosoph

    Statistical Sentence Extraction for Information Distillation

    No full text
    Information distillation aims to extract the most useful pieces of information related to a given query from massive, possibly multilingual, audio and textual document sources. One critical componentin a distillation engine is detecting sentences to be extracted from each relevant document. In this paper, we present a statistical sentence extraction approach for distillation. Basically, we frame this task as a classi�cation problem, where each candidate sentence in documents is classi�ed as relevant to the query or not. These documents may be in textual or audio format and in a number of languages. For audio documents, we use both manual and automatic transcriptions, for non-English documents, we use automatic translations. In this work, we use AdaBoost, a discriminative classi�cation method with both lexical and semantic features. The results indicate 11%-13 % relative improvement over a baseline keyword-spotting-based approach. We also show the robustness of our method on the audio subset of the document sources using manual and automatic transcriptions. Index Terms — information distillation, information extraction, language understanding, speech processing, natural language processin
    corecore