361 research outputs found
Recommended from our members
Adapting Automatic Summarization to New Sources of Information
English-language news articles are no longer necessarily the best source of information. The Web allows information to spread more quickly and travel farther: first-person accounts of breaking news events pop up on social media, and foreign-language news articles are accessible to, if not immediately understandable by, English-speaking users. This thesis focuses on developing automatic summarization techniques for these new sources of information.
We focus on summarizing two specific new sources of information: personal narratives, first-person accounts of exciting or unusual events that are readily found in blog entries and other social media posts, and non-English documents, which must first be translated into English, often introducing translation errors that complicate the summarization process. Personal narratives are a very new area of interest in natural language processing research, and they present two key challenges for summarization. First, unlike many news articles, whose lead sentences serve as summaries of the most important ideas in the articles, personal narratives provide no such shortcuts for determining where important information occurs in within them; second, personal narratives are written informally and colloquially, and unlike news articles, they are rarely edited, so they require heavier editing and rewriting during the summarization process. Non-English documents, whether news or narrative, present yet another source of difficulty on top of any challenges inherent to their genre: they must be translated into English, potentially introducing translation errors and disfluencies that must be identified and corrected during summarization.
The bulk of this thesis is dedicated to addressing the challenges of summarizing personal narratives found on the Web. We develop a two-stage summarization system for personal narrative that first extracts sentences containing important content and then rewrites those sentences into summary-appropriate forms. Our content extraction system is inspired by contextualist narrative theory, using changes in writing style throughout a narrative to detect sentences containing important information; it outperforms both graph-based and neural network approaches to sentence extraction for this genre. Our paraphrasing system rewrites the extracted sentences into shorter, standalone summary sentences, learning to mimic the paraphrasing choices of human summarizers more closely than can traditional lexicon- or translation-based paraphrasing approaches.
We conclude with a chapter dedicated to summarizing non-English documents written in low-resource languages – documents that would otherwise be unreadable for English-speaking users. We develop a cross-lingual summarization system that performs even heavier editing and rewriting than does our personal narrative paraphrasing system; we create and train on large amounts of synthetic errorful translations of foreign-language documents. Our approach produces fluent English summaries from disdisfluent translations of non-English documents, and it generalizes across languages
XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages
Lack of encyclopedic text contributors, especially on Wikipedia, makes
automated text generation for low resource (LR) languages a critical problem.
Existing work on Wikipedia text generation has focused on English only where
English reference articles are summarized to generate English Wikipedia pages.
But, for low-resource languages, the scarcity of reference articles makes
monolingual summarization ineffective in solving this problem. Hence, in this
work, we propose XWikiGen, which is the task of cross-lingual multi-document
summarization of text from multiple reference articles, written in various
languages, to generate Wikipedia-style text. Accordingly, we contribute a
benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five
domains and eight languages. We harness this dataset to train a two-stage
system where the input is a set of citations and a section title and the output
is a section-specific LR summary. The proposed system is based on a novel idea
of neural unsupervised extractive summarization to coarsely identify salient
information followed by a neural abstractive model to generate the
section-specific text. Extensive experiments show that multi-domain training is
better than the multi-lingual setup on average
COMPENDIUM: a text summarisation tool for generating summaries of multiple purposes, domains, and genres
In this paper, we present a Text Summarisation tool, compendium, capable of generating the most common types of summaries. Regarding the input, single- and multi-document summaries can be produced; as the output, the summaries can be extractive or abstractive-oriented; and finally, concerning their purpose, the summaries can be generic, query-focused, or sentiment-based. The proposed architecture for compendium is divided in various stages, making a distinction between core and additional stages. The former constitute the backbone of the tool and are common for the generation of any type of summary, whereas the latter are used for enhancing the capabilities of the tool. The main contributions of compendium with respect to the state-of-the-art summarisation systems are that (i) it specifically deals with the problem of redundancy, by means of textual entailment; (ii) it combines statistical and cognitive-based techniques for determining relevant content; and (iii) it proposes an abstractive-oriented approach for facing the challenge of abstractive summarisation. The evaluation performed in different domains and textual genres, comprising traditional texts, as well as texts extracted from the Web 2.0, shows that compendium is very competitive and appropriate to be used as a tool for generating summaries.This research has been supported by the project “Desarrollo de TĂ©cnicas Inteligentes e Interactivas de MinerĂa de Textos” (PROMETEO/2009/119) and the project reference ACOMP/2011/001 from the Valencian Government, as well as by the Spanish Government (grant no. TIN2009-13391-C04-01)
- …