4 research outputs found
A Cascaded Broadcast News Highlighter
This paper presents a fully automatic news skimming system which takes a broadcast news audio stream and provides the user with the segmented, structured and highlighted transcript. This constitutes a system with three different, cascading stages: converting the audio stream to text using an automatic speech recogniser, segmenting into utterances and stories and finally determining which utterance should be highlighted using a saliency score. Each stage must operate on the erroneous output from the previous stage in the system; an effect which is naturally amplified as the data progresses through the processing stages. We present a large corpus of transcribed broadcast news data enabling us to investigate to which degree information worth highlighting survives this cascading of processes. Both extrinsic and intrinsic experimental results indicate that mistakes in the story boundary detection has a strong impact on the quality of highlights, whereas erroneous utterance boundaries cause only minor problems. Further, the difference in transcription quality does not affect the overall performance greatly
Comparing topiary-style approaches to headline generation
Abstract. In this paper we compare a number of Topiary-style headline generation systems. The Topiary system, developed at the University of Maryland with BBN, was the top performing headline generation system at DUC 2004. Topiary-style headlines consist of a number of general topic labels followed by a compressed version of the lead sentence of a news story. The Topiary system uses a statistical learning approach to finding topic labels for headlines, while our approach, the LexTrim system, identifies key summary words by analysing the lexical cohesive structure of a text. The performance of these systems is evaluated using the ROUGE evaluation suite on the DUC 2004 news stories collection. The results of these experiments show that a baseline system that identifies topic descriptors for headlines using term frequency counts outperforms the LexTrim and Topiary systems. A manual evaluation of the headlines also confirms this result.
Multiple Alternative Sentene Compressions as a Tool for Automatic Summarization Tasks
Automatic summarization is the distillation of important information from a source into an abridged form for a particular user or task.
Many current systems summarize texts by selecting sentences with important content. The limitation of extraction at the sentence level
is that highly relevant sentences may also contain non-relevant and
redundant content.
This thesis presents a novel framework for text summarization that
addresses the limitations of sentence-level extraction. Under this
framework text summarization is performed by generating Multiple
Alternative Sentence Compressions (MASC) as candidate summary
components and using weighted features of the candidates to construct
summaries from them. Sentence compression is the rewriting of a
sentence in a shorter form. This framework provides an environment in
which hypotheses about summarization techniques can be tested.
Three approaches to sentence compression were developed under this
framework. The first approach, HMM Hedge, uses the Noisy Channel
Model to calculate the most likely compressions of a sentence. The
second approach, Trimmer, uses syntactic trimming rules that are
linguistically motivated by Headlinese, a form of compressed English
associated with newspaper headlines. The third approach, Topiary, is
a combination of fluent text with topic terms.
The MASC framework for automatic text summarization has been applied
to the tasks of headline generation and multi-document summarization,
and has been used for initial work in summarization of novel genres
and applications, including broadcast news, email threads,
cross-language, and structured queries. The framework supports
combinations of component techniques, fostering collaboration between
development teams.
Three results will be demonstrated under the MASC framework. The first is
that an extractive summarization system can produce better summaries
by automatically selecting from a pool of compressed sentence
candidates than by automatically selecting from unaltered source
sentences. The second result is that sentence selectors can construct
better summaries from pools of compressed candidates when they make
use of larger candidate feature sets. The third result is that for
the task of Headline Generation, a combination of topic terms and
compressed sentences performs better then either approach alone.
Experimental evidence supports all three results