5 research outputs found
Automatically Learning Cognitive Status for Multi-Document Summarization of Newswire
Machine summaries can be improved by using knowledge about the cognitive status of news article referents. In this paper, we present an approach to automatically acquiring distinctions in cognitive status using machine learning over the forms of referring expressions appearing in the input. We focus on modeling references to people, both because news often revolve around people and because existing natural language tools for named entity identification are reliable. We examine two specific distinctions---whether a person in the news can be assumed to be known to a target audience (hearer-old vs hearer-new) and whether a person is a major character in the news story. We report on machine learning experiments that show that these distinctions can be learned with high accuracy, and validate our approach using human subjects
Automatic text summarization with Maximal Frequent Sequences
En las últimas dos décadas un aumento exponencial de la información electrónica
ha provocado una gran necesidad de entender rápidamente grandes
volúmenes de información. En este libro se desarrollan los métodos automáticos
para producir un resumen. Un resumen es un texto corto que transmite la información
más importante de un documento o de una colección de documentos. Los
resúmenes utilizados en este libro son extractivos: una selección de las oraciones
más importantes del texto. Otros retos consisten en generar resúmenes de manera
independiente de lenguaje y dominio.
Se describe la identificación de cuatro etapas para generación de resúmenes
extractivos. La primera etapa es la selección de términos, en la que uno tiene
que decidir qué unidades contarÃan como términos individuales. El proceso de
estimación de la utilidad de los términos individuales se llama etapa de pesado
de términos. El siguiente paso se denota como pesado de oraciones, donde todas
las secuencias reciben alguna medida numérica de acuerdo con la utilidad de
términos. Finalmente, el proceso de selección de las oraciones más importantes
se llama selección de oraciones. Los diferentes métodos para generación de resúmenes
extractivos pueden ser caracterizados como representan estas etapas.
En este libro se describe la etapa de selección de términos, en la que la detección
de descripciones multipalabra se realiza considerando Secuencias Frecuentes
Maximales (sfms), las cuales adquieren un significado importante, mientras
Secuencias Frecuentes (sf) no maximales, que son partes de otros sf, no deben
de ser consideradas. En la motivación se consideró costo vs. beneficio: existen
muchas sf no maximales, mientras que la probabilidad de adquirir un significado
importante es baja. De todos modos, las sfms representan todas las sfs en el
modo compacto: todas las sfs podrÃan ser obtenidas a partir de todas las sfms
explotando cada sfm al conjunto de todas sus subsecuencias. Se presentan los nuevos métodos basados en grafos, algoritmos de agrupamiento
y algoritmos genéticos, los cuales facilitan la tarea de generación de
resúmenes de textos. Se ha experimentado diferentes combinaciones de las opciones
de selección de términos, pesado de términos, pesado de oraciones y
selección de oraciones para generar los resúmenes extractivos de textos independientes
de lenguaje y dominio para una colección de noticias. Se ha analizado
algunas opciones basadas en descripciones multipalabra considerándolas en los
métodos de grafos, algoritmos de agrupamiento y algoritmos genéticos. Se han
obtenido los resultados superiores al de estado de arte.
Este libro está dirigido a los estudiantes y cientÃficos del área de LingüÃstica
Computacional, y también a quienes quieren saber sobre los recientes avances en
las investigaciones de generación automática de resúmenes de textos.In the last two decades, an exponential increase in the available electronic information
causes a big necessity to quickly understand large volumes of information.
It raises the importance of the development of automatic methods for
detecting the most relevant content of a document in order to produce a shorter
text. Automatic Text Summarization (ats) is an active research area dedicated to
generate abstractive and extractive summaries not only for a single document, but
also for a collection of documents. Other necessity consists in finding method for
ats in a language and domain independent way.
In this book we consider extractive text summarization for single document
task. We have identified that a typical extractive summarization method consists
in four steps. First step is a term selection where one should decide what units
will count as individual terms. The process of estimating the usefulness of the
individual terms is called term weighting step. The next step denotes as sentence
weighting where all the sentences receive some numerical measure according to
the usefulness of its terms. Finally, the process of selecting the most relevant sentences
calls sentence selection. Different extractive summarization methods can
be characterized how they perform these steps.
In this book, in the term selection step, we describe how to detect multiword
descriptions considering Maximal Frequent Sequences (mfss), which bearing important
meaning, while non-maximal frequent sequences (fss), those that are
parts of another fs, should not be considered. Our additional motivation was
cost vs. benefit considerations: there are too many non-maximal fss while their
probability to bear important meaning is lower. In any case, mfss represent all fss
in a compact way: all fss can be obtained from all mfss by bursting each mfs into
a set of all its subsequences.New methods based on graph algorithms, genetic algorithms, and clustering
algorithms which facilitate the text summarization task are presented. We
have tested different combinations of term selection, term weighting, sentence
weighting and sentence selection options for language-and domain-independent
extractive single-document text summarization on a news report collection. We
analyzed several options based on mfss, considering them with graph, genetic,
and clustering algorithms. We obtained results superior to the existing state-ofthe-
art methods.
This book is addressed for students and scientists of the area of Computational
Linguistics, and also who wants to know recent developments in the area of Automatic
Text Generation of Summaries
Automatic Generation of Text Summaries - Challenges, proposals and experiments
Los estudiantes e investigadores en el área de procesamiento deenguaje natural, inteligencia artificial, ciencias computacionales y lingüÃstica computacional serán quizá los primeros interesados en este libro. No obstante, también se pretende introducir a público no especializado en esta prometedora área de investigación; por ello, hemos traducido al español algunos tecnicismos y anglicismos, propios de esta disciplina, pero sin dejar de mencionar, en todo momento, su término en inglés para evitar confusiones y lograr que aquellos lectores interesados puedan ampliar sus fuentes de conocimiento.Este libro presenta un método computacional novedoso, a nivel internacional, para la generación automática de resúmenes de texto, pues supera la calidad de los que actualmente se pueden crear. Es decir, es resultado de una investigación que buscó métodos y modelos computacionales lo menos dependientes del lenguaje y dominio
Linking named entities to Wikipedia
Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (KB), in our case, Wikipedia. The named entity linking (NEL) task requires systems to identify the KB entry, or Wikipedia article, that a mention refers to; or, if the KB does not contain the correct entry, return NIL. Entity linking systems can be complex and we present a framework for analysing their different components, which we use to analyse three seminal systems which are evaluated on a common dataset and we show the importance of precise search for linking. The Text Analysis Conference (TAC) is a major venue for NEL research. We report on our submissions to the entity linking shared task in 2010, 2011 and 2012. The information required to disambiguate entities is often found in the text, close to the mention. We explore apposition, a common way for authors to provide information about entities. We model syntactic and semantic restrictions with a joint model that achieves state-of-the-art apposition extraction performance. We generalise from apposition to examine local descriptions specified close to the mention. We add local description to our state-of-the-art linker by using patterns to extract the descriptions and matching against this restricted context. Not only does this make for a more precise match, we are also able to model failure to match. Local descriptions help disambiguate entities, further improving our state-of-the-art linker. The work in this thesis seeks to link textual entity mentions to knowledge bases. Linking is important for any task where external world knowledge is used and resolving ambiguity is fundamental to advancing research into these problems
Towards More Human-Like Text Summarization: Story Abstraction Using Discourse Structure and Semantic Information.
PhD ThesisWith the massive amount of textual data being produced every day,
the ability to effectively summarise text documents is becoming increasingly
important. Automatic text summarization entails the selection
and generalisation of the most salient points of a text in order
to produce a summary. Approaches to automatic text summarization
can fall into one of two categories: abstractive or extractive approaches.
Extractive approaches involve the selection and concatenation
of spans of text from a given document. Research in automatic
text summarization began with extractive approaches, scoring and
selecting sentences based on the frequency and proximity of words.
In contrast, abstractive approaches are based on a process of interpretation,
semantic representation, and generalisation. This is closer
to the processes that psycholinguistics tells us that humans perform
when reading, remembering and summarizing. However in the sixty
years since its inception, the field has largely remained focused on
extractive approaches.
This thesis aims to answer the following questions. Does knowledge
about the discourse structure of a text aid the recognition of
summary-worthy content? If so, which specific aspects of discourse
structure provide the greatest benefit? Can this structural information
be used to produce abstractive summaries, and are these more
informative than extractive summaries? To thoroughly examine these
questions, they are each considered in isolation, and as a whole, on
the basis of both manual and automatic annotations of texts. Manual
annotations facilitate an investigation into the upper bounds of
what can be achieved by the approach described in this thesis. Results
based on automatic annotations show how this same approach
is impacted by the current performance of imperfect preprocessing
steps, and indicate its feasibility.
Extractive approaches to summarization are intrinsically limited
by the surface text of the input document, in terms of both content
selection and summary generation. Beginning with a motivation
for moving away from these commonly used methods of producing
summaries, I set out my methodology for a more human-like
approach to automatic summarization which examines the benefits of
using discourse-structural information. The potential benefit of this
is twofold: moving away from a reliance on the wording of a text
in order to detect important content, and generating concise summaries
that are independent of the input text. The importance of
discourse structure to signal key textual material has previously been
recognised, however it has seen little applied use in the field of autovii
matic summarization. A consideration of evaluation metrics also features
significantly in the proposed methodology. These play a role in
both preprocessing steps and in the evaluation of the final summary
product. I provide evidence which indicates a disparity between the
performance of coreference resolution systems as indicated by their
standard evaluation metrics, and their performance in extrinsic tasks.
Additionally, I point out a range of problems for the most commonly
used metric, ROUGE, and suggest that at present summary evaluation
should not be automated.
To illustrate the general solutions proposed to the questions raised
in this thesis, I use Russian Folk Tales as an example domain. This
genre of text has been studied in depth and, most importantly, it has a
rich narrative structure that has been recorded in detail. The rules of
this formalism are suitable for the narrative structure reasoning system
presented as part of this thesis. The specific discourse-structural elements
considered cover the narrative structure of a text, coreference
information, and the story-roles fulfilled by different characters.
The proposed narrative structure reasoning system produces highlevel
interpretations of a text according to the rules of a given formalism.
For the example domain of Russian Folktales, a system is implemented
which constructs such interpretations of a tale according to
an existing set of rules and restrictions. I discuss how this process of
detecting narrative structure can be transferred to other genres, and
a key factor in the success of this process: how constrained are the
rules of the formalism. The system enumerates all possible interpretations
according to a set of constraints, meaning a less restricted rule
set leads to a greater number of interpretations.
For the example domain, sentence level discourse-structural annotations
are then used to predict summary-worthy content. The results
of this study are analysed in three parts. First, I examine the relative
utility of individual discourse features and provide a qualitative
discussion of these results. Second, the predictive abilities of these
features are compared when they are manually annotated to when
they are annotated with varying degrees of automation. Third, these
results are compared to the predictive capabilities of classic extractive
algorithms. I show that discourse features can be used to more
accurately predict summary-worthy content than classic extractive algorithms.
This holds true for automatically obtained annotations, but
with a much clearer difference when using manual annotations.
The classifiers learned in the prediction of summary-worthy sentences
are subsequently used to inform the production of both extractive
and abstractive summaries to a given length. A human-based
evaluation is used to compare these summaries, as well as the outputs
of a classic extractive summarizer. I analyse the impact of knowledge
about discourse structure, obtained both manually and automatically,
on summary production. This allows for some insight into the knock
on effects on summary production that can occur from inaccurate discourse
information (narrative structure and coreference information).
My analyses show that even given inaccurate discourse information,
the resulting abstractive summaries are considered more informative
than their extractive counterparts. With human-level knowledge
about discourse structure, these results are even clearer.
In conclusion, this research provides a framework which can be
used to detect the narrative structure of a text, and shows its potential
to provide a more human-like approach to automatic summarization.
I show the limit of what is achievable with this approach both
when manual annotations are obtainable, and when only automatic
annotations are feasible. Nevertheless, this thesis supports the suggestion
that the future of summarization lies with abstractive and not
extractive techniques