14,419 research outputs found
Global Inference for Sentence Compression: An Integer Linear Programming Approach
Institute for Communicating and Collaborative SystemsIn this thesis we develop models for sentence compression. This text rewriting task
has recently attracted a lot of attention due to its relevance for applications (e.g., summarisation)
and simple formulation by means of word deletion. Previous models for
sentence compression have been inherently local and thus fail to capture the long range
dependencies and complex interactions involved in text rewriting. We present a solution
by framing the task as an optimisation problem with local and global constraints
and recast existing compression models into this framework. Using the constraints we
instil syntactic, semantic and discourse knowledge the models otherwise fail to capture.
We show that the addition of constraints allow relatively simple local models to
reach state-of-the-art performance for sentence compression.
The thesis provides a detailed study of sentence compression and its models. The
differences between automatic and manually created compression corpora are assessed
along with how compression varies across written and spoken text. We also discuss
various techniques for automatically and manually evaluating compression output
against a gold standard. Models are reviewed based on their assumptions, training requirements,
and scalability.
We introduce a general method for extending previous approaches to allow for
more global models. This is achieved through the optimisation framework of Integer
Linear Programming (ILP). We reformulate three compression models: an unsupervised
model, a semi-supervised model and a fully supervised model as ILP problems
and augment them with constraints. These constraints are intuitive for the compression
task and are both syntactically and semantically motivated. We demonstrate how they
improve compression quality and reduce the requirements on training material.
Finally, we delve into document compression where the task is to compress every
sentence of a document and use the resulting summary as a replacement for the
original document. For document-based compression we investigate discourse information
and its application to the compression task. Two discourse theories, Centering
and lexical chains, are used to automatically annotate documents. These annotations
are then used in our compression framework to impose additional constraints on the
resulting document. The goal is to preserve the discourse structure of the original document
and most of its content. We show how a discourse informed compression model
can outperform a discourse agnostic state-of-the-art model using a question answering
evaluation paradigm
Some Reflections on the Task of Content Determination in the Context of Multi-Document Summarization of Evolving Events
Despite its importance, the task of summarizing evolving events has received
small attention by researchers in the field of multi-document summariztion. In
a previous paper (Afantenos et al. 2007) we have presented a methodology for
the automatic summarization of documents, emitted by multiple sources, which
describe the evolution of an event. At the heart of this methodology lies the
identification of similarities and differences between the various documents,
in two axes: the synchronic and the diachronic. This is achieved by the
introduction of the notion of Synchronic and Diachronic Relations. Those
relations connect the messages that are found in the documents, resulting thus
in a graph which we call grid. Although the creation of the grid completes the
Document Planning phase of a typical NLG architecture, it can be the case that
the number of messages contained in a grid is very large, exceeding thus the
required compression rate. In this paper we provide some initial thoughts on a
probabilistic model which can be applied at the Content Determination stage,
and which tries to alleviate this problem.Comment: 5 pages, 2 figure
Rhetorical relations for information retrieval
Typically, every part in most coherent text has some plausible reason for its
presence, some function that it performs to the overall semantics of the text.
Rhetorical relations, e.g. contrast, cause, explanation, describe how the parts
of a text are linked to each other. Knowledge about this socalled discourse
structure has been applied successfully to several natural language processing
tasks. This work studies the use of rhetorical relations for Information
Retrieval (IR): Is there a correlation between certain rhetorical relations and
retrieval performance? Can knowledge about a document's rhetorical relations be
useful to IR? We present a language model modification that considers
rhetorical relations when estimating the relevance of a document to a query.
Empirical evaluation of different versions of our model on TREC settings shows
that certain rhetorical relations can benefit retrieval effectiveness notably
(> 10% in mean average precision over a state-of-the-art baseline)
Generating Abstractive Summaries from Meeting Transcripts
Summaries of meetings are very important as they convey the essential content
of discussions in a concise form. Generally, it is time consuming to read and
understand the whole documents. Therefore, summaries play an important role as
the readers are interested in only the important context of discussions. In
this work, we address the task of meeting document summarization. Automatic
summarization systems on meeting conversations developed so far have been
primarily extractive, resulting in unacceptable summaries that are hard to
read. The extracted utterances contain disfluencies that affect the quality of
the extractive summaries. To make summaries much more readable, we propose an
approach to generating abstractive summaries by fusing important content from
several utterances. We first separate meeting transcripts into various topic
segments, and then identify the important utterances in each segment using a
supervised learning approach. The important utterances are then combined
together to generate a one-sentence summary. In the text generation step, the
dependency parses of the utterances in each segment are combined together to
create a directed graph. The most informative and well-formed sub-graph
obtained by integer linear programming (ILP) is selected to generate a
one-sentence summary for each topic segment. The ILP formulation reduces
disfluencies by leveraging grammatical relations that are more prominent in
non-conversational style of text, and therefore generates summaries that is
comparable to human-written abstractive summaries. Experimental results show
that our method can generate more informative summaries than the baselines. In
addition, readability assessments by human judges as well as log-likelihood
estimates obtained from the dependency parser show that our generated summaries
are significantly readable and well-formed.Comment: 10 pages, Proceedings of the 2015 ACM Symposium on Document
Engineering, DocEng' 201
A Novel ILP Framework for Summarizing Content with High Lexical Variety
Summarizing content contributed by individuals can be challenging, because
people make different lexical choices even when describing the same events.
However, there remains a significant need to summarize such content. Examples
include the student responses to post-class reflective questions, product
reviews, and news articles published by different news agencies related to the
same events. High lexical diversity of these documents hinders the system's
ability to effectively identify salient content and reduce summary redundancy.
In this paper, we overcome this issue by introducing an integer linear
programming-based summarization framework. It incorporates a low-rank
approximation to the sentence-word co-occurrence matrix to intrinsically group
semantically-similar lexical items. We conduct extensive experiments on
datasets of student responses, product reviews, and news documents. Our
approach compares favorably to a number of extractive baselines as well as a
neural abstractive summarization system. The paper finally sheds light on when
and why the proposed framework is effective at summarizing content with high
lexical variety.Comment: Accepted for publication in the journal of Natural Language
Engineering, 201
Simulation Genres and Student Uptake: The Patient Health Record in Clinical Nursing Simulations
Drawing on fieldwork, this article examines nursing studentsâ design and use of a patient health record during clinical simulations, where small teams of students provide nursing care for a robotic patient. The student-designed patient health record provides a compelling example of how simulation genres can both authentically coordinate action within a classroom simulation and support professional genre uptake. First, the range of rhetorical choices available to students in designing their simulation health records are discussed. Then, the article draws on an extended example of how student uptake of the patient health record within a clinical simulation emphasized its intertextual relationship to other genres, its role mediating social interactions with the patient and other providers, and its coordination of embodied actions. Connections to studentsâ experiences with professional genres are addressed throughout. The article concludes by considering initial implications of this research for disciplinary and professional writing courses
- âŠ