37 research outputs found
Recommended from our members
The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion for Tibetan Topic Boundary Detection
To properly implement a simple Tibetan Information Retrieval (IR) system segmentation of one form or another (n-gram, POS-tagging, dictionary substring matching, etc.) must be performed (see Hackett (2000b)). To take Tibetan indexing to a more sophisticated level however, some form of topic detection must be employed. This paper reports the results of a pilot study on the application to Tibetan of one technique for topic boundary detection: Lexical Cohesion. The resources developed and deployed, the theoretical model used, and its potential applications are discussed
FEMsum: A flexible eclectic multitask summarizer architecture evaluated in multidocument tasks
This article describes two types of summarization approaches integrated in a flexible architecture for multitask summarization. The first type is based on the use of lexical features, while the second one is grounded on syntactic and semantic information. All the approaches have been evaluated in experiments where, given a set of documents, they are expected to produce summaries answering a user need (expressed by a query) in a reduced set of relevant textual fragments. Their performance is analyzed in two different tasks: written news and scientific oral presentations.Postprint (published version
Automatic Summarization
It has now been 50 years since the publication of Luhn’s seminal paper on automatic summarization. During these years the practical need for automatic summarization has become increasingly urgent and numerous papers have been published on the topic. As a result, it has become harder to find a single reference that gives an overview of past efforts or a complete view of summarization tasks and necessary system components. This article attempts to fill this void by providing a comprehensive overview of research in summarization, including the more traditional efforts in sentence extraction as well as the most novel recent approaches for determining important content, for domain and genre specific summarization and for evaluation of summarization. We also discuss the challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field
Automatic text summarization using lexical chains : algorithms and experiments
viii, 80 leaves : ill. ; 29 cm.Summarization is a complex task that requires understanding of the document content to determine the importance of the text. Lexical cohesion is a method to identify connected portions of the text based on the relations between the words in the text. Lexical cohesive relations can be represented using lexical chaings. Lexical chains are sequences of semantically related words spread over the entire text. Lexical chains are used in variety of Natural Language Processing (NLP) and Information Retrieval (IR) applications. In current thesis, we propose a lexical chaining method that includes the glossary relations in the chaining process. These relations enable us to identify topically related concepts, for instance dormitory and student, and thereby enhances the identification of cohesive ties in the text. We then present methods that use the lexical chains to generate summaries by extracting sentences from the document(s). Headlines are generated by filtering the portions of the sentences extracted, which do not contribute towards the meaning of the sentence. Headlines generated can be used in real world application to skim through the document collections in a digital library. Multi-document summarization is gaining demand with the explosive growth of online news sources. It requires identification of the several themes present in the collection to attain good compression and avoid redundancy. In this thesis, we propose methods to group the portions of the texts of a document collection into meaningful clusters. clustering enable us to extract the various themes of the document collection. Sentences from clusters can then be extracted to generate a summary for the multi-document collection. Clusters can also be used to generate summaries with respect to a given query. We designed a system to compute lexical chains for the given text and use them to extract the salient portions of the document. Some specific tasks considered are: headline generation, multi-document summarization, and query-based summarization.
Our experimental evaluation shows that efficient summaries can be extracted for the above tasks
Proceedings of the 17th Annual Conference of the European Association for Machine Translation
Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT
Automatic Text Summarization
Writing text was one of the first ever methods used by humans to represent their knowledge.
Text can be of different types and have different purposes.
Due to the evolution of information systems and the Internet, the amount of textual information available has increased exponentially in a worldwide scale, and many documents tend
to have a percentage of unnecessary information. Due to this event, most readers have difficulty in digesting all the extensive information contained in multiple documents, produced
on a daily basis.
A simple solution to the excessive irrelevant information in texts is to create summaries, in
which we keep the subject’s related parts and remove the unnecessary ones.
In Natural Language Processing, the goal of automatic text summarization is to create systems that process text and keep only the most important data. Since its creation several
approaches have been designed to create better text summaries, which can be divided in two
separate groups: extractive approaches and abstractive approaches.
In the first group, the summarizers decide what text elements should be in the summary. The
criteria by which they are selected is diverse. After they are selected, they are combined into
the summary. In the second group, the text elements are generated from scratch. Abstractive
summarizers are much more complex so they still need a lot of research, in order to represent
good results.
During this thesis, we have investigated the state of the art approaches, implemented our
own versions and tested them in conventional datasets, like the DUC dataset.
Our first approach was a frequencybased approach, since it analyses the frequency in which
the text’s words/sentences appear in the text. Higher frequency words/sentences automatically receive higher scores which are then filtered with a compression rate and combined in
a summary.
Moving on to our second approach, we have improved the original TextRank algorithm by
combining it with word embedding vectors. The goal was to represent the text’s sentences as
nodes from a graph and with the help of word embeddings, determine how similar are pairs
of sentences and rank them by their similarity scores. The highest ranking sentences were
filtered with a compression rate and picked for the summary.
In the third approach, we combined feature analysis with deep learning. By analysing certain
characteristics of the text sentences, one can assign scores that represent the importance of
a given sentence for the summary. With these computed values, we have created a dataset
for training a deep neural network that is capable of deciding if a certain sentence must be
or not in the summary.
An abstractive encoderdecoder summarizer was created with the purpose of generating words
related to the document subject and combining them into a summary. Finally, every single
summarizer was combined into a full system.
Each one of our approaches was evaluated with several evaluation metrics, such as ROUGE.
We used the DUC dataset for this purpose and the results were fairly similar to the ones in
the scientific community. As for our encoderdecode, we got promising results.O texto é um dos utensílios mais importantes de transmissão de ideias entre os seres humanos. Pode ser de vários tipos e o seu conteúdo pode ser mais ou menos fácil de interpretar,
conforme a quantidade de informação relevante sobre o assunto principal.
De forma a facilitar o processamento pelo leitor existe um mecanismo propositadamente criado para reduzir a informação irrelevante num texto, chamado sumarização de texto. Através
da sumarização criamse versões reduzidas do text original e mantémse a informação do assunto principal.
Devido à criação e evolução da Internet e outros meios de comunicação, surgiu um aumento
exponencial de documentos textuais, evento denominado de sobrecarga de informação, que
têm na sua maioria informação desnecessária sobre o assunto que retratam.
De forma a resolver este problema global, surgiu dentro da área científica de Processamento
de Linguagem Natural, a sumarização automática de texto, que permite criar sumários automáticos de qualquer tipo de texto e de qualquer lingua, através de algoritmos computacionais.
Desde a sua criação, inúmeras técnicas de sumarização de texto foram idealizadas, podendo
ser classificadas em dois tipos diferentes: extractivas e abstractivas. Em técnicas extractivas,
são transcritos elementos do texto original, como palavras ou frases inteiras que sejam as
mais ilustrativas do assunto do texto e combinadas num documento. Em técnicas abstractivas, os algoritmos geram elementos novos.
Nesta dissertação pesquisaramse, implementaramse e combinaramse algumas das técnicas com melhores resultados de modo a criar um sistema completo para criar sumários.
Relativamente às técnicas implementadas, as primeiras três são técnicas extractivas enquanto
que a ultima é abstractiva. Desta forma, a primeira incide sobre o cálculo das frequências dos
elementos do texto, atribuindose valores às frases que sejam mais frequentes, que por sua
vez são escolhidas para o sumário através de uma taxa de compressão. Outra das técnicas
incide na representação dos elementos textuais sob a forma de nodos de um grafo, sendo
atribuidos valores de similaridade entre os mesmos e de seguida escolhidas as frases com
maiores valores através de uma taxa de compressão. Uma outra abordagem foi criada de
forma a combinar um mecanismo de análise das caracteristicas do texto com métodos baseados em inteligência artificial. Nela cada frase possui um conjunto de caracteristicas que são
usadas para treinar um modelo de rede neuronal. O modelo avalia e decide quais as frases
que devem pertencer ao sumário e filtra as mesmas através deu uma taxa de compressão.
Um sumarizador abstractivo foi criado para para gerar palavras sobre o assunto do texto e
combinar num sumário. Cada um destes sumarizadores foi combinado num só sistema. Por
fim, cada uma das técnicas pode ser avaliada segundo várias métricas de avaliação, como
por exemplo a ROUGE. Segundo os resultados de avaliação das técnicas, com o conjunto de
dados DUC, os nossos sumarizadores obtiveram resultados relativamente parecidos com os
presentes na comunidade cientifica, com especial atenção para o codificadordescodificador
que em certos casos apresentou resultados promissores
Automatic summarisation: 25 years On
This is an accepted manuscript of an article published by Cambridge University Press (CUP) in Natural Language Engineering on 19/09/2019, available online: https://doi.org/10.1017/S1351324919000524
The accepted version of the publication may differ from the final published version.Automatic text summarisation is a topic that has been receiving attention from the research community from the early days of computational linguistics, but it really took off around 25 years ago. This article presents the main developments from the last 25 years. It starts by defining what a summary is and how its definition changed over time as a result of the interest in processing new types of documents. The article continues with a brief history of the field and highlights the main challenges posed by the evaluation of summaries. The article finishes with some thoughts about the future of the field.Published onlin
Toward summarization of communicative activities in spoken conversation
This thesis is an inquiry into the nature and structure of face-to-face conversation, with a
special focus on group meetings in the workplace. I argue that conversations are composed
of episodes, each of which corresponds to an identifiable communicative activity such as
giving instructions or telling a story. These activities are important because they are part
of participants’ commonsense understanding of what happens in a conversation. They
appear in natural summaries of conversations such as meeting minutes, and participants
talk about them within the conversation itself. Episodic communicative activities therefore
represent an essential component of practical, commonsense descriptions of conversations.
The thesis objective is to provide a deeper understanding of how such activities may be
recognized and differentiated from one another, and to develop a computational method
for doing so automatically. The experiments are thus intended as initial steps toward future
applications that will require analysis of such activities, such as an automatic minute-taker
for workplace meetings, a browser for broadcast news archives, or an automatic decision
mapper for planning interactions.
My main theoretical contribution is to propose a novel analytical framework called participant
relational analysis. The proposal argues that communicative activities are principally
indicated through participant-relational features, i.e., expressions of relationships between
participants and the dialogue. Participant-relational features, such as subjective language,
verbal reference to the participants, and the distribution of speech activity amongst
the participants, are therefore argued to be a principal means for analyzing the nature and
structure of communicative activities.
I then apply the proposed framework to two computational problems: automatic discourse
segmentation and automatic discourse segment labeling. The first set of experiments
test whether participant-relational features can serve as a basis for automatically
segmenting conversations into discourse segments, e.g., activity episodes. Results show
that they are effective across different levels of segmentation and different corpora, and indeed sometimes more effective than the commonly-used method of using semantic links
between content words, i.e., lexical cohesion. They also show that feature performance is
highly dependent on segment type, suggesting that human-annotated “topic segments” are
in fact a multi-dimensional, heterogeneous collection of topic and activity-oriented units.
Analysis of commonly used evaluation measures, performed in conjunction with the
segmentation experiments, reveals that they fail to penalize substantially defective results
due to inherent biases in the measures. I therefore preface the experiments with a comprehensive
analysis of these biases and a proposal for a novel evaluation measure. A reevaluation
of state-of-the-art segmentation algorithms using the novel measure produces
substantially different results from previous studies. This raises serious questions about the
effectiveness of some state-of-the-art algorithms and helps to identify the most appropriate
ones to employ in the subsequent experiments.
I also preface the experiments with an investigation of participant reference, an important
type of participant-relational feature. I propose an annotation scheme with novel distinctions
for vagueness, discourse function, and addressing-based referent inclusion, each
of which are assessed for inter-coder reliability. The produced dataset includes annotations
of 11,000 occasions of person-referring.
The second set of experiments concern the use of participant-relational features to
automatically identify labels for discourse segments. In contrast to assigning semantic topic
labels, such as topical headlines, the proposed algorithm automatically labels segments
according to activity type, e.g., presentation, discussion, and evaluation. The method is
unsupervised and does not learn from annotated ground truth labels. Rather, it induces the
labels through correlations between discourse segment boundaries and the occurrence of
bracketing meta-discourse, i.e., occasions when the participants talk explicitly about what
has just occurred or what is about to occur. Results show that bracketing meta-discourse
is an effective basis for identifying some labels automatically, but that its use is limited if
global correlations to segment features are not employed.
This thesis addresses important pre-requisites to the automatic summarization of conversation.
What I provide is a novel activity-oriented perspective on how summarization
should be approached, and a novel participant-relational approach to conversational analysis.
The experimental results show that analysis of participant-relational features is
Recommended from our members
Investigating the Extractive Summarization of Literary Novels
Abstract
Due to the vast amount of information we are faced with, summarization has become a critical necessity of everyday human life. Given that a large fraction of the electronic documents available online and elsewhere consist of short texts such as Web pages, news articles, scientific reports, and others, the focus of natural language processing techniques to date has been on the automation of methods targeting short documents. We are witnessing however a change: an increasingly larger number of books become available in electronic format. This means that the need for language processing techniques able to handle very large documents such as books is becoming increasingly important. This thesis addresses the problem of summarization of novels, which are long and complex literary narratives. While there is a significant body of research that has been carried out on the task of automatic text summarization, most of this work has been concerned with the summarization of short documents, with a particular focus on news stories. However, novels are different in both length and genre, and consequently different summarization techniques are required. This thesis attempts to close this gap by analyzing a new domain for summarization, and by building unsupervised and supervised systems that effectively take into account the properties of long documents, and outperform the traditional extractive summarization systems typically addressing news genre
Semantics-driven Abstractive Document Summarization
The evolution of the Web over the last three decades has led to a deluge of scientific and news articles on the Internet. Harnessing these publications in different fields of study is critical to effective end user information consumption. Similarly, in the domain of healthcare, one of the key challenges with the adoption of Electronic Health Records (EHRs) for clinical practice has been the tremendous amount of clinical notes generated that can be summarized without which clinical decision making and communication will be inefficient and costly. In spite of the rapid advances in information retrieval and deep learning techniques towards abstractive document summarization, the results of these efforts continue to resemble extractive summaries, achieving promising results predominantly on lexical metrics but performing poorly on semantic metrics. Thus, abstractive summarization that is driven by intrinsic and extrinsic semantics of documents is not adequately explored. Resources that can be used for generating semantics-driven abstractive summaries include: • Abstracts of multiple scientific articles published in a given technical field of study to generate an abstractive summary for topically-related abstracts within the field, thus reducing the load of having to read semantically duplicate abstracts on a given topic. • Citation contexts from different authoritative papers citing a reference paper can be used to generate utility-oriented abstractive summary for a scientific article. • Biomedical articles and the named entities characterizing the biomedical articles along with background knowledge bases to generate entity and fact-aware abstractive summaries. • Clinical notes of patients and clinical knowledge bases for abstractive clinical text summarization using knowledge-driven multi-objective optimization. In this dissertation, we develop semantics-driven abstractive models based on intra- document and inter-document semantic analyses along with facts of named entities retrieved from domain-specific knowledge bases to produce summaries. Concretely, we propose a sequence of frameworks leveraging semantics at various granularity (e.g., word, sentence, document, topic, citations, and named entities) levels, by utilizing external resources. The proposed frameworks have been applied to a range of tasks including 1. Abstractive summarization of topic-centric multi-document scientific articles and news articles. 2. Abstractive summarization of scientific articles using crowd-sourced citation contexts. 3. Abstractive summarization of biomedical articles clustered based on entity-relatedness. 4. Abstractive summarization of clinical notes of patients with heart failure and Chest X-Rays recordings. The proposed approaches achieve impressive performance in terms of preserving semantics in abstractive summarization while paraphrasing. For summarization of topic-centric multiple scientific/news articles, we propose a three-stage approach where abstracts of scientific articles or news articles are clustered based on their topical similarity determined from topics generated using Latent Dirichlet Allocation (LDA), followed by extractive phase and abstractive phase. Then, in the next stage, we focus on abstractive summarization of biomedical literature where we leverage named entities in biomedical articles to 1) cluster related articles; and 2) leverage the named entities towards guiding abstractive summarization. Finally, in the last stage, we turn to external resources such as citation contexts pointing to a scientific article to generate a comprehensive and utility-centric abstractive summary of a scientific article, domain-specific knowledge bases to fill gaps in information about entities in a biomedical article to summarize and clinical notes to guide abstractive summarization of clinical text. Thus, the bottom-up progression of exploring semantics towards abstractive summarization in this dissertation starts with (i) Semantic Analysis of Latent Topics; builds on (ii) Internal and External Knowledge-I (gleaned from abstracts and Citation Contexts); and extends it to make it comprehensive using (iii) Internal and External Knowledge-II (Named Entities and Knowledge Bases)