30 research outputs found
Question-driven text summarization with extractive-abstractive frameworks
Automatic Text Summarisation (ATS) is becoming increasingly important due to the exponential growth of textual content on the Internet. The primary goal of an ATS system is to generate a condensed version of the key aspects in the input document while minimizing redundancy. ATS approaches are extractive, abstractive, or hybrid. The extractive approach selects the most important sentences in the input document(s) and then concatenates them to form the summary. The abstractive approach represents the input document(s) in an intermediate form and then constructs the summary using different sentences than the originals. The hybrid approach combines both the extractive and abstractive approaches. The query-based ATS selects the information that is most relevant to the initial search query. Question-driven ATS is a technique to produce concise and informative answers to specific questions using a document collection.
In this thesis, a novel hybrid framework is proposed for question-driven ATS taking advantage of extractive and abstractive summarisation mechanisms. The framework consists of complementary modules that work together to generate an effective summary: (1) discovering appropriate non-redundant sentences as plausible answers using a multi-hop question answering system based on a Convolutional Neural Network (CNN), multi-head attention mechanism and reasoning process; and (2) a novel paraphrasing Generative Adversarial Network (GAN) model based on transformers rewrites the extracted sentences in an abstractive setup. In addition, a fusing mechanism is proposed for compressing the sentence pairs selected by a next sentence prediction model in the paraphrased summary. Extensive experiments on various datasets are performed, and the results show the model can outperform many question-driven and query-based baseline methods. The proposed model is adaptable to generate summaries for the questions in the closed domain and open domain. An online summariser demo is designed based on the proposed model for the industry use to process the technical text
WiFi-Based Human Activity Recognition Using Attention-Based BiLSTM
Recently, significant efforts have been made to explore human activity recognition (HAR) techniques that use information gathered by existing indoor wireless infrastructures through WiFi signals without demanding the monitored subject to carry a dedicated device. The key intuition is that different activities introduce different multi-paths in WiFi signals and generate different patterns in the time series of channel state information (CSI). In this paper, we propose and evaluate a full pipeline for a CSI-based human activity recognition framework for 12 activities in three different spatial environments using two deep learning models: ABiLSTM and CNN-ABiLSTM. Evaluation experiments have demonstrated that the proposed models outperform state-of-the-art models. Also, the experiments show that the proposed models can be applied to other environments with different configurations, albeit with some caveats. The proposed ABiLSTM model achieves an overall accuracy of 94.03%, 91.96%, and 92.59% across the 3 target environments. While the proposed CNN-ABiLSTM model reaches an accuracy of 98.54%, 94.25% and 95.09% across those same environments
Automatic Text Summarization
Writing text was one of the first ever methods used by humans to represent their knowledge.
Text can be of different types and have different purposes.
Due to the evolution of information systems and the Internet, the amount of textual information available has increased exponentially in a worldwide scale, and many documents tend
to have a percentage of unnecessary information. Due to this event, most readers have difficulty in digesting all the extensive information contained in multiple documents, produced
on a daily basis.
A simple solution to the excessive irrelevant information in texts is to create summaries, in
which we keep the subject’s related parts and remove the unnecessary ones.
In Natural Language Processing, the goal of automatic text summarization is to create systems that process text and keep only the most important data. Since its creation several
approaches have been designed to create better text summaries, which can be divided in two
separate groups: extractive approaches and abstractive approaches.
In the first group, the summarizers decide what text elements should be in the summary. The
criteria by which they are selected is diverse. After they are selected, they are combined into
the summary. In the second group, the text elements are generated from scratch. Abstractive
summarizers are much more complex so they still need a lot of research, in order to represent
good results.
During this thesis, we have investigated the state of the art approaches, implemented our
own versions and tested them in conventional datasets, like the DUC dataset.
Our first approach was a frequencybased approach, since it analyses the frequency in which
the text’s words/sentences appear in the text. Higher frequency words/sentences automatically receive higher scores which are then filtered with a compression rate and combined in
a summary.
Moving on to our second approach, we have improved the original TextRank algorithm by
combining it with word embedding vectors. The goal was to represent the text’s sentences as
nodes from a graph and with the help of word embeddings, determine how similar are pairs
of sentences and rank them by their similarity scores. The highest ranking sentences were
filtered with a compression rate and picked for the summary.
In the third approach, we combined feature analysis with deep learning. By analysing certain
characteristics of the text sentences, one can assign scores that represent the importance of
a given sentence for the summary. With these computed values, we have created a dataset
for training a deep neural network that is capable of deciding if a certain sentence must be
or not in the summary.
An abstractive encoderdecoder summarizer was created with the purpose of generating words
related to the document subject and combining them into a summary. Finally, every single
summarizer was combined into a full system.
Each one of our approaches was evaluated with several evaluation metrics, such as ROUGE.
We used the DUC dataset for this purpose and the results were fairly similar to the ones in
the scientific community. As for our encoderdecode, we got promising results.O texto é um dos utensílios mais importantes de transmissão de ideias entre os seres humanos. Pode ser de vários tipos e o seu conteúdo pode ser mais ou menos fácil de interpretar,
conforme a quantidade de informação relevante sobre o assunto principal.
De forma a facilitar o processamento pelo leitor existe um mecanismo propositadamente criado para reduzir a informação irrelevante num texto, chamado sumarização de texto. Através
da sumarização criamse versões reduzidas do text original e mantémse a informação do assunto principal.
Devido à criação e evolução da Internet e outros meios de comunicação, surgiu um aumento
exponencial de documentos textuais, evento denominado de sobrecarga de informação, que
têm na sua maioria informação desnecessária sobre o assunto que retratam.
De forma a resolver este problema global, surgiu dentro da área científica de Processamento
de Linguagem Natural, a sumarização automática de texto, que permite criar sumários automáticos de qualquer tipo de texto e de qualquer lingua, através de algoritmos computacionais.
Desde a sua criação, inúmeras técnicas de sumarização de texto foram idealizadas, podendo
ser classificadas em dois tipos diferentes: extractivas e abstractivas. Em técnicas extractivas,
são transcritos elementos do texto original, como palavras ou frases inteiras que sejam as
mais ilustrativas do assunto do texto e combinadas num documento. Em técnicas abstractivas, os algoritmos geram elementos novos.
Nesta dissertação pesquisaramse, implementaramse e combinaramse algumas das técnicas com melhores resultados de modo a criar um sistema completo para criar sumários.
Relativamente às técnicas implementadas, as primeiras três são técnicas extractivas enquanto
que a ultima é abstractiva. Desta forma, a primeira incide sobre o cálculo das frequências dos
elementos do texto, atribuindose valores às frases que sejam mais frequentes, que por sua
vez são escolhidas para o sumário através de uma taxa de compressão. Outra das técnicas
incide na representação dos elementos textuais sob a forma de nodos de um grafo, sendo
atribuidos valores de similaridade entre os mesmos e de seguida escolhidas as frases com
maiores valores através de uma taxa de compressão. Uma outra abordagem foi criada de
forma a combinar um mecanismo de análise das caracteristicas do texto com métodos baseados em inteligência artificial. Nela cada frase possui um conjunto de caracteristicas que são
usadas para treinar um modelo de rede neuronal. O modelo avalia e decide quais as frases
que devem pertencer ao sumário e filtra as mesmas através deu uma taxa de compressão.
Um sumarizador abstractivo foi criado para para gerar palavras sobre o assunto do texto e
combinar num sumário. Cada um destes sumarizadores foi combinado num só sistema. Por
fim, cada uma das técnicas pode ser avaliada segundo várias métricas de avaliação, como
por exemplo a ROUGE. Segundo os resultados de avaliação das técnicas, com o conjunto de
dados DUC, os nossos sumarizadores obtiveram resultados relativamente parecidos com os
presentes na comunidade cientifica, com especial atenção para o codificadordescodificador
que em certos casos apresentou resultados promissores
Ground Truth Spanish Automatic Extractive Text Summarization Bounds
The textual information has accelerated growth in the most spoken languages by native Internet users, such as Chinese, Spanish, English, Arabic, Hindi, Portuguese, Bengali, Russian, among others. It is necessary to innovate the methods of Automatic Text Summarization (ATS) that can extract essential information without reading the entire text. The most competent methods are Extractive ATS (EATS) that extract essential parts of the document (sentences, phrases, or paragraphs) to compose a summary. During the last 60 years of research of EATS, the creation of standard corpus with human-generated summaries and evaluation methods which are highly correlated with human judgments help to increase the number of new state-of-the-art methods. However, these methods are mainly supported for the English language, leaving aside other equally important languages such as Spanish, which is the second most spoken language by natives and the third most used on the Internet. A standard corpus for Spanish EATS (SAETS) is created to evaluate the state-of-the-art methods and systems for the Spanish language. The main contribution consists of a proposal for configuration and evaluation of 5 state-ofthe-art methods, five systems and four heuristics using three evaluation methods (ROUGE, ROUGE-C, and Jensen-Shannon divergence). It is the first time that Jensen-Shannon divergence is used to evaluate AETS. In this paper the ground truth bounds for the Spanish language are presented, which are the heuristics baseline:first, baseline:random, topline and concordance. In addition, the ranking of 30 evaluation tests of the state-of-the-art methods and systems is calculated that forms a benchmark for SAETS
The Construct of Identity in Hellenistic Judaism
This volume assembles twenty-three essays by Erich S. Gruen, who has written extensively on the literature and history of early Judaism and the experience of the Jews in the Greco-Roman world. Twenty-two of the articles have previously been published, and one new one was composed for the volume
Towards More Human-Like Text Summarization: Story Abstraction Using Discourse Structure and Semantic Information.
PhD ThesisWith the massive amount of textual data being produced every day,
the ability to effectively summarise text documents is becoming increasingly
important. Automatic text summarization entails the selection
and generalisation of the most salient points of a text in order
to produce a summary. Approaches to automatic text summarization
can fall into one of two categories: abstractive or extractive approaches.
Extractive approaches involve the selection and concatenation
of spans of text from a given document. Research in automatic
text summarization began with extractive approaches, scoring and
selecting sentences based on the frequency and proximity of words.
In contrast, abstractive approaches are based on a process of interpretation,
semantic representation, and generalisation. This is closer
to the processes that psycholinguistics tells us that humans perform
when reading, remembering and summarizing. However in the sixty
years since its inception, the field has largely remained focused on
extractive approaches.
This thesis aims to answer the following questions. Does knowledge
about the discourse structure of a text aid the recognition of
summary-worthy content? If so, which specific aspects of discourse
structure provide the greatest benefit? Can this structural information
be used to produce abstractive summaries, and are these more
informative than extractive summaries? To thoroughly examine these
questions, they are each considered in isolation, and as a whole, on
the basis of both manual and automatic annotations of texts. Manual
annotations facilitate an investigation into the upper bounds of
what can be achieved by the approach described in this thesis. Results
based on automatic annotations show how this same approach
is impacted by the current performance of imperfect preprocessing
steps, and indicate its feasibility.
Extractive approaches to summarization are intrinsically limited
by the surface text of the input document, in terms of both content
selection and summary generation. Beginning with a motivation
for moving away from these commonly used methods of producing
summaries, I set out my methodology for a more human-like
approach to automatic summarization which examines the benefits of
using discourse-structural information. The potential benefit of this
is twofold: moving away from a reliance on the wording of a text
in order to detect important content, and generating concise summaries
that are independent of the input text. The importance of
discourse structure to signal key textual material has previously been
recognised, however it has seen little applied use in the field of autovii
matic summarization. A consideration of evaluation metrics also features
significantly in the proposed methodology. These play a role in
both preprocessing steps and in the evaluation of the final summary
product. I provide evidence which indicates a disparity between the
performance of coreference resolution systems as indicated by their
standard evaluation metrics, and their performance in extrinsic tasks.
Additionally, I point out a range of problems for the most commonly
used metric, ROUGE, and suggest that at present summary evaluation
should not be automated.
To illustrate the general solutions proposed to the questions raised
in this thesis, I use Russian Folk Tales as an example domain. This
genre of text has been studied in depth and, most importantly, it has a
rich narrative structure that has been recorded in detail. The rules of
this formalism are suitable for the narrative structure reasoning system
presented as part of this thesis. The specific discourse-structural elements
considered cover the narrative structure of a text, coreference
information, and the story-roles fulfilled by different characters.
The proposed narrative structure reasoning system produces highlevel
interpretations of a text according to the rules of a given formalism.
For the example domain of Russian Folktales, a system is implemented
which constructs such interpretations of a tale according to
an existing set of rules and restrictions. I discuss how this process of
detecting narrative structure can be transferred to other genres, and
a key factor in the success of this process: how constrained are the
rules of the formalism. The system enumerates all possible interpretations
according to a set of constraints, meaning a less restricted rule
set leads to a greater number of interpretations.
For the example domain, sentence level discourse-structural annotations
are then used to predict summary-worthy content. The results
of this study are analysed in three parts. First, I examine the relative
utility of individual discourse features and provide a qualitative
discussion of these results. Second, the predictive abilities of these
features are compared when they are manually annotated to when
they are annotated with varying degrees of automation. Third, these
results are compared to the predictive capabilities of classic extractive
algorithms. I show that discourse features can be used to more
accurately predict summary-worthy content than classic extractive algorithms.
This holds true for automatically obtained annotations, but
with a much clearer difference when using manual annotations.
The classifiers learned in the prediction of summary-worthy sentences
are subsequently used to inform the production of both extractive
and abstractive summaries to a given length. A human-based
evaluation is used to compare these summaries, as well as the outputs
of a classic extractive summarizer. I analyse the impact of knowledge
about discourse structure, obtained both manually and automatically,
on summary production. This allows for some insight into the knock
on effects on summary production that can occur from inaccurate discourse
information (narrative structure and coreference information).
My analyses show that even given inaccurate discourse information,
the resulting abstractive summaries are considered more informative
than their extractive counterparts. With human-level knowledge
about discourse structure, these results are even clearer.
In conclusion, this research provides a framework which can be
used to detect the narrative structure of a text, and shows its potential
to provide a more human-like approach to automatic summarization.
I show the limit of what is achievable with this approach both
when manual annotations are obtainable, and when only automatic
annotations are feasible. Nevertheless, this thesis supports the suggestion
that the future of summarization lies with abstractive and not
extractive techniques
Automatic Structured Text Summarization with Concept Maps
Efficiently exploring a collection of text documents in order to answer a complex question is a challenge that many people face. As abundant information on almost any topic is electronically available nowadays, supporting tools are needed to ensure that people can profit from the information's availability rather than suffer from the information overload. Structured summaries can help in this situation: They can be used to provide a concise overview of the contents of a document collection, they can reveal interesting relationships and they can be used as a navigation structure to further explore the documents. A concept map, which is a graph representing concepts and their relationships, is a specific form of a structured summary that offers these benefits. However, despite its appealing properties, only a limited amount of research has studied how concept maps can be automatically created to summarize documents. Automating that task is challenging and requires a variety of text processing techniques including information extraction, coreference resolution and summarization. The goal of this thesis is to better understand these challenges and to develop computational models that can address them. As a first contribution, this thesis lays the necessary ground for comparable research on computational models for concept map--based summarization. We propose a precise definition of the task together with suitable evaluation protocols and carry out experimental comparisons of previously proposed methods. As a result, we point out limitations of existing methods and gaps that have to be closed to successfully create summary concept maps. Towards that end, we also release a new benchmark corpus for the task that has been created with a novel, scalable crowdsourcing strategy. Furthermore, we propose new techniques for several subtasks of creating summary concept maps. First, we introduce the usage of predicate-argument analysis for the extraction of concept and relation mentions, which greatly simplifies the development of extraction methods. Second, we demonstrate that a predicate-argument analysis tool can be ported from English to German with low effort, indicating that the extraction technique can also be applied to other languages. We further propose to group concept mentions using pairwise classifications and set partitioning, which significantly improves the quality of the created summary concept maps. We show similar improvements for a new supervised importance estimation model and an optimal subgraph selection procedure. By combining these techniques in a pipeline, we establish a new state-of-the-art for the summarization task. Additionally, we study the use of neural networks to model the summarization problem as a single end-to-end task. While such approaches are not yet competitive with pipeline-based approaches, we report several experiments that illustrate the challenges - mostly related to training data - that currently limit the performance of this technique. We conclude the thesis by presenting a prototype system that demonstrates the use of automatically generated summary concept maps in practice and by pointing out promising directions for future research on the topic of this thesis
Actes de la conférence Traitement Automatique de la Langue Naturelle, TALN 2018: Volume 2 : Démonstrations, articles des Rencontres Jeunes Chercheurs, ateliers DeFT
International audienc