38 research outputs found
Automatic Generation of Text Summaries - Challenges, proposals and experiments
Los estudiantes e investigadores en el 谩rea de procesamiento deenguaje natural, inteligencia artificial, ciencias computacionales y lingu虉铆stica computacional ser谩n quiz谩 los primeros interesados en este libro. No obstante, tambi茅n se pretende introducir a p煤blico no especializado en esta prometedora 谩rea de investigaci贸n; por ello, hemos traducido al espa帽ol algunos tecnicismos y anglicismos, propios de esta disciplina, pero sin dejar de mencionar, en todo momento, su t茅rmino en ingl茅s para evitar confusiones y lograr que aquellos lectores interesados puedan ampliar sus fuentes de conocimiento.Este libro presenta un m茅todo computacional novedoso, a nivel internacional, para la generaci贸n autom谩tica de res煤menes de texto, pues supera la calidad de los que actualmente se pueden crear. Es decir, es resultado de una investigaci贸n que busc贸 m茅todos y modelos computacionales lo menos dependientes del lenguaje y dominio
Automatic text summarization with Maximal Frequent Sequences
En las 煤ltimas dos d茅cadas un aumento exponencial de la informaci贸n electr贸nica
ha provocado una gran necesidad de entender r谩pidamente grandes
vol煤menes de informaci贸n. En este libro se desarrollan los m茅todos autom谩ticos
para producir un resumen. Un resumen es un texto corto que transmite la informaci贸n
m谩s importante de un documento o de una colecci贸n de documentos. Los
res煤menes utilizados en este libro son extractivos: una selecci贸n de las oraciones
m谩s importantes del texto. Otros retos consisten en generar res煤menes de manera
independiente de lenguaje y dominio.
Se describe la identificaci贸n de cuatro etapas para generaci贸n de res煤menes
extractivos. La primera etapa es la selecci贸n de t茅rminos, en la que uno tiene
que decidir qu茅 unidades contar铆an como t茅rminos individuales. El proceso de
estimaci贸n de la utilidad de los t茅rminos individuales se llama etapa de pesado
de t茅rminos. El siguiente paso se denota como pesado de oraciones, donde todas
las secuencias reciben alguna medida num茅rica de acuerdo con la utilidad de
t茅rminos. Finalmente, el proceso de selecci贸n de las oraciones m谩s importantes
se llama selecci贸n de oraciones. Los diferentes m茅todos para generaci贸n de res煤menes
extractivos pueden ser caracterizados como representan estas etapas.
En este libro se describe la etapa de selecci贸n de t茅rminos, en la que la detecci贸n
de descripciones multipalabra se realiza considerando Secuencias Frecuentes
Maximales (sfms), las cuales adquieren un significado importante, mientras
Secuencias Frecuentes (sf) no maximales, que son partes de otros sf, no deben
de ser consideradas. En la motivaci贸n se consider贸 costo vs. beneficio: existen
muchas sf no maximales, mientras que la probabilidad de adquirir un significado
importante es baja. De todos modos, las sfms representan todas las sfs en el
modo compacto: todas las sfs podr铆an ser obtenidas a partir de todas las sfms
explotando cada sfm al conjunto de todas sus subsecuencias. Se presentan los nuevos m茅todos basados en grafos, algoritmos de agrupamiento
y algoritmos gen茅ticos, los cuales facilitan la tarea de generaci贸n de
res煤menes de textos. Se ha experimentado diferentes combinaciones de las opciones
de selecci贸n de t茅rminos, pesado de t茅rminos, pesado de oraciones y
selecci贸n de oraciones para generar los res煤menes extractivos de textos independientes
de lenguaje y dominio para una colecci贸n de noticias. Se ha analizado
algunas opciones basadas en descripciones multipalabra consider谩ndolas en los
m茅todos de grafos, algoritmos de agrupamiento y algoritmos gen茅ticos. Se han
obtenido los resultados superiores al de estado de arte.
Este libro est谩 dirigido a los estudiantes y cient铆ficos del 谩rea de Ling眉铆stica
Computacional, y tambi茅n a quienes quieren saber sobre los recientes avances en
las investigaciones de generaci贸n autom谩tica de res煤menes de textos.In the last two decades, an exponential increase in the available electronic information
causes a big necessity to quickly understand large volumes of information.
It raises the importance of the development of automatic methods for
detecting the most relevant content of a document in order to produce a shorter
text. Automatic Text Summarization (ats) is an active research area dedicated to
generate abstractive and extractive summaries not only for a single document, but
also for a collection of documents. Other necessity consists in finding method for
ats in a language and domain independent way.
In this book we consider extractive text summarization for single document
task. We have identified that a typical extractive summarization method consists
in four steps. First step is a term selection where one should decide what units
will count as individual terms. The process of estimating the usefulness of the
individual terms is called term weighting step. The next step denotes as sentence
weighting where all the sentences receive some numerical measure according to
the usefulness of its terms. Finally, the process of selecting the most relevant sentences
calls sentence selection. Different extractive summarization methods can
be characterized how they perform these steps.
In this book, in the term selection step, we describe how to detect multiword
descriptions considering Maximal Frequent Sequences (mfss), which bearing important
meaning, while non-maximal frequent sequences (fss), those that are
parts of another fs, should not be considered. Our additional motivation was
cost vs. benefit considerations: there are too many non-maximal fss while their
probability to bear important meaning is lower. In any case, mfss represent all fss
in a compact way: all fss can be obtained from all mfss by bursting each mfs into
a set of all its subsequences.New methods based on graph algorithms, genetic algorithms, and clustering
algorithms which facilitate the text summarization task are presented. We
have tested different combinations of term selection, term weighting, sentence
weighting and sentence selection options for language-and domain-independent
extractive single-document text summarization on a news report collection. We
analyzed several options based on mfss, considering them with graph, genetic,
and clustering algorithms. We obtained results superior to the existing state-ofthe-
art methods.
This book is addressed for students and scientists of the area of Computational
Linguistics, and also who wants to know recent developments in the area of Automatic
Text Generation of Summaries
Gesti贸n del conocimiento en la micro y peque帽a empresa mexicana de la industria del software
Las empresas que desarrollan software en M茅xico son microempresas con diez o menos empleados que no cuentan con sistemas de gesti贸n del conocimiento. Se parte de un modelo de transferencia de conocimiento de los desarrolladores expertos a los no expertos, a trav茅s de un sistema de gesti贸n del conocimiento. Se enfoc贸 la investigaci贸n en conocer 驴C贸mo sucede la gesti贸n de conocimiento en las micro y peque帽as empresas que desarrollan software en M茅xico? Se encontr贸 que el problema no es solo tecnol贸gico, sino tambi茅n cultural, en dichas organizaciones. Por lo anterior se dise帽贸 un instrumento para medir la cultura de la gesti贸n del conocimiento, se aplic贸 y se muestran los resultados. Se detect贸 que los desarrolladores consideran muy importante compartir el conocimiento y de hecho lo hacen informalmente. Se concluy贸 que dichas organizaciones deben incluir la gesti贸n del conocimiento en sus procesos de desarrollo de softwar
Reglas que describen la deserci贸n y permanencia en los estudiantes de la UAP Tianguistenco de la UAEM
Se pretende encontrar cu谩l es el conjunto de reglas de conocimiento que pueden extraerse de aquellos estudiantes que han desertado o que permanecen en sus estudios universitarios tres a帽os despu茅s de su ingreso. Se utiliz贸 una base de datos inicial con 206 factores y 305 estudiantes de cuatro licenciaturas de la uap Tianguistenco de la uaem . Mediante 谩rboles de decisi贸n, fue posible determinar que con s贸lo 12 factores en 19 reglas se puede saber, con un 82.6% de soporte, si un estudiante tiene riesgo de desertar o no de sus estudios en los tres a帽os posteriores
Calculating the Upper Bounds for Portuguese Automatic Text Summarization Using Genetic Algorithm
Over the last years, Automatic Text Summarization (ATS) has been considered as one of the main tasks in Natural Language Processing (NLP) that generates summaries in several languages (e.g., English, Portuguese, Spanish, etc.). One of the most significant advances in ATS is developed for Portuguese reflected with the proposals of various state-of-art methods. It is essential to know the performance of different state-of-the-art methods with respect to the upper bounds (Topline), lower bounds (Baseline-random), and other heuristics (Base-line-first). In recent works, the significance and upper bounds for Single-Docu-ment Summarization (SDS) and Multi-Document Summarization (MDS) using corpora from Document Understanding Conferences (DUC) were calculated. In this paper, a calculus of upper bounds for SDS in Portuguese using Genetic Al-gorithms (GA) is performed. Moreover, we present a comparison of some state-of-the-art methods with respect to the upper bounds, lower bounds, and heuristics to determinate their level of significance
Calculating the Upper Bounds for Multi-Document Summarization using Genetic Algorithms
Over the last years, several Multi-Document Summarization (MDS) methods have been presented in Document Understanding Conference (DUC), workshops. Since DUC01, several methods have been presented in approximately 268 publications of the stateof-the-art, that have allowed the continuous improvement of MDS, however in most works the upper bounds were unknowns. Recently, some works have
been focused to calculate the best sentence combinations of a set of documents and in previous works we have been calculated the significance for single-document summarization task in DUC01 and DUC02 datasets. However, for MDS task has not performed an analysis of significance to rank the best
multi-document summarization methods. In this paper,
we describe a Genetic Algorithm-based method for
calculating the best sentence combinations of DUC01
and DUC02 datasets in MDS through a Meta-document
representation. Moreover, we have calculated three
heuristics mentioned in several works of state-of-the-art
to rank the most recent MDS methods, through the
calculus of upper bounds and lower bounds
Evolutionary Automatic Text Summarization using Cluster Validation Indexes
The main problem for generating an extractive automatic text summary (EATS) is to detect the key themes of a text. For this task, unsupervised approaches cluster the sentences of the original text to find the key sentences that take part in an automatic summary. The quality of an automatic summary is evaluated using similarity metrics with human-made summaries. However, the relationship between the quality of the human-made summaries and the internal quality of the clustering is unclear. First, this paper proposes a comparison of the correlation of the quality of a human-made summary to the internal quality of the clustering validation index for finding the best correlation with a clustering validation index. Second, in this paper, an evolutionary method based on the best above internal clustering validation index for an automatic text summarization task is proposed. Our proposed unsupervised method for EATS has the advantage of not requiring information regarding the specific classes or themes of a text, and is therefore domain- and language-independent. The high results obtained by our method, using the most-competitive standard collection for EATS, prove that our method maintains a high correlation with human-made summaries, meeting the specific features of the groups, for example, compaction, separation, distribution, and density
Extractive Automatic Text Summarization Based on Lexical-Semantic Keywords
The automatic text summarization (ATS) task consists in automatically synthesizing a document to provide a condensed version of it. Creating a summary requires not only selecting the main topics of the sentences but also identifying the key relationships between these topics. Related works rank text units (mainly sentences) to select those that could form the summary. However, the resulting summaries may not include all the topics covered in the source text because important information may have been discarded. In addition, the semantic structure of documents has been barely explored in this field. Thus, this study proposes a new method for the ATS task that takes advantage of semantic information to improve keyword detection. This proposed method increases not only the coverage by clustering the sentences to identify the main topics in the source document but also the precision by detecting the keywords in the clusters. The experimental results of this work indicate that the proposed method outperformed previous methods with a standard collection
The Impact of Key Ideas on Automatic Deception Detection in Text
In recent years, with the rise of the Internet, the automatic deception detection in text is an important task to recognize those of documents that try to make people believe in something false. Current studies in this field assume that the entire document contains cues to identify deception; however, as demonstrated in this work, some irrelevant ideas in text could affect the performance of the classification. Therefore, this research proposes an approach for deception detection in text that identifies, in the first instance, key ideas in a document based on a topic modeling algorithm and a proposed automatic extractive text summarization method, to produce a synthesized document that avoids secondary ideas. The experimental results of this study indicate that the proposed method outperform previous methods with standard collections