6 research outputs found

    Keyphrase Based Evaluation of Automatic Text Summarization

    Full text link
    The development of methods to deal with the informative contents of the text units in the matching process is a major challenge in automatic summary evaluation systems that use fixed n-gram matching. The limitation causes inaccurate matching between units in a peer and reference summaries. The present study introduces a new Keyphrase based Summary Evaluator KpEval for evaluating automatic summaries. The KpEval relies on the keyphrases since they convey the most important concepts of a text. In the evaluation process, the keyphrases are used in their lemma form as the matching text unit. The system was applied to evaluate different summaries of Arabic multi-document data set presented at TAC2011. The results showed that the new evaluation technique correlates well with the known evaluation systems: Rouge1, Rouge2, RougeSU4, and AutoSummENG MeMoG. KpEval has the strongest correlation with AutoSummENG MeMoG, Pearson and spearman correlation coefficient measures are 0.8840, 0.9667 respectively.Comment: 4 pages, 1 figure, 3 table

    Método fuzzy para a sumarização automática de texto com base em um modelo extrativo (FSumm)

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2015.A sumarização automática de texto procura condensar o conteúdo do documento, extraindo as informações mais relevantes. Esse processo normalmente é executado através de métodos computacionais que incorporam o método estatístico e o linguístico. O rápido desenvolvimento das tecnologias emergentes e a crescente quantidade de informação disponível inserem novos desafios para esta área de pesquisa. Um desses desafios está na identificação das sentenças mais informativas no momento da geração do sumário. Como a tarefa de sumarizar informações de texto traz consigo a incerteza inerente à linguagem natural, a lógica fuzzy pode ser aplicada nessa tarefa para contribuir nos resultados gerados. Portanto, esta dissertação propõe um método de sumarização automática de texto utilizando a lógica fuzzy para a classificação das sentenças. O método foi desenvolvido por meio da técnica de sumarização extrativa ao qual se associam tarefas de Recuperação de Informação (RI) e de Processamento de Linguagem Natural (PLN). Para a avaliação deste método, considerou-se um corpus de textos em língua portuguesa e uma ferramenta que automatiza o processo. A ferramenta de avaliação analisa a sobreposição das unidades textuais entre os sumários automáticos e o modelo humano, dadas pelas medidas de precisão, cobertura e medida-f. Foram realizados experimentos que demonstram a efetividade do método na classificação da informatividade das sentenças.Abstract : Automatic text summarization attempts to condense the document content, extracting the most relevant information. This process is usually performed by computational methods such as statistical and linguistic. The rapid development of emerging technologies and the increasing amount of information available insert new research challenges. One of these challenges is to identify the most informative sentences at the time of the summary generation. The textual information summarization task brings with it the uncertainty inherent in natural language where fuzzy logic can be applied and contribute to the results. Therefore, this dissertation proposes a method of automatic text summarization using fuzzy logic to the classification of sentences. The method was developed by extractive summarization techniques which are associated with information retrieval tasks (IR) and natural language processing (NLP). The evaluation method considers a corpus of Brazilian Portuguese news texts and a tool for evaluation of summaries. The assessment tool analyzes the text units overlaps between automatic summaries and human model producing measures (precision, recall, F-measure) that express the informativeness of the summaries. We also present experiments showing the effectiveness of our method in the informativeness sentences classification

    Genetic graph-based in clustering applied to static and streaming data analysis

    Full text link
    Tesis inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de lectura: diciembre de 2014Unsupervised Learning Techniques have been widely used in Data Mining over the last few years. These techniques try to identify patterns in a dataset blindly. Clustering is one of the most promising elds in Unsupervised Learning. It consists on grouping the data by similarity. This eld has generated several research works which have tried to deal with di erent problems related to the pattern extraction and data grouping processes. One of the most innovative clustering methodologies is shape-based or continuity-based clustering which tries to group data according to the form they de ne in the space. This dissertation is focused on how to apply Genetic Algorithms to the continuitybased clustering problems. Genetic Algorithms have been traditionally used in optimization problems. They are featured by an encoding -which represents the solution space; a population set of chromosomes -which are the potential solutions; and some genetic operations -which are used to evolve the solutions in order to nd the best chromosome or solution. The main idea is to take advantage of their potential, generating new algorithms which can improve the performance of classical clustering algorithms, and apply them to static and streaming data. In order to design these algorithms, this dissertation has been based on the Spectral Clustering algorithm. This algorithm studies the spectrum of a Similarity Graph in order to de ne the clusters. The clusters de ned by Spectral Clustering usually respect the data continuity. Using this idea as a starting point, di erent graph-based genetic algorithms have been designed to deal with the continuity-based clustering problem. The di erent algorithms developed have been divided in three generations: The rst generation is based on genetic graph-based clustering algorithms. In this generation we combined graph-based clustering and genetic algorithms to generate a graph topology among the data, in order to nd the best way to cut the graph. This cutting process is used to discriminate the nal clusters. The main idea is to use hybrid algorithms which combine di erent metrics extracted from graph theory. In order to evaluate the performance on real-world problems, these algorithms have been also applied to text summarization. The second generation is based on multi-objective genetic graph-based clustering algorithms. This generation introduces the Pareto Front generated by the di erent tness functions used in the genetic search. The Pareto Front is used to study the solution space and provides more robust and accurate solutions. During this generation we also used co-evolutionary algorithms to include the number of clusters in the search space. Finally, the last generation is focused on large and streaming data analysis. During this generation the previous algorithms have been adapted to deal with large data, combining di erent methodologies such as online clustering and MapReduce. The main idea is to study their performance compared with other algorithms. The dissertation also includes a description of other graph-based bio-inspired algorithms, in this case Ant Colony Optimization Clustering algorithms, which have been designed during the dissertation, in order to extend the range of study to other bio-inspired areas. Finally, with the purpose of evaluating the algorithms of the di erent generations, we have compared them with relevant and well-known clustering algorithms using synthetic and real-world datasets extracted from the literature and the UCI Machine Learning RepositoryLas técnicas de aprendizaje no supervisado han sido ampliamente utilizadas en minería de datos en los últimos años. Estas técnicas tratan de extraer patrones de un conjunto de datos de forma ciega. Dentro de las mismas, el Clustering es uno de los campos más prometedores. Este consiste en la agrupación de los datos por similitud. Este campo ha generado varios trabajos de investigación que han tratado de hacer frente a diferentes problemas relacionados con la extracción de patrones y los procesos de agrupación de datos. Una de las metodologías de clustering más innovadoras se basa en agrupar los datos por continuidad, respetando la forma que estos definen en espacio en el que se encuentran. Esta tesis se centra en la manera de aplicar algoritmos genéticos a los problemas de clustering basado en continuidad. Los algoritmos genéticos han sido utilizados tradicionalmente en problemas de optimización. Se caracterizan por una codificación -que representa el espacio de soluciones-, una población o conjunto de cromosomas -que son las soluciones potenciales dentro de este espacio-, y algunas operaciones genéticas -que se utilizan para evolucionar las soluciones con el fin de encontrar el mejor cromosoma o solución-. La idea principal es aprovechar el pontencial de los algoritmos genéticos generando nuevos algoritmos que pueden mejorar el rendimiento de los algoritmos clásicos aplicados tanto a datos estáticos como a flujos continuos de datos. De cara a diseñaar estos algoritmos, esta tesis doctoral utiliza el algoritmo de Spectral Clustering como punto de partida. Este algoritmo estudia el espectro de un grafo de similitud con el fin de dfinir las agrupaciones o clusters. Los grupos de nidos por Spectral Clustering suelen respetar la continuidad de los datos. Utilizando esta idea, se han diseñado diferentes algoritmos genéticos basados en grafos para hacer frente al problema de agrupación basada en continuidad. Los diferentes algoritmos desarrollados se han dividido en tres generaciones: La primera generación se basa en algoritmos de clustering genéticos basados en grafos. En esta generación se han combinado técnicas de Graph Clustering y algoritmos genéticos para generar una topología de grafo entre los datos, con el fin de encontrar la mejor manera de cortar el grafo. Este proceso de corte se utiliza para discriminar los grupos finales. La idea principal es utilizar algoritmos híbridos que combinan diferentes métricas extraídas de teoría de grafos. Con el fin de evaluar el comportamiento de los algoritmos en problemas del mundo real, estos algoritmos se han aplicado al problema de cómo generar resúmenes automáticos. La segunda generación se basa en algoritmos multi-objetivo de clustering genético basado en grafos. Esta generación introduce el Frente de Pareto, generado por las diferentes funciones de fitness utilizadas en la búsqueda genética. El frente de Pareto se utiliza para estudiar el espacio de soluciones y proporcionar soluciones más robustas y precisas. Durante esta generación también utilizamos algoritmos co-evolutivos de cara a incluir el número de clusters en el espacio de búsqueda Finalmente, la ultima generación se centra en el análisis de grandes cantidades y flujos de datos. Durante esta generación los algoritmos anteriormente mencionados se han adaptado para hacer frente a grandes volúmenes de datos, combinando diferentes metodologí as como el clustering online y MapReduce. La idea principal es estudiar su rendimiento en comparación con otros algoritmos. La tesis también incluye aportaciones de otros algoritmos bio-inspirados basados en grafos, en este caso, algoritmos de clustering usando optimización por colonias de hormigas. Estos algoritmos han sido diseñados durante el desarrollo de la tesis para ampliar el rango de estudio a otros entornos bio-inspirados. Por último, con el fin de evaluar los algoritmos de las diferentes generaciones, se han comparado con algoritmos de clustering conocidos. El rendimiento de estos algoritmos se ha medido utilizando conjuntos de datos sintéticos y reales extraídos de la literatura y del repositorio UCI de Machine Learning

    Generating automated meeting summaries

    Get PDF
    The thesis at hand introduces a novel approach for the generation of abstractive summaries of meetings. While the automatic generation of document summaries has been studied for some decades now, the novelty of this thesis is mainly the application to the meeting domain (instead of text documents) as well as the use of a lexicalized representation formalism on the basis of Frame Semantics. This allows us to generate summaries abstractively (instead of extractively).Die vorliegende Arbeit stellt einen neuartigen Ansatz zur Generierung abstraktiver Zusammenfassungen von Gruppenbesprechungen vor. Während automatische Textzusammenfassungen bereits seit einigen Jahrzehnten erforscht werden, liegt die Neuheit dieser Arbeit vor allem in der Anwendungsdomäne (Gruppenbesprechungen statt Textdokumenten), sowie der Verwendung eines lexikalisierten Repräsentationsformulism auf der Basis von Frame-Semantiken, der es erlaubt, Zusammenfassungen abstraktiv (statt extraktiv) zu generieren. Wir argumentieren, dass abstraktive Ansätze für die Zusammenfassung spontansprachlicher Interaktionen besser geeignet sind als extraktive
    corecore