30 research outputs found

    Ground Truth Spanish Automatic Extractive Text Summarization Bounds

    Get PDF
    The textual information has accelerated growth in the most spoken languages by native Internet users, such as Chinese, Spanish, English, Arabic, Hindi, Portuguese, Bengali, Russian, among others. It is necessary to innovate the methods of Automatic Text Summarization (ATS) that can extract essential information without reading the entire text. The most competent methods are Extractive ATS (EATS) that extract essential parts of the document (sentences, phrases, or paragraphs) to compose a summary. During the last 60 years of research of EATS, the creation of standard corpus with human-generated summaries and evaluation methods which are highly correlated with human judgments help to increase the number of new state-of-the-art methods. However, these methods are mainly supported for the English language, leaving aside other equally important languages such as Spanish, which is the second most spoken language by natives and the third most used on the Internet. A standard corpus for Spanish EATS (SAETS) is created to evaluate the state-of-the-art methods and systems for the Spanish language. The main contribution consists of a proposal for configuration and evaluation of 5 state-ofthe-art methods, five systems and four heuristics using three evaluation methods (ROUGE, ROUGE-C, and Jensen-Shannon divergence). It is the first time that Jensen-Shannon divergence is used to evaluate AETS. In this paper the ground truth bounds for the Spanish language are presented, which are the heuristics baseline:first, baseline:random, topline and concordance. In addition, the ranking of 30 evaluation tests of the state-of-the-art methods and systems is calculated that forms a benchmark for SAETS

    Génération de résumés par abstraction

    Full text link
    Cette thèse présente le résultat de plusieurs années de recherche dans le domaine de la génération automatique de résumés. Trois contributions majeures, présentées sous la forme d'articles publiés ou soumis pour publication, en forment le coeur. Elles retracent un cheminement qui part des méthodes par extraction en résumé jusqu'aux méthodes par abstraction. L'expérience HexTac, sujet du premier article, a d'abord été menée pour évaluer le niveau de performance des êtres humains dans la rédaction de résumés par extraction de phrases. Les résultats montrent un écart important entre la performance humaine sous la contrainte d'extraire des phrases du texte source par rapport à la rédaction de résumés sans contrainte. Cette limite à la rédaction de résumés par extraction de phrases, observée empiriquement, démontre l'intérêt de développer d'autres approches automatiques pour le résumé. Nous avons ensuite développé un premier système selon l'approche Fully Abstractive Summarization, qui se situe dans la catégorie des approches semi-extractives, comme la compression de phrases et la fusion de phrases. Le développement et l'évaluation du système, décrits dans le second article, ont permis de constater le grand défi de générer un résumé facile à lire sans faire de l'extraction de phrases. Dans cette approche, le niveau de compréhension du contenu du texte source demeure insuffisant pour guider le processus de sélection du contenu pour le résumé, comme dans les approches par extraction de phrases. Enfin, l'approche par abstraction basée sur des connaissances nommée K-BABS est proposée dans un troisième article. Un repérage des éléments d'information pertinents est effectué, menant directement à la génération de phrases pour le résumé. Cette approche a été implémentée dans le système ABSUM, qui produit des résumés très courts mais riches en contenu. Ils ont été évalués selon les standards d'aujourd'hui et cette évaluation montre que des résumés hybrides formés à la fois de la sortie d'ABSUM et de phrases extraites ont un contenu informatif significativement plus élevé qu'un système provenant de l'état de l'art en extraction de phrases.This Ph.D. thesis is the result of several years of research on automatic text summarization. Three major contributions are presented in the form of published and submitted papers. They follow a path that moves away from extractive summarization and toward abstractive summarization. The first article describes the HexTac experiment, which was conducted to evaluate the performance of humans summarizing text by extracting sentences. Results show a wide gap of performance between human summaries written by sentence extraction and those written without restriction. This empirical performance ceiling to sentence extraction demonstrates the need for new approaches to text summarization. We then developed and implemented a system, which is the subject of the second article, using the Fully Abstractive Summarization approach. Though the name suggests otherwise, this approach is better categorized as semi-extractive, along with sentence compression and sentence fusion. Building and evaluating this system brought to light the great challenge associated with generating easily readable summaries without extracting sentences. In this approach, text understanding is not deep enough to provide help in the content selection process, as is the case in extractive summarization. As the third contribution, a knowledge-based approach to abstractive summarization called K-BABS was proposed. Relevant content is identified by pattern matching on an analysis of the source text, and rules are applied to directly generate sentences for the summary. This approach is implemented in a system called ABSUM, which generates very short and content-rich summaries. An evaluation was performed according to today's standards. The evaluation shows that hybrid summaries generated by adding extracted sentences to ABSUM's output have significantly more content than a state-of-the-art extractive summarizer

    Data mining techniques for complex application domains

    Get PDF
    The emergence of advanced communication techniques has increased availability of large collection of data in electronic form in a number of application domains including healthcare, e- business, and e-learning. Everyday a large amount of records are stored electronically. However, finding useful information from such a large data collection is a challenging issue. Data mining technology aims automatically extracting hidden knowledge from large data repositories exploiting sophisticated algorithms. The hidden knowledge in the electronic data may be potentially utilized to facilitate the procedures, productivity, and reliability of several application domains. The PhD activity has been focused on novel and effective data mining approaches to tackle the complex data coming from two main application domains: Healthcare data analysis and Textual data analysis. The research activity, in the context of healthcare data, addressed the application of different data mining techniques to discover valuable knowledge from real exam-log data of patients. In particular, efforts have been devoted to the extraction of medical pathways, which can be exploited to analyze the actual treatments followed by patients. The derived knowledge not only provides useful information to deal with the treatment procedures but may also play an important role in future predictions of potential patient risks associated with medical treatments. The research effort in textual data analysis is twofold. On the one hand, a novel approach to discovery of succinct summaries of large document collections has been proposed. On the other hand, the suitability of an established descriptive data mining to support domain experts in making decisions has been investigated. Both research activities are focused on adopting widely exploratory data mining techniques to textual data analysis, which require overcoming intrinsic limitations for traditional algorithms for handling textual documents efficiently and effectively

    Automatic Text Summarization

    Get PDF
    Writing text was one of the first ever methods used by humans to represent their knowledge. Text can be of different types and have different purposes. Due to the evolution of information systems and the Internet, the amount of textual information available has increased exponentially in a worldwide scale, and many documents tend to have a percentage of unnecessary information. Due to this event, most readers have difficulty in digesting all the extensive information contained in multiple documents, produced on a daily basis. A simple solution to the excessive irrelevant information in texts is to create summaries, in which we keep the subject’s related parts and remove the unnecessary ones. In Natural Language Processing, the goal of automatic text summarization is to create systems that process text and keep only the most important data. Since its creation several approaches have been designed to create better text summaries, which can be divided in two separate groups: extractive approaches and abstractive approaches. In the first group, the summarizers decide what text elements should be in the summary. The criteria by which they are selected is diverse. After they are selected, they are combined into the summary. In the second group, the text elements are generated from scratch. Abstractive summarizers are much more complex so they still need a lot of research, in order to represent good results. During this thesis, we have investigated the state of the art approaches, implemented our own versions and tested them in conventional datasets, like the DUC dataset. Our first approach was a frequency­based approach, since it analyses the frequency in which the text’s words/sentences appear in the text. Higher frequency words/sentences automatically receive higher scores which are then filtered with a compression rate and combined in a summary. Moving on to our second approach, we have improved the original TextRank algorithm by combining it with word embedding vectors. The goal was to represent the text’s sentences as nodes from a graph and with the help of word embeddings, determine how similar are pairs of sentences and rank them by their similarity scores. The highest ranking sentences were filtered with a compression rate and picked for the summary. In the third approach, we combined feature analysis with deep learning. By analysing certain characteristics of the text sentences, one can assign scores that represent the importance of a given sentence for the summary. With these computed values, we have created a dataset for training a deep neural network that is capable of deciding if a certain sentence must be or not in the summary. An abstractive encoder­decoder summarizer was created with the purpose of generating words related to the document subject and combining them into a summary. Finally, every single summarizer was combined into a full system. Each one of our approaches was evaluated with several evaluation metrics, such as ROUGE. We used the DUC dataset for this purpose and the results were fairly similar to the ones in the scientific community. As for our encoder­decode, we got promising results.O texto é um dos utensílios mais importantes de transmissão de ideias entre os seres humanos. Pode ser de vários tipos e o seu conteúdo pode ser mais ou menos fácil de interpretar, conforme a quantidade de informação relevante sobre o assunto principal. De forma a facilitar o processamento pelo leitor existe um mecanismo propositadamente criado para reduzir a informação irrelevante num texto, chamado sumarização de texto. Através da sumarização criam­se versões reduzidas do text original e mantém­se a informação do assunto principal. Devido à criação e evolução da Internet e outros meios de comunicação, surgiu um aumento exponencial de documentos textuais, evento denominado de sobrecarga de informação, que têm na sua maioria informação desnecessária sobre o assunto que retratam. De forma a resolver este problema global, surgiu dentro da área científica de Processamento de Linguagem Natural, a sumarização automática de texto, que permite criar sumários automáticos de qualquer tipo de texto e de qualquer lingua, através de algoritmos computacionais. Desde a sua criação, inúmeras técnicas de sumarização de texto foram idealizadas, podendo ser classificadas em dois tipos diferentes: extractivas e abstractivas. Em técnicas extractivas, são transcritos elementos do texto original, como palavras ou frases inteiras que sejam as mais ilustrativas do assunto do texto e combinadas num documento. Em técnicas abstractivas, os algoritmos geram elementos novos. Nesta dissertação pesquisaram­se, implementaram­se e combinaram­se algumas das técnicas com melhores resultados de modo a criar um sistema completo para criar sumários. Relativamente às técnicas implementadas, as primeiras três são técnicas extractivas enquanto que a ultima é abstractiva. Desta forma, a primeira incide sobre o cálculo das frequências dos elementos do texto, atribuindo­se valores às frases que sejam mais frequentes, que por sua vez são escolhidas para o sumário através de uma taxa de compressão. Outra das técnicas incide na representação dos elementos textuais sob a forma de nodos de um grafo, sendo atribuidos valores de similaridade entre os mesmos e de seguida escolhidas as frases com maiores valores através de uma taxa de compressão. Uma outra abordagem foi criada de forma a combinar um mecanismo de análise das caracteristicas do texto com métodos baseados em inteligência artificial. Nela cada frase possui um conjunto de caracteristicas que são usadas para treinar um modelo de rede neuronal. O modelo avalia e decide quais as frases que devem pertencer ao sumário e filtra as mesmas através deu uma taxa de compressão. Um sumarizador abstractivo foi criado para para gerar palavras sobre o assunto do texto e combinar num sumário. Cada um destes sumarizadores foi combinado num só sistema. Por fim, cada uma das técnicas pode ser avaliada segundo várias métricas de avaliação, como por exemplo a ROUGE. Segundo os resultados de avaliação das técnicas, com o conjunto de dados DUC, os nossos sumarizadores obtiveram resultados relativamente parecidos com os presentes na comunidade cientifica, com especial atenção para o codificador­descodificador que em certos casos apresentou resultados promissores

    Generating automated meeting summaries

    Get PDF
    The thesis at hand introduces a novel approach for the generation of abstractive summaries of meetings. While the automatic generation of document summaries has been studied for some decades now, the novelty of this thesis is mainly the application to the meeting domain (instead of text documents) as well as the use of a lexicalized representation formalism on the basis of Frame Semantics. This allows us to generate summaries abstractively (instead of extractively).Die vorliegende Arbeit stellt einen neuartigen Ansatz zur Generierung abstraktiver Zusammenfassungen von Gruppenbesprechungen vor. Während automatische Textzusammenfassungen bereits seit einigen Jahrzehnten erforscht werden, liegt die Neuheit dieser Arbeit vor allem in der Anwendungsdomäne (Gruppenbesprechungen statt Textdokumenten), sowie der Verwendung eines lexikalisierten Repräsentationsformulism auf der Basis von Frame-Semantiken, der es erlaubt, Zusammenfassungen abstraktiv (statt extraktiv) zu generieren. Wir argumentieren, dass abstraktive Ansätze für die Zusammenfassung spontansprachlicher Interaktionen besser geeignet sind als extraktive
    corecore