195 research outputs found

    Automatic text summarization with Maximal Frequent Sequences

    Get PDF
    En las últimas dos décadas un aumento exponencial de la información electrónica ha provocado una gran necesidad de entender rápidamente grandes volúmenes de información. En este libro se desarrollan los métodos automáticos para producir un resumen. Un resumen es un texto corto que transmite la información más importante de un documento o de una colección de documentos. Los resúmenes utilizados en este libro son extractivos: una selección de las oraciones más importantes del texto. Otros retos consisten en generar resúmenes de manera independiente de lenguaje y dominio. Se describe la identificación de cuatro etapas para generación de resúmenes extractivos. La primera etapa es la selección de términos, en la que uno tiene que decidir qué unidades contarían como términos individuales. El proceso de estimación de la utilidad de los términos individuales se llama etapa de pesado de términos. El siguiente paso se denota como pesado de oraciones, donde todas las secuencias reciben alguna medida numérica de acuerdo con la utilidad de términos. Finalmente, el proceso de selección de las oraciones más importantes se llama selección de oraciones. Los diferentes métodos para generación de resúmenes extractivos pueden ser caracterizados como representan estas etapas. En este libro se describe la etapa de selección de términos, en la que la detección de descripciones multipalabra se realiza considerando Secuencias Frecuentes Maximales (sfms), las cuales adquieren un significado importante, mientras Secuencias Frecuentes (sf) no maximales, que son partes de otros sf, no deben de ser consideradas. En la motivación se consideró costo vs. beneficio: existen muchas sf no maximales, mientras que la probabilidad de adquirir un significado importante es baja. De todos modos, las sfms representan todas las sfs en el modo compacto: todas las sfs podrían ser obtenidas a partir de todas las sfms explotando cada sfm al conjunto de todas sus subsecuencias. Se presentan los nuevos métodos basados en grafos, algoritmos de agrupamiento y algoritmos genéticos, los cuales facilitan la tarea de generación de resúmenes de textos. Se ha experimentado diferentes combinaciones de las opciones de selección de términos, pesado de términos, pesado de oraciones y selección de oraciones para generar los resúmenes extractivos de textos independientes de lenguaje y dominio para una colección de noticias. Se ha analizado algunas opciones basadas en descripciones multipalabra considerándolas en los métodos de grafos, algoritmos de agrupamiento y algoritmos genéticos. Se han obtenido los resultados superiores al de estado de arte. Este libro está dirigido a los estudiantes y científicos del área de Lingüística Computacional, y también a quienes quieren saber sobre los recientes avances en las investigaciones de generación automática de resúmenes de textos.In the last two decades, an exponential increase in the available electronic information causes a big necessity to quickly understand large volumes of information. It raises the importance of the development of automatic methods for detecting the most relevant content of a document in order to produce a shorter text. Automatic Text Summarization (ats) is an active research area dedicated to generate abstractive and extractive summaries not only for a single document, but also for a collection of documents. Other necessity consists in finding method for ats in a language and domain independent way. In this book we consider extractive text summarization for single document task. We have identified that a typical extractive summarization method consists in four steps. First step is a term selection where one should decide what units will count as individual terms. The process of estimating the usefulness of the individual terms is called term weighting step. The next step denotes as sentence weighting where all the sentences receive some numerical measure according to the usefulness of its terms. Finally, the process of selecting the most relevant sentences calls sentence selection. Different extractive summarization methods can be characterized how they perform these steps. In this book, in the term selection step, we describe how to detect multiword descriptions considering Maximal Frequent Sequences (mfss), which bearing important meaning, while non-maximal frequent sequences (fss), those that are parts of another fs, should not be considered. Our additional motivation was cost vs. benefit considerations: there are too many non-maximal fss while their probability to bear important meaning is lower. In any case, mfss represent all fss in a compact way: all fss can be obtained from all mfss by bursting each mfs into a set of all its subsequences.New methods based on graph algorithms, genetic algorithms, and clustering algorithms which facilitate the text summarization task are presented. We have tested different combinations of term selection, term weighting, sentence weighting and sentence selection options for language-and domain-independent extractive single-document text summarization on a news report collection. We analyzed several options based on mfss, considering them with graph, genetic, and clustering algorithms. We obtained results superior to the existing state-ofthe- art methods. This book is addressed for students and scientists of the area of Computational Linguistics, and also who wants to know recent developments in the area of Automatic Text Generation of Summaries

    Extractive Automatic Text Summarization Based on Lexical-Semantic Keywords

    Get PDF
    The automatic text summarization (ATS) task consists in automatically synthesizing a document to provide a condensed version of it. Creating a summary requires not only selecting the main topics of the sentences but also identifying the key relationships between these topics. Related works rank text units (mainly sentences) to select those that could form the summary. However, the resulting summaries may not include all the topics covered in the source text because important information may have been discarded. In addition, the semantic structure of documents has been barely explored in this field. Thus, this study proposes a new method for the ATS task that takes advantage of semantic information to improve keyword detection. This proposed method increases not only the coverage by clustering the sentences to identify the main topics in the source document but also the precision by detecting the keywords in the clusters. The experimental results of this work indicate that the proposed method outperformed previous methods with a standard collection

    Automatic Text Summarization

    Get PDF
    Writing text was one of the first ever methods used by humans to represent their knowledge. Text can be of different types and have different purposes. Due to the evolution of information systems and the Internet, the amount of textual information available has increased exponentially in a worldwide scale, and many documents tend to have a percentage of unnecessary information. Due to this event, most readers have difficulty in digesting all the extensive information contained in multiple documents, produced on a daily basis. A simple solution to the excessive irrelevant information in texts is to create summaries, in which we keep the subject’s related parts and remove the unnecessary ones. In Natural Language Processing, the goal of automatic text summarization is to create systems that process text and keep only the most important data. Since its creation several approaches have been designed to create better text summaries, which can be divided in two separate groups: extractive approaches and abstractive approaches. In the first group, the summarizers decide what text elements should be in the summary. The criteria by which they are selected is diverse. After they are selected, they are combined into the summary. In the second group, the text elements are generated from scratch. Abstractive summarizers are much more complex so they still need a lot of research, in order to represent good results. During this thesis, we have investigated the state of the art approaches, implemented our own versions and tested them in conventional datasets, like the DUC dataset. Our first approach was a frequency­based approach, since it analyses the frequency in which the text’s words/sentences appear in the text. Higher frequency words/sentences automatically receive higher scores which are then filtered with a compression rate and combined in a summary. Moving on to our second approach, we have improved the original TextRank algorithm by combining it with word embedding vectors. The goal was to represent the text’s sentences as nodes from a graph and with the help of word embeddings, determine how similar are pairs of sentences and rank them by their similarity scores. The highest ranking sentences were filtered with a compression rate and picked for the summary. In the third approach, we combined feature analysis with deep learning. By analysing certain characteristics of the text sentences, one can assign scores that represent the importance of a given sentence for the summary. With these computed values, we have created a dataset for training a deep neural network that is capable of deciding if a certain sentence must be or not in the summary. An abstractive encoder­decoder summarizer was created with the purpose of generating words related to the document subject and combining them into a summary. Finally, every single summarizer was combined into a full system. Each one of our approaches was evaluated with several evaluation metrics, such as ROUGE. We used the DUC dataset for this purpose and the results were fairly similar to the ones in the scientific community. As for our encoder­decode, we got promising results.O texto é um dos utensílios mais importantes de transmissão de ideias entre os seres humanos. Pode ser de vários tipos e o seu conteúdo pode ser mais ou menos fácil de interpretar, conforme a quantidade de informação relevante sobre o assunto principal. De forma a facilitar o processamento pelo leitor existe um mecanismo propositadamente criado para reduzir a informação irrelevante num texto, chamado sumarização de texto. Através da sumarização criam­se versões reduzidas do text original e mantém­se a informação do assunto principal. Devido à criação e evolução da Internet e outros meios de comunicação, surgiu um aumento exponencial de documentos textuais, evento denominado de sobrecarga de informação, que têm na sua maioria informação desnecessária sobre o assunto que retratam. De forma a resolver este problema global, surgiu dentro da área científica de Processamento de Linguagem Natural, a sumarização automática de texto, que permite criar sumários automáticos de qualquer tipo de texto e de qualquer lingua, através de algoritmos computacionais. Desde a sua criação, inúmeras técnicas de sumarização de texto foram idealizadas, podendo ser classificadas em dois tipos diferentes: extractivas e abstractivas. Em técnicas extractivas, são transcritos elementos do texto original, como palavras ou frases inteiras que sejam as mais ilustrativas do assunto do texto e combinadas num documento. Em técnicas abstractivas, os algoritmos geram elementos novos. Nesta dissertação pesquisaram­se, implementaram­se e combinaram­se algumas das técnicas com melhores resultados de modo a criar um sistema completo para criar sumários. Relativamente às técnicas implementadas, as primeiras três são técnicas extractivas enquanto que a ultima é abstractiva. Desta forma, a primeira incide sobre o cálculo das frequências dos elementos do texto, atribuindo­se valores às frases que sejam mais frequentes, que por sua vez são escolhidas para o sumário através de uma taxa de compressão. Outra das técnicas incide na representação dos elementos textuais sob a forma de nodos de um grafo, sendo atribuidos valores de similaridade entre os mesmos e de seguida escolhidas as frases com maiores valores através de uma taxa de compressão. Uma outra abordagem foi criada de forma a combinar um mecanismo de análise das caracteristicas do texto com métodos baseados em inteligência artificial. Nela cada frase possui um conjunto de caracteristicas que são usadas para treinar um modelo de rede neuronal. O modelo avalia e decide quais as frases que devem pertencer ao sumário e filtra as mesmas através deu uma taxa de compressão. Um sumarizador abstractivo foi criado para para gerar palavras sobre o assunto do texto e combinar num sumário. Cada um destes sumarizadores foi combinado num só sistema. Por fim, cada uma das técnicas pode ser avaliada segundo várias métricas de avaliação, como por exemplo a ROUGE. Segundo os resultados de avaliação das técnicas, com o conjunto de dados DUC, os nossos sumarizadores obtiveram resultados relativamente parecidos com os presentes na comunidade cientifica, com especial atenção para o codificador­descodificador que em certos casos apresentou resultados promissores

    Automatic Generation of Text Summaries - Challenges, proposals and experiments

    Get PDF
    Los estudiantes e investigadores en el área de procesamiento deenguaje natural, inteligencia artificial, ciencias computacionales y lingüística computacional serán quizá los primeros interesados en este libro. No obstante, también se pretende introducir a público no especializado en esta prometedora área de investigación; por ello, hemos traducido al español algunos tecnicismos y anglicismos, propios de esta disciplina, pero sin dejar de mencionar, en todo momento, su término en inglés para evitar confusiones y lograr que aquellos lectores interesados puedan ampliar sus fuentes de conocimiento.Este libro presenta un método computacional novedoso, a nivel internacional, para la generación automática de resúmenes de texto, pues supera la calidad de los que actualmente se pueden crear. Es decir, es resultado de una investigación que buscó métodos y modelos computacionales lo menos dependientes del lenguaje y dominio

    Keyphrase Generation: A Multi-Aspect Survey

    Full text link
    Extractive keyphrase generation research has been around since the nineties, but the more advanced abstractive approach based on the encoder-decoder framework and sequence-to-sequence learning has been explored only recently. In fact, more than a dozen of abstractive methods have been proposed in the last three years, producing meaningful keyphrases and achieving state-of-the-art scores. In this survey, we examine various aspects of the extractive keyphrase generation methods and focus mostly on the more recent abstractive methods that are based on neural networks. We pay particular attention to the mechanisms that have driven the perfection of the later. A huge collection of scientific article metadata and the corresponding keyphrases is created and released for the research community. We also present various keyphrase generation and text summarization research patterns and trends of the last two decades.Comment: 10 pages, 5 tables. Published in proceedings of FRUCT 2019, the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finlan

    Optical tomography: Image improvement using mixed projection of parallel and fan beam modes

    Get PDF
    Mixed parallel and fan beam projection is a technique used to increase the quality images. This research focuses on enhancing the image quality in optical tomography. Image quality can be defined by measuring the Peak Signal to Noise Ratio (PSNR) and Normalized Mean Square Error (NMSE) parameters. The findings of this research prove that by combining parallel and fan beam projection, the image quality can be increased by more than 10%in terms of its PSNR value and more than 100% in terms of its NMSE value compared to a single parallel beam

    Evolutionary Automatic Text Summarization using Cluster Validation Indexes

    Get PDF
    The main problem for generating an extractive automatic text summary (EATS) is to detect the key themes of a text. For this task, unsupervised approaches cluster the sentences of the original text to find the key sentences that take part in an automatic summary. The quality of an automatic summary is evaluated using similarity metrics with human-made summaries. However, the relationship between the quality of the human-made summaries and the internal quality of the clustering is unclear. First, this paper proposes a comparison of the correlation of the quality of a human-made summary to the internal quality of the clustering validation index for finding the best correlation with a clustering validation index. Second, in this paper, an evolutionary method based on the best above internal clustering validation index for an automatic text summarization task is proposed. Our proposed unsupervised method for EATS has the advantage of not requiring information regarding the specific classes or themes of a text, and is therefore domain- and language-independent. The high results obtained by our method, using the most-competitive standard collection for EATS, prove that our method maintains a high correlation with human-made summaries, meeting the specific features of the groups, for example, compaction, separation, distribution, and density

    Summarizing Text for Indonesian Language by Using Latent Dirichlet Allocation and Genetic Algorithm

    Full text link
    The number of documents progressively increases especially for the electronic one. This degrades effectivity and efficiency in managing them. Therefore, it is a must to manage the documents. Automatic text summarization is able to solve by producing text document summaries. The goal of the research is to produce a tool to summarize documents in Bahasa: Indonesian Language. It is aimed to satisfy the user's need of relevant and consistent summaries. The algorithm is based on sentence features scoring by using Latent Dirichlet Allocation and Genetic Algorithm for determining sentence feature weights. It is evaluated by calculating summarization speed, precision, recall, F-measure, and some subjective evaluations. Extractive summaries from the original text documents can represent important information from a single document in Bahasa with faster summarization speed compared to manual process. Best F-measure value is 0,556926 (with precision of 0.53448 and recall of 0.58134) and summary ratio of 30%

    Design of optimal search engine using text summarization through artificial intelligence techniques

    Get PDF
    Natural language processing is the trending topic in the latest research areas, which allows the developers to create the human-computer interactions to come into existence. The natural language processing is an integration of artificial intelligence, computer science and computer linguistics. The research towards natural Language Processing is focused on creating innovations towards creating the devices or machines which operates basing on the single command of a human. It allows various Bot creations to innovate the instructions from the mobile devices to control the physical devices by allowing the speech-tagging. In our paper, we design a search engine which not only displays the data according to user query but also performs the detailed display of the content or topic user is interested for using the summarization concept. We find the designed search engine is having optimal response time for the user queries by analyzing with number of transactions as inputs. Also, the result findings in the performance analysis show that the text summarization method has been an efficient way for improving the response time in the search engine optimizations
    corecore