795 research outputs found
Document analysis by means of data mining techniques
The huge amount of textual data produced everyday by scientists, journalists and Web users, allows investigating many different aspects of information stored in the published documents. Data mining and information retrieval techniques are exploited to manage and extract information from huge amount of unstructured textual data. Text mining also known as text data mining is the processing of extracting high quality information (focusing relevance, novelty and interestingness) from text by identifying patterns etc. Text mining typically involves the process of structuring input text by means of parsing and other linguistic features or sometimes by removing extra data and then finding patterns from structured data. Patterns are then evaluated at last and interpretation of output is performed to accomplish the desired task. Recently, text mining has got attention in several fields such as in security (involves analysis of Internet news), for commercial (for search and indexing purposes) and in academic departments (such as answering query). Beyond searching the documents consisting the words given in a user query, text mining may provide direct answer to user by semantic web for content based (content meaning and its context). It can also act as intelligence analyst and can also be used in some email spam filters for filtering out unwanted material. Text mining usually includes tasks such as clustering, categorization, sentiment analysis, entity recognition, entity relation modeling and document summarization.
In particular, summarization approaches are suitable for identifying relevant sentences that describe the main concepts presented in a document dataset. Furthermore, the knowledge existed in the most informative sentences can be employed to improve the understanding of user and/or community interests. Different approaches have been proposed to extract summaries from unstructured text documents. Some of them are based on the statistical analysis of linguistic features by means of supervised machine learning or data mining methods, such as Hidden Markov models, neural networks and Naive Bayes methods. An appealing research field is the extraction of summaries tailored to the major user interests. In this context, the problem of extracting useful information according to domain knowledge related to the user interests is a challenging task.
The main topics have been to study and design of novel data representations and data mining algorithms useful for managing and extracting knowledge from unstructured documents. This thesis describes an effort to investigate the application of data mining approaches, firmly established in the subject of transactional data (e.g., frequent itemset mining), to textual documents. Frequent itemset mining is a widely exploratory technique to discover hidden correlations that frequently occur in the source data. Although its application to transactional data is well-established, the usage of frequent itemsets in textual document summarization has never been investigated so far. A work is carried on exploiting frequent itemsets for the purpose of multi-document summarization so a novel multi-document summarizer, namely ItemSum (Itemset-based Summarizer) is presented, that is based on an itemset-based model, i.e., a framework comprise of frequent itemsets, taken out from the document collection. Highly representative and not redundant sentences are selected for generating summary by considering both sentence coverage, with respect to a sentence relevance score, based on tf-idf statistics, and a concise and highly informative itemset-based model. To evaluate the ItemSum performance a suite of experiments on a collection of news articles has been performed. Obtained results show that ItemSum significantly outperforms mostly used previous summarizers in terms of precision, recall, and F-measure. We also validated our approach against a large number of approaches on the DUC’04 document collection. Performance comparisons, in terms of precision, recall, and F-measure, have been performed by means of the ROUGE toolkit. In most cases, ItemSum significantly outperforms the considered competitors. Furthermore, the impact of both the main algorithm parameters and the adopted model coverage strategy on the summarization performance are investigated as well.
In some cases, the soundness and readability of the generated summaries are unsatisfactory, because the summaries do not cover in an effective way all the semantically relevant data facets. A step beyond towards the generation of more accurate summaries has been made by semantics-based summarizers. Such approaches combine the use of general-purpose summarization strategies with ad-hoc linguistic analysis. The key idea is to also consider the semantics behind the document content to overcome the limitations of general-purpose strategies in differentiating between sentences based on their actual meaning and context. Most of the previously proposed approaches perform the semantics-based analysis as a preprocessing step that precedes the main summarization process. Therefore, the generated summaries could not entirely reflect the actual meaning and context of the key document sentences. In contrast, we aim at tightly integrating the ontology-based document analysis into the summarization process in order to take the semantic meaning of the document content into account during the sentence evaluation and selection processes. With this in mind, we propose a new multi-document summarizer, namely Yago-based Summarizer, that integrates an established ontology-based entity recognition and disambiguation step. Named Entity Recognition from Yago ontology is being used for the task of text summarization. The Named Entity Recognition (NER) task is concerned with marking occurrences of a specific object being mentioned. These mentions are then classified into a set of predefined categories. Standard categories include “person”, “location”, “geo-political organization”,
“facility”, “organization”, and “time”. The use of NER in text summarization improved the summarization process by increasing the rank of informative sentences. To demonstrate the effectiveness of the proposed approach, we compared its performance on the DUC’04 benchmark document collections with that of a large number of state-of-the-art summarizers. Furthermore, we also performed a qualitative evaluation of the soundness and readability of the generated summaries and a comparison with the results that were produced by the most effective summarizers.
A parallel effort has been devoted to integrating semantics-based models and the knowledge acquired from social networks into a document summarization model named as SociONewSum. The effort addresses the sentence-based generic multi-document summarization problem, which can be formulated as follows: given a collection of news articles ranging over the same topic, the goal is to extract a concise yet informative summary, which consists of most salient document sentences. An established ontological model has been used to improve summarization performance by integrating a textual entity recognition and disambiguation step. Furthermore, the analysis of the user-generated content coming from Twitter has been exploited to discover current social trends and improve the appealing of the generated summaries. An experimental evaluation of the SociONewSum performance was conducted on real English-written news article collections and Twitter posts. The achieved results demonstrate the effectiveness of the proposed summarizer, in terms of different ROUGE scores, compared to state-of-the-art open source summarizers as well as to a baseline version of the SociONewSum summarizer that does not perform any UGC analysis. Furthermore, the readability of the generated summaries has also been analyzed
Automated generation of movie tributes
O objetivo desta tese é gerar um tributo a um filme sob a forma de videoclip, considerando como entrada um filme e um segmento musical coerente. Um tributo é considerado um vídeo que contém os clips mais significativos de um filme, reproduzidos
sequencialmente, enquanto uma música toca. Nesta proposta, os clips a constar do tributo final são o resultado da sumarização das legendas do filme com um algoritmo de sumarização genérico. É importante que o artefacto seja coerente e fluido, pelo que há a
necessidade de haver um equilíbrio entre a seleção de conteúdo importante e a seleção de conteúdo que esteja em harmonia com a música. Para tal, os clips são filtrados de forma a garantir que apenas aqueles que contêm a mesma emoção da música aparecem
no vídeo final. Tal é feito através da extração de vetores de características áudio relacionadas com emoções das cenas às quais os clips pertencem e da música, e, de seguida, da sua comparação por meio do cálculo de uma medida de distância. Por fim, os clips
filtrados preenchem a música cronologicamente. Os resultados foram positivos: em média, os tributos produzidos obtiveram 7 pontos, numa escala de 0 a 10, em critérios como seleção de conteúdo e coerência emocional, fruto de avaliação humana.This thesis’ purpose is to generate a movie tribute in the form of a videoclip for a given movie and music. A tribute is considered to be a video containing meaningful clips from the movie playing along with a cohesive music piece. In this work, we collect the clips by summarizing the movie subtitles with a generic summarization algorithm. It is important that the artifact is coherent and fluid, hence there is the need to balance between the selection of important content and the selection of content that is in harmony with the music. To achieve so, clips are filtered so as to ensure that only those that
contain the same emotion as the music are chosen to appear in the final video. This is made by extracting vectors of emotion-related audio features from the scenes they belong to and from the music, and then comparing them with a distance measure. Finally, filtered clips fill the music length in a chronological order. Results were positive: on average, the produced tributes obtained scores of 7, on a scale from 0 to 10, on content selection, and emotional coherence criteria, from human evaluation
Génération de résumés de mise à jour : Utilisation d'un algorithme de classification non supervisée pour détecter la nouveauté dans les articles de presse
Dans cet article, nous présentons un système de résumé automatique multi-documents, dédié au résumé de mise à jour – ou de nouveauté. Dans une première partie, nous présentons la méthode sur laquelle notre système est fondé, CBSEAS, et son adaptation à la tâche de résumé de mise à jour. Générer des résumés de mise à jour est une tâche plus compliquée que de générer des résumés « standard », et nécessite une évaluation spécifique. Nous décrivons ensuite la tâche « Résumé de mise à jour » de TAC 2009, à laquelle nous avons participé afin d'évaluer notre système. Cette campagne d'évaluation internationale nous a permis de confronter notre système à d'autres systèmes de résumé automatique. Finalement, nous présentons et discutons les résultats intéressants obtenus par notre système
Feature Extraction and Duplicate Detection for Text Mining: A Survey
Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user
Automatic text summarization with Maximal Frequent Sequences
En las últimas dos décadas un aumento exponencial de la información electrónica
ha provocado una gran necesidad de entender rápidamente grandes
volúmenes de información. En este libro se desarrollan los métodos automáticos
para producir un resumen. Un resumen es un texto corto que transmite la información
más importante de un documento o de una colección de documentos. Los
resúmenes utilizados en este libro son extractivos: una selección de las oraciones
más importantes del texto. Otros retos consisten en generar resúmenes de manera
independiente de lenguaje y dominio.
Se describe la identificación de cuatro etapas para generación de resúmenes
extractivos. La primera etapa es la selección de términos, en la que uno tiene
que decidir qué unidades contarían como términos individuales. El proceso de
estimación de la utilidad de los términos individuales se llama etapa de pesado
de términos. El siguiente paso se denota como pesado de oraciones, donde todas
las secuencias reciben alguna medida numérica de acuerdo con la utilidad de
términos. Finalmente, el proceso de selección de las oraciones más importantes
se llama selección de oraciones. Los diferentes métodos para generación de resúmenes
extractivos pueden ser caracterizados como representan estas etapas.
En este libro se describe la etapa de selección de términos, en la que la detección
de descripciones multipalabra se realiza considerando Secuencias Frecuentes
Maximales (sfms), las cuales adquieren un significado importante, mientras
Secuencias Frecuentes (sf) no maximales, que son partes de otros sf, no deben
de ser consideradas. En la motivación se consideró costo vs. beneficio: existen
muchas sf no maximales, mientras que la probabilidad de adquirir un significado
importante es baja. De todos modos, las sfms representan todas las sfs en el
modo compacto: todas las sfs podrían ser obtenidas a partir de todas las sfms
explotando cada sfm al conjunto de todas sus subsecuencias. Se presentan los nuevos métodos basados en grafos, algoritmos de agrupamiento
y algoritmos genéticos, los cuales facilitan la tarea de generación de
resúmenes de textos. Se ha experimentado diferentes combinaciones de las opciones
de selección de términos, pesado de términos, pesado de oraciones y
selección de oraciones para generar los resúmenes extractivos de textos independientes
de lenguaje y dominio para una colección de noticias. Se ha analizado
algunas opciones basadas en descripciones multipalabra considerándolas en los
métodos de grafos, algoritmos de agrupamiento y algoritmos genéticos. Se han
obtenido los resultados superiores al de estado de arte.
Este libro está dirigido a los estudiantes y científicos del área de Lingüística
Computacional, y también a quienes quieren saber sobre los recientes avances en
las investigaciones de generación automática de resúmenes de textos.In the last two decades, an exponential increase in the available electronic information
causes a big necessity to quickly understand large volumes of information.
It raises the importance of the development of automatic methods for
detecting the most relevant content of a document in order to produce a shorter
text. Automatic Text Summarization (ats) is an active research area dedicated to
generate abstractive and extractive summaries not only for a single document, but
also for a collection of documents. Other necessity consists in finding method for
ats in a language and domain independent way.
In this book we consider extractive text summarization for single document
task. We have identified that a typical extractive summarization method consists
in four steps. First step is a term selection where one should decide what units
will count as individual terms. The process of estimating the usefulness of the
individual terms is called term weighting step. The next step denotes as sentence
weighting where all the sentences receive some numerical measure according to
the usefulness of its terms. Finally, the process of selecting the most relevant sentences
calls sentence selection. Different extractive summarization methods can
be characterized how they perform these steps.
In this book, in the term selection step, we describe how to detect multiword
descriptions considering Maximal Frequent Sequences (mfss), which bearing important
meaning, while non-maximal frequent sequences (fss), those that are
parts of another fs, should not be considered. Our additional motivation was
cost vs. benefit considerations: there are too many non-maximal fss while their
probability to bear important meaning is lower. In any case, mfss represent all fss
in a compact way: all fss can be obtained from all mfss by bursting each mfs into
a set of all its subsequences.New methods based on graph algorithms, genetic algorithms, and clustering
algorithms which facilitate the text summarization task are presented. We
have tested different combinations of term selection, term weighting, sentence
weighting and sentence selection options for language-and domain-independent
extractive single-document text summarization on a news report collection. We
analyzed several options based on mfss, considering them with graph, genetic,
and clustering algorithms. We obtained results superior to the existing state-ofthe-
art methods.
This book is addressed for students and scientists of the area of Computational
Linguistics, and also who wants to know recent developments in the area of Automatic
Text Generation of Summaries
- …