183 research outputs found

    Multi-document text summarization using text clustering for Arabic Language

    Get PDF
    The process of multi-document summarization is producing a single summary of a collection of related documents. In this work we focus on generic extractive Arabic multi-document summarizers. We also describe the cluster approach for multi-document summarization. The problem with multi-document text summarization is redundancy of sentences, and thus, redundancy must be eliminated to ensure coherence, and improve readability. Hence, we set out the main objective as to examine multi-document summarization salient information for text Arabic summarization task with noisy and redundancy information. In this research we used Essex Arabic Summaries Corpus (EASC) as data to test and achieve our main objective and of course its subsequent subobjectives. We used the token process to split the original text into words, and then removed all the stop words, and then we extract the root of each word, and then represented the text as bag of words by TFIDF without the noisy information. In the second step we applied the K-means algorithm with cosine similarity in our experimental to select the best cluster based on cluster ordering by distance performance. We applied SVM to order the sentences after selected the best cluster, then we selected the highest weight sentences for the final summary to reduce redundancy information. Finally, the final summary results for the ten categories of related documents are evaluated using Recall and Precision with the best Recall achieved is 0.6 and Precision is 0.6

    Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

    Get PDF
    Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented

    Novel perspectives and approaches to video summarization

    Get PDF
    The increasing volume of videos requires efficient and effective techniques to index and structure videos. Video summarization is such a technique that extracts the essential information from a video, so that tasks such as comprehension by users and video content analysis can be conducted more effectively and efficiently. The research presented in this thesis investigates three novel perspectives of the video summarization problem and provides approaches to such perspectives. Our first perspective is to employ local keypoint to perform keyframe selection. Two criteria, namely Coverage and Redundancy, are introduced to guide the keyframe selection process in order to identify those representing maximum video content and sharing minimum redundancy. To efficiently deal with long videos, a top-down strategy is proposed, which splits the summarization problem to two sub-problems: scene identification and scene summarization. Our second perspective is to formulate the task of video summarization to the problem of sparse dictionary reconstruction. Our method utilizes the true sparse constraint L0 norm, instead of the relaxed constraint L2,1 norm, such that keyframes are directly selected as a sparse dictionary that can reconstruct the video frames. In addition, a Percentage Of Reconstruction (POR) criterion is proposed to intuitively guide users in selecting an appropriate length of the summary. In addition, an L2,0 constrained sparse dictionary selection model is also proposed to further verify the effectiveness of sparse dictionary reconstruction for video summarization. Lastly, we further investigate the multi-modal perspective of multimedia content summarization and enrichment. There are abundant images and videos on the Web, so it is highly desirable to effectively organize such resources for textual content enrichment. With the support of web scale images, our proposed system, namely StoryImaging, is capable of enriching arbitrary textual stories with visual content

    Automatic text summarization with Maximal Frequent Sequences

    Get PDF
    En las últimas dos décadas un aumento exponencial de la información electrónica ha provocado una gran necesidad de entender rápidamente grandes volúmenes de información. En este libro se desarrollan los métodos automáticos para producir un resumen. Un resumen es un texto corto que transmite la información más importante de un documento o de una colección de documentos. Los resúmenes utilizados en este libro son extractivos: una selección de las oraciones más importantes del texto. Otros retos consisten en generar resúmenes de manera independiente de lenguaje y dominio. Se describe la identificación de cuatro etapas para generación de resúmenes extractivos. La primera etapa es la selección de términos, en la que uno tiene que decidir qué unidades contarían como términos individuales. El proceso de estimación de la utilidad de los términos individuales se llama etapa de pesado de términos. El siguiente paso se denota como pesado de oraciones, donde todas las secuencias reciben alguna medida numérica de acuerdo con la utilidad de términos. Finalmente, el proceso de selección de las oraciones más importantes se llama selección de oraciones. Los diferentes métodos para generación de resúmenes extractivos pueden ser caracterizados como representan estas etapas. En este libro se describe la etapa de selección de términos, en la que la detección de descripciones multipalabra se realiza considerando Secuencias Frecuentes Maximales (sfms), las cuales adquieren un significado importante, mientras Secuencias Frecuentes (sf) no maximales, que son partes de otros sf, no deben de ser consideradas. En la motivación se consideró costo vs. beneficio: existen muchas sf no maximales, mientras que la probabilidad de adquirir un significado importante es baja. De todos modos, las sfms representan todas las sfs en el modo compacto: todas las sfs podrían ser obtenidas a partir de todas las sfms explotando cada sfm al conjunto de todas sus subsecuencias. Se presentan los nuevos métodos basados en grafos, algoritmos de agrupamiento y algoritmos genéticos, los cuales facilitan la tarea de generación de resúmenes de textos. Se ha experimentado diferentes combinaciones de las opciones de selección de términos, pesado de términos, pesado de oraciones y selección de oraciones para generar los resúmenes extractivos de textos independientes de lenguaje y dominio para una colección de noticias. Se ha analizado algunas opciones basadas en descripciones multipalabra considerándolas en los métodos de grafos, algoritmos de agrupamiento y algoritmos genéticos. Se han obtenido los resultados superiores al de estado de arte. Este libro está dirigido a los estudiantes y científicos del área de Lingüística Computacional, y también a quienes quieren saber sobre los recientes avances en las investigaciones de generación automática de resúmenes de textos.In the last two decades, an exponential increase in the available electronic information causes a big necessity to quickly understand large volumes of information. It raises the importance of the development of automatic methods for detecting the most relevant content of a document in order to produce a shorter text. Automatic Text Summarization (ats) is an active research area dedicated to generate abstractive and extractive summaries not only for a single document, but also for a collection of documents. Other necessity consists in finding method for ats in a language and domain independent way. In this book we consider extractive text summarization for single document task. We have identified that a typical extractive summarization method consists in four steps. First step is a term selection where one should decide what units will count as individual terms. The process of estimating the usefulness of the individual terms is called term weighting step. The next step denotes as sentence weighting where all the sentences receive some numerical measure according to the usefulness of its terms. Finally, the process of selecting the most relevant sentences calls sentence selection. Different extractive summarization methods can be characterized how they perform these steps. In this book, in the term selection step, we describe how to detect multiword descriptions considering Maximal Frequent Sequences (mfss), which bearing important meaning, while non-maximal frequent sequences (fss), those that are parts of another fs, should not be considered. Our additional motivation was cost vs. benefit considerations: there are too many non-maximal fss while their probability to bear important meaning is lower. In any case, mfss represent all fss in a compact way: all fss can be obtained from all mfss by bursting each mfs into a set of all its subsequences.New methods based on graph algorithms, genetic algorithms, and clustering algorithms which facilitate the text summarization task are presented. We have tested different combinations of term selection, term weighting, sentence weighting and sentence selection options for language-and domain-independent extractive single-document text summarization on a news report collection. We analyzed several options based on mfss, considering them with graph, genetic, and clustering algorithms. We obtained results superior to the existing state-ofthe- art methods. This book is addressed for students and scientists of the area of Computational Linguistics, and also who wants to know recent developments in the area of Automatic Text Generation of Summaries

    Feature Extraction and Duplicate Detection for Text Mining: A Survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user

    Large-scale image collection cleansing, summarization and exploration

    Get PDF
    A perennially interesting topic in the research field of large scale image collection organization is how to effectively and efficiently conduct the tasks of image cleansing, summarization and exploration. The primary objective of such an image organization system is to enhance user exploration experience with redundancy removal and summarization operations on large-scale image collection. An ideal system is to discover and utilize the visual correlation among the images, to reduce the redundancy in large-scale image collection, to organize and visualize the structure of large-scale image collection, and to facilitate exploration and knowledge discovery. In this dissertation, a novel system is developed for exploiting and navigating large-scale image collection. Our system consists of the following key components: (a) junk image filtering by incorporating bilingual search results; (b) near duplicate image detection by using a coarse-to-fine framework; (c) concept network generation and visualization; (d) image collection summarization via dictionary learning for sparse representation; and (e) a multimedia practice of graffiti image retrieval and exploration. For junk image filtering, bilingual image search results, which are adopted for the same keyword-based query, are integrated to automatically identify the clusters for the junk images and the clusters for the relevant images. Within relevant image clusters, the results are further refined by removing the duplications under a coarse-to-fine structure. The duplicate pairs are detected with both global feature (partition based color histogram) and local feature (CPAM and SIFT Bag-of-Word model). The duplications are detected and removed from the data collection to facilitate further exploration and visual correlation analysis. After junk image filtering and duplication removal, the visual concepts are further organized and visualized by the proposed concept network. An automatic algorithm is developed to generate such visual concept network which characterizes the visual correlation between image concept pairs. Multiple kernels are combined and a kernel canonical correlation analysis algorithm is used to characterize the diverse visual similarity contexts between the image concepts. The FishEye visualization technique is implemented to facilitate the navigation of image concepts through our image concept network. To better assist the exploration of large scale data collection, we design an efficient summarization algorithm to extract representative examplars. For this collection summarization task, a sparse dictionary (a small set of the most representative images) is learned to represent all the images in the given set, e.g., such sparse dictionary is treated as the summary for the given image set. The simulated annealing algorithm is adopted to learn such sparse dictionary (image summary) by minimizing an explicit optimization function. In order to handle large scale image collection, we have evaluated both the accuracy performance of the proposed algorithms and their computation efficiency. For each of the above tasks, we have conducted experiments on multiple public available image collections, such as ImageNet, NUS-WIDE, LabelMe, etc. We have observed very promising results compared to existing frameworks. The computation performance is also satisfiable for large-scale image collection applications. The original intention to design such a large-scale image collection exploration and organization system is to better service the tasks of information retrieval and knowledge discovery. For this purpose, we utilize the proposed system to a graffiti retrieval and exploration application and receive positive feedback

    Recent Trends in Computational Intelligence

    Get PDF
    Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications

    Feature extraction and duplicate detection for text mining: A survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user
    • …
    corecore