20 research outputs found

    Multi-document text summarization using text clustering for Arabic Language

    Get PDF
    The process of multi-document summarization is producing a single summary of a collection of related documents. In this work we focus on generic extractive Arabic multi-document summarizers. We also describe the cluster approach for multi-document summarization. The problem with multi-document text summarization is redundancy of sentences, and thus, redundancy must be eliminated to ensure coherence, and improve readability. Hence, we set out the main objective as to examine multi-document summarization salient information for text Arabic summarization task with noisy and redundancy information. In this research we used Essex Arabic Summaries Corpus (EASC) as data to test and achieve our main objective and of course its subsequent subobjectives. We used the token process to split the original text into words, and then removed all the stop words, and then we extract the root of each word, and then represented the text as bag of words by TFIDF without the noisy information. In the second step we applied the K-means algorithm with cosine similarity in our experimental to select the best cluster based on cluster ordering by distance performance. We applied SVM to order the sentences after selected the best cluster, then we selected the highest weight sentences for the final summary to reduce redundancy information. Finally, the final summary results for the ten categories of related documents are evaluated using Recall and Precision with the best Recall achieved is 0.6 and Precision is 0.6

    Génération de résumés de mise à jour : Utilisation d'un algorithme de classification non supervisée pour détecter la nouveauté dans les articles de presse

    No full text
    Dans cet article, nous présentons un système de résumé automatique multi-documents, dédié au résumé de mise à jour – ou de nouveauté. Dans une première partie, nous présentons la méthode sur laquelle notre système est fondé, CBSEAS, et son adaptation à la tâche de résumé de mise à jour. Générer des résumés de mise à jour est une tâche plus compliquée que de générer des résumés « standard », et nécessite une évaluation spécifique. Nous décrivons ensuite la tâche « Résumé de mise à jour » de TAC 2009, à laquelle nous avons participé afin d'évaluer notre système. Cette campagne d'évaluation internationale nous a permis de confronter notre système à d'autres systèmes de résumé automatique. Finalement, nous présentons et discutons les résultats intéressants obtenus par notre système

    Textual Entailment Using Lexical And Syntactic Similarity

    Full text link

    Cross-Lingual Textual Entailment and Applications

    Get PDF
    Textual Entailment (TE) has been proposed as a generic framework for modeling language variability. The great potential of integrating (monolingual) TE recognition components into NLP architectures has been reported in several areas, such as question answering, information retrieval, information extraction and document summarization. Mainly due to the absence of cross-lingual TE (CLTE) recognition components, similar improvements have not yet been achieved in any corresponding cross-lingual application. In this thesis, we propose and investigate Cross-Lingual Textual Entailment (CLTE) as a semantic relation between two text portions in dierent languages. We present dierent practical solutions to approach this problem by i) bringing CLTE back to the monolingual scenario, translating the two texts into the same language; and ii) integrating machine translation and TE algorithms and techniques. We argue that CLTE can be a core tech- nology for several cross-lingual NLP applications and tasks. Experiments on dierent datasets and two interesting cross-lingual NLP applications, namely content synchronization and machine translation evaluation, conrm the eectiveness of our approaches leading to successful results. As a complement to the research in the algorithmic side, we successfully explored the creation of cross-lingual textual entailment corpora by means of crowdsourcing, as a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators

    Clustering incrémental et méthodes de détection de nouveauté : application à l'analyse intelligente d'informations évoluant au cours du temps

    Get PDF
    Série Environnements et services numériques d'information Bibliographie en fin de chapitres. Notes bibliogr. IndexNational audienceLearning algorithms proved their ability to deal with large amount of data. Most of the statistical approaches use defined size learning sets and produce static models. However in specific situations: active or incremental learning, the learning task starts with only very few data. In that case, looking for algorithms able to produce models with only few examples becomes necessary. The literature's classifiers are generally evaluated with criteria such as: accuracy, ability to order data (ranking)... But this classifiers' taxonomy can really change if the focus is on the ability to learn with just few examples. To our knowledge, just few studies were performed on this problem. This study aims to study a larger panel of both algorithms (9 different kinds) and data sets (17 UCI bases)

    Das liv-und esthländische Privatrecht. Th. 1, Die Einleitung, das Personen-, Sachen- und Forderungenrecht enthaltend

    Get PDF
    http://tartu.ester.ee/record=b1790333~S1*es

    Arabic multi-document text summarisation

    Get PDF
    Multi-document summarisation is the process of producing a single summary of a collection of related documents. Much of the current work on multi-document text summarisation is concerned with the English language; relevant resources are numerous and readily available. These resources include human generated (gold-standard) and automatic summaries. Arabic multi-document summarisation is still in its infancy. One of the obstacles to progress is the limited availability of Arabic resources to support this research. When we started our research there were no publicly available Arabic multi-document gold-standard summaries, which are needed to automatically evaluate system generated summaries. The Document Understanding Conference (DUC) and Text Analysis Conference (TAC) at that time provided resources such as gold-standard extractive and abstractive summaries (both human and system generated) that were only available in English. Our aim was to push forward the state-of-the-art in Arabic multi-document summarisation. This required advancements in at least two areas. The first area was the creation of Arabic test collections. The second area was concerned with the actual summarisation process to find methods that improve the quality of Arabic summaries. To address both points we created single and multi-document Arabic test collections both automatically and manually using a commonly used English dataset and by having human participants. We developed extractive language dependent and language independent single and multi-document summarisers, both for Arabic and English. In our work we provided state-of-the-art approaches for Arabic multi-document summarisation. We succeeded in including Arabic in one of the leading summarisation conferences the Text Analysis Conference (TAC). Researchers on Arabic multi-document summarisation now have resources and tools that can be used to advance the research in this field

    Generación automática de resúmenes de múltiples documentos utilizando secuencias frecuentes maximales y método de grafos

    Get PDF
    El crecimiento exponencial de internet ha provocado un bombardeo de información que se produce día a día aumentando de manera exponencial. La información masiva se ha vuelto un problema de sobrecarga de información al momento de realizar una búsqueda de información específica, lo cual ha provocado que las ciencias computacionales se vean involucradas en la búsqueda de una solución. La Generación Automática de Resúmenes de Texto (GART) es una tarea del Procesamiento del Lenguaje Natural (PLN) que busca contrarrestar los efectos negativos de la sobrecarga de información. Actualmente existen diferentes métodos del estado del arte para la GART basados en una arquitectura de tres etapas: 1. Identificación de Tópicos. 2. Transformación o interpretación. 3. Síntesis o generación del resumen. Entre los métodos del estado del arte se encontró un método que a diferencia de los otros propone una cuarta etapa. La cuarta etapa busca darle un valor a cada término de las oraciones. El método propuesto por (Ledeneva y García-Hernández, 2017) demostró dar buenos resultados para la tarea Generación Automática de Resúmenes de Texto de Un solo documento (GART-1). Con referencia a los resultados obtenidos del método de (Ledeneva y García-Hernández, 2017) en este trabajo se propone ajustar los parámetros en las diferentes etapas y adaptar el método para la tarea de Generación Automática de Resúmenes de Texto de Múltiples documentos (GART-M). En el método propuesto se optó por la extracción de las Secuencias Frecuentes Maximales (SFM’s) para ser empleadas como modelo de texto y la utilización de un método basado en grafos para realizar el pesado de las oraciones. El corpus empleado fue DUC-02, el cual está conformado por 59 colecciones de documentos de noticias. La evaluación de los resúmenes se hizo con el sistema ROUGE-N, el cual permite comprar los resúmenes generados a partir del método con los resúmenes generados por un humano. Los resultados obtenidos de los experimentos realizados se dividieron en tres etapas. En la primera etapa se buscó la mejor configuración del método. En la segunda etapa se buscó probar la importancia de la longitud de las SFM’s. En la tercera etapa de busco emplear una nueva configuración para la selección de oraciones. Los resultados obtenidos por el método propuesto se compararon con otros métodos del estado del arte y las heurísticas. Los resultados obtenidos con el método propuesto logran superar las heurísticas y métodos del estado del arte

    Supervised extractive summarisation of news events

    Get PDF
    This thesis investigates whether the summarisation of news-worthy events can be improved by using evidence about entities (i.e.\ people, places, and organisations) involved in the events. More effective event summaries, that better assist people with their news-based information access requirements, can help to reduce information overload in today's 24-hour news culture. Summaries are based on sentences extracted verbatim from news articles about the events. Within a supervised machine learning framework, we propose a series of entity-focused event summarisation features. Computed over multiple news articles discussing a given event, such entity-focused evidence estimates: the importance of entities within events; the significance of interactions between entities within events; and the topical relevance of entities to events. The statement of this research work is that augmenting supervised summarisation models, which are trained on discriminative multi-document newswire summarisation features, with evidence about the named entities involved in the events, by integrating entity-focused event summarisation features, we will obtain more effective summaries of news-worthy events. The proposed entity-focused event summarisation features are thoroughly evaluated over two multi-document newswire summarisation scenarios. The first scenario is used to evaluate the retrospective event summarisation task, where the goal is to summarise an event to-date, based on a static set of news articles discussing the event. The second scenario is used to evaluate the temporal event summarisation task, where the goal is to summarise the changes in an ongoing event, based on a time-stamped stream of news articles discussing the event. The contributions of this thesis are two-fold. First, this thesis investigates the utility of entity-focused event evidence for identifying important and salient event summary sentences, and as a means to perform anti-redundancy filtering to control the volume of content emitted as a summary of an evolving event. Second, this thesis also investigates the validity of automatic summarisation evaluation metrics, the effectiveness of standard summarisation baselines, and the effective training of supervised machine learned summarisation models
    corecore