6 research outputs found

    Joint semantic discourse models for automatic multi-document summarization

    Get PDF
    Automatic multi-document summarization aims at selecting the essential content of related documents and presenting it in a summary. In this paper, we propose some methods for automatic summarization based on Rhetorical Structure Theory and Cross-document Structure Theory. They are chosen in order to properly address the relevance of information, multidocument phenomena and subtopical distribution in the source texts. The results show that using semantic discourse knowledge in strategies for content selection produces summaries that are more informative.Sumarização automática multidocumento visa à seleção das informações mais importantes de um conjunto de documentos para produzir um sumário. Neste artigo, propõem-se métodos para sumarização automática baseando-se em conhecimento semântico-discursivo das teorias Rhetorical Structure Theory e Cross-document Structure Theory. Tais teorias foram escolhidas para tratar adequadamente a relevância das informações, os fenômenos multidocumento e a distribuição de subtópicos dos documentos. Os resultados mostram que o uso de conhecimento semântico-discursivo para selecionar conteúdo produz sumários mais informativos.FAPESPCAPE

    A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise’s Technological Innovation

    Get PDF
    This paper proposed a method for building enterprise's technological innovation domain ontology automatically from plain text corpus based on Latent Dirichlet Allocation (LDA). The proposed method consisted of four modules: 1) introducing the seed ontology for domain of enterprise's technological innovation, 2) using Natural Language Processing (NLP) technique to preprocess the collected textual data, 3) mining domain specific terms from document collections based on LDA, 4) obtaining the relationship between the terms through the defined relevant rules. The experiments have been carried out to demonstrate the effectiveness of this method and the results indicated that many terms in domain of enterprise's technological innovation and the semantic relations between terms are discovered. The proposed method is a process of continuously cycles and iterations, that is the obtained objective ontology can be re-iterated as initial seed ontology. The constant knowledge acquisition in the domain of enterprise's technological innovation to update and perfect the initial seed ontology

    COMPENDIUM: a text summarisation tool for generating summaries of multiple purposes, domains, and genres

    Get PDF
    In this paper, we present a Text Summarisation tool, compendium, capable of generating the most common types of summaries. Regarding the input, single- and multi-document summaries can be produced; as the output, the summaries can be extractive or abstractive-oriented; and finally, concerning their purpose, the summaries can be generic, query-focused, or sentiment-based. The proposed architecture for compendium is divided in various stages, making a distinction between core and additional stages. The former constitute the backbone of the tool and are common for the generation of any type of summary, whereas the latter are used for enhancing the capabilities of the tool. The main contributions of compendium with respect to the state-of-the-art summarisation systems are that (i) it specifically deals with the problem of redundancy, by means of textual entailment; (ii) it combines statistical and cognitive-based techniques for determining relevant content; and (iii) it proposes an abstractive-oriented approach for facing the challenge of abstractive summarisation. The evaluation performed in different domains and textual genres, comprising traditional texts, as well as texts extracted from the Web 2.0, shows that compendium is very competitive and appropriate to be used as a tool for generating summaries.This research has been supported by the project “Desarrollo de Técnicas Inteligentes e Interactivas de Minería de Textos” (PROMETEO/2009/119) and the project reference ACOMP/2011/001 from the Valencian Government, as well as by the Spanish Government (grant no. TIN2009-13391-C04-01)

    A review of the extractive text summarization

    Get PDF
    Research in the area of automatic text summarization has intensifed in recent years due to the large amount of information available in electronic documents. This article present the most relevant methods for automatic text extractive summarization that have been developed both for a single document and multiple documents, with special emphasis on methods based on algebraic reduction, clustering and evolutionary models, of which there is great amount of research in recent years, since they are language-independent and unsupervised methods.Las investigaciones en el área de generación automática de resúmenes de textos se han intensifcado en los últimos años debido a la gran cantidad de información disponible en documentos electrónicos. Este artículo presenta los métodos más relevantes de generación automática de resúmenes extractivos que se han desarrollado tanto para un solo documento como para múltiples documentos, haciendo especial énfasis en los métodos basados en reducción algebraica, en agrupamiento y en modelos evolutivos, de los cuales existe gran cantidad de investigaciones en los últimos años, dado que son métodos independientes del lenguaje y no supervisados. &nbsp

    A review of the extractive text summarization

    Get PDF
    Research in the area of automatic text summarization has intensifed in recent years due to the large amount of information available in electronic documents. This article present the most relevant methods for automatic text extractive summarization that have been developed both for a single document and multiple documents, with special emphasis on methods based on algebraic reduction, clustering and evolutionary models, of which there is great amount of research in recent years, since they are language-independent and unsupervised methods.Las investigaciones en el área de generación automática de resúmenes de textos se han intensifcado en los últimos años debido a la gran cantidad de información disponible en documentos electrónicos. Este artículo presenta los métodos más relevantes de generación automática de resúmenes extractivos que se han desarrollado tanto para un solo documento como para múltiples documentos, haciendo especial énfasis en los métodos basados en reducción algebraica, en agrupamiento y en modelos evolutivos, de los cuales existe gran cantidad de investigaciones en los últimos años, dado que son métodos independientes del lenguaje y no supervisados. &nbsp

    Exploiting Wikipedia Semantics for Computing Word Associations

    No full text
    Semantic association computation is the process of automatically quantifying the strength of a semantic connection between two textual units based on various lexical and semantic relations such as hyponymy (car and vehicle) and functional associations (bank and manager). Humans have can infer implicit relationships between two textual units based on their knowledge about the world and their ability to reason about that knowledge. Automatically imitating this behavior is limited by restricted knowledge and poor ability to infer hidden relations. Various factors affect the performance of automated approaches to computing semantic association strength. One critical factor is the selection of a suitable knowledge source for extracting knowledge about the implicit semantic relations. In the past few years, semantic association computation approaches have started to exploit web-originated resources as substitutes for conventional lexical semantic resources such as thesauri, machine readable dictionaries and lexical databases. These conventional knowledge sources suffer from limitations such as coverage issues, high construction and maintenance costs and limited availability. To overcome these issues one solution is to use the wisdom of crowds in the form of collaboratively constructed knowledge sources. An excellent example of such knowledge sources is Wikipedia which stores detailed information not only about the concepts themselves but also about various aspects of the relations among concepts. The overall goal of this thesis is to demonstrate that using Wikipedia for computing word association strength yields better estimates of humans' associations than the approaches based on other structured and unstructured knowledge sources. There are two key challenges to achieve this goal: first, to exploit various semantic association models based on different aspects of Wikipedia in developing new measures of semantic associations; and second, to evaluate these measures compared to human performance in a range of tasks. The focus of the thesis is on exploring two aspects of Wikipedia: as a formal knowledge source, and as an informal text corpus. The first contribution of the work included in the thesis is that it effectively exploited the knowledge source aspect of Wikipedia by developing new measures of semantic associations based on Wikipedia hyperlink structure, informative-content of articles and combinations of both elements. It was found that Wikipedia can be effectively used for computing noun-noun similarity. It was also found that a model based on hybrid combinations of Wikipedia structure and informative-content based features performs better than those based on individual features. It was also found that the structure based measures outperformed the informative content based measures on both semantic similarity and semantic relatedness computation tasks. The second contribution of the research work in the thesis is that it effectively exploited the corpus aspect of Wikipedia by developing a new measure of semantic association based on asymmetric word associations. The thesis introduced the concept of asymmetric associations based measure using the idea of directional context inspired by the free word association task. The underlying assumption was that the association strength can change with the changing context. It was found that the asymmetric association based measure performed better than the symmetric measures on semantic association computation, relatedness based word choice and causality detection tasks. However, asymmetric-associations based measures have no advantage for synonymy-based word choice tasks. It was also found that Wikipedia is not a good knowledge source for capturing verb-relations due to its focus on encyclopedic concepts specially nouns. It is hoped that future research will build on the experiments and discussions presented in this thesis to explore new avenues using Wikipedia for finding deeper and semantically more meaningful associations in a wide range of application areas based on humans' estimates of word associations
    corecore