222,905 research outputs found

    Coherence Identification of Business Documents: Towards an Automated Message Processing System

    Get PDF
    This paper describes our recent efforts in developing a text segmentation technique in our business document management system. The document analysis is based upon a knowledge-based analysis of the documents’ contents, by automating the coherence identification process, without a full semantic understanding. In the technique, document boundaries can be identified by observing the shifts of segments from one cluster to another. Our experimental results show that the combination of the heterogeneous knowledge is capable to address the topic shifts. Given the increasing recognition of document structure in the fields of information retrieval as well as knowledge management, this approach provides a quantitative model and automatic classification of documents in a business document management system. This will beneficial to the distribution of documents or automatic launching of business processes in a workflow management system

    Generating indicative-informative summaries with SumUM

    Get PDF
    We present and evaluate SumUM, a text summarization system that takes a raw technical text as input and produces an indicative informative summary. The indicative part of the summary identifies the topics of the document, and the informative part elaborates on some of these topics according to the reader's interest. SumUM motivates the topics, describes entities, and defines concepts. It is a first step for exploring the issue of dynamic summarization. This is accomplished through a process of shallow syntactic and semantic analysis, concept identification, and text regeneration. Our method was developed through the study of a corpus of abstracts written by professional abstractors. Relying on human judgment, we have evaluated indicativeness, informativeness, and text acceptability of the automatic summaries. The results thus far indicate good performance when compared with other summarization technologies

    Multimedia information technology and the annotation of video

    Get PDF
    The state of the art in multimedia information technology has not progressed to the point where a single solution is available to meet all reasonable needs of documentalists and users of video archives. In general, we do not have an optimistic view of the usability of new technology in this domain, but digitization and digital power can be expected to cause a small revolution in the area of video archiving. The volume of data leads to two views of the future: on the pessimistic side, overload of data will cause lack of annotation capacity, and on the optimistic side, there will be enough data from which to learn selected concepts that can be deployed to support automatic annotation. At the threshold of this interesting era, we make an attempt to describe the state of the art in technology. We sample the progress in text, sound, and image processing, as well as in machine learning

    Ontology Population in Conversational Recommender System for Smartphone Domain

    Get PDF
    The Conversational recommender system (CRS) is a knowledge-based recommendation system that uses ontology as its knowledge representation.The knowledge of a CRS is based on a real world knowledge base service where information on the topic such as product details and descriptions must always be up-to-date.However, the process of gathering the information is still conducted manually.The process is very time consuming and prone to error.Therefore, automatic or semi-automatic processes that can adapt to update, find and insert information into the knowledge base that matches a given ontology are needed. Hence, this study aims to design a framework for ontology population on Conversational Recommender Systems based on the Functional Requirements as in [4] from tabular web documents so its instantiation as ontology result can substitute manual ontology update on CRS. The framework includes a clustering process that employs the Bi-Layer K-Means Clustering Algorithm as a part of knowledge acquisition. To reach the objective, it is necessary to analyze and check the individual consistency of the resulting ontology. Another aim of this study is to analyze the resulting ontology still suitable according to CRS ontology requirements by checking the CRS Ontology Requirements.The experiment is conducted using data from www.gsmarena.com through a crawler engine. There are four steps in an ontology population process: Document Crawling, Identification of the page (individuals, attributes and values), KnowledgeAcquisition, and OWL Ontology Export. Using input from the tabular web document and developing OWL ontology export that mapping the instances and relations, the result shows that the specifications included in the Weak Clustering, Reasonable Clustering and strong clustering categories can be recommended for the Conversational Recommender System ontology. Analysis of consistency checking shows that the ontology remains consistent and suitable for the CRS ontology requirement

    Enriching very large ontologies using the WWW

    Full text link
    This paper explores the possibility to exploit text on the world wide web in order to enrich the concepts in existing ontologies. First, a method to retrieve documents from the WWW related to a concept is described. These document collections are used 1) to construct topic signatures (lists of topically related words) for each concept in WordNet, and 2) to build hierarchical clusters of the concepts (the word senses) that lexicalize a given word. The overall goal is to overcome two shortcomings of WordNet: the lack of topical links among concepts, and the proliferation of senses. Topic signatures are validated on a word sense disambiguation task with good results, which are improved when the hierarchical clusters are used.Comment: 6 page

    Some Reflections on the Task of Content Determination in the Context of Multi-Document Summarization of Evolving Events

    Full text link
    Despite its importance, the task of summarizing evolving events has received small attention by researchers in the field of multi-document summariztion. In a previous paper (Afantenos et al. 2007) we have presented a methodology for the automatic summarization of documents, emitted by multiple sources, which describe the evolution of an event. At the heart of this methodology lies the identification of similarities and differences between the various documents, in two axes: the synchronic and the diachronic. This is achieved by the introduction of the notion of Synchronic and Diachronic Relations. Those relations connect the messages that are found in the documents, resulting thus in a graph which we call grid. Although the creation of the grid completes the Document Planning phase of a typical NLG architecture, it can be the case that the number of messages contained in a grid is very large, exceeding thus the required compression rate. In this paper we provide some initial thoughts on a probabilistic model which can be applied at the Content Determination stage, and which tries to alleviate this problem.Comment: 5 pages, 2 figure

    Cross Validation Of Neural Network Applications For Automatic New Topic Identification

    Get PDF
    There are recent studies in the literature on automatic topic-shift identification in Web search engine user sessions; however most of this work applied their topic-shift identification algorithms on data logs from a single search engine. The purpose of this study is to provide the cross-validation of an artificial neural network application to automatically identify topic changes in a web search engine user session by using data logs of different search engines for training and testing the neural network. Sample data logs from the Norwegian search engine FAST (currently owned by Overture) and Excite are used in this study. Findings of this study suggest that it could be possible to identify topic shifts and continuations successfully on a particular search engine user session using neural networks that are trained on a different search engine data log
    • …
    corecore