18 research outputs found

    Knowledge Discovery from Financial Text

    Get PDF
    The abundance of on-line electronic financial news articles has opened up new possibilities for intelligent systems that could extract and organize relevant knowledge automatically in a usable format. While most typical information extraction systems require a hand-built dictionary of templates and, subsequently, are subject to ceaseless modification to accommodate new patterns that are observed in the text, in this research, we propose a novel text-based decision support system (DSS) that will (i) extract event sequences from shallow text patterns and (ii) predict the likelihood of the occurrence of events using a classifier-based inference engine. We investigated more than 2,000 financial reports with 28,000 sentences. Experiments show the DSS outperforms other similar statistical models

    Coherence Identification of Business Documents: Towards an Automated Message Processing System

    Get PDF
    This paper describes our recent efforts in developing a text segmentation technique in our business document management system. The document analysis is based upon a knowledge-based analysis of the documents’ contents, by automating the coherence identification process, without a full semantic understanding. In the technique, document boundaries can be identified by observing the shifts of segments from one cluster to another. Our experimental results show that the combination of the heterogeneous knowledge is capable to address the topic shifts. Given the increasing recognition of document structure in the fields of information retrieval as well as knowledge management, this approach provides a quantitative model and automatic classification of documents in a business document management system. This will beneficial to the distribution of documents or automatic launching of business processes in a workflow management system

    Anaphora Resolution as Lexical Cohesion Identification

    Get PDF

    Textual Information Segmentation by Cohesive Ties

    Get PDF

    Large expert-curated database for benchmarking document similarity detection in biomedical literature search

    Get PDF
    Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.Peer reviewe

    Anaphora Resolution as Lexical Cohesion Identification

    No full text
    Anaphora,. an important indicator in lexical cohesion, is a discourse level linguistic phenomenon. Most theoretical linguistic approaches to the interpretation of anaphoric expressions propose a treatment on the basis of purely syntactic information. In this article, what we proposed is to cast anaphora resolution as a semantic inference process in which combination of multiple strategies, each exploiting a different linguistic knowledge, is employed to resolve anaphora into a coherent one. We also exhibit how to embed an anaphora resolution into a framework which captures all the salient parameters as well as to remedy, to a certain extent, the inadequacies found in any monolithic resolution systems. The effectiveness of the anaphora resolution considered in this work is exemplified through a set of simulations. 1

    Anaphora Resolution as Lexical Cohesion Identification

    No full text
    Anaphora, an important indicator in lexical cohesion, is a discourse level linguistic phenomenon. Most theoretical linguistic approaches to the interpretation of anaphoric expressions propose a treatment on the basis of purely syntactic information. In this article, what we proposed is to cast anaphora resolution as a semantic inference process in which combination of multiple strategies, each exploiting a different linguistic knowledge, is employed to resolve anaphora into a coherent one. We also exhibit how to embed an anaphora resolution into a framework which captures all the salient parameters as well as to remedy, to a certain extent, the inadequacies found in any monolithic resolution systems. The effectiveness of the anaphora resolution considered in this work is exemplified through a set of simulations

    Textual Information Segmentation by Cohesive Ties

    No full text
    This paper proposes a novel approach in clustering texts automatically into coherent segments. A set of mutual linguistic constraints that largely determines the similarity of meaning among lexical items is used and a weight function is devised to incorporate the diversity of linguistic bonds among the text. A computational method of extracting the gist from a higher order structure representing the tremendous diversity of interrelationship among items is presented. Topic boundaries between segments in a text are identified. Our text segmentation is regarded as a process of identifying the shifts from one segment cluster to another. The experimental results show that the combination of these constraints is capable to address the topic shifts of texts. 1
    corecore