4 research outputs found

    Automatically Finding Significant Topical Terms from Documents

    Get PDF
    With the pervasion of digital textual data, text mining is becoming more and more important to deriving competitive advantages. One factor for successful text mining applications is the ability of finding significant topical terms for discovering interesting patterns or relationships. Document keyphrases are phrases carrying the most important topical concepts for a given document. In many applications, keyphrases as textual elements are better suited for text mining and could provide more discriminating power than single words. This paper describes an automatic keyphrase identification program (KIP). KIP’s algorithm examines the composition of noun phrases and calculates their scores by looking up a domain-specific glossary database; the ones with higher scores are extracted as keyphrases. KIP’s learning function can enrich its glossary database by automatically adding new identified keyphrases. KIP’s personalization feature allows the user build a glossary database specifically suitable for the area of his/her interest

    Text mining with exploitation of user\u27s background knowledge : discovering novel association rules from text

    Get PDF
    The goal of text mining is to find interesting and non-trivial patterns or knowledge from unstructured documents. Both objective and subjective measures have been proposed in the literature to evaluate the interestingness of discovered patterns. However, objective measures alone are insufficient because such measures do not consider knowledge and interests of the users. Subjective measures require explicit input of user expectations which is difficult or even impossible to obtain in text mining environments. This study proposes a user-oriented text-mining framework and applies it to the problem of discovering novel association rules from documents. The developed system, uMining, consists of two major components: a background knowledge developer and a novel association rules miner. The background knowledge developer learns a user\u27s background knowledge by extracting keywords from documents already known to the user (background documents) and developing a concept hierarchy to organize popular keywords. The novel association rule miner discovers association rules among noun phrases extracted from relevant documents (target documents) and compares the rules with the background knowledge to predict the rule novelty to the particular user (useroriented novelty). The user-oriented novelty measure is defined as the semantic distance between the antecedent and the consequent of a rule in the background knowledge. It consists of two components: occurrence distance and connection distance. The former considers the co-occurrences of two keywords in the background documents: the more the shorter the distance. The latter considers the common connections of with others in the concept hierarchy. It is defined as the length of the connecting the two keywords in the concept hierarchy: the longer the path, distance. The user-oriented novelty measure is evaluated from two perspectives: novelty prediction accuracy and usefulness indication power. The results show that the useroriented novelty measure outperforms the WordNet novelty measure and the compared objective measures in term of predicting novel rules and identifying useful rules

    People-search : searching for people sharing similar interests from the web

    Get PDF
    On the Web, there are limited ways of finding people sharing similar interests or background with a given person. The current methods, such as using regular search engines, are either ineffective or time consuming. In this work, a new approach for searching people sharing similar interests from the Web, called People-Search, is presented. Given a person, to find similar people from the Web, there are two major research issues: person representation and matching persons. In this study, a person representation method which uses a person\u27s website to represent this person\u27s interest and background is proposed. The design of matching process takes person representation into consideration to allow the same representation to be used when composing the query, which is also a personal website. Based on this person representation method, the main proposed algorithm integrates textual content and hyperlink information of all the pages belonging to a personal website to represent a person and match persons. Other algorithms, based on different combinations of content, inlink, and outlink information of an entire personal website or only the main page, are also explored and compared to the main proposed algorithm. Two kinds of evaluations were conducted. In the automatic evaluation, precision, recall, F and Kruskal-Goodman F measures were used to compare these algorithms. In the human evaluation, the effectiveness of the main proposed algorithm and two other important ones were evaluated by human subjects. Results from both evaluations show that the People-Search algorithm integrating content and link information of all pages belonging to a personal website outperformed all other algorithms in finding similar people from the Web

    Li et al. Automatically Finding Significant Topical terms from Documents ABSTRACT Automatically Finding Significant Topical Terms from Documents

    No full text
    With the pervasion of digital textual data, text mining is becoming more and more important to deriving competitive advantages. One factor for successful text mining applications is the ability of finding significant topical terms for discovering interesting patterns or relationships. Document keyphrases are phrases carrying the most important topical concepts for a given document. In many applications, keyphrases as textual elements are better suited for text mining and could provide more discriminating power than single words. This paper describes an automatic keyphrase identification program (KIP). KIP’s algorithm examines the composition of noun phrases and calculates their scores by looking up a domain-specific glossary database; the ones with higher scores are extracted as keyphrases. KIP’s learning function can enrich its glossary database by automatically adding new identified keyphrases. KIP’s personalization feature allows the user build a glossary database specifically suitable for the area of his/her interest
    corecore