5 research outputs found

    Web Unit Mining: Finding and classifying subgraphs of web pages

    Get PDF
    In web classification, most researchers assume that the ob-jects to classify are individual web pages from one or more web sites. In practice, the assumption is too restrictive since a web page itself may not always correspond to a concept instance of some semantic concept (or category) given to the classification task. In this paper, we want to relax this as-sumption and allow a concept instance to be represented by a subgraph of web pages or a set of web pages. We identify several new issues to be addressed when the assumption is removed, and formulate theweb unit mining problem. We also propose an iterative web unit mining (iWUM) method that first finds subgraphs of web pages using some knowledge about web site structure. From these web subgraphs, web units are constructed and classified into semantic concepts (or categories) in an iterative manner. Our experiments us-ing the WebKB dataset showed that iWUM improves the overall classification performance and works very well on the more structured parts of a web site

    Improved Classification via Connectivity Information

    No full text
    The motivation for our work is the observation that Web pages on a particular topic are often linked to other pages on the same topic. We model and analyze the problem of how to improve the classification of Web pages (that is, determining the topic of the page) by using link information. In our setting, an initial classifier examines the text of a Web page and assigns to it some classification, possibly mistaken. We investigate how to reduce the error probability using the observation above, thus building an improved classifier. We present a theoretical framework for this problem based on a random graph model and suggest two linear time algorithms, based on similar methods that have been proven effective in the setting of error-correcting codes. We provide simulation results to verify our analysis and to compare the performance of our suggested algorithms
    corecore