5 research outputs found

    Cleaning Web pages for effective Web content mining.

    Get PDF
    Web pages usually contain many noisy blocks, such as advertisements, navigation bar, copyright notice and so on. These noisy blocks can seriously affect web content mining because contents contained in noise blocks are irrelevant to the main content of the web page. Eliminating noisy blocks before performing web content mining is very important for improving mining accuracy and efficiency. A few existing approaches detect noisy blocks with exact same contents, but are weak in detecting near-duplicate blocks, such as navigation bars. In this thesis, given a collection of web pages in a web site, a new system, WebPageCleaner, which eliminates noisy blocks from these web pages so as to improve the accuracy and efficiency of web content mining, is proposed. WebPageCleaner detects both noisy blocks with exact same contents as well as those with near-duplicate contents. It is based on the observation that noisy blocks usually share common contents, and appear frequently on a given web site. WebPageCleaner consists of three modules: block extraction, block importance retrieval, and cleaned files generation. A vision-based technique is employed for extracting blocks from web pages. Blocks get their importance degree according to their block features such as block position, and level of similarity of block contents to each other. A collection of cleaned files with high importance degree are generated finally and used for web content mining. The proposed technique is evaluated using Naive Bayes text classification. Experiments show that WebPageCleaner is able to lead to a more efficient and accurate web page classification results than existing approaches.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .L5. Source: Masters Abstracts International, Volume: 45-01, page: 0359. Thesis (M.Sc.)--University of Windsor (Canada), 2006

    Website summarization: a topic hierarchy based approach.

    Get PDF
    Liu Nan.Thesis (M.Phil.)--Chinese University of Hong Kong, 2006.Includes bibliographical references (leaves 84-88).Abstracts in English and Chinese.Abstract --- p.1Acknowledgements --- p.3Contents --- p.4List of Figures --- p.6List of Tables --- p.7Chapter Chapter 1 --- Introduction --- p.8Chapter Chapter 2 --- Related Work --- p.12Chapter 2.1 --- Web Structure Mining --- p.12Chapter 2.1.1 --- HITS Algorithm --- p.13Chapter 2.1.2 --- PageRank Algorithm --- p.13Chapter 2.2 --- Website Mining --- p.14Chapter 2.2.1 --- Website Classification --- p.14Chapter 2.2.2 --- Web Unit Mining --- p.16Chapter 2.2.3 --- Logical Domain Extraction --- p.16Chapter 2.2.4 --- Web Thesaurus Construction --- p.17Chapter Chapter 3 --- Website Topic Hierarchy Generation --- p.19Chapter 3.1 --- Problem Definition --- p.19Chapter 3.2 --- Graph Based Algorithms --- p.21Chapter 3.2.1 --- Breadth First Search --- p.21Chapter 3.2.2 --- Shortest Path Search --- p.23Chapter 3.2.3 --- Minimum Directed Spanning Tree --- p.24Chapter 3.2.4 --- Discussion --- p.27Chapter 3.3 --- Edge Weight Function --- p.28Chapter 3.3.1 --- Relevance Method --- p.29Chapter 3.3.2 --- Machine Learning Method --- p.32Chapter 3.4 --- Experiments --- p.47Chapter 3.4.1 --- Data Preparation --- p.47Chapter 3.4.2 --- Performances of Breadth-first Search --- p.50Chapter 3.4.3 --- Performances of Shortest-path Search --- p.50Chapter 3.4.4 --- Performances of Directed Minimum Spanning Tree --- p.54Chapter 3.4.5 --- Comparison of Different Algorithms --- p.55Chapter Chapter 4 --- Website Summarization Through Keyphrase Extraction --- p.58Chapter 4.1 --- Introduction --- p.58Chapter 4.2 --- Background --- p.60Chapter 4.3 --- Keyphrase Extraction --- p.69Chapter 4.3.1 --- Candidate Phrases Idenfication --- p.69Chapter 4.3.2 --- Feature Calculation without Topic Hierarchy --- p.70Chapter 4.3.3 --- Feature Calculation with Topic Hierarchy --- p.72Chapter 4.3.4 --- Extraction of Keyphrases --- p.75Chapter 4.4 --- Experiments --- p.76Chapter Chapter 5 --- Conclusion and Future Work --- p.82References: --- p.8

    Web page cleaning for web mining

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Entropy-based link analysis for mining web informative structures

    No full text
    In this paper, we study the problem of mining the informative structure of a news Web site which consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by TOC pages through informative links. It is noted that the Hyperlink Induced Topics Search (HITS) algorithm has been employed to provide a solution to analyzing authorities and hubs of pages. However, most of the content sites tend to contain some extra hyperlinks, such as navigation panels, advertisements and banners, so as to increase the add-on values of their Web pages. Therefore, due to the structure induced by these extra hyperlinks, HITS is found to be insufficient to provide a good precision in solving the problem. To remedy this, we develop an algorithm to utilize entropy-based link analysis to mine Web informative structures. This algorithm is referred to as LAMIS, standing for entropy-based Link Analysis on Mining web Informative Structures. The key idea of LAMIS is to utilize information entropy for representing the knowledge that corresponds to the amount of information in a link or a page in the link analysis. Experiments on several real news Web sites show that the precision and recall of LAMIS is much superior to those obtained by heuristic methods and also that the link analysis techniques derived are very powerful to mining the informative structures of news Web sites. In average, the augmented LAMIS leads to prominent performance improvement and increases the precision by a factor ranging from 133 % to 232 % when the desired recall falls between 0.5 and 1

    Entropy-based link analysis for mining web informative structures

    Full text link
    corecore