Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Abstract

Topic distillation is the process of finding authoritative Web pages and comprehensive “hubs” which reciprocally endorse each other and are relevant to a given query. Hyperlink-based topic distillation has been traditionally applied to a macroscopic Web model where documents are nodes in a directed graph and hyperlinks are edges. Macroscopic models miss valuable clues such as banners, navigation panels, and template-based inclusions, which are embedded in HTML pages using markup tags. Consequently, results of macroscopic distillation algorithms have been deteriorating in quality as Web pages are becoming more complex. We propose a uniform fine-grained model for the Web in which pages are represented by their tag trees (also called their Document Object Models or DOMs) and these DOM trees are interconnected by ordinary hyperlinks. Surprisingly, macroscopic distillation algorithms do not work in the finegrained scenario. We present a new algorithm suitable for the fine-grained model. It can dis-aggregate hubs into coherent regions by segmenting their DOM trees. Mutual endorsement between hubs and authorities involve these regions, rather than single nodes representing complete hubs. Anecdotes and measurements using a 28-query, 366000-document benchmark suite, used in earlier topic distillation research, reveal two benefits from the new algorithm: distillation quality improves and a by-product of distillation is the ability to extract relevant snippets from hubs which are only partially relevant to the query

    Similar works