Skip to main content
Article thumbnail
Location of Repository

WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA A Pattern Tree-based Approach to Learning URL Normalization

By Rui Cai, Jiang-ming Yang, Yan Ke, Xiaodong Fan, Lei Zhang and Tao Lei

Abstract

Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays URL normalization has attracted significant attention as it is lightweight and can be flexibly integrated into both the online (e.g. crawling) and the offline (e.g. index compression) parts of a search engine. To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs. In this paper, we rethink the problem of URL normalization from a global perspective and propose a pattern treebased approach, which is remarkably different from existing approaches. Most current approaches learn rewrite rule

Topics: Algorithms, Performance, Experimentation. Keywords URL normalization, URL pattern
Year: 2011
OAI identifier: oai:CiteSeerX.psu:10.1.1.188.635
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://research.microsoft.com/... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.