Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays URL normalization has attracted significant attention as it is lightweight and can be flexibly integrated into both the online (e.g. crawling) and the offline (e.g. index compression) parts of a search engine. To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs. In this paper, we rethink the problem of URL normalization from a global perspective and propose a pattern treebased approach, which is remarkably different from existing approaches. Most current approaches learn rewrite rule
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.