In this paper, we study the problem of constructing and maintaining a large shared repository of web pages. We discuss the unique characteristics of such a repository, propose an architecture, and identify its functional modules. We focus on the storage manager module, and illustrate how traditional techniques for storage and indexing can be tailored to meet the requirements of a web repository. To evaluate design alternatives, we also present experimental results from a prototype repository called WebBase, that is currently being developed at Stanford University. Keywords : Repository, WebBase, Architecture, Storage management 1 Introduction A number of important applications require local access to substantial portions of the web. Examples include traditional text search engines [Google] [Avista], related page services [Google] [Alexa], and topic-based search and categorization services [Yahoo]. Such applications typically access, mine or index a local cache or repository of web..
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.