We consider the acquisition and maintenance of XML data found on the web. More precisely, we study the problem of discovering XML data on the web, i.e., in a world still dominated by HTML, and keeping it up to date with the web as best as possible, under set resources. We present a distributed architecture that is designed to scale to the billions of pages of the web. In particular, the distributed management of metadata about HTML and XML pages turns out to be an interesting issue. The scheduling of the fetching of the page is guided by the importance of pages, their expected change rate, and subscriptions/ publications of users. The importance of XML pages is dened in the standard manner based on the link structure of the web graph. It is computed by a matrix xpoint computation. HTML pages are of interest for us only in that they lead to XML pages. Thus their importance is dened in a dioeerent manner and their computation also involves a xpoint but on the transposed link matrix thi..
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.