Location of Repository

Acquisition and Maintenance of XML Data from the Web

By Laurent Mignet, Mihai Preda, Serge Abiteboul, Bernd Amann and Amélie Marian

Abstract

We consider the acquisition and maintenance of XML data found on the web. More precisely, we study the problem of discovering XML data on the web, i.e., in a world still dominated by HTML, and keeping it up to date with the web as best as possible, under set resources. We present a distributed architecture that is designed to scale to the billions of pages of the web. In particular, the distributed management of metadata about HTML and XML pages turns out to be an interesting issue. The scheduling of the fetching of the page is guided by the importance of pages, their expected change rate, and subscriptions/ publications of users. The importance of XML pages is dened in the standard manner based on the link structure of the web graph. It is computed by a matrix xpoint computation. HTML pages are of interest for us only in that they lead to XML pages. Thus their importance is dened in a dioeerent manner and their computation also involves a xpoint but on the transposed link matrix thi..

Topics: HTML
Year: 2007
OAI identifier: oai:CiteSeerX.psu:10.1.1.32.7341
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • ftp://ftp.inria.fr/INRIA/Proje... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.