Skip to main content
Article thumbnail
Location of Repository

Not So Creepy Crawler: Crawling the Web with XQuery

By Franziska Von Dem Bussche, Klara Weiand, Benedikt Linse, Tim Furche and François Bry

Abstract

Web crawlers are increasingly used for focused tasks such as the extraction of data from Wikipedia or the analysis of social networks like last.fm. In these cases, pages are far more uniformly structured than in the general Web and thus crawlers can use the structure of Web pages for more precise data extraction and more expressive analysis. In this demonstration, we present a focused, structure-based crawler generator, the “Not so Creepy Crawler” (NC²). What sets NC² apart, is that all analysis and decision tasks of the crawling process are delegated to an (arbitrary) XML query engine such as XQuery or Xcerpt. Customizing crawlers just means writing (declarative) XML queries that can access the currently crawled document as well as the metadata of the crawl process. We identify four types of queries that together suffice to realize a wide variety of focused crawlers

Year: 2010
OAI identifier: oai:CiteSeerX.psu:10.1.1.185.353
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.pms.ifi.lmu.de/publ... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.