The RoadRunner Project: Towards Automatic Extraction of Web Data

Giansalvatore Mecca; Paolo Merialdo; Valter Crescenzi

The RoadRunner Project: Towards Automatic Extraction of Web Data

Authors: Giansalvatore Mecca
Paolo Merialdo
Valter Crescenzi
Publication date: 1 January 2001
Publisher

Abstract

Introduction ROADRUNNER is a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites that publish large amounts of data in a fairly complex structure. In our view, we aim at ideally seeing the data extraction process of a data-intensive Web site as a black-box taking as input the URL of an entry point to the site (e.g. the home page), and returning as output data extracted from HTML pages in the site in a structured database-like format. This paper describes the top-level software architecture of the ROADRUNNER System, which has been specifically designed to automatize the data extraction process. Several components of the system have already been implemented, and preliminary experiments show the feasibility of our ideas. Data-intensive Web sites usually share a number

Similar works

Full text

Available Versions

Archivio della Ricerca - Università della Basilicata

oai:iris.unibas.it:11563/11138...

Last time updated on 12/11/2016