The RoadRunner Project: Towards Automatic Extraction of Web Data

Abstract

Introduction ROADRUNNER is a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites that publish large amounts of data in a fairly complex structure. In our view, we aim at ideally seeing the data extraction process of a data-intensive Web site as a black-box taking as input the URL of an entry point to the site (e.g. the home page), and returning as output data extracted from HTML pages in the site in a structured database-like format. This paper describes the top-level software architecture of the ROADRUNNER System, which has been specifically designed to automatize the data extraction process. Several components of the system have already been implemented, and preliminary experiments show the feasibility of our ideas. Data-intensive Web sites usually share a number

    Similar works