Search CORE

3 research outputs found

Minimizing the Costs of the Training Data for Learning Web Wrappers

Author: CRESCENZI VALTER
MERIALDO PAOLO
QIU DISHENG
Rolando Creo
Publication venue
Publication date: 01/01/2012
Field of study

Data extraction from the Web represents an important issue. Several approaches have been developed to bring the wrapper generation process at the web scale. Although they rely on different techniques and formalisms, they all learn a wrapper given a set of sample pages. Unsupervised approaches require just a set of sample pages, supervised ones also need training data. Unfortunately, the accuracy obtained by unsupervised techniques is not sufficient for many applications. On the other hand, obtaining training data is not cheap at the web scale. This paper addresses the issue of minimizing the costs of collecting training data for learning web wrappers. We show that two interleaved problems affect this issue: the choice of the sample pages, and the expressiveness of the wrapper language. We propose a solution that leverages contributions in the field of learning theory, and we discuss the promising results of an experimental evaluation of our approach

Archivio della Ricerca - Università di Roma 3