Search CORE

79 research outputs found

Web Content Extraction - a Meta-Analysis of its Past and Thoughts on its Future

Author: Crescenzi Valter
Gottron Thomas
Merialdo Paolo
Palacios Rodrigo
Weninger Tim
Publication venue
Publication date: 01/01/2015
Field of study

In this paper, we present a meta-analysis of several Web content extraction algorithms, and make recommendations for the future of content extraction on the Web. First, we find that nearly all Web content extractors do not consider a very large, and growing, portion of modern Web pages. Second, it is well understood that wrapper induction extractors tend to break as the Web changes; heuristic/feature engineering extractors were thought to be immune to a Web site's evolution, but we find that this is not the case: heuristic content extractor performance also tends to degrade over time due to the evolution of Web site forms and practices. We conclude with recommendations for future work that address these and other findings.Comment: Accepted for publication in SIGKDD Exploration

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Roma 3

Preface of the 31st Italian Symposium on Advanced Database Systems

Author: Amato Flora
Atzori Maurizio
Baralis Elena
Bartolini Ilaria
Bellomarini Luigi
Buccafurri Francesco
Cabibbo Luca
Calvanese Diego
Calí Andrea
Camporese Antonio
Caruccio Loredana
Castano Silvana
Catania Barbara
Ceci Michelangelo
Chiusano Silvia
Ciaccia Paolo
Corradini Enrico
Crescenzi Valter
De Antonellis Valeria
Di Noia Tommaso
Diamantini Claudia
Faggioli Guglielmo
Fazzinga Bettina
Ferrara Alfio
Ferrari Elena
Ferro Nicola
Firmani Donatella
Garza Paolo
Giachelle Fabio
Golfarelli Matteo
Greco Sergio
Guerrini Giovanna
Gullo Francesco
Guzzi Pietro Hiram
Irrera Ornella
Lanti Davide
Lembo Domenico
Leoncini Debora
Leotta Francesco
Manco Giuseppe
Mandreoli Federica
Marchesin Stefano
Masciari Elio
Maurino Andrea
Melchiori Michele
Menotti Laura
Mircoli Alex
Missier Paolo
Molinaro Cristian
Montanelli Stefano
Moscato Vincenzo
Papotti Paolo
Pasin Andrea
Pensa Ruggero G.
Piantella Davide
Pugliese Andrea
Quaggio Elisa
Quintarelli Elisa
Renso Chiara
Rinzivillo Salvatore
Sartiani Carlo
Savo Domenico Fabio
Silvello Gianmaria
Simonini Giovanni
Storti Emanuele
Tagarelli Andrea
Tanca Letizia
Publication venue
Publication date: 10/09/2023
Field of study

This volume contains the proceedings of the 31st Italian Symposium on Advanced Database Systems (SEBD - Sistemi Evoluti per Basi di Dati), held in Galzinagno Terme (Padua, Italy) from 2 to 5 July 2023.</p

University of Birmingham Research Portal

Preface of the 31st Italian Symposium on Advanced Database Systems

Author: Amato Flora
Atzori Maurizio
Baralis Elena
Bartolini Ilaria
Bellomarini Luigi
Buccafurri Francesco
Cabibbo Luca
Calvanese Diego
Calí Andrea
Camporese Antonio
Caruccio Loredana
Castano Silvana
Catania Barbara
Ceci Michelangelo
Chiusano Silvia
Ciaccia Paolo
Corradini Enrico
Crescenzi Valter
De Antonellis Valeria
Di Noia Tommaso
Diamantini Claudia
Faggioli Guglielmo
Fazzinga Bettina
Ferrara Alfio
Ferrari Elena
Ferro Nicola
Firmani Donatella
Garza Paolo
Giachelle Fabio
Golfarelli Matteo
Greco Sergio
Guerrini Giovanna
Gullo Francesco
Guzzi Pietro Hiram
Irrera Ornella
Lanti Davide
Lembo Domenico
Leoncini Debora
Leotta Francesco
Manco Giuseppe
Mandreoli Federica
Marchesin Stefano
Masciari Elio
Maurino Andrea
Melchiori Michele
Menotti Laura
Mircoli Alex
Missier Paolo
Molinaro Cristian
Montanelli Stefano
Moscato Vincenzo
Papotti Paolo
Pasin Andrea
Pensa Ruggero G.
Piantella Davide
Pugliese Andrea
Quaggio Elisa
Quintarelli Elisa
Renso Chiara
Rinzivillo Salvatore
Sartiani Carlo
Savo Domenico Fabio
Silvello Gianmaria
Simonini Giovanni
Storti Emanuele
Tagarelli Andrea
Tanca Letizia
Publication venue
Publication date: 10/09/2023
Field of study

University of Birmingham Research Portal

WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Author: CRESCENZI VALTER
MERIALDO PAOLO
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2008
Field of study

Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques

Archivio della Ricerca - Università di Roma 3

Open Access Repository

Efficient Techniques for Effective Wrapper Induction

Author: CRESCENZI VALTER
MERIALDO PAOLO
Publication venue
Publication date: 01/01/2006
Field of study

Archivio della Ricerca - Università di Roma 3

The RoadRunner Project: Towards Automatic Extraction of Web Data

Author: Giansalvatore Mecca
Paolo Merialdo
Valter Crescenzi
Publication venue
Publication date: 01/01/2001
Field of study

Introduction ROADRUNNER is a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites that publish large amounts of data in a fairly complex structure. In our view, we aim at ideally seeing the data extraction process of a data-intensive Web site as a black-box taking as input the URL of an entry point to the site (e.g. the home page), and returning as output data extracted from HTML pages in the site in a structured database-like format. This paper describes the top-level software architecture of the ROADRUNNER System, which has been specifically designed to automatize the data extraction process. Several components of the system have already been implemented, and preliminary experiments show the feasibility of our ideas. Data-intensive Web sites usually share a number

CiteSeerX

Archivio della Ricerca - Università della Basilicata

ALFRED: crowd assisted data extraction

Author: CRESCENZI VALTER
MERIALDO PAOLO
QIU DISHENG
Publication venue: place:Hyderabad, India
Publication date: 01/01/2013
Field of study

Archivio della Ricerca - Università di Roma 3

The RoadRunner Web Data Extraction System

Author: CRESCENZI VALTER
G. MECCA
MERIALDO PAOLO
Publication venue
Publication date: 01/01/2001
Field of study

Archivio della Ricerca - Università di Roma 3

Efficiently Locating Collections of Web Pages to Wrap

Author: CRESCENZI VALTER
L. BLANCO
MERIALDO PAOLO
Publication venue
Publication date: 01/01/2005
Field of study

Archivio della Ricerca - Università di Roma 3

Clustering Web pages based on their structure

Author: Crescenzi Valter
Merialdo Paolo
Missier Paolo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2005
Field of study

Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a targetWeb site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites

Archivio della Ricerca - Università di Roma 3

The University of Manchester - Institutional Repository