Search CORE

10 research outputs found

Design of Automatically Adaptable Web Wrappers

Author: Baumgartner Robert
Ferrara Emilio
Publication venue
Publication date: 01/01/2011
Field of study

Nowadays, the huge amount of information distributed through the Web motivates studying techniques to\ud be adopted in order to extract relevant data in an efﬁcient and reliable way. Both academia and enterprises\ud developed several approaches of Web data extraction, for example using techniques of artiﬁcial intelligence or\ud machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision\ud of information extracted from Web pages, and, at the same time, have to prove robustness in order not to\ud compromise quality and reliability of data themselves.\ud In this paper we focus on some experimental aspects related to the robustness of the data extraction process\ud and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for\ud ﬁnding similarities between two different version of a Web page, in order to handle modiﬁcations, avoiding\ud the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate\ud performances, advantages and draw-backs of our novel system of automatic wrapper adaptation

arXiv.org e-Print Archive

CiteSeerX

CogPrints Cognitive Sciences Eprint Archive

Intelligent Self-Repairable Web Wrappers

Author: A. Laender
B. Chidlovskii
E. Ferrara
K. Lerman
N. Kushmerick
P. Bille
R. Baumgartner
S. Sarawagi
S. Selkow
W. Yang
X. Meng
Y. Kim
Publication venue
Publication date: 01/01/2011
Field of study

The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.\u

arXiv.org e-Print Archive

Crossref

CogPrints Cognitive Sciences Eprint Archive

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Author: Baumgartner Robert
Ferrara Emilio
Publication venue
Publication date: 01/01/2010
Field of study

Information distributed through the Web keeps growing faster day by day,\ud and for this reason, several techniques for extracting Web data have been suggested\ud during last years. Often, extraction tasks are performed through so called wrappers,\ud procedures extracting information from Web pages, e.g. implementing logic-based\ud techniques. Many ﬁelds of application today require a strong degree of robustness\ud of wrappers, in order not to compromise assets of information or reliability of data\ud extracted.\ud Unfortunately, wrappers may fail in the task of extracting data from a Web page, if\ud its structure changes, sometimes even slightly, thus requiring the exploiting of new\ud techniques to be automatically held so as to adapt the wrapper to the new structure\ud of the page, in case of failure. In this work we present a novel approach of automatic wrapper adaptation based on the measurement of similarity of trees through\ud improved tree edit distance matching techniques

CogPrints Cognitive Sciences Eprint Archive

Fave: uma proposta para verifica??o de extratores de dados de p?ginas html

Author: Silva Jo?o Miguel Gehlen da
Publication venue: UFFS
Publication date: 03/07/2018
Field of study

The constant growth of online services, for example, price and product comparison, content aggregators, among others, drives the demand for solutions for data extraction. In order for information from the Internet to be compared or grouped, it is first necessary to extract relevant data from web pages in a structured format. The techniques that provide data extraction are known as wrappers. Each wrapper is developed based on the HTML page and produces a set of structured information. But when an HTML page is modified, wrapper may stop working or works incorrectly. Currently there are several studies to perform the automatic adjustment of the data extraction system, procedure known as wrapper maintenance. This work presents some techniques of wrapper maintenance and proposes an improvement in the method of extractor automation based on the presented techniques.O constante crescimento de servi?os online, por exemplo, compara??o de pre?os e produtos, agregadores de conte?dos, entre outros, impulsiona a demanda por solu??es para a extra??o de dados. Para que informa??es oriundas internet possam ser comparadas ou agrupadas, ? necess?rio extrair os dados relevantes das p?ginas web em um formato estruturado. As t?cnicas que providenciam a extra??o de dados s?o conhecidas como wrappers. Cada wrapper ? desenvolvido usando como base a p?gina HTML e produz um conjunto de informa??es estruturadas. Por?m quando uma p?gina HTML ? modificada, o wrapper para de funcionar ou funciona de maneira incorreta. Atualmente j? existem diversos estudos para fazer o ajuste autom?tico do sistema de extra??o de dados, procedimento conhecido como wrapper maintenance. Este trabalho apresenta algumas t?cnicas de wrapper maintenance e prop?e uma melhoria no m?todo de automa??o de extratores tomando como base as t?cnicas apresentadas

Universidade Federal da Fronteira Sul

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Semi-Automated Extraction of Targeted Data from Web Pages

Author: Estiévenart Fabrice
Hainaut Jean-Luc
Meurisse Jean-Roch
Thiran Philippe
Publication venue: IEEE Computer Science Press
Publication date: 01/01/2006
Field of study

Repository of the University of Namur

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Author: A.H.F. Laender
B. Chidlovskii
D.S. Hirschberg
J. Raposo
J. Tekli
K. Lerman
K. Tai
K. Zhang
P. Bille
P. Klein
S. Selkow
T. Wong
W. Yang
X. Meng
Y. Kim
Y. Zhai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Crossref

Semi-Automated Extraction of Targeted Data from Web Pages

Author: Estiévenart Fabrice
Hainaut Jean-Luc
Meurisse Jean-Roch
Thiran Philippe
Publication venue: IEEE Computer Science Press
Publication date: 01/01/2006
Field of study

Repository of the University of Namur

Acquisition des contenus intelligents dans l’archivage du Web

Author: Faheem Muhammad
Publication venue: HAL CCSD
Publication date: 17/12/2014
Field of study

Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systèmes de gestion de contenu (CMS). Les outils actuellement utilisés par les archivistes du Web pour préserver le contenu du Web collectent et stockent de manière aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structuré de ces pages Web. Nous présentons dans un premier temps un application-aware helper (AAH) qui s’intègre à une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, étant donnée une base de connaissance deCMS courants. L’AAH a été intégrée à deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriétaire d’Internet Memory Foundation et une version personnalisée d’Heritrix. Nous proposons ensuite un système de crawl efficace et non supervisé, ACEBot (Adaptive Crawler Bot for data Extraction), guidé par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL récupérées), apprend une stratégie de parcours basée sur l’importance des motifs de navigation (sélectionnant ceux qui mènent à du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un téléchargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requêtes HTTP qu’un crawler générique, sans compromis de qualité. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de données semi-supervisée. OWET permet à un utilisateur d’extraire les données cachées derrière des formulaires Web

Thèses en Ligne

1. Introduction SG-WRAM: Schema Guided Wrapper Maintenance

Author: Dongdong Hu
Haiyan Wang
Mingzhe Gu
Xiaofeng Meng
Publication venue
Publication date
Field of study

CiteSeerX