Search CORE

55 research outputs found

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Self-supervised automated wrapper generation for weblog data extraction

Author: A. Laender
B. Adelberg
C. Kohlschütter
I. Muslea
N. Kushmerick
P. Geibel
R. Baumgartner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

Crossref

UCL Discovery

Warwick Research Archives Portal Repository

Designing a general deep web harvester by harvestability factor

Author: Hiemstra Djoerd
Keulen Maurice van
Khelghati Mohammadreza
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2014
Field of study

To make deep web data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need of a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a large number of issues should be addressed. To have all influential elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HFW) or a harvester (HF_H) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites’ or harvesters’ features. These elements are gathered from literature or introduced through the authors’ experiments. In addition to enabling designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing harvesters. Designers can define the list of features and prioritize their implementations. To validate the effectiveness of HF in practice, it is shown how the HFs0\ud websites and how this is useful in designing a harvester. To validate the HF H as an evaluation metric, it is shown how it can be calculated for the harvester implemented by the authors. The results show that the developed harvester works pretty well for the targeted test set by a score of 14.783 of 15

CiteSeerX

Radboud Repository

University of Twente Research Information

Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF)

Author: Hiemstra Djoerd
Keulen Maurice van
Khelghati Mohammadreza
Publication venue: University of Twente, Centre for Telematics and Information Technology (CTIT)
Publication date: 01/01/2014
Field of study

The growing need of accessing more and more information draws attentions to huge amount of data hidden behind web forms defined as deep web. To make this data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need to have a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a number of issues should be considered. Among these issues, business domain features, targeted websites' features, and the harvesting goals are the most influential ones. To consider all these elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HF_w) or a harvester (HF_h) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' (for HF_w) or harvesters' (for HF_h) features. These features are presented in this paper by gathering a number of them from literature and introducing new ones through the authors' experiments. In addition to enabling websites' or harvesters' designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing general purpose deep web harvesters. This framework allows filling in the gap in designing general purpose harvesters by focusing on detailed features of deep websites which have effects on harvesting processes. The represented features in this paper provide a thorough list of requirements for designing deep web harvesters which is not done to best of our knowledge in literature in this extent. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To run the experiments, the developed harvester by the authors, is also discussed in this paper

Radboud Repository

University of Twente Research Information

Efficient Precise Dynamic Data Race Detection For Cpu And Gpu

Author: Peng Yuanfeng
Publication venue: ScholarlyCommons
Publication date: 01/01/2019
Field of study

Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate programs semantics, making it challenging to debug parallel programs. To make parallel programming easier, efficient data race detection has been a research topic in the last decades. However, existing data race detectors either sacrifice precision or incur high overhead, limiting their application to real-world applications and scenarios. This dissertation proposes approaches to improve the performance of dynamic data race detection without undermining precision, by identifying and removing metadata redundancy dynamically. This dissertation also explores ways to make it practical to detect data races dynamically for GPU programs, which has a disparate programming and execution model from CPU workloads. Further, this dissertation shows how the structured synchronization model in GPU programs can simplify the algorithm design of data race detection for GPU, and how the unique patterns in GPU workloads enable an efficient implementation of the algorithm, yielding a high-performance dynamic data race detector for GPU programs

ScholarlyCommons@Penn

BlogForever D2.6: Data Extraction Methodology

Author: Banos V.
Davis R.
Gkotsis G.
Pincent E.
Stepanyan K.
Publication venue
Publication date: 25/10/2013
Field of study

This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Preface of the Proceedings of WRAP 2004

Author: Thiran Philippe
Van den Heuvel Willem-Jan
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2004
Field of study

Repository of the University of Namur

Proceedings of the Workshop on the Wrapper Techniques for Legacy Systems

Author: Thiran Philippe
van den Heuvel Willem-Jan
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2004
Field of study

Repository of the University of Namur

Automatic construction and adaptation of wrappers for semi-structured web documents.

Author
Publication venue
Publication date: 01/01/2003
Field of study

Wong Tak Lam.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 88-94).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Wrapper Induction for Semi-structured Web Documents --- p.1Chapter 1.2 --- Adapting Wrappers to Unseen Web Sites --- p.6Chapter 1.3 --- Thesis Contributions --- p.7Chapter 1.4 --- Thesis Organization --- p.8Chapter 2 --- Related Work --- p.10Chapter 2.1 --- Related Work on Wrapper Induction --- p.10Chapter 2.2 --- Related Work on Wrapper Adaptation --- p.16Chapter 3 --- Automatic Construction of Hierarchical Wrappers --- p.20Chapter 3.1 --- Hierarchical Record Structure Inference --- p.22Chapter 3.2 --- Extraction Rule Induction --- p.30Chapter 3.3 --- Applying Hierarchical Wrappers --- p.38Chapter 4 --- Experimental Results for Wrapper Induction --- p.40Chapter 5 --- Adaptation of Wrappers for Unseen Web Sites --- p.52Chapter 5.1 --- Problem Definition --- p.52Chapter 5.2 --- Overview of Wrapper Adaptation Framework --- p.55Chapter 5.3 --- Potential Training Example Candidate Identification --- p.58Chapter 5.3.1 --- Useful Text Fragments --- p.58Chapter 5.3.2 --- Training Example Generation from the Unseen Web Site --- p.60Chapter 5.3.3 --- Modified Nearest Neighbour Classification --- p.63Chapter 5.4 --- Machine Annotated Training Example Discovery and New Wrap- per Learning --- p.64Chapter 5.4.1 --- Text Fragment Classification --- p.64Chapter 5.4.2 --- New Wrapper Learning --- p.69Chapter 6 --- Case Study and Experimental Results for Wrapper Adapta- tion --- p.71Chapter 6.1 --- Case Study on Wrapper Adaptation --- p.71Chapter 6.2 --- Experimental Results --- p.73Chapter 6.2.1 --- Book Domain --- p.74Chapter 6.2.2 --- Consumer Electronic Appliance Domain --- p.79Chapter 7 --- Conclusions and Future Work --- p.83Bibliography --- p.88Chapter A --- Detailed Performance of Wrapper Induction for Book Do- main --- p.95Chapter B --- Detailed Performance of Wrapper Induction for Consumer Electronic Appliance Domain --- p.9

CUHK Digital Repository

Experiments with property driven monitoring of C programs

Author: Vorobyov Kostyantyn
Publication venue
Publication date: 10/10/2015
Field of study

Bond University Research Portal