55 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Self-supervised automated wrapper generation for weblog data extraction
Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives
Designing a general deep web harvester by harvestability factor
To make deep web data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need of a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a large number of issues should be addressed. To have all influential elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HFW) or a harvester (HF_H) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites’ or harvesters’ features. These elements are gathered from literature or introduced through the authors’ experiments. In addition to enabling designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing harvesters. Designers can define the list of features and prioritize their implementations. To validate the effectiveness of HF in practice, it is shown how the HFs0\ud
websites and how this is useful in designing a harvester. To validate the HF H as an evaluation metric, it is shown how it can be calculated for the harvester implemented by the authors. The results show that the developed harvester works pretty well for the targeted test set by a score of 14.783 of 15
Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF)
The growing need of accessing more and more information draws attentions to huge amount of data hidden behind web forms defined as deep web. To make this data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need to have a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a number of issues should be considered. Among these issues, business domain features, targeted websites' features, and the harvesting goals are the most influential ones. To consider all these elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HF_w) or a harvester (HF_h) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' (for HF_w) or harvesters' (for HF_h) features. These features are presented in this paper by gathering a number of them from literature and introducing new ones through the authors' experiments. In addition to enabling websites' or harvesters' designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing general purpose deep web harvesters. This framework allows filling in the gap in designing general purpose harvesters by focusing on detailed features of deep websites which have effects on harvesting processes. The represented features in this paper provide a thorough list of requirements for designing deep web harvesters which is not done to best of our knowledge in literature in this extent. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To run the experiments, the developed harvester by the authors, is also discussed in this paper
Efficient Precise Dynamic Data Race Detection For Cpu And Gpu
Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate programs semantics, making it challenging to debug parallel programs. To make parallel programming easier, efficient data race detection has been a research topic in the last decades. However, existing data race detectors either sacrifice precision or incur high overhead, limiting their application to real-world applications and scenarios. This dissertation proposes approaches to improve the performance of dynamic data race detection without undermining precision, by identifying and removing metadata redundancy dynamically. This dissertation also explores ways to make it practical to detect data races dynamically for GPU programs, which has a disparate programming and execution model from CPU workloads. Further, this dissertation shows how the structured synchronization model in GPU programs can simplify the algorithm design of
data race detection for GPU, and how the unique patterns in GPU workloads enable an efficient implementation of the algorithm, yielding a high-performance dynamic data race detector for GPU programs
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Automatic construction and adaptation of wrappers for semi-structured web documents.
Wong Tak Lam.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 88-94).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Wrapper Induction for Semi-structured Web Documents --- p.1Chapter 1.2 --- Adapting Wrappers to Unseen Web Sites --- p.6Chapter 1.3 --- Thesis Contributions --- p.7Chapter 1.4 --- Thesis Organization --- p.8Chapter 2 --- Related Work --- p.10Chapter 2.1 --- Related Work on Wrapper Induction --- p.10Chapter 2.2 --- Related Work on Wrapper Adaptation --- p.16Chapter 3 --- Automatic Construction of Hierarchical Wrappers --- p.20Chapter 3.1 --- Hierarchical Record Structure Inference --- p.22Chapter 3.2 --- Extraction Rule Induction --- p.30Chapter 3.3 --- Applying Hierarchical Wrappers --- p.38Chapter 4 --- Experimental Results for Wrapper Induction --- p.40Chapter 5 --- Adaptation of Wrappers for Unseen Web Sites --- p.52Chapter 5.1 --- Problem Definition --- p.52Chapter 5.2 --- Overview of Wrapper Adaptation Framework --- p.55Chapter 5.3 --- Potential Training Example Candidate Identification --- p.58Chapter 5.3.1 --- Useful Text Fragments --- p.58Chapter 5.3.2 --- Training Example Generation from the Unseen Web Site --- p.60Chapter 5.3.3 --- Modified Nearest Neighbour Classification --- p.63Chapter 5.4 --- Machine Annotated Training Example Discovery and New Wrap- per Learning --- p.64Chapter 5.4.1 --- Text Fragment Classification --- p.64Chapter 5.4.2 --- New Wrapper Learning --- p.69Chapter 6 --- Case Study and Experimental Results for Wrapper Adapta- tion --- p.71Chapter 6.1 --- Case Study on Wrapper Adaptation --- p.71Chapter 6.2 --- Experimental Results --- p.73Chapter 6.2.1 --- Book Domain --- p.74Chapter 6.2.2 --- Consumer Electronic Appliance Domain --- p.79Chapter 7 --- Conclusions and Future Work --- p.83Bibliography --- p.88Chapter A --- Detailed Performance of Wrapper Induction for Book Do- main --- p.95Chapter B --- Detailed Performance of Wrapper Induction for Consumer Electronic Appliance Domain --- p.9
- …