55 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Self-supervised automated wrapper generation for weblog data extraction

    Get PDF
    Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

    Designing a general deep web harvester by harvestability factor

    Get PDF
    To make deep web data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need of a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a large number of issues should be addressed. To have all influential elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HFW) or a harvester (HF_H) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites’ or harvesters’ features. These elements are gathered from literature or introduced through the authors’ experiments. In addition to enabling designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing harvesters. Designers can define the list of features and prioritize their implementations. To validate the effectiveness of HF in practice, it is shown how the HFs0\ud websites and how this is useful in designing a harvester. To validate the HF H as an evaluation metric, it is shown how it can be calculated for the harvester implemented by the authors. The results show that the developed harvester works pretty well for the targeted test set by a score of 14.783 of 15

    Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF)

    Get PDF
    The growing need of accessing more and more information draws attentions to huge amount of data hidden behind web forms defined as deep web. To make this data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need to have a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a number of issues should be considered. Among these issues, business domain features, targeted websites' features, and the harvesting goals are the most influential ones. To consider all these elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HF_w) or a harvester (HF_h) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' (for HF_w) or harvesters' (for HF_h) features. These features are presented in this paper by gathering a number of them from literature and introducing new ones through the authors' experiments. In addition to enabling websites' or harvesters' designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing general purpose deep web harvesters. This framework allows filling in the gap in designing general purpose harvesters by focusing on detailed features of deep websites which have effects on harvesting processes. The represented features in this paper provide a thorough list of requirements for designing deep web harvesters which is not done to best of our knowledge in literature in this extent. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To run the experiments, the developed harvester by the authors, is also discussed in this paper

    Efficient Precise Dynamic Data Race Detection For Cpu And Gpu

    Get PDF
    Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate programs semantics, making it challenging to debug parallel programs. To make parallel programming easier, efficient data race detection has been a research topic in the last decades. However, existing data race detectors either sacrifice precision or incur high overhead, limiting their application to real-world applications and scenarios. This dissertation proposes approaches to improve the performance of dynamic data race detection without undermining precision, by identifying and removing metadata redundancy dynamically. This dissertation also explores ways to make it practical to detect data races dynamically for GPU programs, which has a disparate programming and execution model from CPU workloads. Further, this dissertation shows how the structured synchronization model in GPU programs can simplify the algorithm design of data race detection for GPU, and how the unique patterns in GPU workloads enable an efficient implementation of the algorithm, yielding a high-performance dynamic data race detector for GPU programs

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Preface of the Proceedings of WRAP 2004

    Get PDF

    Automatic construction and adaptation of wrappers for semi-structured web documents.

    Get PDF
    Wong Tak Lam.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 88-94).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Wrapper Induction for Semi-structured Web Documents --- p.1Chapter 1.2 --- Adapting Wrappers to Unseen Web Sites --- p.6Chapter 1.3 --- Thesis Contributions --- p.7Chapter 1.4 --- Thesis Organization --- p.8Chapter 2 --- Related Work --- p.10Chapter 2.1 --- Related Work on Wrapper Induction --- p.10Chapter 2.2 --- Related Work on Wrapper Adaptation --- p.16Chapter 3 --- Automatic Construction of Hierarchical Wrappers --- p.20Chapter 3.1 --- Hierarchical Record Structure Inference --- p.22Chapter 3.2 --- Extraction Rule Induction --- p.30Chapter 3.3 --- Applying Hierarchical Wrappers --- p.38Chapter 4 --- Experimental Results for Wrapper Induction --- p.40Chapter 5 --- Adaptation of Wrappers for Unseen Web Sites --- p.52Chapter 5.1 --- Problem Definition --- p.52Chapter 5.2 --- Overview of Wrapper Adaptation Framework --- p.55Chapter 5.3 --- Potential Training Example Candidate Identification --- p.58Chapter 5.3.1 --- Useful Text Fragments --- p.58Chapter 5.3.2 --- Training Example Generation from the Unseen Web Site --- p.60Chapter 5.3.3 --- Modified Nearest Neighbour Classification --- p.63Chapter 5.4 --- Machine Annotated Training Example Discovery and New Wrap- per Learning --- p.64Chapter 5.4.1 --- Text Fragment Classification --- p.64Chapter 5.4.2 --- New Wrapper Learning --- p.69Chapter 6 --- Case Study and Experimental Results for Wrapper Adapta- tion --- p.71Chapter 6.1 --- Case Study on Wrapper Adaptation --- p.71Chapter 6.2 --- Experimental Results --- p.73Chapter 6.2.1 --- Book Domain --- p.74Chapter 6.2.2 --- Consumer Electronic Appliance Domain --- p.79Chapter 7 --- Conclusions and Future Work --- p.83Bibliography --- p.88Chapter A --- Detailed Performance of Wrapper Induction for Book Do- main --- p.95Chapter B --- Detailed Performance of Wrapper Induction for Consumer Electronic Appliance Domain --- p.9
    corecore