3,744 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Design of Automatically Adaptable Web Wrappers

    Get PDF
    Nowadays, the huge amount of information distributed through the Web motivates studying techniques to\ud be adopted in order to extract relevant data in an efficient and reliable way. Both academia and enterprises\ud developed several approaches of Web data extraction, for example using techniques of artificial intelligence or\ud machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision\ud of information extracted from Web pages, and, at the same time, have to prove robustness in order not to\ud compromise quality and reliability of data themselves.\ud In this paper we focus on some experimental aspects related to the robustness of the data extraction process\ud and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for\ud finding similarities between two different version of a Web page, in order to handle modifications, avoiding\ud the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate\ud performances, advantages and draw-backs of our novel system of automatic wrapper adaptation

    Automatic Wrapper Adaptation by Tree Edit Distance Matching

    Get PDF
    Information distributed through the Web keeps growing faster day by day,\ud and for this reason, several techniques for extracting Web data have been suggested\ud during last years. Often, extraction tasks are performed through so called wrappers,\ud procedures extracting information from Web pages, e.g. implementing logic-based\ud techniques. Many fields of application today require a strong degree of robustness\ud of wrappers, in order not to compromise assets of information or reliability of data\ud extracted.\ud Unfortunately, wrappers may fail in the task of extracting data from a Web page, if\ud its structure changes, sometimes even slightly, thus requiring the exploiting of new\ud techniques to be automatically held so as to adapt the wrapper to the new structure\ud of the page, in case of failure. In this work we present a novel approach of automatic wrapper adaptation based on the measurement of similarity of trees through\ud improved tree edit distance matching techniques
    corecore