12,809 research outputs found

    Design of Automatically Adaptable Web Wrappers

    Get PDF
    Nowadays, the huge amount of information distributed through the Web motivates studying techniques to\ud be adopted in order to extract relevant data in an efcient and reliable way. Both academia and enterprises\ud developed several approaches of Web data extraction, for example using techniques of articial intelligence or\ud machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision\ud of information extracted from Web pages, and, at the same time, have to prove robustness in order not to\ud compromise quality and reliability of data themselves.\ud In this paper we focus on some experimental aspects related to the robustness of the data extraction process\ud and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for\ud nding similarities between two different version of a Web page, in order to handle modications, avoiding\ud the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate\ud performances, advantages and draw-backs of our novel system of automatic wrapper adaptation

    Virtue integrated platform : holistic support for distributed ship hydrodynamic design

    Get PDF
    Ship hydrodynamic design today is often still done in a sequential approach. Tools used for the different aspects of CFD (Computational Fluid Dynamics) simulation (e.g. wave resistance, cavitation, seakeeping, and manoeuvring), and even for the different levels of detail within a single aspect, are often poorly integrated. VIRTUE (the VIRtual Tank Utility in Europe) project has the objective to develop a platform that will enable various distributed CFD and design applications to be integrated so that they may operate in a unified and holistic manner. This paper presents an overview of the VIRTUE Integrated Platform (VIP), e.g. research background, objectives, current work, user requirements, system architecture, its implementation, evaluation, and current development and future work

    Wrapper Maintenance: A Machine Learning Approach

    Full text link
    The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task

    Context constraint integration and validation in dynamic web service compositions

    Get PDF
    System architectures that cross organisational boundaries are usually implemented based on Web service technologies due to their inherent interoperability benets. With increasing exibility requirements, such as on-demand service provision, a dynamic approach to service architecture focussing on composition at runtime is needed. The possibility of technical faults, but also violations of functional and semantic constraints require a comprehensive notion of context that captures composition-relevant aspects. Context-aware techniques are consequently required to support constraint validation for dynamic service composition. We present techniques to respond to problems occurring during the execution of dynamically composed Web services implemented in WS-BPEL. A notion of context { covering physical and contractual faults and violations { is used to safeguard composed service executions dynamically. Our aim is to present an architectural framework from an application-oriented perspective, addressing practical considerations of a technical framework

    Feature Selection via Coalitional Game Theory

    Get PDF
    We present and study the contribution-selection algorithm (CSA), a novel algorithm for feature selection. The algorithm is based on the multiperturbation shapley analysis (MSA), a framework that relies on game theory to estimate usefulness. The algorithm iteratively estimates the usefulness of features and selects them accordingly, using either forward selection or backward elimination. It can optimize various performance measures over unseen data such as accuracy, balanced error rate, and area under receiver-operator-characteristic curve. Empirical comparison with several other existing feature selection methods shows that the backward elimination variant of CSA leads to the most accurate classification results on an array of data sets

    Self-supervised automated wrapper generation for weblog data extraction

    Get PDF
    Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
    • …
    corecore