3 research outputs found

    Automatic Wrapper Induction from Hidden-Web Sources with Domain Knowledge ABSTRACT

    No full text
    We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. This approach heavily relies on some domain knowledge, expressed in a predefined form, for a given domain of interest. There are two parts in the understanding of a given service of the hidden Web: understanding the structure of its input and the way its output is presented. This amounts to understanding the structure of a given form and to relate its fields to concepts of the domain of interest, and to understanding where and how resulting records are represented in an HTML result page. For the former problem, we use a combination of heuristics and of probing with domain instances; for the latter, we use a supervised machine learning technique adapted to tree-like information on an automatic, imperfect, and imprecise, annotation using the domain knowledge. The result of these two steps is the possibility to automatically wrap a form as a standard Web service with a WSDL description. We implemented such a system and show experiments that demonstrate the validity and potential of this approach
    corecore