190 research outputs found

    Schema-Guided Induction of Monadic Queries

    Get PDF
    International audienceThe induction of monadic node selecting queries from partially annotated XML-trees is a key task in Web information extraction. We show how to integrate schema guidance into an RPNI-based learning algorithm, in which monadic queries are represented by pruning node selecting tree transducers. We present experimental results on schema guidance by the DTD of HTML

    Query Induction with Schema-Guided Pruning Strategies

    Get PDF
    International audienceInference algorithms for tree automata that define node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schema-guided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justified by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    A Decidable Class of Nested Iterated Schemata (extended version)

    Full text link
    Many problems can be specified by patterns of propositional formulae depending on a parameter, e.g. the specification of a circuit usually depends on the number of bits of its input. We define a logic whose formulae, called "iterated schemata", allow to express such patterns. Schemata extend propositional logic with indexed propositions, e.g. P_i, P_i+1, P_1, and with generalized connectives, e.g. /\i=1..n or i=1..n (called "iterations") where n is an (unbound) integer variable called a "parameter". The expressive power of iterated schemata is strictly greater than propositional logic: it is even out of the scope of first-order logic. We define a proof procedure, called DPLL*, that can prove that a schema is satisfiable for at least one value of its parameter, in the spirit of the DPLL procedure. However the converse problem, i.e. proving that a schema is unsatisfiable for every value of the parameter, is undecidable so DPLL* does not terminate in general. Still, we prove that it terminates for schemata of a syntactic subclass called "regularly nested". This is the first non trivial class for which DPLL* is proved to terminate. Furthermore the class of regularly nested schemata is the first decidable class to allow nesting of iterations, i.e. to allow schemata of the form /\i=1..n (/\j=1..n ...).Comment: 43 pages, extended version of "A Decidable Class of Nested Iterated Schemata", submitted to IJCAR 200

    Efficient Inclusion Checking for Deterministic Tree Automata and XML Schemas

    Get PDF
    Special issue of LATA'08.International audienceWe present algorithms for testing language inclusion L(A) ⊆ L(B) between tree automata in time O(|A| |B|) where B is deterministic (bottom-up or top-down). We extend our algorithms for testing inclusion of automata for unranked trees A in deterministic DTDs or deterministic EDTDs with restrained competition D in time O(|A| |Σ| |D|). Previous algorithms were less efficient or less general

    Proceedings of the 19th International Workshop on Unification

    Get PDF
    Proceedings of the 19th international workshop on Unification, held during RDP'2005 in Nara, Japan, on April 22, 2005.UNIF is the main international meeting on unification. Unification is concerned with the problem of identifying given terms, either syntactically or modulo a given logical theory. Syntactic unification is the basic operation of most automated reasoning systems, and unification modulo theories can be used, for instance, to build in special equational theories into theorem provers

    Optimization of object query languages

    Get PDF

    A Decision Procedure for XPath Containment

    Get PDF
    XPath is the standard language for addressing parts of an XML document. We present a sound and complete decision procedure for containment of XPath queries. The considered XPath fragment covers most of the language features used in practice. Specifically, we show how XPath queries can be translated into equivalent formulas in monadic second-order logic. Using this translation, we construct an optimized logical formulation of the containment problem, which is decided using tree automata. When the containment relation does not hold between two XPath expressions, a counter-example XML tree is generated. We provide a complexity analysis together with practical experiments that illustrate the efficiency of the decision procedure for realistic scenarios
    • …
    corecore