Search CORE

190 research outputs found

Schema-Guided Induction of Monadic Queries

Author: Champavère Jérôme
Gilleron Rémi
Lemay Aurélien
Niehren Joachim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2008
Field of study

International audienceThe induction of monadic node selecting queries from partially annotated XML-trees is a key task in Web information extraction. We show how to integrate schema guidance into an RPNI-based learning algorithm, in which monadic queries are represented by pruning node selecting tree transducers. We present experimental results on schema guidance by the DTD of HTML

HAL - Lille 3

INRIA a CCSD electronic archive server

Query Induction with Schema-Guided Pruning Strategies

Author: Champavère Jérôme
Gilleron Rémi
Lemay Aurélien
Niehren Joachim
Publication venue: Microtome Publishing
Publication date: 01/01/2013
Field of study

International audienceInference algorithms for tree automata that define node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schema-guided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justified by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction

HAL - Lille 3

CiteSeerX

INRIA a CCSD electronic archive server

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

A Decidable Class of Nested Iterated Schemata (extended version)

Author: Aravantinos Vincent
Caferra Ricardo
Peltier Nicolas
Publication venue
Publication date: 24/01/2010
Field of study

Many problems can be specified by patterns of propositional formulae depending on a parameter, e.g. the specification of a circuit usually depends on the number of bits of its input. We define a logic whose formulae, called "iterated schemata", allow to express such patterns. Schemata extend propositional logic with indexed propositions, e.g. P_i, P_i+1, P_1, and with generalized connectives, e.g. /\i=1..n or i=1..n (called "iterations") where n is an (unbound) integer variable called a "parameter". The expressive power of iterated schemata is strictly greater than propositional logic: it is even out of the scope of first-order logic. We define a proof procedure, called DPLL*, that can prove that a schema is satisfiable for at least one value of its parameter, in the spirit of the DPLL procedure. However the converse problem, i.e. proving that a schema is unsatisfiable for every value of the parameter, is undecidable so DPLL* does not terminate in general. Still, we prove that it terminates for schemata of a syntactic subclass called "regularly nested". This is the first non trivial class for which DPLL* is proved to terminate. Furthermore the class of regularly nested schemata is the first decidable class to allow nesting of iterations, i.e. to allow schemata of the form /\i=1..n (/\j=1..n ...).Comment: 43 pages, extended version of "A Decidable Class of Nested Iterated Schemata", submitted to IJCAR 200

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

Efficient Inclusion Checking for Deterministic Tree Automata and XML Schemas

Author: Champavère Jérôme
Gilleron Rémi
Lemay Aurélien
Niehren Joachim
Publication venue: 'Elsevier BV'
Publication date: 01/11/2009
Field of study

Special issue of LATA'08.International audienceWe present algorithms for testing language inclusion L(A) ⊆ L(B) between tree automata in time O(|A| |B|) where B is deterministic (bottom-up or top-down). We extend our algorithms for testing inclusion of automata for unranked trees A in deterministic DTDs or deterministic EDTDs with restrained competition D in time O(|A| |Σ| |D|). Previous algorithms were less efficient or less general

HAL - Lille 3

Elsevier - Publisher Connector

INRIA a CCSD electronic archive server

Proceedings of the 19th International Workshop on Unification

Author: Vigneron Laurent
Publication venue: Faculdade Santa Maria da Gloria
Publication date: 01/04/2005
Field of study

Proceedings of the 19th international workshop on Unification, held during RDP'2005 in Nara, Japan, on April 22, 2005.UNIF is the main international meeting on unification. Unification is concerned with the problem of identifying given terms, either syntactically or modulo a given logical theory. Syntactic unification is the basic operation of most automated reasoning systems, and unification modulo theories can be used, for instance, to build in special equational theories into theorem provers

HAL - Université de Franche-Comté

INRIA a CCSD electronic archive server

The Lixto Data Extraction Project - Back and Forth between Theory and Practice

Author: Baumgartner Robert
Flesca Sergio
Gottlob Georg
Herzog Marcus
Koch Christoph
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/06/2011
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Optimization of object query languages

Author: Steenhagen Hendrika Janna
Publication venue: University of Twente
Publication date: 19/10/1995
Field of study

University of Twente Research Information

A Decision Procedure for XPath Containment

Author: Genevès Pierre
Layaïda Nabil
Publication venue: HAL CCSD
Publication date: 30/09/2005
Field of study

XPath is the standard language for addressing parts of an XML document. We present a sound and complete decision procedure for containment of XPath queries. The considered XPath fragment covers most of the language features used in practice. Specifically, we show how XPath queries can be translated into equivalent formulas in monadic second-order logic. Using this translation, we construct an optimized logical formulation of the containment problem, which is decided using tree automata. When the containment relation does not hold between two XPath expressions, a counter-example XML tree is generated. We provide a complexity analysis together with practical experiments that illustrate the efficiency of the decision procedure for realistic scenarios

CiteSeerX

INRIA a CCSD electronic archive server