62 research outputs found
Characterizing XML Twig Queries with Examples
International audienceTypically, a (Boolean) query is a finite formula that defines a possibly infinite set of database instances that satisfy it (positive examples), and implicitly, the set of instances that do not satisfy the query (negative examples). We investigate the following natural question: for a given class of queries, is it possible to characterize every query with a finite set of positive and negative examples that no other query is consistent with.We study this question for twig queries and XML databases. We show that while twig queries are characterizable, they generally require exponential sets of examples. Consequently, we focus on a practical subclass of anchored twig queries and show that not only are they characterizable but also with polynomially-sized sets of examples. This result is obtained with the use of generalization operations on twig queries, whose application to an anchored twig query yields a properly contained and minimally different query. Our results illustrate further interesting and strong connections between the structure and the semantics of anchored twig queries that the class of arbitrary twig queries does not enjoy. Finally, we show that the class of unions of twig queries is not characterizable
: Méthodes d'Inférence Symbolique pour les Bases de Données
This dissertation is a summary of a line of research, that I wasactively involved in, on learning in databases from examples. Thisresearch focused on traditional as well as novel database models andlanguages for querying, transforming, and describing the schema of adatabase. In case of schemas our contributions involve proposing anoriginal languages for the emerging data models of Unordered XML andRDF. We have studied learning from examples of schemas for UnorderedXML, schemas for RDF, twig queries for XML, join queries forrelational databases, and XML transformations defined with a novelmodel of tree-to-word transducers.Investigating learnability of the proposed languages required us toexamine closely a number of their fundamental properties, often ofindependent interest, including normal forms, minimization,containment and equivalence, consistency of a set of examples, andfinite characterizability. Good understanding of these propertiesallowed us to devise learning algorithms that explore a possibly largesearch space with the help of a diligently designed set ofgeneralization operations in search of an appropriate solution.Learning (or inference) is a problem that has two parameters: theprecise class of languages we wish to infer and the type of input thatthe user can provide. We focused on the setting where the user inputconsists of positive examples i.e., elements that belong to the goallanguage, and negative examples i.e., elements that do not belong tothe goal language. In general using both negative and positiveexamples allows to learn richer classes of goal languages than usingpositive examples alone. However, using negative examples is oftendifficult because together with positive examples they may cause thesearch space to take a very complex shape and its exploration may turnout to be computationally challenging.Ce mĂ©moire est une courte prĂ©sentation dâune direction de recherche, Ă laquelle jâai activementparticipĂ©, sur lâapprentissage pour les bases de donnĂ©es Ă partir dâexemples. Cette recherchesâest concentrĂ©e sur les modĂšles et les langages, aussi bien traditionnels quâĂ©mergents, pourlâinterrogation, la transformation et la description du schĂ©ma dâune base de donnĂ©es. Concernantles schĂ©mas, nos contributions consistent en plusieurs langages de schĂ©mas pour les nouveaumodĂšles de bases de donnĂ©es que sont XML non-ordonnĂ© et RDF. Nous avons ainsi Ă©tudiĂ©lâapprentissage Ă partir dâexemples des schĂ©mas pour XML non-ordonnĂ©, des schĂ©mas pour RDF,des requĂȘtes twig pour XML, les requĂȘtes de jointure pour bases de donnĂ©es relationnelles et lestransformations XML dĂ©finies par un nouveau modĂšle de transducteurs arbre-Ă -mot.Pour explorer si les langages proposĂ©s peuvent ĂȘtre appris, nous avons Ă©tĂ© obligĂ©s dâexaminerde prĂšs un certain nombre de leurs propriĂ©tĂ©s fondamentales, souvent souvent intĂ©ressantespar elles-mĂȘmes, y compris les formes normales, la minimisation, lâinclusion et lâĂ©quivalence, lacohĂ©rence dâun ensemble dâexemples et la caractĂ©risation finie. Une bonne comprĂ©hension de cespropriĂ©tĂ©s nous a permis de concevoir des algorithmes dâapprentissage qui explorent un espace derecherche potentiellement trĂšs vaste grĂące Ă un ensemble dâopĂ©rations de gĂ©nĂ©ralisation adaptĂ© Ă la recherche dâune solution appropriĂ©e.Lâapprentissage (ou lâinfĂ©rence) est un problĂšme Ă deux paramĂštres : la classe prĂ©cise delangage que nous souhaitons infĂ©rer et le type dâinformations que lâutilisateur peut fournir. Nousnous sommes placĂ©s dans le cas oĂč lâutilisateur fournit des exemples positifs, câest-Ă -dire desĂ©lĂ©ments qui appartiennent au langage cible, ainsi que des exemples nĂ©gatifs, câest-Ă -dire qui nâenfont pas partie. En gĂ©nĂ©ral lâutilisation Ă la fois dâexemples positifs et nĂ©gatifs permet dâapprendredes classes de langages plus riches que lâutilisation uniquement dâexemples positifs. Toutefois,lâutilisation des exemples nĂ©gatifs est souvent difficile parce que les exemples positifs et nĂ©gatifspeuvent rendre la forme de lâespace de recherche trĂšs complexe, et par consĂ©quent, son explorationinfaisable
Schemas for Unordered XML on a DIME
We investigate schema languages for unordered XML having no relative order
among siblings. First, we propose unordered regular expressions (UREs),
essentially regular expressions with unordered concatenation instead of
standard concatenation, that define languages of unordered words to model the
allowed content of a node (i.e., collections of the labels of children).
However, unrestricted UREs are computationally too expensive as we show the
intractability of two fundamental decision problems for UREs: membership of an
unordered word to the language of a URE and containment of two UREs.
Consequently, we propose a practical and tractable restriction of UREs,
disjunctive interval multiplicity expressions (DIMEs).
Next, we employ DIMEs to define languages of unordered trees and propose two
schema languages: disjunctive interval multiplicity schema (DIMS), and its
restriction, disjunction-free interval multiplicity schema (IMS). We study the
complexity of the following static analysis problems: schema satisfiability,
membership of a tree to the language of a schema, schema containment, as well
as twig query satisfiability, implication, and containment in the presence of
schema. Finally, we study the expressive power of the proposed schema languages
and compare them with yardstick languages of unordered trees (FO, MSO, and
Presburger constraints) and DTDs under commutative closure. Our results show
that the proposed schema languages are capable of expressing many practical
languages of unordered trees and enjoy desirable computational properties.Comment: Theory of Computing System
Learning Schemas for Unordered XML
We consider unordered XML, where the relative order among siblings is
ignored, and we investigate the problem of learning schemas from examples given
by the user. We focus on the schema formalisms proposed in [10]: disjunctive
multiplicity schemas (DMS) and its restriction, disjunction-free multiplicity
schemas (MS). A learning algorithm takes as input a set of XML documents which
must satisfy the schema (i.e., positive examples) and a set of XML documents
which must not satisfy the schema (i.e., negative examples), and returns a
schema consistent with the examples. We investigate a learning framework
inspired by Gold [18], where a learning algorithm should be sound i.e., always
return a schema consistent with the examples given by the user, and complete
i.e., able to produce every schema with a sufficiently rich set of examples.
Additionally, the algorithm should be efficient i.e., polynomial in the size of
the input. We prove that the DMS are learnable from positive examples only, but
they are not learnable when we also allow negative examples. Moreover, we show
that the MS are learnable in the presence of positive examples only, and also
in the presence of both positive and negative examples. Furthermore, for the
learnable cases, the proposed learning algorithms return minimal schemas
consistent with the examples.Comment: Proceedings of the 14th International Symposium on Database
Programming Languages (DBPL 2013), August 30, 2013, Riva del Garda, Trento,
Ital
Characterising Modal Formulas with Examples
We study the existence of finite characterisations for modal formulas. A
finite characterisation of a modal formula is a finite collection of
positive and negative examples that distinguishes from every other,
non-equivalent modal formula, where an example is a finite pointed Kripke
structure. This definition can be restricted to specific frame classes and to
fragments of the modal language: a modal fragment admits finite
characterisations with respect to a frame class if every formula
has a finite characterisation with respect to consting of
examples that are based on frames in . Finite characterisations are useful
for illustration, interactive specification, and debugging of formal
specifications, and their existence is a precondition for exact learnability
with membership queries. We show that the full modal language admits finite
characterisations with respect to a frame class only when the modal logic
of is locally tabular. We then study which modal fragments, freely
generated by some set of connectives, admit finite characterisations. Our main
result is that the positive modal language without the truth-constants
and admits finite characterisations w.r.t. the class of all frames. This
result is essentially optimal: finite characterizability fails when the
language is extended with the truth constant or with all but very
limited forms of negation.Comment: Expanded version of material from Raoul Koudijs's MSc thesis (2022
Inference of Shape Graphs for Graph Databases
We investigate the problem of constructing a shape graph that describes the structure of a given graph database. We employ the framework of grammatical inference, where the objective is to find an inference algorithm that is both sound, i.e., always producing a schema that validates the input graph, and complete, i.e., able to produce any schema, within a given class of schemas, provided that a sufficiently informative input graph is presented. We identify a number of fundamental limitations that preclude feasible inference. We present inference algorithms based on natural approaches that allow to infer schemas that we argue to be of practical importance
- âŠ