8,234 research outputs found
: Méthodes d'Inférence Symbolique pour les Bases de Données
This dissertation is a summary of a line of research, that I wasactively involved in, on learning in databases from examples. Thisresearch focused on traditional as well as novel database models andlanguages for querying, transforming, and describing the schema of adatabase. In case of schemas our contributions involve proposing anoriginal languages for the emerging data models of Unordered XML andRDF. We have studied learning from examples of schemas for UnorderedXML, schemas for RDF, twig queries for XML, join queries forrelational databases, and XML transformations defined with a novelmodel of tree-to-word transducers.Investigating learnability of the proposed languages required us toexamine closely a number of their fundamental properties, often ofindependent interest, including normal forms, minimization,containment and equivalence, consistency of a set of examples, andfinite characterizability. Good understanding of these propertiesallowed us to devise learning algorithms that explore a possibly largesearch space with the help of a diligently designed set ofgeneralization operations in search of an appropriate solution.Learning (or inference) is a problem that has two parameters: theprecise class of languages we wish to infer and the type of input thatthe user can provide. We focused on the setting where the user inputconsists of positive examples i.e., elements that belong to the goallanguage, and negative examples i.e., elements that do not belong tothe goal language. In general using both negative and positiveexamples allows to learn richer classes of goal languages than usingpositive examples alone. However, using negative examples is oftendifficult because together with positive examples they may cause thesearch space to take a very complex shape and its exploration may turnout to be computationally challenging.Ce mémoire est une courte présentation d’une direction de recherche, à laquelle j’ai activementparticipé, sur l’apprentissage pour les bases de données à partir d’exemples. Cette recherches’est concentrée sur les modèles et les langages, aussi bien traditionnels qu’émergents, pourl’interrogation, la transformation et la description du schéma d’une base de données. Concernantles schémas, nos contributions consistent en plusieurs langages de schémas pour les nouveaumodèles de bases de données que sont XML non-ordonné et RDF. Nous avons ainsi étudiél’apprentissage à partir d’exemples des schémas pour XML non-ordonné, des schémas pour RDF,des requêtes twig pour XML, les requêtes de jointure pour bases de données relationnelles et lestransformations XML définies par un nouveau modèle de transducteurs arbre-à -mot.Pour explorer si les langages proposés peuvent être appris, nous avons été obligés d’examinerde près un certain nombre de leurs propriétés fondamentales, souvent souvent intéressantespar elles-mêmes, y compris les formes normales, la minimisation, l’inclusion et l’équivalence, lacohérence d’un ensemble d’exemples et la caractérisation finie. Une bonne compréhension de cespropriétés nous a permis de concevoir des algorithmes d’apprentissage qui explorent un espace derecherche potentiellement très vaste grâce à un ensemble d’opérations de généralisation adapté à la recherche d’une solution appropriée.L’apprentissage (ou l’inférence) est un problème à deux paramètres : la classe précise delangage que nous souhaitons inférer et le type d’informations que l’utilisateur peut fournir. Nousnous sommes placés dans le cas où l’utilisateur fournit des exemples positifs, c’est-à -dire deséléments qui appartiennent au langage cible, ainsi que des exemples négatifs, c’est-à -dire qui n’enfont pas partie. En général l’utilisation à la fois d’exemples positifs et négatifs permet d’apprendredes classes de langages plus riches que l’utilisation uniquement d’exemples positifs. Toutefois,l’utilisation des exemples négatifs est souvent difficile parce que les exemples positifs et négatifspeuvent rendre la forme de l’espace de recherche très complexe, et par conséquent, son explorationinfaisable
Active Learning with Multiple Views
Active learners alleviate the burden of labeling large amounts of data by
detecting and asking the user to label only the most informative examples in
the domain. We focus here on active learning for multi-view domains, in which
there are several disjoint subsets of features (views), each of which is
sufficient to learn the target concept. In this paper we make several
contributions. First, we introduce Co-Testing, which is the first approach to
multi-view active learning. Second, we extend the multi-view learning framework
by also exploiting weak views, which are adequate only for learning a concept
that is more general/specific than the target concept. Finally, we empirically
show that Co-Testing outperforms existing active learners on a variety of real
world domains such as wrapper induction, Web page classification, advertisement
removal, and discourse tree parsing
Graph Pattern Matching in GQL and SQL/PGQ
As graph databases become widespread, JTC1 -- the committee in joint charge
of information technology standards for the International Organization for
Standardization (ISO), and International Electrotechnical Commission (IEC) --
has approved a project to create GQL, a standard property graph query language.
This complements a project to extend SQL with a new part, SQL/PGQ, which
specifies how to define graph views over an SQL tabular schema, and to run
read-only queries against them.
Both projects have been assigned to the ISO/IEC JTC1 SC32 working group for
Database Languages, WG3, which continues to maintain and enhance SQL as a
whole. This common responsibility helps enforce a policy that the identical
core of both PGQ and GQL is a graph pattern matching sub-language, here termed
GPML.
The WG3 design process is also analyzed by an academic working group, part of
the Linked Data Benchmark Council (LDBC), whose task is to produce a formal
semantics of these graph data languages, which complements their standard
specifications.
This paper, written by members of WG3 and LDBC, presents the key elements of
the GPML of SQL/PGQ and GQL in advance of the publication of these new
standards
Covariance and Controvariance: a fresh look at an old issue (a primer in advanced type systems for learning functional programmers)
Twenty years ago, in an article titled "Covariance and contravariance:
conflict without a cause", I argued that covariant and contravariant
specialization of method parameters in object-oriented programming had
different purposes and deduced that, not only they could, but actually they
should both coexist in the same language.
In this work I reexamine the result of that article in the light of recent
advances in (sub-)typing theory and programming languages, taking a fresh look
at this old issue.
Actually, the revamping of this problem is just an excuse for writing an
essay that aims at explaining sophisticated type-theoretic concepts, in simple
terms and by examples, to undergraduate computer science students and/or
willing functional programmers.
Finally, I took advantage of this opportunity to describe some undocumented
advanced techniques of type-systems implementation that are known only to few
insiders that dug in the code of some compilers: therefore, even expert
language designers and implementers may find this work worth of reading
CLICS² An Improved Database of Cross-Linguistic Colexifications : Assembling Lexical Data with the Help of Cross-Linguistic Data Formats
International audienceThe Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. In its current form, it has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations , ranging from studies on semantic change, patterns of conceptualization, and linguistic pale-ontology. But CLICS has also been criticized for obvious shortcomings, ranging from the underlying dataset, which still contains many errors, up to the limits of cross-linguistic colexification studies in general. Building on recent standardization efforts reflected in the Cross-Linguistic Data Formats initiative (CLDF) and novel approaches for fast, efficient, and reliable data aggregation, we have created a new database for cross-linguistic colexifications, which not only supersedes the original CLICS database in terms of coverage but also offers a much more principled procedure for the creation, curation and aggregation of datasets. The paper presents the new database and discusses its major features
- …