139 research outputs found
: Méthodes d'Inférence Symbolique pour les Bases de Données
This dissertation is a summary of a line of research, that I wasactively involved in, on learning in databases from examples. Thisresearch focused on traditional as well as novel database models andlanguages for querying, transforming, and describing the schema of adatabase. In case of schemas our contributions involve proposing anoriginal languages for the emerging data models of Unordered XML andRDF. We have studied learning from examples of schemas for UnorderedXML, schemas for RDF, twig queries for XML, join queries forrelational databases, and XML transformations defined with a novelmodel of tree-to-word transducers.Investigating learnability of the proposed languages required us toexamine closely a number of their fundamental properties, often ofindependent interest, including normal forms, minimization,containment and equivalence, consistency of a set of examples, andfinite characterizability. Good understanding of these propertiesallowed us to devise learning algorithms that explore a possibly largesearch space with the help of a diligently designed set ofgeneralization operations in search of an appropriate solution.Learning (or inference) is a problem that has two parameters: theprecise class of languages we wish to infer and the type of input thatthe user can provide. We focused on the setting where the user inputconsists of positive examples i.e., elements that belong to the goallanguage, and negative examples i.e., elements that do not belong tothe goal language. In general using both negative and positiveexamples allows to learn richer classes of goal languages than usingpositive examples alone. However, using negative examples is oftendifficult because together with positive examples they may cause thesearch space to take a very complex shape and its exploration may turnout to be computationally challenging.Ce mémoire est une courte présentation d’une direction de recherche, à laquelle j’ai activementparticipé, sur l’apprentissage pour les bases de données à partir d’exemples. Cette recherches’est concentrée sur les modèles et les langages, aussi bien traditionnels qu’émergents, pourl’interrogation, la transformation et la description du schéma d’une base de données. Concernantles schémas, nos contributions consistent en plusieurs langages de schémas pour les nouveaumodèles de bases de données que sont XML non-ordonné et RDF. Nous avons ainsi étudiél’apprentissage à partir d’exemples des schémas pour XML non-ordonné, des schémas pour RDF,des requêtes twig pour XML, les requêtes de jointure pour bases de données relationnelles et lestransformations XML définies par un nouveau modèle de transducteurs arbre-à -mot.Pour explorer si les langages proposés peuvent être appris, nous avons été obligés d’examinerde près un certain nombre de leurs propriétés fondamentales, souvent souvent intéressantespar elles-mêmes, y compris les formes normales, la minimisation, l’inclusion et l’équivalence, lacohérence d’un ensemble d’exemples et la caractérisation finie. Une bonne compréhension de cespropriétés nous a permis de concevoir des algorithmes d’apprentissage qui explorent un espace derecherche potentiellement très vaste grâce à un ensemble d’opérations de généralisation adapté à la recherche d’une solution appropriée.L’apprentissage (ou l’inférence) est un problème à deux paramètres : la classe précise delangage que nous souhaitons inférer et le type d’informations que l’utilisateur peut fournir. Nousnous sommes placés dans le cas où l’utilisateur fournit des exemples positifs, c’est-à -dire deséléments qui appartiennent au langage cible, ainsi que des exemples négatifs, c’est-à -dire qui n’enfont pas partie. En général l’utilisation à la fois d’exemples positifs et négatifs permet d’apprendredes classes de langages plus riches que l’utilisation uniquement d’exemples positifs. Toutefois,l’utilisation des exemples négatifs est souvent difficile parce que les exemples positifs et négatifspeuvent rendre la forme de l’espace de recherche très complexe, et par conséquent, son explorationinfaisable
From the web of data to a world of action
This is the author’s version of a work that was accepted for publication in Web Semantics: Science, Services and Agents on the World Wide Web. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Web Semantics: Science, Services and Agents on the World Wide Web 8.4
(2010): 10.1016/j.websem.2010.04.007This paper takes as its premise that the web is a place of action, not just information, and that the purpose of
global data is to serve human needs. The paper presents several component technologies, which together work
towards a vision where many small micro-applications can be threaded together using automated assistance to
enable a unified and rich interaction. These technologies include data detector technology to enable any text to
become a start point of semantic interaction; annotations for web-based services so that they can link data to
potential actions; spreading activation over personal ontologies, to allow modelling of context; algorithms for
automatically inferring 'typing' of web-form input data based on previous user inputs; and early work on inferring
task structures from action traces. Some of these have already been integrated within an experimental web-based
(extended) bookmarking tool, Snip!t, and a prototype desktop application On Time, and the paper discusses how the
components could be more fully, yet more openly, linked in terms of both architecture and interaction. As well as
contributing to the goal of an action and activity-focused web, the work also exposes a number of broader issues,
theoretical, practical, social and economic, for the Semantic Web.Parts of this work were supported by the Information
Society Technologies (IST) Program of the European
Commission as part of the DELOS Network of
Excellence on Digital Libraries (Contract G038-
507618). Thanks also to Emanuele Tracanna, Marco
Piva, and Raffaele Giuliano for their work on On
Time
Complexity and Expressiveness of ShEx for RDF
International audienceWe study the expressiveness and complexity of Shape Expression Schema (ShEx), a novel schema formalism for RDF currently under development by W3C. ShEx assigns types to the nodes of an RDF graph and allows to constrain the admissible neighborhoods of nodes of a given type with regular bag expressions (RBEs). We formalize and investigate two alternative semantics, multi-and single-type, depending on whether or not a node may have more than one type. We study the expressive power of ShEx and study the complexity of the validation problem. We show that the single-type semantics is strictly more expressive than the multi-type semantics, single-type validation is generally intractable and multi-type validation is feasible for a small (yet practical) subclass of RBEs. To curb the high computational complexity of validation, we propose a natural notion of determinism and show that multi-type validation for the class of deterministic schemas using single-occurrence regular bag expressions (SORBEs) is tractable
Distributed XML Query Processing
While centralized query processing over collections of XML data stored at a single site is a well understood problem,
centralized query evaluation techniques are inherently limited in their scalability when presented
with large collections (or a single, large document) and heavy query workloads.
In the context of relational query processing,
similar scalability challenges have been overcome by partitioning data collections,
distributing them across the sites of a distributed system, and then
evaluating queries in a distributed fashion, usually in a way that ensures locality between
(sub-)queries and their relevant data.
This thesis presents a suite of query evaluation techniques for XML data that follow a similar
approach to address the scalability problems encountered by XML query evaluation.
Due to the significant differences in data and query models between relational and XML query
processing, it is not possible to directly apply distributed query evaluation techniques designed
for relational data to the XML scenario.
Instead, new distributed query evaluation
techniques need to be developed.
Thus, in this thesis, an end-to-end solution to the scalability problems encountered by XML query
processing is proposed.
Based on a data partitioning model that supports both horizontal and vertical
fragmentation steps (or any combination of the two), XML collections are fragmented and distributed
across the sites of a distributed system.
Then, a suite of distributed query evaluation strategies is
proposed. These query evaluation techniques ensure locality between each fragment of the collection and
the parts of the query corresponding to the data in this fragment. Special attention is paid to
scalability and query performance, which is achieved by ensuring a high degree of parallelism
during distributed query evaluation and by avoiding access to irrelevant portions of the data.
For maximum flexibility, the suite of distributed query evaluation techniques proposed in this thesis provides
several alternative approaches
for evaluating a given query over a given distributed collection. Thus, to achieve the best performance, it is
necessary to predict and compare the expected performance of each of these alternatives. In this
work, this is accomplished through a query optimization technique based on a
distribution-aware cost model. The same cost model is also used to fine-tune the way a collection is
fragmented to the demands of the query workload evaluated over this collection.
To evaluate the performance impact of the distributed query evaluation techniques proposed in this
thesis, the techniques were implemented within
a production-quality XML database system. Based on this implementation, a
thorough experimental evaluation was performed. The results of this evaluation confirm that the distributed query evaluation
techniques introduced here lead to significant improvements in query performance and scalability
both when compared to centralized techniques and when compared to existing distributed query
evaluation techniques
- …