13 research outputs found
A Paradigm for Learning Queries on Big Data
International audienceSpecifying a database query using a formal query language is typically a challenging task for non-expert users. In the context of big data, this problem becomes even harder as it requires the users to deal with database instances of big sizes and hence difficult to visualize. Such instances usually lack a schema to help the users specify their queries, or have an incomplete schema as they come from disparate data sources. In this paper, we propose a novel paradigm for interactive learning of queries on big data, without assuming any knowledge of the database schema. The paradigm can be applied to different database models and a class of queries adequate to the database model. In particular, in this paper we present two instantiations that validated the proposed paradigm for learning relational join queries and for learning path queries on graph databases. Finally, we discuss the challenges of employing the paradigm for further data models and for learning cross-model schema mappings
Reverse engineering queries in ontology-enriched systems: the case of expressive horn description logic ontologies
We introduce the query-by-example (QBE) paradigm for query answering in the presence of ontologies. Intuitively, QBE permits non-expert users to explore the data by providing examples of the information they (do not) want, which the system then generalizes into a query. Formally, we study the following question: given a knowledge base and sets of positive and negative examples, is there a query that returns all positive but none of the negative examples? We focus on description logic knowledge bases with ontologies formulated in Horn-ALCI and (unions of) conjunctive queries. Our main contributions are characterizations, algorithms and tight complexity bounds for QBE
Regular Path Query Evaluation on Streaming Graphs
We study persistent query evaluation over streaming graphs, which is becoming
increasingly important. We focus on navigational queries that determine if
there exists a path between two entities that satisfies a user-specified
constraint. We adopt the Regular Path Query (RPQ) model that specifies
navigational patterns with labeled constraints. We propose deterministic
algorithms to efficiently evaluate persistent RPQs under both arbitrary and
simple path semantics in a uniform manner. Experimental analysis on real and
synthetic streaming graphs shows that the proposed algorithms can process up to
tens of thousands of edges per second and efficiently answer RPQs that are
commonly used in real-world workloads.Comment: A shorter version of this paper has been accepted for publication in
2020 International Conference on Management of Data (SIGMOD 2020
Using Knowledge Anchors to Facilitate User Exploration of Data Graphs
YesThis paper investigates how to facilitate usersâ exploration through data graphs for knowledge expansion. Our work
focuses on knowledge utility â increasing usersâ domain knowledge while exploring a data graph. We introduce a novel exploration support mechanism underpinned by the subsumption theory of meaningful learning, which postulates that new knowledge is grasped by starting from familiar concepts in the graph which serve as knowledge anchors from where links to new knowledge are made. A core algorithmic component for operationalising the subsumption theory for meaningful learning to generate exploration
paths for knowledge expansion is the automatic identification of knowledge anchors in a data graph (KADG). We present
several metrics for identifying KADG which are evaluated against familiar concepts in human cognitive structures. A subsumption algorithm that utilises KADG for generating exploration paths for knowledge expansion is presented, and applied in the context of a Semantic data browser in a music domain. The resultant exploration paths are evaluated in a task-driven experimental user study compared to free data graph exploration. The findings show that exploration paths, based on subsumption and using knowledge anchors, lead to significantly higher increase in the usersâ conceptual knowledge and better usability than free exploration of data graphs. The work opens a new avenue in semantic data exploration which investigates the link between learning and knowledge exploration. This extends the value of exploration and enables broader applications of data graphs in systems where the end users are not experts in the specific domain
Recommended from our members
Network Structures, Concurrency, and Interpretability: Lessons from the Development of an AI Enabled Graph Database System
This thesis describes the development of the SmartGraph, an AI enabled graph database. The need for such a system has been independently recognized in the isolated fields of graph databases, graph computing, and computational graph deep learning systems, such as TensorFlow. Though prior works have investigated some relationships between these fields, we believe that the SmartGraph is the first system designed from conception to incorporate the most significant and useful characteristics of each. Examples include the ability to store graph structured data, run analytics natively on this data, and run gradient descent algorithms. It is the synergistic aspects of combining these fields that provide the most novel results presented in this dissertation. Key among them is how the notion of âgraph queryingâ as used in graph databases can be used to solve a problem that has plagued deep learning systems since their inception; rather than attempting to embed graph structured datasets into restrictive vector spaces, we instead allow the deep learning functionality of the system to natively perform graph querying in memory during optimization as a way of interpreting (and learning) the graph. This results in a concept of natural and interpretable processing of graph structured data.
Graph computing systems have traditionally used distributed computing across multiple compute nodes (e.g. separate machines connected via Ethernet or internet) to deal with large-scale datasets whilst working sequentially on problems over entire datasets. In this dissertation, we outline a distributed graph computing methodology that facilitates all the above capabilities (even in an environment consisting of a single physical machine) while allowing for a workflow more typical of a graph database than a graph computing system; massive concurrent access allowing for arbitrarily asynchronous execution of queries and analytics across the entire system. Further, we demonstrate how this methodology is key to the artificial intelligence capabilities of the system
Learning Join Queries from User Examples
International audienceWe investigate the problem of learning join queries from user examples. The user is presented with a set of candidate tuples and is asked to label them as positive or negative examples, depending on whether or not she would like the tuples as part of the join result. The goal is to quickly infer an arbitrary n-ary join predicate across an arbitrary number m of relations while keeping the number of user interactions as minimal as possible. We assume no prior knowledge of the integrity constraints across the involved relations. Inferring the join predicate across multiple relations when the referential constraints are unknown may occur in several applications, such as data integration, reverse engineering of database queries, and schema inference. In such scenarios, the number of tuples involved in the join is typically large. We introduce a set of strategies that let us inspect the search space and aggressively prune what we call uninformative tuples, and we directly present to the user the informative ones that is, those that allow the user to quickly find the goal query she has in mind. In this article, we focus on the inference of joins with equality predicates and also allow disjunctive join predicates and projection in the queries. We precisely characterize the frontier between tractability and intractability for the following problems of interest in these settings: consistency checking, learnability, and deciding the informativeness of a tuple. Next, we propose several strategies for presenting tuples to the user in a given order that allows minimization of the number of interactions. We show the efficiency of our approach through an experimental study on both benchmark and synthetic datasets
Generalizing spreadsheet computation for evolving spreadsheets at scale
Spreadsheets are one of the most ubiquitous ad-hoc data analysis and manipulation tools. Their strength over traditional relational database management systems lies in their ability to allow users to manipulate data interactively through an intuitive interface. However, the capabilities of current spreadsheet systems to handle datasets that evolve over time are limited in several dimensions: (a) limited power: it is difficult to perform relational-style queries, which is often needed for large data analysis, while keeping the convenience of formula-like automatic recalculation, (b) limited introspection: the ability to reason about the source of changes between versions at a higher level is often unsupported, and (c) limited interactivity: the computation in spreadsheets at scale can make the system unresponsive, rendering the strength of spreadsheets moot, (d) limited structure utilization: the computation in spreadsheets often fails to utilize the semi-structured nature of real-world spreadsheets.
The dissertation discusses developments that overcome these hurdles. First, we discuss an extension to spreadsheet formulae that allows for relational-style queries in a manner that is consistent with typical formula computation engines. Second, we develop the theory of "diffing", representing data updates in a concise manner. Third, we introduce Asynchronous Formula Computation, a technique that improves spreadsheet interactivity when dealing with formula computation, while guaranteeing consistency of the results. Finally, we improve formula computation by utilizing structures of real-world spreadsheets and building a more concise representation
Intelligent Support for Exploration of Data Graphs
This research investigates how to support a userâs exploration through data graphs generated from semantic databases in a way leading to expanding the userâs domain knowledge. To be effective, approaches to facilitate exploration of data graphs should take into account the utility from a userâs point of view. Our work focuses on knowledge utility â how useful exploration paths through a data graph are for expanding the userâs knowledge. The main goal of this research is to design an intelligent support mechanism to direct the user to âgoodâ exploration paths through big data graphs for knowledge expansion. We propose a new exploration support mechanism underpinned by the subsumption theory for meaningful learning, which postulates that new knowledge is grasped by starting from familiar concepts in the graph which serve as knowledge anchors from where links to new knowledge are made. A core algorithmic component for adapting the subsumption theory for generating exploration paths is the automatic identification of Knowledge Anchors in a Data Graph (KADG). Several metrics for identifying KADG and the corresponding algorithms for implementation have been developed and evaluated against human cognitive structures. A subsumption algorithm which utilises KADG for generating exploration paths for knowledge expansion is presented and evaluated in the context of a semantic data browser in a musical instrument domain. The resultant exploration paths are evaluated in a controlled user study to examine whether they increase the usersâ knowledge as compared to free exploration. The findings show that exploration paths using knowledge anchors and subsumption lead to significantly higher increase in the usersâ conceptual knowledge. The approach can be adopted in applications providing data graph exploration to facilitate learning and sensemaking of layman users who are not fully familiar with the domain presented in the data graph