2,697 research outputs found
An introduction to Graph Data Management
A graph database is a database where the data structures for the schema
and/or instances are modeled as a (labeled)(directed) graph or generalizations
of it, and where querying is expressed by graph-oriented operations and type
constructors. In this article we present the basic notions of graph databases,
give an historical overview of its main development, and study the main current
systems that implement them
MVG Mechanism: Differential Privacy under Matrix-Valued Query
Differential privacy mechanism design has traditionally been tailored for a
scalar-valued query function. Although many mechanisms such as the Laplace and
Gaussian mechanisms can be extended to a matrix-valued query function by adding
i.i.d. noise to each element of the matrix, this method is often suboptimal as
it forfeits an opportunity to exploit the structural characteristics typically
associated with matrix analysis. To address this challenge, we propose a novel
differential privacy mechanism called the Matrix-Variate Gaussian (MVG)
mechanism, which adds a matrix-valued noise drawn from a matrix-variate
Gaussian distribution, and we rigorously prove that the MVG mechanism preserves
-differential privacy. Furthermore, we introduce the concept
of directional noise made possible by the design of the MVG mechanism.
Directional noise allows the impact of the noise on the utility of the
matrix-valued query function to be moderated. Finally, we experimentally
demonstrate the performance of our mechanism using three matrix-valued queries
on three privacy-sensitive datasets. We find that the MVG mechanism notably
outperforms four previous state-of-the-art approaches, and provides comparable
utility to the non-private baseline.Comment: Appeared in CCS'1
Experiences with Some Benchmarks for Deductive Databases and Implementations of Bottom-Up Evaluation
OpenRuleBench is a large benchmark suite for rule engines, which includes
deductive databases. We previously proposed a translation of Datalog to C++
based on a method that "pushes" derived tuples immediately to places where they
are used. In this paper, we report performance results of various
implementation variants of this method compared to XSB, YAP and DLV. We study
only a fraction of the OpenRuleBench problems, but we give a quite detailed
analysis of each such task and the factors which influence performance. The
results not only show the potential of our method and implementation approach,
but could be valuable for anybody implementing systems which should be able to
execute tasks of the discussed types.Comment: In Proceedings WLP'15/'16/WFLP'16, arXiv:1701.0014
Managing data through the lens of an ontology
Ontology-based data management aims at managing data through the lens of an ontology, that is, a conceptual representation of the domain of interest in the underlying information system. This new paradigm provides several interesting features, many of which have already been proved effective in managing complex information systems. This article introduces the notion of ontology-based data management, illustrating the main ideas underlying the paradigm, and pointing out the importance of knowledge representation and automated reasoning for addressing the technical challenges it introduces
Tight Lower Bounds for Differentially Private Selection
A pervasive task in the differential privacy literature is to select the
items of "highest quality" out of a set of items, where the quality of each
item depends on a sensitive dataset that must be protected. Variants of this
task arise naturally in fundamental problems like feature selection and
hypothesis testing, and also as subroutines for many sophisticated
differentially private algorithms.
The standard approaches to these tasks---repeated use of the exponential
mechanism or the sparse vector technique---approximately solve this problem
given a dataset of samples. We provide a tight lower
bound for some very simple variants of the private selection problem. Our lower
bound shows that a sample of size is required
even to achieve a very minimal accuracy guarantee.
Our results are based on an extension of the fingerprinting method to sparse
selection problems. Previously, the fingerprinting method has been used to
provide tight lower bounds for answering an entire set of queries, but
often only some much smaller set of queries are relevant. Our extension
allows us to prove lower bounds that depend on both the number of relevant
queries and the total number of queries
Learning Models over Relational Data using Sparse Tensors and Functional Dependencies
Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structure-aware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structure-aware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24-hour timeout, or encounter
internal design limitations.Comment: 61 pages, 9 figures, 2 table
Answering Conjunctive Queries under Updates
We consider the task of enumerating and counting answers to -ary
conjunctive queries against relational databases that may be updated by
inserting or deleting tuples. We exhibit a new notion of q-hierarchical
conjunctive queries and show that these can be maintained efficiently in the
following sense. During a linear time preprocessing phase, we can build a data
structure that enables constant delay enumeration of the query results; and
when the database is updated, we can update the data structure and restart the
enumeration phase within constant time. For the special case of self-join free
conjunctive queries we obtain a dichotomy: if a query is not q-hierarchical,
then query enumeration with sublinear delay and sublinear update time
(and arbitrary preprocessing time) is impossible.
For answering Boolean conjunctive queries and for the more general problem of
counting the number of solutions of k-ary queries we obtain complete
dichotomies: if the query's homomorphic core is q-hierarchical, then size of
the the query result can be computed in linear time and maintained with
constant update time. Otherwise, the size of the query result cannot be
maintained with sublinear update time. All our lower bounds rely on the
OMv-conjecture, a conjecture on the hardness of online matrix-vector
multiplication that has recently emerged in the field of fine-grained
complexity to characterise the hardness of dynamic problems. The lower bound
for the counting problem additionally relies on the orthogonal vectors
conjecture, which in turn is implied by the strong exponential time hypothesis.
By sublinear we mean for some
, where is the size of the active domain of the current
database
- …