911 research outputs found
Quantifying the Loss of Acyclic Join Dependencies
Acyclic schemes posses known benefits for database design, speeding up
queries, and reducing space requirements. An acyclic join dependency (AJD) is
lossless with respect to a universal relation if joining the projections
associated with the schema results in the original universal relation. An
intuitive and standard measure of loss entailed by an AJD is the number of
redundant tuples generated by the acyclic join. Recent work has shown that the
loss of an AJD can also be characterized by an information-theoretic measure.
Motivated by the problem of automatically fitting an acyclic schema to a
universal relation, we investigate the connection between these two
characterizations of loss. We first show that the loss of an AJD is captured
using the notion of KL-Divergence. We then show that the KL-divergence can be
used to bound the number of redundant tuples. We prove a deterministic lower
bound on the percentage of redundant tuples. For an upper bound, we propose a
random database model, and establish a high probability bound on the percentage
of redundant tuples, which coincides with the lower bound for large databases.Comment: To appear in PODS 202
Tree Projections and Constraint Optimization Problems: Fixed-Parameter Tractability and Parallel Algorithms
Tree projections provide a unifying framework to deal with most structural
decomposition methods of constraint satisfaction problems (CSPs). Within this
framework, a CSP instance is decomposed into a number of sub-problems, called
views, whose solutions are either already available or can be computed
efficiently. The goal is to arrange portions of these views in a tree-like
structure, called tree projection, which determines an efficiently solvable CSP
instance equivalent to the original one. Deciding whether a tree projection
exists is NP-hard. Solution methods have therefore been proposed in the
literature that do not require a tree projection to be given, and that either
correctly decide whether the given CSP instance is satisfiable, or return that
a tree projection actually does not exist. These approaches had not been
generalized so far on CSP extensions for optimization problems, where the goal
is to compute a solution of maximum value/minimum cost. The paper fills the
gap, by exhibiting a fixed-parameter polynomial-time algorithm that either
disproves the existence of tree projections or computes an optimal solution,
with the parameter being the size of the expression of the objective function
to be optimized over all possible solutions (and not the size of the whole
constraint formula, used in related works). Tractability results are also
established for the problem of returning the best K solutions. Finally,
parallel algorithms for such optimization problems are proposed and analyzed.
Given that the classes of acyclic hypergraphs, hypergraphs of bounded
treewidth, and hypergraphs of bounded generalized hypertree width are all
covered as special cases of the tree projection framework, the results in this
paper directly apply to these classes. These classes are extensively considered
in the CSP setting, as well as in conjunctive database query evaluation and
optimization
Tree Projections and Structural Decomposition Methods: The Power of Local Consistency and Larger Islands of Tractability
Evaluating conjunctive queries and solving constraint satisfaction problems
are fundamental problems in database theory and artificial intelligence,
respectively. These problems are NP-hard, so that several research efforts have
been made in the literature for identifying tractable classes, known as islands
of tractability, as well as for devising clever heuristics for solving
efficiently real-world instances. Many heuristic approaches are based on
enforcing on the given instance a property called local consistency, where (in
database terms) each tuple in every query atom matches at least one tuple in
every other query atom. Interestingly, it turns out that, for many well-known
classes of queries, such as for the acyclic queries, enforcing local
consistency is even sufficient to solve the given instance correctly. However,
the precise power of such a procedure was unclear, but for some very restricted
cases. The paper provides full answers to the long-standing questions about the
precise power of algorithms based on enforcing local consistency. The classes
of instances where enforcing local consistency turns out to be a correct
query-answering procedure are however not efficiently recognizable. In fact,
the paper finally focuses on certain subclasses defined in terms of the novel
notion of greedy tree projections. These latter classes are shown to be
efficiently recognizable and strictly larger than most islands of tractability
known so far, both in the general case of tree projections and for specific
structural decomposition methods
Putting Humpty-Dumpty together again: Reconstructing functions from their projections.
We present a problem decomposition approach to reduce neural net training times. The basic idea is to train neural nets in parallel on marginal distributions obtained from the original distribution (via projection), and then reconstruct the original table from the marginals (via a procedure similar to the join operator in database theory). A function is said to be reconstructible, if it may be recovered without error from its projections. Most distributions are non-reconstructible. The main result of this paper is the Reconstruction theorem, which enables non-reconstructible functions to be expressed in terms of reconstructible ones, and thus facilitates the application of decomposition methods
ExplainIt! -- A declarative root-cause analysis engine for time series data (extended version)
We present ExplainIt!, a declarative, unsupervised root-cause analysis engine
that uses time series monitoring data from large complex systems such as data
centres. ExplainIt! empowers operators to succinctly specify a large number of
causal hypotheses to search for causes of interesting events. ExplainIt! then
ranks these hypotheses, reducing the number of causal dependencies from
hundreds of thousands to a handful for human understanding. We show how a
declarative language, such as SQL, can be effective in declaratively
enumerating hypotheses that probe the structure of an unknown probabilistic
graphical causal model of the underlying system. Our thesis is that databases
are in a unique position to enable users to rapidly explore the possible causal
mechanisms in data collected from diverse sources. We empirically demonstrate
how ExplainIt! had helped us resolve over 30 performance issues in a commercial
product since late 2014, of which we discuss a few cases in detail.Comment: SIGMOD Industry Track 201
Learning Models over Relational Data using Sparse Tensors and Functional Dependencies
Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structure-aware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structure-aware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24-hour timeout, or encounter
internal design limitations.Comment: 61 pages, 9 figures, 2 table
- …