8,627 research outputs found
Schema Independent Relational Learning
Learning novel concepts and relations from relational databases is an
important problem with many applications in database systems and machine
learning. Relational learning algorithms learn the definition of a new relation
in terms of existing relations in the database. Nevertheless, the same data set
may be represented under different schemas for various reasons, such as
efficiency, data quality, and usability. Unfortunately, the output of current
relational learning algorithms tends to vary quite substantially over the
choice of schema, both in terms of learning accuracy and efficiency. This
variation complicates their off-the-shelf application. In this paper, we
introduce and formalize the property of schema independence of relational
learning algorithms, and study both the theoretical and empirical dependence of
existing algorithms on the common class of (de) composition schema
transformations. We study both sample-based learning algorithms, which learn
from sets of labeled examples, and query-based algorithms, which learn by
asking queries to an oracle. We prove that current relational learning
algorithms are generally not schema independent. For query-based learning
algorithms we show that the (de) composition transformations influence their
query complexity. We propose Castor, a sample-based relational learning
algorithm that achieves schema independence by leveraging data dependencies. We
support the theoretical results with an empirical study that demonstrates the
schema dependence/independence of several algorithms on existing benchmark and
real-world datasets under (de) compositions
Covers of Query Results
We introduce succinct lossless representations of query results called covers. They are subsets of the query results that correspond to minimal edge covers in the hypergraphs of these results.
We first study covers whose structures are given by fractional hypertree decompositions of join queries.
For any decomposition of a query, we give asymptotically tight size bounds for the covers of the query result over that decomposition and show that such covers can be computed in worst-case optimal time up to a logarithmic factor in the database size. For acyclic join queries, we can compute covers compositionally using query plans with a new operator called cover-join. The tuples in the query result can be enumerated from any of its covers with linearithmic pre-computation time and constant delay.
We then generalize covers from joins to functional aggregate queries that express a host of computational problems such as aggregate-join queries, in-database optimization, matrix chain multiplication, and inference in probabilistic graphical models
Discovery of data dependencies in relational databases
Knowledge discovery in databases is not only the nontrivial extraction of implicit, previously unknown and potentially useful information from databases. We argue that in contrast to machine learning, knowledge discovery in databases should be applied to real world databases. Since real world databases are known to be very large, they raise problems of the access. Therefore, real world databases only can be accessed by database management systems and the number of accesses has to be reduced to a minimum. Considering this property, we are forced to use, for example, standard set oriented interfaces of relational database management systems in order to apply methods of knowledge discovery in databases. We present a system for discovering data dependencies, which is build upon a set oriented interface. The point of main effort has been put on the discovery of value restrictions, unary inclusion- and functional dependencies in relational databases. The system also embodies an inference relation to minimize database access
Target and (Astro-)WISE technologies - Data federations and its applications
After its first implementation in 2003 the Astro-WISE technology has been
rolled out in several European countries and is used for the production of the
KiDS survey data. In the multi-disciplinary Target initiative this technology,
nicknamed WISE technology, has been further applied to a large number of
projects. Here, we highlight the data handling of other astronomical
applications, such as VLT-MUSE and LOFAR, together with some non-astronomical
applications such as the medical projects Lifelines and GLIMPS, the MONK
handwritten text recognition system, and business applications, by amongst
others, the Target Holding. We describe some of the most important lessons
learned and describe the application of the data-centric WISE type of approach
to the Science Ground Segment of the Euclid satellite.Comment: 9 pages, 5 figures, Proceedngs IAU Symposium No 325 Astroinformatics
201
Learning Models over Relational Data using Sparse Tensors and Functional Dependencies
Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structure-aware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structure-aware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24-hour timeout, or encounter
internal design limitations.Comment: 61 pages, 9 figures, 2 table
- …