824 research outputs found
Learning Models over Relational Data using Sparse Tensors and Functional Dependencies
Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structure-aware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structure-aware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24-hour timeout, or encounter
internal design limitations.Comment: 61 pages, 9 figures, 2 table
Inductive Logic Programming in Databases: from Datalog to DL+log
In this paper we address an issue that has been brought to the attention of
the database community with the advent of the Semantic Web, i.e. the issue of
how ontologies (and semantics conveyed by them) can help solving typical
database problems, through a better understanding of KR aspects related to
databases. In particular, we investigate this issue from the ILP perspective by
considering two database problems, (i) the definition of views and (ii) the
definition of constraints, for a database whose schema is represented also by
means of an ontology. Both can be reformulated as ILP problems and can benefit
from the expressive and deductive power of the KR framework DL+log. We
illustrate the application scenarios by means of examples. Keywords: Inductive
Logic Programming, Relational Databases, Ontologies, Description Logics, Hybrid
Knowledge Representation and Reasoning Systems. Note: To appear in Theory and
Practice of Logic Programming (TPLP).Comment: 30 pages, 3 figures, 2 tables
Extended Dualization: Application to Maximal Pattern Mining
International audienceThe dualization in arbitrary posets is a well-studied problem in combinatorial enumeration and is a crucial step in many applications in logics, databases, artificial intelligence and pattern mining.The objective of this paper is to study reductions of the dualization problem on arbitrary posets to the dualization problem on boolean lattices, for which output quasi-polynomial time algorithms exist. Quasi-polynomial time algorithms are algorithms which run in no(logn) where n is the size of the input and output. We introduce convex embedding and poset reflection as key notions to characterize such reductions. As a consequence, we identify posets, which are not boolean lattices, for which the dualization problem remains in quasi-polynomial time and propose a classification of posets with respect to dualization.From these results, we study how they can be applied to maximal pattern mining problems. We deduce a new classification of pattern mining problems and we point out how known problems involving sequences and conjunctive queries patterns, fit into this classification. Finally, we explain how to adapt the seminal Dualize & Advance algorithm to deal with such patterns.As far as we know, this is the first contribution to explicit non-trivial reductions for studying the hardness of maximal pattern mining problems and to extend the Dualize & Advance algorithm for complex patterns
Compositional Mining of Multi-Relational Biological Datasets
High-throughput biological screens are yielding ever-growing streams of
information about multiple aspects of cellular activity. As more and more
categories of datasets come online, there is a corresponding multitude of ways
in which inferences can be chained across them, motivating the need for
compositional data mining algorithms. In this paper, we argue that such
compositional data mining can be effectively realized by functionally cascading
redescription mining and biclustering algorithms as primitives. Both these
primitives mirror shifts of vocabulary that can be composed in arbitrary ways
to create rich chains of inferences. Given a relational database and its
schema, we show how the schema can be automatically compiled into a
compositional data mining program, and how different domains in the schema can
be related through logical sequences of biclustering and redescription
invocations. This feature allows us to rapidly prototype new data mining
applications, yielding greater understanding of scientific datasets. We
describe two applications of compositional data mining: (i) matching terms
across categories of the Gene Ontology and (ii) understanding the molecular
mechanisms underlying stress response in human cells
- …