824 research outputs found

    Learning Models over Relational Data using Sparse Tensors and Functional Dependencies

    Full text link
    Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them. This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations.Comment: 61 pages, 9 figures, 2 table

    Inductive Logic Programming in Databases: from Datalog to DL+log

    Full text link
    In this paper we address an issue that has been brought to the attention of the database community with the advent of the Semantic Web, i.e. the issue of how ontologies (and semantics conveyed by them) can help solving typical database problems, through a better understanding of KR aspects related to databases. In particular, we investigate this issue from the ILP perspective by considering two database problems, (i) the definition of views and (ii) the definition of constraints, for a database whose schema is represented also by means of an ontology. Both can be reformulated as ILP problems and can benefit from the expressive and deductive power of the KR framework DL+log. We illustrate the application scenarios by means of examples. Keywords: Inductive Logic Programming, Relational Databases, Ontologies, Description Logics, Hybrid Knowledge Representation and Reasoning Systems. Note: To appear in Theory and Practice of Logic Programming (TPLP).Comment: 30 pages, 3 figures, 2 tables

    Extended Dualization: Application to Maximal Pattern Mining

    Get PDF
    International audienceThe dualization in arbitrary posets is a well-studied problem in combinatorial enumeration and is a crucial step in many applications in logics, databases, artificial intelligence and pattern mining.The objective of this paper is to study reductions of the dualization problem on arbitrary posets to the dualization problem on boolean lattices, for which output quasi-polynomial time algorithms exist. Quasi-polynomial time algorithms are algorithms which run in no(logn) where n is the size of the input and output. We introduce convex embedding and poset reflection as key notions to characterize such reductions. As a consequence, we identify posets, which are not boolean lattices, for which the dualization problem remains in quasi-polynomial time and propose a classification of posets with respect to dualization.From these results, we study how they can be applied to maximal pattern mining problems. We deduce a new classification of pattern mining problems and we point out how known problems involving sequences and conjunctive queries patterns, fit into this classification. Finally, we explain how to adapt the seminal Dualize & Advance algorithm to deal with such patterns.As far as we know, this is the first contribution to explicit non-trivial reductions for studying the hardness of maximal pattern mining problems and to extend the Dualize & Advance algorithm for complex patterns

    Compositional Mining of Multi-Relational Biological Datasets

    Get PDF
    High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for compositional data mining algorithms. In this paper, we argue that such compositional data mining can be effectively realized by functionally cascading redescription mining and biclustering algorithms as primitives. Both these primitives mirror shifts of vocabulary that can be composed in arbitrary ways to create rich chains of inferences. Given a relational database and its schema, we show how the schema can be automatically compiled into a compositional data mining program, and how different domains in the schema can be related through logical sequences of biclustering and redescription invocations. This feature allows us to rapidly prototype new data mining applications, yielding greater understanding of scientific datasets. We describe two applications of compositional data mining: (i) matching terms across categories of the Gene Ontology and (ii) understanding the molecular mechanisms underlying stress response in human cells

    Proceedings of the first international VLDB workshop on Management of Uncertain Data

    Get PDF
    • …
    corecore