23 research outputs found

    Machine Learning over Static and Dynamic Relational Data

    Get PDF
    This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task. The tutorial has the following parts: 1) Database research for data science 2) Three main ideas to achieve performance improvements 2.1) Turn the ML problem into a DB problem 2.2) Exploit structure of the data and problem 2.3) Exploit engineering tools of a DB researcher 3) Avenues for future researchComment: arXiv admin note: text overlap with arXiv:2008.0786

    F: Regression Models over Factorized Views

    Get PDF
    ABSTRACT We demonstrate F, a system for building regression models over database views. At its core lies the observation that the computation and representation of materialized views, and in particular of joins, entail non-trivial redundancy that is not necessary for the efficient computation of aggregates used for building regression models. F avoids this redundancy by factorizing data and computation and can outperform the state-of-the-art systems MADlib, R, and Python StatsModels by orders of magnitude on real-world datasets. We illustrate how to incrementally build regression models over factorized views using both an in-memory implementation of F and its SQL encoding. We also showcase the effective use of F for model selection: F decouples the datadependent computation step from the data-independent convergence of model parameters and only performs once the former to explore the entire model space. WHAT IS F? F is a fast learner of regression models over training datasets defined by select-project-join-aggregate (SPJA) views. It is part of an ongoing effort to integrate databases and machine learning including MADlib [2] and Santoku (1) The database joins are an unnecessarily expensive bottleneck for learning due to redundancy in their tabular representation. To alleviate this limitation, F learns models in one pass over factorized joins, where repeating data patterns are only computed and represented once. This has both theoretical and practical benefits. The computational complexity of F follows that of factorized materialized SPJA views The first step computes the aggregates necessary for regression and the factorized view on the input database. The output of this step is a matrix of reals whose dimensions only depend on the arity of the view and is independent of the database size. This matrix contains the necessary information to compute the parameters of any model defined by a subset of the features in the view. This step comes in three flavors F's factorization and task decomposition rely on a representation of data and computation as expressions in the sum-product commutative semiring, which is subject to the law of distributivity of product over sum. Results of SPJA queries are naturally represented in the semiring with Cartesian product as product and union as sum. The derivatives of the objective functions for Least-Squares, Ridge, Lasso, and Elastic-Net regression models are expressible in the sum-product semiring. Optimization methods such as gradient descent and (quasi) Newton, which rely on first and respectively second-order derivatives of such objective functions, can thus be used to train any such model using F. HOW DOES F WORK? We next explain F by means of an example for learning a least-squares regression model over a factorized join. Factorized Joins

    Compressed Representations of Conjunctive Query Results

    Full text link
    Relational queries, and in particular join queries, often generate large output results when executed over a huge dataset. In such cases, it is often infeasible to store the whole materialized output if we plan to reuse it further down a data processing pipeline. Motivated by this problem, we study the construction of space-efficient compressed representations of the output of conjunctive queries, with the goal of supporting the efficient access of the intermediate compressed result for a given access pattern. In particular, we initiate the study of an important tradeoff: minimizing the space necessary to store the compressed result, versus minimizing the answer time and delay for an access request over the result. Our main contribution is a novel parameterized data structure, which can be tuned to trade off space for answer time. The tradeoff allows us to control the space requirement of the data structure precisely, and depends both on the structure of the query and the access pattern. We show how we can use the data structure in conjunction with query decomposition techniques, in order to efficiently represent the outputs for several classes of conjunctive queries.Comment: To appear in PODS'18; 35 pages; comments welcom
    corecore