298 research outputs found
F-IVM: Analytics over Relational Databases under Updates
This article describes F-IVM, a unified approach for maintaining analytics
over changing relational data. We exemplify its versatility in four
disciplines: processing queries with group-by aggregates and joins; learning
linear regression models using the covariance matrix of the input features;
building Chow-Liu trees using pairwise mutual information of the input
features; and matrix chain multiplication.
F-IVM has three main ingredients: higher-order incremental view maintenance;
factorized computation; and ring abstraction. F-IVM reduces the maintenance of
a task to that of a hierarchy of simple views. Such views are functions mapping
keys, which are tuples of input values, to payloads, which are elements from a
ring. F-IVM also supports efficient factorized computation over keys, payloads,
and updates. Finally, F-IVM treats uniformly seemingly disparate tasks. In the
key space, all tasks require joins and variable marginalization. In the payload
space, tasks differ in the definition of the sum and product ring operations.
We implemented F-IVM on top of DBToaster and show that it can outperform
classical first-order and fully recursive higher-order incremental view
maintenance by orders of magnitude while using less memory
Learning Models over Relational Data using Sparse Tensors and Functional Dependencies
Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structure-aware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structure-aware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24-hour timeout, or encounter
internal design limitations.Comment: 61 pages, 9 figures, 2 table
Recommended from our members
Write once, rewrite everywhere: A Unified Framework for Factorized Machine Learning
This thesis describes TRINITY, a framework to optimize linear algebra algorithms operat- ing over relational data in GraalVM. The framework implements a host-language-agnostic version of the optimizations introduced by the Morpheus project, meaning that a single implementation of the Morpheus rewrite rules can be used to optimize linear algebra algorithms written in arbitrary GraalVM languages. We evaluate its performance when hosted within FastR and GraalPython, GraalVM’s R and Python implementations respectively. In doing so, we also show that TRINITY can optimize across languages, meaning that it can execute and optimize an algorithm written in one language, such as Python, while using data originating from another language, such as R
The Fast and the Private: Task-based Dataset Search
Modern dataset search platforms employ ML task-based utility metrics instead
of relying on metadata-based keywords to comb through extensive dataset
repositories. In this setup, requesters provide an initial dataset, and the
platform identifies complementary datasets to augment (join or union) the
requester's dataset such that the ML model (e.g., linear regression)
performance is improved most. Although effective, current task-based data
searches are stymied by (1) high latency which deters users, (2) privacy
concerns for regulatory standards, and (3) low data quality which provides low
utility. We introduce Mileena, a fast, private, and high-quality task-based
dataset search platform. At its heart, Mileena is built on pre-computed
semi-ring sketches for efficient ML training and evaluation. Based on
semi-ring, we develop a novel Factorized Privacy Mechanism that makes the
search differentially private and scales to arbitrary corpus sizes and numbers
of requests without major quality degradation. We also demonstrate the early
promise in using LLM-based agents for automatic data transformation and
applying semi-rings to support causal discovery and treatment effect
estimation
In-Database Data Imputation
Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates (e.g., mean), are computationally efficient but may introduce bias and disrupt variable relationships, leading to inaccurate analyses. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time, limiting their applicability to small datasets. This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method. We adapt this method to exploit computation sharing and a ring abstraction for faster model training. To impute both continuous and categorical values, we develop techniques for in-database learning of stochastic linear regression and Gaussian discriminant analysis models. Our MICE implementations in PostgreSQL and DuckDB outperform alternative MICE implementations and model-based imputation techniques by up to two orders of magnitude in terms of computation time, while maintaining high imputation quality
- …