17 research outputs found
To Join or Not to Join? Thinking Twice about Joins before Feature Selection
Closer integration of machine learning (ML) with data processing is a booming area in both the data management industry and academia. Almost all ML toolkits assume that the input is a single table, but many datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins to obtain features from all base tables and apply a feature selection method, either explicitly or implicitly, with the aim of improving accuracy. In this work, we show that the features brought in by such joins can often be ignored without affecting ML accuracy significantly, i.e., we can "avoid joins safely". We identify the core technical issue that could cause accuracy to decrease in some cases and analyze this issue theoretically. Using simulations, we validate our analysis and measure the effects of various properties of normalized data on accuracy. We apply our analysis to design easy-to-understand decision rules to predict when it is safe to avoid joins in order to help analysts exploit this runtime-accuracy tradeoff. Experiments with multiple real normalized datasets show that our rules are able to accurately predict when joins can be avoided safely, and in some cases, this led to significant reductions in the runtime of some popular feature selection methods
Boosting gets full Attention for Relational Learning
More often than not in benchmark supervised ML, tabular data is flat, i.e.
consists of a single (rows, columns) file, but cases abound in the
real world where observations are described by a set of tables with structural
relationships. Neural nets-based deep models are a classical fit to incorporate
general topological dependence among description features (pixels, words,
etc.), but their suboptimality to tree-based models on tabular data is still
well documented. In this paper, we introduce an attention mechanism for
structured data that blends well with tree-based models in the training context
of (gradient) boosting. Each aggregated model is a tree whose training involves
two steps: first, simple tabular models are learned descending tables in a
top-down fashion with boosting's class residuals on tables' features. Second,
what has been learned progresses back bottom-up via attention and aggregation
mechanisms, progressively crafting new features that complete at the end the
set of observation features over which a single tree is learned, boosting's
iteration clock is incremented and new class residuals are computed.
Experiments on simulated and real-world domains display the competitiveness of
our method against a state of the art containing both tree-based and neural
nets-based models
A Relational Gradient Descent Algorithm For Support Vector Machine Training
We consider gradient descent like algorithms for Support Vector Machine (SVM)
training when the data is in relational form. The gradient of the SVM objective
can not be efficiently computed by known techniques as it suffers from the
``subtraction problem''. We first show that the subtraction problem can not be
surmounted by showing that computing any constant approximation of the gradient
of the SVM objective function is -hard, even for acyclic joins. We,
however, circumvent the subtraction problem by restricting our attention to
stable instances, which intuitively are instances where a nearly optimal
solution remains nearly optimal if the points are perturbed slightly. We give
an efficient algorithm that computes a ``pseudo-gradient'' that guarantees
convergence for stable instances at a rate comparable to that achieved by using
the actual gradient. We believe that our results suggest that this sort of
stability the analysis would likely yield useful insight in the context of
designing algorithms on relational data for other learning problems in which
the subtraction problem arises
BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees
The rising volume of datasets has made training machine learning (ML) models
a major computational cost in the enterprise. Given the iterative nature of
model and parameter tuning, many analysts use a small sample of their entire
data during their initial stage of analysis to make quick decisions (e.g., what
features or hyperparameters to use) and use the entire dataset only in later
stages (i.e., when they have converged to a specific model). This sampling,
however, is performed in an ad-hoc fashion. Most practitioners cannot precisely
capture the effect of sampling on the quality of their model, and eventually on
their decision-making process during the tuning phase. Moreover, without
systematic support for sampling operators, many optimizations and reuse
opportunities are lost.
In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML
training. BlinkML allows users to make error-computation tradeoffs: instead of
training a model on their full data (i.e., full model), BlinkML can quickly
train an approximate model with quality guarantees using a sample. The quality
guarantees ensure that, with high probability, the approximate model makes the
same predictions as the full model. BlinkML currently supports any ML model
that relies on maximum likelihood estimation (MLE), which includes Generalized
Linear Models (e.g., linear regression, logistic regression, max entropy
classifier, Poisson regression) as well as PPCA (Probabilistic Principal
Component Analysis). Our experiments show that BlinkML can speed up the
training of large-scale ML tasks by 6.26x-629x while guaranteeing the same
predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201