7 research outputs found
Learning Theory of Distributed Regression with Bias Corrected Regularization Kernel Network
Distributed learning is an effective way to analyze big data. In distributed
regression, a typical approach is to divide the big data into multiple blocks,
apply a base regression algorithm on each of them, and then simply average the
output functions learnt from these blocks. Since the average process will
decrease the variance, not the bias, bias correction is expected to improve the
learning performance if the base regression algorithm is a biased one.
Regularization kernel network is an effective and widely used method for
nonlinear regression analysis. In this paper we will investigate a bias
corrected version of regularization kernel network. We derive the error bounds
when it is applied to a single data set and when it is applied as a base
algorithm in distributed regression. We show that, under certain appropriate
conditions, the optimal learning rates can be reached in both situations
Optimal Rates of Distributed Regression with Imperfect Kernels
Distributed machine learning systems have been receiving increasing
attentions for their efficiency to process large scale data. Many distributed
frameworks have been proposed for different machine learning tasks. In this
paper, we study the distributed kernel regression via the divide and conquer
approach. This approach has been proved asymptotically minimax optimal if the
kernel is perfectly selected so that the true regression function lies in the
associated reproducing kernel Hilbert space. However, this is usually, if not
always, impractical because kernels that can only be selected via prior
knowledge or a tuning process are hardly perfect. Instead it is more common
that the kernel is good enough but imperfect in the sense that the true
regression can be well approximated by but does not lie exactly in the kernel
space. We show distributed kernel regression can still achieves capacity
independent optimal rate in this case. To this end, we first establish a
general framework that allows to analyze distributed regression with response
weighted base algorithms by bounding the error of such algorithms on a single
data set, provided that the error bounds has factored the impact of the
unexplained variance of the response variable. Then we perform a leave one out
analysis of the kernel ridge regression and bias corrected kernel ridge
regression, which in combination with the aforementioned framework allows us to
derive sharp error bounds and capacity independent optimal rates for the
associated distributed kernel regression algorithms. As a byproduct of the
thorough analysis, we also prove the kernel ridge regression can achieve rates
faster than (where is the sample size) in the noise free setting
which, to our best knowledge, are first observed and novel in regression
learning.Comment: 2 figure
WONDER: Weighted one-shot distributed ridge regression in high dimensions
In many areas, practitioners need to analyze large datasets that challenge
conventional single-machine computing. To scale up data analysis, distributed
and parallel computing approaches are increasingly needed.
Here we study a fundamental and highly important problem in this area: How to
do ridge regression in a distributed computing environment? Ridge regression is
an extremely popular method for supervised learning, and has several optimality
properties, thus it is important to study. We study one-shot methods that
construct weighted combinations of ridge regression estimators computed on each
machine. By analyzing the mean squared error in a high dimensional
random-effects model where each predictor has a small effect, we discover
several new phenomena.
1. Infinite-worker limit: The distributed estimator works well for very large
numbers of machines, a phenomenon we call "infinite-worker limit".
2. Optimal weights: The optimal weights for combining local estimators sum to
more than unity, due to the downward bias of ridge. Thus, all averaging methods
are suboptimal.
We also propose a new Weighted ONe-shot DistributEd Ridge regression (WONDER)
algorithm. We test WONDER in simulation studies and using the Million Song
Dataset as an example. There it can save at least 100x in computation time,
while nearly preserving test accuracy.Comment: Gave the name "Wonder" to the algorithm, updated title, added
algorithm for general non-isotropic desig
Two-stage Best-scored Random Forest for Large-scale Regression
We propose a novel method designed for large-scale regression problems,
namely the two-stage best-scored random forest (TBRF). "Best-scored" means to
select one regression tree with the best empirical performance out of a certain
number of purely random regression tree candidates, and "two-stage" means to
divide the original random tree splitting procedure into two: In stage one, the
feature space is partitioned into non-overlapping cells; in stage two, child
trees grow separately on these cells. The strengths of this algorithm can be
summarized as follows: First of all, the pure randomness in TBRF leads to the
almost optimal learning rates, and also makes ensemble learning possible, which
resolves the boundary discontinuities long plaguing the existing algorithms.
Secondly, the two-stage procedure paves the way for parallel computing, leading
to computational efficiency. Last but not least, TBRF can serve as an inclusive
framework where different mainstream regression strategies such as linear
predictor and least squares support vector machines (LS-SVMs) can also be
incorporated as value assignment approaches on leaves of the child trees,
depending on the characteristics of the underlying data sets. Numerical
assessments on comparisons with other state-of-the-art methods on several
large-scale real data sets validate the promising prediction accuracy and high
computational efficiency of our algorithm
Kernel-based L_2-Boosting with Structure Constraints
Developing efficient kernel methods for regression is very popular in the
past decade. In this paper, utilizing boosting on kernel-based weaker learners,
we propose a novel kernel-based learning algorithm called kernel-based
re-scaled boosting with truncation, dubbed as KReBooT. The proposed KReBooT
benefits in controlling the structure of estimators and producing sparse
estimate, and is near overfitting resistant. We conduct both theoretical
analysis and numerical simulations to illustrate the power of KReBooT.
Theoretically, we prove that KReBooT can achieve the almost optimal numerical
convergence rate for nonlinear approximation. Furthermore, using the recently
developed integral operator approach and a variant of Talagrand's concentration
inequality, we provide fast learning rates for KReBooT, which is a new record
of boosting-type algorithms. Numerically, we carry out a series of simulations
to show the promising performance of KReBooT in terms of its good
generalization, near over-fitting resistance and structure constraints.Comment: 33pages, 8figure
Kernel regression in high dimensions: Refined analysis beyond double descent
In this paper, we provide a precise characterization of generalization
properties of high dimensional kernel ridge regression across the under- and
over-parameterized regimes, depending on whether the number of training data n
exceeds the feature dimension d. By establishing a bias-variance decomposition
of the expected excess risk, we show that, while the bias is (almost)
independent of d and monotonically decreases with n, the variance depends on n,
d and can be unimodal or monotonically decreasing under different
regularization schemes. Our refined analysis goes beyond the double descent
theory by showing that, depending on the data eigen-profile and the level of
regularization, the kernel regression risk curve can be a double-descent-like,
bell-shaped, or monotonic function of n. Experiments on synthetic and real data
are conducted to support our theoretical findings.Comment: This paper was accepted by AISTATS-202
Histogram Transform Ensembles for Large-scale Regression
We propose a novel algorithm for large-scale regression problems named
histogram transform ensembles (HTE), composed of random rotations, stretchings,
and translations. First of all, we investigate the theoretical properties of
HTE when the regression function lies in the H\"{o}lder space ,
, . In the case that , we adopt
the constant regressors and develop the na\"{i}ve histogram transforms (NHT).
Within the space , although almost optimal convergence rates can
be derived for both single and ensemble NHT, we fail to show the benefits of
ensembles over single estimators theoretically. In contrast, in the subspace
, we prove that if , the lower bound
of the convergence rates for single NHT turns out to be worse than the upper
bound of the convergence rates for ensemble NHT. In the other case when , the NHT may no longer be appropriate in predicting smoother regression
functions. Instead, we apply kernel histogram transforms (KHT) equipped with
smoother regressors such as support vector machines (SVMs), and it turns out
that both single and ensemble KHT enjoy almost optimal convergence rates. Then
we validate the above theoretical results by numerical experiments. On the one
hand, simulations are conducted to elucidate that ensemble NHT outperform
single NHT. On the other hand, the effects of bin sizes on accuracy of both NHT
and KHT also accord with theoretical analysis. Last but not least, in the
real-data experiments, comparisons between the ensemble KHT, equipped with
adaptive histogram transforms, and other state-of-the-art large-scale
regression estimators verify the effectiveness and accuracy of our algorithm.Comment: arXiv admin note: text overlap with arXiv:1911.1158