38 research outputs found
Lepskii Principle in Supervised Learning
In the setting of supervised learning using reproducing kernel methods, we
propose a data-dependent regularization parameter selection rule that is
adaptive to the unknown regularity of the target function and is optimal both
for the least-square (prediction) error and for the reproducing kernel Hilbert
space (reconstruction) norm error. It is based on a modified Lepskii balancing
principle using a varying family of norms
Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces
In this paper, we study regression problems over a separable Hilbert space
with the square loss, covering non-parametric regression over a reproducing
kernel Hilbert space. We investigate a class of spectral-regularized
algorithms, including ridge regression, principal component analysis, and
gradient methods. We prove optimal, high-probability convergence results in
terms of variants of norms for the studied algorithms, considering a capacity
assumption on the hypothesis space and a general source condition on the target
function. Consequently, we obtain almost sure convergence results with optimal
rates. Our results improve and generalize previous results, filling a
theoretical gap for the non-attainable cases
Convergence rates of Kernel Conjugate Gradient for random design regression
We prove statistical rates of convergence for kernel-based least squares
regression from i.i.d. data using a conjugate gradient algorithm, where
regularization against overfitting is obtained by early stopping. This method
is related to Kernel Partial Least Squares, a regression method that combines
supervised dimensionality reduction with least squares projection. Following
the setting introduced in earlier related literature, we study so-called "fast
convergence rates" depending on the regularity of the target regression
function (measured by a source condition in terms of the kernel integral
operator) and on the effective dimensionality of the data mapped into the
kernel space. We obtain upper bounds, essentially matching known minimax lower
bounds, for the (prediction) norm as well as for the stronger
Hilbert norm, if the true regression function belongs to the reproducing kernel
Hilbert space. If the latter assumption is not fulfilled, we obtain similar
convergence rates for appropriate norms, provided additional unlabeled data are
available
Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
We consider stochastic gradient descent (SGD) for least-squares regression
with potentially several passes over the data. While several passes have been
widely reported to perform practically better in terms of predictive
performance on unseen data, the existing theoretical analysis of SGD suggests
that a single pass is statistically optimal. While this is true for
low-dimensional easy problems, we show that for hard problems, multiple passes
lead to statistically optimal predictions while single pass does not; we also
show that in these hard models, the optimal number of passes over the data
increases with sample size. In order to define the notion of hardness and show
that our predictive performances are optimal, we consider potentially
infinite-dimensional models and notions typically associated to kernel methods,
namely, the decay of eigenvalues of the covariance matrix of the features and
the complexity of the optimal predictor as measured through the covariance
matrix. We illustrate our results on synthetic experiments with non-linear
kernel methods and on a classical benchmark with a linear model
Kernel Conjugate Gradient Methods with Random Projections
We propose and study kernel conjugate gradient methods (KCGM) with random
projections for least-squares regression over a separable Hilbert space.
Considering two types of random projections generated by randomized sketches
and Nystr\"{o}m subsampling, we prove optimal statistical results with respect
to variants of norms for the algorithms under a suitable stopping rule.
Particularly, our results show that if the projection dimension is proportional
to the effective dimension of the problem, KCGM with randomized sketches can
generalize optimally, while achieving a computational advantage. As a
corollary, we derive optimal rates for classic KCGM in the case that the target
function may not be in the hypothesis space, filling a theoretical gap.Comment: 43 pages, 2 figure