5,869 research outputs found
FALKON: An Optimal Large Scale Kernel Method
Kernel methods provide a principled way to perform non linear, nonparametric
learning. They rely on solid functional analytic foundations and enjoy optimal
statistical properties. However, at least in their basic form, they have
limited applicability in large scale scenarios because of stringent
computational requirements in terms of time and especially memory. In this
paper, we take a substantial step in scaling up kernel methods, proposing
FALKON, a novel algorithm that allows to efficiently process millions of
points. FALKON is derived combining several algorithmic principles, namely
stochastic subsampling, iterative solvers and preconditioning. Our theoretical
analysis shows that optimal statistical accuracy is achieved requiring
essentially memory and time. An extensive experimental
analysis on large scale datasets shows that, even with a single machine, FALKON
outperforms previous state of the art solutions, which exploit
parallel/distributed architectures.Comment: NIPS 201
Convergence rates of Kernel Conjugate Gradient for random design regression
We prove statistical rates of convergence for kernel-based least squares
regression from i.i.d. data using a conjugate gradient algorithm, where
regularization against overfitting is obtained by early stopping. This method
is related to Kernel Partial Least Squares, a regression method that combines
supervised dimensionality reduction with least squares projection. Following
the setting introduced in earlier related literature, we study so-called "fast
convergence rates" depending on the regularity of the target regression
function (measured by a source condition in terms of the kernel integral
operator) and on the effective dimensionality of the data mapped into the
kernel space. We obtain upper bounds, essentially matching known minimax lower
bounds, for the (prediction) norm as well as for the stronger
Hilbert norm, if the true regression function belongs to the reproducing kernel
Hilbert space. If the latter assumption is not fulfilled, we obtain similar
convergence rates for appropriate norms, provided additional unlabeled data are
available
Kernel Conjugate Gradient Methods with Random Projections
We propose and study kernel conjugate gradient methods (KCGM) with random
projections for least-squares regression over a separable Hilbert space.
Considering two types of random projections generated by randomized sketches
and Nystr\"{o}m subsampling, we prove optimal statistical results with respect
to variants of norms for the algorithms under a suitable stopping rule.
Particularly, our results show that if the projection dimension is proportional
to the effective dimension of the problem, KCGM with randomized sketches can
generalize optimally, while achieving a computational advantage. As a
corollary, we derive optimal rates for classic KCGM in the case that the target
function may not be in the hypothesis space, filling a theoretical gap.Comment: 43 pages, 2 figure
Do optimization methods in deep learning applications matter?
With advances in deep learning, exponential data growth and increasing model
complexity, developing efficient optimization methods are attracting much
research attention. Several implementations favor the use of Conjugate Gradient
(CG) and Stochastic Gradient Descent (SGD) as being practical and elegant
solutions to achieve quick convergence, however, these optimization processes
also present many limitations in learning across deep learning applications.
Recent research is exploring higher-order optimization functions as better
approaches, but these present very complex computational challenges for
practical use. Comparing first and higher-order optimization functions, in this
paper, our experiments reveal that Levemberg-Marquardt (LM) significantly
supersedes optimal convergence but suffers from very large processing time
increasing the training complexity of both, classification and reinforcement
learning problems. Our experiments compare off-the-shelf optimization
functions(CG, SGD, LM and L-BFGS) in standard CIFAR, MNIST, CartPole and
FlappyBird experiments.The paper presents arguments on which optimization
functions to use and further, which functions would benefit from
parallelization efforts to improve pretraining time and learning rate
convergence
Early stopping and non-parametric regression: An optimal data-dependent stopping rule
The strategy of early stopping is a regularization technique based on
choosing a stopping time for an iterative algorithm. Focusing on non-parametric
regression in a reproducing kernel Hilbert space, we analyze the early stopping
strategy for a form of gradient-descent applied to the least-squares loss
function. We propose a data-dependent stopping rule that does not involve
hold-out or cross-validation data, and we prove upper bounds on the squared
error of the resulting function estimate, measured in either the and
norm. These upper bounds lead to minimax-optimal rates for various
kernel classes, including Sobolev smoothness classes and other forms of
reproducing kernel Hilbert spaces. We show through simulation that our stopping
rule compares favorably to two other stopping rules, one based on hold-out data
and the other based on Stein's unbiased risk estimate. We also establish a
tight connection between our early stopping strategy and the solution path of a
kernel ridge regression estimator.Comment: 29 pages, 4 figure
Optimization with Sparsity-Inducing Penalties
Sparse estimation methods are aimed at using or obtaining parsimonious
representations of data or models. They were first dedicated to linear variable
selection but numerous extensions have now emerged such as structured sparsity
or kernel selection. It turns out that many of the related estimation problems
can be cast as convex optimization problems by regularizing the empirical risk
with appropriate non-smooth norms. The goal of this paper is to present from a
general perspective optimization tools and techniques dedicated to such
sparsity-inducing penalties. We cover proximal methods, block-coordinate
descent, reweighted -penalized techniques, working-set and homotopy
methods, as well as non-convex formulations and extensions, and provide an
extensive set of experiments to compare various algorithms from a computational
point of view
Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
We consider the optimization of a quadratic objective function whose
gradients are only accessible through a stochastic oracle that returns the
gradient at any given point plus a zero-mean finite variance random error. We
present the first algorithm that achieves jointly the optimal prediction error
rates for least-squares regression, both in terms of forgetting of initial
conditions in O(1/n 2), and in terms of dependence on the noise and dimension d
of the problem, as O(d/n). Our new algorithm is based on averaged accelerated
regularized gradient descent, and may also be analyzed through finer
assumptions on initial conditions and the Hessian matrix, leading to
dimension-free quantities that may still be small while the " optimal " terms
above are large. In order to characterize the tightness of these new bounds, we
consider an application to non-parametric regression and use the known lower
bounds on the statistical performance (without computational limits), which
happen to match our bounds obtained from a single pass on the data and thus
show optimality of our algorithm in a wide variety of particular trade-offs
between bias and variance
- …