98 research outputs found
Training (Overparametrized) Neural Networks in Near-Linear Time
The slow convergence rate and pathological curvature issues of first-order
gradient methods for training deep neural networks, initiated an ongoing effort
for developing faster - optimization
algorithms beyond SGD, without compromising the generalization error. Despite
their remarkable convergence rate ( of the training batch
size ), second-order algorithms incur a daunting slowdown in the
(inverting the Hessian
matrix of the loss function), which renders them impractical. Very recently,
this computational overhead was mitigated by the works of [ZMG19,CGH+19},
yielding an -time second-order algorithm for training two-layer
overparametrized neural networks of polynomial width .
We show how to speed up the algorithm of [CGH+19], achieving an
-time backpropagation algorithm for training (mildly
overparametrized) ReLU networks, which is near-linear in the dimension ()
of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to
reformulate the Gauss-Newton iteration as an -regression problem, and
then use a Fast-JL type dimension reduction to the
underlying Gram matrix in time independent of , allowing to find a
sufficiently good approximate solution via -
conjugate gradient. Our result provides a proof-of-concept that advanced
machinery from randomized linear algebra -- which led to recent breakthroughs
in (ERM, LPs, Regression) -- can be
carried over to the realm of deep learning as well
PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates
This paper introduces PROMISE (econditioned Stochastic
ptimization ethods by ncorporating
calable Curvature stimates), a suite of sketching-based
preconditioned stochastic gradient algorithms for solving large-scale convex
optimization problems arising in machine learning. PROMISE includes
preconditioned versions of SVRG, SAGA, and Katyusha; each algorithm comes with
a strong theoretical analysis and effective default hyperparameter values. In
contrast, traditional stochastic gradient methods require careful
hyperparameter tuning to succeed, and degrade in the presence of
ill-conditioning, a ubiquitous phenomenon in machine learning. Empirically, we
verify the superiority of the proposed algorithms by showing that, using
default hyperparameter values, they outperform or match popular tuned
stochastic gradient optimizers on a test bed of ridge and logistic
regression problems assembled from benchmark machine learning repositories. On
the theoretical side, this paper introduces the notion of quadratic regularity
in order to establish linear convergence of all proposed methods even when the
preconditioner is updated infrequently. The speed of linear convergence is
determined by the quadratic regularity ratio, which often provides a tighter
bound on the convergence rate compared to the condition number, both in theory
and in practice, and explains the fast global linear convergence of the
proposed methods.Comment: 127 pages, 31 Figure
FALKON: An Optimal Large Scale Kernel Method
Kernel methods provide a principled way to perform non linear, nonparametric
learning. They rely on solid functional analytic foundations and enjoy optimal
statistical properties. However, at least in their basic form, they have
limited applicability in large scale scenarios because of stringent
computational requirements in terms of time and especially memory. In this
paper, we take a substantial step in scaling up kernel methods, proposing
FALKON, a novel algorithm that allows to efficiently process millions of
points. FALKON is derived combining several algorithmic principles, namely
stochastic subsampling, iterative solvers and preconditioning. Our theoretical
analysis shows that optimal statistical accuracy is achieved requiring
essentially memory and time. An extensive experimental
analysis on large scale datasets shows that, even with a single machine, FALKON
outperforms previous state of the art solutions, which exploit
parallel/distributed architectures.Comment: NIPS 201
Randomized Riemannian Preconditioning for Orthogonality Constrained Problems
Optimization problems with (generalized) orthogonality constraints are
prevalent across science and engineering. For example, in computational science
they arise in the symmetric (generalized) eigenvalue problem, in nonlinear
eigenvalue problems, and in electronic structures computations, to name a few
problems. In statistics and machine learning, they arise, for example, in
canonical correlation analysis and in linear discriminant analysis. In this
article, we consider using randomized preconditioning in the context of
optimization problems with generalized orthogonality constraints. Our proposed
algorithms are based on Riemannian optimization on the generalized Stiefel
manifold equipped with a non-standard preconditioned geometry, which
necessitates development of the geometric components necessary for developing
algorithms based on this approach. Furthermore, we perform asymptotic
convergence analysis of the preconditioned algorithms which help to
characterize the quality of a given preconditioner using second-order
information. Finally, for the problems of canonical correlation analysis and
linear discriminant analysis, we develop randomized preconditioners along with
corresponding bounds on the relevant condition number
- …