    Training (Overparametrized) Neural Networks in Near-Linear Time

    The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster second\mathit{second}-order\mathit{order} optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (independent\mathit{independent} of the training batch size nn), second-order algorithms incur a daunting slowdown in the cost\mathit{cost} per\mathit{per} iteration\mathit{iteration} (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19}, yielding an O(mn2)O(mn^2)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width mm. We show how to speed up the algorithm of [CGH+19], achieving an O~(mn)\tilde{O}(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mnmn) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an â„“2\ell_2-regression problem, and then use a Fast-JL type dimension reduction to precondition\mathit{precondition} the underlying Gram matrix in time independent of MM, allowing to find a sufficiently good approximate solution via first\mathit{first}-order\mathit{order} conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra -- which led to recent breakthroughs in convex\mathit{convex} optimization\mathit{optimization} (ERM, LPs, Regression) -- can be carried over to the realm of deep learning as well

    PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates

    This paper introduces PROMISE (Pr\textbf{Pr}econditioned Stochastic O\textbf{O}ptimization M\textbf{M}ethods by I\textbf{I}ncorporating S\textbf{S}calable Curvature E\textbf{E}stimates), a suite of sketching-based preconditioned stochastic gradient algorithms for solving large-scale convex optimization problems arising in machine learning. PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha; each algorithm comes with a strong theoretical analysis and effective default hyperparameter values. In contrast, traditional stochastic gradient methods require careful hyperparameter tuning to succeed, and degrade in the presence of ill-conditioning, a ubiquitous phenomenon in machine learning. Empirically, we verify the superiority of the proposed algorithms by showing that, using default hyperparameter values, they outperform or match popular tuned stochastic gradient optimizers on a test bed of 5151 ridge and logistic regression problems assembled from benchmark machine learning repositories. On the theoretical side, this paper introduces the notion of quadratic regularity in order to establish linear convergence of all proposed methods even when the preconditioner is updated infrequently. The speed of linear convergence is determined by the quadratic regularity ratio, which often provides a tighter bound on the convergence rate compared to the condition number, both in theory and in practice, and explains the fast global linear convergence of the proposed methods.Comment: 127 pages, 31 Figure

    FALKON: An Optimal Large Scale Kernel Method

    Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n)O(n) memory and O(nn)O(n\sqrt{n}) time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures.Comment: NIPS 201

    Randomized Riemannian Preconditioning for Orthogonality Constrained Problems

    Optimization problems with (generalized) orthogonality constraints are prevalent across science and engineering. For example, in computational science they arise in the symmetric (generalized) eigenvalue problem, in nonlinear eigenvalue problems, and in electronic structures computations, to name a few problems. In statistics and machine learning, they arise, for example, in canonical correlation analysis and in linear discriminant analysis. In this article, we consider using randomized preconditioning in the context of optimization problems with generalized orthogonality constraints. Our proposed algorithms are based on Riemannian optimization on the generalized Stiefel manifold equipped with a non-standard preconditioned geometry, which necessitates development of the geometric components necessary for developing algorithms based on this approach. Furthermore, we perform asymptotic convergence analysis of the preconditioned algorithms which help to characterize the quality of a given preconditioner using second-order information. Finally, for the problems of canonical correlation analysis and linear discriminant analysis, we develop randomized preconditioners along with corresponding bounds on the relevant condition number
