Search CORE

18,915 research outputs found

A Stochastic Derivative Free Optimization Method with Momentum

Author: Bergou El Houcine
Bibi Adel
Gorbunov Eduard
Richtárik Peter
Sener Ozan
Publication venue
Publication date: 16/02/2020
Field of study

We consider the problem of unconstrained minimization of a smooth objective function in

\mathbb{R}^d

in setting where only function evaluations are possible. We propose and analyze stochastic zeroth-order method with heavy ball momentum. In particular, we propose, SMTP, a momentum version of the stochastic three-point method (STP) \cite{Bergou_2018}. We show new complexity results for non-convex, convex and strongly convex functions. We test our method on a collection of learning to continuous control tasks on several MuJoCo \cite{Todorov_2012} environments with varying difficulty and compare against STP, other state-of-the-art derivative-free optimization algorithms and against policy gradient methods. SMTP significantly outperforms STP and all other methods that we considered in our numerical experiments. Our second contribution is SMTP with importance sampling which we call SMTP_IS. We provide convergence analysis of this method for non-convex, convex and strongly convex objectives

arXiv.org e-Print Archive

Fast, Better Training Trick -- Random Gradient

Author: Wei Jiakai
Publication venue
Publication date: 13/08/2018
Field of study

In this paper, we will show an unprecedented method to accelerate training and improve performance, which called random gradient (RG). This method can be easier to the training of any model without extra calculation cost, we use Image classification, Semantic segmentation, and GANs to confirm this method can improve speed which is training model in computer vision. The central idea is using the loss multiplied by a random number to random reduce the back-propagation gradient. We can use this method to produce a better result in Pascal VOC, Cifar, Cityscapes datasets.Comment: arXiv admin note: text overlap with arXiv:1708.07120 by other author

arXiv.org e-Print Archive

A Latent Variational Framework for Stochastic Optimization

Author: Casgrain Philippe
Publication venue
Publication date: 27/10/2019
Field of study

This paper provides a unifying theoretical framework for stochastic optimization algorithms by means of a latent stochastic variational problem. Using techniques from stochastic control, the solution to the variational problem is shown to be equivalent to that of a Forward Backward Stochastic Differential Equation (FBSDE). By solving these equations, we recover a variety of existing adaptive stochastic gradient descent methods. This framework establishes a direct connection between stochastic optimization algorithms and a secondary Bayesian inference problem on gradients, where a prior measure on noisy gradient observations determines the resulting algorithm.Comment: 8 pages main content, 8 pages appendi

arXiv.org e-Print Archive

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

Author: Smith Leslie N.
Topin Nicholay
Publication venue
Publication date: 17/05/2018
Field of study

In this paper, we describe a phenomenon, which we named "super-convergence", where neural networks can be trained an order of magnitude faster than with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elements of super-convergence is training with one learning rate cycle and a large maximum learning rate. A primary insight that allows super-convergence training is that large learning rates regularize the training, hence requiring a reduction of all other forms of regularization in order to preserve an optimal regularization balance. We also derive a simplification of the Hessian Free optimization method to compute an estimate of the optimal learning rate. Experiments demonstrate super-convergence for Cifar-10/100, MNIST and Imagenet datasets, and resnet, wide-resnet, densenet, and inception architectures. In addition, we show that super-convergence provides a greater boost in performance relative to standard training when the amount of labeled training data is limited. The architectures and code to replicate the figures in this paper are available at github.com/lnsmith54/super-convergence. See http://www.fast.ai/2018/04/30/dawnbench-fastai/ for an application of super-convergence to win the DAWNBench challenge (see https://dawn.cs.stanford.edu/benchmark/).Comment: This paper was significantly revised to show super-convergence as a general fast training methodolog

arXiv.org e-Print Archive

Deep Frank-Wolfe For Neural Network Optimization

Author: Berrada Leonard
Kumar M. Pawan
Zisserman Andrew
Publication venue
Publication date: 30/04/2019
Field of study

Learning a deep neural network requires solving a challenging optimization problem: it is a high-dimensional, non-convex and non-smooth minimization problem with a large number of terms. The current practice in neural network optimization is to rely on the stochastic gradient descent (SGD) algorithm or its adaptive variants. However, SGD requires a hand-designed schedule for the learning rate. In addition, its adaptive variants tend to produce solutions that generalize less well on unseen data than SGD with a hand-designed schedule. We present an optimization method that offers empirically the best of both worlds: our algorithm yields good generalization performance while requiring only one hyper-parameter. Our approach is based on a composite proximal framework, which exploits the compositional nature of deep neural networks and can leverage powerful convex optimization algorithms by design. Specifically, we employ the Frank-Wolfe (FW) algorithm for SVM, which computes an optimal step-size in closed-form at each time-step. We further show that the descent direction is given by a simple backward pass in the network, yielding the same computational cost per iteration as SGD. We present experiments on the CIFAR and SNLI data sets, where we demonstrate the significant superiority of our method over Adam, Adagrad, as well as the recently proposed BPGrad and AMSGrad. Furthermore, we compare our algorithm to SGD with a hand-designed learning rate schedule, and show that it provides similar generalization while converging faster. The code is publicly available at https://github.com/oval-group/dfw.Comment: Published as a conference paper at ICLR 201

arXiv.org e-Print Archive

Convergence Analysis of Proximal Gradient with Momentum for Nonconvex Optimization

Author: Li Qunwei
Liang Yingbin
Varshney Pramod K.
Zhou Yi
Publication venue
Publication date: 14/05/2017
Field of study

In many modern machine learning applications, structures of underlying mathematical models often yield nonconvex optimization problems. Due to the intractability of nonconvexity, there is a rising need to develop efficient methods for solving general nonconvex problems with certain performance guarantee. In this work, we investigate the accelerated proximal gradient method for nonconvex programming (APGnc). The method compares between a usual proximal gradient step and a linear extrapolation step, and accepts the one that has a lower function value to achieve a monotonic decrease. In specific, under a general nonsmooth and nonconvex setting, we provide a rigorous argument to show that the limit points of the sequence generated by APGnc are critical points of the objective function. Then, by exploiting the Kurdyka-{\L}ojasiewicz (\KL) property for a broad class of functions, we establish the linear and sub-linear convergence rates of the function value sequence generated by APGnc. We further propose a stochastic variance reduced APGnc (SVRG-APGnc), and establish its linear convergence under a special case of the \KL property. We also extend the analysis to the inexact version of these methods and develop an adaptive momentum strategy that improves the numerical performance.Comment: Accepted in ICML 2017, 9 papes, 4 figure

arXiv.org e-Print Archive

L2 Regularization versus Batch and Weight Normalization

Author: van Laarhoven Twan
Publication venue
Publication date: 16/06/2017
Field of study

Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue

arXiv.org e-Print Archive

Preconditioner on Matrix Lie Group for SGD

Author: Li Xi-Lin
Publication venue
Publication date: 24/12/2018
Field of study

We study two types of preconditioners and preconditioned stochastic gradient descent (SGD) methods in a unified framework. We call the first one the Newton type due to its close relationship to the Newton method, and the second one the Fisher type as its preconditioner is closely related to the inverse of Fisher information matrix. Both preconditioners can be derived from one framework, and efficiently estimated on any matrix Lie groups designated by the user using natural or relative gradient descent minimizing certain preconditioner estimation criteria. Many existing preconditioners and methods, e.g., RMSProp, Adam, KFAC, equilibrated SGD, batch normalization, etc., are special cases of or closely related to either the Newton type or the Fisher type ones. Experimental results on relatively large scale machine learning problems are reported for performance study.Comment: to appear on ICLR 201

arXiv.org e-Print Archive

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

Author: Grosse Roger
Martens James
Publication venue
Publication date: 07/06/2020
Field of study

We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.Comment: Reduction ratio formula corrected. Removed incorrect claim about geodesics in footnot

arXiv.org e-Print Archive

Advances in Optimizing Recurrent Networks

Author: Bengio Yoshua
Boulanger-Lewandowski Nicolas
Pascanu Razvan
Publication venue
Publication date: 13/12/2012
Field of study

After a more than decade-long period of relatively little research activity in the area of recurrent neural networks, several new developments will be reviewed here that have allowed substantial progress both in understanding and in technical solutions towards more efficient training of recurrent networks. These advances have been motivated by and related to the optimization issues surrounding deep learning. Although recurrent networks are extremely powerful in what they can in principle represent in terms of modelling sequences,their training is plagued by two aspects of the same issue regarding the learning of long-term dependencies. Experiments reported here evaluate the use of clipping gradients, spanning longer time ranges with leaky integration, advanced momentum techniques, using more powerful output probability models, and encouraging sparser gradients to help symmetry breaking and credit assignment. The experiments are performed on text and music data and show off the combined effects of these techniques in generally improving both training and test error

arXiv.org e-Print Archive