18,915 research outputs found
A Stochastic Derivative Free Optimization Method with Momentum
We consider the problem of unconstrained minimization of a smooth objective
function in in setting where only function evaluations are
possible. We propose and analyze stochastic zeroth-order method with heavy ball
momentum. In particular, we propose, SMTP, a momentum version of the stochastic
three-point method (STP) \cite{Bergou_2018}. We show new complexity results for
non-convex, convex and strongly convex functions. We test our method on a
collection of learning to continuous control tasks on several MuJoCo
\cite{Todorov_2012} environments with varying difficulty and compare against
STP, other state-of-the-art derivative-free optimization algorithms and against
policy gradient methods. SMTP significantly outperforms STP and all other
methods that we considered in our numerical experiments. Our second
contribution is SMTP with importance sampling which we call SMTP_IS. We provide
convergence analysis of this method for non-convex, convex and strongly convex
objectives
Fast, Better Training Trick -- Random Gradient
In this paper, we will show an unprecedented method to accelerate training
and improve performance, which called random gradient (RG). This method can be
easier to the training of any model without extra calculation cost, we use
Image classification, Semantic segmentation, and GANs to confirm this method
can improve speed which is training model in computer vision. The central idea
is using the loss multiplied by a random number to random reduce the
back-propagation gradient. We can use this method to produce a better result in
Pascal VOC, Cifar, Cityscapes datasets.Comment: arXiv admin note: text overlap with arXiv:1708.07120 by other author
A Latent Variational Framework for Stochastic Optimization
This paper provides a unifying theoretical framework for stochastic
optimization algorithms by means of a latent stochastic variational problem.
Using techniques from stochastic control, the solution to the variational
problem is shown to be equivalent to that of a Forward Backward Stochastic
Differential Equation (FBSDE). By solving these equations, we recover a variety
of existing adaptive stochastic gradient descent methods. This framework
establishes a direct connection between stochastic optimization algorithms and
a secondary Bayesian inference problem on gradients, where a prior measure on
noisy gradient observations determines the resulting algorithm.Comment: 8 pages main content, 8 pages appendi
Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
In this paper, we describe a phenomenon, which we named "super-convergence",
where neural networks can be trained an order of magnitude faster than with
standard training methods. The existence of super-convergence is relevant to
understanding why deep networks generalize well. One of the key elements of
super-convergence is training with one learning rate cycle and a large maximum
learning rate. A primary insight that allows super-convergence training is that
large learning rates regularize the training, hence requiring a reduction of
all other forms of regularization in order to preserve an optimal
regularization balance. We also derive a simplification of the Hessian Free
optimization method to compute an estimate of the optimal learning rate.
Experiments demonstrate super-convergence for Cifar-10/100, MNIST and Imagenet
datasets, and resnet, wide-resnet, densenet, and inception architectures. In
addition, we show that super-convergence provides a greater boost in
performance relative to standard training when the amount of labeled training
data is limited. The architectures and code to replicate the figures in this
paper are available at github.com/lnsmith54/super-convergence. See
http://www.fast.ai/2018/04/30/dawnbench-fastai/ for an application of
super-convergence to win the DAWNBench challenge (see
https://dawn.cs.stanford.edu/benchmark/).Comment: This paper was significantly revised to show super-convergence as a
general fast training methodolog
Deep Frank-Wolfe For Neural Network Optimization
Learning a deep neural network requires solving a challenging optimization
problem: it is a high-dimensional, non-convex and non-smooth minimization
problem with a large number of terms. The current practice in neural network
optimization is to rely on the stochastic gradient descent (SGD) algorithm or
its adaptive variants. However, SGD requires a hand-designed schedule for the
learning rate. In addition, its adaptive variants tend to produce solutions
that generalize less well on unseen data than SGD with a hand-designed
schedule. We present an optimization method that offers empirically the best of
both worlds: our algorithm yields good generalization performance while
requiring only one hyper-parameter. Our approach is based on a composite
proximal framework, which exploits the compositional nature of deep neural
networks and can leverage powerful convex optimization algorithms by design.
Specifically, we employ the Frank-Wolfe (FW) algorithm for SVM, which computes
an optimal step-size in closed-form at each time-step. We further show that the
descent direction is given by a simple backward pass in the network, yielding
the same computational cost per iteration as SGD. We present experiments on the
CIFAR and SNLI data sets, where we demonstrate the significant superiority of
our method over Adam, Adagrad, as well as the recently proposed BPGrad and
AMSGrad. Furthermore, we compare our algorithm to SGD with a hand-designed
learning rate schedule, and show that it provides similar generalization while
converging faster. The code is publicly available at
https://github.com/oval-group/dfw.Comment: Published as a conference paper at ICLR 201
Convergence Analysis of Proximal Gradient with Momentum for Nonconvex Optimization
In many modern machine learning applications, structures of underlying
mathematical models often yield nonconvex optimization problems. Due to the
intractability of nonconvexity, there is a rising need to develop efficient
methods for solving general nonconvex problems with certain performance
guarantee. In this work, we investigate the accelerated proximal gradient
method for nonconvex programming (APGnc). The method compares between a usual
proximal gradient step and a linear extrapolation step, and accepts the one
that has a lower function value to achieve a monotonic decrease. In specific,
under a general nonsmooth and nonconvex setting, we provide a rigorous argument
to show that the limit points of the sequence generated by APGnc are critical
points of the objective function. Then, by exploiting the
Kurdyka-{\L}ojasiewicz (\KL) property for a broad class of functions, we
establish the linear and sub-linear convergence rates of the function value
sequence generated by APGnc. We further propose a stochastic variance reduced
APGnc (SVRG-APGnc), and establish its linear convergence under a special case
of the \KL property. We also extend the analysis to the inexact version of
these methods and develop an adaptive momentum strategy that improves the
numerical performance.Comment: Accepted in ICML 2017, 9 papes, 4 figure
L2 Regularization versus Batch and Weight Normalization
Batch Normalization is a commonly used trick to improve the training of deep
neural networks. These neural networks use L2 regularization, also called
weight decay, ostensibly to prevent overfitting. However, we show that L2
regularization has no regularizing effect when combined with normalization.
Instead, regularization has an influence on the scale of weights, and thereby
on the effective learning rate. We investigate this dependence, both in theory,
and experimentally. We show that popular optimization methods such as ADAM only
partially eliminate the influence of normalization on the learning rate. This
leads to a discussion on other ways to mitigate this issue
Preconditioner on Matrix Lie Group for SGD
We study two types of preconditioners and preconditioned stochastic gradient
descent (SGD) methods in a unified framework. We call the first one the Newton
type due to its close relationship to the Newton method, and the second one the
Fisher type as its preconditioner is closely related to the inverse of Fisher
information matrix. Both preconditioners can be derived from one framework, and
efficiently estimated on any matrix Lie groups designated by the user using
natural or relative gradient descent minimizing certain preconditioner
estimation criteria. Many existing preconditioners and methods, e.g., RMSProp,
Adam, KFAC, equilibrated SGD, batch normalization, etc., are special cases of
or closely related to either the Newton type or the Fisher type ones.
Experimental results on relatively large scale machine learning problems are
reported for performance study.Comment: to appear on ICLR 201
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
We propose an efficient method for approximating natural gradient descent in
neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC).
K-FAC is based on an efficiently invertible approximation of a neural network's
Fisher information matrix which is neither diagonal nor low-rank, and in some
cases is completely non-sparse. It is derived by approximating various large
blocks of the Fisher (corresponding to entire layers) as being the Kronecker
product of two much smaller matrices. While only several times more expensive
to compute than the plain stochastic gradient, the updates produced by K-FAC
make much more progress optimizing the objective, which results in an algorithm
that can be much faster than stochastic gradient descent with momentum in
practice. And unlike some previously proposed approximate
natural-gradient/Newton methods which use high-quality non-diagonal curvature
matrices (such as Hessian-free optimization), K-FAC works very well in highly
stochastic optimization regimes. This is because the cost of storing and
inverting K-FAC's approximation to the curvature matrix does not depend on the
amount of data used to estimate it, which is a feature typically associated
only with diagonal or low-rank approximations to the curvature matrix.Comment: Reduction ratio formula corrected. Removed incorrect claim about
geodesics in footnot
Advances in Optimizing Recurrent Networks
After a more than decade-long period of relatively little research activity
in the area of recurrent neural networks, several new developments will be
reviewed here that have allowed substantial progress both in understanding and
in technical solutions towards more efficient training of recurrent networks.
These advances have been motivated by and related to the optimization issues
surrounding deep learning. Although recurrent networks are extremely powerful
in what they can in principle represent in terms of modelling sequences,their
training is plagued by two aspects of the same issue regarding the learning of
long-term dependencies. Experiments reported here evaluate the use of clipping
gradients, spanning longer time ranges with leaky integration, advanced
momentum techniques, using more powerful output probability models, and
encouraging sparser gradients to help symmetry breaking and credit assignment.
The experiments are performed on text and music data and show off the combined
effects of these techniques in generally improving both training and test
error
- …