216 research outputs found
On the Convergence Proof of AMSGrad and a New Version
The adaptive moment estimation algorithm Adam (Kingma and Ba) is a popular
optimizer in the training of deep neural networks. However, Reddi et al. have
recently shown that the convergence proof of Adam is problematic and proposed a
variant of Adam called AMSGrad as a fix. In this paper, we show that the
convergence proof of AMSGrad is also problematic. Concretely, the problem in
the convergence proof of AMSGrad is in handling the hyper-parameters, treating
them as equal while they are not. This is also the neglected issue in the
convergence proof of Adam. We provide an explicit counter-example of a simple
convex optimization setting to show this neglected issue. Depending on
manipulating the hyper-parameters, we present various fixes for this issue. We
provide a new convergence proof for AMSGrad as the first fix. We also propose a
new version of AMSGrad called AdamX as another fix. Our experiments on the
benchmark dataset also support our theoretical results.Comment: Update publication informatio
On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization
This paper studies a class of adaptive gradient based momentum algorithms
that update the search directions and learning rates simultaneously using past
gradients. This class, which we refer to as the "Adam-type", includes the
popular algorithms such as the Adam, AMSGrad and AdaGrad. Despite their
popularity in training deep neural networks, the convergence of these
algorithms for solving nonconvex problems remains an open question. This paper
provides a set of mild sufficient conditions that guarantee the convergence for
the Adam-type methods. We prove that under our derived conditions, these
methods can achieve the convergence rate of order for
nonconvex stochastic optimization. We show the conditions are essential in the
sense that violating them may make the algorithm diverge. Moreover, we propose
and analyze a class of (deterministic) incremental adaptive gradient
algorithms, which has the same convergence rate. Our
study could also be extended to a broader class of adaptive gradient methods in
machine learning and optimization
Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks
Adaptive gradient methods, which adopt historical gradient information to
automatically adjust the learning rate, despite the nice property of fast
convergence, have been observed to generalize worse than stochastic gradient
descent (SGD) with momentum in training deep neural networks. This leaves how
to close the generalization gap of adaptive gradient methods an open problem.
In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are
sometimes "over adapted". We design a new algorithm, called Partially adaptive
momentum estimation method, which unifies the Adam/Amsgrad with SGD by
introducing a partial adaptive parameter , to achieve the best from both
worlds. We also prove the convergence rate of our proposed algorithm to a
stationary point in the stochastic nonconvex optimization setting. Experiments
on standard benchmarks show that our proposed algorithm can maintain a fast
convergence rate as Adam/Amsgrad while generalizing as well as SGD in training
deep neural networks. These results would suggest practitioners pick up
adaptive gradient methods once again for faster training of deep neural
networks.Comment: 17 pages, 4 figures, 4 tables. In IJCAI 202
Adaptive Gradient Methods with Dynamic Bound of Learning Rate
Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been
proposed to achieve a rapid training process with an element-wise scaling term
on learning rates. Though prevailing, they are observed to generalize poorly
compared with SGD or even fail to converge due to unstable and extreme learning
rates. Recent work has put forward some algorithms such as AMSGrad to tackle
this issue but they failed to achieve considerable improvement over existing
methods. In our paper, we demonstrate that extreme learning rates can lead to
poor performance. We provide new variants of Adam and AMSGrad, called AdaBound
and AMSBound respectively, which employ dynamic bounds on learning rates to
achieve a gradual and smooth transition from adaptive methods to SGD and give a
theoretical proof of convergence. We further conduct experiments on various
popular tasks and models, which is often insufficient in previous work.
Experimental results show that new variants can eliminate the generalization
gap between adaptive methods and SGD and maintain higher learning speed early
in training at the same time. Moreover, they can bring significant improvement
over their prototypes, especially on complex deep networks. The implementation
of the algorithm can be found at https://github.com/Luolc/AdaBound .Comment: Accepted to ICLR 2019. arXiv admin note: text overlap with
arXiv:1904.09237 by other author
AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods
Adam is shown not being able to converge to the optimal solution in certain
cases. Researchers recently propose several algorithms to avoid the issue of
non-convergence of Adam, but their efficiency turns out to be unsatisfactory in
practice. In this paper, we provide new insight into the non-convergence issue
of Adam as well as other adaptive learning rate methods. We argue that there
exists an inappropriate correlation between gradient and the
second-moment term in Adam ( is the timestep), which results in that a
large gradient is likely to have small step size while a small gradient may
have a large step size. We demonstrate that such biased step sizes are the
fundamental cause of non-convergence of Adam, and we further prove that
decorrelating and will lead to unbiased step size for each
gradient, thus solving the non-convergence problem of Adam. Finally, we propose
AdaShift, a novel adaptive learning rate method that decorrelates and
by temporal shifting, i.e., using temporally shifted gradient
to calculate . The experiment results demonstrate that AdaShift is able to
address the non-convergence issue of Adam, while still maintaining a
competitive performance with Adam in terms of both training speed and
generalization.Comment: Published as a conference paper at ICLR 201
Riemannian Adaptive Optimization Methods
Several first order stochastic optimization methods commonly used in the
Euclidean domain such as stochastic gradient descent (SGD), accelerated
gradient descent or variance reduced methods have already been adapted to
certain Riemannian settings. However, some of the most popular of these
optimization tools - namely Adam , Adagrad and the more recent Amsgrad - remain
to be generalized to Riemannian manifolds. We discuss the difficulty of
generalizing such adaptive schemes to the most agnostic Riemannian setting, and
then provide algorithms and convergence proofs for geodesically convex
objectives in the particular case of a product of Riemannian manifolds, in
which adaptivity is implemented across manifolds in the cartesian product. Our
generalization is tight in the sense that choosing the Euclidean space as
Riemannian manifold yields the same algorithms and regret bounds as those that
were already known for the standard algorithms. Experimentally, we show faster
convergence and to a lower train loss value for Riemannian adaptive methods
over their corresponding baselines on the realistic task of embedding the
WordNet taxonomy in the Poincare ball.Comment: Accepted at International Conference on Learning Representations
(ICLR), 201
On the Convergence of Adam and Beyond
Several recently proposed stochastic optimization methods that have been
successfully used in training deep networks such as RMSProp, Adam, Adadelta,
Nadam are based on using gradient updates scaled by square roots of exponential
moving averages of squared past gradients. In many applications, e.g. learning
with large output spaces, it has been empirically observed that these
algorithms fail to converge to an optimal solution (or a critical point in
nonconvex settings). We show that one cause for such failures is the
exponential moving average used in the algorithms. We provide an explicit
example of a simple convex optimization setting where Adam does not converge to
the optimal solution, and describe the precise problems with the previous
analysis of Adam algorithm. Our analysis suggests that the convergence issues
can be fixed by endowing such algorithms with `long-term memory' of past
gradients, and propose new variants of the Adam algorithm which not only fix
the convergence issues but often also lead to improved empirical performance.Comment: Appeared in ICLR 201
Deep Frank-Wolfe For Neural Network Optimization
Learning a deep neural network requires solving a challenging optimization
problem: it is a high-dimensional, non-convex and non-smooth minimization
problem with a large number of terms. The current practice in neural network
optimization is to rely on the stochastic gradient descent (SGD) algorithm or
its adaptive variants. However, SGD requires a hand-designed schedule for the
learning rate. In addition, its adaptive variants tend to produce solutions
that generalize less well on unseen data than SGD with a hand-designed
schedule. We present an optimization method that offers empirically the best of
both worlds: our algorithm yields good generalization performance while
requiring only one hyper-parameter. Our approach is based on a composite
proximal framework, which exploits the compositional nature of deep neural
networks and can leverage powerful convex optimization algorithms by design.
Specifically, we employ the Frank-Wolfe (FW) algorithm for SVM, which computes
an optimal step-size in closed-form at each time-step. We further show that the
descent direction is given by a simple backward pass in the network, yielding
the same computational cost per iteration as SGD. We present experiments on the
CIFAR and SNLI data sets, where we demonstrate the significant superiority of
our method over Adam, Adagrad, as well as the recently proposed BPGrad and
AMSGrad. Furthermore, we compare our algorithm to SGD with a hand-designed
learning rate schedule, and show that it provides similar generalization while
converging faster. The code is publicly available at
https://github.com/oval-group/dfw.Comment: Published as a conference paper at ICLR 201
Convergence Analyses of Online ADAM Algorithm in Convex Setting and Two-Layer ReLU Neural Network
Nowadays, online learning is an appealing learning paradigm, which is of
great interest in practice due to the recent emergence of large scale
applications such as online advertising placement and online web ranking.
Standard online learning assumes a finite number of samples while in practice
data is streamed infinitely. In such a setting gradient descent with a
diminishing learning rate does not work. We first introduce regret with rolling
window, a new performance metric for online streaming learning, which measures
the performance of an algorithm on every fixed number of contiguous samples. At
the same time, we propose a family of algorithms based on gradient descent with
a constant or adaptive learning rate and provide very technical analyses
establishing regret bound properties of the algorithms. We cover the convex
setting showing the regret of the order of the square root of the size of the
window in the constant and dynamic learning rate scenarios. Our proof is
applicable also to the standard online setting where we provide the first
analysis of the same regret order (the previous proofs have flaws). We also
study a two layer neural network setting with ReLU activation. In this case we
establish that if initial weights are close to a stationary point, the same
square root regret bound is attainable. We conduct computational experiments
demonstrating a superior performance of the proposed algorithms
SAdam: A Variant of Adam for Strongly Convex Functions
The Adam algorithm has become extremely popular for large-scale machine
learning. Under convexity condition, it has been proved to enjoy a
data-dependant regret bound where is the time horizon.
However, whether strong convexity can be utilized to further improve the
performance remains an open problem. In this paper, we give an affirmative
answer by developing a variant of Adam (referred to as SAdam) which achieves a
data-dependant regret bound for strongly convex functions. The
essential idea is to maintain a faster decaying yet under controlled step size
for exploiting strong convexity. In addition, under a special configuration of
hyperparameters, our SAdam reduces to SC-RMSprop, a recently proposed variant
of RMSprop for strongly convex functions, for which we provide the first
data-dependent logarithmic regret bound. Empirical results on optimizing
strongly convex functions and training deep networks demonstrate the
effectiveness of our method.Comment: 19 pages, 9 figure
- …