4 research outputs found
Why gradient clipping accelerates training: A theoretical justification for adaptivity
We provide a theoretical explanation for the effectiveness of gradient
clipping in training deep neural networks. The key ingredient is a new
smoothness condition derived from practical neural network training examples.
We observe that gradient smoothness, a concept central to the analysis of
first-order optimization algorithms that is often assumed to be a constant,
demonstrates significant variability along the training trajectory of deep
neural networks. Further, this smoothness positively correlates with the
gradient norm, and contrary to standard assumptions in the literature, it can
grow with the norm of the gradient. These empirical observations limit the
applicability of existing theoretical analyses of algorithms that rely on a
fixed bound on smoothness. These observations motivate us to introduce a novel
relaxation of gradient smoothness that is weaker than the commonly used
Lipschitz smoothness assumption. Under the new condition, we prove that two
popular methods, namely, \emph{gradient clipping} and \emph{normalized
gradient}, converge arbitrarily faster than gradient descent with fixed
stepsize. We further explain why such adaptively scaled gradient methods can
accelerate empirical convergence and verify our results empirically in popular
neural network training settings
Why ADAM Beats SGD for Attention Models
While stochastic gradient descent (SGD) is still the de facto algorithm in
deep learning, adaptive methods like Adam have been observed to outperform SGD
across important tasks, such as attention models. The settings under which SGD
performs poorly in comparison to Adam are not well understood yet. In this
paper, we provide empirical and theoretical evidence that a heavy-tailed
distribution of the noise in stochastic gradients is a root cause of SGD's poor
performance. Based on this observation, we study clipped variants of SGD that
circumvent this issue; we then analyze their convergence under heavy-tailed
noise. Furthermore, we develop a new adaptive coordinate-wise clipping
algorithm (ACClip) tailored to such settings. Subsequently, we show how
adaptive methods like Adam can be viewed through the lens of clipping, which
helps us explain Adam's strong performance under heavy-tail noise settings.
Finally, we show that the proposed ACClip outperforms Adam for both BERT
pretraining and finetuning tasks
Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise
Stochastic gradient descent with momentum (SGDm) is one of the most popular
optimization algorithms in deep learning. While there is a rich theory of SGDm
for convex problems, the theory is considerably less developed in the context
of deep learning where the problem is non-convex and the gradient noise might
exhibit a heavy-tailed behavior, as empirically observed in recent studies. In
this study, we consider a \emph{continuous-time} variant of SGDm, known as the
underdamped Langevin dynamics (ULD), and investigate its asymptotic properties
under heavy-tailed perturbations. Supported by recent studies from statistical
physics, we argue both theoretically and empirically that the heavy-tails of
such perturbations can result in a bias even when the step-size is small, in
the sense that \emph{the optima of stationary distribution} of the dynamics
might not match \emph{the optima of the cost function to be optimized}. As a
remedy, we develop a novel framework, which we coin as \emph{fractional} ULD
(FULD), and prove that FULD targets the so-called Gibbs distribution, whose
optima exactly match the optima of the original cost. We observe that the Euler
discretization of FULD has noteworthy algorithmic similarities with
\emph{natural gradient} methods and \emph{gradient clipping}, bringing a new
perspective on understanding their role in deep learning. We support our theory
with experiments conducted on a synthetic model and neural networks.Comment: 20 pages, Published at International Conference on Machine Learning
202
The Geometry of Sign Gradient Descent
Sign-based optimization methods have become popular in machine learning due
to their favorable communication cost in distributed optimization and their
surprisingly good performance in neural network training. Furthermore, they are
closely connected to so-called adaptive gradient methods like Adam. Recent
works on signSGD have used a non-standard "separable smoothness" assumption,
whereas some older works study sign gradient descent as steepest descent with
respect to the -norm. In this work, we unify these existing
results by showing a close connection between separable smoothness and
-smoothness and argue that the latter is the weaker and more
natural assumption. We then proceed to study the smoothness constant with
respect to the -norm and thereby isolate geometric properties of
the objective function which affect the performance of sign-based methods. In
short, we find sign-based methods to be preferable over gradient descent if (i)
the Hessian is to some degree concentrated on its diagonal, and (ii) its
maximal eigenvalue is much larger than the average eigenvalue. Both properties
are common in deep networks