727 research outputs found
Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise
The empirical success of deep learning is often attributed to SGD's
mysterious ability to avoid sharp local minima in the loss landscape, which is
well known to lead to poor generalization. Recently, empirical evidence of
heavy-tailed gradient noise was reported in many deep learning tasks; under the
presence of heavy-tailed gradient noise, it can be shown that SGD can escape
sharp local minima, providing a partial solution to the mystery. In this work,
we analyze a popular variant of SGD where gradients are truncated above a fixed
threshold. We show that it achieves a stronger notion of avoiding sharp minima;
it can effectively eliminate sharp local minima entirely from its training
trajectory. We characterize the dynamics of truncated SGD driven by
heavy-tailed noises. First, we show that truncation threshold and width of the
attraction field dictate the order of the first exit time from the associated
local minimum. Moreover, when the objective function satisfies appropriate
structural conditions, we prove that as the learning rate decreases the
dynamics of heavy-tailed truncated SGD closely resemble those of a
continuous-time Markov chain which never visits any sharp minima. We verify our
theoretical results with numerical experiments and discuss the implications on
the generalizability of SGD in deep learning.Comment: 92 pages (13 pages for the main paper and 79 pages for the
supplementary materials), 7 figure
On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control
Reinforcement learning is a framework for interactive decision-making with
incentives sequentially revealed across time without a system dynamics model.
Due to its scaling to continuous spaces, we focus on policy search where one
iteratively improves a parameterized policy with stochastic policy gradient
(PG) updates. In tabular Markov Decision Problems (MDPs), under persistent
exploration and suitable parameterization, global optimality may be obtained.
By contrast, in continuous space, the non-convexity poses a pathological
challenge as evidenced by existing convergence results being mostly limited to
stationarity or arbitrary local extrema. To close this gap, we step towards
persistent exploration in continuous space through policy parameterizations
defined by distributions of heavier tails defined by tail-index parameter
alpha, which increases the likelihood of jumping in state space. Doing so
invalidates smoothness conditions of the score function common to PG. Thus, we
establish how the convergence rate to stationarity depends on the policy's tail
index alpha, a Holder continuity parameter, integrability conditions, and an
exploration tolerance parameter introduced here for the first time. Further, we
characterize the dependence of the set of local maxima on the tail index
through an exit and transition time analysis of a suitably defined Markov
chain, identifying that policies associated with Levy Processes of a heavier
tail converge to wider peaks. This phenomenon yields improved stability to
perturbations in supervised learning, which we corroborate also manifests in
improved performance of policy search, especially when myopic and farsighted
incentives are misaligned
Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections
Gaussian noise injections (GNIs) are a family of simple and widely-used
regularisation methods for training neural networks, where one injects additive
or multiplicative Gaussian noise to the network activations at every iteration
of the optimisation algorithm, which is typically chosen as stochastic gradient
descent (SGD). In this paper we focus on the so-called `implicit effect' of
GNIs, which is the effect of the injected noise on the dynamics of SGD. We show
that this effect induces an asymmetric heavy-tailed noise on SGD gradient
updates. In order to model this modified dynamics, we first develop a
Langevin-like stochastic differential equation that is driven by a general
family of asymmetric heavy-tailed noise. Using this model we then formally
prove that GNIs induce an `implicit bias', which varies depending on the
heaviness of the tails and the level of asymmetry. Our empirical results
confirm that different types of neural networks trained with GNIs are
well-modelled by the proposed dynamics and that the implicit effect of these
injections induces a bias that degrades the performance of networks.Comment: Main paper of 12 pages, followed by appendi
The Heavy-Tail Phenomenon in SGD
In recent years, various notions of capacity and complexity have been
proposed for characterizing the generalization properties of stochastic
gradient descent (SGD) in deep learning. Some of the popular notions that
correlate well with the performance on unseen data are (i) the `flatness' of
the local minimum found by SGD, which is related to the eigenvalues of the
Hessian, (ii) the ratio of the stepsize to the batch-size , which
essentially controls the magnitude of the stochastic gradient noise, and (iii)
the `tail-index', which measures the heaviness of the tails of the network
weights at convergence. In this paper, we argue that these three seemingly
unrelated perspectives for generalization are deeply linked to each other. We
claim that depending on the structure of the Hessian of the loss at the
minimum, and the choices of the algorithm parameters and , the SGD
iterates will converge to a \emph{heavy-tailed} stationary distribution. We
rigorously prove this claim in the setting of quadratic optimization: we show
that even in a simple linear regression problem with independent and
identically distributed data whose distribution has finite moments of all
order, the iterates can be heavy-tailed with infinite variance. We further
characterize the behavior of the tails with respect to algorithm parameters,
the dimension, and the curvature. We then translate our results into insights
about the behavior of SGD in deep learning. We support our theory with
experiments conducted on synthetic data, fully connected, and convolutional
neural networks
On the Theoretical Properties of Noise Correlation in Stochastic Optimization
Studying the properties of stochastic noise to optimize complex non-convex
functions has been an active area of research in the field of machine learning.
Prior work has shown that the noise of stochastic gradient descent improves
optimization by overcoming undesirable obstacles in the landscape. Moreover,
injecting artificial Gaussian noise has become a popular idea to quickly escape
saddle points. Indeed, in the absence of reliable gradient information, the
noise is used to explore the landscape, but it is unclear what type of noise is
optimal in terms of exploration ability. In order to narrow this gap in our
knowledge, we study a general type of continuous-time non-Markovian process,
based on fractional Brownian motion, that allows for the increments of the
process to be correlated. This generalizes processes based on Brownian motion,
such as the Ornstein-Uhlenbeck process. We demonstrate how to discretize such
processes which gives rise to the new algorithm fPGD. This method is a
generalization of the known algorithms PGD and Anti-PGD. We study the
properties of fPGD both theoretically and empirically, demonstrating that it
possesses exploration abilities that, in some cases, are favorable over PGD and
Anti-PGD. These results open the field to novel ways to exploit noise for
training machine learning models
- …