30 research outputs found
The Heavy-Tail Phenomenon in SGD
In recent years, various notions of capacity and complexity have been
proposed for characterizing the generalization properties of stochastic
gradient descent (SGD) in deep learning. Some of the popular notions that
correlate well with the performance on unseen data are (i) the `flatness' of
the local minimum found by SGD, which is related to the eigenvalues of the
Hessian, (ii) the ratio of the stepsize to the batch-size , which
essentially controls the magnitude of the stochastic gradient noise, and (iii)
the `tail-index', which measures the heaviness of the tails of the network
weights at convergence. In this paper, we argue that these three seemingly
unrelated perspectives for generalization are deeply linked to each other. We
claim that depending on the structure of the Hessian of the loss at the
minimum, and the choices of the algorithm parameters and , the SGD
iterates will converge to a \emph{heavy-tailed} stationary distribution. We
rigorously prove this claim in the setting of quadratic optimization: we show
that even in a simple linear regression problem with independent and
identically distributed data whose distribution has finite moments of all
order, the iterates can be heavy-tailed with infinite variance. We further
characterize the behavior of the tails with respect to algorithm parameters,
the dimension, and the curvature. We then translate our results into insights
about the behavior of SGD in deep learning. We support our theory with
experiments conducted on synthetic data, fully connected, and convolutional
neural networks
Sharp Concentration Results for Heavy-Tailed Distributions
We obtain concentration and large deviation for the sums of independent and
identically distributed random variables with heavy-tailed distributions. Our
concentration results are concerned with random variables whose distributions
satisfy , where is an increasing function and as . Our main theorem can not only recover some
of the existing results, such as the concentration of the sum of subWeibull
random variables, but it can also produce new results for the sum of random
variables with heavier tails. We show that the concentration inequalities we
obtain are sharp enough to offer large deviation results for the sums of
independent random variables as well. Our analyses which are based on standard
truncation arguments simplify, unify and generalize the existing results on the
concentration and large deviation of heavy-tailed random variables.Comment: 16 page
Smoothed Gradient Clipping and Error Feedback for Distributed Optimization under Heavy-Tailed Noise
Motivated by understanding and analysis of large-scale machine learning under
heavy-tailed gradient noise, we study distributed optimization with smoothed
gradient clipping, i.e., in which certain smoothed clipping operators are
applied to the gradients or gradient estimates computed from local clients
prior to further processing. While vanilla gradient clipping has proven
effective in mitigating the impact of heavy-tailed gradient noises in
non-distributed setups, it incurs bias that causes convergence issues in
heterogeneous distributed settings. To address the inherent bias introduced by
gradient clipping, we develop a smoothed clipping operator, and propose a
distributed gradient method equipped with an error feedback mechanism, i.e.,
the clipping operator is applied on the difference between some local gradient
estimator and local stochastic gradient. We establish that, for the first time
in the strongly convex setting with heavy-tailed gradient noises that may not
have finite moments of order greater than one, the proposed distributed
gradient method's mean square error (MSE) converges to zero at a rate
, , where the exponent stays bounded
away from zero as a function of the problem condition number and the first
absolute moment of the noise. Numerical experiments validate our theoretical
findings.Comment: 25 pages, 2 figure
Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective
This work examines the deep disconnect between existing theoretical analyses
of gradient-based algorithms and the practice of training deep neural networks.
Specifically, we provide numerical evidence that in large-scale neural network
training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the
neural network's weights do not converge to stationary points where the
gradient of the loss is zero. Remarkably, however, we observe that even though
the weights do not converge to stationary points, the progress in minimizing
the loss function halts and training loss stabilizes. Inspired by this
observation, we propose a new perspective based on ergodic theory of dynamical
systems to explain it. Rather than studying the evolution of weights, we study
the evolution of the distribution of weights. We prove convergence of the
distribution of weights to an approximate invariant measure, thereby explaining
how the training loss can stabilize without weights necessarily converging to
stationary points. We further discuss how this perspective can better align
optimization theory with empirical observations in machine learning practice
Distributionally Robust Learning with Weakly Convex Losses: Convergence Rates and Finite-Sample Guarantees
We consider a distributionally robust stochastic optimization problem and
formulate it as a stochastic two-level composition optimization problem with
the use of the mean--semideviation risk measure. In this setting, we consider a
single time-scale algorithm, involving two versions of the inner function value
tracking: linearized tracking of a continuously differentiable loss function,
and SPIDER tracking of a weakly convex loss function. We adopt the norm of the
gradient of the Moreau envelope as our measure of stationarity and show that
the sample complexity of is possible in both
cases, with only the constant larger in the second case. Finally, we
demonstrate the performance of our algorithm with a robust learning example and
a weakly convex, non-smooth regression example
Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
Stochastic differential equations (SDEs) have been shown recently to well
characterize the dynamics of training machine learning models with SGD. This
provides two opportunities for better understanding the generalization
behaviour of SGD through its SDE approximation. First, under the SDE
characterization, SGD may be regarded as the full-batch gradient descent with
Gaussian gradient noise. This allows the application of the generalization
bounds developed by Xu & Raginsky (2017) to analyzing the generalization
behaviour of SGD, resulting in upper bounds in terms of the mutual information
between the training set and the training trajectory. Second, under mild
assumptions, it is possible to obtain an estimate of the steady-state weight
distribution of SDE. Using this estimate, we apply the PAC-Bayes-like
information-theoretic bounds developed in both Xu & Raginsky (2017) and Negrea
et al. (2019) to obtain generalization upper bounds in terms of the KL
divergence between the steady-state weight distribution of SGD with respect to
a prior distribution. Among various options, one may choose the prior as the
steady-state weight distribution obtained by SGD on the same training set but
with one example held out. In this case, the bound can be elegantly expressed
using the influence function (Koh & Liang, 2017), which suggests that the
generalization of the SGD is related to the stability of SGD. Various insights
are presented along the development of these bounds, which are subsequently
validated numerically
SGD with a Constant Large Learning Rate Can Converge to Local Maxima
Previous works on stochastic gradient descent (SGD) often focus on its
success. In this work, we construct worst-case optimization problems
illustrating that, when not in the regimes that the previous works often
assume, SGD can exhibit many strange and potentially undesirable behaviors.
Specifically, we construct landscapes and data distributions such that (1) SGD
converges to local maxima, (2) SGD escapes saddle points arbitrarily slowly,
(3) SGD prefers sharp minima over flat ones, and (4) AMSGrad converges to local
maxima. We also realize results in a minimal neural network-like example. Our
results highlight the importance of simultaneously analyzing the minibatch
sampling, discrete-time updates rules, and realistic landscapes to understand
the role of SGD in deep learning.Comment: ICLR 2022 Spotligh