26 research outputs found
The Heavy-Tail Phenomenon in SGD
In recent years, various notions of capacity and complexity have been
proposed for characterizing the generalization properties of stochastic
gradient descent (SGD) in deep learning. Some of the popular notions that
correlate well with the performance on unseen data are (i) the `flatness' of
the local minimum found by SGD, which is related to the eigenvalues of the
Hessian, (ii) the ratio of the stepsize to the batch-size , which
essentially controls the magnitude of the stochastic gradient noise, and (iii)
the `tail-index', which measures the heaviness of the tails of the network
weights at convergence. In this paper, we argue that these three seemingly
unrelated perspectives for generalization are deeply linked to each other. We
claim that depending on the structure of the Hessian of the loss at the
minimum, and the choices of the algorithm parameters and , the SGD
iterates will converge to a \emph{heavy-tailed} stationary distribution. We
rigorously prove this claim in the setting of quadratic optimization: we show
that even in a simple linear regression problem with independent and
identically distributed data whose distribution has finite moments of all
order, the iterates can be heavy-tailed with infinite variance. We further
characterize the behavior of the tails with respect to algorithm parameters,
the dimension, and the curvature. We then translate our results into insights
about the behavior of SGD in deep learning. We support our theory with
experiments conducted on synthetic data, fully connected, and convolutional
neural networks
Differentially Private Accelerated Optimization Algorithms
We present two classes of differentially private optimization algorithms
derived from the well-known accelerated first-order methods. The first
algorithm is inspired by Polyak's heavy ball method and employs a smoothing
approach to decrease the accumulated noise on the gradient steps required for
differential privacy. The second class of algorithms are based on Nesterov's
accelerated gradient method and its recent multi-stage variant. We propose a
noise dividing mechanism for the iterations of Nesterov's method in order to
improve the error behavior of the algorithm. The convergence rate analyses are
provided for both the heavy ball and the Nesterov's accelerated gradient method
with the help of the dynamical system analysis techniques. Finally, we conclude
with our numerical experiments showing that the presented algorithms have
advantages over the well-known differentially private algorithms.Comment: 28 pages, 4 figure
Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent
Algorithmic stability is an important notion that has proven powerful for
deriving generalization bounds for practical algorithms. The last decade has
witnessed an increasing number of stability bounds for different algorithms
applied on different classes of loss functions. While these bounds have
illuminated various properties of optimization algorithms, the analysis of each
case typically required a different proof technique with significantly
different mathematical tools. In this study, we make a novel connection between
learning theory and applied probability and introduce a unified guideline for
proving Wasserstein stability bounds for stochastic optimization algorithms. We
illustrate our approach on stochastic gradient descent (SGD) and we obtain
time-uniform stability bounds (i.e., the bound does not increase with the
number of iterations) for strongly convex losses and non-convex losses with
additive noise, where we recover similar results to the prior art or extend
them to more general cases by using a single proof technique. Our approach is
flexible and can be generalizable to other popular optimizers, as it mainly
requires developing Lyapunov functions, which are often readily available in
the literature. It also illustrates that ergodicity is an important component
for obtaining time-uniform bounds -- which might not be achieved for convex or
non-convex losses unless additional noise is injected to the iterates. Finally,
we slightly stretch our analysis technique and prove time-uniform bounds for
SGD under convex and non-convex losses (without additional additive noise),
which, to our knowledge, is novel.Comment: 49 pages, NeurIPS 202