40 research outputs found
The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects
Understanding the behavior of stochastic gradient descent (SGD) in the
context of deep neural networks has raised lots of concerns recently. Along
this line, we study a general form of gradient based optimization dynamics with
unbiased noise, which unifies SGD and standard Langevin dynamics. Through
investigating this general optimization dynamics, we analyze the behavior of
SGD on escaping from minima and its regularization effects. A novel indicator
is derived to characterize the efficiency of escaping from minima through
measuring the alignment of noise covariance and the curvature of loss function.
Based on this indicator, two conditions are established to show which type of
noise structure is superior to isotropic noise in term of escaping efficiency.
We further show that the anisotropic noise in SGD satisfies the two conditions,
and thus helps to escape from sharp and poor minima effectively, towards more
stable and flat minima that typically generalize well. We systematically design
various experiments to verify the benefits of the anisotropic noise, compared
with full gradient descent plus isotropic diffusion (i.e. Langevin dynamics).Comment: ICML 2019 camera read
On the Global Convergence of Continuous-Time Stochastic Heavy-Ball Method for Nonconvex Optimization
We study the convergence behavior of the stochastic heavy-ball method with a
small stepsize. Under a change of time scale, we approximate the discrete
method by a stochastic differential equation that models small random
perturbations of a coupled system of nonlinear oscillators. We rigorously show
that the perturbed system converges to a local minimum in a logarithmic time.
This indicates that for the diffusion process that approximates the stochastic
heavy-ball method, it takes (up to a logarithmic factor) only a linear time of
the square root of the inverse stepsize to escape from all saddle points. This
results may suggest a fast convergence of its discrete-time counterpart. Our
theoretical results are validated by numerical experiments.Comment: accepted at IEEE International Conference on Big Data in 201
Escaping Saddle Points with Adaptive Gradient Methods
Adaptive methods such as Adam and RMSProp are widely used in deep learning
but are not well understood. In this paper, we seek a crisp, clean and precise
characterization of their behavior in nonconvex settings. To this end, we first
provide a novel view of adaptive methods as preconditioned SGD, where the
preconditioner is estimated in an online manner. By studying the preconditioner
on its own, we elucidate its purpose: it rescales the stochastic gradient noise
to be isotropic near stationary points, which helps escape saddle points.
Furthermore, we show that adaptive methods can efficiently estimate the
aforementioned preconditioner. By gluing together these two components, we
provide the first (to our knowledge) second-order convergence result for any
adaptive method. The key insight from our analysis is that, compared to SGD,
adaptive methods escape saddle points faster, and can converge faster overall
to second-order stationary points.Comment: Update Theorem 4.1 and proof to use martingale concentration bounds,
i.e. matrix Freedma