Search CORE

40 research outputs found

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Author: Ma Jinwen
Wu Jingfeng
Wu Lei
Yu Bing
Zhu Zhanxing
Publication venue
Publication date: 01/06/2019
Field of study

Understanding the behavior of stochastic gradient descent (SGD) in the context of deep neural networks has raised lots of concerns recently. Along this line, we study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics. Through investigating this general optimization dynamics, we analyze the behavior of SGD on escaping from minima and its regularization effects. A novel indicator is derived to characterize the efficiency of escaping from minima through measuring the alignment of noise covariance and the curvature of loss function. Based on this indicator, two conditions are established to show which type of noise structure is superior to isotropic noise in term of escaping efficiency. We further show that the anisotropic noise in SGD satisfies the two conditions, and thus helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well. We systematically design various experiments to verify the benefits of the anisotropic noise, compared with full gradient descent plus isotropic diffusion (i.e. Langevin dynamics).Comment: ICML 2019 camera read

arXiv.org e-Print Archive

Southampton (e-Prints Soton)

On the Global Convergence of Continuous-Time Stochastic Heavy-Ball Method for Nonconvex Optimization

Author: Hu Wenqing
Li Chris Junchi
Zhou Xiang
Publication venue
Publication date: 18/10/2019
Field of study

We study the convergence behavior of the stochastic heavy-ball method with a small stepsize. Under a change of time scale, we approximate the discrete method by a stochastic differential equation that models small random perturbations of a coupled system of nonlinear oscillators. We rigorously show that the perturbed system converges to a local minimum in a logarithmic time. This indicates that for the diffusion process that approximates the stochastic heavy-ball method, it takes (up to a logarithmic factor) only a linear time of the square root of the inverse stepsize to escape from all saddle points. This results may suggest a fast convergence of its discrete-time counterpart. Our theoretical results are validated by numerical experiments.Comment: accepted at IEEE International Conference on Big Data in 201

arXiv.org e-Print Archive

Crossref

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

Escaping Saddle Points with Adaptive Gradient Methods

Author: Kale Satyen
Kumar Sanjiv
Reddi Sashank J.
Sra Suvrit
Staib Matthew
Publication venue
Publication date: 03/02/2020
Field of study

Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.Comment: Update Theorem 4.1 and proof to use martingale concentration bounds, i.e. matrix Freedma

arXiv.org e-Print Archive

DSpace@MIT