1,607 research outputs found
Escaping Saddle Points with Adaptive Gradient Methods
Adaptive methods such as Adam and RMSProp are widely used in deep learning
but are not well understood. In this paper, we seek a crisp, clean and precise
characterization of their behavior in nonconvex settings. To this end, we first
provide a novel view of adaptive methods as preconditioned SGD, where the
preconditioner is estimated in an online manner. By studying the preconditioner
on its own, we elucidate its purpose: it rescales the stochastic gradient noise
to be isotropic near stationary points, which helps escape saddle points.
Furthermore, we show that adaptive methods can efficiently estimate the
aforementioned preconditioner. By gluing together these two components, we
provide the first (to our knowledge) second-order convergence result for any
adaptive method. The key insight from our analysis is that, compared to SGD,
adaptive methods escape saddle points faster, and can converge faster overall
to second-order stationary points.Comment: Update Theorem 4.1 and proof to use martingale concentration bounds,
i.e. matrix Freedma
Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study
While first-order optimization methods such as stochastic gradient descent
(SGD) are popular in machine learning (ML), they come with well-known
deficiencies, including relatively-slow convergence, sensitivity to the
settings of hyper-parameters such as learning rate, stagnation at high training
errors, and difficulty in escaping flat regions and saddle points. These issues
are particularly acute in highly non-convex settings such as those arising in
neural networks. Motivated by this, there has been recent interest in
second-order methods that aim to alleviate these shortcomings by capturing
curvature information. In this paper, we report detailed empirical evaluations
of a class of Newton-type methods, namely sub-sampled variants of trust region
(TR) and adaptive regularization with cubics (ARC) algorithms, for non-convex
ML problems. In doing so, we demonstrate that these methods not only can be
computationally competitive with hand-tuned SGD with momentum, obtaining
comparable or better generalization performance, but also they are highly
robust to hyper-parameter settings. Further, in contrast to SGD with momentum,
we show that the manner in which these Newton-type methods employ curvature
information allows them to seamlessly escape flat regions and saddle points.Comment: 21 pages, 11 figures. Restructure the paper and add experiment
Population Descent: A Natural-Selection Based Hyper-Parameter Tuning Framework
First-order gradient descent has been the base of the most successful
optimization algorithms ever implemented. On supervised learning problems with
very high dimensionality, such as neural network optimization, it is almost
always the algorithm of choice, mainly due to its memory and computational
efficiency. However, it is a classical result in optimization that gradient
descent converges to local minima on non-convex functions. Even more
importantly, in certain high-dimensional cases, escaping the plateaus of large
saddle points becomes intractable. On the other hand, black-box optimization
methods are not sensitive to the local structure of a loss function's landscape
but suffer the curse of dimensionality. Instead, memetic algorithms aim to
combine the benefits of both. Inspired by this, we present Population Descent,
a memetic algorithm focused on hyperparameter optimization. We show that an
adaptive m-elitist selection approach combined with a normalized-fitness-based
randomization scheme outperforms more complex state-of-the-art algorithms by up
to 13% on common benchmark tasks
A Generic Approach for Escaping Saddle points
A central challenge to using first-order methods for optimizing nonconvex
problems is the presence of saddle points. First-order methods often get stuck
at saddle points, greatly deteriorating their performance. Typically, to escape
from saddles one has to use second-order methods. However, most works on
second-order methods rely extensively on expensive Hessian-based computations,
making them impractical in large-scale settings. To tackle this challenge, we
introduce a generic framework that minimizes Hessian based computations while
at the same time provably converging to second-order critical points. Our
framework carefully alternates between a first-order and a second-order
subroutine, using the latter only close to saddle points, and yields
convergence results competitive to the state-of-the-art. Empirical results
suggest that our strategy also enjoys a good practical performance
On the Global Convergence of Continuous-Time Stochastic Heavy-Ball Method for Nonconvex Optimization
We study the convergence behavior of the stochastic heavy-ball method with a
small stepsize. Under a change of time scale, we approximate the discrete
method by a stochastic differential equation that models small random
perturbations of a coupled system of nonlinear oscillators. We rigorously show
that the perturbed system converges to a local minimum in a logarithmic time.
This indicates that for the diffusion process that approximates the stochastic
heavy-ball method, it takes (up to a logarithmic factor) only a linear time of
the square root of the inverse stepsize to escape from all saddle points. This
results may suggest a fast convergence of its discrete-time counterpart. Our
theoretical results are validated by numerical experiments.Comment: accepted at IEEE International Conference on Big Data in 201
- …