15,779 research outputs found
How to Escape Saddle Points Efficiently
This paper shows that a perturbed form of gradient descent converges to a
second-order stationary point in a number iterations which depends only
poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The
convergence rate of this procedure matches the well-known convergence rate of
gradient descent to first-order stationary points, up to log factors. When all
saddle points are non-degenerate, all second-order stationary points are local
minima, and our result thus shows that perturbed gradient descent can escape
saddle points almost for free. Our results can be directly applied to many
machine learning applications, including deep learning. As a particular
concrete example of such an application, we show that our results can be used
directly to establish sharp global convergence rates for matrix factorization.
Our results rely on a novel characterization of the geometry around saddle
points, which may be of independent interest to the non-convex optimization
community
Efficiently escaping saddle points on manifolds
Smooth, non-convex optimization problems on Riemannian manifolds occur in
machine learning as a result of orthonormality, rank or positivity constraints.
First- and second-order necessary optimality conditions state that the
Riemannian gradient must be zero, and the Riemannian Hessian must be positive
semidefinite. Generalizing Jin et al.'s recent work on perturbed gradient
descent (PGD) for optimization on linear spaces [How to Escape Saddle Points
Efficiently (2017), Stochastic Gradient Descent Escapes Saddle Points
Efficiently (2019)], we propose a version of perturbed Riemannian gradient
descent (PRGD) to show that necessary optimality conditions can be met
approximately with high probability, without evaluating the Hessian.
Specifically, for an arbitrary Riemannian manifold of dimension
, a sufficiently smooth (possibly non-convex) objective function , and
under weak conditions on the retraction chosen to move on the manifold, with
high probability, our version of PRGD produces a point with gradient smaller
than and Hessian within of being positive
semidefinite in gradient queries. This matches
the complexity of PGD in the Euclidean case. Crucially, the dependence on
dimension is low. This matters for large-scale applications including PCA and
low-rank matrix completion, which both admit natural formulations on manifolds.
The key technical idea is to generalize PRGD with a distinction between two
types of gradient steps: "steps on the manifold" and "perturbed steps in a
tangent space of the manifold." Ultimately, this distinction makes it possible
to extend Jin et al.'s analysis seamlessly.Comment: 18 pages, NeurIPS 201
Gradient Descent Can Take Exponential Time to Escape Saddle Points
Although gradient descent (GD) almost always escapes saddle points
asymptotically [Lee et al., 2016], this paper shows that even with fairly
natural random initialization schemes and non-pathological functions, GD can be
significantly slowed down by saddle points, taking exponential time to escape.
On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et
al., 2017] is not slowed down by saddle points - it can find an approximate
local minimizer in polynomial time. This result implies that GD is inherently
slower than perturbed GD, and justifies the importance of adding perturbations
for efficient non-convex optimization. While our focus is theoretical, we also
present experiments that illustrate our theoretical findings.Comment: Accepted by NIPS 201
Adaptive Stochastic Gradient Langevin Dynamics: Taming Convergence and Saddle Point Escape Time
In this paper, we propose a new adaptive stochastic gradient Langevin
dynamics (ASGLD) algorithmic framework and its two specialized versions, namely
adaptive stochastic gradient (ASG) and adaptive gradient Langevin
dynamics(AGLD), for non-convex optimization problems. All proposed algorithms
can escape from saddle points with at most iterations, which is
nearly dimension-free. Further, we show that ASGLD and ASG converge to a local
minimum with at most iterations. Also, ASGLD with full
gradients or ASGLD with a slowly linearly increasing batch size converge to a
local minimum with iterations bounded by , which
outperforms existing first-order methods.Comment: 24 pages, 13 figure
On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points
Gradient descent (GD) and stochastic gradient descent (SGD) are the
workhorses of large-scale machine learning. While classical theory focused on
analyzing the performance of these methods in convex optimization problems, the
most notable successes in machine learning have involved nonconvex
optimization, and a gap has arisen between theory and practice. Indeed,
traditional analyses of GD and SGD show that both algorithms converge to
stationary points efficiently. But these analyses do not take into account the
possibility of converging to saddle points. More recent theory has shown that
GD and SGD can avoid saddle points, but the dependence on dimension in these
analyses is polynomial. For modern machine learning, where the dimension can be
in the millions, such dependence would be catastrophic. We analyze perturbed
versions of GD and SGD and show that they are truly efficient---their dimension
dependence is only polylogarithmic. Indeed, these algorithms converge to
second-order stationary points in essentially the same time as they take to
converge to classical first-order stationary points.Comment: A preliminary version of this paper, with a subset of the results
that are presented here, was presented at ICML 2017 (also as
arXiv:1703.00887
Convergence to Second-Order Stationarity for Constrained Non-Convex Optimization
We consider the problem of finding an approximate second-order stationary
point of a constrained non-convex optimization problem. We first show that,
unlike the gradient descent method for unconstrained optimization, the vanilla
projected gradient descent algorithm may converge to a strict saddle point even
when there is only a single linear constraint. We then provide a hardness
result by showing that checking -second order
stationarity is NP-hard even in the presence of linear constraints. Despite our
hardness result, we identify instances of the problem for which checking second
order stationarity can be done efficiently. For such instances, we propose a
dynamic second order Frank--Wolfe algorithm which converges to ()-second order stationary points in
iterations. The
proposed algorithm can be used in general constrained non-convex optimization
as long as the constrained quadratic sub-problem can be solved efficiently
Saving Gradient and Negative Curvature Computations: Finding Local Minima More Efficiently
We propose a family of nonconvex optimization algorithms that are able to
save gradient and negative curvature computations to a large extent, and are
guaranteed to find an approximate local minimum with improved runtime
complexity. At the core of our algorithms is the division of the entire domain
of the objective function into small and large gradient regions: our algorithms
only perform gradient descent based procedure in the large gradient region, and
only perform negative curvature descent in the small gradient region. Our novel
analysis shows that the proposed algorithms can escape the small gradient
region in only one negative curvature descent step whenever they enter it, and
thus they only need to perform at most negative curvature
direction computations, where is the number of times the
algorithms enter small gradient regions. For both deterministic and stochastic
settings, we show that the proposed algorithms can potentially beat the
state-of-the-art local minima finding algorithms. For the finite-sum setting,
our algorithm can also outperform the best algorithm in a certain regime.Comment: 31 pages, 1 tabl
Bowl breakout, escaping the positive region when searching for saddle points
We present a scheme improving the minimum-mode following method for finding
first order saddle points by confining the displacements of atoms to the subset
of those subject to the largest force. By doing so it is ensured that the
displacement remains of a local character within regions where all eigenvalues
of the Hessian matrix are positive. However, as soon as a region is entered
where an eigenvalue turns negative all atoms are released to maintain the
ability of determining concerted moves. Applying the proposed scheme reduces
the required number of force calls for the determination of connected saddle
points by a factor two or more compared to a free search. Furthermore, a wider
distribution of the relevant low barrier saddle points is obtained. Finally,
the dependency on the initial distortion and the applied maximal step size is
reduced making minimum-mode guided searches both more robust and applicable.Comment: 19 pages, 7 figure
PGDOT -- Perturbed Gradient Descent Adapted with Occupation Time
This paper develops further the idea of perturbed gradient descent (PGD), by
adapting perturbation with the history of states via the notion of occupation
time. The proposed algorithm, perturbed gradient descent adapted with
occupation time (PGDOT), is shown to converge at least as fast as the PGD
algorithm and is guaranteed to avoid getting stuck at saddle points. The
analysis is corroborated by empirical studies, in which a mini-batch version of
PGDOT is shown to outperform alternatives such as mini-batch gradient descent,
Adam, AMSGrad, and RMSProp in training multilayer perceptrons (MLPs). In
particular, the mini-batch PGDOT manages to escape saddle points whereas these
alternatives fail.Comment: 15 pages, 7 figures, 1 tabl
Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization
Neural networks with REctified Linear Unit (ReLU) activation functions
(a.k.a. ReLU networks) have achieved great empirical success in various
domains. Nonetheless, existing results for learning ReLU networks either pose
assumptions on the underlying data distribution being e.g. Gaussian, or require
the network size and/or training size to be sufficiently large. In this
context, the problem of learning a two-layer ReLU network is approached in a
binary classification setting, where the data are linearly separable and a
hinge loss criterion is adopted. Leveraging the power of random noise
perturbation, this paper presents a novel stochastic gradient descent (SGD)
algorithm, which can \emph{provably} train any single-hidden-layer ReLU network
to attain global optimality, despite the presence of infinitely many bad local
minima, maxima, and saddle points in general. This result is the first of its
kind, requiring no assumptions on the data distribution, training/network size,
or initialization. Convergence of the resultant iterative algorithm to a global
minimum is analyzed by establishing both an upper bound and a lower bound on
the number of non-zero updates to be performed. Moreover, generalization
guarantees are developed for ReLU networks trained with the novel SGD
leveraging classic compression bounds. These guarantees highlight a key
difference (at least in the worst case) between reliably learning a ReLU
network as well as a leaky ReLU network in terms of sample complexity.
Numerical tests using both synthetic data and real images validate the
effectiveness of the algorithm and the practical merits of the theory.Comment: 23 pages, 7 figures, work in progres
- …