16 research outputs found
Escaping Saddle Points in Constrained Optimization
In this paper, we study the problem of escaping from saddle points in smooth
nonconvex optimization problems subject to a convex set . We
propose a generic framework that yields convergence to a second-order
stationary point of the problem, if the convex set is simple for
a quadratic objective function. Specifically, our results hold if one can find
a -approximate solution of a quadratic program subject to
in polynomial time, where is a positive constant that depends on the
structure of the set . Under this condition, we show that the
sequence of iterates generated by the proposed framework reaches an
-second order stationary point (SOSP) in at most
iterations. We
further characterize the overall complexity of reaching an SOSP when the convex
set can be written as a set of quadratic constraints and the
objective function Hessian has a specific structure over the convex set
. Finally, we extend our results to the stochastic setting and
characterize the number of stochastic gradient and Hessian evaluations to reach
an -SOSP
Convergence to Second-Order Stationarity for Constrained Non-Convex Optimization
We consider the problem of finding an approximate second-order stationary
point of a constrained non-convex optimization problem. We first show that,
unlike the gradient descent method for unconstrained optimization, the vanilla
projected gradient descent algorithm may converge to a strict saddle point even
when there is only a single linear constraint. We then provide a hardness
result by showing that checking -second order
stationarity is NP-hard even in the presence of linear constraints. Despite our
hardness result, we identify instances of the problem for which checking second
order stationarity can be done efficiently. For such instances, we propose a
dynamic second order Frank--Wolfe algorithm which converges to ()-second order stationary points in
iterations. The
proposed algorithm can be used in general constrained non-convex optimization
as long as the constrained quadratic sub-problem can be solved efficiently
Escaping Saddle Points for Nonsmooth Weakly Convex Functions via Perturbed Proximal Algorithms
We propose perturbed proximal algorithms that can provably escape strict
saddles for nonsmooth weakly convex functions. The main results are based on a
novel characterization of -approximate local minimum for nonsmooth
functions, and recent developments on perturbed gradient methods for escaping
saddle points for smooth problems. Specifically, we show that under standard
assumptions, the perturbed proximal point, perturbed proximal gradient and
perturbed proximal linear algorithms find -approximate local minimum
for nonsmooth weakly convex functions in
iterations, where is the dimension of the problem
One Sample Stochastic Frank-Wolfe
One of the beauties of the projected gradient descent method lies in its
rather simple mechanism and yet stable behavior with inexact, stochastic
gradients, which has led to its wide-spread use in many machine learning
applications. However, once we replace the projection operator with a simpler
linear program, as is done in the Frank-Wolfe method, both simplicity and
stability take a serious hit. The aim of this paper is to bring them back
without sacrificing the efficiency. In this paper, we propose the first
one-sample stochastic Frank-Wolfe algorithm, called 1-SFW, that avoids the need
to carefully tune the batch size, step size, learning rate, and other
complicated hyper parameters. In particular, 1-SFW achieves the optimal
convergence rate of for reaching an
-suboptimal solution in the stochastic convex setting, and a
approximate solution for a stochastic monotone DR-submodular
maximization problem. Moreover, in a general non-convex setting, 1-SFW finds an
-first-order stationary point after at most
iterations, achieving the current best known
convergence rate. All of this is possible by designing a novel unbiased
momentum estimator that governs the stability of the optimization process while
using a single sample at each iteration
Efficiently escaping saddle points on manifolds
Smooth, non-convex optimization problems on Riemannian manifolds occur in
machine learning as a result of orthonormality, rank or positivity constraints.
First- and second-order necessary optimality conditions state that the
Riemannian gradient must be zero, and the Riemannian Hessian must be positive
semidefinite. Generalizing Jin et al.'s recent work on perturbed gradient
descent (PGD) for optimization on linear spaces [How to Escape Saddle Points
Efficiently (2017), Stochastic Gradient Descent Escapes Saddle Points
Efficiently (2019)], we propose a version of perturbed Riemannian gradient
descent (PRGD) to show that necessary optimality conditions can be met
approximately with high probability, without evaluating the Hessian.
Specifically, for an arbitrary Riemannian manifold of dimension
, a sufficiently smooth (possibly non-convex) objective function , and
under weak conditions on the retraction chosen to move on the manifold, with
high probability, our version of PRGD produces a point with gradient smaller
than and Hessian within of being positive
semidefinite in gradient queries. This matches
the complexity of PGD in the Euclidean case. Crucially, the dependence on
dimension is low. This matters for large-scale applications including PCA and
low-rank matrix completion, which both admit natural formulations on manifolds.
The key technical idea is to generalize PRGD with a distinction between two
types of gradient steps: "steps on the manifold" and "perturbed steps in a
tangent space of the manifold." Ultimately, this distinction makes it possible
to extend Jin et al.'s analysis seamlessly.Comment: 18 pages, NeurIPS 201
Escaping from saddle points on Riemannian manifolds
We consider minimizing a nonconvex, smooth function on a Riemannian
manifold . We show that a perturbed version of Riemannian gradient
descent algorithm converges to a second-order stationary point (and hence is
able to escape saddle points on the manifold). The rate of convergence depends
as on the accuracy , which matches a rate known only
for unconstrained smooth minimization. The convergence rate depends
polylogarithmically on the manifold dimension , hence is almost
dimension-free. The rate also has a polynomial dependence on the parameters
describing the curvature of the manifold and the smoothness of the function.
While the unconstrained problem (Euclidean setting) is well-studied, our result
is the first to prove such a rate for nonconvex, manifold-constrained problems.Comment: submitted to NeurIPS 201
Stochastic Conditional Gradient++
In this paper, we consider the general non-oblivious stochastic optimization
where the underlying stochasticity may change during the optimization procedure
and depends on the point at which the function is evaluated. We develop
Stochastic Frank-Wolfe++ (), an efficient variant of the
conditional gradient method for minimizing a smooth non-convex function subject
to a convex body constraint. We show that converges to an
-first order stationary point by using stochastic
gradients. Once further structures are present, 's theoretical
guarantees, in terms of the convergence rate and quality of its solution,
improve. In particular, for minimizing a convex function,
achieves an -approximate optimum while using
stochastic gradients. It is known that this rate is optimal in terms of
stochastic gradient evaluations. Similarly, for maximizing a monotone
continuous DR-submodular function, a slightly different form of , called Stochastic Continuous Greedy++ (), achieves a tight
solution while using
stochastic gradients. Through an information theoretic argument, we also prove
that 's convergence rate is optimal. Finally, for maximizing a
non-monotone continuous DR-submodular function, we can achieve a
solution by using stochastic
gradients. We should highlight that our results and our novel variance
reduction technique trivially extend to the standard and easier oblivious
stochastic optimization settings for (non-)covex and continuous submodular
settings
Escaping strict saddle points of the Moreau envelope in nonsmooth optimization
Recent work has shown that stochastically perturbed gradient methods can
efficiently escape strict saddle points of smooth functions. We extend this
body of work to nonsmooth optimization, by analyzing an inexact analogue of a
stochastically perturbed gradient method applied to the Moreau envelope. The
main conclusion is that a variety of algorithms for nonsmooth optimization can
escape strict saddle points of the Moreau envelope at a controlled rate. The
main technical insight is that typical algorithms applied to the proximal
subproblem yield directions that approximate the gradient of the Moreau
envelope in relative terms.Comment: 29 pages, 1 figur
Stochastic Gradient Langevin Dynamics with Variance Reduction
Stochastic gradient Langevin dynamics (SGLD) has gained the attention of
optimization researchers due to its global optimization properties. This paper
proves an improved convergence property to local minimizers of nonconvex
objective functions using SGLD accelerated by variance reductions. Moreover, we
prove an ergodicity property of the SGLD scheme, which gives insights on its
potential to find global minimizers of nonconvex objectives
Escaping Saddle-Points Faster under Interpolation-like Conditions
In this paper, we show that under over-parametrization several standard
stochastic optimization algorithms escape saddle-points and converge to
local-minimizers much faster. One of the fundamental aspects of
over-parametrized models is that they are capable of interpolating the training
data. We show that, under interpolation-like assumptions satisfied by the
stochastic gradients in an over-parametrization setting, the first-order oracle
complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach
an -local-minimizer, matches the corresponding deterministic rate of
. We next analyze Stochastic
Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions,
and show that the oracle complexity to reach an -local-minimizer
under interpolation-like conditions, is
. While this obtained complexity is
better than the corresponding complexity of either PSGD, or SCRN without
interpolation-like assumptions, it does not match the rate of
corresponding to deterministic
Cubic-Regularized Newton method. It seems further Hessian-based
interpolation-like assumptions are necessary to bridge this gap. We also
discuss the corresponding improved complexities in the zeroth-order settings.Comment: To appear in NeurIPS, 202