16 research outputs found

    Escaping Saddle Points in Constrained Optimization

    Full text link
    In this paper, we study the problem of escaping from saddle points in smooth nonconvex optimization problems subject to a convex set C\mathcal{C}. We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set C\mathcal{C} is simple for a quadratic objective function. Specifically, our results hold if one can find a ρ\rho-approximate solution of a quadratic program subject to C\mathcal{C} in polynomial time, where ρ<1\rho<1 is a positive constant that depends on the structure of the set C\mathcal{C}. Under this condition, we show that the sequence of iterates generated by the proposed framework reaches an (ϵ,γ)(\epsilon,\gamma)-second order stationary point (SOSP) in at most O(max{ϵ2,ρ3γ3})\mathcal{O}(\max\{\epsilon^{-2},\rho^{-3}\gamma^{-3}\}) iterations. We further characterize the overall complexity of reaching an SOSP when the convex set C\mathcal{C} can be written as a set of quadratic constraints and the objective function Hessian has a specific structure over the convex set C\mathcal{C}. Finally, we extend our results to the stochastic setting and characterize the number of stochastic gradient and Hessian evaluations to reach an (ϵ,γ)(\epsilon,\gamma)-SOSP

    Convergence to Second-Order Stationarity for Constrained Non-Convex Optimization

    Full text link
    We consider the problem of finding an approximate second-order stationary point of a constrained non-convex optimization problem. We first show that, unlike the gradient descent method for unconstrained optimization, the vanilla projected gradient descent algorithm may converge to a strict saddle point even when there is only a single linear constraint. We then provide a hardness result by showing that checking (ϵg,ϵH)(\epsilon_g,\epsilon_H)-second order stationarity is NP-hard even in the presence of linear constraints. Despite our hardness result, we identify instances of the problem for which checking second order stationarity can be done efficiently. For such instances, we propose a dynamic second order Frank--Wolfe algorithm which converges to (ϵg,ϵH\epsilon_g, \epsilon_H)-second order stationary points in O(max{ϵg2,ϵH3}){\mathcal{O}}(\max\{\epsilon_g^{-2}, \epsilon_H^{-3}\}) iterations. The proposed algorithm can be used in general constrained non-convex optimization as long as the constrained quadratic sub-problem can be solved efficiently

    Escaping Saddle Points for Nonsmooth Weakly Convex Functions via Perturbed Proximal Algorithms

    Full text link
    We propose perturbed proximal algorithms that can provably escape strict saddles for nonsmooth weakly convex functions. The main results are based on a novel characterization of ϵ\epsilon-approximate local minimum for nonsmooth functions, and recent developments on perturbed gradient methods for escaping saddle points for smooth problems. Specifically, we show that under standard assumptions, the perturbed proximal point, perturbed proximal gradient and perturbed proximal linear algorithms find ϵ\epsilon-approximate local minimum for nonsmooth weakly convex functions in O(ϵ2log(d)4)O(\epsilon^{-2}\log(d)^4) iterations, where dd is the dimension of the problem

    One Sample Stochastic Frank-Wolfe

    Full text link
    One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications. However, once we replace the projection operator with a simpler linear program, as is done in the Frank-Wolfe method, both simplicity and stability take a serious hit. The aim of this paper is to bring them back without sacrificing the efficiency. In this paper, we propose the first one-sample stochastic Frank-Wolfe algorithm, called 1-SFW, that avoids the need to carefully tune the batch size, step size, learning rate, and other complicated hyper parameters. In particular, 1-SFW achieves the optimal convergence rate of O(1/ϵ2)\mathcal{O}(1/\epsilon^2) for reaching an ϵ\epsilon-suboptimal solution in the stochastic convex setting, and a (11/e)ϵ(1-1/e)-\epsilon approximate solution for a stochastic monotone DR-submodular maximization problem. Moreover, in a general non-convex setting, 1-SFW finds an ϵ\epsilon-first-order stationary point after at most O(1/ϵ3)\mathcal{O}(1/\epsilon^3) iterations, achieving the current best known convergence rate. All of this is possible by designing a novel unbiased momentum estimator that governs the stability of the optimization process while using a single sample at each iteration

    Efficiently escaping saddle points on manifolds

    Full text link
    Smooth, non-convex optimization problems on Riemannian manifolds occur in machine learning as a result of orthonormality, rank or positivity constraints. First- and second-order necessary optimality conditions state that the Riemannian gradient must be zero, and the Riemannian Hessian must be positive semidefinite. Generalizing Jin et al.'s recent work on perturbed gradient descent (PGD) for optimization on linear spaces [How to Escape Saddle Points Efficiently (2017), Stochastic Gradient Descent Escapes Saddle Points Efficiently (2019)], we propose a version of perturbed Riemannian gradient descent (PRGD) to show that necessary optimality conditions can be met approximately with high probability, without evaluating the Hessian. Specifically, for an arbitrary Riemannian manifold M\mathcal{M} of dimension dd, a sufficiently smooth (possibly non-convex) objective function ff, and under weak conditions on the retraction chosen to move on the manifold, with high probability, our version of PRGD produces a point with gradient smaller than ϵ\epsilon and Hessian within ϵ\sqrt{\epsilon} of being positive semidefinite in O((logd)4/ϵ2)O((\log{d})^4 / \epsilon^{2}) gradient queries. This matches the complexity of PGD in the Euclidean case. Crucially, the dependence on dimension is low. This matters for large-scale applications including PCA and low-rank matrix completion, which both admit natural formulations on manifolds. The key technical idea is to generalize PRGD with a distinction between two types of gradient steps: "steps on the manifold" and "perturbed steps in a tangent space of the manifold." Ultimately, this distinction makes it possible to extend Jin et al.'s analysis seamlessly.Comment: 18 pages, NeurIPS 201

    Escaping from saddle points on Riemannian manifolds

    Full text link
    We consider minimizing a nonconvex, smooth function ff on a Riemannian manifold M\mathcal{M}. We show that a perturbed version of Riemannian gradient descent algorithm converges to a second-order stationary point (and hence is able to escape saddle points on the manifold). The rate of convergence depends as 1/ϵ21/\epsilon^2 on the accuracy ϵ\epsilon, which matches a rate known only for unconstrained smooth minimization. The convergence rate depends polylogarithmically on the manifold dimension dd, hence is almost dimension-free. The rate also has a polynomial dependence on the parameters describing the curvature of the manifold and the smoothness of the function. While the unconstrained problem (Euclidean setting) is well-studied, our result is the first to prove such a rate for nonconvex, manifold-constrained problems.Comment: submitted to NeurIPS 201

    Stochastic Conditional Gradient++

    Full text link
    In this paper, we consider the general non-oblivious stochastic optimization where the underlying stochasticity may change during the optimization procedure and depends on the point at which the function is evaluated. We develop Stochastic Frank-Wolfe++ (SFW++\text{SFW}{++} ), an efficient variant of the conditional gradient method for minimizing a smooth non-convex function subject to a convex body constraint. We show that SFW++\text{SFW}{++} converges to an ϵ\epsilon-first order stationary point by using O(1/ϵ3)O(1/\epsilon^3) stochastic gradients. Once further structures are present, SFW++\text{SFW}{++}'s theoretical guarantees, in terms of the convergence rate and quality of its solution, improve. In particular, for minimizing a convex function, SFW++\text{SFW}{++} achieves an ϵ\epsilon-approximate optimum while using O(1/ϵ2)O(1/\epsilon^2) stochastic gradients. It is known that this rate is optimal in terms of stochastic gradient evaluations. Similarly, for maximizing a monotone continuous DR-submodular function, a slightly different form of SFW++\text{SFW}{++} , called Stochastic Continuous Greedy++ (SCG++\text{SCG}{++} ), achieves a tight [(11/e)OPTϵ][(1-1/e)\text{OPT} -\epsilon] solution while using O(1/ϵ2)O(1/\epsilon^2) stochastic gradients. Through an information theoretic argument, we also prove that SCG++\text{SCG}{++} 's convergence rate is optimal. Finally, for maximizing a non-monotone continuous DR-submodular function, we can achieve a [(1/e)OPTϵ][(1/e)\text{OPT} -\epsilon] solution by using O(1/ϵ2)O(1/\epsilon^2) stochastic gradients. We should highlight that our results and our novel variance reduction technique trivially extend to the standard and easier oblivious stochastic optimization settings for (non-)covex and continuous submodular settings

    Escaping strict saddle points of the Moreau envelope in nonsmooth optimization

    Full text link
    Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict saddle points of the Moreau envelope at a controlled rate. The main technical insight is that typical algorithms applied to the proximal subproblem yield directions that approximate the gradient of the Moreau envelope in relative terms.Comment: 29 pages, 1 figur

    Stochastic Gradient Langevin Dynamics with Variance Reduction

    Full text link
    Stochastic gradient Langevin dynamics (SGLD) has gained the attention of optimization researchers due to its global optimization properties. This paper proves an improved convergence property to local minimizers of nonconvex objective functions using SGLD accelerated by variance reductions. Moreover, we prove an ergodicity property of the SGLD scheme, which gives insights on its potential to find global minimizers of nonconvex objectives

    Escaping Saddle-Points Faster under Interpolation-like Conditions

    Full text link
    In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an ϵ\epsilon-local-minimizer, matches the corresponding deterministic rate of O~(1/ϵ2)\tilde{\mathcal{O}}(1/\epsilon^{2}). We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an ϵ\epsilon-local-minimizer under interpolation-like conditions, is O~(1/ϵ2.5)\tilde{\mathcal{O}}(1/\epsilon^{2.5}). While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of O~(1/ϵ1.5)\tilde{\mathcal{O}}(1/\epsilon^{1.5}) corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings.Comment: To appear in NeurIPS, 202
    corecore