15,779 research outputs found

    How to Escape Saddle Points Efficiently

    Full text link
    This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free. Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the non-convex optimization community

    Efficiently escaping saddle points on manifolds

    Full text link
    Smooth, non-convex optimization problems on Riemannian manifolds occur in machine learning as a result of orthonormality, rank or positivity constraints. First- and second-order necessary optimality conditions state that the Riemannian gradient must be zero, and the Riemannian Hessian must be positive semidefinite. Generalizing Jin et al.'s recent work on perturbed gradient descent (PGD) for optimization on linear spaces [How to Escape Saddle Points Efficiently (2017), Stochastic Gradient Descent Escapes Saddle Points Efficiently (2019)], we propose a version of perturbed Riemannian gradient descent (PRGD) to show that necessary optimality conditions can be met approximately with high probability, without evaluating the Hessian. Specifically, for an arbitrary Riemannian manifold M\mathcal{M} of dimension dd, a sufficiently smooth (possibly non-convex) objective function ff, and under weak conditions on the retraction chosen to move on the manifold, with high probability, our version of PRGD produces a point with gradient smaller than ϵ\epsilon and Hessian within ϵ\sqrt{\epsilon} of being positive semidefinite in O((logd)4/ϵ2)O((\log{d})^4 / \epsilon^{2}) gradient queries. This matches the complexity of PGD in the Euclidean case. Crucially, the dependence on dimension is low. This matters for large-scale applications including PCA and low-rank matrix completion, which both admit natural formulations on manifolds. The key technical idea is to generalize PRGD with a distinction between two types of gradient steps: "steps on the manifold" and "perturbed steps in a tangent space of the manifold." Ultimately, this distinction makes it possible to extend Jin et al.'s analysis seamlessly.Comment: 18 pages, NeurIPS 201

    Gradient Descent Can Take Exponential Time to Escape Saddle Points

    Full text link
    Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.Comment: Accepted by NIPS 201

    Adaptive Stochastic Gradient Langevin Dynamics: Taming Convergence and Saddle Point Escape Time

    Full text link
    In this paper, we propose a new adaptive stochastic gradient Langevin dynamics (ASGLD) algorithmic framework and its two specialized versions, namely adaptive stochastic gradient (ASG) and adaptive gradient Langevin dynamics(AGLD), for non-convex optimization problems. All proposed algorithms can escape from saddle points with at most O(logd)O(\log d) iterations, which is nearly dimension-free. Further, we show that ASGLD and ASG converge to a local minimum with at most O(logd/ϵ4)O(\log d/\epsilon^4) iterations. Also, ASGLD with full gradients or ASGLD with a slowly linearly increasing batch size converge to a local minimum with iterations bounded by O(logd/ϵ2)O(\log d/\epsilon^2), which outperforms existing first-order methods.Comment: 24 pages, 13 figure

    On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points

    Full text link
    Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in machine learning have involved nonconvex optimization, and a gap has arisen between theory and practice. Indeed, traditional analyses of GD and SGD show that both algorithms converge to stationary points efficiently. But these analyses do not take into account the possibility of converging to saddle points. More recent theory has shown that GD and SGD can avoid saddle points, but the dependence on dimension in these analyses is polynomial. For modern machine learning, where the dimension can be in the millions, such dependence would be catastrophic. We analyze perturbed versions of GD and SGD and show that they are truly efficient---their dimension dependence is only polylogarithmic. Indeed, these algorithms converge to second-order stationary points in essentially the same time as they take to converge to classical first-order stationary points.Comment: A preliminary version of this paper, with a subset of the results that are presented here, was presented at ICML 2017 (also as arXiv:1703.00887

    Convergence to Second-Order Stationarity for Constrained Non-Convex Optimization

    Full text link
    We consider the problem of finding an approximate second-order stationary point of a constrained non-convex optimization problem. We first show that, unlike the gradient descent method for unconstrained optimization, the vanilla projected gradient descent algorithm may converge to a strict saddle point even when there is only a single linear constraint. We then provide a hardness result by showing that checking (ϵg,ϵH)(\epsilon_g,\epsilon_H)-second order stationarity is NP-hard even in the presence of linear constraints. Despite our hardness result, we identify instances of the problem for which checking second order stationarity can be done efficiently. For such instances, we propose a dynamic second order Frank--Wolfe algorithm which converges to (ϵg,ϵH\epsilon_g, \epsilon_H)-second order stationary points in O(max{ϵg2,ϵH3}){\mathcal{O}}(\max\{\epsilon_g^{-2}, \epsilon_H^{-3}\}) iterations. The proposed algorithm can be used in general constrained non-convex optimization as long as the constrained quadratic sub-problem can be solved efficiently

    Saving Gradient and Negative Curvature Computations: Finding Local Minima More Efficiently

    Full text link
    We propose a family of nonconvex optimization algorithms that are able to save gradient and negative curvature computations to a large extent, and are guaranteed to find an approximate local minimum with improved runtime complexity. At the core of our algorithms is the division of the entire domain of the objective function into small and large gradient regions: our algorithms only perform gradient descent based procedure in the large gradient region, and only perform negative curvature descent in the small gradient region. Our novel analysis shows that the proposed algorithms can escape the small gradient region in only one negative curvature descent step whenever they enter it, and thus they only need to perform at most NϵN_{\epsilon} negative curvature direction computations, where NϵN_{\epsilon} is the number of times the algorithms enter small gradient regions. For both deterministic and stochastic settings, we show that the proposed algorithms can potentially beat the state-of-the-art local minima finding algorithms. For the finite-sum setting, our algorithm can also outperform the best algorithm in a certain regime.Comment: 31 pages, 1 tabl

    Bowl breakout, escaping the positive region when searching for saddle points

    Full text link
    We present a scheme improving the minimum-mode following method for finding first order saddle points by confining the displacements of atoms to the subset of those subject to the largest force. By doing so it is ensured that the displacement remains of a local character within regions where all eigenvalues of the Hessian matrix are positive. However, as soon as a region is entered where an eigenvalue turns negative all atoms are released to maintain the ability of determining concerted moves. Applying the proposed scheme reduces the required number of force calls for the determination of connected saddle points by a factor two or more compared to a free search. Furthermore, a wider distribution of the relevant low barrier saddle points is obtained. Finally, the dependency on the initial distortion and the applied maximal step size is reduced making minimum-mode guided searches both more robust and applicable.Comment: 19 pages, 7 figure

    PGDOT -- Perturbed Gradient Descent Adapted with Occupation Time

    Full text link
    This paper develops further the idea of perturbed gradient descent (PGD), by adapting perturbation with the history of states via the notion of occupation time. The proposed algorithm, perturbed gradient descent adapted with occupation time (PGDOT), is shown to converge at least as fast as the PGD algorithm and is guaranteed to avoid getting stuck at saddle points. The analysis is corroborated by empirical studies, in which a mini-batch version of PGDOT is shown to outperform alternatives such as mini-batch gradient descent, Adam, AMSGrad, and RMSProp in training multilayer perceptrons (MLPs). In particular, the mini-batch PGDOT manages to escape saddle points whereas these alternatives fail.Comment: 15 pages, 7 figures, 1 tabl

    Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization

    Full text link
    Neural networks with REctified Linear Unit (ReLU) activation functions (a.k.a. ReLU networks) have achieved great empirical success in various domains. Nonetheless, existing results for learning ReLU networks either pose assumptions on the underlying data distribution being e.g. Gaussian, or require the network size and/or training size to be sufficiently large. In this context, the problem of learning a two-layer ReLU network is approached in a binary classification setting, where the data are linearly separable and a hinge loss criterion is adopted. Leveraging the power of random noise perturbation, this paper presents a novel stochastic gradient descent (SGD) algorithm, which can \emph{provably} train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general. This result is the first of its kind, requiring no assumptions on the data distribution, training/network size, or initialization. Convergence of the resultant iterative algorithm to a global minimum is analyzed by establishing both an upper bound and a lower bound on the number of non-zero updates to be performed. Moreover, generalization guarantees are developed for ReLU networks trained with the novel SGD leveraging classic compression bounds. These guarantees highlight a key difference (at least in the worst case) between reliably learning a ReLU network as well as a leaky ReLU network in terms of sample complexity. Numerical tests using both synthetic data and real images validate the effectiveness of the algorithm and the practical merits of the theory.Comment: 23 pages, 7 figures, work in progres
    corecore