657 research outputs found

    Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions

    Full text link
    We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains examples such as ReLU neural networks and others with non-differentiable activation functions. We first show that finding an ϵ\epsilon-stationary point with first-order methods is impossible in finite time. We then introduce the notion of (δ,ϵ)(\delta, \epsilon)-stationarity, which allows for an ϵ\epsilon-approximate gradient to be the convex combination of generalized gradients evaluated at points within distance δ\delta to the solution. We propose a series of randomized first-order methods and analyze their complexity of finding a (δ,ϵ)(\delta, \epsilon)-stationary point. Furthermore, we provide a lower bound and show that our stochastic algorithm has min-max optimal dependence on δ\delta. Empirically, our methods perform well for training ReLU neural networks

    Subgradient Descent Learns Orthogonal Dictionaries

    Full text link
    This paper concerns dictionary learning, i.e., sparse coding, a fundamental representation learning problem. We show that a subgradient descent algorithm, with random initialization, can provably recover orthogonal dictionaries on a natural nonsmooth, nonconvex 1\ell_1 minimization formulation of the problem, under mild statistical assumptions on the data. This is in contrast to previous provable methods that require either expensive computation or delicate initialization schemes. Our analysis develops several tools for characterizing landscapes of nonsmooth functions, which might be of independent interest for provable training of deep networks with nonsmooth activations (e.g., ReLU), among numerous other applications. Preliminary experiments corroborate our analysis and show that our algorithm works well empirically in recovering orthogonal dictionaries

    Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems

    Full text link
    In this paper, we introduce a stochastic projected subgradient method for weakly convex (i.e., uniformly prox-regular) nonsmooth, nonconvex functions---a wide class of functions which includes the additive and convex composite classes. At a high-level, the method is an inexact proximal point iteration in which the strongly convex proximal subproblems are quickly solved with a specialized stochastic projected subgradient method. The primary contribution of this paper is a simple proof that the proposed algorithm converges at the same rate as the stochastic gradient method for smooth nonconvex problems. This result appears to be the first convergence rate analysis of a stochastic (or even deterministic) subgradient method for the class of weakly convex functions.Comment: Updated 9/17/2018: Major Revision -added high probability bounds, improved convergence analysis in general, new experimental results. Updated 7/26/2017: Added references to introduction and a couple simple extensions as Sections 3.2 and 4. Updated 8/23/2017: Added NSF acknowledgements. Updated 10/16/2017: Added experimental result

    Proximal Gradient Method for Nonsmooth Optimization over the Stiefel Manifold

    Full text link
    We consider optimization problems over the Stiefel manifold whose objective function is the summation of a smooth function and a nonsmooth function. Existing methods for solving this kind of problems can be classified into three classes. Algorithms in the first class rely on information of the subgradients of the objective function and thus tend to converge slowly in practice. Algorithms in the second class are proximal point algorithms, which involve subproblems that can be as difficult as the original problem. Algorithms in the third class are based on operator-splitting techniques, but they usually lack rigorous convergence guarantees. In this paper, we propose a retraction-based proximal gradient method for solving this class of problems. We prove that the proposed method globally converges to a stationary point. Iteration complexity for obtaining an ϵ\epsilon-stationary solution is also analyzed. Numerical results on solving sparse PCA and compressed modes problems are reported to demonstrate the advantages of the proposed method

    Complexity of finding near-stationary points of convex functions stochastically

    Full text link
    In a recent paper, we showed that the stochastic subgradient method applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate O(k1/4)O(k^{-1/4}). In this supplementary note, we present a stochastic subgradient method for minimizing a convex function, with the improved rate O~(k1/2)\widetilde O(k^{-1/2}).Comment: 9 page

    SpiderBoost and Momentum: Faster Stochastic Variance Reduction Algorithms

    Full text link
    SARAH and SPIDER are two recently developed stochastic variance-reduced algorithms, and SPIDER has been shown to achieve a near-optimal first-order oracle complexity in smooth nonconvex optimization. However, SPIDER uses an accuracy-dependent stepsize that slows down the convergence in practice, and cannot handle objective functions that involve nonsmooth regularizers. In this paper, we propose SpiderBoost as an improved scheme, which allows to use a much larger constant-level stepsize while maintaining the same near-optimal oracle complexity, and can be extended with proximal mapping to handle composite optimization (which is nonsmooth and nonconvex) with provable convergence guarantee. In particular, we show that proximal SpiderBoost achieves an oracle complexity of O(min{n1/2ϵ2,ϵ3})\mathcal{O}(\min\{n^{1/2}\epsilon^{-2},\epsilon^{-3}\}) in composite nonconvex optimization, improving the state-of-the-art result by a factor of O(min{n1/6,ϵ1/3})\mathcal{O}(\min\{n^{1/6},\epsilon^{-1/3}\}). We further develop a novel momentum scheme to accelerate SpiderBoost for composite optimization, which achieves the near-optimal oracle complexity in theory and substantial improvement in experiments.Comment: Appear in NeurIPS 201

    Catalyst Acceleration for Gradient-Based Non-Convex Optimization

    Get PDF
    We introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. Even though these methods may originally require convexity to operate, the proposed approach allows one to use them on weakly convex objectives, which covers a large class of non-convex functions typically appearing in machine learning and signal processing. In general, the scheme is guaranteed to produce a stationary point with a worst-case efficiency typical of first-order methods, and when the objective turns out to be convex, it automatically accelerates in the sense of Nesterov and achieves near-optimal convergence rate in function values. These properties are achieved without assuming any knowledge about the convexity of the objective, by automatically adapting to the unknown weak convexity constant. We conclude the paper by showing promising experimental results obtained by applying our approach to incremental algorithms such as SVRG and SAGA for sparse matrix factorization and for learning neural networks

    A Smoothing SQP Framework for a Class of Composite LqL_q Minimization over Polyhedron

    Full text link
    The composite Lq (0<q<1)L_q~(0<q<1) minimization problem over a general polyhedron has received various applications in machine learning, wireless communications, image restoration, signal reconstruction, etc. This paper aims to provide a theoretical study on this problem. Firstly, we show that for any fixed 0<q<10<q<1, finding the global minimizer of the problem, even its unconstrained counterpart, is strongly NP-hard. Secondly, we derive Karush-Kuhn-Tucker (KKT) optimality conditions for local minimizers of the problem. Thirdly, we propose a smoothing sequential quadratic programming framework for solving this problem. The framework requires a (approximate) solution of a convex quadratic program at each iteration. Finally, we analyze the worst-case iteration complexity of the framework for returning an ϵ\epsilon-KKT point; i.e., a feasible point that satisfies a perturbed version of the derived KKT optimality conditions. To the best of our knowledge, the proposed framework is the first one with a worst-case iteration complexity guarantee for solving composite LqL_q minimization over a general polyhedron

    Graphical Convergence of Subgradients in Nonconvex Optimization and Learning

    Full text link
    We investigate the stochastic optimization problem of minimizing population risk, where the loss defining the risk is assumed to be weakly convex. Compositions of Lipschitz convex functions with smooth maps are the primary examples of such losses. We analyze the estimation quality of such nonsmooth and nonconvex problems by their sample average approximations. Our main results establish dimension-dependent rates on subgradient estimation in full generality and dimension-independent rates when the loss is a generalized linear model. As an application of the developed techniques, we analyze the nonsmooth landscape of a robust nonlinear regression problem.Comment: 36 page

    Asynchronous Parallel Algorithms for Nonconvex Optimization

    Full text link
    We propose a new asynchronous parallel block-descent algorithmic framework for the minimization of the sum of a smooth nonconvex function and a nonsmooth convex one, subject to both convex and nonconvex constraints. The proposed framework hinges on successive convex approximation techniques and a novel probabilistic model that captures key elements of modern computational architectures and asynchronous implementations in a more faithful way than current state-of-the-art models. Other key features of the framework are: i) it covers in a unified way several specific solution methods; ii) it accommodates a variety of possible parallel computing architectures; and iii) it can deal with nonconvex constraints. Almost sure convergence to stationary solutions is proved, and theoretical complexity results are provided, showing nearly ideal linear speedup when the number of workers is not too large.Comment: This is the first part of a two-paper work. The second part can be found at: arXiv:1701.0490
    corecore