4,422 research outputs found

    Non-asymptotic Analysis of Stochastic Methods for Non-Smooth Non-Convex Regularized Problems

    Full text link
    Stochastic Proximal Gradient (SPG) methods have been widely used for solving optimization problems with a simple (possibly non-smooth) regularizer in machine learning and statistics. However, to the best of our knowledge no non-asymptotic convergence analysis of SPG exists for non-convex optimization with a non-smooth and non-convex regularizer. All existing non-asymptotic analysis of SPG for solving non-smooth non-convex problems require the non-smooth regularizer to be a convex function, and hence are not applicable to a non-smooth non-convex regularized problem. This work initiates the analysis to bridge this gap and opens the door to non-asymptotic convergence analysis of non-smooth non-convex regularized problems. We analyze several variants of mini-batch SPG methods for minimizing a non-convex objective that consists of a smooth non-convex loss and a non-smooth non-convex regularizer. Our contributions are two-fold: (i) we show that they enjoy the same complexities as their counterparts for solving convex regularized non-convex problems in terms of finding an approximate stationary point; (ii) we develop more practical variants using dynamic mini-batch size instead of a fixed mini-batch size without requiring the target accuracy level of solution. The significance of our results is that they improve upon the-state-of-art results for solving non-smooth non-convex regularized problems. We also empirically demonstrate the effectiveness of the considered SPG methods in comparison with other peer stochastic methods.Comment: Accepted to NeurIPS 201

    Simple Stochastic Gradient Methods for Non-Smooth Non-Convex Regularized Optimization

    Full text link
    Our work focuses on stochastic gradient methods for optimizing a smooth non-convex loss function with a non-smooth non-convex regularizer. Research on this class of problem is quite limited, and until recently no non-asymptotic convergence results have been reported. We present two simple stochastic gradient algorithms, for finite-sum and general stochastic optimization problems, which have superior convergence complexities compared to the current state-of-the-art. We also compare our algorithms' performance in practice for empirical risk minimization

    Convergence of Stochastic Proximal Gradient Algorithm

    Full text link
    We prove novel convergence results for a stochastic proximal gradient algorithm suitable for solving a large class of convex optimization problems, where a convex objective function is given by the sum of a smooth and a possibly non-smooth component. We consider the iterates convergence and derive O(1/n)O(1/n) non asymptotic bounds in expectation in the strongly convex case, as well as almost sure convergence results under weaker assumptions. Our approach allows to avoid averaging and weaken boundedness assumptions which are often considered in theoretical studies and might not be satisfied in practice.Comment: 24 page

    Dual Iterative Hard Thresholding: From Non-convex Sparse Minimization to Non-smooth Concave Maximization

    Full text link
    Iterative Hard Thresholding (IHT) is a class of projected gradient descent methods for optimizing sparsity-constrained minimization models, with the best known efficiency and scalability in practice. As far as we know, the existing IHT-style methods are designed for sparse minimization in primal form. It remains open to explore duality theory and algorithms in such a non-convex and NP-hard problem setting. In this paper, we bridge this gap by establishing a duality theory for sparsity-constrained minimization with â„“2\ell_2-regularized loss function and proposing an IHT-style algorithm for dual maximization. Our sparse duality theory provides a set of sufficient and necessary conditions under which the original NP-hard/non-convex problem can be equivalently solved in a dual formulation. The proposed dual IHT algorithm is a super-gradient method for maximizing the non-smooth dual objective. An interesting finding is that the sparse recovery performance of dual IHT is invariant to the Restricted Isometry Property (RIP), which is required by virtually all the existing primal IHT algorithms without sparsity relaxation. Moreover, a stochastic variant of dual IHT is proposed for large-scale stochastic optimization. Numerical results demonstrate the superiority of dual IHT algorithms to the state-of-the-art primal IHT-style algorithms in model estimation accuracy and computational efficiency

    A Variable Sample-size Stochastic Quasi-Newton Method for Smooth and Nonsmooth Stochastic Convex Optimization

    Full text link
    Classical theory for quasi-Newton schemes has focused on smooth deterministic unconstrained optimization while recent forays into stochastic convex optimization have largely resided in smooth, unconstrained, and strongly convex regimes. Naturally, there is a compelling need to address nonsmoothness, the lack of strong convexity, and the presence of constraints. Accordingly, this paper presents a quasi-Newton framework that can process merely convex and possibly nonsmooth (but smoothable) stochastic convex problems. We propose a framework that combines iterative smoothing and regularization with a variance-reduced scheme reliant on using increasing sample-sizes of gradients. We make the following contributions. (i) We develop a regularized and smoothed variable sample-size BFGS update (rsL-BFGS) that generates a sequence of Hessian approximations and can accommodate nonsmooth convex objectives by utilizing iterative regularization and smoothing. (ii) In strongly convex regimes with state-dependent noise, the proposed variable sample-size stochastic quasi-Newton scheme admits a non-asymptotic linear rate of convergence while the oracle complexity of computing an ϵ\epsilon-solution is O(κm+1/ϵ)\mathcal{O}(\kappa^{m+1}/\epsilon) where κ\kappa is the condition number and m≥1m\geq 1. In nonsmooth (but smoothable) regimes, using Moreau smoothing retains the linear convergence rate while using more general smoothing leads to a deterioration of the rate to O(k−1/3)\mathcal{O}(k^{-1/3}) for the resulting smoothed VS-SQN scheme; (iii) In merely convex but smooth settings, the regularized VS-SQN scheme rVS-SQN displays a rate of O(1/k(1−ε))\mathcal{O}(1/k^{(1-\varepsilon)}). When the smoothness requirements are weakened, the rate for the regularized and smoothed VS-SQN scheme worsens to O(k−1/3)\mathcal{O}(k^{-1/3}). Such statements allow for a state-dependent noise assumption under a quadratic growth property

    Graphical Convergence of Subgradients in Nonconvex Optimization and Learning

    Full text link
    We investigate the stochastic optimization problem of minimizing population risk, where the loss defining the risk is assumed to be weakly convex. Compositions of Lipschitz convex functions with smooth maps are the primary examples of such losses. We analyze the estimation quality of such nonsmooth and nonconvex problems by their sample average approximations. Our main results establish dimension-dependent rates on subgradient estimation in full generality and dimension-independent rates when the loss is a generalized linear model. As an application of the developed techniques, we analyze the nonsmooth landscape of a robust nonlinear regression problem.Comment: 36 page

    Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with Non-asymptotic Convergence

    Full text link
    Difference of convex (DC) functions cover a broad family of non-convex and possibly non-smooth and non-differentiable functions, and have wide applications in machine learning and statistics. Although deterministic algorithms for DC functions have been extensively studied, stochastic optimization that is more suitable for learning with big data remains under-explored. In this paper, we propose new stochastic optimization algorithms and study their first-order convergence theories for solving a broad family of DC functions. We improve the existing algorithms and theories of stochastic optimization for DC functions from both practical and theoretical perspectives. On the practical side, our algorithm is more user-friendly without requiring a large mini-batch size and more efficient by saving unnecessary computations. On the theoretical side, our convergence analysis does not necessarily require the involved functions to be smooth with Lipschitz continuous gradient. Instead, the convergence rate of the proposed stochastic algorithm is automatically adaptive to the H\"{o}lder continuity of the gradient of one component function. Moreover, we extend the proposed stochastic algorithms for DC functions to solve problems with a general non-convex non-differentiable regularizer, which does not necessarily have a DC decomposition but enjoys an efficient proximal mapping. To the best of our knowledge, this is the first work that gives the first non-asymptotic convergence for solving non-convex optimization whose objective has a general non-convex non-differentiable regularizer.Comment: In the revised version, we present some improved complexity results for non-smooth and non-convex regularizers and for functions with known H\"{o}lder continuity parameter ν∈(0,1]\nu\in(0,1] by a simple change of an algorithmic paramete

    NEON+: Accelerated Gradient Methods for Extracting Negative Curvature for Non-Convex Optimization

    Full text link
    Accelerated gradient (AG) methods are breakthroughs in convex optimization, improving the convergence rate of the gradient descent method for optimization with smooth functions. However, the analysis of AG methods for non-convex optimization is still limited. It remains an open question whether AG methods from convex optimization can accelerate the convergence of the gradient descent method for finding local minimum of non-convex optimization problems. This paper provides an affirmative answer to this question. In particular, we analyze two renowned variants of AG methods (namely Polyak's Heavy Ball method and Nesterov's Accelerated Gradient method) for extracting the negative curvature from random noise, which is central to escaping from saddle points. By leveraging the proposed AG methods for extracting the negative curvature, we present a new AG algorithm with double loops for non-convex optimization~\footnote{this is in contrast to a single-loop AG algorithm proposed in a recent manuscript~\citep{AGNON}, which directly analyzed the Nesterov's AG method for non-convex optimization and appeared online on November 29, 2017. However, we emphasize that our work is an independent work, which is inspired by our earlier work~\citep{NEON17} and is based on a different novel analysis.}, which converges to second-order stationary point \x such that \|\nabla f(\x)\|\leq \epsilon and \nabla^2 f(\x)\geq -\sqrt{\epsilon} I with O~(1/ϵ1.75)\widetilde O(1/\epsilon^{1.75}) iteration complexity, improving that of gradient descent method by a factor of ϵ−0.25\epsilon^{-0.25} and matching the best iteration complexity of second-order Hessian-free methods for non-convex optimization.Comment: The main result is merged into our manuscript "First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time" (arXiv:1711.01944

    An Iterative Regularized Incremental Projected Subgradient Method for a Class of Bilevel Optimization Problems

    Full text link
    We study a class of bilevel convex optimization problems where the goal is to find the minimizer of an objective function in the upper level, among the set of all optimal solutions of an optimization problem in the lower level. A wide range of problems in convex optimization can be formulated using this class. An important example is the case where an optimization problem is ill-posed. In this paper, our interest lies in addressing the bilevel problems, where the lower level objective is given as a finite sum of separate nondifferentiable convex component functions. This is the case in a variety of applications in distributed optimization, such as large-scale data processing in machine learning and neural networks. To the best of our knowledge, this class of bilevel problems, with a finite sum in the lower level, has not been addressed before. Motivated by this gap, we develop an iterative regularized incremental subgradient method, where the agents update their iterates in a cyclic manner using a regularized subgradient. Under a suitable choice of the regularization parameter sequence, we establish the convergence of the proposed algorithm and derive a rate of O(1/k0.5−ϵ)\mathcal{O} \left({1}/k^{0.5-\epsilon}\right) in terms of the lower level objective function for an arbitrary small ϵ>0\epsilon>0. We present the performance of the algorithm on a binary text classification problem.Comment: 8 pages, 1 figur

    Stochastic Proximal Methods for Non-Smooth Non-Convex Constrained Sparse Optimization

    Full text link
    This paper focuses on stochastic proximal gradient methods for optimizing a smooth non-convex loss function with a non-smooth non-convex regularizer and convex constraints. To the best of our knowledge we present the first non-asymptotic convergence results for this class of problem. We present two simple stochastic proximal gradient algorithms, for general stochastic and finite-sum optimization problems, which have the same or superior convergence complexities compared to the current best results for the unconstrained problem setting. In a numerical experiment we compare our algorithms with the current state-of-the-art deterministic algorithm and find our algorithms to exhibit superior convergence.Comment: arXiv admin note: text overlap with arXiv:1901.0836
    • …
    corecore