29,017 research outputs found

    Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence

    Full text link
    In this paper, we study stochastic non-convex optimization with non-convex random functions. Recent studies on non-convex optimization revolve around establishing second-order convergence, i.e., converging to a nearly second-order optimal stationary points. However, existing results on stochastic non-convex optimization are limited, especially with a high probability second-order convergence. We propose a novel updating step (named NCG-S) by leveraging a stochastic gradient and a noisy negative curvature of a stochastic Hessian, where the stochastic gradient and Hessian are based on a proper mini-batch of random functions. Building on this step, we develop two algorithms and establish their high probability second-order convergence. To the best of our knowledge, the proposed stochastic algorithms are the first with a second-order convergence in {\it high probability} and a time complexity that is {\it almost linear} in the problem's dimensionality.Comment: This short paper will appear at NIPS 2017 Optimization of Machine Learning Workshop. Partial results are presented in arXiv:1709.08571. The second version corrects a statement regarding previous wor

    On Noisy Negative Curvature Descent: Competing with Gradient Descent for Faster Non-convex Optimization

    Full text link
    The Hessian-vector product has been utilized to find a second-order stationary solution with strong complexity guarantee (e.g., almost linear time complexity in the problem's dimensionality). In this paper, we propose to further reduce the number of Hessian-vector products for faster non-convex optimization. Previous algorithms need to approximate the smallest eigen-value with a sufficient precision (e.g., ϵ21\epsilon_2\ll 1) in order to achieve a sufficiently accurate second-order stationary solution (i.e., \lambda_{\min}(\nabla^2 f(\x))\geq -\epsilon_2). In contrast, the proposed algorithms only need to compute the smallest eigen-vector approximating the corresponding eigen-value up to a small power of current gradient's norm. As a result, it can dramatically reduce the number of Hessian-vector products during the course of optimization before reaching first-order stationary points (e.g., saddle points). The key building block of the proposed algorithms is a novel updating step named the NCG step, which lets a noisy negative curvature descent compete with the gradient descent. We show that the worst-case time complexity of the proposed algorithms with their favorable prescribed accuracy requirements can match the best in literature for achieving a second-order stationary point but with an arguably smaller per-iteration cost. We also show that the proposed algorithms can benefit from inexact Hessian by developing their variants accepting inexact Hessian under a mild condition for achieving the same goal. Moreover, we develop a stochastic algorithm for a finite or infinite sum non-convex optimization problem. To the best of our knowledge, the proposed stochastic algorithm is the first one that converges to a second-order stationary point in {\it high probability} with a time complexity independent of the sample size and almost linear in dimensionality.Comment: added a stochastic algorithm with high probability second-order convergence and corrected some typo

    Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions

    Full text link
    Error bound conditions (EBC) are properties that characterize the growth of an objective function when a point is moved away from the optimal set. They have recently received increasing attention in the field of optimization for developing optimization algorithms with fast convergence. However, the studies of EBC in statistical learning are hitherto still limited. The main contributions of this paper are two-fold. First, we develop fast and intermediate rates of empirical risk minimization (ERM) under EBC for risk minimization with Lipschitz continuous, and smooth convex random functions. Second, we establish fast and intermediate rates of an efficient stochastic approximation (SA) algorithm for risk minimization with Lipschitz continuous random functions, which requires only one pass of nn samples and adapts to EBC. For both approaches, the convergence rates span a full spectrum between O~(1/n)\widetilde O(1/\sqrt{n}) and O~(1/n)\widetilde O(1/n) depending on the power constant in EBC, and could be even faster than O(1/n)O(1/n) in special cases for ERM. Moreover, these convergence rates are automatically adaptive without using any knowledge of EBC. Overall, this work not only strengthens the understanding of ERM for statistical learning but also brings new fast stochastic algorithms for solving a broad range of statistical learning problems

    Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

    Full text link
    In this paper, we propose a new accelerated stochastic first-order method called clipped-SSTM for smooth convex stochastic optimization with heavy-tailed distributed noise in stochastic gradients and derive the first high-probability complexity bounds for this method closing the gap in the theory of stochastic optimization with heavy-tailed noise. Our method is based on a special variant of accelerated Stochastic Gradient Descent (SGD) and clipping of stochastic gradients. We extend our method to the strongly convex case and prove new complexity bounds that outperform state-of-the-art results in this case. Finally, we extend our proof technique and derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.Comment: 71 pages, 14 figure

    Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions

    Full text link
    We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding an \order(\pdim/T) convergence rate for strongly convex objectives in \pdim dimensions, and an \order(\sqrt{(\spindex \log \pdim)/T}) convergence rate when the optimum is \spindex-sparse. Our algorithm is based on successively solving a series of 1\ell_1-regularized optimization problems using Nesterov's dual averaging algorithm. We establish that the error of our solution after TT iterations is at most \order((\spindex \log\pdim)/T), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to multiplicative constant factors. The effectiveness of our approach is also confirmed in numerical simulations, in which we compare to several baselines on a least-squares regression problem.Comment: 2 figure

    Accelerate Stochastic Subgradient Method by Leveraging Local Growth Condition

    Full text link
    In this paper, a new theory is developed for first-order stochastic convex optimization, showing that the global convergence rate is sufficiently quantified by a local growth rate of the objective function in a neighborhood of the optimal solutions. In particular, if the objective function F(w)F(\mathbf w) in the ϵ\epsilon-sublevel set grows as fast as ww21/θ\|\mathbf w - \mathbf w_*\|_2^{1/\theta}, where w\mathbf w_* represents the closest optimal solution to w\mathbf w and θ(0,1]\theta\in(0,1] quantifies the local growth rate, the iteration complexity of first-order stochastic optimization for achieving an ϵ\epsilon-optimal solution can be O~(1/ϵ2(1θ))\widetilde O(1/\epsilon^{2(1-\theta)}), which is optimal at most up to a logarithmic factor. To achieve the faster global convergence, we develop two different accelerated stochastic subgradient methods by iteratively solving the original problem approximately in a local region around a historical solution with the size of the local region gradually decreasing as the solution approaches the optimal set. Besides the theoretical improvements, this work also includes new contributions towards making the proposed algorithms practical: (i) we present practical variants of accelerated stochastic subgradient methods that can run without the knowledge of multiplicative growth constant and even the growth rate θ\theta; (ii) we consider a broad family of problems in machine learning to demonstrate that the proposed algorithms enjoy faster convergence than traditional stochastic subgradient method. We also characterize the complexity of the proposed algorithms for ensuring the gradient is small without the smoothness assumption

    Stochastic Primal-Dual Algorithms with Faster Convergence than O(1/T)O(1/\sqrt{T}) for Problems without Bilinear Structure

    Full text link
    Previous studies on stochastic primal-dual algorithms for solving min-max problems with faster convergence heavily rely on the bilinear structure of the problem, which restricts their applicability to a narrowed range of problems. The main contribution of this paper is the design and analysis of new stochastic primal-dual algorithms that use a mixture of stochastic gradient updates and a logarithmic number of deterministic dual updates for solving a family of convex-concave problems with no bilinear structure assumed. Faster convergence rates than O(1/T)O(1/\sqrt{T}) with TT being the number of stochastic gradient updates are established under some mild conditions of involved functions on the primal and the dual variable. For example, for a family of problems that enjoy a weak strong convexity in terms of the primal variable and has a strongly concave function of the dual variable, the convergence rate of the proposed algorithm is O(1/T)O(1/T). We also investigate the effectiveness of the proposed algorithms for learning robust models and empirical AUC maximization

    MixedGrad: An O(1/T) Convergence Rate Algorithm for Stochastic Smooth Optimization

    Full text link
    It is well known that the optimal convergence rate for stochastic optimization of smooth functions is O(1/T)O(1/\sqrt{T}), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T2)O(1/T^2). In this work, we consider a new setup for optimizing smooth functions, termed as {\bf Mixed Optimization}, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(lnT)O(\ln T) calls to the full gradient oracle and an O(T)O(T) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T)O(1/T)

    Stochastic Approximation of Smooth and Strongly Convex Functions: Beyond the O(1/T)O(1/T) Convergence Rate

    Full text link
    Stochastic approximation (SA) is a classical approach for stochastic convex optimization. Previous studies have demonstrated that the convergence rate of SA can be improved by introducing either smoothness or strong convexity condition. In this paper, we make use of smoothness and strong convexity simultaneously to boost the convergence rate. Let λ\lambda be the modulus of strong convexity, κ\kappa be the condition number, FF_* be the minimal risk, and α>1\alpha>1 be some small constant. First, we demonstrate that, in expectation, an O(1/[λTα]+κF/T)O(1/[\lambda T^\alpha] + \kappa F_*/T) risk bound is attainable when T=Ω(κα)T = \Omega(\kappa^\alpha). Thus, when FF_* is small, the convergence rate could be faster than O(1/[λT])O(1/[\lambda T]) and approaches O(1/[λTα])O(1/[\lambda T^\alpha]) in the ideal case. Second, to further benefit from small risk, we show that, in expectation, an O(1/2T/κ+F)O(1/2^{T/\kappa}+F_*) risk bound is achievable. Thus, the excess risk reduces exponentially until reaching O(F)O(F_*), and if F=0F_*=0, we obtain a global linear convergence. Finally, we emphasize that our proof is constructive and each risk bound is equipped with an efficient stochastic algorithm attaining that bound

    Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity

    Full text link
    We develop model-based methods for solving stochastic convex optimization problems, introducing the approximate-proximal point, or aProx, family, which includes stochastic subgradient, proximal point, and bundle methods. When the modeling approaches we propose are appropriately accurate, the methods enjoy stronger convergence and robustness guarantees than classical approaches, even though the model-based methods typically add little to no computational overhead over stochastic subgradient methods. For example, we show that improved models converge with probability 1 and enjoy optimal asymptotic normality results under weak assumptions; these methods are also adaptive to a natural class of what we term easy optimization problems, achieving linear convergence under appropriate strong growth conditions on the objective. Our substantial experimental investigation shows the advantages of more accurate modeling over standard subgradient methods across many smooth and non-smooth optimization problems.Comment: To appear in SIAM Journal on Optimizatio
    corecore