139 research outputs found

    Non-convex Finite-Sum Optimization Via SCSG Methods

    Full text link
    We develop a class of algorithms, as variants of the stochastically controlled stochastic gradient (SCSG) methods (Lei and Jordan, 2016), for the smooth non-convex finite-sum optimization problem. Assuming the smoothness of each component, the complexity of SCSG to reach a stationary point with Ef(x)2ϵ\mathbb{E} \|\nabla f(x)\|^{2}\le \epsilon is O(min{ϵ5/3,ϵ1n2/3})O\left (\min\{\epsilon^{-5/3}, \epsilon^{-1}n^{2/3}\}\right), which strictly outperforms the stochastic gradient descent. Moreover, SCSG is never worse than the state-of-the-art methods based on variance reduction and it significantly outperforms them when the target accuracy is low. A similar acceleration is also achieved when the functions satisfy the Polyak-Lojasiewicz condition. Empirical experiments demonstrate that SCSG outperforms stochastic gradient methods on training multi-layers neural networks in terms of both training and validation loss.Comment: Add Lemma B.

    Stochastically Controlled Stochastic Gradient for the Convex and Non-convex Composition problem

    Full text link
    In this paper, we consider the convex and non-convex composition problem with the structure 1ni=1nFi(G(x))\frac{1}{n}\sum\nolimits_{i = 1}^n {{F_i}( {G( x )} )}, where G(x)=1nj=1nGj(x)G( x )=\frac{1}{n}\sum\nolimits_{j = 1}^n {{G_j}( x )} is the inner function, and Fi()F_i(\cdot) is the outer function. We explore the variance reduction based method to solve the composition optimization. Due to the fact that when the number of inner function and outer function are large, it is not reasonable to estimate them directly, thus we apply the stochastically controlled stochastic gradient (SCSG) method to estimate the gradient of the composition function and the value of the inner function. The query complexity of our proposed method for the convex and non-convex problem is equal to or better than the current method for the composition problem. Furthermore, we also present the mini-batch version of the proposed method, which has the improved the query complexity with related to the size of the mini-batch

    Stochastic Nested Variance Reduction for Nonconvex Optimization

    Full text link
    We study finite-sum nonconvex optimization problems, where the objective function is an average of nn nonconvex functions. We propose a new stochastic gradient descent algorithm based on nested variance reduction. Compared with conventional stochastic variance reduced gradient (SVRG) algorithm that uses two reference points to construct a semi-stochastic gradient with diminishing variance in each iteration, our algorithm uses K+1K+1 nested reference points to build a semi-stochastic gradient to further reduce its variance in each iteration. For smooth nonconvex functions, the proposed algorithm converges to an ϵ\epsilon-approximate first-order stationary point (i.e., F(x)2ϵ\|\nabla F(\mathbf{x})\|_2\leq \epsilon) within O~(nϵ2+ϵ3n1/2ϵ2)\tilde{O}(n\land \epsilon^{-2}+\epsilon^{-3}\land n^{1/2}\epsilon^{-2}) number of stochastic gradient evaluations. This improves the best known gradient complexity of SVRG O(n+n2/3ϵ2)O(n+n^{2/3}\epsilon^{-2}) and that of SCSG O(nϵ2+ϵ10/3n2/3ϵ2)O(n\land \epsilon^{-2}+\epsilon^{-10/3}\land n^{2/3}\epsilon^{-2}). For gradient dominated functions, our algorithm also achieves a better gradient complexity than the state-of-the-art algorithms.Comment: 28 pages, 2 figures, 1 tabl

    Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

    Full text link
    We propose stochastic optimization algorithms that can find local minima faster than existing algorithms for nonconvex optimization problems, by exploiting the third-order smoothness to escape non-degenerate saddle points more efficiently. More specifically, the proposed algorithm only needs O~(ϵ10/3)\tilde{O}(\epsilon^{-10/3}) stochastic gradient evaluations to converge to an approximate local minimum x\mathbf{x}, which satisfies f(x)2ϵ\|\nabla f(\mathbf{x})\|_2\leq\epsilon and λmin(2f(x))ϵ\lambda_{\min}(\nabla^2 f(\mathbf{x}))\geq -\sqrt{\epsilon} in the general stochastic optimization setting, where O~()\tilde{O}(\cdot) hides logarithm polynomial terms and constants. This improves upon the O~(ϵ7/2)\tilde{O}(\epsilon^{-7/2}) gradient complexity achieved by the state-of-the-art stochastic local minima finding algorithms by a factor of O~(ϵ1/6)\tilde{O}(\epsilon^{-1/6}). For nonconvex finite-sum optimization, our algorithm also outperforms the best known algorithms in a certain regime.Comment: 25 page

    On the Adaptivity of Stochastic Gradient-Based Optimization

    Full text link
    Stochastic-gradient-based optimization has been a core enabling methodology in applications to large-scale problems in machine learning and related areas. Despite the progress, the gap between theory and practice remains significant, with theoreticians pursuing mathematical optimality at a cost of obtaining specialized procedures in different regimes (e.g., modulus of strong convexity, magnitude of target accuracy, signal-to-noise ratio), and with practitioners not readily able to know which regime is appropriate to their problem, and seeking broadly applicable algorithms that are reasonably close to optimality. To bridge these perspectives it is necessary to study algorithms that are adaptive to different regimes. We present the stochastically controlled stochastic gradient (SCSG) method for composite convex finite-sum optimization problems and show that SCSG is adaptive to both strong convexity and target accuracy. The adaptivity is achieved by batch variance reduction with adaptive batch sizes and a novel technique, which we referred to as geometrization, which sets the length of each epoch as a geometric random variable. The algorithm achieves strictly better theoretical complexity than other existing adaptive algorithms, while the tuning parameters of the algorithm only depend on the smoothness parameter of the objective.Comment: Accepted by SIAM Journal on Optimization; 54 page

    On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

    Full text link
    The application of stochastic variance reduction to optimization has shown remarkable recent theoretical and practical success. The applicability of these techniques to the hard non-convex optimization problems encountered during training of modern deep neural networks is an open problem. We show that naive application of the SVRG technique and related approaches fail, and explore why

    Saving Gradient and Negative Curvature Computations: Finding Local Minima More Efficiently

    Full text link
    We propose a family of nonconvex optimization algorithms that are able to save gradient and negative curvature computations to a large extent, and are guaranteed to find an approximate local minimum with improved runtime complexity. At the core of our algorithms is the division of the entire domain of the objective function into small and large gradient regions: our algorithms only perform gradient descent based procedure in the large gradient region, and only perform negative curvature descent in the small gradient region. Our novel analysis shows that the proposed algorithms can escape the small gradient region in only one negative curvature descent step whenever they enter it, and thus they only need to perform at most NϵN_{\epsilon} negative curvature direction computations, where NϵN_{\epsilon} is the number of times the algorithms enter small gradient regions. For both deterministic and stochastic settings, we show that the proposed algorithms can potentially beat the state-of-the-art local minima finding algorithms. For the finite-sum setting, our algorithm can also outperform the best algorithm in a certain regime.Comment: 31 pages, 1 tabl

    Neon2: Finding Local Minima via First-Order Oracles

    Full text link
    We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance. As applications, our reduction turns Natasha2 into a first-order method without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results.Comment: version 2 and 3 improve writin

    Inexact SARAH Algorithm for Stochastic Optimization

    Full text link
    We develop and analyze a variant of the SARAH algorithm, which does not require computation of the exact gradient. Thus this new method can be applied to general expectation minimization problems rather than only finite sum problems. While the original SARAH algorithm, as well as its predecessor, SVRG, require an exact gradient computation on each outer iteration, the inexact variant of SARAH (iSARAH), which we develop here, requires only stochastic gradient computed on a mini-batch of sufficient size. The proposed method combines variance reduction via sample size selection and iterative stochastic gradient updates. We analyze the convergence rate of the algorithms for strongly convex and non-strongly convex cases, under smooth assumption with appropriate mini-batch size selected for each case. We show that with an additional, reasonable, assumption iSARAH achieves the best known complexity among stochastic methods in the case of non-strongly convex stochastic functions.Comment: Optimization Methods and Softwar

    Finding Local Minima via Stochastic Nested Variance Reduction

    Full text link
    We propose two algorithms that can find local minima faster than the state-of-the-art algorithms in both finite-sum and general stochastic nonconvex optimization. At the core of the proposed algorithms is One-epoch-SNVRG+\text{One-epoch-SNVRG}^+ using stochastic nested variance reduction (Zhou et al., 2018a), which outperforms the state-of-the-art variance reduction algorithms such as SCSG (Lei et al., 2017). In particular, for finite-sum optimization problems, the proposed SNVRG++Neon2finite\text{SNVRG}^{+}+\text{Neon2}^{\text{finite}} algorithm achieves O~(n1/2ϵ2+nϵH3+n3/4ϵH7/2)\tilde{O}(n^{1/2}\epsilon^{-2}+n\epsilon_H^{-3}+n^{3/4}\epsilon_H^{-7/2}) gradient complexity to converge to an (ϵ,ϵH)(\epsilon, \epsilon_H)-second-order stationary point, which outperforms SVRG+Neon2finite\text{SVRG}+\text{Neon2}^{\text{finite}} (Allen-Zhu and Li, 2017) , the best existing algorithm, in a wide regime. For general stochastic optimization problems, the proposed SNVRG++Neon2online\text{SNVRG}^{+}+\text{Neon2}^{\text{online}} achieves O~(ϵ3+ϵH5+ϵ2ϵH3)\tilde{O}(\epsilon^{-3}+\epsilon_H^{-5}+\epsilon^{-2}\epsilon_H^{-3}) gradient complexity, which is better than both SVRG+Neon2online\text{SVRG}+\text{Neon2}^{\text{online}} (Allen-Zhu and Li, 2017) and Natasha2 (Allen-Zhu, 2017) in certain regimes. Furthermore, we explore the acceleration brought by third-order smoothness of the objective function.Comment: 37 pages, 4 figures, 1 tabl
    corecore