Search CORE

139 research outputs found

Non-convex Finite-Sum Optimization Via SCSG Methods

Author: Chen Jianbo
Jordan Michael I.
Ju Cheng
Lei Lihua
Publication venue
Publication date: 16/05/2019
Field of study

We develop a class of algorithms, as variants of the stochastically controlled stochastic gradient (SCSG) methods (Lei and Jordan, 2016), for the smooth non-convex finite-sum optimization problem. Assuming the smoothness of each component, the complexity of SCSG to reach a stationary point with

\mathbb{E} \|\nabla f(x)\|^{2}\le \epsilon

O\left (\min\{\epsilon^{-5/3}, \epsilon^{-1}n^{2/3}\}\right)

, which strictly outperforms the stochastic gradient descent. Moreover, SCSG is never worse than the state-of-the-art methods based on variance reduction and it significantly outperforms them when the target accuracy is low. A similar acceleration is also achieved when the functions satisfy the Polyak-Lojasiewicz condition. Empirical experiments demonstrate that SCSG outperforms stochastic gradient methods on training multi-layers neural networks in terms of both training and validation loss.Comment: Add Lemma B.

arXiv.org e-Print Archive

Stochastically Controlled Stochastic Gradient for the Convex and Non-convex Composition problem

Author: Hsieh Cho-Jui
Liu Ji
Liu Liu
Tao Dacheng
Publication venue
Publication date: 06/09/2018
Field of study

In this paper, we consider the convex and non-convex composition problem with the structure

\frac{1}{n}\sum\nolimits_{i = 1}^n {{F_i}( {G( x )} )}

, where

G( x )=\frac{1}{n}\sum\nolimits_{j = 1}^n {{G_j}( x )}

is the inner function, and

F_i(\cdot)

is the outer function. We explore the variance reduction based method to solve the composition optimization. Due to the fact that when the number of inner function and outer function are large, it is not reasonable to estimate them directly, thus we apply the stochastically controlled stochastic gradient (SCSG) method to estimate the gradient of the composition function and the value of the inner function. The query complexity of our proposed method for the convex and non-convex problem is equal to or better than the current method for the composition problem. Furthermore, we also present the mini-batch version of the proposed method, which has the improved the query complexity with related to the size of the mini-batch

arXiv.org e-Print Archive

Stochastic Nested Variance Reduction for Nonconvex Optimization

Author: Gu Quanquan
Xu Pan
Zhou Dongruo
Publication venue
Publication date: 20/06/2018
Field of study

We study finite-sum nonconvex optimization problems, where the objective function is an average of

n

nonconvex functions. We propose a new stochastic gradient descent algorithm based on nested variance reduction. Compared with conventional stochastic variance reduced gradient (SVRG) algorithm that uses two reference points to construct a semi-stochastic gradient with diminishing variance in each iteration, our algorithm uses

K+1

nested reference points to build a semi-stochastic gradient to further reduce its variance in each iteration. For smooth nonconvex functions, the proposed algorithm converges to an

\epsilon

-approximate first-order stationary point (i.e.,

\|\nabla F(\mathbf{x})\|_2\leq \epsilon

) within

\tilde{O}(n\land \epsilon^{-2}+\epsilon^{-3}\land n^{1/2}\epsilon^{-2})

number of stochastic gradient evaluations. This improves the best known gradient complexity of SVRG

O(n+n^{2/3}\epsilon^{-2})

and that of SCSG

O(n\land \epsilon^{-2}+\epsilon^{-10/3}\land n^{2/3}\epsilon^{-2})

. For gradient dominated functions, our algorithm also achieves a better gradient complexity than the state-of-the-art algorithms.Comment: 28 pages, 2 figures, 1 tabl

arXiv.org e-Print Archive

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Author: Gu Quanquan
Xu Pan
Yu Yaodong
Publication venue
Publication date: 18/12/2017
Field of study

We propose stochastic optimization algorithms that can find local minima faster than existing algorithms for nonconvex optimization problems, by exploiting the third-order smoothness to escape non-degenerate saddle points more efficiently. More specifically, the proposed algorithm only needs

\tilde{O}(\epsilon^{-10/3})

stochastic gradient evaluations to converge to an approximate local minimum

\mathbf{x}

, which satisfies

\|\nabla f(\mathbf{x})\|_2\leq\epsilon

and

\lambda_{\min}(\nabla^2 f(\mathbf{x}))\geq -\sqrt{\epsilon}

in the general stochastic optimization setting, where

\tilde{O}(\cdot)

hides logarithm polynomial terms and constants. This improves upon the

\tilde{O}(\epsilon^{-7/2})

gradient complexity achieved by the state-of-the-art stochastic local minima finding algorithms by a factor of

\tilde{O}(\epsilon^{-1/6})

. For nonconvex finite-sum optimization, our algorithm also outperforms the best known algorithms in a certain regime.Comment: 25 page

arXiv.org e-Print Archive

On the Adaptivity of Stochastic Gradient-Based Optimization

Author: Jordan Michael I.
Lei Lihua
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 31/12/2020
Field of study

Stochastic-gradient-based optimization has been a core enabling methodology in applications to large-scale problems in machine learning and related areas. Despite the progress, the gap between theory and practice remains significant, with theoreticians pursuing mathematical optimality at a cost of obtaining specialized procedures in different regimes (e.g., modulus of strong convexity, magnitude of target accuracy, signal-to-noise ratio), and with practitioners not readily able to know which regime is appropriate to their problem, and seeking broadly applicable algorithms that are reasonably close to optimality. To bridge these perspectives it is necessary to study algorithms that are adaptive to different regimes. We present the stochastically controlled stochastic gradient (SCSG) method for composite convex finite-sum optimization problems and show that SCSG is adaptive to both strong convexity and target accuracy. The adaptivity is achieved by batch variance reduction with adaptive batch sizes and a novel technique, which we referred to as geometrization, which sets the length of each epoch as a geometric random variable. The algorithm achieves strictly better theoretical complexity than other existing adaptive algorithms, while the tuning parameters of the algorithm only depend on the smoothness parameter of the objective.Comment: Accepted by SIAM Journal on Optimization; 54 page

arXiv.org e-Print Archive

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

Author: Bottou Léon
Defazio Aaron
Publication venue
Publication date: 20/11/2019
Field of study

The application of stochastic variance reduction to optimization has shown remarkable recent theoretical and practical success. The applicability of these techniques to the hard non-convex optimization problems encountered during training of modern deep neural networks is an open problem. We show that naive application of the SVRG technique and related approaches fail, and explore why

arXiv.org e-Print Archive

Saving Gradient and Negative Curvature Computations: Finding Local Minima More Efficiently

Author: Gu Quanquan
Yu Yaodong
Zou Difan
Publication venue
Publication date: 11/12/2017
Field of study

We propose a family of nonconvex optimization algorithms that are able to save gradient and negative curvature computations to a large extent, and are guaranteed to find an approximate local minimum with improved runtime complexity. At the core of our algorithms is the division of the entire domain of the objective function into small and large gradient regions: our algorithms only perform gradient descent based procedure in the large gradient region, and only perform negative curvature descent in the small gradient region. Our novel analysis shows that the proposed algorithms can escape the small gradient region in only one negative curvature descent step whenever they enter it, and thus they only need to perform at most

N_{\epsilon}

negative curvature direction computations, where

N_{\epsilon}

is the number of times the algorithms enter small gradient regions. For both deterministic and stochastic settings, we show that the proposed algorithms can potentially beat the state-of-the-art local minima finding algorithms. For the finite-sum setting, our algorithm can also outperform the best algorithm in a certain regime.Comment: 31 pages, 1 tabl

arXiv.org e-Print Archive

Neon2: Finding Local Minima via First-Order Oracles

Author: Allen-Zhu Zeyuan
Li Yuanzhi
Publication venue
Publication date: 20/04/2018
Field of study

We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance. As applications, our reduction turns Natasha2 into a first-order method without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results.Comment: version 2 and 3 improve writin

arXiv.org e-Print Archive

Inexact SARAH Algorithm for Stochastic Optimization

Author: Nguyen Lam M.
Scheinberg Katya
Takáč Martin
Publication venue
Publication date: 27/08/2020
Field of study

We develop and analyze a variant of the SARAH algorithm, which does not require computation of the exact gradient. Thus this new method can be applied to general expectation minimization problems rather than only finite sum problems. While the original SARAH algorithm, as well as its predecessor, SVRG, require an exact gradient computation on each outer iteration, the inexact variant of SARAH (iSARAH), which we develop here, requires only stochastic gradient computed on a mini-batch of sufficient size. The proposed method combines variance reduction via sample size selection and iterative stochastic gradient updates. We analyze the convergence rate of the algorithms for strongly convex and non-strongly convex cases, under smooth assumption with appropriate mini-batch size selected for each case. We show that with an additional, reasonable, assumption iSARAH achieves the best known complexity among stochastic methods in the case of non-strongly convex stochastic functions.Comment: Optimization Methods and Softwar

arXiv.org e-Print Archive

Finding Local Minima via Stochastic Nested Variance Reduction

Author: Gu Quanquan
Xu Pan
Zhou Dongruo
Publication venue
Publication date: 22/06/2018
Field of study

We propose two algorithms that can find local minima faster than the state-of-the-art algorithms in both finite-sum and general stochastic nonconvex optimization. At the core of the proposed algorithms is

\text{One-epoch-SNVRG}^+

using stochastic nested variance reduction (Zhou et al., 2018a), which outperforms the state-of-the-art variance reduction algorithms such as SCSG (Lei et al., 2017). In particular, for finite-sum optimization problems, the proposed

\text{SNVRG}^{+}+\text{Neon2}^{\text{finite}}

algorithm achieves

\tilde{O}(n^{1/2}\epsilon^{-2}+n\epsilon_H^{-3}+n^{3/4}\epsilon_H^{-7/2})

gradient complexity to converge to an

(\epsilon, \epsilon_H)

-second-order stationary point, which outperforms

\text{SVRG}+\text{Neon2}^{\text{finite}}

(Allen-Zhu and Li, 2017) , the best existing algorithm, in a wide regime. For general stochastic optimization problems, the proposed

\text{SNVRG}^{+}+\text{Neon2}^{\text{online}}

achieves

\tilde{O}(\epsilon^{-3}+\epsilon_H^{-5}+\epsilon^{-2}\epsilon_H^{-3})

gradient complexity, which is better than both

\text{SVRG}+\text{Neon2}^{\text{online}}

(Allen-Zhu and Li, 2017) and Natasha2 (Allen-Zhu, 2017) in certain regimes. Furthermore, we explore the acceleration brought by third-order smoothness of the objective function.Comment: 37 pages, 4 figures, 1 tabl

arXiv.org e-Print Archive