139 research outputs found
Non-convex Finite-Sum Optimization Via SCSG Methods
We develop a class of algorithms, as variants of the stochastically
controlled stochastic gradient (SCSG) methods (Lei and Jordan, 2016), for the
smooth non-convex finite-sum optimization problem. Assuming the smoothness of
each component, the complexity of SCSG to reach a stationary point with
is , which strictly outperforms the stochastic
gradient descent. Moreover, SCSG is never worse than the state-of-the-art
methods based on variance reduction and it significantly outperforms them when
the target accuracy is low. A similar acceleration is also achieved when the
functions satisfy the Polyak-Lojasiewicz condition. Empirical experiments
demonstrate that SCSG outperforms stochastic gradient methods on training
multi-layers neural networks in terms of both training and validation loss.Comment: Add Lemma B.
Stochastically Controlled Stochastic Gradient for the Convex and Non-convex Composition problem
In this paper, we consider the convex and non-convex composition problem with
the structure , where
is the inner
function, and is the outer function. We explore the variance
reduction based method to solve the composition optimization. Due to the fact
that when the number of inner function and outer function are large, it is not
reasonable to estimate them directly, thus we apply the stochastically
controlled stochastic gradient (SCSG) method to estimate the gradient of the
composition function and the value of the inner function. The query complexity
of our proposed method for the convex and non-convex problem is equal to or
better than the current method for the composition problem. Furthermore, we
also present the mini-batch version of the proposed method, which has the
improved the query complexity with related to the size of the mini-batch
Stochastic Nested Variance Reduction for Nonconvex Optimization
We study finite-sum nonconvex optimization problems, where the objective
function is an average of nonconvex functions. We propose a new stochastic
gradient descent algorithm based on nested variance reduction. Compared with
conventional stochastic variance reduced gradient (SVRG) algorithm that uses
two reference points to construct a semi-stochastic gradient with diminishing
variance in each iteration, our algorithm uses nested reference points to
build a semi-stochastic gradient to further reduce its variance in each
iteration. For smooth nonconvex functions, the proposed algorithm converges to
an -approximate first-order stationary point (i.e., ) within number of stochastic
gradient evaluations. This improves the best known gradient complexity of SVRG
and that of SCSG . For gradient
dominated functions, our algorithm also achieves a better gradient complexity
than the state-of-the-art algorithms.Comment: 28 pages, 2 figures, 1 tabl
Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima
We propose stochastic optimization algorithms that can find local minima
faster than existing algorithms for nonconvex optimization problems, by
exploiting the third-order smoothness to escape non-degenerate saddle points
more efficiently. More specifically, the proposed algorithm only needs
stochastic gradient evaluations to converge to an
approximate local minimum , which satisfies and in the general stochastic optimization setting, where
hides logarithm polynomial terms and constants. This
improves upon the gradient complexity achieved by
the state-of-the-art stochastic local minima finding algorithms by a factor of
. For nonconvex finite-sum optimization, our
algorithm also outperforms the best known algorithms in a certain regime.Comment: 25 page
On the Adaptivity of Stochastic Gradient-Based Optimization
Stochastic-gradient-based optimization has been a core enabling methodology
in applications to large-scale problems in machine learning and related areas.
Despite the progress, the gap between theory and practice remains significant,
with theoreticians pursuing mathematical optimality at a cost of obtaining
specialized procedures in different regimes (e.g., modulus of strong convexity,
magnitude of target accuracy, signal-to-noise ratio), and with practitioners
not readily able to know which regime is appropriate to their problem, and
seeking broadly applicable algorithms that are reasonably close to optimality.
To bridge these perspectives it is necessary to study algorithms that are
adaptive to different regimes. We present the stochastically controlled
stochastic gradient (SCSG) method for composite convex finite-sum optimization
problems and show that SCSG is adaptive to both strong convexity and target
accuracy. The adaptivity is achieved by batch variance reduction with adaptive
batch sizes and a novel technique, which we referred to as geometrization,
which sets the length of each epoch as a geometric random variable. The
algorithm achieves strictly better theoretical complexity than other existing
adaptive algorithms, while the tuning parameters of the algorithm only depend
on the smoothness parameter of the objective.Comment: Accepted by SIAM Journal on Optimization; 54 page
On the Ineffectiveness of Variance Reduced Optimization for Deep Learning
The application of stochastic variance reduction to optimization has shown
remarkable recent theoretical and practical success. The applicability of these
techniques to the hard non-convex optimization problems encountered during
training of modern deep neural networks is an open problem. We show that naive
application of the SVRG technique and related approaches fail, and explore why
Saving Gradient and Negative Curvature Computations: Finding Local Minima More Efficiently
We propose a family of nonconvex optimization algorithms that are able to
save gradient and negative curvature computations to a large extent, and are
guaranteed to find an approximate local minimum with improved runtime
complexity. At the core of our algorithms is the division of the entire domain
of the objective function into small and large gradient regions: our algorithms
only perform gradient descent based procedure in the large gradient region, and
only perform negative curvature descent in the small gradient region. Our novel
analysis shows that the proposed algorithms can escape the small gradient
region in only one negative curvature descent step whenever they enter it, and
thus they only need to perform at most negative curvature
direction computations, where is the number of times the
algorithms enter small gradient regions. For both deterministic and stochastic
settings, we show that the proposed algorithms can potentially beat the
state-of-the-art local minima finding algorithms. For the finite-sum setting,
our algorithm can also outperform the best algorithm in a certain regime.Comment: 31 pages, 1 tabl
Neon2: Finding Local Minima via First-Order Oracles
We propose a reduction for non-convex optimization that can (1) turn an
stationary-point finding algorithm into an local-minimum finding one, and (2)
replace the Hessian-vector product computations with only gradient
computations. It works both in the stochastic and the deterministic settings,
without hurting the algorithm's performance.
As applications, our reduction turns Natasha2 into a first-order method
without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into
algorithms finding approximate local minima, outperforming some best known
results.Comment: version 2 and 3 improve writin
Inexact SARAH Algorithm for Stochastic Optimization
We develop and analyze a variant of the SARAH algorithm, which does not
require computation of the exact gradient. Thus this new method can be applied
to general expectation minimization problems rather than only finite sum
problems. While the original SARAH algorithm, as well as its predecessor, SVRG,
require an exact gradient computation on each outer iteration, the inexact
variant of SARAH (iSARAH), which we develop here, requires only stochastic
gradient computed on a mini-batch of sufficient size. The proposed method
combines variance reduction via sample size selection and iterative stochastic
gradient updates. We analyze the convergence rate of the algorithms for
strongly convex and non-strongly convex cases, under smooth assumption with
appropriate mini-batch size selected for each case. We show that with an
additional, reasonable, assumption iSARAH achieves the best known complexity
among stochastic methods in the case of non-strongly convex stochastic
functions.Comment: Optimization Methods and Softwar
Finding Local Minima via Stochastic Nested Variance Reduction
We propose two algorithms that can find local minima faster than the
state-of-the-art algorithms in both finite-sum and general stochastic nonconvex
optimization. At the core of the proposed algorithms is
using stochastic nested variance reduction (Zhou et
al., 2018a), which outperforms the state-of-the-art variance reduction
algorithms such as SCSG (Lei et al., 2017). In particular, for finite-sum
optimization problems, the proposed
algorithm achieves
gradient complexity to converge to an -second-order
stationary point, which outperforms
(Allen-Zhu and Li, 2017) , the best existing algorithm, in a wide regime. For
general stochastic optimization problems, the proposed
achieves
gradient complexity, which is better than both
(Allen-Zhu and Li, 2017) and
Natasha2 (Allen-Zhu, 2017) in certain regimes. Furthermore, we explore the
acceleration brought by third-order smoothness of the objective function.Comment: 37 pages, 4 figures, 1 tabl
- …