224 research outputs found
SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points
We analyze stochastic gradient algorithms for optimizing nonconvex problems.
In particular, our goal is to find local minima (second-order stationary
points) instead of just finding first-order stationary points which may be some
bad unstable saddle points. We show that a simple perturbed version of
stochastic recursive gradient descent algorithm (called SSRGD) can find an
-second-order stationary point with
stochastic gradient complexity for nonconvex finite-sum problems. As a
by-product, SSRGD finds an -first-order stationary point with
stochastic gradients. These results are almost
optimal since Fang et al. [2018] provided a lower bound
for finding even just an -first-order
stationary point. We emphasize that SSRGD algorithm for finding second-order
stationary points is as simple as for finding first-order stationary points
just by adding a uniform perturbation sometimes, while all other algorithms for
finding second-order stationary points with similar gradient complexity need to
combine with a negative-curvature search subroutine (e.g., Neon2 [Allen-Zhu and
Li, 2018]). Moreover, the simple SSRGD algorithm gets a simpler analysis.
Besides, we also extend our results from nonconvex finite-sum problems to
nonconvex online (expectation) problems, and prove the corresponding
convergence results.Comment: 44 page
Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees
Asynchronous distributed algorithms are a popular way to reduce
synchronization costs in large-scale optimization, and in particular for neural
network training. However, for nonsmooth and nonconvex objectives, few
convergence guarantees exist beyond cases where closed-form proximal operator
solutions are available. As most popular contemporary deep neural networks lead
to nonsmooth and nonconvex objectives, there is now a pressing need for such
convergence guarantees. In this paper, we analyze for the first time the
convergence of stochastic asynchronous optimization for this general class of
objectives. In particular, we focus on stochastic subgradient methods allowing
for block variable partitioning, where the shared-memory-based model is
asynchronously updated by concurrent processes. To this end, we first introduce
a probabilistic model which captures key features of real asynchronous
scheduling between concurrent processes; under this model, we establish
convergence with probability one to an invariant set for stochastic subgradient
methods with momentum.
From the practical perspective, one issue with the family of methods we
consider is that it is not efficiently supported by machine learning
frameworks, as they mostly focus on distributed data-parallel strategies. To
address this, we propose a new implementation strategy for shared-memory based
training of deep neural networks, whereby concurrent parameter servers are
utilized to train a partitioned but shared model in single- and multi-GPU
settings. Based on this implementation, we achieve on average 1.2x speed-up in
comparison to state-of-the-art training methods for popular image
classification tasks without compromising accuracy
- …