636 research outputs found
Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees
Asynchronous distributed algorithms are a popular way to reduce
synchronization costs in large-scale optimization, and in particular for neural
network training. However, for nonsmooth and nonconvex objectives, few
convergence guarantees exist beyond cases where closed-form proximal operator
solutions are available. As most popular contemporary deep neural networks lead
to nonsmooth and nonconvex objectives, there is now a pressing need for such
convergence guarantees. In this paper, we analyze for the first time the
convergence of stochastic asynchronous optimization for this general class of
objectives. In particular, we focus on stochastic subgradient methods allowing
for block variable partitioning, where the shared-memory-based model is
asynchronously updated by concurrent processes. To this end, we first introduce
a probabilistic model which captures key features of real asynchronous
scheduling between concurrent processes; under this model, we establish
convergence with probability one to an invariant set for stochastic subgradient
methods with momentum.
From the practical perspective, one issue with the family of methods we
consider is that it is not efficiently supported by machine learning
frameworks, as they mostly focus on distributed data-parallel strategies. To
address this, we propose a new implementation strategy for shared-memory based
training of deep neural networks, whereby concurrent parameter servers are
utilized to train a partitioned but shared model in single- and multi-GPU
settings. Based on this implementation, we achieve on average 1.2x speed-up in
comparison to state-of-the-art training methods for popular image
classification tasks without compromising accuracy
A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization
We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth
finite-sum problems. In particular, the objective function is given by the
summation of a differentiable (possibly nonconvex) component, together with a
possibly non-differentiable but convex component. We propose a proximal
stochastic gradient algorithm based on variance reduction, called ProxSVRG+.
Our main contribution lies in the analysis of ProxSVRG+. It recovers several
existing convergence results and improves/generalizes them (in terms of the
number of stochastic gradient oracle calls and proximal oracle calls). In
particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm,
recently proposed by [Lei et al., 2017] for the smooth nonconvex case.
ProxSVRG+ is also more straightforward than SCSG and yields simpler analysis.
Moreover, ProxSVRG+ outperforms the deterministic proximal gradient descent
(ProxGD) for a wide range of minibatch sizes, which partially solves an open
problem proposed in [Reddi et al., 2016b]. Also, ProxSVRG+ uses much less
proximal oracle calls than ProxSVRG [Reddi et al., 2016b]. Moreover, for
nonconvex functions satisfied Polyak-\L{}ojasiewicz condition, we prove that
ProxSVRG+ achieves a global linear convergence rate without restart unlike
ProxSVRG. Thus, it can \emph{automatically} switch to the faster linear
convergence in some regions as long as the objective function satisfies the PL
condition locally in these regions. ProxSVRG+ also improves ProxGD and
ProxSVRG/SAGA, and generalizes the results of SCSG in this case. Finally, we
conduct several experiments and the experimental results are consistent with
the theoretical results.Comment: 32nd Conference on Neural Information Processing Systems (NeurIPS
2018
An Accelerated Stochastic ADMM for Nonconvex and Nonsmooth Finite-Sum Optimization
The nonconvex and nonsmooth finite-sum optimization problem with linear
constraint has attracted much attention in the fields of artificial
intelligence, computer, and mathematics, due to its wide applications in
machine learning and the lack of efficient algorithms with convincing
convergence theories. A popular approach to solve it is the stochastic
Alternating Direction Method of Multipliers (ADMM), but most stochastic
ADMM-type methods focus on convex models. In addition, the variance reduction
(VR) and acceleration techniques are useful tools in the development of
stochastic methods due to their simplicity and practicability in providing
acceleration characteristics of various machine learning models. However, it
remains unclear whether accelerated SVRG-ADMM algorithm (ASVRG-ADMM), which
extends SVRG-ADMM by incorporating momentum techniques, exhibits a comparable
acceleration characteristic or convergence rate in the nonconvex setting. To
fill this gap, we consider a general nonconvex nonsmooth optimization problem
and study the convergence of ASVRG-ADMM. By utilizing a well-defined potential
energy function, we establish its sublinear convergence rate , where
denotes the iteration number. Furthermore, under the additional
Kurdyka-Lojasiewicz (KL) property which is less stringent than the frequently
used conditions for showcasing linear convergence rates, such as strong
convexity, we show that the ASVRG-ADMM sequence has a finite length and
converges to a stationary solution with a linear convergence rate. Several
experiments on solving the graph-guided fused lasso problem and regularized
logistic regression problem validate that the proposed ASVRG-ADMM performs
better than the state-of-the-art methods.Comment: 40 Pages, 8 figure
- …