221 research outputs found
Convergence Guarantees for Stochastic Subgradient Methods in Nonsmooth Nonconvex Optimization
In this paper, we investigate the convergence properties of the stochastic
gradient descent (SGD) method and its variants, especially in training neural
networks built from nonsmooth activation functions. We develop a novel
framework that assigns different timescales to stepsizes for updating the
momentum terms and variables, respectively. Under mild conditions, we prove the
global convergence of our proposed framework in both single-timescale and
two-timescale cases. We show that our proposed framework encompasses a wide
range of well-known SGD-type methods, including heavy-ball SGD, SignSGD, Lion,
normalized SGD and clipped SGD. Furthermore, when the objective function adopts
a finite-sum formulation, we prove the convergence properties for these
SGD-type methods based on our proposed framework. In particular, we prove that
these SGD-type methods find the Clarke stationary points of the objective
function with randomly chosen stepsizes and initial points under mild
assumptions. Preliminary numerical experiments demonstrate the high efficiency
of our analyzed SGD-type methods.Comment: 30 pages, the introduction part is modified and some typos are
correcte
Nonsmooth nonconvex stochastic heavy ball
Motivated by the conspicuous use of momentum based algorithms in deep
learning, we study a nonsmooth nonconvex stochastic heavy ball method and show
its convergence. Our approach relies on semialgebraic assumptions, commonly met
in practical situations, which allow to combine a conservative calculus with
nonsmooth ODE methods. In particular, we can justify the use of subgradient
sampling in practical implementations that employ backpropagation or implicit
differentiation. Additionally, we provide general conditions for the sample
distribution to ensure the convergence of the objective function. As for the
stochastic subgradient method, our analysis highlights that subgradient
sampling can make the stochastic heavy ball method converge to artificial
critical points. We address this concern showing that these artifacts are
almost surely avoided when initializations are randomized
Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees
Asynchronous distributed algorithms are a popular way to reduce
synchronization costs in large-scale optimization, and in particular for neural
network training. However, for nonsmooth and nonconvex objectives, few
convergence guarantees exist beyond cases where closed-form proximal operator
solutions are available. As most popular contemporary deep neural networks lead
to nonsmooth and nonconvex objectives, there is now a pressing need for such
convergence guarantees. In this paper, we analyze for the first time the
convergence of stochastic asynchronous optimization for this general class of
objectives. In particular, we focus on stochastic subgradient methods allowing
for block variable partitioning, where the shared-memory-based model is
asynchronously updated by concurrent processes. To this end, we first introduce
a probabilistic model which captures key features of real asynchronous
scheduling between concurrent processes; under this model, we establish
convergence with probability one to an invariant set for stochastic subgradient
methods with momentum.
From the practical perspective, one issue with the family of methods we
consider is that it is not efficiently supported by machine learning
frameworks, as they mostly focus on distributed data-parallel strategies. To
address this, we propose a new implementation strategy for shared-memory based
training of deep neural networks, whereby concurrent parameter servers are
utilized to train a partitioned but shared model in single- and multi-GPU
settings. Based on this implementation, we achieve on average 1.2x speed-up in
comparison to state-of-the-art training methods for popular image
classification tasks without compromising accuracy
Set-Valued Analysis, Viability Theory and Partial Differential Inclusions
Systems of first-order partial differential inclusions -- solutions of which are feedbacks governing viable trajectories of control systems -- are derived. A variational principle and an existence theorem of a (single-valued contingent) solution to such partial differential inclusions are stated. To prove such theorems, tools of set-valued analysis and tricks taken from viability theory are surveyed.
This paper is the text of a plenary conference to the World Congress on Nonlinear Analysis held at Tampa, Florida, August 19-26, 1992
- …