531 research outputs found
Nonconvex Stochastic Bregman Proximal Gradient Method with Application to Deep Learning
The widely used stochastic gradient methods for minimizing nonconvex
composite objective functions require the Lipschitz smoothness of the
differentiable part. But the requirement does not hold true for problem classes
including quadratic inverse problems and training neural networks. To address
this issue, we investigate a family of stochastic Bregman proximal gradient
(SBPG) methods, which only require smooth adaptivity of the differentiable
part. SBPG replaces the upper quadratic approximation used in SGD with the
Bregman proximity measure, resulting in a better approximation model that
captures the non-Lipschitz gradients of the nonconvex objective. We formulate
the vanilla SBPG and establish its convergence properties under nonconvex
setting without finite-sum structure. Experimental results on quadratic inverse
problems testify the robustness of SBPG. Moreover, we propose a momentum-based
version of SBPG (MSBPG) and prove it has improved convergence properties. We
apply MSBPG to the training of deep neural networks with a polynomial kernel
function, which ensures the smooth adaptivity of the loss function.
Experimental results on representative benchmarks demonstrate the effectiveness
and robustness of MSBPG in training neural networks. Since the additional
computation cost of MSBPG compared with SGD is negligible in large-scale
optimization, MSBPG can potentially be employed as an universal open-source
optimizer in the future.Comment: 37 page
Momentum-based variance reduction in non-convex SGD
Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent
in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic
gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large āmega-batchesā in order to achieve their improved results. We present a new algorithm, Storm, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses F, Storm finds a point x with E[kāF(x)k] ā¤ O(1 /ā T + Ļ^1/3 /T^1/3) in T iterations with Ļ^2 variance in the gradients, matching the optimal rate and without requiring knowledge of Ļ.https://arxiv.org/pdf/1905.10018.pdfPublished versio
Finite-sum optimization: Adaptivity to smoothness and loopless variance reduction
For finite-sum optimization, variance-reduced gradient methods (VR) compute
at each iteration the gradient of a single function (or of a mini-batch), and
yet achieve faster convergence than SGD thanks to a carefully crafted
lower-variance stochastic gradient estimator that reuses past gradients.
Another important line of research of the past decade in continuous
optimization is the adaptive algorithms such as AdaGrad, that dynamically
adjust the (possibly coordinate-wise) learning rate to past gradients and
thereby adapt to the geometry of the objective function. Variants such as
RMSprop and Adam demonstrate outstanding practical performance that have
contributed to the success of deep learning. In this work, we present AdaVR,
which combines the AdaGrad algorithm with variance-reduced gradient estimators
such as SAGA or L-SVRG. We assess that AdaVR inherits both good convergence
properties from VR methods and the adaptive nature of AdaGrad: in the case of
-smooth convex functions we establish a gradient complexity of
without prior knowledge of . Numerical
experiments demonstrate the superiority of AdaVR over state-of-the-art methods.
Moreover, we empirically show that the RMSprop and Adam algorithm combined with
variance-reduced gradients estimators achieve even faster convergence
Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity
We study oracle complexity of gradient based methods for stochastic
approximation problems. Though in many settings optimal algorithms and tight
lower bounds are known for such problems, these optimal algorithms do not
achieve the best performance when used in practice. We address this
theory-practice gap by focusing on instance-dependent complexity instead of
worst case complexity. In particular, we first summarize known
instance-dependent complexity results and categorize them into three levels. We
identify the domination relation between different levels and propose a fourth
instance-dependent bound that dominates existing ones. We then provide a
sufficient condition according to which an adaptive algorithm with moment
estimation can achieve the proposed bound without knowledge of noise levels.
Our proposed algorithm and its analysis provide a theoretical justification for
the success of moment estimation as it achieves improved instance complexity
- ā¦