21 research outputs found
An Adaptive Remote Stochastic Gradient Method for Training Neural Networks
We present the remote stochastic gradient (RSG) method, which computes the
gradients at configurable remote observation points, in order to improve the
convergence rate and suppress gradient noise at the same time for different
curvatures. RSG is further combined with adaptive methods to construct ARSG for
acceleration. The method is efficient in computation and memory, and is
straightforward to implement. We analyze the convergence properties by modeling
the training process as a dynamic system, which provides a guideline to select
the configurable observation factor without grid search. ARSG yields
convergence rate in non-convex settings, that can be further
improved to in strongly convex settings. Numerical experiments
demonstrate that ARSG achieves both faster convergence and better
generalization, compared with popular adaptive methods, such as ADAM, NADAM,
AMSGRAD, and RANGER for the tested problems. In particular, for training
ResNet-50 on ImageNet, ARSG outperforms ADAM in convergence speed and meanwhile
it surpasses SGD in generalization.Comment: The generalization is improved by modifying the preconditioner. For
training ResNet-50 on ImageNet, ARSG outperforms ADAM in convergence speed
and meanwhile it surpasses SGD in generalization. We also present a
convergence bound in non-convex setting
On the Ineffectiveness of Variance Reduced Optimization for Deep Learning
The application of stochastic variance reduction to optimization has shown
remarkable recent theoretical and practical success. The applicability of these
techniques to the hard non-convex optimization problems encountered during
training of modern deep neural networks is an open problem. We show that naive
application of the SVRG technique and related approaches fail, and explore why
Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks
Adaptive gradient methods, which adopt historical gradient information to
automatically adjust the learning rate, despite the nice property of fast
convergence, have been observed to generalize worse than stochastic gradient
descent (SGD) with momentum in training deep neural networks. This leaves how
to close the generalization gap of adaptive gradient methods an open problem.
In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are
sometimes "over adapted". We design a new algorithm, called Partially adaptive
momentum estimation method, which unifies the Adam/Amsgrad with SGD by
introducing a partial adaptive parameter , to achieve the best from both
worlds. We also prove the convergence rate of our proposed algorithm to a
stationary point in the stochastic nonconvex optimization setting. Experiments
on standard benchmarks show that our proposed algorithm can maintain a fast
convergence rate as Adam/Amsgrad while generalizing as well as SGD in training
deep neural networks. These results would suggest practitioners pick up
adaptive gradient methods once again for faster training of deep neural
networks.Comment: 17 pages, 4 figures, 4 tables. In IJCAI 202
Interpreting Deep Learning: The Machine Learning Rorschach Test?
Theoretical understanding of deep learning is one of the most important tasks
facing the statistics and machine learning communities. While deep neural
networks (DNNs) originated as engineering methods and models of biological
networks in neuroscience and psychology, they have quickly become a centerpiece
of the machine learning toolbox. Unfortunately, DNN adoption powered by recent
successes combined with the open-source nature of the machine learning
community, has outpaced our theoretical understanding. We cannot reliably
identify when and why DNNs will make mistakes. In some applications like text
translation these mistakes may be comical and provide for fun fodder in
research talks, a single error can be very costly in tasks like medical
imaging. As we utilize DNNs in increasingly sensitive applications, a better
understanding of their properties is thus imperative. Recent advances in DNN
theory are numerous and include many different sources of intuition, such as
learning theory, sparse signal analysis, physics, chemistry, and psychology. An
interesting pattern begins to emerge in the breadth of possible
interpretations. The seemingly limitless approaches are mostly constrained by
the lens with which the mathematical operations are viewed. Ultimately, the
interpretation of DNNs appears to mimic a type of Rorschach test --- a
psychological test wherein subjects interpret a series of seemingly ambiguous
ink-blots. Validation for DNN theory requires a convergence of the literature.
We must distinguish between universal results that are invariant to the
analysis perspective and those that are specific to a particular network
configuration. Simultaneously we must deal with the fact that many standard
statistical tools for quantifying generalization or empirically assessing
important network features are difficult to apply to DNNs.Comment: 13 pages, 1 figure. Preprint is related to an upcoming Society for
Industrial and Applied Mathematics (SIAM) News articl
Adaptive Weight Decay for Deep Neural Networks
Regularization in the optimization of deep neural networks is often critical
to avoid undesirable over-fitting leading to better generalization of model.
One of the most popular regularization algorithms is to impose L-2 penalty on
the model parameters resulting in the decay of parameters, called weight-decay,
and the decay rate is generally constant to all the model parameters in the
course of optimization. In contrast to the previous approach based on the
constant rate of weight-decay, we propose to consider the residual that
measures dissimilarity between the current state of model and observations in
the determination of the weight-decay for each parameter in an adaptive way,
called adaptive weight-decay (AdaDecay) where the gradient norms are normalized
within each layer and the degree of regularization for each parameter is
determined in proportional to the magnitude of its gradient using the sigmoid
function. We empirically demonstrate the effectiveness of AdaDecay in
comparison to the state-of-the-art optimization algorithms using popular
benchmark datasets: MNIST, Fashion-MNIST, and CIFAR-10 with conventional neural
network models ranging from shallow to deep. The quantitative evaluation of our
proposed algorithm indicates that AdaDecay improves generalization leading to
better accuracy across all the datasets and models
Predictive Local Smoothness for Stochastic Gradient Methods
Stochastic gradient methods are dominant in nonconvex optimization especially
for deep models but have low asymptotical convergence due to the fixed
smoothness. To address this problem, we propose a simple yet effective method
for improving stochastic gradient methods named predictive local smoothness
(PLS). First, we create a convergence condition to build a learning rate which
varies adaptively with local smoothness. Second, the local smoothness can be
predicted by the latest gradients. Third, we use the adaptive learning rate to
update the stochastic gradients for exploring linear convergence rates. By
applying the PLS method, we implement new variants of three popular algorithms:
PLS-stochastic gradient descent (PLS-SGD), PLS-accelerated SGD (PLS-AccSGD),
and PLS-AMSGrad. Moreover, we provide much simpler proofs to ensure their
linear convergence. Empirical results show that the variants have better
performance gains than the popular algorithms, such as, faster convergence and
alleviating explosion and vanish of gradients.Comment: 14 pages, 7 figure
A Modular Analysis of Provable Acceleration via Polyak's Momentum: Training a Wide ReLU Network and a Deep Linear Network
Incorporating a so-called "momentum" dynamic in gradient descent methods is
widely used in neural net training as it has been broadly observed that, at
least empirically, it often leads to significantly faster convergence. At the
same time, there are very few theoretical guarantees in the literature to
explain this apparent acceleration effect. Even for the classical strongly
convex quadratic problems, several existing results only show Polyak's momentum
has an accelerated linear rate asymptotically. In this paper, we first revisit
the quadratic problems and show a non-asymptotic accelerated linear rate of
Polyak's momentum. Then, we provably show that Polyak's momentum achieves
acceleration for training a one-layer wide ReLU network and a deep linear
network, which are perhaps the two most popular canonical models for studying
optimization and deep learning in the literature. Prior work Du at al. 2019 and
Wu et al. 2019 showed that using vanilla gradient descent, and with an use of
over-parameterization, the error decays as
after iterations, where is the condition number of a Gram Matrix.
Our result shows that with the appropriate choice of parameters Polyak's
momentum has a rate of . For the deep
linear network, prior work Hu et al. 2020 showed that vanilla gradient descent
has a rate of , where is the condition
number of a data matrix. Our result shows an acceleration rate is achievable by Polyak's momentum. All the
results in this work are obtained from a modular analysis, which can be of
independent interest. This work establishes that momentum does indeed speed up
neural net training.Comment: Accepted at ICML 202
Aggregated Momentum: Stability Through Passive Damping
Momentum is a simple and widely used trick which allows gradient-based
optimizers to pick up speed along low curvature directions. Its performance
depends crucially on a damping coefficient . Large values can
potentially deliver much larger speedups, but are prone to oscillations and
instability; hence one typically resorts to small values such as 0.5 or 0.9. We
propose Aggregated Momentum (AggMo), a variant of momentum which combines
multiple velocity vectors with different parameters. AggMo is trivial
to implement, but significantly dampens oscillations, enabling it to remain
stable even for aggressive values such as 0.999. We reinterpret
Nesterov's accelerated gradient descent as a special case of AggMo and analyze
rates of convergence for quadratic objectives. Empirically, we find that AggMo
is a suitable drop-in replacement for other momentum methods, and frequently
delivers faster convergence.Comment: 11 primary pages, 11 supplementary pages, 12 figures tota
Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Increasing the batch size is a popular way to speed up neural network
training, but beyond some critical batch size, larger batch sizes yield
diminishing returns. In this work, we study how the critical batch size changes
based on properties of the optimization algorithm, including acceleration and
preconditioning, through two different lenses: large scale experiments, and
analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate
that optimization algorithms that employ preconditioning, specifically Adam and
K-FAC, result in much larger critical batch sizes than stochastic gradient
descent with momentum. We also demonstrate that the NQM captures many of the
essential features of real neural network training, despite being drastically
simpler to work with. The NQM predicts our results with preconditioned
optimizers, previous results with accelerated gradient descent, and other
results around optimal learning rates and large batch training, making it a
useful tool to generate testable predictions about neural network optimization.Comment: NeurIPS 201
Demon: Improved Neural Network Training with Momentum Decay
Momentum is a widely used technique for gradient-based optimizers in deep
learning. In this paper, we propose a decaying momentum (\textsc{Demon}) rule.
We conduct the first large-scale empirical analysis of momentum decay methods
for modern neural network optimization, in addition to the most popular
learning rate decay schedules. Across 28 relevant combinations of models,
epochs, datasets, and optimizers, \textsc{Demon} achieves the highest number of
Top-1 and Top-3 finishes at 39\% and 85\% respectively, almost doubling the
second-placed learning rate cosine schedule at 17\% and 60\%, respectively.
\textsc{Demon} also outperforms other widely used schedulers including, but not
limited to, the learning rate step schedule, linear schedule, OneCycle
schedule, and exponential schedule. Compared with the widely used learning rate
step schedule, \textsc{Demon} is observed to be less sensitive to parameter
tuning, which is critical to training neural networks in practice. Results are
demonstrated across a variety of settings and architectures, including image
classification, generative models, and language models. \textsc{Demon} is easy
to implement, requires no additional tuning, and incurs almost no extra
computational overhead compared to the vanilla counterparts. Code is readily
available.Comment: 12 page