7 research outputs found
Adaptive Strategies in Non-convex Optimization
An algorithm is said to be adaptive to a certain parameter (of the problem)
if it does not need a priori knowledge of such a parameter but performs
competitively to those that know it. This dissertation presents our work on
adaptive algorithms in following scenarios: 1. In the stochastic optimization
setting, we only receive stochastic gradients and the level of noise in
evaluating them greatly affects the convergence rate. Tuning is typically
required when without prior knowledge of the noise scale in order to achieve
the optimal rate. Considering this, we designed and analyzed noise-adaptive
algorithms that can automatically ensure (near)-optimal rates under different
noise scales without knowing it. 2. In training deep neural networks, the
scales of gradient magnitudes in each coordinate can scatter across a very wide
range unless normalization techniques, like BatchNorm, are employed. In such
situations, algorithms not addressing this problem of gradient scales can
behave very poorly. To mitigate this, we formally established the advantage of
scale-free algorithms that adapt to the gradient scales and presented its real
benefits in empirical experiments. 3. Traditional analyses in non-convex
optimization typically rely on the smoothness assumption. Yet, this condition
does not capture the properties of some deep learning objective functions,
including the ones involving Long Short-Term Memory networks and Transformers.
Instead, they satisfy a much more relaxed condition, with potentially unbounded
smoothness. Under this condition, we show that a generalized SignSGD algorithm
can theoretically match the best-known convergence rates obtained by SGD with
gradient clipping but does not need explicit clipping at all, and it can
empirically match the performance of Adam and beat others. Moreover, it can
also be made to automatically adapt to the unknown relaxed smoothness.Comment: arXiv admin note: text overlap with arXiv:2208.1119
A Second look at Exponential and Cosine Step Sizes: Simplicity, Adaptivity, and Performance
Stochastic Gradient Descent (SGD) is a popular tool in training large-scale
machine learning models. Its performance, however, is highly variable,
depending crucially on the choice of the step sizes. Accordingly, a variety of
strategies for tuning the step sizes have been proposed, ranging from
coordinate-wise approaches (a.k.a. ``adaptive'' step sizes) to sophisticated
heuristics to change the step size in each iteration. In this paper, we study
two step size schedules whose power has been repeatedly confirmed in practice:
the exponential and the cosine step sizes. For the first time, we provide
theoretical support for them proving convergence rates for smooth non-convex
functions, with and without the Polyak-\L{}ojasiewicz (PL) condition. Moreover,
we show the surprising property that these two strategies are \emph{adaptive}
to the noise level in the stochastic gradients of PL functions. That is,
contrary to polynomial step sizes, they achieve almost optimal performance
without needing to know the noise level nor tuning their hyperparameters based
on it. Finally, we conduct a fair and comprehensive empirical evaluation of
real-world datasets with deep learning architectures. Results show that, even
if only requiring at most two hyperparameters to tune, these two strategies
best or match the performance of various finely-tuned state-of-the-art
strategies
Adaptive strategies in non-convex optimization
Modern applications in machine learning have seen more and more usage of non-convex formulations in that they can often better capture the problem structure. One prominent example is the Deep Neural Networks which have achieved innumerable successes in various fields including computer vision and natural language processing. However, optimizing a non-convex problem presents much greater difficulties compared with convex ones. A vastly popular optimizer used for such scenarios is Stochastic Gradient Descent (SGD), but its performance depends crucially on the choice of its step sizes. Tuning of step sizes is notoriously laborious and the optimal choice can vary drastically across different problems. To save the labor of tuning, adaptive algorithms come to the rescue: An algorithm is said to be adaptive to a certain parameter (of the problem) if it does not need a priori knowledge of such parameter but performs competitively to those that know it.
This dissertation presents our work on adaptive algorithms in following scenarios:
1. In the stochastic optimization setting, we only receive stochastic gradients and the level of noise in evaluating them greatly affects the convergence rate. Tuning is typically required when without prior knowledge of the noise scale in order to achieve the optimal rate. Considering this, we designed and analyzed noise-adaptive algorithms that can automatically ensure (near)-optimal rates under different noise scales without knowing it.
2. In training deep neural networks, the scales of gradient magnitudes in each coordinate can scatter across a very wide range unless normalization techniques, like BatchNorm, are employed. In such situations, algorithms not addressing this problem of gradient scales can behave very poorly. To mitigate this, we formally established the advantage of scale-free algorithms that adapt to the gradient scales and presented its real benefits in empirical experiments.
3. Traditional analyses in non-convex optimization typically rely on the smoothness assumption. Yet, this condition does not capture the properties of some deep learning objective functions, including the ones involving Long Short-Term Memory (LSTM) networks and Transformers. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this condition, we show that a generalized SignSGD (update using only the sign of each coordinate of the stochastic gradient vector when running SGD) algorithm can theoretically match the best-known convergence rates obtained by SGD with gradient clipping but does not need explicit clipping at all, and it can empirically match the performance of Adam and beat others. Moreover, it can also be made to automatically adapt to the unknown relaxed smoothness
Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization
Stochastic Gradient Descent (SGD) has played a central role in machine
learning. However, it requires a carefully hand-picked stepsize for fast
convergence, which is notoriously tedious and time-consuming to tune. Over the
last several years, a plethora of adaptive gradient-based algorithms have
emerged to ameliorate this problem. They have proved efficient in reducing the
labor of tuning in practice, but many of them lack theoretic guarantees even in
the convex setting. In this paper, we propose new surrogate losses to cast the
problem of learning the optimal stepsizes for the stochastic optimization of a
non-convex smooth objective function onto an online convex optimization
problem. This allows the use of no-regret online algorithms to compute optimal
stepsizes on the fly. In turn, this results in a SGD algorithm with self-tuned
stepsizes that guarantees convergence rates that are automatically adaptive to
the level of noise
A second look at exponential and cosine step sizes: simplicity, adaptivity, and performance
Stochastic Gradient Descent (SGD) is a popular tool in training large-scale machine learning models. Its performance, however, is highly variable, depending crucially on the choice of the step sizes. Accordingly, a variety of strategies for tuning the step sizes have been proposed, ranging from coordinate-wise approaches (a.k.a.
“adaptive” step sizes) to sophisticated heuristics to change the step size in each iteration. In this paper, we study two step size schedules whose power has been repeatedly confirmed in practice: the exponential and the cosine step sizes. For the first time, we provide theoretical support for them proving
convergence rates for smooth non-convex functions, with and without the Polyak-Łojasiewicz (PL) condition. Moreover, we show the surprising
property that these two strategies are adaptive to the noise level in the stochastic gradients of PL functions. That is, contrary to polynomial step sizes, they achieve almost optimal performance without needing to know the noise level nor tuning their hyperparameters based on it. Finally, we
conduct a fair and comprehensive empirical evaluation of real-world datasets with deep learning architectures. Results show that, even if only requiring at most two hyperparameters to tune, these
two strategies best or match the performance of various finely-tuned state-of-the-art strategies.https://arxiv.org/pdf/2002.05273.pdfPublished versio
A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks
In distributed training of deep neural networks or Federated Learning (FL),
people usually run Stochastic Gradient Descent (SGD) or its variants on each
machine and communicate with other machines periodically. However, SGD might
converge slowly in training some deep neural networks (e.g., RNN, LSTM) because
of the exploding gradient issue. Gradient clipping is usually employed to
address this issue in the single machine setting, but exploring this technique
in the FL setting is still in its infancy: it remains mysterious whether the
gradient clipping scheme can take advantage of multiple machines to enjoy
parallel speedup. The main technical difficulty lies in dealing with nonconvex
loss function, non-Lipschitz continuous gradient, and skipping communication
rounds simultaneously. In this paper, we explore a relaxed-smoothness
assumption of the loss landscape which LSTM was shown to satisfy in previous
works and design a communication-efficient gradient clipping algorithm. This
algorithm can be run on multiple machines, where each machine employs a
gradient clipping scheme and communicate with other machines after multiple
steps of gradient-based updates. Our algorithm is proved to have
iteration complexity for finding an
-stationary point, where is the number of machines. This
indicates that our algorithm enjoys linear speedup. We prove this result by
introducing novel analysis techniques of estimating truncated random variables,
which we believe are of independent interest. Our experiments on several
benchmark datasets and various scenarios demonstrate that our algorithm indeed
exhibits fast convergence speed in practice and thus validates our theory