26 research outputs found
A Second look at Exponential and Cosine Step Sizes: Simplicity, Adaptivity, and Performance
Stochastic Gradient Descent (SGD) is a popular tool in training large-scale
machine learning models. Its performance, however, is highly variable,
depending crucially on the choice of the step sizes. Accordingly, a variety of
strategies for tuning the step sizes have been proposed, ranging from
coordinate-wise approaches (a.k.a. ``adaptive'' step sizes) to sophisticated
heuristics to change the step size in each iteration. In this paper, we study
two step size schedules whose power has been repeatedly confirmed in practice:
the exponential and the cosine step sizes. For the first time, we provide
theoretical support for them proving convergence rates for smooth non-convex
functions, with and without the Polyak-\L{}ojasiewicz (PL) condition. Moreover,
we show the surprising property that these two strategies are \emph{adaptive}
to the noise level in the stochastic gradients of PL functions. That is,
contrary to polynomial step sizes, they achieve almost optimal performance
without needing to know the noise level nor tuning their hyperparameters based
on it. Finally, we conduct a fair and comprehensive empirical evaluation of
real-world datasets with deep learning architectures. Results show that, even
if only requiring at most two hyperparameters to tune, these two strategies
best or match the performance of various finely-tuned state-of-the-art
strategies
STL-SGD: Speeding Up Local SGD with Stagewise Communication Period
Distributed parallel stochastic gradient descent algorithms are workhorses
for large scale machine learning tasks. Among them, local stochastic gradient
descent (Local SGD) has attracted significant attention due to its low
communication complexity. Previous studies prove that the communication
complexity of Local SGD with a fixed or an adaptive communication period is in
the order of and when the data distributions on clients are identical (IID) or
otherwise (Non-IID), where is the number of clients and is the number
of iterations. In this paper, to accelerate the convergence by reducing the
communication complexity, we propose \textit{ST}agewise \textit{L}ocal
\textit{SGD} (STL-SGD), which increases the communication period gradually
along with decreasing learning rate. We prove that STL-SGD can keep the same
convergence rate and linear speedup as mini-batch SGD. In addition, as the
benefit of increasing the communication period, when the objective is strongly
convex or satisfies the Polyak-\L ojasiewicz condition, the communication
complexity of STL-SGD is and for the IID case and the Non-IID case respectively, achieving
significant improvements over Local SGD. Experiments on both convex and
non-convex problems demonstrate the superior performance of STL-SGD.Comment: Accepted by AAAI202
Adaptive Strategies in Non-convex Optimization
An algorithm is said to be adaptive to a certain parameter (of the problem)
if it does not need a priori knowledge of such a parameter but performs
competitively to those that know it. This dissertation presents our work on
adaptive algorithms in following scenarios: 1. In the stochastic optimization
setting, we only receive stochastic gradients and the level of noise in
evaluating them greatly affects the convergence rate. Tuning is typically
required when without prior knowledge of the noise scale in order to achieve
the optimal rate. Considering this, we designed and analyzed noise-adaptive
algorithms that can automatically ensure (near)-optimal rates under different
noise scales without knowing it. 2. In training deep neural networks, the
scales of gradient magnitudes in each coordinate can scatter across a very wide
range unless normalization techniques, like BatchNorm, are employed. In such
situations, algorithms not addressing this problem of gradient scales can
behave very poorly. To mitigate this, we formally established the advantage of
scale-free algorithms that adapt to the gradient scales and presented its real
benefits in empirical experiments. 3. Traditional analyses in non-convex
optimization typically rely on the smoothness assumption. Yet, this condition
does not capture the properties of some deep learning objective functions,
including the ones involving Long Short-Term Memory networks and Transformers.
Instead, they satisfy a much more relaxed condition, with potentially unbounded
smoothness. Under this condition, we show that a generalized SignSGD algorithm
can theoretically match the best-known convergence rates obtained by SGD with
gradient clipping but does not need explicit clipping at all, and it can
empirically match the performance of Adam and beat others. Moreover, it can
also be made to automatically adapt to the unknown relaxed smoothness.Comment: arXiv admin note: text overlap with arXiv:2208.1119
Attentional Biased Stochastic Gradient for Imbalanced Classification
In this paper, we present a simple yet effective method (ABSGD) for
addressing the data imbalance issue in deep learning. Our method is a simple
modification to momentum SGD where we leverage an attentional mechanism to
assign an individual importance weight to each gradient in the mini-batch.
Unlike many existing heuristic-driven methods for tackling data imbalance, our
method is grounded in {\it theoretically justified distributionally robust
optimization (DRO)}, which is guaranteed to converge to a stationary point of
an information-regularized DRO problem. The individual-level weight of a
sampled data is systematically proportional to the exponential of a scaled loss
value of the data, where the scaling factor is interpreted as the
regularization parameter in the framework of information-regularized DRO.
Compared with existing class-level weighting schemes, our method can capture
the diversity between individual examples within each class. Compared with
existing individual-level weighting methods using meta-learning that require
three backward propagations for computing mini-batch stochastic gradients, our
method is more efficient with only one backward propagation at each iteration
as in standard deep learning methods. To balance between the learning of
feature extraction layers and the learning of the classifier layer, we employ a
two-stage method that uses SGD for pretraining followed by ABSGD for learning a
robust classifier and finetuning lower layers. Our empirical studies on several
benchmark datasets demonstrate the effectiveness of the proposed method.Comment: 29pages, 10 figure