12 research outputs found
Exploit Where Optimizer Explores via Residuals
In order to train the neural networks faster, many efforts have been devoted
to exploring a better solution trajectory, but few have been put into
exploiting the existing solution trajectory. To exploit the trajectory of
(momentum) stochastic gradient descent (SGD(m)) method, we propose a novel
method named SGD(m) with residuals (RSGD(m)), which leads to a performance
boost of both the convergence and generalization. Our new method can also be
applied to other optimizers such as ASGD and Adam. We provide theoretical
analysis to show that RSGD achieves a smaller growth rate of the generalization
error and the same (but empirically better) convergence rate compared with SGD.
Extensive deep learning experiments on image classification, language modeling
and graph convolutional neural networks show that the proposed algorithm is
faster than SGD(m)/Adam at the initial training stage, and similar to or better
than SGD(m) at the end of training with better generalization error
Training Faster with Compressed Gradient
Although the distributed machine learning methods show the potential for the
speed-up of training large deep neural networks, the communication cost has
been the notorious bottleneck to constrain the performance. To address this
challenge, the gradient compression based communication-efficient distributed
learning methods were designed to reduce the communication cost, and more
recently the local error feedback was incorporated to compensate for the
performance loss. However, in this paper, we will show the "gradient mismatch"
problem of the local error feedback in centralized distributed training and
this issue can lead to degraded performance compared with full-precision
training. To solve this critical problem, we propose two novel techniques: 1)
step ahead; 2) error averaging. Both our theoretical and empirical results show
that our new methods can alleviate the "gradient mismatch" problem. Experiments
show that we can even train \textbf{faster with compressed gradient} than
full-precision training \textbf{regarding training epochs}
High-Dimensional Stochastic Gradient Quantization for Communication-Efficient Edge Learning
Edge machine learning involves the deployment of learning algorithms at the
wireless network edge so as to leverage massive mobile data for enabling
intelligent applications. The mainstream edge learning approach, federated
learning, has been developed based on distributed gradient descent. Based on
the approach, stochastic gradients are computed at edge devices and then
transmitted to an edge server for updating a global AI model. Since each
stochastic gradient is typically high-dimensional (with millions to billions of
coefficients), communication overhead becomes a bottleneck for edge learning.
To address this issue, we propose in this work a novel framework of
hierarchical stochastic gradient quantization and study its effect on the
learning performance. First, the framework features a practical hierarchical
architecture for decomposing the stochastic gradient into its norm and
normalized block gradients, and efficiently quantizes them using a uniform
quantizer and a low-dimensional codebook on a Grassmann manifold, respectively.
Subsequently, the quantized normalized block gradients are scaled and cascaded
to yield the quantized normalized stochastic gradient using a so-called hinge
vector designed under the criterion of minimum distortion. The hinge vector is
also efficiently compressed using another low-dimensional Grassmannian
quantizer. The other feature of the framework is a bit-allocation scheme for
reducing the quantization error. The scheme determines the resolutions of the
low-dimensional quantizers in the proposed framework. The framework is proved
to guarantee model convergency by analyzing the convergence rate as a function
of the quantization bits. Furthermore, by simulation, our design is shown to
substantially reduce the communication overhead compared with the
state-of-the-art signSGD scheme, while both achieve similar learning
accuracies