Search CORE

12 research outputs found

Exploit Where Optimizer Explores via Residuals

Author: Huang Heng
Huo Zhouyuan
Xu An
Publication venue
Publication date: 13/08/2020
Field of study

In order to train the neural networks faster, many efforts have been devoted to exploring a better solution trajectory, but few have been put into exploiting the existing solution trajectory. To exploit the trajectory of (momentum) stochastic gradient descent (SGD(m)) method, we propose a novel method named SGD(m) with residuals (RSGD(m)), which leads to a performance boost of both the convergence and generalization. Our new method can also be applied to other optimizers such as ASGD and Adam. We provide theoretical analysis to show that RSGD achieves a smaller growth rate of the generalization error and the same (but empirically better) convergence rate compared with SGD. Extensive deep learning experiments on image classification, language modeling and graph convolutional neural networks show that the proposed algorithm is faster than SGD(m)/Adam at the initial training stage, and similar to or better than SGD(m) at the end of training with better generalization error

arXiv.org e-Print Archive

Training Faster with Compressed Gradient

Author: Huang Heng
Huo Zhouyuan
Xu An
Publication venue
Publication date: 13/08/2020
Field of study

Although the distributed machine learning methods show the potential for the speed-up of training large deep neural networks, the communication cost has been the notorious bottleneck to constrain the performance. To address this challenge, the gradient compression based communication-efficient distributed learning methods were designed to reduce the communication cost, and more recently the local error feedback was incorporated to compensate for the performance loss. However, in this paper, we will show the "gradient mismatch" problem of the local error feedback in centralized distributed training and this issue can lead to degraded performance compared with full-precision training. To solve this critical problem, we propose two novel techniques: 1) step ahead; 2) error averaging. Both our theoretical and empirical results show that our new methods can alleviate the "gradient mismatch" problem. Experiments show that we can even train \textbf{faster with compressed gradient} than full-precision training \textbf{regarding training epochs}

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

High-Dimensional Stochastic Gradient Quantization for Communication-Efficient Edge Learning

Author: Du Yuqing
Huang Kaibin
Yang Sheng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/10/2019
Field of study

Edge machine learning involves the deployment of learning algorithms at the wireless network edge so as to leverage massive mobile data for enabling intelligent applications. The mainstream edge learning approach, federated learning, has been developed based on distributed gradient descent. Based on the approach, stochastic gradients are computed at edge devices and then transmitted to an edge server for updating a global AI model. Since each stochastic gradient is typically high-dimensional (with millions to billions of coefficients), communication overhead becomes a bottleneck for edge learning. To address this issue, we propose in this work a novel framework of hierarchical stochastic gradient quantization and study its effect on the learning performance. First, the framework features a practical hierarchical architecture for decomposing the stochastic gradient into its norm and normalized block gradients, and efficiently quantizes them using a uniform quantizer and a low-dimensional codebook on a Grassmann manifold, respectively. Subsequently, the quantized normalized block gradients are scaled and cascaded to yield the quantized normalized stochastic gradient using a so-called hinge vector designed under the criterion of minimum distortion. The hinge vector is also efficiently compressed using another low-dimensional Grassmannian quantizer. The other feature of the framework is a bit-allocation scheme for reducing the quantization error. The scheme determines the resolutions of the low-dimensional quantizers in the proposed framework. The framework is proved to guarantee model convergency by analyzing the convergence rate as a function of the quantization bits. Furthermore, by simulation, our design is shown to substantially reduce the communication overhead compared with the state-of-the-art signSGD scheme, while both achieve similar learning accuracies

arXiv.org e-Print Archive

HAL-CentraleSupelec

HAL-Rennes 1