Exploit Where Optimizer Explores via Residuals

Abstract

In order to train the neural networks faster, many efforts have been devoted to exploring a better solution trajectory, but few have been put into exploiting the existing solution trajectory. To exploit the trajectory of (momentum) stochastic gradient descent (SGD(m)) method, we propose a novel method named SGD(m) with residuals (RSGD(m)), which leads to a performance boost of both the convergence and generalization. Our new method can also be applied to other optimizers such as ASGD and Adam. We provide theoretical analysis to show that RSGD achieves a smaller growth rate of the generalization error and the same (but empirically better) convergence rate compared with SGD. Extensive deep learning experiments on image classification, language modeling and graph convolutional neural networks show that the proposed algorithm is faster than SGD(m)/Adam at the initial training stage, and similar to or better than SGD(m) at the end of training with better generalization error

    Similar works

    Full text

    thumbnail-image

    Available Versions