In order to train the neural networks faster, many efforts have been devoted
to exploring a better solution trajectory, but few have been put into
exploiting the existing solution trajectory. To exploit the trajectory of
(momentum) stochastic gradient descent (SGD(m)) method, we propose a novel
method named SGD(m) with residuals (RSGD(m)), which leads to a performance
boost of both the convergence and generalization. Our new method can also be
applied to other optimizers such as ASGD and Adam. We provide theoretical
analysis to show that RSGD achieves a smaller growth rate of the generalization
error and the same (but empirically better) convergence rate compared with SGD.
Extensive deep learning experiments on image classification, language modeling
and graph convolutional neural networks show that the proposed algorithm is
faster than SGD(m)/Adam at the initial training stage, and similar to or better
than SGD(m) at the end of training with better generalization error