With the rapid growth of data, distributed stochastic gradient descent~(DSGD)
has been widely used for solving large-scale machine learning problems. Due to
the latency and limited bandwidth of network, communication has become the
bottleneck of DSGD when we need to train large scale models, like deep neural
networks. Communication compression with sparsified gradient, abbreviated as
\emph{sparse communication}, has been widely used for reducing communication
cost in DSGD. Recently, there has appeared one method, called deep gradient
compression~(DGC), to combine memory gradient and momentum SGD for sparse
communication. DGC has achieved promising performance in practise. However, the
theory about the convergence of DGC is lack. In this paper, we propose a novel
method, called \emph{\underline{g}}lobal \emph{\underline{m}}omentum
\emph{\underline{c}}ompression~(GMC), for sparse communication in DSGD. GMC
also combines memory gradient and momentum SGD. But different from DGC which
adopts local momentum, GMC adopts global momentum. We theoretically prove the
convergence rate of GMC for both convex and non-convex problems. To the best of
our knowledge, this is the first work that proves the convergence of
distributed momentum SGD~(DMSGD) with sparse communication and memory gradient.
Empirical results show that, compared with the DMSGD counterpart without sparse
communication, GMC can reduce the communication cost by approximately 100 fold
without loss of generalization accuracy. GMC can also achieve
comparable~(sometimes better) performance compared with DGC, with extra
theoretical guarantee