We study stochastic decentralized optimization for the problem of training
machine learning models with large-scale distributed data. We extend the famous
EXTRA and DIGing methods with accelerated variance reduction (VR), and propose
two methods, which require the time of
O((nκs+n)logϵ1) stochastic gradient evaluations
and O(κbκclogϵ1) communication rounds to
reach precision ϵ, where κs and κb are the stochastic
condition number and batch condition number for strongly convex and smooth
problems, κc is the condition number of the communication network, and
n is the sample size on each distributed node. Our stochastic gradient
computation complexity is the same as the single-machine accelerated variance
reduction methods, such as Katyusha, and our communication complexity is the
same as the accelerated full batch decentralized methods, such as MSDA, and
they are both optimal. We also propose the non-accelerated VR based EXTRA and
DIGing, and provide explicit complexities, for example, the
O((κs+n)logϵ1) stochastic gradient computation
complexity and the O((κb+κc)logϵ1) communication
complexity for the VR based EXTRA. The two complexities are also the same as
the ones of single-machine VR methods, such as SAG, SAGA, and SVRG, and the
non-accelerated full batch decentralized methods, such as EXTRA, respectively