While training a machine learning model using multiple workers, each of which
collects data from their own data sources, it would be most useful when the
data collected from different workers can be {\em unique} and {\em different}.
Ironically, recent analysis of decentralized parallel stochastic gradient
descent (D-PSGD) relies on the assumption that the data hosted on different
workers are {\em not too different}. In this paper, we ask the question: {\em
Can we design a decentralized parallel stochastic gradient descent algorithm
that is less sensitive to the data variance across workers?} In this paper, we
present D2, a novel decentralized parallel stochastic gradient descent
algorithm designed for large data variance \xr{among workers} (imprecisely,
"decentralized" data). The core of D2 is a variance blackuction extension of
the standard D-PSGD algorithm, which improves the convergence rate from
O(nT​σ​+T2/3(nζ2)31​​) to O(nT​σ​) where ζ2
denotes the variance among data on different workers. As a result, D2 is
robust to data variance among workers. We empirically evaluated D2 on image
classification tasks where each worker has access to only the data of a limited
set of labels, and find that D2 significantly outperforms D-PSGD