3,312 research outputs found
D: Decentralized Training over Decentralized Data
While training a machine learning model using multiple workers, each of which
collects data from their own data sources, it would be most useful when the
data collected from different workers can be {\em unique} and {\em different}.
Ironically, recent analysis of decentralized parallel stochastic gradient
descent (D-PSGD) relies on the assumption that the data hosted on different
workers are {\em not too different}. In this paper, we ask the question: {\em
Can we design a decentralized parallel stochastic gradient descent algorithm
that is less sensitive to the data variance across workers?} In this paper, we
present D, a novel decentralized parallel stochastic gradient descent
algorithm designed for large data variance \xr{among workers} (imprecisely,
"decentralized" data). The core of D is a variance blackuction extension of
the standard D-PSGD algorithm, which improves the convergence rate from
to where
denotes the variance among data on different workers. As a result, D is
robust to data variance among workers. We empirically evaluated D on image
classification tasks where each worker has access to only the data of a limited
set of labels, and find that D significantly outperforms D-PSGD
Robust and Communication-Efficient Collaborative Learning
We consider a decentralized learning problem, where a set of computing nodes
aim at solving a non-convex optimization problem collaboratively. It is
well-known that decentralized optimization schemes face two major system
bottlenecks: stragglers' delay and communication overhead. In this paper, we
tackle these bottlenecks by proposing a novel decentralized and gradient-based
optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two
main ideas: (i) we impose a deadline on the local gradient computations of each
node at each iteration of the algorithm, and (ii) the nodes exchange quantized
versions of their local models. The first idea robustifies to straggling nodes
and the second alleviates communication efficiency. The key technical
contribution of our work is to prove that with non-vanishing noises for
quantization and stochastic gradients, the proposed method exactly converges to
the global optimal for convex loss functions, and finds a first-order
stationary point in non-convex scenarios. Our numerical evaluations of the
QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate
speedups of up to 3x in run-time, compared to state-of-the-art decentralized
optimization methods
- …