D$^2$: Decentralized Training over Decentralized Data

Lian, Xiangru; Liu, Ji; Tang, Hanlin; Yan, Ming; Zhang, Ce

research

D $^2$ : Decentralized Training over Decentralized Data

Authors: Xiangru Lian
Ji Liu
Hanlin Tang
Ming Yan
Ce Zhang
Publication date: 1 January 2018
Publisher

Abstract

While training a machine learning model using multiple workers, each of which collects data from their own data sources, it would be most useful when the data collected from different workers can be {\em unique} and {\em different}. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are {\em not too different}. In this paper, we ask the question: {\em Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?} In this paper, we present D

^2

, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, "decentralized" data). The core of D

^2

is a variance blackuction extension of the standard D-PSGD algorithm, which improves the convergence rate from

O\left({\sigma \over \sqrt{nT}} + {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right)

to

O\left({\sigma \over \sqrt{nT}}\right)

where

\zeta^{2}

denotes the variance among data on different workers. As a result, D

^2

is robust to data variance among workers. We empirically evaluated D

^2

on image classification tasks where each worker has access to only the data of a limited set of labels, and find that D

^2

significantly outperforms D-PSGD

Similar works

Full text

Available Versions

Repository for Publications and Research Data

oai:www.research-collection.et...

Last time updated on 19/04/2020