Search CORE

7 research outputs found

D $^2$ : Decentralized Training over Decentralized Data

Author: Lian Xiangru
Liu Ji
Tang Hanlin
Yan Ming
Zhang Ce
Publication venue
Publication date: 01/01/2018
Field of study

While training a machine learning model using multiple workers, each of which collects data from their own data sources, it would be most useful when the data collected from different workers can be {\em unique} and {\em different}. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are {\em not too different}. In this paper, we ask the question: {\em Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?} In this paper, we present D

^2

, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, "decentralized" data). The core of D

^2

is a variance blackuction extension of the standard D-PSGD algorithm, which improves the convergence rate from

O\left({\sigma \over \sqrt{nT}} + {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right)

O\left({\sigma \over \sqrt{nT}}\right)

where

\zeta^{2}

denotes the variance among data on different workers. As a result, D

^2

is robust to data variance among workers. We empirically evaluated D

^2

on image classification tasks where each worker has access to only the data of a limited set of labels, and find that D

^2

significantly outperforms D-PSGD

arXiv.org e-Print Archive

Repository for Publications and Research Data

Recommended from our members

Exact Diffusion Learning over Networks

Author: Yuan Kun
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

In this dissertation, we study optimization, adaptation, and learning problems over connected networks. In these problems, each agent

k

collects and learns from its own local data and is able to communicate with its local neighbors. While each single node in the network may not be capable of sophisticated behavior on its own, the agents collaborate to solve large-scale and challenging learning problems. Different approaches have been proposed in the literature to boost the learning capabilities of networked agents. Among these approaches, the class of diffusion strategies has been shown to be particularly well-suited due to their enhanced stability range over other methods and improved performance in adaptive scenarios. However, diffusion implementations suffer from a small inherent bias in the iterates. When a constant step-size is employed to solve deterministic optimization problems, the iterates generated by the diffusion strategy will converge to a small neighborhood around the desired global solution but not to the exact solution itself. This bias is not due to any gradient noise arising from stochastic approximation; it is instead due to the update structure in diffusion implementations. The existence of the bias leads to three questions: (1) What is the origin of this inherent bias? (2) Can it be eliminated? (3) Does the correction of the bias bring benefits to distributed optimization, distributed adaptation, or distributed learning?This dissertation provides affirmative solutions to these questions. Specifically, we design a new {\em exact diffusion} approach that eliminates the inherent bias in diffusion. Exact diffusion has almost the same structure as diffusion, with the addition of a ``correction'' step between the adaptation and combination steps. Next, this dissertation studies the performance of exact diffusion for the scenarios of distributed optimization, distributed adaptation, and distributed learning, respectively. For distributed optimization, exact diffusion is proven to converge exponentially fast to the {\em exact} global solution under proper conditions. For distributed adaptation, exact diffusion is proven to have better steady-state mean-square-error than diffusion, and this superiority is analytically shown to be more evident for sparsely-connected networks such as line, cycle, grid, and other topologies. In distributed learning, exact diffusion can be integrated with the amortized variance-reduced gradient method (AVRG) so that it converges exponentially fast to the exact global solution while employing stochastic gradients per iteration. This dissertation also compares exact diffusion with other state-of-the-art methods in literature. Intensive numerical simulations are provided to illustrate the theoretical results derived in the dissertation

eScholarship - University of California