Search CORE

15 research outputs found

D $^2$ : Decentralized Training over Decentralized Data

Author: Lian Xiangru
Liu Ji
Tang Hanlin
Yan Ming
Zhang Ce
Publication venue
Publication date: 01/01/2018
Field of study

While training a machine learning model using multiple workers, each of which collects data from their own data sources, it would be most useful when the data collected from different workers can be {\em unique} and {\em different}. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are {\em not too different}. In this paper, we ask the question: {\em Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?} In this paper, we present D

^2

, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, "decentralized" data). The core of D

^2

is a variance blackuction extension of the standard D-PSGD algorithm, which improves the convergence rate from

O\left({\sigma \over \sqrt{nT}} + {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right)

O\left({\sigma \over \sqrt{nT}}\right)

where

\zeta^{2}

denotes the variance among data on different workers. As a result, D

^2

is robust to data variance among workers. We empirically evaluated D

^2

on image classification tasks where each worker has access to only the data of a limited set of labels, and find that D

^2

significantly outperforms D-PSGD

arXiv.org e-Print Archive

Repository for Publications and Research Data

Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent

Author: Hu Wenqing
Li Chris Junchi
Lian Xiangru
Liu Ji
Yuan Huizhuo
Publication venue: Scholars\u27 Mine
Publication date: 14/12/2019
Field of study

Stochastic compositional optimization arises in many important machine learning applications. The objective function is the composition of two expectations of stochastic functions, and is more challenging to optimize than vanilla stochastic optimization problems. In this paper, we investigate the stochastic compositional optimization in the general smooth non-convex setting. We employ a recently developed idea of Stochastic Recursive Gradient Descent to design a novel algorithm named SARAH-Compositional, and prove a sharp Incremental First-order Oracle (IFO) complexity upper bound for stochastic compositional optimization: ((n + m)1/2ε-2) in the finite-sum case and (ε-3) in the online case. Such a complexity is known to be the best one among IFO complexity results for non-convex stochastic compositional optimization. Numerical experiments on risk-adverse portfolio management validate the superiority of SARAH-Compositional over a few rival algorithms

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

Hop: Heterogeneity-Aware Decentralized Training

Author: Abadi Martin
Chen Jianmin
Chilimbi Trishul
Dai Wei
Ho Qirong
Hsieh Kevin
Kadav Asim
Lian Xiangru
Lian Xiangru
Recht Benjamin
Tang Hanlin
Yu Yuan
Zaharia Matei
Zinkevich Martin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/02/2019
Field of study

Recent work has shown that decentralized algorithms can deliver superior performance over centralized ones in the context of machine learning. The two approaches, with the main difference residing in their distinct communication patterns, are both susceptible to performance degradation in heterogeneous environments. Although vigorous efforts have been devoted to supporting centralized algorithms against heterogeneity, little has been explored in decentralized algorithms regarding this problem. This paper proposes Hop, the first heterogeneity-aware decentralized training protocol. Based on a unique characteristic of decentralized training that we have identified, the iteration gap, we propose a queue-based synchronization mechanism that can efficiently implement backup workers and bounded staleness in the decentralized setting. To cope with deterministic slowdown, we propose skipping iterations so that the effect of slower workers is further mitigated. We build a prototype implementation of Hop on TensorFlow. The experiment results on CNN and SVM show significant speedup over standard decentralized training in heterogeneous settings

arXiv.org e-Print Archive

Crossref

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Author: Alistarh Dan
Chilimbi Trishul
Devlin Jacob
Ho Qirong
Hoefler T.
Hoefler T.
Hoefler T.
Hsieh Kevin
Interface Forum Message Passing
Jayarajan Anand
Lian Xiangru
Recht B.
Seide Frank
Strom Nikko
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202

arXiv.org e-Print Archive

Crossref

IST Austria: PubRep (Institute of Science and Technology)