Search CORE

859 research outputs found

Asynchronous Decentralized Parallel Stochastic Gradient Descent

Author: Lian Xiangru
Liu Ji
Zhang Ce
Zhang Wei
Publication venue
Publication date: 24/09/2018
Field of study

Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a parameter server suffer from 1) communication bottleneck at parameter servers when workers are many, and 2) significantly worse convergence when the traffic to parameter server is congested. Can we design an algorithm that is robust in a heterogeneous environment, while being communication efficient and maintaining the best-possible convergence rate? In this paper, we propose an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations. Our theoretical analysis shows AD-PSGD converges at the optimal

O(1/\sqrt{K})

rate as SGD and has linear speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a heterogeneous environment. When training ResNet-50 on ImageNet with up to 128 GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each epoch can be up to 4-8X faster than its synchronous counterparts in a network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale

arXiv.org e-Print Archive

Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization

Author: Hong Mingyi
Wai Hoi-To
Wang Zhaoran
Yang Zhuoran
Publication venue
Publication date: 08/01/2019
Field of study

Despite the success of single-agent reinforcement learning, multi-agent reinforcement learning (MARL) remains challenging due to complex interactions between agents. Motivated by decentralized applications such as sensor networks, swarm robotics, and power grids, we study policy evaluation in MARL, where agents with jointly observed state-action pairs and private local rewards collaborate to learn the value of a given policy. In this paper, we propose a double averaging scheme, where each agent iteratively performs averaging over both space and time to incorporate neighboring gradient information and local reward information, respectively. We prove that the proposed algorithm converges to the optimal solution at a global geometric rate. In particular, such an algorithm is built upon a primal-dual reformulation of the mean squared projected Bellman error minimization problem, which gives rise to a decentralized convex-concave saddle-point problem. To the best of our knowledge, the proposed double averaging primal-dual optimization algorithm is the first to achieve fast finite-time convergence on decentralized convex-concave saddle-point problems.Comment: final version as appeared in NeurIPS 201

arXiv.org e-Print Archive

Stochastic Gradient Push for Distributed Deep Learning

Author: Assran Mahmoud
Ballas Nicolas
Loizou Nicolas
Rabbat Michael
Publication venue
Publication date: 14/05/2019
Field of study

Distributed data-parallel algorithms aim to accelerate the training of deep neural networks by parallelizing the computation of large mini-batch gradient updates across multiple nodes. Approaches that synchronize nodes using exact distributed averaging (e.g., via AllReduce) are sensitive to stragglers and communication delays. The PushSum gossip algorithm is robust to these issues, but only performs approximate distributed averaging. This paper studies Stochastic Gradient Push (SGP), which combines PushSum with stochastic gradient updates. We prove that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus. We empirically validate the performance of SGP on image classification (ResNet-50, ImageNet) and machine translation (Transformer, WMT'16 En-De) workloads. Our code will be made publicly available.Comment: ICML 201

arXiv.org e-Print Archive

Robust and Communication-Efficient Collaborative Learning

Author: Hassan Hamed
Mokhtari Aryan
Pedarsani Ramtin
Reisizadeh Amirhossein
Taheri Hossein
Publication venue: eScholarship, University of California
Publication date: 01/10/2019
Field of study

We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively. It is well-known that decentralized optimization schemes face two major system bottlenecks: stragglers' delay and communication overhead. In this paper, we tackle these bottlenecks by proposing a novel decentralized and gradient-based optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two main ideas: (i) we impose a deadline on the local gradient computations of each node at each iteration of the algorithm, and (ii) the nodes exchange quantized versions of their local models. The first idea robustifies to straggling nodes and the second alleviates communication efficiency. The key technical contribution of our work is to prove that with non-vanishing noises for quantization and stochastic gradients, the proposed method exactly converges to the global optimal for convex loss functions, and finds a first-order stationary point in non-convex scenarios. Our numerical evaluations of the QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate speedups of up to 3x in run-time, compared to state-of-the-art decentralized optimization methods

arXiv.org e-Print Archive

eScholarship - University of California

Distributed Stochastic Multi-Task Learning with Graph Regularization

Author: Kolar Mladen
Srebro Nathan
Wang Jialei
Wang Weiran
Publication venue
Publication date: 11/02/2018
Field of study

We propose methods for distributed graph-based multi-task learning that are based on weighted averaging of messages from other machines. Uniform averaging or diminishing stepsize in these methods would yield consensus (single task) learning. We show how simply skewing the averaging weights or controlling the stepsize allows learning different, but related, tasks on the different machines

arXiv.org e-Print Archive

Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

Author: Joshi Gauri
Wang Jianyu
Publication venue
Publication date: 25/01/2019
Field of study

Communication-efficient SGD algorithms, which allow nodes to perform local updates and periodically synchronize local models, are highly effective in improving the speed and scalability of distributed SGD. However, a rigorous convergence analysis and comparative study of different communication-reduction strategies remains a largely open problem. This paper presents a unified framework called Cooperative SGD that subsumes existing communication-efficient SGD algorithms such as periodic-averaging, elastic-averaging and decentralized SGD. By analyzing Cooperative SGD, we provide novel convergence guarantees for existing algorithms. Moreover, this framework enables us to design new communication-efficient SGD algorithms that strike the best balance between reducing communication overhead and achieving fast error convergence with low error floor

arXiv.org e-Print Archive

Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication

Author: Mokhtari Aryan
Qian Hui
Shen Zebang
Zhao Peilin
Zhou Tengfei
Publication venue
Publication date: 24/05/2018
Field of study

Recently, the decentralized optimization problem is attracting growing attention. Most existing methods are deterministic with high per-iteration cost and have a convergence rate quadratically depending on the problem condition number. Besides, the dense communication is necessary to ensure the convergence even if the dataset is sparse. In this paper, we generalize the decentralized optimization problem to a monotone operator root finding problem, and propose a stochastic algorithm named DSBA that (i) converges geometrically with a rate linearly depending on the problem condition number, and (ii) can be implemented using sparse communication only. Additionally, DSBA handles learning problems like AUC-maximization which cannot be tackled efficiently in the decentralized setting. Experiments on convex minimization and AUC-maximization validate the efficiency of our method.Comment: Accepted to ICML 201

arXiv.org e-Print Archive

Distributed Convex Optimization With Limited Communications

Author: Goldsmith Andrea
Rao Milind
Rini Stefano
Publication venue
Publication date: 29/10/2018
Field of study

In this paper, a distributed convex optimization algorithm, termed \emph{distributed coordinate dual averaging} (DCDA) algorithm, is proposed. The DCDA algorithm addresses the scenario of a large distributed optimization problem with limited communication among nodes in the network. Currently known distributed subgradient methods, such as the distributed dual averaging or the distributed alternating direction method of multipliers algorithms, assume that nodes can exchange messages of large cardinality. Such network communication capabilities are not valid in many scenarios of practical relevance. In the DCDA algorithm, on the other hand, communication of each coordinate of the optimization variable is restricted over time. For the proposed algorithm, we bound the rate of convergence under different communication protocols and network architectures. We also consider the extensions to the case of imperfect gradient knowledge and the case in which transmitted messages are corrupted by additive noise or are quantized. Relevant numerical simulations are also provided.Comment: Extended version of submission to IEEE ICASSP 201

arXiv.org e-Print Archive

On Data Dependence in Distributed Stochastic Optimization

Author: Bijral Avleen S.
Sarwate Anand D.
Srebro Nathan
Publication venue
Publication date: 31/08/2016
Field of study

We study a distributed consensus-based stochastic gradient descent (SGD) algorithm and show that the rate of convergence involves the spectral properties of two matrices: the standard spectral gap of a weight matrix from the network topology and a new term depending on the spectral norm of the sample covariance matrix of the data. This data-dependent convergence rate shows that distributed SGD algorithms perform better on datasets with small spectral norm. Our analysis method also allows us to find data-dependent convergence rates as we limit the amount of communication. Spreading a fixed amount of data across more nodes slows convergence; for asymptotically growing data sets we show that adding more machines can help when minimizing twice-differentiable losses

arXiv.org e-Print Archive

Graph Balancing for Distributed Subgradient Methods over Directed Graphs

Author: Makhdoumi Ali
Ozdaglar Asuman
Publication venue
Publication date: 10/03/2016
Field of study

We consider a multi agent optimization problem where a set of agents collectively solves a global optimization problem with the objective function given by the sum of locally known convex functions. We focus on the case when information exchange among agents takes place over a directed network and propose a distributed subgradient algorithm in which each agent performs local processing based on information obtained from his incoming neighbors. Our algorithm uses weight balancing to overcome the asymmetries caused by the directed communication network, i.e., agents scale their outgoing information with dynamically updated weights that converge to balancing weights of the graph. We show that both the objective function values and the consensus violation, at the ergodic average of the estimates generated by the algorithm, converge with rate

O(\frac{\log T}{\sqrt{T}})

, where

T

is the number of iterations. A special case of our algorithm provides a new distributed method to compute average consensus over directed graphs

arXiv.org e-Print Archive