859 research outputs found
Asynchronous Decentralized Parallel Stochastic Gradient Descent
Most commonly used distributed machine learning systems are either
synchronous or centralized asynchronous. Synchronous algorithms like
AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous
algorithms using a parameter server suffer from 1) communication bottleneck at
parameter servers when workers are many, and 2) significantly worse convergence
when the traffic to parameter server is congested. Can we design an algorithm
that is robust in a heterogeneous environment, while being communication
efficient and maintaining the best-possible convergence rate? In this paper, we
propose an asynchronous decentralized stochastic gradient decent algorithm
(AD-PSGD) satisfying all above expectations. Our theoretical analysis shows
AD-PSGD converges at the optimal rate as SGD and has linear
speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of
decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and
standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a
heterogeneous environment. When training ResNet-50 on ImageNet with up to 128
GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each
epoch can be up to 4-8X faster than its synchronous counterparts in a
network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the
first asynchronous algorithm that achieves a similar epoch-wise convergence
rate as AllReduce-SGD, at an over 100-GPU scale
Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization
Despite the success of single-agent reinforcement learning, multi-agent
reinforcement learning (MARL) remains challenging due to complex interactions
between agents. Motivated by decentralized applications such as sensor
networks, swarm robotics, and power grids, we study policy evaluation in MARL,
where agents with jointly observed state-action pairs and private local rewards
collaborate to learn the value of a given policy. In this paper, we propose a
double averaging scheme, where each agent iteratively performs averaging over
both space and time to incorporate neighboring gradient information and local
reward information, respectively. We prove that the proposed algorithm
converges to the optimal solution at a global geometric rate. In particular,
such an algorithm is built upon a primal-dual reformulation of the mean squared
projected Bellman error minimization problem, which gives rise to a
decentralized convex-concave saddle-point problem. To the best of our
knowledge, the proposed double averaging primal-dual optimization algorithm is
the first to achieve fast finite-time convergence on decentralized
convex-concave saddle-point problems.Comment: final version as appeared in NeurIPS 201
Stochastic Gradient Push for Distributed Deep Learning
Distributed data-parallel algorithms aim to accelerate the training of deep
neural networks by parallelizing the computation of large mini-batch gradient
updates across multiple nodes. Approaches that synchronize nodes using exact
distributed averaging (e.g., via AllReduce) are sensitive to stragglers and
communication delays. The PushSum gossip algorithm is robust to these issues,
but only performs approximate distributed averaging. This paper studies
Stochastic Gradient Push (SGP), which combines PushSum with stochastic gradient
updates. We prove that SGP converges to a stationary point of smooth,
non-convex objectives at the same sub-linear rate as SGD, and that all nodes
achieve consensus. We empirically validate the performance of SGP on image
classification (ResNet-50, ImageNet) and machine translation (Transformer,
WMT'16 En-De) workloads. Our code will be made publicly available.Comment: ICML 201
Robust and Communication-Efficient Collaborative Learning
We consider a decentralized learning problem, where a set of computing nodes
aim at solving a non-convex optimization problem collaboratively. It is
well-known that decentralized optimization schemes face two major system
bottlenecks: stragglers' delay and communication overhead. In this paper, we
tackle these bottlenecks by proposing a novel decentralized and gradient-based
optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two
main ideas: (i) we impose a deadline on the local gradient computations of each
node at each iteration of the algorithm, and (ii) the nodes exchange quantized
versions of their local models. The first idea robustifies to straggling nodes
and the second alleviates communication efficiency. The key technical
contribution of our work is to prove that with non-vanishing noises for
quantization and stochastic gradients, the proposed method exactly converges to
the global optimal for convex loss functions, and finds a first-order
stationary point in non-convex scenarios. Our numerical evaluations of the
QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate
speedups of up to 3x in run-time, compared to state-of-the-art decentralized
optimization methods
Distributed Stochastic Multi-Task Learning with Graph Regularization
We propose methods for distributed graph-based multi-task learning that are
based on weighted averaging of messages from other machines. Uniform averaging
or diminishing stepsize in these methods would yield consensus (single task)
learning. We show how simply skewing the averaging weights or controlling the
stepsize allows learning different, but related, tasks on the different
machines
Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms
Communication-efficient SGD algorithms, which allow nodes to perform local
updates and periodically synchronize local models, are highly effective in
improving the speed and scalability of distributed SGD. However, a rigorous
convergence analysis and comparative study of different communication-reduction
strategies remains a largely open problem. This paper presents a unified
framework called Cooperative SGD that subsumes existing communication-efficient
SGD algorithms such as periodic-averaging, elastic-averaging and decentralized
SGD. By analyzing Cooperative SGD, we provide novel convergence guarantees for
existing algorithms. Moreover, this framework enables us to design new
communication-efficient SGD algorithms that strike the best balance between
reducing communication overhead and achieving fast error convergence with low
error floor
Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication
Recently, the decentralized optimization problem is attracting growing
attention. Most existing methods are deterministic with high per-iteration cost
and have a convergence rate quadratically depending on the problem condition
number. Besides, the dense communication is necessary to ensure the convergence
even if the dataset is sparse. In this paper, we generalize the decentralized
optimization problem to a monotone operator root finding problem, and propose a
stochastic algorithm named DSBA that (i) converges geometrically with a rate
linearly depending on the problem condition number, and (ii) can be implemented
using sparse communication only. Additionally, DSBA handles learning problems
like AUC-maximization which cannot be tackled efficiently in the decentralized
setting. Experiments on convex minimization and AUC-maximization validate the
efficiency of our method.Comment: Accepted to ICML 201
Distributed Convex Optimization With Limited Communications
In this paper, a distributed convex optimization algorithm, termed
\emph{distributed coordinate dual averaging} (DCDA) algorithm, is proposed. The
DCDA algorithm addresses the scenario of a large distributed optimization
problem with limited communication among nodes in the network. Currently known
distributed subgradient methods, such as the distributed dual averaging or the
distributed alternating direction method of multipliers algorithms, assume that
nodes can exchange messages of large cardinality. Such network communication
capabilities are not valid in many scenarios of practical relevance. In the
DCDA algorithm, on the other hand, communication of each coordinate of the
optimization variable is restricted over time. For the proposed algorithm, we
bound the rate of convergence under different communication protocols and
network architectures. We also consider the extensions to the case of imperfect
gradient knowledge and the case in which transmitted messages are corrupted by
additive noise or are quantized. Relevant numerical simulations are also
provided.Comment: Extended version of submission to IEEE ICASSP 201
On Data Dependence in Distributed Stochastic Optimization
We study a distributed consensus-based stochastic gradient descent (SGD)
algorithm and show that the rate of convergence involves the spectral
properties of two matrices: the standard spectral gap of a weight matrix from
the network topology and a new term depending on the spectral norm of the
sample covariance matrix of the data. This data-dependent convergence rate
shows that distributed SGD algorithms perform better on datasets with small
spectral norm. Our analysis method also allows us to find data-dependent
convergence rates as we limit the amount of communication. Spreading a fixed
amount of data across more nodes slows convergence; for asymptotically growing
data sets we show that adding more machines can help when minimizing
twice-differentiable losses
Graph Balancing for Distributed Subgradient Methods over Directed Graphs
We consider a multi agent optimization problem where a set of agents
collectively solves a global optimization problem with the objective function
given by the sum of locally known convex functions. We focus on the case when
information exchange among agents takes place over a directed network and
propose a distributed subgradient algorithm in which each agent performs local
processing based on information obtained from his incoming neighbors. Our
algorithm uses weight balancing to overcome the asymmetries caused by the
directed communication network, i.e., agents scale their outgoing information
with dynamically updated weights that converge to balancing weights of the
graph. We show that both the objective function values and the consensus
violation, at the ergodic average of the estimates generated by the algorithm,
converge with rate , where is the number of
iterations. A special case of our algorithm provides a new distributed method
to compute average consensus over directed graphs
- …