859 research outputs found

    Asynchronous Decentralized Parallel Stochastic Gradient Descent

    Full text link
    Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a parameter server suffer from 1) communication bottleneck at parameter servers when workers are many, and 2) significantly worse convergence when the traffic to parameter server is congested. Can we design an algorithm that is robust in a heterogeneous environment, while being communication efficient and maintaining the best-possible convergence rate? In this paper, we propose an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations. Our theoretical analysis shows AD-PSGD converges at the optimal O(1/K)O(1/\sqrt{K}) rate as SGD and has linear speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a heterogeneous environment. When training ResNet-50 on ImageNet with up to 128 GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each epoch can be up to 4-8X faster than its synchronous counterparts in a network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale

    Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization

    Full text link
    Despite the success of single-agent reinforcement learning, multi-agent reinforcement learning (MARL) remains challenging due to complex interactions between agents. Motivated by decentralized applications such as sensor networks, swarm robotics, and power grids, we study policy evaluation in MARL, where agents with jointly observed state-action pairs and private local rewards collaborate to learn the value of a given policy. In this paper, we propose a double averaging scheme, where each agent iteratively performs averaging over both space and time to incorporate neighboring gradient information and local reward information, respectively. We prove that the proposed algorithm converges to the optimal solution at a global geometric rate. In particular, such an algorithm is built upon a primal-dual reformulation of the mean squared projected Bellman error minimization problem, which gives rise to a decentralized convex-concave saddle-point problem. To the best of our knowledge, the proposed double averaging primal-dual optimization algorithm is the first to achieve fast finite-time convergence on decentralized convex-concave saddle-point problems.Comment: final version as appeared in NeurIPS 201

    Stochastic Gradient Push for Distributed Deep Learning

    Full text link
    Distributed data-parallel algorithms aim to accelerate the training of deep neural networks by parallelizing the computation of large mini-batch gradient updates across multiple nodes. Approaches that synchronize nodes using exact distributed averaging (e.g., via AllReduce) are sensitive to stragglers and communication delays. The PushSum gossip algorithm is robust to these issues, but only performs approximate distributed averaging. This paper studies Stochastic Gradient Push (SGP), which combines PushSum with stochastic gradient updates. We prove that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus. We empirically validate the performance of SGP on image classification (ResNet-50, ImageNet) and machine translation (Transformer, WMT'16 En-De) workloads. Our code will be made publicly available.Comment: ICML 201

    Robust and Communication-Efficient Collaborative Learning

    Get PDF
    We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively. It is well-known that decentralized optimization schemes face two major system bottlenecks: stragglers' delay and communication overhead. In this paper, we tackle these bottlenecks by proposing a novel decentralized and gradient-based optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two main ideas: (i) we impose a deadline on the local gradient computations of each node at each iteration of the algorithm, and (ii) the nodes exchange quantized versions of their local models. The first idea robustifies to straggling nodes and the second alleviates communication efficiency. The key technical contribution of our work is to prove that with non-vanishing noises for quantization and stochastic gradients, the proposed method exactly converges to the global optimal for convex loss functions, and finds a first-order stationary point in non-convex scenarios. Our numerical evaluations of the QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate speedups of up to 3x in run-time, compared to state-of-the-art decentralized optimization methods

    Distributed Stochastic Multi-Task Learning with Graph Regularization

    Full text link
    We propose methods for distributed graph-based multi-task learning that are based on weighted averaging of messages from other machines. Uniform averaging or diminishing stepsize in these methods would yield consensus (single task) learning. We show how simply skewing the averaging weights or controlling the stepsize allows learning different, but related, tasks on the different machines

    Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

    Full text link
    Communication-efficient SGD algorithms, which allow nodes to perform local updates and periodically synchronize local models, are highly effective in improving the speed and scalability of distributed SGD. However, a rigorous convergence analysis and comparative study of different communication-reduction strategies remains a largely open problem. This paper presents a unified framework called Cooperative SGD that subsumes existing communication-efficient SGD algorithms such as periodic-averaging, elastic-averaging and decentralized SGD. By analyzing Cooperative SGD, we provide novel convergence guarantees for existing algorithms. Moreover, this framework enables us to design new communication-efficient SGD algorithms that strike the best balance between reducing communication overhead and achieving fast error convergence with low error floor

    Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication

    Full text link
    Recently, the decentralized optimization problem is attracting growing attention. Most existing methods are deterministic with high per-iteration cost and have a convergence rate quadratically depending on the problem condition number. Besides, the dense communication is necessary to ensure the convergence even if the dataset is sparse. In this paper, we generalize the decentralized optimization problem to a monotone operator root finding problem, and propose a stochastic algorithm named DSBA that (i) converges geometrically with a rate linearly depending on the problem condition number, and (ii) can be implemented using sparse communication only. Additionally, DSBA handles learning problems like AUC-maximization which cannot be tackled efficiently in the decentralized setting. Experiments on convex minimization and AUC-maximization validate the efficiency of our method.Comment: Accepted to ICML 201

    Distributed Convex Optimization With Limited Communications

    Full text link
    In this paper, a distributed convex optimization algorithm, termed \emph{distributed coordinate dual averaging} (DCDA) algorithm, is proposed. The DCDA algorithm addresses the scenario of a large distributed optimization problem with limited communication among nodes in the network. Currently known distributed subgradient methods, such as the distributed dual averaging or the distributed alternating direction method of multipliers algorithms, assume that nodes can exchange messages of large cardinality. Such network communication capabilities are not valid in many scenarios of practical relevance. In the DCDA algorithm, on the other hand, communication of each coordinate of the optimization variable is restricted over time. For the proposed algorithm, we bound the rate of convergence under different communication protocols and network architectures. We also consider the extensions to the case of imperfect gradient knowledge and the case in which transmitted messages are corrupted by additive noise or are quantized. Relevant numerical simulations are also provided.Comment: Extended version of submission to IEEE ICASSP 201

    On Data Dependence in Distributed Stochastic Optimization

    Full text link
    We study a distributed consensus-based stochastic gradient descent (SGD) algorithm and show that the rate of convergence involves the spectral properties of two matrices: the standard spectral gap of a weight matrix from the network topology and a new term depending on the spectral norm of the sample covariance matrix of the data. This data-dependent convergence rate shows that distributed SGD algorithms perform better on datasets with small spectral norm. Our analysis method also allows us to find data-dependent convergence rates as we limit the amount of communication. Spreading a fixed amount of data across more nodes slows convergence; for asymptotically growing data sets we show that adding more machines can help when minimizing twice-differentiable losses

    Graph Balancing for Distributed Subgradient Methods over Directed Graphs

    Full text link
    We consider a multi agent optimization problem where a set of agents collectively solves a global optimization problem with the objective function given by the sum of locally known convex functions. We focus on the case when information exchange among agents takes place over a directed network and propose a distributed subgradient algorithm in which each agent performs local processing based on information obtained from his incoming neighbors. Our algorithm uses weight balancing to overcome the asymmetries caused by the directed communication network, i.e., agents scale their outgoing information with dynamically updated weights that converge to balancing weights of the graph. We show that both the objective function values and the consensus violation, at the ergodic average of the estimates generated by the algorithm, converge with rate O(logTT)O(\frac{\log T}{\sqrt{T}}), where TT is the number of iterations. A special case of our algorithm provides a new distributed method to compute average consensus over directed graphs
    corecore