Search CORE

12,958 research outputs found

Distributed Delayed Stochastic Optimization

Author: Agarwal Alekh
Duchi John C.
Publication venue
Publication date: 01/01/2011
Field of study

We analyze the convergence of gradient-based optimization algorithms that base their updates on delayed stochastic gradient information. The main application of our results is to the development of gradient-based distributed optimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. We take motivation from statistical problems where the size of the data is so large that it cannot fit on one computer; with the advent of huge datasets in biology, astronomy, and the internet, such problems are now common. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible and we can achieve order-optimal convergence results. In application to distributed optimization, we develop procedures that overcome communication bottlenecks and synchronization requirements. We show

n

-node architectures whose optimization error in stochastic problems---in spite of asynchronous delays---scales asymptotically as \order(1 / \sqrt{nT}) after

T

iterations. This rate is known to be optimal for a distributed system with

n

nodes even in the absence of delays. We additionally complement our theoretical results with numerical experiments on a statistical machine learning task.Comment: 27 pages, 4 figure

arXiv.org e-Print Archive

CiteSeerX

Local SGD Converges Fast and Communicates Little

Author: Stich Sebastian U.
Publication venue
Publication date: 03/05/2019
Field of study

Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis. We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T^{1/2}---where T denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations. Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.Comment: to appear at ICLR 2019, 19 page

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling

Author: Agarwal Alekh
Duchi John
Wainwright Martin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. It arises in various application domains, including distributed tracking and localization, multi-agent co-ordination, estimation in sensor networks, and large-scale optimization in machine learning. We develop and analyze distributed algorithms based on dual averaging of subgradients, and we provide sharp bounds on their convergence rates as a function of the network size and topology. Our method of analysis allows for a clear separation between the convergence of the optimization algorithm itself and the effects of communication constraints arising from the network structure. In particular, we show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. Our approach includes both the cases of deterministic optimization and communication, as well as problems with stochastic optimization and/or communication.Comment: 40 pages, 4 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Efficient Distributed Online Prediction and Stochastic Optimization with Approximate Distributed Averaging

Author: Rabbat Michael G.
Tsianos Konstantinos I.
Publication venue
Publication date: 05/03/2014
Field of study

We study distributed methods for online prediction and stochastic optimization. Our approach is iterative: in each round nodes first perform local computations and then communicate in order to aggregate information and synchronize their decision variables. Synchronization is accomplished through the use of a distributed averaging protocol. When an exact distributed averaging protocol is used, it is known that the optimal regret bound of

\mathcal{O}(\sqrt{m})

can be achieved using the distributed mini-batch algorithm of Dekel et al. (2012), where

m

is the total number of samples processed across the network. We focus on methods using approximate distributed averaging protocols and show that the optimal regret bound can also be achieved in this setting. In particular, we propose a gossip-based optimization method which achieves the optimal regret bound. The amount of communication required depends on the network topology through the second largest eigenvalue of the transition matrix of a random walk on the network. In the setting of stochastic optimization, the proposed gossip-based approach achieves nearly-linear scaling: the optimization error is guaranteed to be no more than

\epsilon

after

\mathcal{O}(\frac{1}{n \epsilon^2})

rounds, each of which involves

\mathcal{O}(\log n)

gossip iterations, when nodes communicate over a well-connected graph. This scaling law is also observed in numerical experiments on a cluster.Comment: 30 pages, 2 figure

arXiv.org e-Print Archive

CiteSeerX