Search CORE

1,937 research outputs found

Local SGD Converges Fast and Communicates Little

Author: Stich Sebastian U.
Publication venue
Publication date: 03/05/2019
Field of study

Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis. We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T^{1/2}---where T denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations. Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.Comment: to appear at ICLR 2019, 19 page

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Distributed Delayed Stochastic Optimization

Author: Agarwal Alekh
Duchi John C.
Publication venue
Publication date: 01/01/2011
Field of study

We analyze the convergence of gradient-based optimization algorithms that base their updates on delayed stochastic gradient information. The main application of our results is to the development of gradient-based distributed optimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. We take motivation from statistical problems where the size of the data is so large that it cannot fit on one computer; with the advent of huge datasets in biology, astronomy, and the internet, such problems are now common. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible and we can achieve order-optimal convergence results. In application to distributed optimization, we develop procedures that overcome communication bottlenecks and synchronization requirements. We show

n

-node architectures whose optimization error in stochastic problems---in spite of asynchronous delays---scales asymptotically as \order(1 / \sqrt{nT}) after

T

iterations. This rate is known to be optimal for a distributed system with

n

nodes even in the absence of delays. We additionally complement our theoretical results with numerical experiments on a statistical machine learning task.Comment: 27 pages, 4 figure

arXiv.org e-Print Archive

CiteSeerX

On the convergence of mirror descent beyond stochastic convex programming

Author: Zhou Zhengyuan
Mertikopoulos Panayotis
Bambos Nicholas
Boyd Stephen
Glynn Peter
Publication venue
Publication date: 01/01/1962
Field of study

In this paper, we examine the convergence of mirror descent in a class of stochastic optimization problems that are not necessarily convex (or even quasi-convex), and which we call variationally coherent. Since the standard technique of "ergodic averaging" offers no tangible benefits beyond convex programming, we focus directly on the algorithm's last generated sample (its "last iterate"), and we show that it converges with probabiility

1

if the underlying problem is coherent. We further consider a localized version of variational coherence which ensures local convergence of stochastic mirror descent (SMD) with high probability. These results contribute to the landscape of non-convex stochastic optimization by showing that (quasi-)convexity is not essential for convergence to a global minimum: rather, variational coherence, a much weaker requirement, suffices. Finally, building on the above, we reveal an interesting insight regarding the convergence speed of SMD: in problems with sharp minima (such as generic linear programs or concave minimization problems), SMD reaches a minimum point in a finite number of steps (a.s.), even in the presence of persistent gradient noise. This result is to be contrasted with existing black-box convergence rate estimates that are only asymptotic.Comment: 30 pages, 5 figure

arXiv.org e-Print Archive

Biblioteca Virtual del Patrimonio Bibliográfico (Virtual Library of Bibliographical Heritage)