6 research outputs found

    Understanding Progressive Training Through the Framework of Randomized Coordinate Descent

    Full text link
    We propose a Randomized Progressive Training algorithm (RPT) -- a stochastic proxy for the well-known Progressive Training method (PT) (Karras et al., 2017). Originally designed to train GANs (Goodfellow et al., 2014), PT was proposed as a heuristic, with no convergence analysis even for the simplest objective functions. On the contrary, to the best of our knowledge, RPT is the first PT-type algorithm with rigorous and sound theoretical guarantees for general smooth objective functions. We cast our method into the established framework of Randomized Coordinate Descent (RCD) (Nesterov, 2012; Richt\'arik & Tak\'a\v{c}, 2014), for which (as a by-product of our investigations) we also propose a novel, simple and general convergence analysis encapsulating strongly-convex, convex and nonconvex objectives. We then use this framework to establish a convergence theory for RPT. Finally, we validate the effectiveness of our method through extensive computational experiments

    Error Feedback Shines when Features are Rare

    Full text link
    We provide the first proof that gradient descent (GD)\left({\color{green}\sf GD}\right) with greedy sparsification (TopK)\left({\color{green}\sf TopK}\right) and error feedback (EF)\left({\color{green}\sf EF}\right) can obtain better communication complexity than vanilla GD{\color{green}\sf GD} when solving the distributed optimization problem minxRdf(x)=1ni=1nfi(x)\min_{x\in \mathbb{R}^d} {f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)}, where nn = # of clients, dd = # of features, and f1,,fnf_1,\dots,f_n are smooth nonconvex functions. Despite intensive research since 2014 when EF{\color{green}\sf EF} was first proposed by Seide et al., this problem remained open until now. We show that EF{\color{green}\sf EF} shines in the regime when features are rare, i.e., when each feature is present in the data owned by a small number of clients only. To illustrate our main result, we show that in order to find a random vector x^\hat{x} such that f(x^)2ε\lVert {\nabla f(\hat{x})} \rVert^2 \leq \varepsilon in expectation, GD{\color{green}\sf GD} with the Top1{\color{green}\sf Top1} sparsifier and EF{\color{green}\sf EF} requires O((L+rcnmin(cnmaxiLi2,1ni=1nLi2))1ε){\cal O} \left(\left( L+{\color{blue}r} \sqrt{ \frac{{\color{red}c}}{n} \min \left( \frac{{\color{red}c}}{n} \max_i L_i^2, \frac{1}{n}\sum_{i=1}^n L_i^2 \right) }\right) \frac{1}{\varepsilon} \right) bits to be communicated by each worker to the server only, where LL is the smoothness constant of ff, LiL_i is the smoothness constant of fif_i, c{\color{red}c} is the maximal number of clients owning any feature (1cn1\leq {\color{red}c} \leq n), and r{\color{blue}r} is the maximal number of features owned by any client (1rd1\leq {\color{blue}r} \leq d). Clearly, the communication complexity improves as c{\color{red}c} decreases (i.e., as features become more rare), and can be much better than the O(rL1ε){\cal O}({\color{blue}r} L \frac{1}{\varepsilon}) communication complexity of GD{\color{green}\sf GD} in the same regime

    Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants

    Full text link
    Error Feedback (EF) is a highly popular and immensely effective mechanism for fixing convergence issues which arise in distributed training methods (such as distributed GD or SGD) when these are enhanced with greedy communication compression techniques such as TopK. While EF was proposed almost a decade ago (Seide et al., 2014), and despite concentrated effort by the community to advance the theoretical understanding of this mechanism, there is still a lot to explore. In this work we study a modern form of error feedback called EF21 (Richtarik et al., 2021) which offers the currently best-known theoretical guarantees, under the weakest assumptions, and also works well in practice. In particular, while the theoretical communication complexity of EF21 depends on the quadratic mean of certain smoothness parameters, we improve this dependence to their arithmetic mean, which is always smaller, and can be substantially smaller, especially in heterogeneous data regimes. We take the reader on a journey of our discovery process. Starting with the idea of applying EF21 to an equivalent reformulation of the underlying problem which (unfortunately) requires (often impractical) machine cloning, we continue to the discovery of a new weighted version of EF21 which can (fortunately) be executed without any cloning, and finally circle back to an improved analysis of the original EF21 method. While this development applies to the simplest form of EF21, our approach naturally extends to more elaborate variants involving stochastic gradients and partial participation. Further, our technique improves the best-known theory of EF21 in the rare features regime (Richtarik et al., 2023). Finally, we validate our theoretical findings with suitable experiments.Comment: 70 pages, 14 figures, 6 table

    3PC: Three point compressors for communication-efficient distributed training and a better theory for lazy aggregation

    No full text
    We propose and study a new class of gradient communication mechanisms for communication-efficient training -- three point compressors (3PC) -- as well as efficient distributed nonconvex optimization algorithms that can take advantage of them. Unlike most established approaches, which rely on a static compressor choice (e.g., Top-KK), our class allows the compressors to {\em evolve} throughout the training process, with the aim of improving the theoretical communication complexity and practical efficiency of the underlying methods. We show that our general approach can recover the recently proposed state-of-the-art error feedback mechanism EF21 (Richt\'arik et al., 2021) and its theoretical properties as a special case, but also leads to a number of new efficient methods. Notably, our approach allows us to improve upon the state of the art in the algorithmic and theoretical foundations of the {\em lazy aggregation} literature (Chen et al., 2018). As a by-product that may be of independent interest, we provide a new and fundamental link between the lazy aggregation and error feedback literature. A special feature of our work is that we do not require the compressors to be unbiased.Comment: 52 page
    corecore