Search CORE

6 research outputs found

Understanding Progressive Training Through the Framework of Randomized Coordinate Descent

Author: Gasanov Elnur
Richtárik Peter
Szlendak Rafał
Publication venue
Publication date: 06/06/2023
Field of study

We propose a Randomized Progressive Training algorithm (RPT) -- a stochastic proxy for the well-known Progressive Training method (PT) (Karras et al., 2017). Originally designed to train GANs (Goodfellow et al., 2014), PT was proposed as a heuristic, with no convergence analysis even for the simplest objective functions. On the contrary, to the best of our knowledge, RPT is the first PT-type algorithm with rigorous and sound theoretical guarantees for general smooth objective functions. We cast our method into the established framework of Randomized Coordinate Descent (RCD) (Nesterov, 2012; Richt\'arik & Tak\'a\v{c}, 2014), for which (as a by-product of our investigations) we also propose a novel, simple and general convergence analysis encapsulating strongly-convex, convex and nonconvex objectives. We then use this framework to establish a convergence theory for RPT. Finally, we validate the effectiveness of our method through extensive computational experiments

arXiv.org e-Print Archive

Error Feedback Shines when Features are Rare

Author: Burlachenko Konstantin
Gasanov Elnur
Richtárik Peter
Publication venue
Publication date: 24/05/2023
Field of study

We provide the first proof that gradient descent

\left({\color{green}\sf GD}\right)

with greedy sparsification

\left({\color{green}\sf TopK}\right)

and error feedback

\left({\color{green}\sf EF}\right)

can obtain better communication complexity than vanilla

{\color{green}\sf GD}

when solving the distributed optimization problem

\min_{x\in \mathbb{R}^d} {f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)}

, where

n

= # of clients,

d

= # of features, and

f_1,\dots,f_n

are smooth nonconvex functions. Despite intensive research since 2014 when

{\color{green}\sf EF}

was first proposed by Seide et al., this problem remained open until now. We show that

{\color{green}\sf EF}

shines in the regime when features are rare, i.e., when each feature is present in the data owned by a small number of clients only. To illustrate our main result, we show that in order to find a random vector

\hat{x}

such that

\lVert {\nabla f(\hat{x})} \rVert^2 \leq \varepsilon

in expectation,

{\color{green}\sf GD}

with the

{\color{green}\sf Top1}

sparsifier and

{\color{green}\sf EF}

requires

{\cal O} \left(\left( L+{\color{blue}r} \sqrt{ \frac{{\color{red}c}}{n} \min \left( \frac{{\color{red}c}}{n} \max_i L_i^2, \frac{1}{n}\sum_{i=1}^n L_i^2 \right) }\right) \frac{1}{\varepsilon} \right)

bits to be communicated by each worker to the server only, where

L

is the smoothness constant of

f

L_i

is the smoothness constant of

f_i

{\color{red}c}

is the maximal number of clients owning any feature (

1\leq {\color{red}c} \leq n

), and

{\color{blue}r}

is the maximal number of features owned by any client (

1\leq {\color{blue}r} \leq d

). Clearly, the communication complexity improves as

{\color{red}c}

decreases (i.e., as features become more rare), and can be much better than the

{\cal O}({\color{blue}r} L \frac{1}{\varepsilon})

communication complexity of

{\color{green}\sf GD}

in the same regime

arXiv.org e-Print Archive

Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants

Author: Burlachenko Konstantin
Gasanov Elnur
Richtárik Peter
Publication venue
Publication date: 16/02/2024
Field of study

Error Feedback (EF) is a highly popular and immensely effective mechanism for fixing convergence issues which arise in distributed training methods (such as distributed GD or SGD) when these are enhanced with greedy communication compression techniques such as TopK. While EF was proposed almost a decade ago (Seide et al., 2014), and despite concentrated effort by the community to advance the theoretical understanding of this mechanism, there is still a lot to explore. In this work we study a modern form of error feedback called EF21 (Richtarik et al., 2021) which offers the currently best-known theoretical guarantees, under the weakest assumptions, and also works well in practice. In particular, while the theoretical communication complexity of EF21 depends on the quadratic mean of certain smoothness parameters, we improve this dependence to their arithmetic mean, which is always smaller, and can be substantially smaller, especially in heterogeneous data regimes. We take the reader on a journey of our discovery process. Starting with the idea of applying EF21 to an equivalent reformulation of the underlying problem which (unfortunately) requires (often impractical) machine cloning, we continue to the discovery of a new weighted version of EF21 which can (fortunately) be executed without any cloning, and finally circle back to an improved analysis of the original EF21 method. While this development applies to the simplest form of EF21, our approach naturally extends to more elaborate variants involving stochastic gradients and partial participation. Further, our technique improves the best-known theory of EF21 in the rare features regime (Richtarik et al., 2023). Finally, we validate our theoretical findings with suitable experiments.Comment: 70 pages, 14 figures, 6 table

arXiv.org e-Print Archive

3PC: Three point compressors for communication-efficient distributed training and a better theory for lazy aggregation

Author: FATKHULLIN Ilyas
GASANOV Elnur
GORBUNOV Eduard
LI Zhize
RICHTARIK Peter
SOKOLOV Igor
Publication venue: Proceedings of Machine Learning Research
Publication date: 02/02/2022
Field of study

We propose and study a new class of gradient communication mechanisms for communication-efficient training -- three point compressors (3PC) -- as well as efficient distributed nonconvex optimization algorithms that can take advantage of them. Unlike most established approaches, which rely on a static compressor choice (e.g., Top-

K

), our class allows the compressors to {\em evolve} throughout the training process, with the aim of improving the theoretical communication complexity and practical efficiency of the underlying methods. We show that our general approach can recover the recently proposed state-of-the-art error feedback mechanism EF21 (Richt\'arik et al., 2021) and its theoretical properties as a special case, but also leads to a number of new efficient methods. Notably, our approach allows us to improve upon the state of the art in the algorithmic and theoretical foundations of the {\em lazy aggregation} literature (Chen et al., 2018). As a by-product that may be of independent interest, we provide a new and fundamental link between the lazy aggregation and error feedback literature. A special feature of our work is that we do not require the compressors to be unbiased.Comment: 52 page

arXiv.org e-Print Archive

Institutional Knowledge at Singapore Management University