6 research outputs found
Understanding Progressive Training Through the Framework of Randomized Coordinate Descent
We propose a Randomized Progressive Training algorithm (RPT) -- a stochastic
proxy for the well-known Progressive Training method (PT) (Karras et al.,
2017). Originally designed to train GANs (Goodfellow et al., 2014), PT was
proposed as a heuristic, with no convergence analysis even for the simplest
objective functions. On the contrary, to the best of our knowledge, RPT is the
first PT-type algorithm with rigorous and sound theoretical guarantees for
general smooth objective functions. We cast our method into the established
framework of Randomized Coordinate Descent (RCD) (Nesterov, 2012; Richt\'arik &
Tak\'a\v{c}, 2014), for which (as a by-product of our investigations) we also
propose a novel, simple and general convergence analysis encapsulating
strongly-convex, convex and nonconvex objectives. We then use this framework to
establish a convergence theory for RPT. Finally, we validate the effectiveness
of our method through extensive computational experiments
Error Feedback Shines when Features are Rare
We provide the first proof that gradient descent with greedy sparsification
and error feedback can obtain better
communication complexity than vanilla when solving the
distributed optimization problem , where = # of clients, = # of
features, and are smooth nonconvex functions. Despite intensive
research since 2014 when was first proposed by Seide et
al., this problem remained open until now. We show that
shines in the regime when features are rare, i.e., when each feature is present
in the data owned by a small number of clients only. To illustrate our main
result, we show that in order to find a random vector such that
in expectation,
with the sparsifier and
requires bits to be communicated by each worker to the server only, where
is the smoothness constant of , is the smoothness constant of ,
is the maximal number of clients owning any feature (), and is the maximal number of
features owned by any client (). Clearly, the
communication complexity improves as decreases (i.e., as
features become more rare), and can be much better than the communication complexity of
in the same regime
Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants
Error Feedback (EF) is a highly popular and immensely effective mechanism for
fixing convergence issues which arise in distributed training methods (such as
distributed GD or SGD) when these are enhanced with greedy communication
compression techniques such as TopK. While EF was proposed almost a decade ago
(Seide et al., 2014), and despite concentrated effort by the community to
advance the theoretical understanding of this mechanism, there is still a lot
to explore. In this work we study a modern form of error feedback called EF21
(Richtarik et al., 2021) which offers the currently best-known theoretical
guarantees, under the weakest assumptions, and also works well in practice. In
particular, while the theoretical communication complexity of EF21 depends on
the quadratic mean of certain smoothness parameters, we improve this dependence
to their arithmetic mean, which is always smaller, and can be substantially
smaller, especially in heterogeneous data regimes. We take the reader on a
journey of our discovery process. Starting with the idea of applying EF21 to an
equivalent reformulation of the underlying problem which (unfortunately)
requires (often impractical) machine cloning, we continue to the discovery of a
new weighted version of EF21 which can (fortunately) be executed without any
cloning, and finally circle back to an improved analysis of the original EF21
method. While this development applies to the simplest form of EF21, our
approach naturally extends to more elaborate variants involving stochastic
gradients and partial participation. Further, our technique improves the
best-known theory of EF21 in the rare features regime (Richtarik et al., 2023).
Finally, we validate our theoretical findings with suitable experiments.Comment: 70 pages, 14 figures, 6 table
3PC: Three point compressors for communication-efficient distributed training and a better theory for lazy aggregation
We propose and study a new class of gradient communication mechanisms for
communication-efficient training -- three point compressors (3PC) -- as well as
efficient distributed nonconvex optimization algorithms that can take advantage
of them. Unlike most established approaches, which rely on a static compressor
choice (e.g., Top-), our class allows the compressors to {\em evolve}
throughout the training process, with the aim of improving the theoretical
communication complexity and practical efficiency of the underlying methods. We
show that our general approach can recover the recently proposed
state-of-the-art error feedback mechanism EF21 (Richt\'arik et al., 2021) and
its theoretical properties as a special case, but also leads to a number of new
efficient methods. Notably, our approach allows us to improve upon the state of
the art in the algorithmic and theoretical foundations of the {\em lazy
aggregation} literature (Chen et al., 2018). As a by-product that may be of
independent interest, we provide a new and fundamental link between the lazy
aggregation and error feedback literature. A special feature of our work is
that we do not require the compressors to be unbiased.Comment: 52 page