1 research outputs found
CGX: Adaptive System Support for Communication-Efficient Deep Learning
The ability to scale out training workloads has been one of the key
performance enablers of deep learning. The main scaling approach is
data-parallel GPU-based training, which has been boosted by hardware and
software support for highly efficient point-to-point communication, and in
particular via hardware bandwidth overprovisioning. Overprovisioning comes at a
cost: there is an order of magnitude price difference between "cloud-grade"
servers with such support, relative to their popular "consumer-grade"
counterparts, although single server-grade and consumer-grade GPUs can have
similar computational envelopes.
In this paper, we show that the costly hardware overprovisioning approach can
be supplanted via algorithmic and system design, and propose a framework called
CGX, which provides efficient software support for compressed communication in
ML applications, for both multi-GPU single-node training, as well as
larger-scale multi-node training. CGX is based on two technical advances:
\emph{At the system level}, it relies on a re-developed communication stack for
ML frameworks, which provides flexible, highly-efficient support for compressed
communication. \emph{At the application level}, it provides \emph{seamless,
parameter-free} integration with popular frameworks, so that end-users do not
have to modify training recipes, nor significant training code. This is
complemented by a \emph{layer-wise adaptive compression} technique which
dynamically balances compression gains with accuracy preservation. CGX
integrates with popular ML frameworks, providing up to 3X speedups for
multi-GPU nodes based on commodity hardware, and order-of-magnitude
improvements in the multi-node setting, with negligible impact on accuracy