18 research outputs found
Natural Compression for Distributed Deep Learning
Modern deep learning models are often trained in parallel over a collection
of distributed machines to reduce training time. In such settings,
communication of model updates among machines becomes a significant performance
bottleneck and various lossy update compression techniques have been proposed
to alleviate this problem. In this work, we introduce a new, simple yet
theoretically and practically effective compression technique: {\em natural
compression (NC)}. Our technique is applied individually to all entries of the
to-be-compressed update vector and works by randomized rounding to the nearest
(negative or positive) power of two, which can be computed in a "natural" way
by ignoring the mantissa. We show that compared to no compression, NC increases
the second moment of the compressed vector by not more than the tiny factor
\nicefrac{9}{8}, which means that the effect of NC on the convergence speed
of popular training algorithms, such as distributed SGD, is negligible.
However, the communications savings enabled by NC are substantial, leading to
{\em - improvement in overall theoretical running time}. For
applications requiring more aggressive compression, we generalize NC to {\em
natural dithering}, which we prove is {\em exponentially better} than the
common random dithering technique. Our compression operators can be used on
their own or in combination with existing operators for a more aggressive
combined effect, and offer new state-of-the-art both in theory and practice.Comment: 8 pages, 20 pages of Appendix, 6 Tables, 14 Figure
Efficient Communication Acceleration for Next-Gen Scale-up Deep Learning Training Platforms
Deep Learning (DL) training platforms are built by interconnecting multiple
DL accelerators (e.g., GPU/TPU) via fast, customized interconnects. As the size
of DL models and the compute efficiency of the accelerators has continued to
increase, there has also been a corresponding steady increase in the bandwidth
of these interconnects.Systems today provide 100s of gigabytes (GBs) of
inter-connect bandwidth via a mix of solutions such as Multi-Chip packaging
modules (MCM) and proprietary interconnects(e.g., NVlink) that together from
the scale-up network of accelerators. However, as we identify in this work, a
significant portion of this bandwidth goes under-utilized. This is because(i)
using compute cores for executing collective operations such as all-reduce
decreases overall compute efficiency, and(ii) there is memory bandwidth
contention between the accesses for arithmetic operations vs those for
collectives, and(iii) there are significant internal bus congestions that
increase the latency of communication operations. To address this challenge, we
propose a novel microarchitecture, calledAccelerator Collectives Engine(ACE),
forDL collective communication offload. ACE is a smart net-work interface (NIC)
tuned to cope with the high-bandwidth and low latency requirements of scale-up
networks and is able to efficiently drive the various scale-up network
systems(e.g. switch-based or point-to-point topologies). We evaluate the
benefits of the ACE with micro-benchmarks (e.g. single collective performance)
and popular DL models using an end-to-end DL training simulator. For modern DL
workloads, ACE on average increases the net-work bandwidth utilization by
1.97X, resulting in 2.71X and 1.44X speedup in iteration time for ResNet-50 and
GNMT, respectively
Zen: Near-Optimal Sparse Tensor Synchronization for Distributed DNN Training
Distributed training is the de facto standard to scale up the training of
Deep Neural Networks (DNNs) with multiple GPUs. The performance bottleneck of
distributed training lies in communications for gradient synchronization.
Recently, practitioners have observed sparsity in gradient tensors, suggesting
the potential to reduce the traffic volume in communication and improve
end-to-end training efficiency. Yet, the optimal communication scheme to fully
leverage sparsity is still missing. This paper aims to address this gap. We
first analyze the characteristics of sparse tensors in popular DNN models to
understand the fundamentals of sparsity. We then systematically explore the
design space of communication schemes for sparse tensors and find the optimal
one. % We then find the optimal scheme based on the characteristics by
systematically exploring the design space. We also develop a gradient
synchronization system called Zen that approximately realizes it for sparse
tensors. We demonstrate that Zen can achieve up to 5.09x speedup in
communication time and up to 2.48x speedup in training throughput compared to
the state-of-the-art methods
On Biased Compression for Distributed Learning
In the last few years, various communication compression techniques have
emerged as an indispensable tool helping to alleviate the communication
bottleneck in distributed learning. However, despite the fact {\em biased}
compressors often show superior performance in practice when compared to the
much more studied and understood {\em unbiased} compressors, very little is
known about them. In this work we study three classes of biased compression
operators, two of which are new, and their performance when applied to
(stochastic) gradient descent and distributed (stochastic) gradient descent. We
show for the first time that biased compressors can lead to linear convergence
rates both in the single node and distributed settings. Our {\em distributed}
SGD method enjoys the ergodic rate , where is a compression
parameter which grows when more compression is applied, and are the
smoothness and strong convexity constants, captures stochastic gradient
noise ( if full gradients are computed on each node) and captures the
variance of the gradients at the optimum ( for over-parameterized models).
Further, via a theoretical study of several synthetic and empirical
distributions of communicated gradients, we shed light on why and by how much
biased compressors outperform their unbiased variants. Finally, we propose a
new highly performing biased compressor---combination of Top- and natural
dithering---which in our experiments outperforms all other compression
techniques.Comment: 39 pages, 10 Figures, 25 Theorems and Lemmas, 8 New Compression
Operators, 2 Algorithm