Large-scale training of deep neural networks by Dryden, Nikoli Joseph
© 2019 Nikoli Joseph Dryden




Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2019
Urbana, Illinois
Doctoral Committee:
Professor Marc Snir, Chair
Professor William Gropp
Professor Wen-mei Hwu
Assistant Professor Alexander Schwing
Dr. Brian Van Essen, Lawrence Livermore National Laboratory
ABSTRACT
Accelerating and scaling the training of deep neural networks (DNNs) is critical to keep
up with growing datasets, reduce training times, and enable training on memory-constrained
problems where parallelism is necessary. In this thesis, I present a set of techniques that
can leverage large high-performance computing systems for fast training of DNNs. I first
introduce a suite of algorithms to exploit additional parallelism in convolutional layers when
training, expanding beyond the standard sample-wise data-parallel approach to include spa-
tial parallelism and channel and filter parallelism. Next, I present optimizations to commu-
nication frameworks to reduce communication overheads at large scales. Finally, I discuss
communication quantization, which can directly reduce communication volumes. In concert,




The work leading up to and presented in this thesis would not have been possible without
the advice and support of many people. My advisor, Marc Snir, has been a constant source
of insight throughout, and this work would not have been possible without his support.
Brian Van Essen has also been instrumental in the development of this thesis. I would
also like to thank Professors Alex Schwing, Bill Gropp, and Wen-mei Hwu for their ideas
and advice. Through their extensive conversations and discussions with me, Tim Moon and
Naoya Maruyama also significantly contributed to the success of this work. Everyone on the
LBANN team has also been helpful in turning the ideas here into working applications.
Last, but not least, I would like to thank all the colleagues, friends, and collaborators
who have offered advice or support throughout the years: Alex, Ana, Anna, Babek, Bryan,
Calvin, Cole, Felix, Jon, Laura, RJ, Rob, Tal, Tim, Tom, Torsten, and Vu.
iii
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Training Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Overview of Common Neural Network Architectures . . . . . . . . . . . . . . 12
2.4 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Collective Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 LBANN Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
CHAPTER 3 DISTRIBUTED-MEMORY CONVOLUTION . . . . . . . . . . . . . 20
3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Spatial Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Channel and Filter Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Parallel Execution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
CHAPTER 4 COMMUNICATION OPTIMIZATION . . . . . . . . . . . . . . . . . 65
4.1 Communication Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Communication Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Interfacing with MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 The Aluminum Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
CHAPTER 5 COMMUNICATION QUANTIZATION . . . . . . . . . . . . . . . . . 92
5.1 Existing Quantization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 Adaptive Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3 Quantized Allreduces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
CHAPTER 6 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.1 Parallelizing Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Scaling Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Applications of Large-Scale Training . . . . . . . . . . . . . . . . . . . . . . 112
6.4 Communication Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.5 Communication Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.6 Other Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . 115
iv
CHAPTER 7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
v
CHAPTER 1: INTRODUCTION
Deep neural networks (DNNs) [1, 2] have been applied with great success in many domains,
including image classification [3, 4, 5], object detection [6, 7, 8, 9], speech recognition [10, 11],
translation [12, 13], sequence modeling [14, 15], and natural language processing [16, 17].
They have also been successfully applied to many problems in scientific and computational
science domains, such as bioinformatics [18, 19], genomics [20, 21, 22], drug discovery [23, 24],
high-energy physics [25, 26, 27], cosmology [28, 29, 30], climate modeling [31, 32, 33, 34],
and inertial confinement fusion [35]. Over time, DNNs have grown increasingly large and
complex, with state-of-the-art DNNs having hundreds of layers and billions of parameters [3,
4, 36, 37, 38, 39]. The amount of data available has also grown rapidly, with dataset sizes
increasing [40, 41, 5, 42, 43, 44, 45] and larger data samples, such as high-resolution images
or video [46, 47, 48], becoming more common [45, 44, 49]. As larger models and datasets
have been shown to lead to improved performance, these trends are expected to continue.
Training a DNN is a very computationally expensive process, and it is not uncommon to
spend days or weeks to train to convergence. Some data are large enough that training on
them is infeasible due to the compute or memory requirements [50, 51, 52].
A typical workflow for a researcher or practitioner applying deep learning involves repeat-
edly training DNNs. This may be part of initial experimentation when first exploring a
new problem space, or as part of tuning the DNN architecture and training process. This
is particularly acute in scientific environments, where requirements may change rapidly.
Once finalized, DNNs may need to be frequently re-trained to incorporate newly-acquired
data [53].
It has become standard to leverage accelerators such as GPUs to reduce training times;
indeed, GPUs are often credited with helping to enable the “deep learning revolution” [54, 3].
Using multiple GPUs on a single compute node, or a small cluster of such nodes, often in a
cloud environment, is now common. Nevertheless, training state-of-the-art DNNs continues
to require days or weeks, and is a major expense.
Training DNNs is a high-performance computing problem, for which supercomputers
should be efficiently leveraged. In this thesis, I discuss techniques to accelerate training
on large-scale systems and enable training on previously-unreachable problems. This work
does so in three broad areas:
• Developing algorithms to exploit additional parallelism within the training process, in
order to make use of the massive amounts of parallelism available on modern super-
computers.
1
• Optimizing communication primitives, in order to reduce the overheads of distributed-
memory algorithms.
• Reducing communication volume within the training algorithms, to further reduce
overheads.
1.1 MOTIVATION AND GOALS
The goal of a practitioner training a DNN is to solve a problem of interest to them. To
be productive, they must be able to rapidly iterate on DNN training. There is a vast space
of approaches to accelerate training, including improved learning techniques. This thesis
focuses on approaches that are primarily independent of the particulars of the DNN, so
that they can be applied jointly with improvements in learning. I focuses on addressing
the following limitations: memory pressures from large samples and models; communication
overheads; efficiently utilizing compute and network resources; and scalability.
1.1.1 Memory Pressures
The size of individual data samples has grown significantly, especially in image and sci-
entific datasets. This is due to increased emphasis on high-resolution imagery (e.g. satellite
imagery [46]) and applications from areas such as medical and scientific imaging [52, 51, 34],
as well as output data from numerical simulations [50]. These data can be many megapixels
or even gigapixels in size, may be 3D, and may have channels beyond color (e.g. multispec-
tral images or physical variables such as density, pressure, etc.). It can require hundreds of
megabytes or gigabytes of memory to load a single such sample.
When training a DNN, the intermediate activations of every layer must be kept in order to
perform backpropagation (see Section 2.2). The size of a layer’s activations depends on the
input size and details of the layer, but is typically proportional in size to the input. Hence, the
memory requirements when training a DNN are significantly greater than just the size of the
input data and model parameters, and grow with input size. Further, standard approaches
to training require assigning at least one sample to each processor (see Section 2.2). With
deep models and large input, memory requirements can easily exceed what is currently
available on a GPU or other accelerator (16-32 GB). Indeed, for very large input data, a
single sample may require too much memory. As data sizes are continuing to grow, this
situation is expected to continue.
Several approaches have been used to tackle this, but all suffer from various shortcomings.
2
The simplest solution is to use CPUs, which typically have much more abundant memory,
rather than GPUs for training. However, as CPUs are significantly slower than GPUs
for training DNNs, this approach can be expected to significantly increase training time
compared to the case where training could be performed on GPUs.
If at least one sample and its corresponding intermediate data can fit in GPU mem-
ory, “micro-batching” approaches can be used [55, 39]. In these, small numbers of samples
are forward and backpropagated through the network and parameter updates are accu-
mulated before being applied. This enables GPUs to be used for training, but pipelining
approaches [56, 57, 58, 13, 39] are required for efficiency, although they may have their own
performance implications (e.g. approximating backpropagation). However, such approaches
potentially impact the learning dynamics when networks use batch normalization [39], as
mini-batch statistics are computed over each micro-batch. This can also be alleviated by
moving to distributed data-parallel training, where each GPU has only as many samples
as can fit in its memory. While this enables all memory to be resident in GPU memory, it
may not efficiently utilize compute resources and can be easily dominated by communication
time.
Lastly, out-of-core approaches can be used, where data is cached in host memory and
moved to accelerators on demand. On GPUs, CUDA unified memory can support this trans-
parently, as do some training frameworks [59, 60], although prefetch hints can be required
for good performance. However, these approaches result in additional memory transfers to
and from the GPU, potentially impacting performance and reducing bandwidth available
for inter-GPU communication when training [61, 62, 63].
The key limitation of these approaches is the requirement to train on at least a single,
complete data sample. This thesis solves this for convolutional neural networks by introduc-
ing spatial parallelism (Section 3.2) and channel and filter parallelism (Section 3.3). These
enable the data and computation to be further decomposed among processors, enabling ef-
ficient training even when a single sample is too large to fit in memory of a single processor.
1.1.2 Communication Overheads in Parallel Training
When training deep neural networks in distributed memory, a global allreduce operation is
required to combine the parameter updates produced by each processor. Each processor con-
tributes an update for each model paramemter, which can require hundreds of megabytes of
data to be transferred (see Chapter 2 for more details on distributed training). These intense
communication requirements result in a major overhead when training, and communication
is often considered the primary bottleneck when training, limiting scalability [64, 65]. Indeed,
3
benchmarks performed as part of my work show this remains true on current leadership-class
systems (Figure 4.3). As improvements in compute capabilities continue to outpace improve-
ments in networking, we can expect the relative cost of communication to only increase.
There are two broad approaches to implementing these allreduces: parameter servers and
decentralized allreduce collectives (see Section 2.5). Parameter servers perform well at small
scale, and have properties that are advantageous in cloud environments, such as enabling
easy fault tolerance and performance in heterogenous environments [66, 67, 68]. Decen-
tralized allreduces [69] are typically preferred at large scale and in HPC environments, as
the communication is more efficient and environments are typically dedicated and homoge-
nous [70, 71]. Within this work, all results use decentralized allreduces.
Allreduces may be performed at a finer-grained level than all the parameters of an entire
DNN at once in order to enable improved communication/computation overlap. However,
this can result in a significant portion of the communication being latency-bound. Many
existing libraries for distributed DNN training have employed ring allreduces to accelerate
bandwidth-bound workloads [72, 73, 74, 75]. However, the latency cost of a ring allreduce
scales linearly with the number of processors involved, making them a poor choice for small
messages (as when overlapping) or at large numbers of processors.
This thesis presents results showing that a combination of ring allreduces for bandwidth-
bound workloads and butterfly allreduces for latency-bound workloads can significantly re-
duce communication overheads, particularly when combined with an efficient runtime frame-
work (Chapter 4).
A more fundamental issue is simply that the volume of communication required for training
is very large. Methods to reduce the amount of data transmitted can, in addition to optimized
algorithms, further reduce communication overhead. There are two broad approaches to this:
sparsification and quantization [76, 77, 78, 79, 80, 81, 82, 83]. In sparsification methods, only
a subset of the parameters are transmitted, selected according to some importance metric.
In quantization methods, parameters are encoded into a more concise representation before
transmission. Both of these approaches trade increased local computation (to convert to
the sparsified or quantized form) in exchange for transmitting less data. Because these
change the parameter updates, these methods can also impact the learning and convergence
dynamics of the training process.
However, employing sparsification or quantization in practice still has many barriers and
open problems. Many approaches introduce new hyperparameters that need to be tuned,
and may require re-tuning of existing hyperparameters. The algorithms are often designed
for CPUs and may not efficiently transfer to GPUs or other accelerators, and are often only
considered in the parameter server setting. It is often unclear how best to implement them
4
in the context of a decentralized allreduce, which may have many intermediate steps where
quantized or sparsified data need to be combined.
This thesis presents adaptive quantization (see Section 5.2), a method that combines spar-
sification and quantization, and requires little additional hyperparameter tuning. Adaptive
quantization first selects only the important gradient updates to transmit, and then quan-
tizes them using one-bit quantization; an error feedback mechanism is used to help maintain
convergence. The only hyperparameter to tune controls the proportion of data transmitted.
I also present a technique to optimize reduction performance for decentralized allreduces on
one-bit quantized data (see 5.3).
1.1.3 Efficient System Utilization
It is critical to efficiently utilize system resources, such as compute and the network,
in order to reduce training times as much as possible. There are many aspects to this; for
example, having a local mini-batch that is too small may underutilize accelerators. Similarly,
the more efficient communication algorithms already discussed can utilize the network more
efficiently. In this thesis, I focus on underutilization caused by excessive synchronization
when performing communication. This is orthogonal to the other work on communication
optimization, and combining them together can result in further improvements.
Many communication libraries for deep learning build upon and/or emulate MPI. (While
the discussion in this thesis focuses on MPI, the issues apply more broadly.) MPI libraries
have optimized communication primitives, and many are also “CUDA-aware” [84], in that
they accept pointers to buffers located in GPU memory and can perform communication
operations with them. However, MPI runtimes are unaware of operations a user has en-
queued to run on a GPU; likewise, a GPU is unaware of MPI operations the host may
initiate. Therefore, when a user passes a GPU buffer to an MPI routine, MPI has no way to
determine whether a pending GPU operation will write to the buffer, and a GPU operation
cannot determine whether MPI will write to a buffer. To ensure correctness, a user must
manually add appropriate synchronization operations to complete pending GPU computa-
tions before performing communication, and cannot launch additional computation until the
communication completes. This forces applications to adopt a bulk-synchronous model of
separate computation and communication phases. Thus, the GPU is underutilized during
communication phases, and the network is underutilized during computation phases [85, 86].
Further, while launching GPU kernels is asynchronous, it is not free. Kernel launch latency
on modern GPUs is typically around 10 µs of CPU time. This overhead is typically hidden
by pipelining kernel launches: as long as kernels are not too short, the time to launch all but
5
the first can be hidden by other compute kernels. The bulk-synchronous model means that
this launch overhead must be paid every compute phase. This is of particular importance
when fine-grained communication is performed, which will be critical with algorithms such
as spatial parallelism (see Section 3.2).
Finally, frequent synchronization between the CPU and GPU tightly couples the two
together, limiting opportunities for overlapping GPU work with other operations on the
CPU. Of particular importance for training DNNs is the ability to overlap I/O with the
training process [87, 88].
In this thesis, I propose GPU-centric communication semantics, which enable communi-
cation operations to be treated “just like a CUDA kernel” (see Chapter 4). These enable
applications to avoid the bulk-synchronous model when performing communication. I also
describe an efficient implementation of this in a new open-source communication library
named Aluminum (Section 4.4).
1.1.4 Scalability
To accelerate training, large GPU supercomputers must be used, and it is important to
leverage them efficiently and scale effectively to large numbers of GPUs. Although there are
many notions of scalability that can be applied to training DNNs, it is useful to consider
mini-batch scaling. In this regime, strong scaling involves training a fixed model with a fixed
mini-batch on an increasing number of processors. Weak scaling fixes the number of samples
within the mini-batch assigned to each processor and increases the number of processors,
hence also increasing the global mini-batch size.
There are several immediate limitations to this scaling when using standard data-parallelism.
With strong scaling, the number of processors used cannot exceed the number of samples
within the mini-batch. With weak scaling, the learning dynamics and accuracy of the re-
sulting model limit the maximum size of the global mini-batch that be productively used
(see Section 2.2). Both of these approaches break down when samples are large, as a single
processor may not be able to process a single sample.
To demonstrate scalability, this thesis combines the previously discussed approaches to
train on novel problems that were previously infeasible due to computational limitations.
The results of this are demonstrated primarily in Chapter 3.
1.2 CONTRIBUTIONS
This thesis makes the following contributions:
6
• Spatial parallelism, which partitions the spatial domain of convolutional layers, and
channel and filter parallelism, which partitions the channel and filter domains of convo-
lutional layers. These two approaches enable parallelism beyond the sample dimension
when training CNNs.
• Identifies optimized communication primitives for training deep neural networks, par-
ticularly for latency-bound workloads.
• A GPU-centric semantics for communication operations, and an implementation of
them in the open-source Aluminum library.
• Adaptive quantization, a technique for quantizing communication when training DNNs
in order to reduce communication volume.
• Techniques to optimize the performance of quantized reduction operations in allre-
duces.
• In combination, these provide improved strong and weak scaling performance for train-
ing existing DNN models, and enables training on previously unreachable problems.
1.3 ORGANIZATION
Chapter 2 provides an overview of relevant background material on neural networks and
their training. In Chapter 3, the algorithms for spatial parallelism and channel and filter
parallelism are developed and evaluated. Chapter 4 discusses communication optimizations
and GPU-centric communication semantics. Chapter 5 introduces adaptive quantization and
discusses the adaption of quantization in decentralized allreduces. Related work is discussed
in Chapter 6, and I provide concluding remarks and ideas for future research in Chapter 7.
7
CHAPTER 2: BACKGROUND
In this chapter, I present an overview of relevant background material. This is primarily
intended to convey the notation and terminology used in this thesis. Readers are encouraged
to utilize the references for a more comprehensive introduction to these topics.
2.1 DEEP NEURAL NETWORKS
Deep neural networks (DNNs), the primary class of machine learning model used in modern
deep learning [1, 2], are powerful function approximators [89, 90, 91] based on many-layered
artificial neural networks. I provide here a very brief overview of key components of DNNs,
with a focus on convolutional neural networks (CNNs). Training DNNs is discussed in the
next section.
The most basic component of a DNN is a neuron, which in modern DNNs can be thought of
as a perceptron combined with a nonlinear activation function, producing for a vector input
x, the output σ(wx + b), where w is a weight vector, b a bias vector, and σ the activation
function. Historically, activation functions such as the logistic or hyperbolic tangent function
were used; the rectified linear unit (ReLU), max(0, x), is now widely used [92, 93]. In practice,
many neurons are combined to form a fully-connected (or dense) layer; stacking many such
layers forms a multi-layer perceptron.
The convolution operation [94, 40] was originally introduced to neural networks for image
classification problems, and has since become widely used in many domains. (Note that,
despite the name, the operation typically used in convolutional layers in practice is the
cross-correlation operator, not the convolution operator.) While conceptually similar to a
fully-connected layer, convolutional layers take advantage of local connectivity and weight
sharing to significantly reduce the number of parameters. It can be thought of as an infinitely
strong prior in favor of local interactions and translation equivariance, and has its origins in
the human vision system.
A bias term (the b from a perceptron) may be applied after a layer’s operation, prior to
applying the activation function. The bias is essentially a learned vector, which does not
depend on a layer’s input, that is added to the output in order to shift it from the origin.
8
2.2 TRAINING DEEP NEURAL NETWORKS
To train a DNN, one first selects a loss function (for example, mean squared error or
cross-entropy), and then selects parameters to minimize this loss function. By far the most
common approach for solving this optimization problem is mini-batch stochastic gradient
descent (SGD), often with momentum or Nesterov momentum [95]. In a mini-batch method,
a set of samples are randomly selected (without replacement) from a dataset and used to
approximate a descent direction for the optimizer; a complete pass over the dataset is referred
to as an epoch. One typically performs many epochs in order to minimize the loss function.
In addition to SGD, other commonly used optimizers include RMSProp [96], Adam [97], and
AdaGrad [98].
Applying such optimizers requires computing gradients with respect to the loss function.
This is done using the backpropagation algorithm [99], which can be thought of as repeated
application of the chain rule from calculus, while using dynamic programming to avoid
redundant computations. Thus, at a high level, the training process consists first of a
forward propagation phase, in which the DNN makes its prediction on the samples in the
mini-batch (this is equivalent to inference), a backpropagation phase, in which gradients
are computed, and an optimization phase, in which the gradients are used to compute an
update to the parameters of the network. In general, backpropagation consists of two phases,
a backward-weights phase, in which the gradients for a layer’s parameters are computed, and
a backward-data phase, in which an error signal is computed to pass to the next layer (the
names of these phases are adapted from the corresponding cuDNN functions).
To parallelize training, the typical approach is to exploit sample-wise parallelism within the
mini-batch. This is commonly referred to as data parallelism, but I will refer to it as sample
parallelism in this thesis (see Chapter 3). In this approach, a mini-batch is partitioned
among processors, which can each perform forward and backpropagation independently.
After this, each processor has locally the gradients for the samples it processed locally,
which need to be aggregated across every processor. This takes the form of an allreduce
with a summation reduction (see Section 2.5). Sample parallelism is widely used in both
shared- and distributed-memory environments.
2.2.1 Training Convolutional Layers
To make this more concrete, I now discuss forward and backpropagation for a convolutional
layer in more detail. For simplicity, I will consider only convolution on 2D data; generalizing
to other dimensions is easy. Let the layer have F filters of size K×K, stride S, and padding
9
P ; to further simplify I will assume that K is odd and write O = bK/2c to be the number
of filter entries on either side of the center. The input to the layer will consist of N samples
(the local mini-batch size), each with C channels, height H, and widthW . Each filter will be
applied to each sample, resulting in N output samples with F channels (also called feature
maps), height H̃, and width W̃ , which are defined to be
H̃ =
⌊












Thus one can think of the layer as having six associated 4D tensors: a N × C ×H ×W







(where L is the loss function).
To further simplify, I will assume stride S = 1 and “same” padding, so the output height









where we assume that subscripts may be “out of range” and are handled by padding.


























For simplicity, I will use ConvForward, ConvBackData, and ConvBackFilt to refer to the





Note that the sample-parallel formulation of distributed training is immediately visible
from Equation 2.5, as the summation over N can be distributed over processors and then
completed with an allreduce. The reduction aspect is due to the summation, and since
filters are replicated on every processor, each of which contributes updates, the reduction
must involve all processors.
10
2.2.2 Scaling Training
It is natural now to consider scaling training by using more processors. For training
DNNs, scaling is typically defined as mini-batch scaling, with analogues to traditional strong
and weak scaling. In this regime, strong scaling is defined as training a fixed model with
a fixed mini-batch size using additional processors and weak scaling trains a fixed model
with a fixed mini-batch size per processor. Hence, as additional processors are added, the
global mini-batch size grows when doing weak scaling (this has its own set of issues, see
Section 2.2.3).
There are a number of different metrics one can consider when evaluating the speedup of
scaling training, or optimizations in general. I list several below:
Mini-batch time This is the time to process one mini-batch of a fixed size.
Epoch time This is the time to complete one pass over an entire dataset.
Time to accuracy This is the time to train a DNN to a given accuracy, where accuracy
refers to some metric used to evaluate the performance of the network. For classifica-
tion, this is the proportion of samples assigned the correct class.
These metrics are often closely related. For a fixed dataset and mini-batch size, one can
compute epoch time from mini-batch time; and if learning dynamics are not altered, one
should be able to compute time to accuracy from either. When scaling, mini-batch time is
often used to evaluate strong scaling and epoch time to evaluate weak scaling1.
2.2.3 Large-Batch Training
Based upon the above, a naïve approach to accelerating training would be to aggres-
sively weak-scale and use a mini-batch as large as available (distributed) memory permits.
Unfortunately, this “large-batch” training regime runs into major issues.
The first issue is the so-called “generalization gap”, whereby models that are trained
with large mini-batches perform worse on held-out validation datasets than the same model
trained with a smaller mini-batch [100]. The exact nature of this problem remains open.
Some have speculated that models trained with very large mini-batches converge to “sharper”
minima that do not generalize as well [101, 100], although the right definition of “sharpness”
is unclear [102]. Nevertheless, this gap has been consistently observed empirically [103].
1Another perspective on scaling that is important to note, although I do not consider it in this thesis, is
that when using the time to accuracy metric, mini-batch weak and strong scaling can both be considered
strong scaling when using a fixed model and dataset.
11
For certain models and datasets (most famously, ResNet-50, see Section 2.3.2), techniques
have been developed to enable training with large mini-batch sizes through careful learning
rate management, such as linear scaling with a warmup [103], square-root scaling [104], and
layer-wise adaptive learning rates [105]. Other works have explored varying the mini-batch
size itself [106, 107]. However, beyond a certain point, the generalization gap continues to
persist.
More recently, Shallue et al. [108] have studied more directly the effects of large mini-
batches on training when given a significant compute budget for hyperparameter tuning for
every mini-batch size. They observe that, even in this regime, the statistical efficiency (i.e.
number of iterations to converge) plateaus once a mini-batch becomes large enough, and
that this phenomena empirically holds across a number of different models and datasets.
2.2.4 Model Parallelism
In addition to sample parallelism, model parallelism is another commonly-used class of
techniques for parallelizing training. While there is no widely-used definition of it, in this
thesis, I will view model parallelism as techniques that partition the parameters of a DNN,
in contrast to data parallelism, which partitions inputs and activations. Note that model
and data parallelism need not be exclusive, as the channel and filter parallelism described
in this thesis partitions both parameters and activations.
Section 6.2 provides a discussion on some approaches to model parallelism. In brief, it
is commonly applied to fully-connected layers, as they have significantly more parameters,
making partitioning them more profitable. This is typically realized through standard dis-
tributed linear algebra algorithms, such as distributed matrix-matrix products. One can
also combine model and sample parallelism hierarchically, where model parallelism is used
over a group of processors, to which a subset of samples are assigned.
2.3 OVERVIEW OF COMMON NEURAL NETWORK ARCHITECTURES
I now give a brief overview of some additional layers commonly used in CNNs, and some
common neural network architectures and their characteristics.
2.3.1 Additional Common Layers
The pooling layer is widely used within CNNs to provide invariance to small translations.
In many ways it functions similarly to a convolutional layer, but applies a fixed function,
12
such as the average or maximum within its filter, instead of a learnable filter.
Batch normalization [109] is a widely used operation to improve the convergence rate of
deep networks. At a high level, it normalizes a mini-batch to have mean 0 and variance
1, but with learnable parameters to modulate this transform. It is important to note that
the mean and variance used to normalize a mini-batch is computed from that mini-batch;
hence it is important to have a sufficiently large mini-batch to compute reliable statistics;
empirically, roughly 16 samples seems to be sufficient [110]. To avoid excess communication,
it is standard to compute this normalization separately on each processor using only the local
mini-batch; when this is too small, methods to aggregate statistics from multiple processors
have been employed [111].
2.3.2 Common Architectures
AlexNet [3] is a CNN architecture for image classification considered by many to have
helped launch the modern success of deep learning. At a high level, it consists of a stack
of five convolutional layers, interspersed with three max-pooling layers, followed by two
4096-neuron fully-connected layers and a final 1000-neuron fully-connected layer. All layers
used the ReLU activation and the fully-connected layers used dropout regularization [112],
except the last, which uses a softmax activation. Although this architecture is historically
important, it is no longer particularly relevant today: other architectures have surpassed it.
In general, modern networks are significantly deeper and have eschewed large fully-connected
layers.
ResNets [4] have become the basis of many modern CNN architectures. These networks
are built out of many stacks of “building blocks”, small sets of layers that are repeated.
The key architectural advancement in ResNets was the introduction of residual connections,
identity skip connections around each building block. ResNets are also divided into several
“modules”, each of which operate on a fixed spatial domain size; as the spatial domain is
decreased, the number of filters (and hence channels) in each convolutional layer is increased
to keep the amount of computation roughly constant. While ResNets can be constructed
with almost arbitrary numbers of layers, five sizes were defined in the original paper, with
18, 34, 50, 101, and 152 layers. ResNet-50, with 50 layers, is widely used for classification on
the ImageNet dataset [5]. ResNets are often used as the basis for many modern classification
architectures (e.g., [113, 114, 38]) and architectures for other tasks (e.g., [115, 7, 116]).
Compared to AlexNet, ResNets are significantly deeper and do not contain any large fully-
connected layers. The basic structure of ResNets has influenced many subsequent networks,
and ResNets are often used as a component in networks for other problems (e.g. semantic
13
segmentation [115]). Hence, optimizations improving ResNets can be widely applicable
within modern deep learning.
An important variant of the standard CNN architecture is that of fully-convolutional
networks [117]; these are based on the observation that any fully-connected layer can be
replaced by a convolutional layer with an appropriately chosen filter size. A major advantage
of this change is that the network can be applied to arbitrarily-sized inputs.
Two variants of the convolution operation are also common: depthwise and dilated. A
depthwise convolution [118, 43] performs the summation over channel dimensions only within
fixed groups instead of over every group. These are often followed by a 1× 1 convolution to
form what are called depthwise-separable convolutions. Dilated, or atrous, convolutions [119]
insert holes into the convolutional filter while expanding it, enabling a larger field of view to
be processed without downsampling input data.
2.4 PERFORMANCE MODELING
Performance modeling will be important throughout this work as a way to guide intuition
on the performance of algorithms and to estimate how far implementations are from “optimal”
(under some model). In general, there are two broad categories of operations that will be
considered in this thesis when estimating performance: convolution and communication. For
most DNNs considered here, the convolution operation dominates compute time; and when
training in distributed memory, communication overheads are critical to estimate. While
other aspects (e.g. I/O) can be important, this coarse-grained estimate is sufficient for the
work here.
Convolution is particularly difficult to model, especially on GPUs. There are a great many
algorithms that can be used to implement it (e.g. [120, 121, 122]), many of which have further
variants and optimizations in vendor libraries to accelerate particular configurations. This is
further complicated by the difficulty of modeling performance on GPUs or other accelerators.
Instead of attempting to model this analytically, I instead use empirical measurements from
a simple cuDNN [123] benchmark. This benchmark first does autotuning to select the
fastest cuDNN algorithm, then does several warmup runs, then measures several iterations
of convolution and reports the arithmetic average runtime. Measurement is done using
CUDA events to measure execution time on the GPU without synchronization overheads. A
spin kernel is launched onto the GPU before any convolution kernels are enqueued, in order
to hide kernel launch overheads.
When representing convolution runtime in performance models, I will, in line with the
notation in Section 2.2.1, refer to the runtime of forward propagation, backward-data, and
14
backward-weights as ConvForward, ConvBackData, and ConvBackFilt, to elide the differ-
ences in underlying implementation.
Communication is modeled analytically using the linear model [124]. In this model, a
message consists of two parts: a latency or startup time, α, and the time to send a word
or the inverse bandwidth, β. I will further assume that the network is full-duplex and that
there is no interference. Thus, the time to send and receive a message of n words is α+ βn.
Estimating α and β requires empirical measurements, for which I utilize the OSU MPI
benchmarks [125]. Since communication costs can vary significantly depending on whether
data resides in CPU or GPU memory, different measurements are made for each case. To
compress notation, I will use SR(n) = α + βn to be the time to send and receive n words.
When computation is involved, a γn term is introduced, where γ is the time to perform the
computation on a single word. Also, throughout this thesis, I will use lg = log2.
2.5 COLLECTIVE COMMUNICATION
Collective communication plays a critical role in distributed training algorithms, including
the standard sample-parallelism approaches, as well as new ones introduced in this thesis.
This section will provide a brief overview of some important collective operations and the
algorithms that implement them; see [126, 69] for a more complete introduction. We define
seven operations (although not all are directly used):
Broadcast A vector on a single processor is transmitted to all processors.
Reduce Each processor has a vector, and the result of combining every vector with an
element-wise operation (typically summation) is placed on a single processor.
Scatter A vector on a single processor is split evenly into segments and the segments are
transmitted such that the ith processor receives the ith segment.
Gather Each processor has a vector, and the result of concatening the vectors in processor
order is placed on a single processor.
Reduce-scatter Each processor has a vector, all of which are combined with an element-
wise operation, and then the result is split evenly into segments and the segments
transmitted such that the ith processor receives the ith segment. This is logically
equivalent to performing a reduce followed by a scatter.
15
Table 2.1: Summary of important collective algorithms and their runtime.
Collective Algorithm Runtime
Broadcast
Binomial tree α lg p+ βn lg p




Binomial tree α lg p+ βn lg p+ γn lg p






















Recursive-doubling α lg p+ p−1
p
nβ




Parameter server 2pα + 2pnβ
Tree 2α lg p+ 2βn lg p+ γn lg p
Recursive-doubling α lg p+ βn lg p+ γn lg p










Allgather Each processor has a vector, and the result of concatenating all these vectors, in
processor order, is placed on every processor. This is logically equivalent to performing
a gather followed by a broadcast.
Allreduce Each processor has a vector, all of which are combined with an element-wise
operation, and then the result is placed on every processor. This is logically equivalent
to a reduce followed by a broadcast.
The key operation for distributed training is the allreduce. This can be implemented in
a number of different ways. I will assume that there are p processors (excluding centralized
servers), and that p is a power of 2. In general, it is relatively easy to adapt these algorithms
to the non-power-of-2 case, although it typically requires an additional step. A common
approach uses a centralized parameter server, where each worker processor transmits a vector
to the server, and after all vectors are received and combined, the result is transmitted back
to all the workers. Assuming that the parameter server can receive from or send to one
worker processor at a time, this will take time 2pα + 2pnβ to receive the updates and send
the combined result back to each of the workers.
A better approach can be had by implementing the allreduce as a reduce to the param-
16
eter server, followed by a broadcast. Broadcast is typically implemented with one of two
algorithms, depending on the message length:
• For short messages, a binomial tree algorithm, which takes time α lg p+ βn lg p.
• For long messages, an algorithm due to Van de Geijin et al. [127, 128] implements
the broadcast as a binomial tree scatter followed by a ring allgather, taking time
(lg p+ p− 1)α + 2p−1
p
nβ. (Other variants on this algorithm are also possible.)
The reduce is commonly implemented with one of three algorithms:
• For short messages, a binomial tree algorithm, which takes time α lg p+βn lg p+γn lg p.
• For long messages, an algorithm due to Rabenseifner [129] implements the reduce
as a recursive-halving reduce-scatter followed by a binomial tree gather, taking time





• Also for long messages, a ring algorithm for the reduce-scatter, followed by a gather
implemented by each process sending directly to the root can be used, taking time





Instead of relying on a centralized parameter server, a decentralized allreduce can be
performed directly. One approach to this is to use a reduce followed by a broadcast, as with
parameter servers, but without needing a dedicated server. This approach, using a binomial
tree reduce and broadcast, has been used to implement latency-sensitive allreduces. Three
other algorithms are also common.
• For short messages, a recursive-doubling algorithm, which takes time α lg p+βn lg p+
γn lg p.
• For long messages, an algorithm again due to Rabenseifner [129] implements an allre-
duce as a recursive-halving reduce-scatter followed by a recursive-doubling allgather.





• Also for long messages, and very commonly used in distributed DNN training, ring






Although the ring algorithm is strictly worse than Rabenseifner’s algorithm in this perfor-














Figure 2.1: LBANN software stack, with the Hydrogen distributed linear algebra library and
the Aluminum communication library.
due to its nearest-neighbor communication pattern when the ring is appropriately mapped
to the underlying network topology.
The work in Section 3.3 will also require reduce-scatter and allgather collectives directly.
The reduce-scatter is commonly implemented with one of two algorithms:










Similarly, the allgather can be implemented via two algorithms:
• A recursive-doubling algorithm, taking time α lg p+ p−1
p
nβ.
• A ring algorithm, taking time (p− 1)α + p−1
p
nβ.
Table 2.1 summarizes the runtime of these collective operations.
Note that the algorithms presented here for collectives are not exhaustive, and many others
exist. In particular, more specialized algorithms for particular network topologies or system
configurations have been developed [130, 131, 132, 133, 134, 111].
Given the many different algorithms involved here, we will write AR(n, p) to be the time
for an allreduce on n words and p processors, RS(n, p) for a reduce-scatter, and AG(n, p)
for an allgather. The optimal algorithm for the given parameters and system can then be
selected.
2.6 LBANN TOOLKIT
The LBANN toolkit2 [70] is a toolkit designed to support distributed training of DNNs on
HPC systems. It is primarily built upon MPI to facilitate parallel training and a combina-
tion of cuDNN-based [123] and custom CUDA kernels for GPU-accelerated training. Some
2https://github.com/LLNL/lbann
18
core components of its basic software stack are illustrated in Figure 2.1. The Aluminum li-
brary (see Chapter 4) provides the underlying communication interface, while the Hydrogen
library3, a fork of the now-defunct Elemental library [135], provides support for distributed
dense linear algebra.
Historically, LBANN began as a toolkit to support distributed model-parallel training for
large fully-connected layers, by leveraging the Elemental library for distributed dense linear
algebra. It has since evolved to a more comprehensive toolkit for deep learning.
3https://github.com/LLNL/Elemental/
19
CHAPTER 3: DISTRIBUTED-MEMORY CONVOLUTION
In this chapter, I present algorithms for parallelizing convolution beyond sample-wise mini-
batch parallelism, and evaluations of these algorithms.1 I begin by introducing the notation
used to describe these algorithms, then present two classes of algorithms: spatial parallelism,
to partition the spatial domains of activations when training a CNN, and channel and filter
parallelism, to partition the channels and filters of a CNN.
These approaches enable two broad advancements over standard data-parallel techniques
for training CNNs:
1. Training a CNN can be strong-scaled. This allows training to be accelerated by utilizing
large-scale compute resources. In particular, this can be done without introducing any
additional hyperparameters, which would require additional compute budgets in order
to tune.
2. Enable training problems where parallelism is required due to memory limitations.
This provides a new capability to handle training CNNs on datasets with very large
spatial or channel domains, or very wide models with many filters.
Conceptually, the underlying idea of this work is to observe that a convolutional layer
applied to 2D data has dimensions N × C × F × H × W , each of which can be parti-
tioned (extending this to different numbers of spatial dimensions is simple). I will refer to
partitioning a dimension of this space as parallelizing that dimension; hence, the standard
data-parallel approach is sample (N) parallelism. The remaining dimensions give height and
width (together, spatial, H ×W ), channel (C), and filter (F ) parallelism. In this chapter, I
introduce approaches to partition the remaining dimensions.
3.1 NOTATION
I now introduce the notation we will use to describe the algorithms. This notation is meant
to be a simplified, high-level description of how tensors are distributed while eliding many
“lower level” details. It is inspired by the notation developed for the FLAME project [137,
138] and is also similar in idea to High Performance Fortran [139].
1This chapter incorporates my work from [50] and [136].
20
3.1.1 Processor grid
A layer is distributed over a processor grid G, which is constructed as a multi-dimensional
grid over the available processors. We will define the dimensions of a processor grid as a
tuple, with each dimension name using a script notation, e.g. (N , . . .). The processor grid
may vary between layers, since they can use different parallelization strategies or algorithms.
Note that the order of dimensions for the processor grid has no bearing on the ordering of
dimensions for local data, which can be laid out in whatever manner is most efficient.
3.1.2 Distributions
A distribution DGa for a tensor a is a map from the multi-index of a processor in G to a set of
entries in a; when the processor grid is clear from context, we will omit it. While this can be
used to define arbitrary distributions, we will describe a method of constructing distributions
that is sufficient for describing the algorithms in this chapter. To do this, we will use two
functions, ◦ and ∗, to describe the distribution of a tensor dimension onto a processor grid
dimension. The key intuition behind these two functions is not the precise distribution of
data, but whether that dimension is replicated or distributed (in a load-balanced manner).
We now make these functions more precise. The function ◦NN (projN (p)) maps from G to a
subset of [|N |], where N is the dimension of the tensor being mapped, N is a processor grid
dimension of G the dimension will be mapped onto, p ∈ G, and [|N |] is the set of integer
indices in dimension N . To simplify construction, we always assume that p is projected to
the dimension N , i.e. that the distribution is based solely on the processor’s index in N .
Essentially, ◦NN assigns indices of N to each processor based on its index in N .
The function ∗NN (projN (p)) is a special case of ◦NN that replicates the indices of N on every
processor: ∗NN (projN (p)) = [|N |]. When they are clear from context, we will omit N and N
from ◦ and ∗. Given a specification of ◦ or ∗ for dimensions of a processor grid, a distribution
DGa is constructed as the Cartesian product of these functions. For this distribution to be
valid, each element of a must be assigned to at least one processor.
Note that a tensor cannot be distributed over the same processor grid dimension multiple
times. If a grid dimension is not specified in Da, the data is identical across that dimension.
We refer to such dimensions as the redundant grid dimensions. (This is equivalent to the
notion of redundant communicators in [138].) It is also possible to define distributions that
include overlap, although it is not directly represented in this notation. This can be used to
describe halo exchanges, which we will use in Section 3.2.
Throughout this chapter, we will use ◦ without defining it more precisely. In these cases, we
21
mean for the indices to be distributed in a blocked, load-balanced manner over the processor
grid. That is, let r = projN (p) and n = |N | be the size of processor grid dimension N . Then











where we have assumed that n evenly divides |N | for simplicity. We will also use a[· · · ] to
refer to a distribution, where · · · is some combination of ◦ and ∗, with the associated tensor
and processor grid dimensions being inferred from the order of a’s dimensions. We also may
write that a tensor a is distributed as [· · · ] to mean the same thing, typically when referring
to a similar distribution for multiple tensors.
As an example, we can define the data distribution for sample parallelism as being over
a one-dimensional processor grid of dimensions (N ). Then the input data x would be
distributed as x[◦, ∗, ∗, ∗], indicating that the sample dimension N is distributed, while the
channel and spatial dimensions are replicated.
3.1.3 Index Sets
It is sometimes helpful to be able to refer to the set of indices assigned to a processor




to refer to the indices of dimension N assigned
to processor p under distribution DGa ; when the processor and distribution are clear from
context, we abbreviate this to IN . (It is reasonable to omit the processor specification











| (and again omit p and DGa when clear). To refer the set of elements






We will make use of segmented collectives, wherein disjoint sets of processors concurrently
perform the same collective operation using only the data local to the processors. A seg-
mented collective has the potential to be significantly faster than a global collective, since
it is performed among a smaller set of processors, and may involve less data from each
processor. The processors involved in such a collective correspond to the redundant grid
dimensions for the tensor the collective is performed on.
22
3.1.5 Redistributions
We use a[DGa ] ←− a[DGa ] to denote a redistribution (typically, an allgather) of a from
distribution DGa to DGa where both distributions are over the same processor grid. On a
per-dimension basis, there are four basic cases:
• ∗ ←− ◦ is an allgather that assembles the complete set of indices for a dimension on
every processor involved.
• ◦ ←− ∗ discards local data.
• ◦ ←− ◦ and ∗ ←− ∗ are NOPs.
We may also need to specify a redistribution ◦d′ ←− ◦d that moves data between distri-
butions over different processor grid dimensions; here we require that d be a proper subset
of d′ (or vice-versa). These are discards of local data or segmented allgathers.
One can define redistributions between distributions over different processor grids, but this
is more complicated; see Section 3.2.1. One may also consider redistributions of multiple
dimensions simultaneously; however, this is not needed in this thesis.
3.1.6 Reductions
Data movement may also involve collective reductions, which we denote explicitly as
a[DGa ] ←−
∑̃
a[DGa ]. This is logically a segmented allreduce over the redundant grid di-
mensions. For clarity, we can add a subscript to the reduction operator
∑̃
A to explicitly
indicate the dimensions being reduced over: The reduction will be over the processors that
have different indices for the dimensions in A, and the same indices for the remaining di-
mensions of G.
We mention an important special case that fuses a reduction and a redistribution: a[DGa ]←−∑̃
a[DGa ] where the distribution change is ◦ ←− ∗. This can be implemented as a segmented
reduce-scatter, which is significantly cheaper than an allreduce.
An alternate, and perhaps more appealing, formulation of reductions can be defined en-
tirely in terms of data distributions. For simplicity, we will not use the following approach
in this thesis, but it may be useful for more general algorithms. Similarly to the approach
taken by Schatz [138], a temporary dimension can be introduced to a tensor, which may be
either distributed or replicated. The
∑̃
operator then sums out this temporary dimension,
with the exact operation being a function of the distributions: If the temporary distribution
is replicated, the operation is purely local; if it is distributed, it is either an allreduce or a
segmented allreduce depending on whether other dimensions are distributed.
23
3.1.7 Example: Sample Parallelism
Recall that sample parallelism partitions only the N dimension of input/output tensors
and that w is replicated on every processor. Our processor grid will consist of only one di-
mension, N , to distribute the sample dimension. The distributions of x and dL
dy
are [◦, ∗, ∗, ∗],
as are the output distributions for y and dL
dx
. As the weights are replicated, w and dL
dw
have
distribution [∗, ∗], where we omit the K ×K dimensions for simplicity (they are replicated
in every case). We also make use of temporary variables denoted Ti. Finally, we will omit
the optimization step, which is always performed locally on each processor. With this, we
can use our notation to describe sample parallelism as:
1. y[◦, ∗, ∗, ∗]←− ConvForward(x[◦, ∗, ∗, ∗], w[∗, ∗])
2. dL
dx
[◦, ∗, ∗, ∗]←− ConvBackData(dL
dy
[◦, ∗, ∗, ∗], w[∗, ∗])






From this, we can see all the important parts of the algorithm, computation, and communi-
cation. In (1), ConvForward implementations Eq. 2.3 and in (2), ConvBackData implements
Eq. 2.4, both of which can be performed entirely locally. The computation of ConvBackFilt
in (3) begins locally and produces a temporary, and is followed by a global reduction in (4),
completing the summation of N in Eq. 2.5. This reduction is performed over all processors
that have different indices for the N dimension; since that dimension is the only one that is
distributed, and based upon the transition rules in Section 3.1.6, this is a global allreduce.
3.2 SPATIAL PARALLELISM
We now present spatial parallelism, which partitions the spatial domains of input, acti-
vation, and error signal tensors. We begin by describing spatial parallelism, together with
sample parallelism, and then discuss our implementation. Our algorithms exactly replicate
convolution as if it were performed on a single GPU (up to floating point accumulation
issues).
3.2.1 Algorithm
Spatial parallelism is more complicated than sample parallelism. The spatial dimensionsH




Figure 3.1: Remote data dependencies for a 3× 3 convolutional filter centered on the solid
red box when data is partitioned spatially over four processors. The red arrows indicate




Figure 3.2: Data transferred to processor P0 as part of a halo exchange for a 3× 3 convolu-
tional filter when data is partitioned spatially over four processors.
25
can be performed locally, but when a filter of size greater than 1 × 1 is placed near the
border of a partition, remote data will be needed to compute the convolution. Thus a small
number of rows and/or columns will need to be transferred from the remote processors
in a halo exchange (as in a stencil computation). Backpropagation is similar, requiring a
halo exchange on dL/dy to compute dL/dx, and using the data from forward propagation
to compute the local contributions of dL/dw. Finally, an allreduce completes the sum in
dL/dw, like in sample parallelism. It should be observed that sample and spatial parallelism
are orthogonal and can be used simultaneously. We refer to this as hybrid sample/spatial
parallelism. Figures 3.1 and 3.2 graphically illustrate this halo exchange.
We now present our algorithm more formally, combining both sample and spatial par-
allelism. We keep the same simplifying assumptions as in Section 2.2.1. Data will be





distributed as [◦, ∗, ◦, ◦], i.e. with the sample, height, and width dimensions distributed.
The weights of the convolutional layer will be replicated, with w and dL
dw
having distribution
[∗, ∗]. (Note that this notation omits explicit representation of the necessary halo exchange,
the details of which we will make more explicit soon.) Under these distributions, spatial
parallelism is similar to pure sample parallelism:
1. y[◦, ∗, ◦, ◦]←− ConvForward(x[◦, ∗, ◦, ◦], w[∗, ∗])
2. dL
dx
[◦, ∗, ◦, ◦]←− ConvBackData(dL
dy
[◦, ∗, ◦, ◦], w[∗, ∗])






Note, importantly, that in (4), the allreduce remains global, and is over all dimensions of
the processor grid. This is since w is replicated over all processors, and we need to complete
the sums over N , H, and W in Eq. 2.5.
Halo Exchange
A key limitation of this notation is in expressing halo exchanges, so we now make them
more explicit. Two halo exchanges are required, one on x to compute ConvForward in
(1) and one on dL
dy
do compute ConvBackData in (2). This halo exchange is required any
time the convolution requires data not present on the local processor. The size of this halo
on processor p, ignoring padding issues, will be O rows or columns of length I(p)W or I
(p)
H ,




Figure 3.3: Remote data dependencies for a 3× 3 convolutional filter with dilation factor 2,
when data is partitioned spatially over four processors.
To make this more precise, let q(p)H = min I
(p)
H be the lowest index in dimension H assigned
to p and r(p)H = max I
(p)




W similarly. Then the non-local data




{q(p)H −O, . . . , q
(p)
H − 1, r
(p)
H + 1, . . . , r
(p)
H +O} ∪ I
(p)
H ,
{q(p)W −O, . . . , q
(p)
W − 1, r
(p)
W + 1, . . . , r
(p)
W +O} ∪ I
(p)
W ) \ I
(p). (3.2)




It is important to observe that, while a halo exchange is required to compute dL
dy
, we do
not need a halo exchange to compute dL
dw
. Since these two operations can be computed in any
order for a given layer, the entire computation of dL
dx
, including the necessary halo exchange,
can be performed concurrently with the computation of dL
dw
.
There are several other important cases for the halo exchanges in spatial parallelism that
we discuss next. First, in the case that the stride S is greater than 1, the overall process is
similar, but the particulars of the halo region will need to be adjusted. This may result in
some processors not requiring a halo exchange on certain borders. Second, in the case that







Figure 3.4: Remote data dependencies for a reverse halo exchange for a 3 × 3 max pooling
layer when data is partitioned spatially over four processors.
breaks down. However, as spatial partitioning is most efficient when the halo is small relative
to the size of the partition, to mitigate communication overheads, alternative parallelization
strategies should be preferred in such a case. Lastly, dilated convolution requires that the
halo region be expanded by a factor equivalent to the dilation factor, in order to account for
the larger kernel size. This is illustrated in Figure 3.3.
Entire CNNs
The extension to an entire CNN is relatively straightforward. Input should be provided in
the appropriate distribution for the first layer. Each convolutional layer can be parallelized
as above. Element-wise operations such as ReLUs parallelize trivially regardless of distri-
bution. Batch normalization is typically computed locally on each processor; however, to
our knowledge, performing batch normalization on subsets of the spatial dimensions has not
been explored. Both purely local batch normalization and a variant that aggregates over the
spatial distribution of a sample are easy to implement.
Pooling layers require halo exchanges in forward and backpropagation, similarly to con-
volution (but have no parameters, and hence no backward-weights step). However, the
halo exchange in backpropagation is a “reverse” halo exchange, where values are pushed to
neighbors instead of received from neighbors. This is illustrated in Figure 3.4.
28
Other layers, such as fully-connected layers, do not require modifications, but may require
a data redistribution.
Data Redistributions
It may occur that two layers are adjacent, but are distributed differently, e.g. over different
processor grids. For example, DGy 6= DG
′
x , which would imply the two distributions assign
indices to processors differently. This may be because it is more profitable to parallelize the
two layers with different partitioning strategies, or because one is a fully-connected layer,
which uses an elemental distribution [135] in LBANN. When this occurs, data must be
shuffled between the two distributions on both forward and backpropagation. This shuffle
can, most generally, be implemented via an all-to-all collective, where a processor sends the



















We have implemented spatial parallelism by extending the LBANN toolkit. While it can
efficiently parallelize training using highly-optimized libraries, LBANN is limited to sample
parallelism for convolution.
To do this, we first develop a small C++ library for distributed tensor data structures
that provides high-level abstractions for common tensor primitives used in CNN training.
The design of the tensor library is strongly influenced by the design of high-level frameworks
for stencil computations, e.g. [140]. It presents a partitioned global view of multidimensional
tensors decomposed over distributed memory. For supporting convolutions, a halo exchange
among adjacent distributed sub-tensors is implemented as part of the library API. It uses
Aluminum for inter-node communication, and CUDA inter-process communication on-node
(see Chapter 4). Most of these operations do not require the programmer to explicitly man-
age the data distribution. The library automates the underlying bookkeeping of distributed
data structures as much as possible so long as this does not negatively impact performance.
Fortunately, similar to regular stencil computations in scientific simulations, CNN computa-
tions tend to be rather regular, and thus realizing automation without performance penalties
is possible (e.g. [140] for stencils).
We implement a basic set of layers used in typical CNNs, including convolutions, pooling,
batch normalization, and ReLU, on top of the tensor library. We focus on systems with
Nvidia GPUs as the main computing platform for training and use Nvidia’s cuDNN library
29
for optimized compute kernels. However, as cuDNN is not aware of the distribution of
tensors, the library performs halo exchanges before convolutions and pooling. We replace
LBANN’s tensor representation with ours with necessary data shuffling so that the overall
training pipeline can be used as is.
Optimization for Strong Scaling
While spatial parallelization allows us to use a larger set of parallel resources, careful
optimization of data movement becomes more important, especially for strong scaling. In
particular, unless well-optimized the overhead of halo exchange communication can be pro-
hibitively expensive, particularly with small spatial domains.
One of the well-known optimization techniques for halo exchanges is overlapping commu-
nication with independent computations. Our implementation automatically decomposes
an input tensor into its interior domain and boundary domains and calls cuDNN convo-
lution kernels for each region separately so that halo exchanges can be run concurrently
with the convolution of the interior domain. For backward convolutions, we exploit the
task-level parallelism of backward-data and -weights phases to overlap the halo exchange for
the backward-data step with the backward-weights computation. Note that the backward-
weights step does not require halo exchanges (see Eq. 2.5).
3.3 CHANNEL AND FILTER PARALLELISM
We now turn to channel and filter parallelism, which partitions the channel and/or filter
domains of every tensor in a convolutional layer. As with spatial parallelism, these algorithms
exactly replicate convolution as if it were performed on a single GPU. This consists of a family




The stationary-x algorithm is the first of our algorithms for exploiting channel and filter













































Figure 3.5: Illustration of the stationary-x algorithm’s forward propagation step for one
layer and two samples, using two-way channel/filter partitioning on four processors.
algorithms [137] which avoid communicating a particular matrix2. We first present the
algorithm in our notation, and then discuss our reasoning behind it and some implications.
We will assume that input data is distributed channel-wise in addition to using sample and
spatial parallelism: x[◦, ◦, ◦, ◦] over a processor grid (N ,H,W , C). This choice of dimension
ordering is not needed for correctness, but does result in good performance, as bandwidth-
intensive communication will tend to use faster links (e.g. NVLink2) due to channels being
partitioned over a small number of close GPUs. Note that the order of processor grid
dimensions need not match the tensor’s, and that we will write tensors in the ordering
given in Section 2.2.1 (e.g. N × C × H ×W ). Similarly, we will assume that the desired
distribution of y is [◦, ◦C, ◦, ◦], where the F dimension of y is distributed according to the C
dimension of the processor grid; this ensures that the inputs and outputs have matching data




are distributed similarly. Unlike sample or spatial
parallelism, we distribute the weights by channels instead of replicating them: w[∗, ◦]. Thus,
w is distributed such that each rank has, for every filter, only the channel parameters for its
local input channels. Note, when combining this with additional parallelism (e.g. sample),
these may still be replicated, but not over every processor. With this, the algorithm is as
follows:
2It may be possible to define other algorithms than these if more general communication operations are
permitted. For example, one may be able to develop an algorithm that is similar to Cannon’s algorithm for





































Figure 3.6: Illustration of the stationary-x algorithm’s backpropagation step for one layer
and two samples, using two-way channel/filter partitioning on four processors.
1. T1[◦, ∗, ◦, ◦]←− ConvForward(x[◦, ◦, ◦, ◦], w[∗, ◦])
2. y[◦, ◦, ◦, ◦]←−
∑̃
CT1[◦, ∗, ◦, ◦]
3. dL
dy
[◦, ∗, ◦, ◦]←− dL
dy
[◦, ◦, ◦, ◦]
4. dL
dx
[◦, ◦, ◦, ◦]←− ConvBackData(dL
dy
[◦, ∗, ◦, ◦], w[∗, ◦])






As a high-level summary, this algorithm first does forward propagation locally (1), then
performs a segmented reduce-scatter (2) among each set of processors that has different
channels for the same sample and spatial region, in order to complete the sum over channels
in Eq. 2.3 and then produce the correct distribution for y. The scatter is needed because
each processor locally produces output for every filter. Figure 3.5 illustrates this forward
propagation phase. Backpropagation begins with a segmented allgather (3) to assemble the
filters of dL
dy
. The backward-data step (Eq. 2.4) can then be computed completely locally (4).
The backward-weights step (Eq. 2.5) can be partially computed locally (5), but completing
the summations still requires aggregating over all sample and spatial distributions (6). (See
Figure 3.6 for an illustration of this.) A key difference, and advantage, of this algorithm is
32
that this allreduce is no longer global: it is a segmented allreduce over only the local data,
among processors that have the same channel data. Thus, we are able to reduce both the
amount of data in any single allreduce, and the number of processors participating in each
allreduce.
This algorithm trades additional communication overhead during forward and backprop-
agation (from the reduce-scatter and allgather) for additional parallelism among channels.
Convolutions among different channels can be performed concurrently in forward propaga-
tion and for backward-data, reducing total compute time. In backpropagation, the allgather
of dL
dy
before computing ConvBackData is necessary given our distribution of w. Other dis-
tributions would require either additional storage or communication. It also significantly
reduces memory usage, as the complete channels of each tensor need only be stored during
the forward or backpropagation pass through their particular layer.
We also partition w by channels, as opposed to fully replicating it as is typical, for two
reasons. First, as the input data is partitioned by channels, only the portions of the filters
corresponding to those channels would be used during forward and backpropagation on
each processor. Second, if weights were replicated on every processor, the backward-filter
stage would require a global allreduce to update the parameters. By partitioning w, we
are able to instead use a segmented allreduce to reduce communication overheads, since
updates need only be communicated among processors with the same parameters. Thus,
for example, if a combination of sample and channel parallelism is used and channels are
partitioned among two processors, the segmented allreduce results in two disjoint allreduces
being performed, each among half the processors and over half the data. We also considered
alternate communication patterns, but these result in excess communication or buffer space.
Stationary-y
The stationary-y algorithm is essentially symmetric to the stationary-x algorithm: it
avoids communicating y and dL
dy
instead of x and dL
dx
(compare lines (2) and (3) of stationary-





to the stationary-x algorithm. w is distributed as [◦C, ∗], so each processor has only its local
filters. The algorithm is then:
1. x[◦, ∗, ◦, ◦]←− x[◦, ◦, ◦, ◦]
2. y[◦, ◦, ◦, ◦]←− ConvForward(x[◦, ∗, ◦, ◦], w[◦, ∗])




[◦, ◦, ◦, ◦]←−
∑̃
FT2[◦, ∗, ◦, ◦]






For this, the communication patterns have essentially been swapped between forward and
backpropagation compared to the stationary-x algorithm. An allgather is performed at the
beginning of forward propagation (1), followed by local convolution (2). The backward-
data step can be performed locally (3), and is then followed by a reduce-scatter (4). The
backward-filter computation (5, 6) is similar to the stationary-x algorithm, except now the
segmented allreduce aggregates gradient updates among processors with the same filters. A
similar discussion of communication choices also applies.
Stationary-w
We now present the stationary-w algorithm, a name which we choose for consistency
despite it being a bit of a misnomer as a reduction is still performed over the dL
dw
tensor.
This is also the most complex of the algorithms we present, and in fact both the stationary-x
and -y algorithms are special cases of this one. At a high level, this algorithm distributes both
the C and F dimensions of w, which will involve communication during both forward and
backpropagation, while further segmenting the allreduce required to aggregate parameter
updates.
The processor grid for this algorithm has dimensions (N ,H,W , C,F). Input tensors are
distributed according to [◦, ◦C×F , ◦, ◦], where we have distributed the channel (or filter, for
y) dimension of the tensor on the linearization of the C and F dimensions of the processor
grid. The weights are distributed according to w[◦, ◦], and so are partitioned over both the
F and C dimensions of G. The stationary-w algorithm is now:
1. x[◦, ◦C, ◦, ◦]←− x[◦, ◦C×F , ◦, ◦]
2. T1[◦, ◦F◦, ◦]←− ConvForward(x[◦, ◦C, ◦, ◦, w[◦, ◦])
3. y[◦, ◦C×F , ◦, ◦]←−
∑̃
CT1[◦, ◦F , ◦, ◦]
4. dL
dy
[◦, ◦F , ◦, ◦]←− dLdy [◦, ◦C×F , ◦, ◦]
5. T2[◦, ◦C, ◦, ◦]←− ConvBackData(dLdy [◦, ◦F , ◦, ◦], w[◦, ◦])
6. dL
dx
[◦, ◦C×F , ◦, ◦]←−
∑̃
FT2[◦, ◦C, ◦, ◦]
34






In this algorithm, the channels of x are first allgathered such that they match the channel
distribution of w (1). Note that this does not fully assemble the channels on each processor,
but is required since x distributed over C×F (i.e. over more processors than the C dimension
of w). Local forward propagation is then performed (2), followed by a reduce-scatter (3)
to complete the operation and produce the correct output distribution. Backpropagation
proceeds similarly. The allreduce (8) is now more extensively segmented, as both the channel
and filter dimensions of w are partitioned.3
The stationary-w algorithm generalizes both other algorithms, as we can recover them by
making the F or C processor grid dimension be size 1, respectively. When this is done, some
of the collective operations (e.g., steps (1) and (6) for stationary-x) can be elided, as they
are performed over a single processor.
Blocking
As a future optimization to reduce the communication overhead of reduce-scatters and
allgathers during forward and backpropagation, these algorithms are amenable to blocking
and pipelining, similarly to stationary matrix product algorithms. For example, with the
stationary-x algorithm, blocks of channels in forward propagation can be used to overlap
ConvForward and the reduce-scatters; a similar approach is possible with ConvBackData and
the allgather. Note, however, that the allgather needs to complete for every block before
ConvBackFilt can be performed. Similar considerations apply for the other algorithms.
Layers Other than Convolution
Partitioning data along the channel/filter dimension has implications for other commonly-
used layers in CNNs. Most other layers naturally adapt to this partitioning and require no
additional communication while benefiting from reduced local computation. Many element-
wise layers, such as ReLU activations or dropout, are trivially adapted. Similarly, pooling
operations are applied channel-wise and thus require no further adaptation. Channel parti-
tionings do not make sense for fully-connected layers, as they operate on a linearized input;
3Note that due to the distribution of data over the linearization of the F and C dimensions, this algorithm
produces output that has been transposed from its input. This is not an issue in general, since subsequent
layers can distribute w in a transposed manner to operate on the data with no excess communication.
35
depending on their implementation, data movement may be required to shift activations into
an appropriate distribution for processing.
Batch normalization is also applied channel-wise, so much of its computation requires no
significant adjustment. It does have learnable parameters, the gradient updates for which
need to be aggregated via allreduce in the same manner as convolutional layers. As these
parameters are distributed channel-wise, this becomes a segmented allreduce as well. Thus,
a channel partition can help reduce communication overhead for batch normalization layers
in addition to convolutional layers.
3.3.2 Implementation
We have implemented these algorithms by extending the LBANN toolkit and its dis-
tributed tensor library (see Section 3.2.2) to support channel and filter partitioning for the
stationary-x algorithm. We have not yet evaluated the stationary-y or -w algorithms due to
compute time limitations.
We implemented the appropriate tensor distributions for the stationary-x algorithm and
added support to the convolutional layers for such distributions. Local convolution is per-
formed with Nvidia’s cuDNN library [123]. Each convolutional layer allocates temporary
memory using a shared CUB memory pool [142] that releases memory after the layer’s for-
ward or backpropagation work is done, in order to minimize memory overheads. NCCL [73]
is used to implement the reduce-scatter and allgather communication, for both intra- and
inter-node communication. Since data is locally stored in a NCHW format, this commu-
nication is non-contiguous when more than one sample is present locally. To mitigate this,
we implement custom CUDA kernels for packing (for reduce-scatter) and unpacking (for
allgather).
LBANN’s optimization framework was also modified to support segmented allreduces,
again using NCCL. To do this, we simply split the associated communicator appropriately.
During testing, we observed that segmented tree allreduces using NCCL had very high
performance variability, and that a similar effect was present with our system MPI. We
hypothesize that this is due to on-node interference when different processes on the same
node perform simultaneous communication operations on different communicators (see Sec-
tion 3.5.4). We therefore use only ring allreduces, which did not exhibit such variability.
As many of the allreduces are large (≥ 100 KiB), we do not expect this to fundamentally
impact our results. We observed that concurrent NCCL allreduces and allgathers during
backpropagation overlap well, so the stationary-x backpropagation does not interfere with
hiding allreduce communication.
36
We validated our implementation by extensively testing the channel-parallel implementa-
tion and comparing its results to convolution performed on a single GPU. Adapting other
layers (e.g., pooling, batch normalization) was straightforward once the tensor library sup-
ported channel-parallel distributions.
3.4 PARALLEL EXECUTION STRATEGIES
With the introduction of many additional knobs to partition and parallelize CNN training
with, selecting the appropriate partitioning schemes for each layer becomes a challenging
process. Instead of simply selecting an appropriate number of processors to partition a
mini-batch over, a user must decide how to partition each dimension of each layer to get a
good speedup while satisfying memory constraints. This is further complicated by the fact
that different layers may best be partitioned in different ways, which in turn requires data
shuffling in both forward and backpropagation, the costs of which can influence the optimal
partitioning strategy for the entire network. We refer to such an assignment of partitions to
each layer as a parallel execution strategy.
To help address this, we introduce a performance model to help predict the performance of
CNN configurations, and briefly sketch an optimization approach for selecting good parallel
execution strategies. This helps to both validate the performance of the existing algorithms
and lay a foundation for tuning performance. This section uses the notation and approach
discussed in Section 2.4 and the collective communication performance models in Section 2.5.
3.4.1 Performance Model
We first describe the details of our performance model for a single convolutional layer.
We focus on convolution since, for CNNs, it is by far the dominant operation in terms of
both compute and communication costs. We present a single model that applies to any
combination of sample, spatial, and channel and filter parallelism.
As a minor additional bit of notation, we shall write px = |F| if the stationary-y algorithm
is being used, px = |C| if the stationary-w algorithm is being used, and px = 0 otherwise.
Likewise, py = |C| if the stationary-x algorithm is being used, py = |F| if the stationary-w
algorithm is being used, and py = 0 otherwise. These are used to denote the number of
processors used for reduce-scatters and allgathers in the different algorithms. We will also
use H̃ and W̃ to denote the spatial dimensions after convolution.
37
Using our notation, the time for forward convolution is given by





+ 1|C|>1RS(INIF IH̃IW̃ , py). (3.3)
In this, the indicator functions are used to determine whether a particular operation is
actually being performed. The allgather is used in the stationary-y and stationary-w algo-
rithms. The send/recvs are for halo exchanges in spatial parallelism, being the north/south,
east/west, and corner halo exchanges, respectively.4 Then forward convolution is included,
where we have elided the minor cost of convolution for boundary regions. Lastly, the reduce-
scatter is for the stationary-x and -w algorithms.
Similarly, the time for backward-data is given by










+ 1|F|>1RS(INICIHIW , px). (3.4)
Note the similarity to forward propagation; similar discussions apply.







+ AR(IF ICK2, |N ||H||W|). (3.5)
4Note one can implement the halo exchange slightly differently to avoid the explicit corner halo exchanges.
Instead, a slightly wider north/south exchange can be done, followed by the east/west exchange; or vice-
versa. We omit this case for simplicity. Similarly, we omit the cases for processors on image boundaries,
which will do fewer halo exchanges. Since we consider only synchronous algorithms in this work, runtime
will be driven by the longest-running processor.
38
This consists solely of the backward-weights convolution and the allreduce to aggregate
parameter updates. Note that certain operations (e.g. allgathers for the stationary-x and
-w algorithms) may be necessary for this computation to proceed, but for simplicity we have
included them only in the performance model for backward-data.
We will write CostD(L) = FP + BP + BPw to refer to the total runtime cost of layer
L with distribution D (where D refers to the distributions for all tensors associated with
L). We will also allow Cost to adjust for overlap of communication and computation as
appropriate; note that this is non-trivial, as the allreduce can be overlapped the runtime of
subsequent layers in backpropagation, the runtime of which depends on the particulars of
their parallel execution strategies.
From these performance models, we can draw some conclusions about the relative costs
of the different approaches for parallelizing convolution. Considering only the forward and
backward-data phases, sample parallelism is the cheapest approach, followed by spatial par-
allelism (although halo exchanges can often be overlapped), followed by channel and filter
parallelism (which requires reduce-scatters and allgathers, that cannot be overlapped except
by blocking). However, channel/filter parallelism modifies the set of processors and size of
the allreduce in the backward-filter phase. While this can often be overlapped with compu-
tation, this overhead can nonetheless be quite large (see Chapter 4), so reducing it can have
a significant impact that may actually outweigh the additional costs in other phases.
We can also see the relative characteristics of the different algorithms for channel and
filter parallelism. Selecting between the stationary-x and -y algorithms depends primarily




). These, in turn, depend on the
relative number of input channels versus filters in the convolutional layer, and the stride and
padding used. In short, however, if x is smaller than y, then the stationary-y algorithm is
probably to be preferred; and the stationary-x algorithm otherwise. Selecting the stationary-
w algorithm is less straightforward, as it imposes communication requirements on both the
x and y matrices, in exchange for additional parallelism.
3.4.2 Memory Requirements
The memory requirements for sample and spatial parallelism work in the obvious way: the
memory needed for activations (and error signals) decreases proportionally to how much they
are partitioned. As w is replicated in both cases, the memory required to store parameters
and their gradients remains constant.
The stationary algorithms for channel and filter parallelism require additional memory to
hold data prior to reduce-scatters or data assembled by allgathers (the temporary variables
39
in the algorithm descriptions). The memory used prior to reduce-scatters is ephemeral
and only needed until that operation completes. For the stationary-x algorithm, the buffer
assembled by its allgather is similarly ephemeral; for the stationary-y and -w algorithms, the
allgathered x tensor must be held until the layer completes backpropagation. The stationary
algorithms also reduce the memory requirements for storing parameters; while this is often
small for many CNNs, this can be significant when networks are particularly wide.
3.4.3 Runtime for a CNN
Extending this performance model to an entire CNN is relatively straightforward. We
need to address three things: Layers besides convolution, overlapping between layers, and
data redistributions.
Most layers other than convolution are computationally cheap, and can be treated as
computationally free for simplicity. Fully-connected layers use standard performance models
for distributed matrix multiplication [137]. Pooling layers require halo exchanges similar to
convolutional layers in both forward and backpropagation, but are otherwise computationally
inexpensive. Layers with learnable parameters (most notably, batch normalization) require
allreduces to aggregate updates, which can be modeled similarly to the above. Note that
because batch normalization operates channel-wise, its parameters can be segmented in the
same way as convolutional layers when using channel and filter parallelism.
Estimating communication/computation overlap precisely is difficult, due to hard-to-
model issues such as interference. We outline a simple, greedy approach that is implemented
by our performance model. A certain number of allreduces are permitted to run concurrently,
and allreduces are started in a given order. Once an allreduce begins, it is overlapped with
as much subsequent computation as possible. After an allreduce completes, the next eligible
one is considered and overlapped with the remaining computation (note, there could be no
remaining computation, in which case the allreduce is completely unoverlapped) until all
communication is complete. This approach essentially approximates the maximum possible
communication/computation overlap.
Lastly, data redistributions must be considered when the distributions of adjacent layers
differ. This is an all-to-all collective operation, which we implement as a set of sends and
receives. To represent this cost, we will use Shuffle(Dy,Dx) to represent the cost of shuffling
from distribution Dy to distribution Dx.
40
3.4.4 Finding Parallel Execution Strategies
We now sketch a simple optimization approach for selecting good parallel execution strate-
gies using this performance model. We will first consider the case of a CNN that has a “line”
architecture with no branches (e.g. AlexNet), for simplicity.
First we generate candidate distributions for each layer. For convolutional layers, we
heuristically select distributions that are load-balanced, and prefer applying “cheaper” par-
titioning methods ahead of more expensive ones (e.g., sample over spatial parallelism). We
assume fully-connected layers are either entirely sample- or model-parallel (although this
could be generalized), and that other layers simply inherit the distribution of their parent
layer. We will write Dk`i to be the kth candidate distribution for layer `i.
Given this set of candidate distributions, we find an optimal parallel execution strategy
among them by reducing this to a single-source shortest path problem in the following graph.
First, create a vertex for every candidate distribution for every layer, and add source and
sink vertices s and t. We add an edge from each candidate distribution Dk`i in layer `i to
each candidate distribution Dk′`i+1 in the subsequent layer `i+1, with weight
CostDk`i




i.e., the cost of forward and backpropagation for layer `i, plus any cost to shuffle data between
the two layers’ distributions. For each candidate distribution in the last layer `, we add an
edge from it to t, with weight CostDk` `. Finally, we add an edge from s to each candidate
distribution of the first layer with weight 0 (since we treat I/O costs as free).
In such a graph, which can be thought of as a sequence of bipartite graphs, a shortest
path from s to t gives a parallel execution strategy with fastest end-to-end runtime for the
CNN. Since this is a directed acyclic graph, such a path can be found in linear time. While
a large number of vertices and edges are generated to construct this graph, and gathering
empirical performance data may take some time, we have found that it is not an issue in
practice. In particular, it is often the case that many configurations are the same, and
caching performance data helps significantly. If necessary, additional heuristics could be
used to prune candidate distributions for layers in order to reduce the size of the graph and
the amount of performance data needed.
For networks that have branches (e.g. ResNets), this approach cannot be applied directly,
as some layers have multiple parents or children. Instead, we rely on the following heuristic
approach. First, we find the longest path from the beginning to the end of the CNN, and
apply the above to every layer in this path (which, by definition, is a “line” network). The
distributions for these layers are then fixed, and this process is repeated with the next longest
41
path that contains as few of the already-used layers as possible, until every layer has been
assigned a distribution. The idea behind this approach is that the longest path is the most
computationally intensive, and so should be optimized first, in order to guarantee maximum
flexibility in distribution choice.
3.5 EVALUATION
3.5.1 Spatial Parallelism
We now evaluate our algorithms via microbenchmarks and end-to-end training. We use
Lassen [143], which consists of 650 nodes, each with two IBM POWER9 CPUs and four
Nvidia V100 (Volta) GPUs with NVLink2 and 16 GB of memory per GPU, interconnected
via dual-rail InfiniBand EDR. Each CPU has two GPUs attached to it, and the node-local
NVlink2 interconnect does not cross the socket boundary; thus, intra-node communication
between GPUs on different sockets uses the socket interconnect, which has significantly lower
bandwidth and imposes additional hops. Our implementation uses a recent development
version of LBANN and Aluminum. We use GCC 7.3.1, CUDA 9.2.148, Spectrum MPI
2019.01.30 [144], cuDNN 7.4.1 [123], and NCCL 2.4.2 [73]. For all results we use single-
precision floating point.
We consider two networks: a fully-convolutional ResNet-50 [4, 117] for ImageNet-1K clas-
sification and a proof-of-concept model for a 2D mesh-tangling problem which we formulate
as semantic segmentation. The mesh-tangling data consists of “images” representing a hy-
drodynamics simulation state at a timestep, and the problem is to predict, for each pixel,
whether the mesh cell at that location needs to be relaxed to prevent tangling. Mesh tangling
occurs when cells overlap, which is non-physical and results in the simulation degenerating
or failing entirely. It is hoped that CNNs, which can incorporate global information from the
simulation state, will offer better heuristics for predicting tangling than existing methods.
However, as interesting simulations are high-resolution, it has not been possible to train
neural networks on the data due to memory constraints.
For these tests, the input data is either 1024 × 1024 (1K) or 2048 × 2048 (2K) pixel
images, with 18 channels consisting of various state variables and mesh quality metrics from
a hydrodynamics simulation. We use 10,000 samples of each size. Our CNN is a very sim-
ple fully-convolutional model adapted from VGGNet [145] for our input sizes and semantic
segmentation. It consists of six blocks of either three (1K) or five (2K) convolution-batch
normalization-ReLU operations, using 3 × 3 convolutional filters, and a final convolutional






























1 2 4 8 16
1
2






Figure 3.7: Microbenchmark results for the conv1 layer of ResNet-50 comparing paralleliza-
tion schemes in forward (FP) and backpropagation (BP). Error bars are ± one standard
deviation. Black shapes are performance model predictions. The layer configuration is given
at the top. Each column of graphs benchmarks a different number of samples.
tional filter of each block. The model for the 2K mesh data is large enough, when including
intermediate activations, to exceed GPU memory when training with even one sample. For
performance benchmarks on this problem, we use synthetic data, as our goal is to focus on
the performance of our algorithms and demonstrate that models of this scale can be quickly
trained on HPC systems. We leave developing optimized models to future work, now that
training is feasible.
3.5.2 Layer benchmarks
We first present microbenchmark results for selected layers from ResNet-50 and the 2K
mesh model on up to four Lassen nodes. These show the performance characteristics of
spatial convolution at a fine-grained scale. We time forward and backpropagation of each
layer, with halo exchanges being overlapped. We exclude the allreduce to accumulate gra-
dients to focus on the performance of convolution. Also, it is typically overlapped by other
computation. For each measurement, we first do warmup runs, then report the mean and
standard deviation of ten runs. To avoid measuring kernel launch overheads, we used a spin
kernel to ensure all work reached the GPU before beginning measurements. We also used




























1 2 4 8 16
0.25
0.50






Figure 3.8: Microbenchmark results for the res3b_branch2a layer of ResNet-50 compar-
ing parallelization schemes in forward (FP) and backpropagation (BP). Error bars are ±
one standard deviation. Black shapes are performance model predictions. The layer con-









































Figure 3.9: Microbenchmark results for the conv1_1 layer from the 2K mesh model, com-
paring parallelization schemes in forward (FP) and backpropagation (BP). Error bars are
± one standard deviation. Black shapes are performance model predictions. The layer con-






























1 2 4 8 16
0.5
1.0






Figure 3.10: Microbenchmark results for the conv6_1 layer from the 2K mesh model, com-
paring parallelization schemes in forward (FP) and backpropagation (BP). Error bars are
± one standard deviation. Black shapes are performance model predictions. The layer con-
figuration is given at the top. Each column of graphs benchmarks a different number of
samples.
Figures 3.7 and 3.8 show results for ResNet-50 layers conv1 and res3b_branch2a with
N = 1, 4, and 32 samples. The first two mini-batch sizes can occur when strong scaling
sample-parallelism to few samples per GPU; the last is a typical target for efficient use
of a GPU. conv1 is the first layer of ResNet, with a relatively large (224 × 224) input,
but only three channels and 64 filters. The kernel is large, K = 7, requiring large halo
exchanges for spatial parallelism. For N = 1, forward propagation does not scale well, due
to limited computation to hide halo exchanges; backpropagation fares better, and results in
net improvements to forward and backpropagation of ∼1.35x with 8 GPUs. Performance
degrades somewhat with 16 GPUs, due to communication overheads. res3b_branch2a is a
1× 1 convolution from the middle of ResNet, with a fairly small spatial domain. The filter
size means that no halo exchange is needed, avoiding communication overheads. Forward
propagation does not show significant performance improvements beyond two GPUs, due to
fixed kernel overheads. Backpropagation shows improvements up to 16 GPUs except that
the 2 GPUs/sample case is significantly slower than 4 GPUs/sample at 4 GPUs due to the
performance of the underlying cuDNN kernels. With larger numbers of samples, spatial
decomposition remains competitive with pure sample parallelism, indicating halo exchanges
are hidden.
45
Figures 3.9 and 3.10 present results for two layers of our 2048 × 2048 mesh model. Here
spatial domains are much larger, and we expect spatial parallelism to perform better. Results
are for N = 1, 2, and 4 samples, since due to the size of the data, only one or two samples can
be trained per node. (Since we benchmark only individual layers here, memory pressure is
not as significant as with an entire CNN.) conv1_1 is the first layer, and has extremely large
spatial input. The N = 1 case has very good scaling on both forward and backpropagation,
achieving ∼14.8x speedup on 16 GPUs, indicating inter-node halo exchange overheads are
well-hidden. The two sample case is similar. With four samples, the overhead of the halo
exchange is very minor, enabling competitive scaling to sample parallelism. conv6_1 is from
much deeper within the network, and has a smaller input spatial domain. Nonetheless we
see continued benefit in the N = 1 case (∼1.4x).
These results illustrate several important things. First, that the empirical performance of
convolution can be complicated to predict. Second, that the intuition from our performance
model is broadly correct: other things equal, sample parallelism typically has the least
overhead. Spatial parallelism has moderate overheads, primarily driven by unoverlapped
communication when there is little local computation. Finally, the small N case, observed
when strong-scaling sample-parallelism, can benefit significantly from spatial parallelism.
3.5.3 Training
We now present scaling results for end-to-end training of our models. We consider both
strong and weak scaling, although our focus is strong scaling (see Section 2.2.2). Strong
scaling has the advantage that one need not address issues of learning and generalization
with large mini-batches, as the learning process does not change: once a good mini-batch
size is selected, it can be strong-scaled to make training faster. Our results are primarily
hybrid sample-spatial parallelism, where samples are first partitioned onto groups of GPUs,
and then spatially parallelized within that group. We use the same data decomposition for
every layer in a given configuration, although this is not necessarily optimal; we leave a
full exploration of more varied decompositions to future work, but explore a simple case in
Section 3.5.4.
Mesh model
Figures 3.11 and 3.12 shows scaling results for the 1K and 2K mesh models, up to 2048
GPUs (512 nodes). For the former, we present five cases: sample parallelism, and 2-, 4-, 8-,
and 16-way hybrid sample/spatial parallelism with mini-batch sizes N = 4 to 2048 (omitting
46























Figure 3.11: Weak scaling for the 1K mesh model. Error bars are ± one standard deviation.
Black markers are estimates from our performance model. 1 sample/GPU is is pure sample
parallelism; other cases are hybrid sample/spatial parallelism. Strong scaling can be seen
between parallelization schemes (see Table 3.1).























Figure 3.12: Weak scaling for the 2K mesh model. Error bars are ± one standard deviation.
Black markers are estimates from our performance model. Note that there are 4 GPUs/node.
Spatial parallelism is required for this model due to memory requirements. Strong scaling
can be seen between parallelization schemes (see Table 3.2).
47
cases when they require too many GPUs). The model can fit only one sample per GPU,
so we do not explore additional sample parallelism. We run the same configurations for the
2K mesh model, except pure sample parallelism is not possible due to memory constraints
and our maximum mini-batch size is 1024. We compare strong-scaling at a fixed mini-batch
size across parallelism cases, and weak scaling as the mini-batch size grows. Note that when
using 8- or 16-way spatial parallelism, a sample is being partitioned across two or four nodes,
requiring both intra- and inter-node communication for halo exchanges.
For strong scaling, spatial parallelism allows us to use additional GPUs for the same
mini-batch size. Table 3.1 gives mini-batch times and speedups for the 1K mesh model.
We see near-linear speedup for 2 GPUs/sample over sample parallelism (2x is ideal), and
significant further improvements with 4 GPUs/sample. Improvements continue with 8 and
16 GPUs/sample, although they are not as dramatic due to the increased overheads of halo
communication and local convolution kernels not scaling linearly. Weak scaling results can
be seen in Figure 3.11, where the flat mini-batch time for increasing numbers of GPUs
(hence, increasing mini-batch size) shows near-perfect weak scaling. This implies that our
spatial partitioning is not impacting the typical sample-parallel weak scaling trends. Weak
scaling for 8 and 16 GPUs/sample does show a slight trend of increasing mini-batch time
at large scale; due to the extensive data decomposition, each GPU has significantly less
work, and our implementation cannot fully overlap global allreduces with backpropagation
computation in these cases.
The performance degradation for 1 GPU/sample at 2048 GPUs in Figure 3.11 is due
to memory pressure requiring a smaller workspace for cuDNN, impacting local convolution
algorithm selection. The increased memory pressure is due to communication-related data
structures taking increased GPU memory, and could be mitigated with future optimizations.
Results for the 2K mesh model are shown in Table 3.2 (strong scaling) and Figure 3.12
(weak scaling). The strong scaling trend is similar to that for the 1K model, with compa-
rable speedups from each algorithm. Speedups from 2 GPUs/sample to 4 GPUs/sample are
essentially perfect; slight superlinear speedup is observed due to cuDNN algorithm changes.
Indeed, when increasing from 4 GPUs/sample to 8 GPUs/sample, the models observe roughly
a 1.3x and 1.4x improvement in performance (respectively). Neither attain linear increases
due to the high overhead of fine-grained inter-node halo communication, but there is further
improvement nonetheless. The 2K model achieves slightly more speedup in this case, as
there is more work to overlap communication with. The weak scaling trend is also compara-
ble to that for the 1K model, although we only observe weak scaling performance degrading
































































































































































































































































































































































Table 3.2: 2K mesh model strong scaling at mini-batch size N , mini-batch time and speedup
over 2 GPUs/sample.
N 2 GPUs/sample 4 GPUs/sample 8 GPUs/sample 16 GPUs/sample
2 0.247s 0.12s (2.1x) 0.0859s (2.9x) 0.0683s (3.6x)
4 0.249s 0.123s (2.0x) 0.0895s (2.8x) 0.0662s (3.8x)
8 0.25s 0.125s (2.0x) 0.0849s (2.9x) 0.0665s (3.8x)
16 0.249s 0.121s (2.1x) 0.0848s (2.9x) 0.0681s (3.7x)
32 0.251s 0.122s (2.1x) 0.0851s (2.9x) 0.0703s (3.6x)
64 0.252s 0.122s (2.1x) 0.0856s (2.9x) 0.0729s (3.5x)
128 0.252s 0.122s (2.1x) 0.0867s (2.9x) 0.0748s (3.4x)
256 0.25s 0.123s (2.0x) 0.089s (2.8x) n/a
512 0.249s 0.123s (2.0x) n/a n/a
That we can achieve both good strong and weak scaling for the mesh-tangling problem
is important for being able to rapidly train and explore new models: good strong scaling
means we can accelerate training of models without changing learning dynamics, and good
weak scaling additionally helps with tuning mini-batch sizes.
ResNet-50
We present strong scaling results for ResNet-50, comparing pure sample parallelism to
hybrid sample 2-way and 4-way parallelism, in Table 3.3. We use 32 samples per GPU as
our baseline, as this is a typical choice to saturate GPUs. Using spatial parallelism we achieve
1.4x speedups with 2x as many GPUs, and up to 1.8x with 4x as many GPUs. Other work
has shown that strong scaling this problem size via sample-parallelism past ∼8-16 nodes
rapidly results in communication overhead making it unprofitable [146] (see Chapter 4).
Thus, weak scaling is typically preferred, ensuring sufficient local work to hide communi-
cation costs. However, one cannot weak scale indefinitely, even with large mini-batch tech-
niques, due to generalization issues (see Section 2.2.3). To continue to accelerate training
beyond this, strong scaling must still be employed. Table 3.3 shows we get continued im-
provement from spatial parallelism with larger mini-batch sizes. Speedups decrease slightly
at larger scale for more extensive decomposition, due to the implementation being unable
to fully overlap the cost of allreduces, since there is less work per GPU.
Achieving near-linear speedup for ResNet is unlikely, as most layers have small spatial
domains. This agrees with our microbenchmarks. Using channel/filter parallelism for such
50
layers is more promising, as we show in Section 3.5.4. Despite this, we are still able to
accelerate many problem sizes of interest.
Performance Model
The figures in this section have also included the corresponding predictions from our per-
formance model (discussed in more detail in Section 3.4). We can see that its predictions are
quite accurate, and even when there are deviations, it still has the correct trend and rank-
ing of algorithms. Much of the inaccuracy is due to lower-order computations that are not
accounted for but matter with more extensive decompositions (e.g. 16 GPUs/sample). Net-
work noise and other similar factors are also not accounted for. Nevertheless, this validates
the accuracy of the performance model, and we can be confident that generating parallel
execution strategies with it will be effective.
3.5.4 Channel and Filter Parallelism
We now evaluate the stationary-x algorithm using both micro-benchmarks and end-to-
end training. We begin by independently examining the compute and segmented allreduce
performance for representative layers of ResNet-50, and then examine the performance of
the full algorithm. This allows us to understand the upper bound of possible performance
improvements. We then evaluate the strong and weak scaling performance for end-to-end
training of ResNet-50 and two Wide ResNet architectures. Finally, we report accuracy
results for our models.
Our algorithms were evaluated on Lassen [143], which is described in Section 3.5.1. Our
implementation uses a recent development version of LBANN, GCC 7.3.1, Spectrum MPI
2019.01.30, CUDA 9.2.148, cuDNN 7.5.0, and NCCL 2.4.2. For all results we use single-
precision floating point.
We use ResNet-50 [4] as a benchmark case because it is representative of many modern
CNN architectures (see Section 2.3.2). We therefore expect that improvements for ResNet-50
training will have broad impact. Our evaluation focuses extensively on this network.
To understand the impact of wider CNN architectures, we also evaluate end-to-end training
for two Wide ResNet [147] architectures, Wide ResNet-50-2 and 50-4. These networks have a
similar structure to ResNet-50, but the 3×3 convolutions in the network have the number of
filters (their width, not to be confused with the spatial width dimension) increased by a factor
of 2 and 4, respectively. Wide ResNet-50-2 achieved superior accuracy to ResNet-50 in the













































































































































































































































































































Figure 3.13: Microbenchmark results for a conv_3 layer of ResNet-50 without any commu-
nication. The layer configuration is given at the top. Each column of graphs benchmarks a
different number of samples.
We are able to demonstrate the training of these wider models, which had not previously
been done, and show the accuracy improvements they bring. We use fully-convolutional [117]
versions of all models.
We evaluate both strong and weak mini-batch scaling (see Section 2.2.2). Our evaluations
will consist of hybrid sample and channel parallelism: data is partitioned over an N × C
grid (H and W are trivial; compare with Section 3.5.1). Thus, when weak scaling, we fix
a mini-batch size for each “row” of the N processor grid dimension. When using two- or
four-way channel parallelism, we always partition channels within a single socket or node,
respectively. Eight-way channel parallelism requires two nodes, and hence more expensive
inter-node communication for reduce-scatters and allgathers.
Microbenchmarks
Compute and communication. We first seek to understand the computational scaling
of the cuDNN kernels used for local convolution as we partition a layer. Figures 3.13 and 3.14
present the local convolution time on a single GPU as a layer is partitioned over additional
GPUs. We select two layers, one near the beginning of ResNet-50 (the conv_3 block, which
has relatively few channels and filters), and one near the end (conv_5, which has many











































Figure 3.14: Microbenchmark results for a conv_5 layer of ResNet-50 without any commu-
nication. The layer configuration is given at the top. Each column of graphs benchmarks a
different number of samples.


























Figure 3.15: Segmented allreduce runtime for a message size from a conv_3 layer of ResNet-
50.
54























































1 4 16 64 256
0.5
1.0






Figure 3.17: Strong scaling for a conv_3 layer of ResNet-50. Allreduces are included in back-
propagation. The layer configuration is given at the top. Each column of graphs benchmarks




































1 4 16 64 256
1
2






Figure 3.18: Strong scaling for a conv_5 layer of ResNet-50. Allreduces are included in back-
propagation. The layer configuration is given at the top. Each column of graphs benchmarks
a different number of samples.




























ResNet-50: C=2048 H=7 W=7 F=512 K=1 S=1 P=0
Figure 3.19: Weak scaling for a conv_5 layer of ResNet-50. Allreduces are included in
backpropagation.
56
three mini-batch sizes for strong scaling: N = 1, 2, and 32, which correspond to extreme
sample parallelism and a more typical number of samples per GPU. To evaluate this, we
used CUDA events to time the convolution kernel on GPU. We first did several warmup
runs, then reported the mean of five runs. To avoid measuring kernel launch overheads, we
used a spin kernel to ensure all work reached the GPU before beginning measurements. We
use auto-tuning to select the fastest cuDNN algorithm for each configuration.
The N = 1 case is most indicative of the kernel scaling as the number of channels per
GPU is decreased. For the conv_3 layer, while we observe improvements due to channel
partitioning, the scaling is sub-linear. We also observe that forward propagation tends to
scale better than backpropagation. The reasons for this are unclear, but seem to be due
to the implementation of the underlying compute kernel. Direct convolution should scale
linearly regardless of how a dimension is partitioned, but optimized algorithms may exploit
the locality in channel accesses (e.g. im2col); as we partition channels, there is less locality
to exploit. Given the small total runtime, other overheads (e.g. kernel scheduling) may also
be significant. A similar trend occurs for N = 2, although with such a small mini-batch it
appears better to partition the channels instead of the samples. The N = 32 case exhibits
a trend where sample parallelism initially scales better, but eventually hybrid sample and
channel parallelism produces better results.
The trend is similar for the conv_5 layer, although forward propagation scaling is much
closer to linear. For N = 32, we observe that channel parallelism consistently outperforms
sample parallelism, due to the large number of channels. Backpropagation is more compa-
rable, although there are slight benefits.
We next examine the allreduce performance in isolation, as it is the key communication
bottleneck when scaling. We use the same two layers as above; the conv_3 layer has a much
smaller number of parameters than the conv_5 layer. Figures 3.15 and 3.16 present these
results. To obtain these measurements, we first conducted several warmup runs, then timed
the appropriate segmented allreduce on a buffer sized appropriately for the amount of channel
parallelism (e.g. half the size for two-way). Both cases are qualitatively similar: for a given
number of GPUs, the segmented allreduce is faster. The large jumps in runtime are due to
the allreduce crossing the socket boundary, the node boundary, or due to an algorithm change
in NCCL’s ring implementation, respectively. Note that as the allreduce is more segmented,
these jumps occur at larger numbers of GPUs, as the individual communicators are smaller.
As bandwidth-bound ring allreduces are primarily sensitive to the vector size, much of this
improvement is driven by the reduced buffer size. However, latency may become more of a
factor at large scale or with smaller parameter buffers (e.g. for batch normalization). Thus,
greater segmentation may be more beneficial for such cases.
57
We observe that while segmentation improves performance, it is again not a linear im-
provement. We investigated whether this was due to contention on shared InfiniBand HCAs,
but this did not explain the results. Our hypothesis is that NCCL has all InfiniBand con-
trol performed by threads on a single core of each socket. For a segmented allreduce, the
processes on each node can participate in different segments of the collective, and hence are
performing communication on different communicators. When this communication is routed
through a single control thread, contention occurs. A communication framework optimized
for segmented allreduces should be able to mitigate this impact. Nonetheless, we observe
reductions in allreduce runtime of over 50% and up to 4x at large scale.
Summary. Based upon the scaling of computation and communication individually, we
expect we will not see major gains in forward propagation runtime, nor backpropagation
computation, for every layer. This is primarily due to the overhead of the reduce-scatters
and allgathers the stationary-x algorithm performs, relative to the computation done in
layers with few input channels. However, we can expect to see significant improvements in
allreduce overhead at large scale, particularly as the faster communication makes it easier
to overlap communication and computation.
Layer scaling. We now examine the runtime of the conv_3 and conv_5 layers above using
the full stationary-x algorithm. Our measurement methodology is as above, and we perform
all communication. As we are looking at a single layer, the allreduce is not overlapped, and
its runtime is included in backpropagation. Figures 3.17 and 3.18 present strong scaling
for these results. For conv_3, forward and backpropagation are flat or do not scale well
when N is small; this is expected according to our performance model as the overhead of
communication is high and this layer has few channels. Eight-way channel parallelism, in
particular, suffers from the high cost of inter-node reduce-scatters and allgathers. Comparing
Figures 3.13 and 3.14 with 3.17 and 3.18, we can see that this overhead is often 2x or more.
For N = 32, forward propagation scales relatively well, although this is likely due primarily
to the scaling of sample parallelism. Backpropagation exhibits a “U” shape due to the impact
of allreduces at larger scales. We can observe that communication overhead is improved at
large scale by the segmented collectives, but sample parallelism is still superior. Again, this
is expected due to the few channels, but demonstrates that even when a layer is poorly
suited to it, moderate channel parallelism does not add too much overhead.
We observe much better trends for conv_5. At small N , eight-way channel parallelism
is again not profitable due to inter-node communication, but two- and four-way parallelism
show more promise. For N = 32, channel parallelism scales better than sample parallelism,
showing moderate improvements at larger scales, but allreduce overheads again rapidly dom-
inate. In sum, we see moderate benefits from strong scaling via channel parallelism for layers
58
with many channels, and a trend of reduced communication overheads at large scale.
With this in mind, we next consider weak scaling for the conv_5 layer in Figure 3.19. For
this plot, we fix 32 samples per row of N ; thus at the largest scale we consider 1024 GPUs
(256 nodes). We use a maximum of 16k samples in sample parallelism, as this is roughly the
maximum useful mini-batch size for training ResNet-50 with current techniques [108]. In
forward propagation, we observe scaling trends similar to what we would expect from Fig-
ure 3.18, as channel parallelism improves over sample parallelism. Weak scaling is excellent
in this case. In backpropagation, we can observe that channel parallelism again outper-
forms sample parallelism. Further, the rate of increase in allreduce runtime is less with
additional channel parallelism: we have successfully reduced communication overheads. In
end-to-end training, we can therefore expect channel parallelism to enable easier communi-
cation/computation overlap.
Summary. There are some limitations to our results: cuDNN kernels and segmented
allreduces do not always scale linearly. Nevertheless, we are able to demonstrate improve-
ments for important layers, primarily through the reduced communication overhead. While
the use of ring allreduces may limit weak scaling performance, since the segmentation reduces
both the vector size and number of processors involved, we expect these results to translate
similarly to tree or butterfly algorithms. The stationary-x algorithm is the simplest to im-
plement of our family of algorithms, so we have focused on it; since the structures of the
other algorithms are similar, their performance should be comparable to what we observe
here.
End-to-End Training
We now evaluate the performance of end-to-end training of ResNet-50 and Wide ResNets
on the ImageNet dataset within the LBANN toolkit. For simplicity, we use synthetic data
that matches the dimensions of ImageNet data; since our compute nodes can fit the entire
ImageNet-1K dataset in host memory, this is not significantly different than using real data.
We report the average mini-batch time over an epoch, skipping the first mini-batch (which
performs initialization).
We show ResNet-50 scaling performance in Figures 3.20 and 3.21 with 32 samples per row
of N , again using a maximum mini-batch size of 16k. At the largest scale, the four-way
channel parallel algorithm uses 2048 GPUs (512 nodes). For small mini-batch sizes (32-
512), sample and two-way channel parallelism weak scale very well, and have comparable
performance. Beyond this, sample parallelism begins to degrade in performance due to
allreduce overheads; two-way channel parallelism maintains its weak scaling for another two
59



























Figure 3.20: Weak scaling for end-to-end training of ResNet-50, using 32 samples per GPU
for sample parallelism.



























Figure 3.21: Strong scaling for end-to-end training of ResNet-50, using 32 samples per GPU
for sample parallelism. To see strong scaling, compare different partitioning schemes for a
fixed mini-batch size.
60
























Figure 3.22: Weak scaling for end-to-end training of Wide ResNet-50-2, using 16 samples
per GPU for sample parallelism.
























Figure 3.23: Strong scaling for end-to-end training of Wide ResNet-50-2, using 16 samples per
GPU for sample parallelism. To see strong scaling, compare different partitioning schemes
for a fixed mini-batch size.
61


























Figure 3.24: Weak scaling for end-to-end training of Wide ResNet-50-4, using 8 samples per
GPU for sample parallelism.


























Figure 3.25: Strong scaling for end-to-end training of Wide ResNet-50-4, using 8 samples per
GPU for sample parallelism. To see strong scaling, compare different partitioning schemes
for a fixed mini-batch size.
62
doublings. Four-way channel parallelism weak scales quite well throughout the entire range
of mini-batch sizes, but is outperformed at smaller scales. We have observed that this is due
to the overhead of inter-socket reduce-scatters and allgathers compared to compute scaling,
primarily for the early layers of ResNet-50. This agrees with our performance model and
microbenchmarks. At the largest scales, where allreduce overheads are greatest, four-way
channel parallelism is fastest. We do not show eight-way channel parallelism, as it was
not beneficial at these scales. The main cause of limited performance at smaller scales is
the overhead of reduce-scatters and allgathers for channel parallelism, particularly for early
layers in ResNet-50, outweighing the reduced allreduce overhead (which is mostly hidden by
other compute).
Given the observed overheads in the early layers of ResNet-50, we also experimented
with using the spatial parallelism of [50] for the conv_1 and conv_2 blocks (a N × H
grid) and then using channel parallelism for the remaining layers. These results are also
plotted in Figures 3.20 and 3.21, where we observe that this has significantly mitigated
overheads from the early layers while continuing to offer good weak scaling. The overhead
from shuffling between different tensor distributions is also negligible. At large scales, where
allreduce overheads are most acute, the benefit of two-way spatial parallelism is limited, but
we see up to 1.4x improvements with four-way spatial/channel parallelism. These results are
comparable to those in [50], while we observe much better weak scaling compared to sample
parallelism.
In Figures 3.22 through 3.25, we show the scaling performance for end-to-end training of
Wide ResNet-50-2 and 50-4, using 16 and 8 samples per row of N , respectively. We use
a smaller mini-batch per GPU due to the increased memory requirements of the networks:
Wide ResNet-50-2 has 2.7x as many parameters as ResNet-50, and 50-4 has 8.7x as many, in
addition to larger activations. Channel parallelism’s strong scaling enables it to outperform
sample parallelism at nearly every scale for both networks. For Wide ResNet-50-2, channel
parallelism is consistently at least 1.25x faster than sample parallelism beyond a mini-batch
size of 512. For Wide ResNet-50-4, we see performance improvements of at least 2x for
mini-batch sizes up to 256, and significant improvements for larger mini-batch sizes. Eight-
way channel parallelism is always at least 2x faster than sample parallelism at every scale,
and is over 3.3x faster with a mini-batch size of 2048. The channel parallel algorithms also
consistently weak scale better than sample parallelism– the eight-way algorithm in particular
demonstrates excellent weak scaling.
Overall, our channel parallel algorithms are able to significantly improve performance at
large scales, and can enable training very wide networks, which would be infeasible otherwise
due to memory constraints. While performance at small scale is network-dependent, channel
63
parallelism is able to consistently offer superior performance at large scales. Thus, one is
able to weak scale using sample parallelism until mini-batch size limits are reached, and then
leverage channel parallelism to further accelerate training.
Accuracy
Table 3.4: ImageNet validation accuracy.
Model Mini-batch size Top-1 Top-5
ResNet-50 8192 76.3% 93.0%
Wide ResNet-50-2 4096 77.3% 93.6%
Wide ResNet-50-4 2048 78.2% 94.0%
We evaluated the accuracy of these trained networks on ImageNet. We used the learning
rate schedule, hyperparameters, and data augmentation strategy of Goyal et al. [103] for
all models. Accuracies and the mini-batch sizes used are reported in Table 3.4. We use
only a single mini-batch size for each model due to compute time constraints. Our baseline
results for ResNet-50 match those of [103], and we show roughly a 1% absolute improvement
in top-1 and 0.5% absolute improvement in top-5 accuracy for each doubling of width.
This improvement is significant, and comparable to using significantly deeper and/or more
complicated architectures (e.g. SENet-101 [148]).
This demonstrates the advantage of wider models, even without additional hyperparameter
tuning or data augmentation, to give improved accuracy, while avoiding extensive CNN
architecture development. This also shows that the warmup and linear scaling rule for
learning rate introduced by [103] can enable training with large mini-batches on models
other than ResNet-50.
64
CHAPTER 4: COMMUNICATION OPTIMIZATION
I now discuss approaches to reduce communication overheads, and present an implemen-
tation of them in the new Aluminum library1.
4.1 COMMUNICATION REQUIREMENTS
We begin by discussing in more detail the communication involved in training a deep
network, including where the communication occurs and what volume of data is moved.
This forms the basis of our subsequent discussion on optimizing communication.
4.1.1 Where and What is the Communication?
As discussed in Section 2.2, the key communication step when training is an allreduce to
aggregate gradient updates for layers with learnable parameters. While this can be imple-
mented with either a parameter server or a decentralized allreduce (e.g., MPI_Allreduce),
we focus only on the latter case in this work.
The granularity of communication can vary depending on the implementation. At one
extreme, all data could be combined into a single buffer and allreduced once backprop
completes for every layer. Alternatively, allreduces can be done as soon as the gradient
computation for a layer completes, and work on a per-buffer basis. Many implementations
(including ours) keep separate, non-contiguous buffers for the parameters for each layer for
simplicity, so operating on a per-buffer basis is typical.
In this work, our layers use 4 byte single-precision floats to store parameters, and we
communicate parameters in this format. Within the networks we consider, convolutional,
fully-connected, and batch normalization layers have parameters that must be learned. In
our implementation, convolutional and fully-connected layers have their parameters stored
in a single buffer per layer. Batch normalization, for convenience, has two buffers, one each
for its scale and bias.
4.1.2 Communication Volume
We now look to understand the amount of data and number of buffers that must be
communicated in a training iteration. This depends on the architecture of the network











































 2  2  2
 1  1
 2
 1  1  1  1
Figure 4.1: Histogram breaking down the number of parameter buffers (essentially, a layer)
























































































 6  5
 1  1
 3
Figure 4.2: Histogram breaking down the number of parameter buffers (essentially, a layer)
of a given size for ResNet-50. In our implementation, each parameter is a 4-byte float.
66












































































Figure 4.3: Strong and weak scaling results for ResNet-50 using our synthetic benchmark
on Sierra, using NCCL with no communication/computation overlap.
being trained (e.g. number and size of filters in a convolutional layer). Figures 4.1 and 4.2
plot histograms of parameter buffer size for two representative image classification networks,
AlexNet and ResNet-50 (see Section 2.3 for an overview).
AlexNet is a fairly shallow network that has several large fully-connected layers, and is a
commonly used baseline or building block where state-of-the-art accuracy is not necessary. It
has relatively few buffers: five convolutional layers and three fully-connected layers, with all
but the final layer having a separate bias. The three largest buffers are the fully-connected
layers, which contain a majority of the parameters.
ResNet-50 is more representative of modern CNNs, which have many more layers, batch
normalization, and fewer fully-connected layers. ResNet architectures do not have biases,
but many of the small buffers are due to the parameters for batch normalization layers.
Because of the widespread use of ResNet and similar architectures, we will focus on it for
the remainder of this chapter.
A key observation to make from these plots is that both networks require allreduces to be
performed on many small buffers. For ResNet-50, a majority of the buffers are 8 KiB or less.
However, there is also a very large range of buffer sizes, spanning 256 bytes to megabytes. A
single algorithm for performing the allreduce is unlikely to perform optimally for all of these




We now empirically examine the communication overhead involved in training ResNet-50
on ImageNet [5] in various configurations. Our goal in this section is to understand the
baseline performance, which can then be improved upon. We utilize a simple synthetic
benchmark that incorporates the compute cost of convolutional layers (the primary compu-
tational cost in ResNet-50) and the communication cost of synchronizing layer gradients.
The compute time is determined by benchmarking the runtime of the relevant cuDNN [123]
routines for convolution on the local problem size of each convolutional layer. Communica-
tion time is determined by benchmarking allreduces of the relevant sizes, using the NCCL
collective communication library [73]. We assume that a separate allreduce is performed
on each buffer. We neglect the fully-connected layer for simplicity; as it is a small layer,
this does not significantly affect our results. Note that this benchmark is meant to illus-
trate the major sources of communication and computation, and neglects many aspects of a
full training pipeline, such as I/O, optimization, activation layer computation, and internal
synchronization.
We run this benchmark on the Sierra supercomputer [149], which consists of 4,320 compute
nodes with two IBM POWER9 CPUs and four Nvidia V100 (Volta) GPUs with NVLink2
per node, interconnected via a dual-rail InfiniBand EDR network. We use CUDA 9.2.148,
cuDNN 7.2.1, and NCCL 2.3. This system has the same architecture as Lassen, but with
slightly older software (see Section 3.5.1). In this chapter, all experiments use only sample
parallelism.
Strong Scaling
To strong scale ResNet-50 training, we keep all parameters constant and increase the
number of GPUs being trained on. The mini-batch size is 256, per the original paper [4].
Due to memory constraints, we cannot train ResNet-50 on fewer than 8 GPUs, and the mini-
batch limits us to at most 256 GPUs. We additionally neglect issues that may be caused by
batch normalization having few samples per node [110, 150].
We plot the mini-batch iteration time, as well as a breakdown of computation versus
communication, in Figure 4.3 (left). As the number of GPUs increases, the computation
time decreases, but the scaling is unfortunately sublinear. Simultaneously, communication
requirements increase as more nodes are involved while the number of iterations remains
68
constant. Runtimes improve up to 32 GPUs, after which communication overheads outweigh
the benefits. The communication/computation ratio rapidly increases, and even at only 32
GPUs accounts for more than half the runtime.
Weak Scaling
For weak scaling, we keep every parameter but the mini-batch size fixed and train with 32
samples per GPU. This offers a good compromise between GPU utilization and memory re-
quirements. Note that as the mini-batch size increases, the number of iterations to complete
an epoch decreases (it is 4955 iterations when the mini-batch size is 256).
We plot total epoch time, again with a communication/computation breakdown, in Fig-
ure 4.3 (right). In this case, computation scales linearly. The total time for communication
decreases as the number of GPUs increases, because fewer iterations are performed, resuling
in fewer rounds of communication, although this trend breaks down for large numbers of
GPUs. However, the ratio of communication to computation steadily worsens, resulting in
a nearly 6x ratio of communication to computation on 1024 GPUs and 12x on 2048 GPUs.
Despite this, it remains profitable to weak scale ResNet-50 training to this scale, though it
suffers from significantly diminishing returns.
4.1.4 Model-Parallel Fully-Connected Layers
We briefly discuss the differences in communication when using model-parallel fully-
connected layers (see Section 2.2.4). These essentially implement a distributed matrix prod-
uct, which can be thought of as a collective operation involving every processor. Commu-
nication is now required in both forward and backward propagation to compute the layer’s
output, error signal, and gradients; however, no additional communication is needed to syn-
chronize the gradient update. Since matrix products typically require their input data to
have a particular distribution (e.g. blocked), data may need to be moved from a different
distribution for this. The communication operations performed depend on the algorithm
being used, but typically involve a variety of collectives beyond allreduce.
4.2 COMMUNICATION OPTIMIZATIONS
We now discuss two basic optimizations for reducing communication overhead and improv-
ing performance: overlapping and latency-efficient allreduce algorithms. Neither of these
techniques are new. Overlapping communication and computation during training has been
69
discussed before (e.g. [103]), and we will provide additional detail on implementing them
with GPUs. Latency-efficient allreduces are similarly not new [69]; however, deep learning
applications have typically preferred bandwidth-optimized ring-based allreduce implemen-
tations as in the Baidu allreduce [72] or NCCL [73] libraries.
4.2.1 Overlapping
Overlapping communication and computation when training deep nets involves perform-
ing gradient update allreduces concurrently with backpropagation and optimization. Thus,
to maximize the potential for overlapping, each layer should compute its local gradient
update first and then start an asynchronous allreduce on that buffer. The remainder of
backpropagation can be performed in the same manner, and the allreduce completed when
the optimization phase for that layer begins. This enables the allreduce to be hidden by the
error signal computation in the associated layer, and all computation in all remaining layers.
Achieving communication/computation overlap when running on GPUs requires addi-
tional work, as we do not want to block the CUDA stream training computations are
performed on. We shall instead make use of separate, internal streams to perform the
communication and handle synchronization as needed.
4.2.2 Latency
While performing allreduces as soon as possible helps maximize overlap, it results in
many small allreduces being performed, some as small as 64 parameters (256 bytes). This
size regime is latency-dominated instead of being bandwidth-dominated, and the size of
allreduces that are latency-dominated increases as the number of GPUs increases.
Typically, allreduce libraries for deep learning have been bandwidth-optimized and employ
ring-based algorithms [72, 73]. These algorithms perform very well in multi-GPU shared-
memory systems (especially ones optimized to have ring topologies, such as the Nvidia
DGX1) or at small distributed-memory scales despite not being latency-optimized. They
are adequate for models that have few small allreduces, such as AlexNet (see Figure 4.1).
Butterfly-based allreduce algorithms can offer much better performance in latency dom-
inated regimes. In particular, recursive-doubling algorithms should be preferred for small
messages, and have optimal latency. Recursive-halving/doubling (also called Rabenseifner’s
algorithm) has slightly worse latency, but better bandwidth utilization, and is preferred for
larger messages. Section 2.5 presents performance models for these algorithms. One can
quickly see that, while the bandwidth term for a ring allreduce is nearly constant as the
70
number of processors p increases, the latency term is linear in p. With many small messages,
this term can rapidly become important. In contrast, butterfly algorithms have much better
latency terms. While Rabenseifner’s algorithm has the same bandwidth term and better
latency than the ring algorithm, the nearest-neighbor communication in rings often enables
them to outperform it in practice for large messages.
In Section 4.4 we present the Aluminum library, which augments NCCL with butterfly-
based algorithms and dynamically select the fastest algorithm based on the buffer size and the
number of processors. An additional optimization is to run multiple allreduces concurrently.
In a latency-dominated regime, we are not limited by packet injection rates or similar issues,
but instead by waiting for communication to complete. This enables pipelining the allreduces
to further reduce communication overhead.
4.3 INTERFACING WITH MPI
Modern MPI distributions provide large suites of optimized communication algorithms,
including butterfly-based allreduce algorithms. Many of them are also “CUDA-aware”, in
that they accept pointers to GPU buffers and can perform communication on them. Why
can we not simply use CUDA-aware MPI directly for allreduces when appropriate? Fun-
damentally, we argue that because MPI is unaware of users’ CUDA streams, a semantic
mismatch between the MPI and CUDA programming models arises, leading to communica-
tion and computation overheads due to unnecessary synchronization. We will then discuss
approaches to fixing this mismatch.
4.3.1 Problems
When using CUDA to compute data on a GPU, one typically launches a sequence of
compute kernels on a CUDA stream. The CUDA runtime ensures that kernels launched
on a stream are executed in launch order (there is no ordering between multiple streams
unless one is imposed using explicit synchronization). This means that, provided kernels are
launched in the right order, all its inputs are ready when it begins execution. Kernel launches
(along with most other CUDA operations) are asynchronous and do not block the host, but
there is a cost (roughly 10 µs of CPU time) associated with launching them. For this reason,
one typically launches many kernels in a row without waiting for their completion, pipelining
the launches and hiding the launch latency for every kernel beyond the first.
MPI runtimes are unaware of users’ CUDA streams. Therefore, when a user passes a
GPU buffer to an MPI routine, MPI has no way to determine whether there is a pending
71
computation on a stream that will write to the buffer. To ensure correctness when a kernel
may write to the buffer, the user must synchronize the stream to complete pending compu-
tation. This forces the application into a bulk-synchronous model of separated computation
and communication phases, preventing pipelining of kernel launches and overlapping of com-
munication and computation. Similarly, when MPI communication is in progress, there is
no way for a stream to wait for a blocking operation’s completion (e.g. MPI_Allreduce or
MPI_Wait). This further means that other streams that might synchronize with the first
stream also need to be blocked.
Alternating computation and communication phases in this manner leads to an awkward
and error-prone programming model, and underutilization of both the network (during com-
putation phases) and GPU (during communication). Frequent blocking on the host also
limits the ability to overlap communication and computation with other operations, such
as I/O. In the context of training deep nets, I/O can be quite expensive, so hiding it is
crucial. Finally, when latency-optimized communication is necessary for scaling, minimizing
additional synchronization is important.
A further concern with using CUDA-aware MPI is practical. We have observed that
CUDA-aware MPI runtimes often do not handle operations with GPU buffers correctly
when they are performed from multiple threads, even when MPI_THREAD_MULTIPLE is en-
abled. We hope that this can be resolved by improved documentation and bug fixes by MPI
distributions.
4.3.2 Possible Solutions
One solution that achieves correctness is to push the synchronization into the MPI library.
Since it is unaware of which user stream is producing the buffer to be communicated, the
library must synchronize the entire device, either explicitly or via CUDA’s default stream
semantics. This resolves none of the performance issues noted above.
A more promising solution is to treat MPI communication operations as “just another
kernel” to be enqueued on a stream. As a proof-of-concept, NCCL operations take a stream
as an argument and employ the usual kernel launch semantics: it doesn’t block the host, is
ordered within the stream, and blocks the stream.
Unfortunately, MPI operations cannot take a stream parameter. However, we find it
sufficient to associate a single stream with a communicator. Every operation that uses the
communicator and a GPU buffer can then assume that the buffer is written to by some
kernel on that stream, and perform the appropriate synchronization with respect to only
that stream. Within MPI, this association can be implemented as an attribute attached to
72
Table 4.1: Aluminum allreduce capabilities based on backend.
Backend Algorithm Support Features
MPI Ring, recursive-doubling, Rabenseifner Ubiquitous, optimized
NCCL Ring GDR, optimized for GPUs
MPI-CUDA Ring, recursive-doubling, Rabenseifner Host-transfer algorithm
the communicator. To achieve good performance, the implementation can then make use
of fine-grained CUDA events and other synchronization, driven by a background thread,
to progress communication without blocking execution. We have taken this approach and
implemented it in our Aluminum library, detailed in the next section.
While this paper has focused on allreduces, due to their importance in training deep
networks, these approaches are in no way exclusive to allreduces and are applicable to any
communication operation.
4.4 THE ALUMINUM LIBRARY
We have developed the Aluminum library as an open-source communication library2. It
provides a generic API for communication operations implemented by multiple backends,
and currently supports MPI, NCCL, and custom implementations of various operations for
both CPU and GPU communication. Aluminum does not replace other communication
libraries, but provides a portable layer to optimized communication substrates as well as
benefiting from the ubiquity of MPI as a baseline. For example, it would be easy to support
AMD’s software stack with little change in applications.
This library encapsulates the proposed optimizations discussed in Sections 4.2 and 4.3.2,
including easy non-blocking operations on both host and GPU, latency-optimized algorithms,
and CUDA-friendly synchronization semantics. It is currently being leveraged by both the
LBANN toolkit and the Hydrogen distributed linear algebra library [151] (a fork of the
Elemental library [135]). Section 2.6 and Figure 2.1 provide more details.
4.4.1 API and Semantics
Aluminum is a C++11 library with an API inspired by MPI’s. This similarity means that
integrating Aluminum into existing applications should be quite simple. In particular, since
2https://github.com/LLNL/Aluminum
73
NCCL and/or MPI are frequently used by distributed deep learning frameworks, they can
easily take advantage of Aluminum.
It consists of a core providing internal implementation frameworks and three communica-
tion backends (and is extendable to support more):
MPI provides both an interface to MPI (by directly calling MPI routines) and custom
collective implementations built atop of MPI. It is meant to be used with host buffers.
NCCL provides a direct interface to NCCL for use with GPU buffers.
MPI-CUDA implements a variety of custom algorithms that are built on top of MPI
and CUDA for use with GPU buffers. This backend implements our “host-transfer”
allreduce. (This is independent of CUDA-aware MPI.)
The backends and notable features are summarized in Table 4.1.
The API to invoke a non-blocking, in-place allreduce (for example) looks like:
Al::NonblockingAllreduce<Backend>(buffer, count, op, comm, req)
where buffer and count define the buffer to be reduced, op is a reduction operation (e.g.
summation), comm is an Aluminum communicator object, and req is a request object. C++
templates are used to infer the type of the buffer and dispatch the operation to the correct
backend. Aluminum also handles algorithm selection where appropriate, making a reasonable
choice based on the buffer and communicator sizes (this can also be manually specified by
the user). The allreduce then proceeds asynchronously, and can be completed via a wait
operation: Al::Wait<Backend>(req). Every backend automatically handles Aluminum’s
synchronization semantics, described below.
Aluminum currently supports a subset of the standard MPI collective operations in both
blocking and non-blocking versions, including: reduce, allreduce, reduce-scatter, allgather,
and broadcast. The MPI-CUDA backend additionally supports the basic send, recv, and
sendrecv point-to-point operations for GPU buffers. The NCCL backend is currently limited
to only the subset of reduction operations that NCCL supports (summation, multiplication,
min, and max); our other backends support a more general set of reduction operations.
The semantics of Aluminum’s blocking and non-blocking operations differs from MPI,
and it implements the approach discussed in Section 4.3.2 in a manner that provides a fairly
generic interface for both CPU and GPU operations. We associate a “stream of computation”
with each communicator. For GPU backends, this is a CUDA stream. For the MPI backend,
this stream is implicit, and can be thought of as the calling thread or process; this could be
made explicit in the future to better support threading or lightweight threading libraries.
74
All operations then synchronize the communicator’s stream as necessary. This is critically
important for GPU operations, where it means that no GPU operation blocks the host. From
the example above, if the Al::Wait operation is used with the MPI-CUDA backend, it will
(perhaps counterintuitively) not block the host, but instead block comm’s CUDA stream until
the allreduce completed.
4.4.2 Implementation Details
We now detail some of the notable implementation details for Aluminum.
Communication Engine
Any communication that must perform operations on the host without blocking the main
thread of execution need to be run in a separate, dedicated thread that serves as the com-
munication or progress engine. This thread is automatically bound by the library to a core,
and uses some basic heuristics to avoid conflicting with both other processes that may be
on the same node and other threads (e.g. OpenMP compute threads) that the application
may spawn. Asynchronous operations are submitted to the communication engine as a state
object that encapsulates the operation to be performed and any necessary state (essentially,
a closure). Submission is done via a lock-free single-producer, single-consumer queue (im-
plemented as a classic Lamport queue [152] with modifications described in [153], and could
be generalized to a MPSC queue). The engine maintains an internal queue of currently
running state objects, and invokes a step method on them, which should not block. When
the operation has completed, the engine can optionally indicate this to other threads by
atomically setting a flag in a request object.
This implementation approach is inspired by the communication engines that have been
used in other high-performance communication libraries [154, 155].
Aluminum’s MPI backend utilizes the progress engine to provide asynchronous progress
on the host both for custom algorithm implementations and via MPI_Test polling for non-
blocking MPI operations. We do this because we have observed that MPI implementations
often do not make adequate progress on their own without polling (see also e.g. [156]). The







Figure 4.4: Performing a non-blocking allreduce using NCCL. Data is computed on a stream
by the application, and a separate, internal stream is synchronized to the first. This stream
performs the NCCL allreduce, while the data stream can perform other computation. When
the result is needed, the internal stream can be synchronized back to the data stream. (Red
boxes are synchronization, such as CUDA events. Boxes are not to scale.)
Comm engine




Figure 4.5: Implementation of the blocking host-transfer allreduce. The data buffer to
transmit is computed on the data stream, after which a device-to-host memcpy transfers
moves the buffer to the host. CUDA event synchronization is used to determine when the
transfer has completed, after which an MPI allreduce is performed. Meanwhile, the data
stream is blocked with a wait operation, until the host signals completion, after which a
host-to-device memcpy transfers the buffer back. A second event signals completion of this
















Figure 4.6: Implementation of the non-blocking host-transfer allreduce. This is similar to
the blocking version, but run on an internal stream and a separate completion operation is
used to invoke the synchronization with the data stream to complete the operation. (Boxes
are not to scale.)
Non-Blocking and Host-Transfer Allreduce
Aluminum has a heavy focus on non-blocking communication with GPU buffers. For
the NCCL backend, non-blocking allreduces are automatically run on one of Aluminum’s
internal CUDA streams, described in Figure 4.4. The Al::Wait operation implements the
synchronization to complete the communication. This allows communication to proceed
without blocking the user’s stream or the host. In our experiments and profiling, we have
observed that this strategy enables excellent communication/computation overlap.
For latency-dominated workloads, we have implemented a “host-transfer” allreduce that
encapsulates MPI’s allreduce algorithms. As described in Section 4.2.2, these can be sig-
nificantly more performant than NCCL in the right regimes. At a high level, this imple-
mentation simply transfers the GPU memory to the host, performs the allreduce in host
memory using MPI, and transfers the result back to the GPU. To avoid the caller blocking
the host, the operation enqueues the necessary kernels and events on the communicator’s
stream, and then delegates communication to Aluminum’s communication engine. Polling
on CUDA events is used to determine when memory transfers have completed. To block the
stream while communication is in progress, the cuStreamWaitValue32 operation from the
CUDA driver API is used3. This prevents any work submitted to the stream after the call
from beginning until a memory location is written. The entire process is described in more












Figure 4.8: Implementation of the host-transfer receive operation. (Boxes are not to scale.)
operations for NCCL, by running on an internal stream (see Figure 4.6).
Because we transfer the entire GPU buffer to the host, this approach could be signifi-
cantly optimized by utilizing GPUDirect RDMA (GDR) [157], and by pipelining for longer
messages.
While we have described and implemented this “host-transfer” approach for allreduces, it
can be applied to any communication operation. We briefly describe applying this approach
to send and recv operations next.
Other Operations
Send and recv operations that support Aluminum’s semantics for GPU buffers are useful
to support applications that require fine-grained communication, irregular communication
patterns, and as building blocks for custom implementations of collectives. Both operations
can be implemented similarly to the host-transfer allreduce. We use these to implement the
78
inter-node halo exchanges used in spatial parallelism (see Section 3.2).
For a send operation, we transfer the data from the GPU to the host and then use
MPI_Isend within the communication engine to perform the completion. The communi-
cator’s stream does not need to be blocked: similarly to MPI’s semantics, we consider it
locally complete when the user buffer can be reused. For recv, the communication engine
can begin an MPI_Irecv immediately while blocking the communicator’s stream. Once com-
plete, the buffer transferred to the GPU and the stream is notified. These are illustrated in
Figures 4.7 and 4.8.
Using these operations as primitives, we have implemented our own ring allreduce in
Aluminum’s MPI-CUDA backend. This allreduce pipelines communication and host/GPU
memory transfers, supports both single- and bi-directional rings, and performs reduction
operations on-GPU. While this implementation is not always competitive with NCCL’s (in
particular, it does not take advantage of GDR), it does enable additional flexibility by




To demonstrate the advantages of our overlapping and latency optimizations, we apply
the same benchmark as in 4.1.3, now using Aluminum.
Overlapping
Figure 4.9 plots the runtime and unoverlapped communication/computation breakdown
for strong and weak scaling using NCCL with overlap (compare with Figure 4.3). At small
scales, we successfully overlap nearly all communication; indeed, for weak scaling, commu-
nication is not a significant factor until 256 GPUs.
For strong scaling the runtimes improves in every case, however, beyond 32 GPUs there
is simply too much communication and insufficient computation to hide it. In particular,
because many allreduces can only be started toward the end of backprop, allreduces later
in backprop always have less computation available to hide them. Nonetheless, overlapping
still reduces communication overhead in these cases. 32 GPUs remains the optimal number
to use in this case, and runtime is improved by ∼1.4x here.
79












































































Figure 4.9: Strong and weak scaling results for ResNet-50 using our synthetic benchmark on
Sierra, using Aluminum+NCCL with communication/computation overlap. The bars break
down runtime by computation and unoverlapped communication.









































































Figure 4.10: Strong scaling results for ResNet-50 using our synthetic benchmark on Sierra,
using Aluminum to dynamically select either NCCL or our latency-optimized host-transfer
allreduce. The bars break down runtime by computation and unoverlapped communication.
80





























































Figure 4.11: Weak scaling results for ResNet-50 using our synthetic benchmark on Sierra,
using Aluminum to dynamically select either NCCL or our latency-optimized host-transfer






















































Figure 4.12: Communication time and speedup for strong scaling for ResNet-50 in our
synthetic benchmark. The NCCL and minimal lines plot the absolute communication time
at that scale, and the speedup line plots the improvement of the minimal algorithm over


























































Figure 4.13: Communication time and speedup for weak scaling for ResNet-50 in our syn-
thetic benchmark. The NCCL and minimal lines plot the absolute communication time at
that scale, and the speedup line plots the improvement of the minimal algorithm over NCCL
at that scale.
For weak scaling, the constant amount of local computation means that Aluminum is able
to hide more of the communication. Unfortunately, at very large scales, communication
overheads with NCCL remain too high, and profitability for weak scaling is very low beyond
256 GPUs. At 2048 GPUs, the runtime is almost entirely communication.
Latency
To demonstrate the different regimes in which NCCL and our latency-optimized host-
transfer allreduce are better, we conducted a simple benchmark comparing their performance
across a range of node/GPU counts (2-512 nodes/8-2048 GPUs) and buffer sizes (1-228
parameters) on Sierra. For each configuration we computed the average over ten runs of the
in-place version of allreduce algorithm, after a warmup run. The underlying MPI distribution
was MVAPICH2 2.3rc2.
Figure 4.15 plots the actual performance results for each scale. We can see that NCCL has
a significant advantage at the smallest scale (two nodes), that gradually disappears as the
number of nodes increases. At small scales, the impact of latency is smaller, so the difference





































Figure 4.14: The fastest allreduce algorithm for a given number of GPUs and buffer size
on Sierra. A green dot marks the configurations our host-transfer allreduce is fastest; a red
triangle when NCCL is. The host-transfer point for 32 GPUs appears to be due to a protocol
change or similar within NCCL.
starts performing better at 64 GPUs, and at 128 GPUs, it is over 2x faster than NCCL for
small messages. At 2048 GPUs, this increases to over 20x. Further, the tree-based allreduces
scale much better with increasing node count than NCCL’s ring-based allreduce.
Figure 4.14 plots which implementation is faster for a given configuration, providing a
summary of Figure 4.15. Once running on 64 GPUs (16 nodes), the host-transfer allreduce
outperforms NCCL for messages of up to 32768 parameters. At the largest scale, the host-
transfer allreduce is preferred for messages up to 219 parameters.
It may be somewhat surprising that NCCL performs well even for very small messages
up to 32 GPUs. We attribute this to two factors. First, NCCL is able to take advantage
of GPUDirect RDMA [157] and node-local topology information, to reduce communication
overhead and latency. Second, our implementation is a prototype whereas NCCL is an
optimized production library.
It is important to observe that the size range where the host-transfer allreduce outper-
forms NCCL corresponds to a significant portion of the allreduces required when training
AlexNet or ResNet-50 (see Figures 4.1 and 4.2). While these allreduces also tend to be faster,
improving their performance helps to reduce communication overheads during training.









































































































































































































































































that is a hybrid of the host-transfer and NCCL allreduces. This algorithm uses our prior
benchmarking results to select the fastest implementation for a given input configuration.
The results for strong and weak scaling (with and without overlap) are presented in Figures
4.10 and 4.11.
Strong scaling benefits less from the better allreduce algorithms, as the regime where it is
profitable is not significantly impacted by them. Nonetheless, at larger scales communica-
tion overhead is significantly reduced. This implies that with a better implementation and
improved compute scaling, we may be able to successfully strong-scale training further.
Weak scaling exhibits a more noticable impact, dramatically improving the performance
at large scales. Whereas NCCL, even with overlap, barely improves performance beyond 256
GPUs, the minimal algorithm sees continued profit in scaling to 2048 GPUs. Furthermore,
communication overhead, while still quite high, is significantly improved, by over 5x at 2048
GPUs.
To illustrate more directly the communication improvements within the benchmark, we
plotted only the communication time for both NCCL and the minimal algorithm in Fig-
ures 4.12 and 4.13. Here we can quite clearly see that speedups in communication begin
at 64 GPUs; below that, the minimal algorithm is identical to NCCL. Beyond 64 GPUs,











































NCCL+HT speedup over NCCL
Figure 4.16: Strong scaling for end-to-end training of ResNet-50 in LBANN with Aluminum
on Sierra.
than NCCL alone at 2048 GPUs.
We investigated running multiple allreduces concurrently, but we have observed that
NCCL performs only a single allreduce at a time, even if multiple allreduces could be exe-
cuted. While our host-transfer allreduce does not have this restriction, we did not incorporate
this optimization into our benchmarks here.
4.5.2 Training
To evaluate end-to-end training in a real environment, we integrated Aluminum into the
LBANN toolkit (see Section 2.6). We train ResNet-50 on the ImageNet-1K dataset [5] using
Sierra, with data being read off a Spectrum Scale parallel filesystem, again using only sample
parallelism.
Strong scaling is performed by fixing the mini-batch size to 256, the default, and increasing
the number of GPUs. Weak scaling fixes a per-GPU mini-batch size of 32, increasing the
global mini-batch size as the number of GPUs increases. This results in fewer iterations
being performed per epoch. Note this is the same setup as in our synthetic benchmark.
MVAPICH2 v2.3 was used as the underlying MPI distribution (see Section 4.1.3 for other
system/software details). Experiments were performed using up to 256 GPUs on 64 nodes4.
We compare three configurations: LBANN using CUDA-aware MPI, Aluminum with
NCCL, and Aluminum with NCCL and the host-transfer allreduce (HT). In the last configu-








































NCCL+HT speedup over NCCL
Figure 4.17: Weak scaling for end-to-end training of ResNet-50 in LBANN with Aluminum
on Sierra. Note the log scale.
ration, a static performance model selects between NCCL and HT, similar to the “minimal”
algorithm in our prior benchmarks. Notably, based on our benchmarks, NCCL is preferred
exclusively when running on fewer than 64 GPUs. Strong and weak scaling results are
presented in Figures 4.16 and 4.17.
Both strong and weak scaling exhibit similar trends to those in our benchmark (compare
with Figures 4.10 and 4.11). However, I/O is now a major factor in runtime, which was not
reflected in it. Other computations (optimization, etc.) are also present. This results in
additional work that communication can be overlapped with, reducing overhead.
CUDA-aware MPI is significantly outperformed by both Aluminum configurations. Alu-
minum+NCCL is ∼2.4x and ∼1.5x faster than CUDA-aware MPI for strong scaling at 64
GPUs and weak scaling at 256 GPUs, respectively. Aluminum+NCCL+HT is ∼2.5x and
∼1.9x faster in these cases. Aluminum’s semantics for communication with GPU buffers
means that both NCCL and the host-transfer allreduce are asynchronous with respect to
the host, enabling I/O to be overlapped much more extensively.
When strong scaling, this additional work enables scaling to be profitable up to 64 GPUs
(compared to 32 GPUs in our benchmark), after which communication overheads and poor
compute scaling begin to dominate. At 64 GPUs, the host-transfer algorithm begins to
slightly improve communication performance, resulting in a ∼1.05x improvement in runtime,
which is commensurate with the modest improvements over NCCL our benchmarks show at
this scale. We see larger speedups with more GPUs, despite it not being profitable; future
90
communication optimizations may enable strong scaling at these scales.
Weak scaling shows improvements of similar magnitude, except the better compute scal-
ing means that it is profitable up to 256 GPUs. The host-transfer algorithm again shows
improvements beginning at 64 GPUs, and results in a ∼1.25x performance improvement
over NCCL at 256 GPUs.
Overall, both strong and weak scaling demonstrates the advantages of Aluminum over
vanilla CUDA-aware MPI, and, at larger numbers of GPUs, the importance of taking latency
into consideration when selecting communication algorithms.
91
CHAPTER 5: COMMUNICATION QUANTIZATION
I now present quantization techniques to reduce communication volume, and an approach
to incorporate quantization more directly into decentralized allreduces1. The key idea behind
quantization is to trade increased local compute time for decreased communication volume; if
quantization can be done fast enough, this will decrease the overall time for a communication
operation.
5.1 EXISTING QUANTIZATION ALGORITHMS
We begin with a brief overview of two existing quantization algorithms, one-bit and thresh-
old quantization, which we will examine throughout this chapter.
5.1.1 One-bit Quantization
One-bit quantization, also known as one-bit SGD, was introduced by Seide et al. [76].
At a high-level, one-bit quantization compresses each word of its input vector to one bit,
enabling it to take on one of two values. Assuming input data requires 32 bits per word,
this achieves a compression factor of 1/32 (excluding codebook overhead).
We now make this more precise. For an input vector v, each word vi is quantized as follows
Q(vi) =
0 if vi ≥ 01 if vi < 0 (5.1)
though the choice of which case to assign to which bit is arbitrary. The resulting bits are then
packed into a bit vector. To dequantize the quantized vector and recover the original (up to
quantization error), we shall need to construct a codebook, which maps the quantized values
to a reconstructed value. We will write τ+ to be the reconstruction value for non-negative
entries (quantized to 0) and τ− to be the reconstruction value for negative entries (quantized
to 1). Given that we must pick a single value for each case, the optimal choice is the mean
of the non-negative and negative values, respectively2. This codebook is transmitted with
the quantized vector, and is used to implement the corresponding dequantization function,
U .
1This chapter incorporates my work from [78].
2This is optimal in the sense that selecting the centroid minimizes the mean-squared error of the recon-
struction.
92
One-bit quantization introduces significant error because of the aggressive quantization.
To combat this, Seide et al. additionally make use of error feedback, which they find to be
necessary to stabilize training. The idea behind error feedback is to observe that, locally, each
processor can compute the error from one-bit quantization; this is stored locally and added
to the vector in the subsequent round. Using subscripts to denote iterations (beginning with
1), error feedback is performed as follows:
E0 = 0 (5.2)
qt = Q(vt + Et−1) (5.3)
Et = vt − U(qt). (5.4)
This implements a form of delayed SGD updates, similar to what can occur with stale
updates in asynchronous SGD.
To make use of one-bit quantization, the gradient updates produced by each processor,
when training in a sample-parallel regime, are quantized, and then these quantized vec-
tors are aggregated. To help further mitigate quantization error, instead of using a single
codebook for an entire layer, separate codebooks are computed for sub-blocks of each layer;
typically these are selected such that parameters corresponding to a single neuron are quan-
tized with the same codebook.
5.1.2 Threshold Quantization
Threshold quantization was introduced by Strom [77], and is similar to one-bit quanti-
zation. The basic idea is to introduce sparsity by not sending values smaller than a user-
provided threshold. This fixed threshold hyperparameter, τ , is chosen in advance. Only
values vi such that |vi| ≥ τ will be transmitted; values not transmitted are treated as 0. The
values that are transmitted are quantized to one bit, and use τ or −τ as their reconstruction
value. Error feedback is applied as in one-bit SGD.
To transmit these values, each quantized value is packed as a 1-bit value and a 31-bit
index (larger words could be used if needed). Strom also notes that the indices are amenable
to delta encoding [158, chapter 27] followed by lossless Golumb-Rice coding [159], although
they do not evaluate this.
93
5.1.3 Issues
There exist several issues with one-bit and threshold quantization, which we will demon-
strate in Section 5.2.3, but will enumerate here.
• One-bit quantization can be computationally expensive, relatively speaking. The quan-
tization operation is typically implemented with a conditional branch, so if data is
distributed symmetrically about zero, branch prediction will be ineffective.
• One-bit quantization is also unable to reduce the data volume to more than 1/32 of the
original (assuming 32-bit input values).
• In contrast, threshold quantization can be faster, however, properly selecting the
threshold hyperparameter can be quite difficult. One must make an assumption about
the magnitude of the gradients being computed, and further assume that this does
not vary significantly throughout training. This must be done for each layer. Further,
poorly chosen thresholds can lead to a “build up” of error feedback, where few val-
ues are transmitted for several iterations, until the error feedback crosses a threshold,
sending significant quantities of data.
5.2 ADAPTIVE QUANTIZATION
5.2.1 Algorithm
Adaptive quantization was developed to address the observed deficiencies in one-bit and
threshold quantization and to obtain the “best of both worlds”. It is designed to be compu-
tationally fast and minimize quanitzation error, while having a single simple-to-tune hyper-
parameter.
Adaptive quantization uses a fixed proportion hyperparameter, π, which controls the pro-
portion of data that transmitted. This is similar to top-k SGD approaches for sparsification.
The algorithm first determines positive and negative thresholds δ+ and δ− such that only
the desired proportion of values are greater or less than, or equal to, (respectively) the
thresholds. For simplicity, the algorithm in practice sends a proportion π of each of the
non-negative and negative updates without accounting for the relative proportion of the two
classes. To determine the thresholds, suppose there are k non-negative values; we wish to
compute δ+ such that there are k/π values greater than or equal to δ+. Finding this thresh-
old can be done using a selection algorithm (e.g. quickselect) to find the k/πth largest value.
Computing δ− can be done similarly.
94
Given these thresholds, adaptive quantization proceeds similarly to threshold quantization,
sparsifying the input data and quantizing the remaining values to one bit. Similarly to one-
bit quantization, the reconstruction values, τ+ and τ−, are computed to be the means of the
values greater or less than, or equal to, the thresholds (respectively), which is again optimal.
Error feedback is applied, with the locally accumulated error summed in before computing
the thresholds to ensure a fixed proportion of values is always sent. Also similarly to one-bit
quantization, adaptive quantization can be applied with different codebooks for sub-blocks
of each layer.
As an optimization, instead of doing selection on the entire input vector to compute the
thresholds, we can instead randomly sample a sub-vector of the input and perform selection
on that. Since this uses a much smaller list, computing the now-approximate thresholds is
much faster. Additionally, delta and Golumb-Rice compression can be applied, similarly to
threshold quantization.
Example
We now examine a simple example of adaptive quantization. Consider the following vector
of twelve elements
v = (1, 3,−5, 1, 0,−3,−5,−1, 2, 5,−4,−2)
and suppose the proportion is π = 2, indicating 1/2 of the values should be transmitted.
The thresholds are computed to be δ+ = 2 and δ− = −4 and the reconstruction values are
computed to be τ+ = 3.33 and τ− = −4.67. Thus, the values 3, -5, 2, 5, -5, and -2 will be
quantized and transmitted.
5.2.2 Implementation
We incorporated one-bit, threshold, and adaptive quantization into LBANN’s sample-
parallel training process by augmenting the global allreduce procedure. We also included
variants of the quantization algorithms incorporating sampling approximations and delta/Golumb-
Rice compression. This work focused primarily on communication optimization, and used
CPUs for all computation and MPI for communication. Hence, the simplest implemen-
tation would be to use MPI_Allreduce with a custom reduction operation to manage the
intermediate reduction operations on quantized vectors. This has several challenges:
• The MPI standard permits the runtime to call reduction operations on arbitrary chunks
of data as a performance optimization. This makes it difficult to associate codebooks
95
with quantized vectors.
• With threshold and adaptive quantization, the size of the input and output vectors
may vary, due to different amounts of sparsity.
• In practice, MPI implementations fall back to a recursive-doubling allreduce algo-
rithm when given custom reduction operations. This algorithm may have suboptimal
performance in bandwidth-dominated regimes compared to other algorithms (see Sec-
tion 2.5).
To address this, we implemented our own allreduce algorithms on top of MPI non-blocking
send and receive calls. Since we target CPUs only in this study, we do not need to use
Aluminum. For this study, we used a pairwise-exchange reduce-scatter followed by a ring-
based allgather, which worked well for the configurations evaluated here, which are primarily
bandwidth-dominated. Note that when selecting allreduce algorithms for quantized data, it
is important to select the algorithm based on the size of the quantized data, not the original
data.
The sample-parallel communication proceeds as follows: Local gradient updates are quan-
tized and then reduce-scattered. The reduction involves the vectors being unquantized,
summed, and requantized. When using threshold quantization, the same threshold param-
eter τ is used throughout and adaptive quantization uses the same proportion parameter π
throughout. The reduced vectors are then allgathered in quantized form, after which the
final result is unquantized.
5.2.3 Evaluation
Here we evaluate one-bit, threshold, and our adaptive quantization in terms of both com-
putational performance and the accuracy of the models they generate. We also examine
variants of these involving compression and sampling. Our baseline is to do no quantization
and simply use MPI_Allreduce on the raw gradient updates.
All tests were run on the Catalyst cluster at Lawrence Livermore National Laboratory [160].
Catalyst consists of 324 nodes, each with two Intel Xeon E5-2695 v2 twelve-core CPUs, 128
GB of RAM, 800 GB of NVRAM, and dual InfiniBand QDR network interfaces.
We use the MNIST handwritten digit dataset [40] for some evaluations. MNIST consists
of 60,000 28× 28 pixel training images containing a single numerical digit and an additional
10,000 test examples. Our test DNN consists of three 4096-neuron fully-connected layers
followed by a ten-unit softmax layer for classification. Layers used ReLU activations [92].
96
Training used AdaGrad [98] with an initial learning rate of 0.005. This model was distributed
using hybrid sample and model parallelism, with four-way sample-parallelism (four model
replicas) and 48-way model-parallelism. The total of 192 processes were distributed 12 to
each Catalyst node, using a total of 16 nodes. The underlying BLAS routines automatically
make use of the additional cores, and this ratio of processes to cores performs well empirically.
Each model replica processed 10 images in each local mini-batch, for a total mini-batch size
of 40, and used an identical initialization.3
Model Versus Sample Parallelism
A brief but important point to consider is when sample parallelism is superior to model
parallelism. To test this, we compare the time to complete an epoch with our four-way
sample-parallel model versus a single large model using 192-way model parallelism and a
mini-batch of size 40. This avoids all sample-parallel communication, but distributes the
layer parameter matrices over many more processes. The single large model completes an
epoch in an average of 3561 s; the sample-parallel model in 937 s, 3.8 times faster.
Beyond a certain point, the communication overhead of doing distributed matrix opera-
tions begins to dominate, at which point gains from model parallelism decrease and even-
tually become counter-productive. This is where sample parallelism can step in to allow
continued scalability by doing separate matrix operations on fewer processors to maintain
higher efficiency.
Allreduce Benchmark
We next examine the performance of our allreduce implementations on a synthetic bench-
mark, using MPI_Allreduce (the “normal” variant) as a baseline. We generate square ma-
trices of of size 64× 64 to 16384× 16384, filled with uniformly random 32-bit floats centered
on 0. An allreduce is run for twenty trials, and the average time taken is reported. We use
128 nodes, each with a separate matrix, and two processes on each node. This models the
communication patterns present in sample-parallel training with 128-way sample parallelism
and 2-way model parallelism. The matrices should be thought of as a proxy for the gradient
updates that are allreduced after each mini-batch. For example, the 16384× 16384 case cor-
responds to a layer with 16384 neurons and roughly 268 million parameters, requiring 1 GiB
3We did not spend much time tuning hyperparameters on this network for maximal accuracy with the
different quantization routines. As MNIST is a small dataset, we use only 16 nodes in most of our evaluations;


















































Figure 5.1: Results of different allreduce implementations on uniformly random matrices
with 128 nodes. Each matrix is square and the x-axis gives its height/width. “WOS”
variants were run without our sampling optimization and variants prefixed with “C.” use
compression. Threshold used τ = 3.875 (chosen to leave 1/32 of the data); adaptive used
π = 64.
of memory to store. Note this is a rather large layer compared to our model for MNIST,
but this benchmark indicates our scaling trend for the larger models applicable to larger
datasets. One-bit quantization typically does a partial AdaGrad step during its allreduce,
which none of our other algorithms do. To make the results more fair, we disable this step
for the purposes of this evaluation. Our results are plotted in Figure 5.1.
For small matrix sizes, MPI_Allreduce is clearly superior in every case. In these cases, the
bandwidth requirements are small enough that the quantization is counter-productive. At
larger sizes, the approaches become more differentiated. Threshold quantization performs
the worst due to a build-up of the error feedback leading to a significant portion of the data
being sent with later allreduces. However, if we examine just the initial iterations before
98
error feedback has built up, its runtime is competitive at larger scales.
Here, compression is actively harmful to performance due to the increased computation
required (as noted by [77]); with further optimization it may prove useful. Sampling is crucial
for good performance in our adaptive quantization. Not only does it reduce the amount of
data examined, the selection problem is performed on a smaller set, which gives a significant
speedup. One-bit quantization receives only a slight benefit from sampling. MPI_Allreduce
remains competitive until the largest matrix sizes, at which point our adaptive quantization
outperforms it due to decreased data volume. For the largest matrices, adaptive quantization
is 1.76 times faster than MPI_Allreduce.
One-bit and threshold quantization do not appear to be competitive with MPI_Allreduce
at any scale; as far as we are aware, they have not been compared with MPI prior to this
work. Much of this is due to the overheads of intermediate reduction operations on quantized
values (see Section 5.3).
Based on these performance results, we do not consider compression further, nor do we
use variants without sampling.
Mini-Batch Performance
Communication is done after each mini-batch completes, so examining metrics for the
quantization algorithms at mini-batch granularity provides insight into fine details of their
performance and enables us to validate expectations. These were measured during the
training of our MNIST model described at the beginning of Section 5.2.3.
We first investigate the amount of data sent in each allreduce, plotted for a representative
fully-connected layer in Figure 5.2. The results for no quantization and one-bit quantization
are as expected: either the full update matrix is sent, or the quantity of data is reduced by a
factor of 32 (plus a small constant additional amount of meta-data to send the reconstruction
values). The adaptive quantization is more interesting. With π = 64, we should expect a data
reduction similar to that of one-bit quantization, but since it uses sampling to approximate
the thresholds to exclude updates, it does not exactly achieve a 32-times reduction. Our
sampling is a good enough approximation, however, for the data volume to never be too
great or too small, as intended. Threshold quantization sends very little data, even with
τ = 0.001, and the volume it does send is very erratic. This results in test accuracies that
are noticably lower than other methods (see Section 5.2.3), as each model is essentially only
learning from its local mini-batch data.
Figure 5.3 looks at the time taken to perform the allreduce on the same layer. These results
are in line with what our allreduce benchmark (see Section 5.2.3) would predict, given the
99




















Figure 5.2: Volume of data sent each mini-batch during allreduces for the third fully-
connected layer (other fully-connected layers are similar). The x-axis refers to the particular
mini-batch. MPI_Allreduce and one-bit send the expected amount of data, and adaptive
quantization closely follows one-bit in data volume. Threshold quantization sends very little
data, resulting in poor learning.
size of each process’s local matrix, with the exception of threshold quantization. As it
sends very little data, threshold quantization achieves the lowest communication overhead
by a significant margin. The normal MPI_Allreduce is the next fastest, due to the small
local matrix sizes making quantization potentially counter-productive. It is also moderately
noisy, which we believe to be due to network effects on the shared cluster these tests were
run on. Adaptive quantization follows slightly behind MPI_Allreduce, and additionally
performs quite consistently. One-bit quantization performs noticably slower than adaptive
quantization, but note that this is not an entirely fair comparison: a partial AdaGrad step
is performed in this allreduce, which the others do not do. This saves some computation
time later; however, even excluding the time to do AdaGrad, one-bit quantization is still
100




















Figure 5.3: Time spent performing the allreduce and quantization (if any) during each mini-
batch for the third fully-connected layer (other fully-connected layers are similar). The x-axis
refers to the particular mini-batch. These times are as our allreduce benchmark leads us to
expect, except for threshold, which is low due to transmitting almost no data.
slower than adaptive quantization. We do not disable this step, because we used this same
run to evaluate accuracy in the next section. The slight periodicity present in the one-bit
quantization’s timing is related to data-structures being reset after each epoch and does not
have a significant impact on overall epoch time.
Accuracy
It is important that quantization not degrade final test accuracy overly much, or any
performance gains will be mooted by poor models. Table 5.1 reports test accuracies for our
network on the MNIST handwritten digit dataset after twenty training epochs.







Table 5.1: Test accuracy on MNIST after 20 epochs of sample-parallel training.
quantization all achieve comparable accuracies. We do not attribute the small differences
between their accuracies to the quantization algorithms but rather to the particulars of the
initialization. We expect that all three approaches will be able to get comparable accura-
cies with appropriately-tuned hyperparameters. The original work on one-bit quantization
reported that in some instances, quantization benefitted AdaGrad and resulted in a small
accuracy gain, and in others, quantization produced a small accuracy loss. Here, we see
neither effect, though this may be due to the different natures of the data and networks as
one-bit quantization was originally applied to speech DNNs.
Threshold quantization performs noticably worse than the other variants. It shares very
little data between model replicas, and so each model replica essentially learns only from its
local subset of data. To test this, we ran our model with no sample-parallel communication,
so each model learns solely from its local mini-batches. The best model achieves 98.09%
accuracy, indicating that threshold quantization offers little gain. The trouble here is due
to the difficulty in tuning the τ threshold parameter so that updates are still sent, but there
is meaningful data reduction. [77] used τ = 4.0, which results in no updates ever being
sent between our models. Our use of τ = 0.001 still leads to little data being transmitted.
Indeed, this demonstrates one of the advantages of our adaptive quantization: the threshold
is chosen automatically to achieve a given data reduction.
Sample-Parallel Performance
We examine the scalability of our sample-parallelism as we increase the number of model
replicas while keeping the per-replica mini-batch size constant. While this is not necessarily
the most realistic approach to training a DNN in practice, it provides a good benchmark for
understanding the performance of our quantization at larger scales. We train the DNN ar-
chitecture described in the beginning of Section 5.2.3, but vary the number of model replicas.
Each model replica still uses 48 processes and four nodes. Sample-parallel communication
was quantized using π = 64. The average time taken per epoch is reported in Figure 5.4.
We report only on adaptive quantization as the other approaches scale similarly in this test,
102
1 2 4 8















Figure 5.4: Sample-parallel scaling of adaptive quantization with π = 64 and a model with
three 4096-neuron fully-connected layers. We vary the number of model replicas from 1 to 8
with each replica doing a mini-batch of size 10. Replicas use 48 processes across four nodes.
The y-axis reports the average time to complete an epoch over the entire training set. Other
approaches yield similar scalability.
just with different constants.
The scaling trend we see here is excellent and we achieve a 7.5-times speedup when us-
ing eight model replicas. This further validates the viability of large-scale sample-parallel
training.
Lastly, we conduct a benchmark similar to our allreduce benchmark in Section 5.2.3, but
with a full DNN. We begin with our original DNN, consisting of three 4096-neuron hidden
layers using four-way sample parallelism and 48-way model parallelism, and increase the size
of each layer, up to 65,536 neurons.4 The average mini-batch time while training on the
4This is not a realistic model for MNIST classification and we use it primarily as an example to demon-
strate scalability. Future work will look at larger datasets where such layers are more appropriate.
103
212 213 214 215 216
























Mini-batch time for sample-parallel training
Normal
Adaptive
Figure 5.5: Average mini-batch time for training a model with the fully-connected layers of
increasing size using four-way sample parallelism and 48-way model parallelism on 16 nodes.
Adaptive quantization is 1.18 times faster than MPI_Allreduce, despite the large amount of
model parallelism.
MNIST dataset is recorded. This allows us to evaluate our sample-parallel communication
in a more realistic context, albeit still at smaller scales due to the large quantity of model
parallelism lowering the quantity of data each process holds. Our results are in Figure 5.5.
For smaller layers both approaches perform comparably, with MPI_Allreduce being slightly
faster. For larger layers, adaptive quantization overtakes MPI_Allreduce, running 1.18 times
faster in the largest case.
This confirms the cross-over point seen in our allreduce benchmark in a more realistic
scenario: adaptive quantization begins to outperform MPI_Allreduce once the data volume
on each process becomes sufficiently large. Note that while this is a large network, prior work
with LBANN has looked at autoencoders with 400,000 neurons in a layer [70]. Further, there
is a trend toward using very deep models with many layers, which will have similar total
104
data transfer requirements. Our results indicate that our adaptive quantization will also be
appropriate for scaling such models.
5.3 QUANTIZED ALLREDUCES
A key issue one can observe from the results in Figure 5.1 is that a vanilla MPI_Allreduce
transmitting unquantized 32-bit floating point values outperforms the quantization algo-
rithms in many cases. A significant part of this is due to the overhead of the intermediate
reductions on quantized vectors, which are implemented by first unquantizing both input
vectors, summing them, and requantizing the result. In particular, for one-bit quantization,
this always performs strictly more operations than operating on an unquantized vector,
since floating-point vectors of the same length are summed in both cases, but quantization
overhead is also present, including extensive use of temporary memory. In this section, we
will present techniques to optimize this intermediate reduction step in the particular case of
one-bit quantization, although the techniques should be more broadly applicable.
To make this more concrete, observe that the intermediate reduction step is essentially a
quantized sum, r = q⊕ q′, where q, q′, and r are quantized vectors and ⊕ computes the sum
of the vectors. This sum can be implemented as
r = Q(U(q) + U(q′)) (5.5)
where Q and U are the quantization and unquantization functions, respectively.
5.3.1 Optimized One-Bit Quantization
To provide a strong baseline, we begin by extensively optimizing the one-bit quantization
and unquantization methods. The standard implementation involves a loop over each value
and the use of if statements to select the appropriate code path. When data is distributed
symmetrically about 0, the processor will be unable to predict these branches effectively,
leading to poorly-performing code. Instead, we observe that we can simply extract the
sign bit (using signbit) from each value, which will be 0 for non-negative values and 1
for negative values5. With this sign bit in hand, we can use it directly as the computed
one-bit quantized value, and as an index into accumulators for computing the means for the
5This produces an incorrect result for the floating point -0 and other special values, but these should not
matter in practice. In our benchmarking, using signbit was faster than a branchless comparison such as
val < 0.0f.
105
codebook. This is combined with standard optimization techniques such as loop peeling in
order to further mitigate unneeded branches.
Note that if the data is not distributed symmetrically about 0, other techniques may
outperform these optimizations.
5.3.2 Direct One-Bit Quantized Reductions
Instead of optimizing quantization and dequantization in order to accelerate the quantized
sum, we now show how to compute this operation directly, without an intermediate dequan-
tization step. It is still important to have optimized quantization/dequantization routines
for the initial quantization step and final dequantization step, however.
In this section, we will focus on one-bit quantization without error feedback, although the
results could be extended if need be. This is because there is limited utility to employing
error feedback in the intermediate reduction steps, as input vectors are quite homogeneous,
since each contains only two possible values.
The key observation is that computing the new quantized vector can be done using a single
bitwise operation, which is based on the relative value of the reconstruction values in the
two vectors’ codebooks. Consider two quantized vectors with their associated codebooks, x,
τ+, τ−, and y, τ ′+, and τ ′−. For two bits xi and yi, if xi = yi = 0, then the new bit is 0;
likewise if xi = yi = 1, the new bit is 1. When the two bits differ, the result depends on the
relative values of the associated reconstruction values. For xi = 0 and yi = 1,0 if |τ−| > τ ′+1 otherwise (5.6)
and for xi = 1 and yi = 0, 0 if |τ ′−| > τ+1 otherwise. (5.7)
This can be implemented with a pre-computed lookup table, but we can also apply a
simple function, based on examining a truth-table of the different cases:
106
|τ−| > τ ′+ τ ′+ ≥ |τ−| |τ−| > τ ′+ τ ′+ ≥ |τ−|
xi yi τ
+ ≥ |τ ′−| |τ ′−| > τ+ |τ ′−| > τ+ τ+ ≥ |τ ′−|
0 0 0 0 0 0
0 1 0 1 0 1
1 0 1 0 0 1
1 1 1 1 1 1
In the first case, the output is simply xi; the second, yi; the third xi&yi; and the fourth
xi|yi (where & and | are the bitwise AND and OR operations, respectively). A key advantage
is that these operations can be easily vectorized.
The remaining step is to compute the means of the positive and negative values in the
resulting vector, τ̂+ and τ̂−, for the new codebook. This can be implemented as a weighted
average of the original reconstruction values, with the weighting depending on the number
of output configurations for each of the above cases. Computing these counts can be done
efficiently using population count instructions on appropriately constructed bit-vectors.
As a notational convenience we will write
C(a, b, pred) = 1pred · |{(xi, yi) | xi = a ∧ yi = b}| (5.8)
to be the number of positions in bit vectors x and y where xi = a and yi = b, if the predicate
is true, and 0 otherwise. We can then compute the new reconstruction values as
τ̂+ =
(τ+ + τ ′+)C(1, 1, 1) + (τ− + τ ′+)C(0, 1, τ ′+ ≥ |τ−|) + (τ+ + τ ′−)C(1, 0, τ+ ≥ |τ ′−|)
C(1, 1) + C(0, 1, τ ′+ ≥ |τ−|) + C(1, 0, τ+ ≥ |τ ′−|)
(5.9)
τ̂− =
(τ− + τ ′−)C(0, 0, 1) + (τ− + τ ′+)C(0, 1, |τ−| > τ ′+) + (τ+ + τ ′−)C(1, 0, |τ ′−| > τ+)
C(0, 0) + C(0, 1, |τ−| > τ ′+) + C(1, 0, |τ ′−| > τ+)
.
(5.10)
When implementing this step, we need to compute the number of times xi = 1 ∧ yi = 1,
and so on for the other three bit combinations. However, given that we also know the
total number of bits present, we need only compute counts for three of the cases. We take
advantage of this to implicity compute the xi = 0 ∧ yi = 0 case, since this avoids having a
special case for the last bits in a vector (which are 0s) when it is not evenly divisible by the
word size. Although we have so far only implemented this scheme on CPUs, we note that it
is easily amenable to parallelization with GPUs.
Another advantage of this method is that the means for the new codebook are computed
































































Onebit - direct reduction
Figure 5.6: Runtime for different one-bit quantized summation schemes, using 32-bit floating
point values as the original data. Note this is just the reduction operation, and does not
involve communication.
dard quantization implementation. This appears to be much more numerically stable, since
truncation error from the naïve summation algorithms can impact results.
5.3.3 Evaluation
We now evaluate our direct approach to one-bit quantized sums in comparison to no quan-
tization (simply summing two vectors of 32-bit floating point values) and two approaches
using a unquantization/sum/quantization approach, a baseline implementation and our op-
timized approach. These results used an IBM POWER9 CPU on the Lassen system (see
Section 3.5.1 for further description). All results were not multi-threaded and used a single
core. Results on an Intel Xeon (Broadwell) CPU were comparable.
Performance results are presented in Figure 5.6. One can first observe that both our
108
baseline and optimized approaches to one-bit quantized sums are outperformed by simply
summing two floating point vectors. This is to be expected, as these methods perform
sums on vectors of the same length but have additional unquantization and requantization
overhead. Note the large jump in runtime for the no quantization case is due to cache effects
(the greater runtime of the other cases hides this impact).
In contrast, the direct implementation of the one-bit quantized reduction significantly
outperforms the other cases, being over two orders of magnitude faster for long vectors.
Using no quantization remains faster for short vectors, but this implementation outperforms
it once vectors are longer than 256 elements. A significant factor driving this improved
performance is reduced memory usage, since all operations are performed on vectors that
are 1/32 the length of the original, enabling much better cache utilization. Additionally, since
all operations are vectorized on the bit-vectors, fewer instructions need to be executed as
each one processes more entries.
109
CHAPTER 6: RELATED WORK
I now provide an overview of related works on large-scale training of DNNs and techniques
to parallelize and accelerate training. The survey of Ben-Nun and Hoefler [161] provides a
comprehensive overview of work in these and other related areas.
6.1 PARALLELIZING CONVOLUTION
AlexNet [3] introduced an early form of model parallelism, partitioning convolutional filters
of some layers between two GPUs in order to avoid memory limits. To avoid communication
overheads, grouped convolutions were used at these layers instead of directly replicating
regular convolution.
Coates et al. [162] develop a distributed tensor library to support spatial partitioning of
locally-connected layers, among other model-parallel techniques. Locally-connected layers
are similar to convolutional layers, but do not share weights, enabling reduced communica-
tion burdens.
Gholami et al. [163] develop a general framework for parallelizing CNN training along
multiple dimensions, which they refer to as model, batch, and domain parallelism. Their
results primarily consist of simulation, however, and their formulation of channel and filter
parallelism differs from the one presented here. A simple, bulk-synchronous implementation
of this approach for spatial parallelism was evaluated for individual convolutional layers
in Jin et al [164]. Similarly, Jia et al. [165] present a separate framework for generalizing
parallelism during training.
Previous deep learning frameworks, such as DistBelief [66] and Project Adam [68] have
supported partitioning the filters of convolutional layers onto multiple worker processors in a
parameter server model. These approaches are not directly applicable to a framework using
decentralized allreduces, unfortunately, and our formulation of channel and filter parallelism
differs significantly.
Significant work has been done to optimize and parallelize convolution on a single GPU.
Historically, frameworks such as Theano [166], Torch7 [167], and cuda-convnet2 [168] have
provided custom compute kernels for convolution. Vendor libraries, such as cuDNN [123] and
MKL-DNN [169], offer optimized primitives for particular architectures. The convolution
operation itself can be implemented in a number of different ways; “direct convolution”,
similar to a matrix product, is the most straightforward approach. “im2col” approaches [120],
which transform the convolution operation into a matrix product, are now commonly used,
110
as they are able to take advantage of well-optimized BLAS routines for matrix products.
Convolution can be implemented as an element-wise product in the Fourier domain [121].
Although this is algorithmically efficient, this approach has limited use in practice, as it
performs best for layers with large convolutional filters. Instead, one uses approaches based
upon Winograd’s minimal filtering algorithm [122], which is both algorithmically fast and
efficient in practice for small (e.g. 3 × 3) convolutional filters, which are common in state-
of-the-art CNNs.
In [170], Demmel and Dinh present communication lower bounds for forward propagation
in convolutional layers, and sequential algorithms that achieve them. However, backpropa-
gation, and training an entire CNN, are not considered.
6.2 SCALING TRAINING
Beyond just convolution, much work has been done on the general problem of parallelizing,
accelerating, and scaling training. I mention here some key approaches and implementations
that are not discussed elsewhere in this section; see [161] for a more systematic overview.
As a fundamental note, parallelizing training is necessary to scale training, as otherwise one
lacks the means to exploit additional compute resources. A key component of succesfully
scaling is mitigating communication overheads, which I discuss separately.
Broadly, most techniques for parallelizing and scaling training fall into two categories,
data-parallelism and model-parallelism, although usage of the two terms varies within the
literature. Both techniques have long histories. Data-parallelism typically refers to paral-
lelizing training by partitioning a mini-batch or samples therein; in practice this has almost
always referred to sample-parallelism [171, 172, 173, 54]. Model-parallelism, in contrast,
typically refers to partitioning the parameters of a layer [172, 66, 70, 163]. Note, however,
that this can require partitioning input samples as well, so the distinction between model-
and data-parallelism is not always clear. Because fully-connected layers have many more pa-
rameters than convolutional layers, model-parallelism has traditionally been focused on such
layers. In many CNNs, where a stack of convolutional layers is followed by fully-connected
layers, a common strategy is to use data-parallelism for convolutional layers and switch to
model-parallelism for fully-connected layers [174].
Pipeline parallelism splits a mini-batch into sub-batches and pipelines them through lay-
ers [39, 57, 175, 176]. The layers may be on different processors, and the pipelining can
enable communication overheads to be hidden. This can be used to reduce memory over-
heads by staging layers on different processors, or to reduce computation times. In this
sense, one could view pipeline parallelism as a form of model-parallelism.
111
Recently, many deep learning frameworks have begun to offer improved support for dis-
tributed training. DistBelief [66], TensorFlow [177], PyTorch [59], Chainer [178], and many
other frameworks include communication primitives to support distributed sample-parallel
training. Specialized frameworks have also been emerging to handle this, including LBANN [70],
Mesh-TensorFlow [179], and TF-Replicator [180]. These frameworks have primarily focused
on supporting distributed model-parallelism for fully-connected layers in a simple manner.
6.3 APPLICATIONS OF LARGE-SCALE TRAINING
There has been significant effort directed toward accelerating the training of ResNet-50
on ImageNet data, as this is a popular benchmark in deep learning benchmark suites, such
as Deep500 [181], DAWNBench [182], and MLPerf [183]. Over time, the walltime needed to
train ResNet-50 to convergence has steadly decreased [103, 184, 105, 134, 111, 185]. These
works all exclusively use sample-parallelism, in combination with framework optimizations,
learning tricks, and/or special hardware. All the techniques presented in this thesis, partic-
ularly for scaling convolution, have not been otherwise applied to training ResNet-50 and
can be used to further accelerate it, while also being applicable to training CNNs and DNNs
in general. Many of the learning tricks used require extensive hyperparameter tuning before
good results can be achieved, whereas this work allows additional performance to be achieved
by leveraging large-scale systems with no extensive problem-specific tuning.
Other applications have also made use of large-scale training to target particularly large or
challenging training problems. Kurth et al. [34] achieve over 1 ExaOP of sustained training
throughput using optimized sample-parallel techniques for a semantic segmentation problem
on large images produced by climate simulations. The CosmoFlow application [29] uses up
to 8192 nodes of the Cori supercomputer to train a CNN to predict cosmological parameters
based on large, 3D data cubes produced by numerical simulations. This work also used
conventional sample parallelism, and due to memory overheads was unable to train on full-
size data. The work on spatial parallelism described in this thesis applies naturally to these
problems.
6.4 COMMUNICATION OPTIMIZATION
There has been extensive work over the years on optimizing MPI communication in gen-
eral [69, 155, 186, 187] and for GPUs [84, 188, 189, 190, 191] in particular. Much of this work
has focused on HPC applications, which often exhibit different communication patterns than
112
that of training DNNs. This work is, in general, complementary to the results presented
here, and can be leveraged where appropriate. Work integrating CUDA-aware MPI into deep
learning frameworks such as S-Caffe [192] and in CNTK [193] is oriented toward parameter
server approaches, as is work on optimized broadcasts [194]. FireCaffe [71] leverages MPI
for decentralized allreduces, but its communication backend is not GPU-aware nor does it
overlap communication and computation.
Several frameworks designed specifically to accelerate GPU-aware deep learning communi-
cation exist, including Baidu Allreduce [72], Facebook’s Gloo [74], Uber’s Horovod [75], and
Nvidia’s NCCL [73]. Baidu Allreduce was the first attempt to leverage ring allreduces for
training DNNs and is implemented atop CUDA-aware MPI, suffering the same issues (see
Section 4.3). NCCL provides highly optimized ring allreduce algorithms and GPU-centric
communication algorithms that perform very well for bandwidth-dominated workloads, but
it lacked latency-optimized allreduces until very recently. Gloo and Horovod both leverage
a combination of CUDA-aware MPI and NCCL, but only support latency-optimized GPU-
centric allreduces if NCCL does. Horovod also attempts tensor fusion to reduce the number
of latency-bound operations.
Parameter servers [67] provide an alternative to decentralized allreduces, and are widely
used [66, 177, 195]. Centralized communication results in communication bottlenecks at a
single (or a handful) of dedicated nodes, and in practice is often implemented using quadratic
point-to-point communication patterns. The tradeoffs between the two are complicated,
and depend heavily on the system environment. Parameter servers are typically preferred
when doing distributed-memory training in cloud environments, where workers tend to be
heterogeneous and more loosely coupled, failures due to preemption are common, and com-
munication latency is relatively high. In contrast, HPC systems typically have dedicated,
tightly-coupled compute nodes with fast networks and homogeneous hardware, making allre-
duces more attractive.
Asynchronous SGD [196, 66, 197, 198, 199, 200], in which processors are allowed to become
desynchronized from one aother and operate on slightly different versions of the model being
trained, is a common technique to reduce communication bottlenecks at large scale or with
heterogeneous workers. Since processors need not operate on the same model, slow processors
do not impact the runtime of training iterations. However, these approaches introduce
stale gradients and models, which can impact the convergence rate negatively. It is also
unclear how to generalize asynchronous SGD to a decentralized allreduce setting, although
gossip algorithms are one possibility [201]. These approaches are in a similar vein to chaotic
relaxation [202].
GPU-centric communication libraries have been developed in the past. Oden et al. [203]
113
present an implementation of InfiniBand Verbs on GPUs, finding that it in fact degrades
performance. GPUrdma [85] builds upon this work and uses further optimizations to demon-
strate performance benefits. Finally, NVSHMEM [86] extends this further, implementing
point-to-point communication atop InfiniBand Verbs on GPUs. These frameworks lack sup-
port for collective operations, but could be promising as building blocks for higher-level
systems.
6.5 COMMUNICATION COMPRESSION
Many approaches to reducing communication volume during DNN training have been
developed. A widely-used approach to reduce communication volume and accelerate training
is to train with 16-bit (“half precision”) parameters [204, 205], using either IEEE 754 16-bit
or bfloat16 format floating point [177, 206]. These approaches typically require a master
copy of model parameters to be kept at FP32, and that summations be accumulated at
FP32, in order to avoid significant floating point error. Reducing precision even further, to
one bit for weights [207] and for weights and activations [208] has also been proposed. These
methods still perform all computation at full precision, and in general the reduced-precision
models have worse accuracy than the originals.
More aggressive quantization schemes have been developed, including one-bit SGD [76],
TernGrad [82], and QSGD [81]. Sparsification schemes, such as thresholding (which also
includes quantization) [77] and others [79, 80] reduce volume by sending only a sparse sub-
set of the original values. In general, these algorithms are designed to be run on CPUs
and for parameter server communication. SparCML [209] provides a framework to support
allreduces over sparse vectors, including automatic conversion to dense vectors when sparsity
is no longer profitable. The GradiVeQ algorithm [83] uses a quantized representation that
does not require intermediate dequantization steps for reduction operations in decentralized
allreduces.
Many of these methods, beginning with one-bit SGD, include a local error feedback mech-
anism, whereby the compression errors are accumulated locally and fed into the subsequent
iteration. Error feedback mechanisms are frequently essential for fast convergence with lossy
compression schemes.
Until recently, while quantized and/or sparsified communication empirically performed
well for training DNNs, theoretical justification for their convergence was limited. ECQ-
SGD [210] accumulates all quantization errors up to the current iteration, and the authors
prove its convergence for convex quadratic optimization problems. Alistarh et al. [211] prove
that sparsified SGD with local error feedback converges for both convex and smooth non-
114
convex functions.
Offline techniques to compress DNNs, primarily targeting deployment for inference on
resource-constrained devices, has been widely studied. Han et al. [212] developed network
pruning techniques to sparsify trained DNNs by replacing small parameters with zeros. Deep
Compression [213] extends this further with quantization and Huffman coding. Many other
techniques along these lines exist. A common aspect is that models must be retrained
during the compression process to retain their original accuracies; current techniques do not
currently permit directly training such aggressively compressed models.
6.6 OTHER PERFORMANCE CONSIDERATIONS
I/O overheads are a major bottleneck for fast training of DNNs, as modern GPUs and
other accelerators can often perform a mini-batch iteration faster than the I/O subsystem
can provide data [87, 88]. Practical frameworks have often attempted to mitigate this with
a combination of asynchronous I/O threads and data staging [34, 29].
Completely orthogonal to the work discussed here, there are many learning and DNN
architecture tricks and techniques that have been developed to accelerate the convergence of
models. Additionally, simply developing better architectures can enable a given accuracy on
a dataset to be achieved faster. In general, these can be applied in concert with the techniques
discussed in this thesis. For example, depthwise (also known as grouped) convolutions, along
with depthwise-separable convolutions [118, 43] have been used to reduce computational
overhead without compromising the accuracy of the model. Particular learning rate schedules
designed to accelerate convergence, such as the superconvergence technique [214], also serve




I now discuss a number of potential research directions that follow on from this work,
which I will pursue in the future.
Some research directions build directly upon the results presented in this paper:
• While I introduced the stationary-y and -w algorithms in Section 3.3, a comprehensive
evaluation of them has not yet been undertaken.
• There is extensive follow-up work to build on the results for direct one-bit quantized
sums introduced in Section 5.3. These include extending the approach to quantization
algorithms beyond one-bit quantization, such as n-bit quantization or QSGD [81].
Further, studying the impact of quantization algorithms that have been integrated
into decentralized allreduces is an open problem.
A wide range of additional performance optimizations, along with their associated trade-
offs, remain to be explored and integrated into workflows. I enumerate several of these:
• Wide halos for spatial parallelism (see e.g. [215] for a discussion of this in traditional
stencil applications). Increasing the width of halos means more data must be moved
(and more local computation done), but potentially means that subsequent layers will
not need to perform a halo exchange, as their needed input data is already present. The
number of layers that can avoid the halo exchange depends on the width of the halo and
the parameters of the convolutional layer (e.g., stride, kernel size). This optimization
is essentially trading decreased latency costs for increased bandwidth costs.
• There is a potential to improve communication frameworks’ support for segmented
collective operations, given the performance demonstrated in Section 3.5.4. This will
likely involve both intra-node and topology-aware optimizations in order to mitigate
the impact of concurrent communication operations, but seems likely to have significant
future benefits, as many systems have multiple network interfaces.
• Similarly, it was observed that convolution kernels, which are typically optimized for
particular configurations, do not necessarily perform optimally when layers are par-
titioned. It is infeasible to hand-optimize kernels for every possible configuration, so
using automatic methods (e.g. TVM [216]) to generate kernels may yield the best
results.
116
Additional combinations of parallelism remain to be explored. Work on pipelining was
discussed in Section 6.2, and exploring it in combination with the parallelism approaches
discussed here is promising. Considering optimizations of algorithms across layers may also
yield new insights; for example, there may be potential to optimize communication when
two adjacent layers use the stationary-x and -y algorithms.
Another avenue to explore is co-designing DNN architectures and training algorithms to
optimize training runtime. This is in the same vein as approaches designing optimized neural
networks for mobile or other resource-constrained systems [217, 218, 219]. There are several
possibilities that could be pursued, although this is not an exhaustive list.
• Reduce-scatters and allgathers are required by the algorithms presented for channel
and filter parallelism. Using grouped convolutions can eliminate the need for this
communication.
• Locally-connected layers, which do not share weights as convolutional layers do, may
enable reduced communication in spatial partitions.
• Similarly, for certain problems, it may be possible to ignore the halo exchange entirely.
The layer will no longer be computing convolution precisely, as certain pixels will be
“seen” fewer times by the filters, but this may nonetheless provide sufficient accuracy.
• Many networks employ “bottlenecks”, where the dimensionality of activations is reduced
(the bottleneck blocks of ResNet are an example of this). These may provide a natural
point at which to exploit pipeline parallelism.
• Different forms of batch normalization, which aggregate over subsets of data, may
enable reduced communication while still maintaining accuracy.
In line with the numerous approaches to parallelism discussed in this thesis and also
proposed for the future, it becomes critical to have an effective method to select parallel
execution strategies. Section 3.4 sketched an approach to this, but the performance model
does not consider the techniques discussed here. Further, other approaches to selecting
parallel execution strategies, such as formulating the problem as a linear program or using
reinforcement learning, may be better in practice, especially as the optimization space grows.
Beyond optimizing “standard” architectures like convolution, emerging DNN architectures
are likely to be important in the future. In particular, graph neural networks [220, 221, 222]
have wide application in many domains where data is not grid-structured. Accelerating their
training is significantly more complicated than for standard DNN architectures, as data is
typically sparse and access patterns are irregular.
117
Lastly, applying DNNs within computational science, where the techniques proposed in
this thesis can be particularly impactful, is a fruitful direction to explore. Once one can
train DNNs on large data produced by experiments or computational simulations, it becomes
necessary to develop techniques such as regularization or objective functions that can ensure
the physical consistency of the DNN’s predictions. Along these lines, exploring DNNs as fast
substitutes for computational kernels or as surrogates for entire simulation runs may enable
improved scientific productivity. Lastly, one can consider incorporating DNNs within the
simulation workflow, using them to select algorithms for different data regimes or engage in
other computational steering applications.
7.2 CONCLUSIONS
In this thesis, I have presented a set of techniques to accelerate training of DNNs and
to open up the possibility of training on new problems where it was previously infeasible.
In a broad sense, these techniques fall into two categories: providing additional parallelism
for users and frameworks to exploit, and reducing overheads. The combination of spatial
parallelism and channel and filter parallelism, along with existing work on sample paral-
lelism, enable every dimension of a CNN to be partitioned when training. This enables
improved strong and weak scaling, while allowing training on very large samples and wide
models. The Aluminum library enables reduced communication overheads for training at
scale, via its GPU-centric communication semantics and latency-optimized polyalgorithms.
In concert, the distributed memory convolution algorithms and communication optimiza-
tions introduced here allow practitioners to leverage large-scale HPC systems to train models
faster without additional tuning phases. For further performance, I introduced approaches
to communication quantization that can reduce communication volume.
As data, datasets, and models grow in size and complexity, and as rapid training cycles
become more common, accelerating training becomes critical to maintain productivity. The
techniques introduced in this thesis provide a way to manage this.
118
REFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, 2015.
[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http:
//www.deeplearningbook.org.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep
convolutional neural networks,” in NeurIPS, 2012.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in CVPR, 2016.
[5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual
Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3,
2015.
[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate
object detection and semantic segmentation,” in CVPR, 2014.
[7] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object
detection with region proposal networks,” in NeurIPS, 2015.
[8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified,
real-time object detection,” in CVPR, 2016.
[9] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in ICCV. IEEE,
2017.
[10] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Van-
houcke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in
speech recognition: The shared views of four research groups,” IEEE Signal processing
magazine, vol. 29, no. 6, 2012.
[11] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger,
S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech
recognition,” arXiv preprint arXiv:1412.5567, 2014.
[12] R. Sennrich, B. Haddow, and A. Birch, “Edinburgh neural machine translation systems
for WMT ’16,” arXiv preprint arXiv:1606.02891, 2016.
[13] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao,
Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging
the gap between human and machine translation,” arXiv preprint arXiv:1609.08144,
2016.
119
[14] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural
networks,” in NeurIPS, 2014.
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,
and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
[16] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natu-
ral language processing (almost) from scratch,” Journal of Machine Learning Research,
vol. 12, 2011.
[17] D. Chen and C. Manning, “A fast and accurate dependency parser using neural net-
works,” in EMNLP, 2014.
[18] C. Angermueller, T. Pärnamaa, L. Parts, and O. Stegle, “Deep learning for computa-
tional biology,” Molecular Systems Biology, vol. 12, no. 7, 2016.
[19] S. Min, B. Lee, and S. Yoon, “Deep learning in bioinformatics,” Briefings in Bioinfor-
matics, vol. 18, no. 5, 2017.
[20] M. K. Leung, H. Y. Xiong, L. J. Lee, and B. J. Frey, “Deep learning of the tissue-
regulated splicing code,” Bioinformatics, vol. 30, no. 12, 2014.
[21] H. Y. Xiong, B. Alipanahi, L. J. Lee, H. Bretschneider, D. Merico, R. K. Yuen, Y. Hua,
S. Gueroussov, H. S. Najafabadi, T. R. Hughes et al., “The human splicing code reveals
new insights into the genetic determinants of disease,” Science, vol. 347, no. 6218, 2015.
[22] E. Asgari and M. R. Mofrad, “Continuous distributed representation of biological se-
quences for deep proteomics and genomics,” PLOS ONE, vol. 10, no. 11, 2015.
[23] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, and V. Svetnik, “Deep neural nets as a
method for quantitative structure–activity relationships,” Journal of Chemical Infor-
mation and Modeling, vol. 55, no. 2, 2015.
[24] H. Altae-Tran, B. Ramsundar, A. S. Pappu, and V. Pande, “Low data drug discovery
with one-shot learning,” ACS Central Science, vol. 3, no. 4, 2017.
[25] T. Ciodaro, D. Deva, J. De Seixas, and D. Damazio, “Online particle detection with
neural networks based on topological calorimetry information,” in Journal of Physics:
conference series, vol. 368, no. 1. IOP Publishing, 2012.
[26] C. Adam-Bourdarios, G. Cowan, C. Germain, I. Guyon, B. Kégl, and D. Rousseau,
“The Higgs boson machine learning challenge,” in NeurIPS 2014 Workshop on High-
energy Physics and Machine Learning, 2015.
[27] J. H. Collins, K. Howe, and B. Nachman, “CWoLa hunting: Extending the bump hunt
with machine learning,” arXiv preprint arXiv:1805.02664, 2018.
120
[28] S. Ravanbakhsh, J. B. Oliva, S. Fromenteau, L. Price, S. Ho, J. G. Schneider, and
B. Póczos, “Estimating cosmological parameters from the dark matter distribution,”
in ICML, 2016.
[29] A. Mathuriya, D. Bard, P. Mendygral, L. Meadows, J. Arnemann, L. Shao, S. He,
T. Kärnä, D. Moise, S. J. Pennycook et al., “CosmoFlow: Using deep learning to learn
the universe at scale,” in SC. IEEE, 2018.
[30] M. Mustafa, D. Bard, W. Bhimji, Z. Lukić, R. Al-Rfou, and J. M. Kratochvil, “Cos-
moGAN: Creating high-fidelity weak lensing convergence maps using generative ad-
versarial networks,” Computational Astrophysics and Cosmology, vol. 6, no. 1, 2019.
[31] Y. Liu, E. Racah, J. Correa, A. Khosrowshahi, D. Lavers, K. Kunkel, M. Wehner,
W. Collins et al., “Application of deep convolutional neural networks for detecting
extreme weather in climate datasets,” arXiv preprint arXiv:1605.01156, 2016.
[32] S. Hong, S. Kim, M. Joh, and S.-k. Song, “GlobalNet: Convolutional neural
networks for typhoon eye tracking from remote sensing imagery,” arXiv preprint
arXiv:1708.03417, 2017.
[33] E. Racah, C. Beckham, T. Maharaj, S. E. Kahou, Prabhat, and C. Pal, “Ex-
tremeWeather: A large-scale climate dataset for semi-supervised detection, localiza-
tion, and understanding of extreme weather events,” in Advances in Neural Information
Processing Systems, 2017.
[34] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips, A. Mahesh,
M. Matheson, J. Deslippe, M. Fatica et al., “Exascale deep learning for climate ana-
lytics,” in SC. IEEE Press, 2018.
[35] K. Humbird, J. Peterson, and R. McClarren, “Deep neural network initialization with
decision trees,” arXiv preprint arXiv:1707.00784v3, 2018.
[36] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,”
in ECCV. Springer, 2016.
[37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-
houcke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015.
[38] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected
convolutional networks,” in CVPR, 2017.
[39] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. V. Le, and Z. Chen, “GPipe:
Efficient training of giant neural networks using pipeline parallelism,” arXiv preprint
arXiv:1811.06965, 2018.
[40] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, vol. 86, no. 11, 1998.
121
[41] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,”
University of Toronto, Tech. Rep., 2009.
[42] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth,
and L.-J. Li, “YFCC100M: The new data in multimedia research,” Communications
of the ACM, vol. 59, no. 2, 2016.
[43] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in
CVPR, 2017.
[44] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonable effectiveness
of data in deep learning era,” in ICCV. IEEE, 2017.
[45] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe,
and L. van der Maaten, “Exploring the limits of weakly supervised pretraining,” arXiv
preprint arXiv:1805.00932, 2018.
[46] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, “A large contextual
dataset for classification, detection and counting of cars with deep learning,” in ECCV.
Springer, 2016.
[47] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. Ag-
garwal, H. Lee, L. Davis et al., “A large-scale benchmark dataset for event recognition
in surveillance video,” in CVPR. IEEE, 2011.
[48] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs for im-
proved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
[49] L. Peterson, “Machine learning aided discovery of a new NIF design,” in LLNL Data
Science Institute Workshop, 2018.
[50] N. Dryden, N. Maruyama, T. Benson, T. Moon, M. Snir, and B. Van Essen, “Improving
strong-scaling of CNN training by exploiting finer-grained parallelism,” in IPDPS.
IEEE, 2019.
[51] M. Wehner, K. A. Reed, D. Stone, W. D. Collins, and J. Bacmeister, “Resolution
dependence of future tropical cyclone projections of CAM5.1 in the US CLIVAR hur-
ricane working group idealized configurations,” Journal of Climate, vol. 28, no. 10,
2015.
[52] M. Ghafoorian, N. Karssemeijer, T. Heskes, M. Bergkamp, J. Wissink, J. Obels,
K. Keizer, F.-E. de Leeuw, B. van Ginneken, E. Marchiori et al., “Deep multi-scale
location-aware 3D convolutional neural networks for automated detection of lacunes
of presumed vascular origin,” NeuroImage: Clinical, vol. 14, 2017.
[53] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy,
B. Jia, Y. Jia, A. Kalro et al., “Applied machine learning at Facebook: A datacenter
infrastructure perspective,” in HPCA. IEEE, 2018.
122
[54] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale deep unsupervised learning using
graphics processors,” in ICML. ACM, 2009.
[55] Y. Ito, R. Matsumiya, and T. Endo, “ooc_cuDNN: Accommodating convolutional
neural networks over GPU memory capacity,” in Big Data. IEEE, 2017.
[56] A. Petrowski, G. Dreyfus, and C. Girault, “Performance analysis of a pipelined back-
propagation parallel algorithm,” IEEE Transactions on Neural Networks, vol. 4, no. 6,
1993.
[57] X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide, “Pipelined back-propagation for
context-dependent deep neural networks,” in INTERSPEECH, 2012.
[58] C.-C. Chen, C.-L. Yang, and H.-Y. Cheng, “Efficient and robust parallel DNN training
through model parallelism on multi-GPU platform,” arXiv preprint arXiv:1809.02839,
2018.
[59] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,
L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” in NeurIPS-W, 2017.
[60] C. Meng, M. Sun, J. Yang, M. Qiu, and Y. Gu, “Training deeper models by GPU
memory optimization on TensorFlow,” in NeurIPS ML Systems Workshop, 2017.
[61] R. Landaverde, T. Zhang, A. K. Coskun, and M. Herbordt, “An investigation of unified
memory access performance in CUDA,” in HPEC. IEEE, 2014.
[62] W. Li, G. Jin, X. Cui, and S. See, “An evaluation of unified memory technology on
NVIDIA GPUs,” in CCGrid. IEEE, 2015.
[63] Ł. Jarząbek and P. Czarnul, “Performance evaluation of unified memory and dynamic
parallelism for selected parallel CUDA applications,” The Journal of Supercomputing,
vol. 73, no. 12, 2017.
[64] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “On parallelizability of stochastic gradient
descent for speech DNNs,” in ICASSP. IEEE, 2014.
[65] J. Keuper and F.-J. Preundt, “Distributed training of deep neural networks: Theoret-
ical and practical limits of parallel scalability,” in MLHPC. IEEE, 2016.
[66] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker,
K. Yang, Q. V. Le et al., “Large scale distributed deep networks,” in NeurIPS, 2012.
[67] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long,
E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter
server,” in OSDI, 2014.
[68] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project Adam: Building
an efficient and scalable deep learning training system,” in OSDI, vol. 14, 2014.
123
[69] R. Thakur, R. Rabenseifner, andW. Gropp, “Optimization of collective communication
operations in MPICH,” The International Journal of High Performance Computing
Applications, vol. 19, no. 1, 2005.
[70] B. Van Essen, H. Kim, R. Pearce, K. Boakye, and B. Chen, “LBANN: Livermore big
artificial neural network HPC toolkit,” in MLHPC. ACM, 2015.
[71] F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer, “FireCaffe: Near-linear
acceleration of deep neural network training on compute clusters,” in CVPR, 2016.
[72] Baidu Research, “Baidu allreduce,” https://github.com/baidu-research/
baidu-allreduce, 2018.
[73] NVIDIA, “NVIDIA collective communications library,” https://developer.nvidia.com/
nccl, 2019.
[74] Facebook, “Gloo,” https://github.com/facebookincubator/gloo, 2018.
[75] A. Sergeev and M. D. Balso, “Horovod: Fast and easy distributed deep learning in
TensorFlow,” arXiv preprint arXiv:1802.05799, 2018.
[76] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its
application to data-parallel distributed training of speech DNNs,” in INTERSPEECH,
2014.
[77] N. Strom, “Scalable distributed DNN training using commodity GPU cloud comput-
ing,” in INTERSPEECH, 2015.
[78] N. Dryden, T. Moon, S. A. Jacobs, and B. Van Essen, “Communication quantization
for data-parallel training of deep neural networks,” in MLHPC. IEEE, 2016.
[79] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,”
arXiv preprint arXiv:1704.05021, 2017.
[80] X. Sun, X. Ren, S. Ma, and H. Wang, “meprop: Sparsified back propagation for
accelerated deep learning with reduced overfitting,” arXiv preprint arXiv:1706.06197,
2017.
[81] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-
efficient SGD via gradient quantization and encoding,” in NeurIPS, 2017.
[82] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “TernGrad: Ternary
gradients to reduce communication in distributed deep learning,” in NeurIPS, 2017.
[83] M. Yu, Z. Lin, K. Narra, S. Li, Y. Li, N. S. Kim, A. Schwing, M. Annavaram, and
S. Avestimehr, “GradiVeQ: Vector quantization for bandwidth-efficient gradient aggre-
gation in distributed CNN training,” in NeurIPS, 2018.
124
[84] H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur, and D. K. Panda, “MVAPICH2-
GPU: Optimized GPU to GPU communication for InfiniBand clusters,” Computer
Science-Research and Development, vol. 26, no. 3-4, 2011.
[85] F. Daoud, A. Watad, and M. Silberstein, “GPUrdma: GPU-side library for high per-
formance networking from GPU kernels,” in ROSS. ACM, 2016.
[86] S. Potluri, A. Goswami, D. Rossetti, C. Newburn, M. G. Venkata, and N. Imam, “GPU-
centric communication on NVIDIA GPU clusters with InfiniBand: A case study with
OpenSHMEM,” in HiPC. IEEE, 2017.
[87] S. Pumma, M. Si, W.-c. Feng, and P. Balaji, “Towards scalable deep learning via I/O
analysis and optimization,” in HPCC. IEEE, 2017.
[88] S. W. Chien, S. Markidis, C. P. Sishtla, L. Santos, P. Herman, S. Narasimhamurthy,
and E. Laure, “Characterizing deep-learning I/O workloads in TensorFlow,” in PDSW-
DISCS. IEEE, 2018.
[89] G. Cybenko, “Approximations by superpositions of a sigmoidal function,” Mathematics
of Control, Signals and Systems, vol. 2, 1989.
[90] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural
networks, vol. 4, no. 2, 1991.
[91] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of neural networks:
A view from the width,” in NeurIPS, 2017.
[92] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann ma-
chines,” in ICML, 2010.
[93] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in AIS-
TATS, 2011.
[94] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural
computation, vol. 1, no. 4, 1989.
[95] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine
learning,” SIAM Review, vol. 60, no. 2, 2018.
[96] T. Tieleman and G. Hinton, “Lecture 6.5-RMSProp: Divide the gradient by a running
average of its recent magnitude,” COURSERA: Neural networks for machine learning,
vol. 4, no. 2, 2012.
[97] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR,
2015.
125
[98] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning
and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul,
2011.
[99] D. E. Rumelhart, G. E. Hinton, R. J. Williams et al., “Learning representations by
back-propagating errors,” Cognitive modeling, vol. 5, no. 3, 1988.
[100] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-
batch training for deep learning: Generalization gap and sharp minima,” in ICLR,
2016.
[101] S. Hochreiter and J. Schmidhuber, “Flat minima,” Neural Computation, vol. 9, no. 1,
1997.
[102] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can generalize for deep
nets,” arXiv preprint arXiv:1703.04933, 2017.
[103] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch,
Y. Jia, and K. He, “Accurate, large minibatch SGD: Training ImageNet in 1 hour,”
arXiv preprint arXiv:1706.02677, 2017.
[104] E. Hoffer, I. Hubara, and D. Soudry, “Train longer, generalize better: closing the
generalization gap in large batch training of neural networks,” in NeurIPS, 2017.
[105] Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “ImageNet training in
minutes,” in ICPP. ACM, 2018.
[106] L. Balles, J. Romero, and P. Hennig, “Coupling adaptive batch sizes with learning
rates,” arXiv preprint arXiv:1612.05086, 2016.
[107] S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le, “Don’t decay the learning rate,
increase the batch size,” arXiv preprint arXiv:1711.00489, 2017.
[108] C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl,
“Measuring the effects of data parallelism on neural network training,” arXiv preprint
arXiv:1811.03600, 2018.
[109] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
[110] S. Ioffe, “Batch renormalization: Towards reducing minibatch dependence in batch-
normalized models,” in NeurIPS, 2017.
[111] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, “Image classification at super-
computer scale,” in NeurIPS Systems for ML Workshop, 2018.
[112] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:
a simple way to prevent neural networks from overfitting,” The Journal of Machine
Learning Research, vol. 15, no. 1, 2014.
126
[113] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations
for deep neural networks,” in CVPR, 2017.
[114] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, Inception-ResNet
and the impact of residual connections on learning,” in AAAI, 2017.
[115] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with
atrous separable convolution for semantic image segmentation,” in ECCV, 2018.
[116] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy:
Disentangling task transfer learning,” in CVPR, 2018.
[117] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic
segmentation,” in CVPR, 2015.
[118] L. Sifre, “Rigid-motion scattering for image classification,” Ph.D. dissertation, Ecole
Polytechnique, CMAP, 2014.
[119] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image
segmentation with deep convolutional nets and fully connected CRFs,” in ICLR, 2015.
[120] K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional neural net-
works for document processing,” in Tenth International Workshop on Frontiers in
Handwriting Recognition. Suvisoft, 2006.
[121] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutional networks
through FFTs,” arXiv preprint arXiv:1312.5851, 2013.
[122] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” in CVPR,
2016.
[123] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and
E. Shelhamer, “cuDNN: Efficient primitives for deep learning,” arXiv preprint
arXiv:1410.0759, 2014, see also https://developer.nvidia.com/cudnn.
[124] P. Fraigniaud and E. Lazard, “Methods and problems of communication in usual net-
works,” Discrete Applied Mathematics, vol. 53, 1994.
[125] O. N.-B. C. Laboratory, “OSU MPI benchmarks,” http://mvapich.cse.ohio-state.edu/
benchmarks/, 2019.
[126] M. Forum, “MPI forum,” https://www.mpi-forum.org/, 2019.
[127] M. Barnett, L. Shuler, R. van De Geijn, S. Gupta, D. G. Payne, and J. Watts, “In-
terprocessor collective communication library (InterCom),” in Proceedings of IEEE
Scalable High Performance Computing Conference. IEEE, 1994.
[128] M. Shroff and R. A. Van De Geijn, “CollMark: MPI collective communication bench-
mark,” in International Conference on Supercomputing, 2000.
127
[129] R. Rabenseifner, “New optimized MPI reduce algorithm,” https://web.archive.org/
web/20081014060350/http://www.hlrs.de/organization/par/services/models/mpi/
myreduce.html, 2008.
[130] M. Barnett, R. Littlefield, D. G. Payne, and R. Vandegeijn, “Global combine algorithms
for 2-D meshes with wormhole routing,” Journal of Parallel and Distributed Computing,
vol. 24, no. 2, 1995.
[131] E. Chan, R. van de Geijn, W. Gropp, and R. Thakur, “Collective communication
on architectures that support simultaneous communication over multiple links,” in
PPoPP. ACM, 2006.
[132] E. Chan, M. Heimlich, A. Purkayastha, and R. Van De Geijn, “Collective communica-
tion: theory, practice, and experience,” Concurrency and Computation: Practice and
Experience, vol. 19, no. 13, 2007.
[133] N. Jain and Y. Sabharwal, “Optimal bucket algorithms for large MPI collectives on
torus interconnects,” in ICS. ACM, 2010.
[134] H. Mikami, H. Suganuma, P. U-chupala, Y. Tanaka, and Y. Kageyama, “Mas-
sively distributed SGD: ImageNet/ResNet-50 training in a flash,” arXiv preprint
arXiv:1811.05233, 2018.
[135] J. Poulson, B. Marker, R. A. Van de Geijn, J. R. Hammond, and N. A. Romero,
“Elemental: A new framework for distributed memory dense matrix computations,”
ACM TOMS, vol. 39, no. 2, 2013.
[136] N. Dryden, N. Maruyama, T. Moon, T. Benson, M. Snir, and B. Van Essen, “Channel
and filter parallelism for large-scale CNN training,” In submission, 2019.
[137] M. D. Schatz, R. A. Van de Geijn, and J. Poulson, “Parallel matrix multiplication: A
systematic journey,” SIAM Journal on Scientific Computing, vol. 38, no. 6, 2016.
[138] M. D. Schatz, “Distributed tensor computations: formalizing distributions, redistribu-
tions, and algorithm derivation,” Ph.D. dissertation, University of Texas at Austin,
2015.
[139] C. Ancourt, F. Coelho, F. Irigoin, and R. Keryell, “A linear algebra framework for
static High Performance Fortran code distribution,” Scientific Programming, vol. 6,
no. 1, 1997.
[140] N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka, “Physis: An implicitly parallel
programming model for stencil computations on large-scale GPU-accelerated super-
computers,” in SC, 2011.
[141] L. E. Cannon, “A cellular computer to implement the Kalman filter algorithm,” Ph.D.
dissertation, Montana State University-Bozeman, College of Engineering, 1969.
[142] NVIDIA Research, “Cub,” 2019, https://nvlabs.github.io/cub/.
128
[143] Lawrence Livermore National Laboratory, “Lassen,” https://hpc.llnl.gov/hardware/
platforms/lassen, 2019.
[144] IBM, “SpectrumMPI,” https://www.ibm.com/us-en/marketplace/spectrum-mpi,
2019.
[145] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition,” in ICLR, 2015.
[146] N. Dryden, N. Maruyama, T. Moon, T. Benson, A. Yoo, M. Snir, and B. Van Essen,
“Aluminum: An asynchronous GPU-aware communication library optimized for large-
scale training of deep neural networks on HPC systems,” in MLHPC. IEEE, 2018.
[147] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in BMVC, 2016.
[148] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in CVPR, 2018.
[149] Lawrence Livermore National Laboratory, “Sierra,” https://hpc.llnl.gov/hardware/
platforms/sierra, 2018.
[150] Y. Wu and K. He, “Group normalization,” in ECCV, 2018.
[151] Hydrogen team, “Hydrogen,” https://github.com/LLNL/Elemental, 2018.
[152] L. Lamport, “Proving the correctness of multiprocess programs,” IEEE transactions
on software engineering, no. 2, 1977.
[153] N. M. Lê, A. Guatto, A. Cohen, and A. Pop, “Correct and efficient bounded FIFO
queues,” in SBAC-PAD. IEEE, 2013.
[154] A. Brooks, H.-V. Dang, N. Dryden, and M. Snir, “PPL: an abstract runtime system
for hybrid parallel programming,” in ESPM2. ACM, 2015.
[155] H.-V. Dang, M. Snir, and W. Gropp, “Towards millions of communicating threads,” in
EuroMPI. ACM, 2016.
[156] P. R. Eller and W. Gropp, “Scalable non-blocking preconditioned conjugate gradient
methods,” in SC. IEEE, 2016.
[157] NVIDIA, “GPUDirect RDMA,” https://docs.nvidia.com/cuda/gpudirect-rdma/index.
html, 2018.
[158] S. W. Smith, The scientist and engineer’s guide to digital signal processing. California
Technical Publishing, 1997.
[159] R. Rice and J. Plaunt, “Adaptive variable-length coding for efficient compression of
spacecraft television data,” IEEE Transactions on Communication Technology, vol. 19,
no. 6, 1971.
129
[160] Lawrence Livermore National Laboratory, “Catalyst,” https://hpc.llnl.gov/hardware/
platforms/catalyst, 2019.
[161] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An
in-depth concurrency analysis,” arXiv preprint arXiv:1802.09941, 2018.
[162] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, “Deep learning
with COTS HPC systems,” in ICML, 2013.
[163] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc, “Integrated model, batch and
domain parallelism in training neural networks,” in SPAA, 2018.
[164] P. Jin, B. Ginsburg, and K. Keutzer, “Spatially parallel convolutions,” in ICLR Work-
shops, 2018.
[165] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism for deep neural
networks,” arXiv preprint arXiv:1807.05358, 2018.
[166] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,
D. Warde-Farley, and Y. Bengio, “Theano: a CPU and GPU math expression com-
piler,” in SciPy, vol. 4, no. 3, 2010.
[167] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A MATLAB-like environment
for machine learning,” in BigLearn NeurIPS Workshop, 2011.
[168] A. Krizhevsky, “cuda-convnet2,” https://github.com/akrizhevsky/cuda-convnet2,
2015.
[169] Intel, “MKL-DNN,” https://01.org/mkl-dnn, 2019.
[170] J. Demmel and G. Dinh, “Communication-optimal convolutional neural nets,” arXiv
preprint arXiv:1802.06905, 2018.
[171] X. Zhang, M. Mckenna, J. P. Mesirov, and D. L. Waltz, “An efficient implementation of
the back-propagation algorithm on the connection machine CM-2,” in NeurIPS, 1990.
[172] J. Dean, “Parallel implementations of neural network training: Two back-propagation
approaches,” Senior thesis, University of Minnesota, 1990.
[173] P. Farber and K. Asanovic, “Parallel neural network training on Multi-Spert,” in Pro-
ceedings of 3rd International Conference on Algorithms and Architectures for Parallel
Processing. IEEE, 1997.
[174] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv
preprint arXiv:1404.5997, 2014.
[175] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and
P. Gibbons, “PipeDream: Fast and efficient pipeline parallel DNN training,” arXiv
preprint arXiv:1806.03377, 2018.
130
[176] Y. Li, M. Yu, S. Li, S. Avestimehr, N. S. Kim, and A. Schwing, “Pipe-SGD: A de-
centralized pipelined SGD framework for distributed deep net training,” in NeurIPS,
2018.
[177] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,
G. Irving, M. Isard et al., “TensorFlow: A system for large-scale machine learning,” in
OSDI, vol. 16, 2016.
[178] T. Akiba, K. Fukuda, and S. Suzuki, “ChainerMN: Scalable distributed deep learning
framework,” in NeurIPS Workshop on ML Systems, 2017.
[179] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins,
H. Lee, M. Hong, C. Young et al., “Mesh-TensorFlow: Deep learning for supercom-
puters,” in NeurIPS, 2018.
[180] P. Buchlovsky, D. Budden, D. Grewe, C. Jones, J. Aslanides, F. Besse, A. Brock,
A. Clark, S. G. Colmenarejo, A. Pope et al., “TF-Replicator: Distributed machine
learning for researchers,” arXiv preprint arXiv:1902.00465, 2019.
[181] T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler, “A modular
benchmarking infrastructure for high-performance and reproducible deep learning,” in
IPDPS. IEEE, 2019.
[182] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Oluko-
tun, C. Ré, and M. Zaharia, “DAWNBench: An end-to-end deep learning benchmark
and competition,” in NeurIPS ML Systems Workshop, 2017.
[183] MLPerf, “MLPerf,” https://mlperf.org/, 2019.
[184] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large minibatch SGD: Training
ResNet-50 on ImageNet in 15 minutes,” in NeurIPS 2017 Workshop: Deep Learning
at Supercomputer Scale, 2017.
[185] J. Howard, “Now anyone can train ImageNet in 18 minutes,” https://www.fast.ai/
2018/08/10/fastai-diu-imagenet/, 2018.
[186] G. Almási, P. Heidelberger, C. J. Archer, X. Martorell, C. C. Erway, J. E. Moreira,
B. Steinmacher-Burow, and Y. Zheng, “Optimization of MPI collective communication
on BlueGene/L systems,” in ICS. ACM, 2005.
[187] J. Liu, J. Wu, and D. K. Panda, “High performance RDMA-based MPI implementation
over InfiniBand,” International Journal of Parallel Programming, vol. 32, no. 3, 2004.
[188] S. Potluri, H. Wang, D. Bureddy, A. K. Singh, C. Rosales, and D. K. Panda, “Op-
timizing MPI communication on multi-GPU systems using CUDA inter-process com-
munication,” in IPDPSW. IEEE, 2012.
131
[189] S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. K. Panda, “Efficient
inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with
NVIDIA GPUs,” in ICPP. IEEE, 2013.
[190] R. Shi, S. Potluri, K. Hamidouche, J. Perkins, M. Li, D. Rossetti, and D. K. D. Panda,
“Designing efficient small message transfer mechanism for inter-node MPI communi-
cation on InfiniBand GPU clusters,” in HiPC. IEEE, 2014.
[191] C.-H. Chu, K. Hamidouche, A. Venkatesh, A. A. Awan, and D. K. Panda, “CUDA
kernel based collective reduction operations on large-scale GPU clusters,” in CCGrid.
IEEE, 2016.
[192] A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, “S-Caffe: Co-designing
MPI runtimes and Caffe for scalable deep learning on modern GPU clusters,” in
PPoPP. ACM, 2017.
[193] D. S. Banerjee, K. Hamidouche, and D. K. Panda, “Re-designing CNTK deep learning
framework on modern GPU enabled clusters,” in CloudCom. IEEE, 2016.
[194] A. A. Awan, C.-H. Chu, H. Subramoni, and D. K. Panda, “Optimized broadcast for
deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL?” in Eu-
roMPI, 2018.
[195] F. Seide and A. Agarwal, “CNTK: Microsoft’s open-source deep-learning toolkit,” in
KDD. ACM, 2016.
[196] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing
stochastic gradient descent,” in NeurIPS, 2011.
[197] C. M. De Sa, C. Zhang, K. Olukotun, and C. Ré, “Taming the wild: A unified analysis
of Hogwild-style algorithms,” in NeurIPS, 2015.
[198] J. Keuper and F.-J. Pfreundt, “Asynchronous parallel stochastic gradient descent: A
numeric core for scalable distributed machine learning algorithms,” inMLHPC. ACM,
2015.
[199] C. Noel and S. Osindero, “Dogwild! - distributed Hogwild for CPU & GPU,” in
NeurIPS Workshop on Distributed Machine Learning and Matrix Computations, 2014.
[200] T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang, “GPU asynchronous stochastic
gradient descent to speed up neural network training,” arXiv preprint arXiv:1312.6186,
2013.
[201] J. Daily, A. Vishnu, C. Siegel, T. Warfel, and V. Amatya, “GossipGraD: Scalable
deep learning using gossip communication based asynchronous gradient descent,” arXiv
preprint arXiv:1803.05880, 2018.
[202] D. Chazan and W. Miranker, “Chaotic relaxation,” Linear algebra and its applications,
vol. 2, no. 2, 1969.
132
[203] L. Oden, H. Fröning, and F.-J. Pfreundt, “Infiniband-verbs on GPU: A case study of
controlling an InfiniBand network device from the GPU,” in IPDPSW. IEEE, 2014.
[204] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with
limited numerical precision,” in ICML, 2015.
[205] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg,
M. Houston, O. Kuchaev, G. Venkatesh et al., “Mixed precision training,” in ICML,
2018.
[206] D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T.
Vooturi, N. Jammalamadaka, J. Huang, H. Yuen et al., “A study of BFLOAT16 for
deep learning training,” arXiv preprint arXiv:1905.12322, 2019.
[207] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training deep neural
networks with binary weights during propagations,” in NeurIPS, 2015.
[208] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural
networks,” in NeurIPS, 2016.
[209] C. Renggli, D. Alistarh, and T. Hoefler, “SparCML: High-performance sparse commu-
nication for machine learning,” arXiv preprint arXiv:1802.08021, 2018.
[210] J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quantized SGD and
its applications to large-scale distributed optimization,” in ICML, 2018.
[211] D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli,
“The convergence of sparsified gradient methods,” in NeurIPS, 2018.
[212] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for
efficient neural network,” in NeurIPS, 2015.
[213] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural net-
works with pruning, trained quantization and huffman coding,” in ICLR, 2016.
[214] L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks
using large learning rates,” in Artificial Intelligence and Machine Learning for Multi-
Domain Operations Applications, vol. 11006. International Society for Optics and
Photonics, 2019.
[215] F. B. Kjolstad and M. Snir, “Ghost cell pattern,” in Proceedings of the 2010 Workshop
on Parallel Programming Patterns. ACM, 2010.
[216] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang,
Y. Hu, L. Ceze et al., “TVM: An automated end-to-end optimizing compiler for deep
learning,” in OSDI, 2018.
133
[217] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer,
“SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 mb model
size,” arXiv preprint arXiv:1602.07360, 2016.
[218] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. An-
dreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile
vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[219] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient convolu-
tional neural network for mobile devices,” in CVPR, 2018.
[220] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geomet-
ric deep learning: going beyond Euclidean data,” IEEE Signal Processing Magazine,
vol. 34, no. 4, 2017.
[221] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun, “Graph neural networks: A
review of methods and applications,” arXiv preprint arXiv:1812.08434, 2018.
[222] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey
on graph neural networks,” arXiv preprint arXiv:1901.00596, 2019.
134
