Optimizing Multi-GPU Parallelization Strategies for Deep Learning
  Training by Pal, Saptadeep et al.
Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
Saptadeep Pal 1 Eiman Ebrahimi 2 Arslan Zulfiqar 2 Yaosheng Fu 2
Victor Zhang 2 Szymon Migacz 2 David Nellans 2 Puneet Gupta 1
Abstract
Deploying deep learning (DL) models across mul-
tiple compute devices to train large and complex
models continues to grow in importance because
of the demand for faster and more frequent train-
ing. Data parallelism (DP) is the most widely
used parallelization strategy, but as the number of
devices in data parallel training grows, so does the
communication overhead between devices. Ad-
ditionally, a larger aggregate batch size per step
leads to statistical efficiency loss, i.e., a larger
number of epochs are required to converge to a de-
sired accuracy. These factors affect overall train-
ing time and beyond a certain number of devices,
the speedup from leveraging DP begins to scale
poorly. In addition to DP, each training step can be
accelerated by exploiting model parallelism (MP).
This work explores hybrid parallelization, where
each data parallel worker is comprised of more
than one device, across which the model dataflow
graph (DFG) is split using MP. We show that at-
scale, hybrid training will be more effective at
minimizing end-to-end training time than exploit-
ing DP alone. We project that for Inception-V3,
GNMT, and BigLSTM, the hybrid strategy pro-
vides an end-to-end training speedup of at least
26.5%, 8%, and 22% respectively compared to
what DP alone can achieve at scale.
1. Introduction
Deep learning (DL) models continue to grow and the
datasets used to train them are increasing in size, leading
to longer training times. Therefore, training is being ac-
celerated by deploying DL models across multiple devices
(e.g., GPUs/TPUs) in parallel. Data parallelism (DP) is the
simplest parallelization strategy (Krizhevsky et al., 2017;
Dean et al., 2012; Simonyan & Zisserman, 2014), where
replicas of a model are trained on independent devices using
independent subsets of data, referred to as mini-batches. All
1University of California Los Angeles 2NVIDIA
Correspondence to: Eiman Ebrahimi <eebrahimi@nvidia.com>,
Saptadeep Pal <saptadeep@ucla.edu>.
major frameworks (e.g. TensorFlow (Abadi et al., 2016), Py-
Torch (PyTorch)) support DP using easy-to-use and intuitive
APIs (Sergeev & Balso, 2018). However, as the number
of devices used to exploit DP increases, the global batch
size also typically increases 1. This poses a fundamental
problem for data parallel scalability because for any given
DL network, there exists a global batch size beyond which
converging to the desired accuracy requires a significantly
larger number of iterations. This is primarily due to the
reduced statistical efficiency of the training process (Hoffer
et al., 2017). In addition, as the number of devices employed
increases, the synchronization/communication overhead of
sharing gradients across devices increases, further limiting
overall training speedup.
Model parallelism (MP) is a complementary technique in
which the model dataflow graph (DFG) is split across mul-
tiple devices while working on the same mini-batch (Dean
et al., 2012; Seide et al., 2014). MP has been traditionally
used to split large models (which can not fit in a single de-
vice’s memory), but employing MP can also help speed up
each training step by placing and running concurrent op-
erations on separate devices. Unfortunately the amount of
parallelism that exists in today’s models is often limited (Wu
et al., 2016; Szegedy et al., 2015), either by the algorithm or
by its implementation. Therefore, using MP alone to obtain
performance through parallelization typically does not scale
well to a large number of devices. Additionally, maximizing
the speedup from MP is often non-trivial (Mirhoseini et al.,
2017; 2018). Optimizing MP requires carefully splitting the
model to take into account the overhead of communicating
activations (during the forward pass) and gradients (during
the backward pass) between dependent operations (placed
on separate devices) in order to achieve the maximum possi-
ble speedup.
This work studies which parallelization strategies to adopt
to minimize end-to-end training time for a given DL model
on available hardware. We ask the question: how can we
improve the scaling obtained from DP, by combining MP
and DP to achieve the best possible end-to-end training time
at a given accuracy? The novel insight of this work is that
when the number of devices (and hence global batch size)
grows to a point where scaling from DP slows significantly,
1We discuss our methodology in Section 4.2 and other possi-
bilities in Section 7.
ar
X
iv
:1
90
7.
13
25
7v
1 
 [c
s.L
G]
  3
0 J
ul 
20
19
Optimizing Multi-GPU Parallelization Strategies
MP should then be used in conjunction with DP to continue
improving training times. The speedup obtainable via MP
is critical to this tipping point. We show that every network
will have a unique scale at which DP’s scaling and statistical
efficiency degradation can be overcome by MP’s speedup.
This work makes the following contributions:
• We show that when DP’s inefficiencies become large,
a hybrid parallelization strategy where each parallel
worker is model parallelized across multiple devices
will further scale multi-device training.
• We develop an analytical framework to systematically
find this cross-over point (in terms of number of de-
vices - e.g., GPUs or TPUs - used to train a model) that
indicates which parallelization strategy to use when
optimizing training of a model, on a particular system.
• We show that hybrid parallelization outperforms DP
alone at different scales for different DL networks. We
implement 2-way model parallel versions of Inception-
V3, GNMT, and BigLSTM, and project that using
them, hybrid training provides a speedup of at least
26.5%, 8%, and 22% respectively above DP-only train-
ing at scale.
• We propose DLPlacer, an integer linear programming
based tool to find optimal operation-to-device place-
ment to maximize MP speedup. We demonstrate
DLPlacer’s effectiveness by using it to derive an opti-
mal placement for the Inception-V3 model (Szegedy
et al., 2015), showing the obtained 1.32x model parallel
speedup with two GPUs is within 6% of that predicted
by the tool.
2. Background
Neural Network Training: In neural network training, first
a batch of inputs is forward propagated through the network
to calculate the losses from each input. The losses are back
propagated through the network to compute the gradients.
The average of the batch’s gradients is then used to update
the weights. The size of the batch is chosen such that the
compute resource of the device used for training is fully
utilized. This process is called stochastic batch gradient
descent (Gardner, 1984; Zinkevich et al., 2010). One for-
ward and backward pass with gradient update to the weights
is typically referred to as a training step. One iteration
through the training data set where all inputs are processed
once involves multiple steps and is referred to as an epoch.
The training process, shown in Figure 1 is run for multiple
epochs until a desired training accuracy is reached.
Data Parallel Training: To accelerate training using DP,
a full set of model parameters (i.e., weights) are replicated
across multiple devices/workers. As Figure 2a shows, each
worker performs a forward and backward pass indepen-
dently on a different batch of inputs first. Gradients are then
Figure 1. Deep learning training flow
(a) Data parallel training
(b) Model parallel training
Figure 2. Different Training Parallelization Strategies
communicated across workers and averaged; after which,
each worker applies the same set of gradient values to the
model weights. The communication of the gradients across
the workers is done using all-reduce communication. The
method of updating the model after each iteration (using the
average of all gradients) is called synchronous stochastic
gradient descent (sync-SGD) and is the most widely used
technique for data parallel training. In this work, we call the
batch of inputs per worker a mini-batch and the collection
of all the mini-batches in a training step a global batch.
Model Parallel Training: The model DL is split by plac-
ing different operations of it’s DFG onto different devices.
This approach has been traditionally used for models whose
parameters will not fit into a single device’s memory (Wu
et al., 2016; Krizhevsky et al., 2017). However, MP can
provide per step training speedup (Mirhoseini et al., 2018;
Dean et al., 2012) even when the entire model fits on one
device by executing independent operations concurrently on
separate devices, as shown in Figure 2b. Splitting a DFG
among multiple devices is non-trivial for many networks.
The communication overhead of moving data between de-
vices may be so large that it may outweigh the gains of MP.
Thus, when dividing a network’s DFG, characteristics such
as compute intensity of each device, inter-device network
bandwidth, and even network topology must be considered.
An alternative approach to obtaining speedup is to split up
a model across multiple devices using pipelining (Huang
et al., 2018). This enables splitting a model across multiple
devices when a model does not have parallel branches and
is sequential in nature. Networks are partitioned into groups
containing one or a few layers of the network, where each
Optimizing Multi-GPU Parallelization Strategies
group is placed on a different device. To orchestrate paral-
lel execution, a mini-batch is split into yet smaller micro-
batches and each device processes a different micro-batch
sequentially but concurrently. While subtly distinct, for the
purposes of this work we consider pipeline parallelism as
an implementation instance of MP.
3. Decomposing End-to-End Training Time
End-to-end training time for a DL model depends on three
factors: the average time per step (T ), the number of steps
per epoch (S) and the number of epochs (E) required to
converge to a desired accuracy. Therefore, the total training
time, i.e., time to converge (C) can be expressed as:
C = T ×S×E (1)
T is determined by primarily by compute efficiency, i.e.,
given the same training setup, algorithm, and mini-batch
size, T depends solely on the compute capability of a device;
better performing hardware provides smaller T values. S on
the other hand, depends on the global batch size and number
of items in the training dataset. All items in the dataset are
processed once per epoch, therefore the number of steps
per epoch (S) is equal to the number of items in the data
set, divided by the global batch size. The number of epochs
to converge (E) depends on the global batch size and other
training hyper-parameters.
3.1. Quantifying Data Parallel Training Time
In data parallel training, the network parameters (weights)
are replicated across multiple worker devices and each
worker performs a forward and a backward pass individ-
ually on a distinct batch of inputs (shown in Figure 2a).
In this work, we focus on synchronous stochastic gradient
decent (sync− SGD) for weight updates. In sync− SGD
workers are synchronized, i.e., the gradients from workers
are shared and network parameters are updated such that
all workers have the same parameters after each step. An
alternative approach uses asynchronous updates, usually
with a parameter server. When scaling to a large number of
devices, this approach performs poorly (Chen et al., 2016).
Therefore, we use a ring-based all-reduce mechanism for
data parallel training which provides superior performance
and scalability over parameter server based approaches and
primarily supports sync−SGD. We call the batch of inputs
per worker a mini-batch and the collection of all the mini-
batches in a training step a global batch. When using DP
alone to accelerate training, the speedup from employing
N-way data parallelism (SUN) compared to training on a
single device can be expressed as:
SUN =
T1
TN
× S1
SN
× E1
EN
(2)
T1 is the average training time per step when only one de-
vice is used for training, while TN is the time per step when
N data parallel devices (with a constant mini-batch size per
device) are used. TN is always larger than T1 because in DP,
after each device has performed a forward and backward
pass, the gradients must be exchanged between the devices
using all-reduce communication (see Figure 2a)2. Due to
this communication overhead, T1TN will never be larger than
one and is typically less than one. We call this ratio of T1TN
the scaling efficiency (SEN) of N-way DP.
S1 is the total number of steps required per epoch when one
device is used, while SN is the number of steps per epoch
when N devices are used. When a single device is used, the
global batch size is equal to the mini-batch size. In N-way
data parallelism each device performs an independent step
with its own mini-batch of data, therefore the global batch
size is N-times the mini-batch size per device. Thus, S1SN is
also equal to N.
E1 is the number of epochs required to converge when
one device is used, while EN is the number of epochs re-
quired when N devices are used. At larger global batch sizes
(higher N), the gradients from a larger number of training
samples are averaged which results in model over-fitting as
well as a tendency to get attracted to local minima or saddle
points. This eventually leads to poor generalization of the
network (Hoffer et al., 2017; Goyal et al., 2017; Smith et al.,
2017; Jastrzebski et al., 2017; Li et al., 2014; Keskar et al.,
2016) and therefore more epochs are typically required to
converge. As such, E1EN is usually less than one. Equation 2
can thus be simplified as:
SUN = SEN×N× E1EN (3)
When training at larger device counts (N) both SEN and
E1
EN
decrease. At large global batch sizes, hyper-parameter
tuning (which is a challenging and time consuming task)
can be used to try and minimize the increase in number
of epochs required for convergence. However for any par-
ticular network, beyond a certain global batch size it has
been often observed that the number of epochs required
to converge increases rapidly, even with hyper-parameter
tuning (Goyal et al., 2017). We describe how we calculate
the values for SEN , E1, and EN in detail in Section 4.
3.2. Quantifying Model Parallel Training Time
As shown in Figure 2b, MP enables more than one device
to work on the same mini-batch at the same time. This
directly reduces the time taken for one training step; term T
in Equation 1. We call this speedup from M-way MP, SUM
and it can be measured using real hardware by splitting
a model across multiple devices and measuring per step
2Additionally, text and speech networks often exhibit straggler
effects where processing some mini-batches take longer than others
and therefore, in sync-SGD, devices with a shorter execution time
of a mini-batch will suffer from under utilization (Hashemi et al.,
2018)
Optimizing Multi-GPU Parallelization Strategies
execution time or estimated using a numerical model. Note
that the SUM speedup already includes the communication
cost of data movement between dependent operations placed
across multiple devices.
As previously noted, the global batch size does not increase
when employing MP. Therefore, the number of steps per
epoch (term S in Equation 1) and number of epochs required
to converge (term E in Equation 1) do not change. As such,
improving SUM reduces convergence time by solely reduc-
ing term T in equation 1 while the other two terms remain
constant. We find that typically the inherent parallelism of
a given model or its implementation, limits the achievable
SUM . As a result, MP alone is not been considered a broadly
applicable scalable parallelization strategy. However, we
show that MP can be combined with DP to extend training
scalibility beyond today’s limits.
3.3. Hybrid Data and Model Parallel Training:
In Section 3.1, we introduced the speedup obtained by N−
way DP in Equation 3. Now, let’s assume we have scaled
our training system up to N devices using N−way DP and
are happy with the training speedup achieved. If additional
devices (say M×N devices, where M is an integer) were
to become available for training, how should we best use
these devices for distributed training? Our goal is to identify
when to continue to use DP alone, and when to combine
DP with MP to obtain the highest possible training speedup.
Using DP alone, the speedup from M×N devices compared
to one device is (substituting M×N for N in Equation 3):
SUM×N = SEM×N×M×N× E1EM×N (4)
A few observations are important when comparing the
speedup from M×N-way DP (Equation 4) and speedup
from N-way DP (Equation 3): First, scaling efficiency is
generally lower for the system with M×N-way DP com-
pared to N −way DP (Thakur et al., 2005; Patarasuk &
Yuan, 2009). This is because all-reduce communication hap-
pens between a larger number of devices. Depending on the
values of N, M, and system configuration, all-reduce com-
munication potentially crosses slower inter-node links that
leads to increased all-reduce times and reduces SEM×N (Shi
& Chu, 2017; NCCL, 2018). Second, since global batch size
is larger at M×N devices (to maintain a constant mini-batch
size), the number of steps per epoch is smaller by a factor
M compared to N-way DP. Third, the number of epochs re-
quired, EM×N , is greater than or equal to EN . These factors
all trend towards lower efficiency as the number of devices
employed in DP training grows.
When using M ×N devices in a hybrid parallelization
strategy of N-way DP where each worker uses M-way MP,
we consider each worker’s per step speedup to be SUM .
Thus the overall training speedup can be expressed as:
Figure 3. An example plot showing the speedup obtained from DP
alone, and the hybrid strategy. N refers to the total number of
devices used for training.
SUMN = SU
M×SEN×N× E1EN (5)
When comparing hybrid N-way DP with M-way model
parallel workers, versus N-way DP with single GPU work-
ers, the global batch size will remain the same. This is
because in the M×N-device configuration, every M devices
are grouped into a single data-parallel worker. Thus, the
number of steps per epoch remains the same as that of N-
way DP at N and E1EN remains unchanged as well. As such,
the per-step speedup achieved through MP increases the
overall training speedup by a factor of SUM , when compar-
ing Equations 3 and 5.
3.4. Choosing the Best Parallelization Strategy
By substituting Equations 4 and 5 into Equation 6 we can
determine the conditions under which using hybrid paral-
lelization will be better than DP scaling alone. Equation 6
shows that if the speedup obtained from MP (for a given
model parallelization step) is large enough to overcome
the scaling and statistical efficiency loss that comes from
increased communication, synchronization overhead, and
global batch size increase respectively, employing a hybrid
MP and DP strategy will improve network training time.
SUMN > SUM×N
SUM×SEN×N× E1EN > SEM×N×M×N×
E1
EM×N
SUM > M×SEM×N
SEN
× EN
EM×N
(6)
Figure 3 illustrates this concept using a hypothetical sce-
nario. Let’s assume implementing MP provides a 45% and
65% improvement with two and four GPUs respectively.
The DP-only strategy scales well up to 32 devices after
which the improvement in speedup slows down. This en-
ables a hybrid 32-way DP & 2-way MP hybrid paralleliza-
tion strategy to perform better than 64-way DP given the
Optimizing Multi-GPU Parallelization Strategies
scaling and statistical efficiency losses at 64 devices, for this
example.
Similarly, a hybrid 16-way DP & 4-way MP hybrid strategy
outperforms DP-only when scaling from 32 to 128 devices.
However in this example, this hybrid strategy’s performance
is not as good as the hybrid strategy of 32-way DP & 2-way
MP. The reason is that 4-way MP’s per step speedup (SU4)
does not overcome the trade-off (of using four machines for
each data-parallel worker) as efficiently as 2-way MP’s per
step speedup, SU2 (when using two machines for each data-
parallel worker). Depending on these relative improvements
at any device count, the choice of parallelization strategy is
critical to the training speedup obtained when scaling to yet
larger number of devices. This choice depends on the DL
network’s properties and system configuration parameters
as described above, so there is no one size fits all solution
to efficient scale-out multi-device training.
4. Methodology
We use the following DL models in our evaluations with
their default hyper-parameters, unless otherwise specified:
• Inception-v3 (Szegedy et al., 2015) is used for image
recognition and visual feature extraction. The net-
work is composed of multiple blocks, each with sev-
eral branches of convolution and pooling operations.
These branches can be executed in parallel. We use
the implementation provided with the public NVIDIA
Tensorflow container 18.07 (NVIDIA Container, 2018)
and train the network using the Imagenet dataset (Deng
et al., 2009). We scale the initial learning rate linearly
with the increase in global batch size as originally pro-
posed by Goyal et al. (Goyal et al., 2017). For measur-
ing epoch counts, we train the model until a training
loss of 6.1 is achieved.
• GNMT (Wu et al., 2016) is a language translation net-
work with attention mechanism (Wu et al., 2016; Bah-
danau et al., 2014). We use 4 LSTM layers of size
1024 in the encoder and decoder. We use the public
repository at (Migacz, 2018) as the basis of our imple-
mentation. We use exponential learning rate warm-up
for 200 training steps. The learning rate decay is started
after 6000 steps and decays for a total of four times
after every 500 iterations with a decay factor of 0.5.
Such a technique has been shown to scale well when
global batch size is scaled. We train the network using
the WMT’16 German-English dataset (Guillou et al.,
2016) until a BLEU score of 21.8 is achieved.
• BigLSTM (Jo´zefowicz et al., 2016) is a large scale
language modelling network. It consists of an input
embedding layer of size 1024, 2 LSTM layers with
hidden state size of 8192, and a Softmax projection
layer of size 1024. We implemented the network in
the public NVIDIA PyTorch container v19.06, used
a learning rate of 0.1, and trained using the 1 billion
word language modelling dataset to a perplexity of 67.
4.1. System Configuration and Evaluation Points
For our experiments, we use an NVIDIA DGX-1 (NVIDIA,
2018a) with 4 Tesla V100 GPUs (NVIDIA, 2018) connected
via NVLink (NVIDIA, 2018b) with 16GB of memory ca-
pacity. In the BigLSTM experiments we used a similar
system but with GV100 cards having 32GB of memory, be-
cause this network requires more capacity to execute on a
single GPU. We use NCCL2.0 based all-reduce communi-
cation for gradient sharing.
In order to project when hybrid training will perform bet-
ter than DP alone, we need to measure the epoch counts
to convergence and scaling efficiency for DP (defined in
Section 3.1) for different GPU counts. We also require the
speedup achieved via MP when M GPUs are used for a
model-parallel worker in a hybrid strategy. Without loss of
generalization, we use M = 2 for the DL models we use
to make a case for future hybrid parallelization strategies.
The value chosen for M for an arbitrary DL model will
always depend on the speedup obtained from M-way MP
and slowdown in scaling efficiency the DP implementation
incurs.
4.2. Measuring Epoch Counts to Convergence
Typically, epoch counts to convergence for DP on N com-
pute nodes is obtained by running the training on N nodes.
We select mini-batch sizes to saturate single GPU through-
put or lower if the desired mini-batch size is limited by GPU
memory capacity. We perform experiments on a 4-GPU
NVIDIA DGX system, so the maximum global batch size
possible to measure is 4×B, where the mini-batch size is
B. To emulate larger global batch sizes (corresponding to
more than four GPUs), we use the delayed gradient update
approach (Ott et al., 2018) where multiple mini-batches
are processed per GPU before the gradients are shared for
weight update. For example, to emulate a batch size of
16×B that would be used in a 16 GPU system, each GPU
runs the forward and backward propagation of four mini-
batches before the GPUs share the gradients (using NCCL
2.0 based all-reduce (NCCL, 2018)) and update weights.
This methodology allows us to measure the effect of global
batch size on the epoch counts required for reaching a de-
sired accuracy, at higher device counts than we have in our
physical system. It is worth noting that even though we
complete training of a DL model to find EN , in practice,
many DL models are often re-trained many times during
development or as new data becomes available. Our pro-
posed systematic modelling approach helps find the best
parallelization strategy for optimizing the turnaround time
of such subsequent training runs.
Learning rate schedules are sometimes optimized to keep
epoch counts to convergence low at large global batch sizes.
Optimizing Multi-GPU Parallelization Strategies
Figure 4. Number of epochs required for the networks to converge
versus increasing global batch size with increase in the number
of GPUs. We emulated larger global batch sizes corresponding to
large number of GPUs using the technique described in Section 4.2
For example, the learning rate schedules we use for GNMT
and Inception V3 were tuned accordingly for this purpose.
However, in general, hyper-parameter tuning is time con-
suming and requires many training runs. Similar to prior
work (Goyal et al., 2017), we find that even with such tuning,
beyond a certain global batch size, the number of epochs re-
quired to converge increases rapidly. As such, the proposals
of this work are orthogonal to such efforts.
4.3. Estimating Scaling Efficiency
Unlike the methodology we use for emulating larger global
batch sizes than what our physical system allows, we can
not obtain the scaling efficiency (SEN) of data parallel train-
ing on larger number of GPUs, when using just four. Thus,
we conservatively assume a scaling efficiency (SEN) of 1,
i.e., the time overhead of communication and synchroniza-
tion after each step is negligibly small compared to the time
taken for the forward and backward passes. This optimistic
assumption minimizes the impact of hybrid parallelization,
but reflects the reality that framework developers are con-
stantly working to improve overheads that hinder DP scaling
efficiency. In fact for CNNs such as ResNet-50, relative scal-
ing efficiency of > 95% has been achieved for 2048-way
DP (Yamazaki et al., 2019).
4.4. Model Parallel Splitting
Inception-V3’s implementation allows a traditional model
parallel mapping of independent operations to different
GPUs. As such, we split the model’s DFG across two
GPUs using DLPlacer, later described in Section 6. We
observed that beyond 2-way splitting of the Inception-V3
DFG, the MP speedup is marginal (see Figure 8). For
GNMT and BigLSTM, we split their DFGs using pipeline
parallelism (Huang et al., 2018). Pipeline parallelism is
appropriate for implementing MP on these networks due
to the use of optimized libraries and fused RNN kernels in
their implementations. Pipelining could similarly be useful
for models which do not have parallel branches and are
sequential in nature (e.g., ResNet, AmoebaNet).
It is worth noting that the original GNMT implementa-
tion (Wu et al., 2016) uses 8-way MP. However, since we
use a system with V100 GPUs that have 14x more FLOPs
compared to the K80 GPUs used in that prior work, the
ratio of communication overhead to computation is larger
in our configuration. We use up-to-date CuDNN libraries
with fused RNN kernels and observe that splitting the model
beyond 2-way provides marginal per-step speedup because
of kernel overheads and pipeline imbalance.
5. Evaluation
Figure 4 shows the number of epochs required to hit the
desired accuracy versus the number of GPUs (workers) used
in data parallel training. The number of epochs generally
increases with an increasing number of GPUs (i.e., with
increasing global batch size). For Inception-V3, the number
of epochs increases sharply from four to seven as the global
batch size increases beyond 2048 (i.e., 32 GPUs) and grows
to 23 epochs at a global batch size of 16384 (i.e., 256 GPUs).
For GNMT, the epoch count decreases slightly when going
from two to four GPUs because the hyper-parameters used
are tuned for large global batch sizes. Even with these tuned
hyper-parameters, as the GPU count increases beyond 64,
the number of epochs required grows rapidly. In BigLSTM,
beyond 16 GPUs (i.e., global batch size of 2048), the num-
ber of epochs increases rapidly and in fact, 3.2 times the
number of epochs is required for 32-way DP compared to
16-way DP. Beyond 32-way DP, training did not converge
within a meaningful time limit. Overall, as we scale up the
number of GPUs used in DP training, E1EN becomes smaller
which ultimately hinders the overall speedup achievable
through data parallel training alone.
As described in Section 3, splitting each network across
two GPUs using model parallelism results in per-step
speedup when done successfully. Table 1 shows the mea-
sured MP speedups on our test system for our evaluated
networks. Using the number of epochs required and per step
speedup from MP, together with the conservative estimates
of scaling efficiency, we can then calculate the minimum
projected speedup (over DP-only) that can be obtained by
implementing a hybrid parallelization strategy across dif-
ferent GPU counts. It is worth noting that the MP speedup
achieved on Inception-V3 using expert manual placement
of operations was 21%. In Section 6 we discuss DLPlacer, a
tool we developed for optimizing operation-to-device place-
ment, which improves the MP speedup for Inception-V3 to
32%.
Inception-V3 As shown in Figure 5a, beyond 32 GPUs,
a hybrid parallelization strategy performs better than DP-
only. This is because of the sharp increase in the number of
epochs required when the global batch size grows beyond
2048 which saturates the speedup obtainable from DP-only
parallelization. When moving from 32 GPUs to 64 GPUs, it
Optimizing Multi-GPU Parallelization Strategies
(a) Inception-V3 (b) GNMT (c) BigLSTM
Figure 5. Projected speedup of hybrid MP-DP parallelization vs DP-only parallelization
Table 1. MP splitting strategy and the speedup obtained when split
across 2 GPUs
Network MP splitting strategy Speedup
Inception-V3 Partitioned w/ DLPlacer 1.32x
GNMT Pipeline Parallelism 1.15x
BigLSTM Pipeline Parallelism 1.22x
is better to use the additional 32 GPUs to do 2-way MP, and
our estimates show that the hybrid-strategy will outperform
DP alone by at least 15.5%. As the numbers of GPUs grow
further, only marginal speedup can be obtained from DP-
only parallelization and at 256 GPUs, the hybrid-strategy
will be atleast 26.5% better than the DP-only strategy.
GNMT As shown in Figure 5b, GNMT scales very well
to a large number of GPUs using DP alone. However, even
with tuned hyper-parameters for larger batch sizes, DP-only
speedup starts to slow down beyond 64 GPUs and dramati-
cally slows down when moving from 128 to 256 GPUs. The
hybrid parallelization strategy with 2-way MP and 128-way
DP outperforms 256-way a DP strategy by 8%. If the hyper-
parameters would not have been tuned for large batch sizes,
the gains from hybrid parallelism would be larger and the
tipping point would occur at a lower number of GPUs.
BigLSTM As shown in Figure 5c, beyond 16 GPUs,
BigLSTM does not scale well with an increasing number
of GPUs. This is because the statistical efficiency of train-
ing decreases rapidly with increasing global batch size, and
therefore the significantly larger number of required epochs
offsets the throughput increase of multiple GPUs. At 32-
GPUs, the large loss in statistical efficiency impacts the over-
all training speedup of DP-only strategy and the speedup
drops significantly. As a result, the hybrid policy provides
a 1.22x speedup over the best performing scale of DP-only
which happens at 16-GPUs, as Figure 5c shows.
In summary, these results show that when statistical ef-
ficiency loss reduces the effectiveness of DP-only paral-
lelization, hybrid parallelization (combining DP with MP)
will enable higher performance than employing DP alone.
Notably, using real scaling efficiency loss values (we con-
servatively assumed SEN = 1), the improvements from hy-
brid parallelization would be more pronounced since SE2NSEN
is often smaller than 0.9 for large LSTM based networks.
Figure 6. DLPLacer Flow Diagram.
Based on Equation 6, the smaller the ratio, the higher the
speedup from hybrid parallelism (SUMN ) compared to data-
parallelism alone (SUM×N).
6. Maximizing MP Performance
Maximizing the speedup obtained from MP for a given
model improves the scalability of hybrid parallelism. For
some networks, optimal placements are easy to achieve by
examining a network’s DFG. For others, finding the optimal
operation-to-device placement that results in the maximum
per-step speedup is non-trivial. To this end, we developed
an integer-linear programming (ILP) based device place-
ment tool called DLPlacer. DLPlacer maximizes resource
utilization by extracting parallelism between operations in a
model while also minimizing the communication overhead
of moving data between the compute nodes.
Figure 6 shows DLPlacer’s tool flow. We express a DL
model as a compute DFG, with a set of vertices K corre-
sponding to compute operations and a set of uni-directional
edges E showing operation dependencies. For example, for
some k1, k2 ∈ K and ek1,k2 ∈ E, ek1,k2 = 1 means k2 is
dependent on k1. The expected execution time of a vertex is
represented as ∆(k) and the memory footprint of the vertex
for a given batch size is represented as M(k). Edge weight
(D(e)) corresponds to the number of bytes exchanged be-
tween the operations it connects. The node and edge weights
can be obtained by profiling a model on a compute device
(e.g., GPU) or can be analytically calculated, with the for-
mer approach being more robust and the latter more flexible.
Optimizing Multi-GPU Parallelization Strategies
Using similar notation, we express a system as a hardware
graph (Nowatzki et al., 2013a;b) where a set of compute
(e.g., GPUs) nodes N and router nodes (network switches)
R are connected through a set of physical links L. As an
example, for n1, n2 ∈ N and ln1,n2 ∈ L, ln1,n2 = 1 means
nodes n1 and n2 are connected. We assume physical links
are bidirectional, so ln1,n2 = ln2,n1. The bandwidth of the
physical link is denoted by B(l).
DLPlacer’s ILP solver minimizes per step training time
by providing an assignment of compute DFG operations on
to the hardware graph, a schedule, and a communication
routing of activations, weights, and gradients. This is done
by mapping model DFG’s vertices to compute nodes, depen-
dency edges to physical link mapping (if dependent vertices
are placed in separate devices) and determining the execu-
tion start time of each vertex on a device. This mapping
must satisfy a series of constraints and variables which are
described next. A summary of all the variables is provided
in Table 2
Table 2. Summary of notations used in the ILP
Notation Meaning
Inputs : Computation DFG
K Set of compute operation/kernel vertices
E Set of edges between vertices
∆(K) Expected execution time of compute vertex
M(K) Memory footprint of the compute vertex
D(E) Number of data bytes transferred in an edge
Inputs : Hardware Graph
N Set of compute hardware nodes
R Set of router nodes
L Set of physical links connecting routers and hardware nodes
B(l) Bandwidth of the physical links
Mem(N) Device memory capacity
Variables: Outputs
Pkn(K,N) Mapping of compute vertex to hardware node
Tk(K) Time a vertex is launched on the hardware
Cel(E,L) Mapping of dependency edges to physical links
Variables: Intermediate
∆e(L) Delay of communication of edge e
Now, we describe the constraints in details.
Placement of compute operation vertex: If the binary
variable Pkn(k,n) = 1, then vertex k is mapped to node n.
Each operation of the compute DFG should be mapped to
only one node on the hardware graph, therefore this gives
us the following constraint:
∀k ∑
n
Pkn(k,n) = 1 (7)
Routing of Activation Data: The output from a vertex
need to be routed to the dependent vertices through the
physical communication links. Each edge e needs to be
mapped to a sequence of one or more links l. The path for
communication must start from the origin vertex and end at
the destination vertex if the origin and destination vertices
are different. Therefore, for the source and destination
nodes, exactly one link should be allotted for the edge, and
for all other nodes (including router nodes). This constraint
can be formulated as follows:
∀e,n,ki,k j |eki,k j = 1 i f Pkn(ki,n)! = Pkn(k j,n)
∑
l|ln,nx=1
Cel(e, l) = 1 ∀nx ∈ N (8)
To find a contiguous path, we enforce for all non-source
and non-destination nodes (includes router/switch nodes)
that either two links should be allotted, one for the incoming
traffic and one for the outgoing traffic or no links should be
allotted.
∀e,ki,k j | eki,k j = 1,n | Pkn(ki,n) = Pkn(k j,n) = 0
∑
l|ln,nx=1
Cel(e, l) = ∑
l|ln,ny=1
Cel(e, l)∀nx,ny ∈ N∪R (9)
Scheduling of Vertices: Vertices need to be scheduled
such that their dependencies are met. We calculate the time
at which a vertex can begin executing by considering the
start time of the other vertices it is dependent upon and
the execution time and communication delay of the input
activations.
∀ksrc,kdest ,e | eksrc,kdest = 1,
T (kdest)≥ T (ksrc)+∆(ksrc)+∆e(L)
(10)
This equation ensures that a vertex kdest can begin only
after all the vertices it is dependent upon (ksrc) has finished
executing and the input activations have been communicated
to the device where kdest is placed.
∆e(L) is the time to communication the edge data. The
amount of data that need to be routed between two vertices
is the amount of total output activation (dependent on mini-
batch size). We assumed that the time for communication
would depend on the number of links it need to traverse and
the bandwidth and latency of these links. Therefore, ∆e(L)
can be computed as follows:
∀e ∈ E,∆e(L) =∑
l∈L
Cel(e, l)∗ (D(e)/B(l)+L(l)) (11)
Another timing related constraint comes from the fact
that multiple operations can be mapped to a device but
co-located vertices cannot be scheduled on the same device
at the same time. The start of the execution of consecutive
operations on a device should atleast be separated by the ex-
ecution time of the operation which starts earlier among the
two. Note that this constraint is unnecessary for operations
which lie on the dependency path of each other because of
Optimizing Multi-GPU Parallelization Strategies
the previous constraint. Therefore, this constraint can be
formulated as follows:
∀kx,ky,n | Pkn(kx,n) = Pkn(ky,n) = 1 and ekx,ky ! = 1,
i f T (kx)> T (ky) :
T (kx)≥ T (ky)+∆(ky)
else :
T (ky)≥ T (kx)+∆(kx)
(12)
Device memory capacity constraint: This constraint en-
sures that the summation of the memory footprint of all
the vertices placed on a device does not exceed the device
memory capacity.
∀n ∈ N, Mem(n)≥ ∑
k∈K
Pkn(k,n)∗M(k) (13)
In this DLPlacer framework, we assumed the following:
1. Two operations which are co-located on a device are
executed back-to-back, without any delay in between
the end of one operation and the beginning of the other.
2. Communication of tensors between devices are over-
lapped with computation.
Based on these constraints and assumptions DLPlacer pre-
dicts the training speedup for a given MP solution. In our
work, we considered operations at the granularity of tensor-
flow operations (e.g., conv2D, conv3D), however DLPlacer
can be used to even find placements when the operations
are partitioned into finer granularity operations (e.g., par-
titioned by channels, filters etc.). But, such fine grained
operation splitting requires framework support for correct
back-propagation and therefore was not a focus of this work.
Note that because of framework-induced overheads and un-
modeled operating system effects, correct prediction of the
exact speedup is difficult. Modelling these overheads is
challenging and often depends on the mapping of kernels
to high-level operations (e.g., mapping of CuDNN (Chetlur
et al., 2014) kernels to convolution/FC/etc.), device archi-
tecture, and the runtime implementation. Despite the chal-
lenges in accurate prediction, we believe ILP based MP
optimization is worthwhile to pursue based on the observed
improvements over manual optimization.
Inception-V3 Case Study
As inputs to DLPlacer, we analytically calculate the ex-
ecution and communication times of the operations in the
Inception-V3 DFG. For example, given the input/output ten-
sor sizes of a convolution operation, we calculate the number
of floating point operations (FLOPs) required, and based
on advertised compute capability of NVIDIA’s V100, we
calculate the operations’ expected execution time. Similarly,
communication time between nodes is calculated based on
the tensor sizes of the nodes in the model DFG along with
NVLink bandwidth and latency. The placement solution
of Inception-V3 using 2-GPUs is shown in Figure 7. We
implement the placement directives from DLPlacer using
Tensorflow’s t f .device() command, and we have validated
DLPlacer’s speedup estimation against real hardware per-
formance.
In Figure 8, the blue bars show the normalized per-step
speedup estimated by DLPlacer for the optimal placement
solution it finds. DLPlacer’s runtime on an 18-core Xeon-E5
system to find Inception-V3’s placement solution is ∼11-
18 minutes depending on the number of device nodes in
the hardware graph. The orange bar for each configuration
shows speedup as measured on real silicon with DLPlacer’s
placement applied to the Tensorflow implementation. The
speedup-ups measured by DLPlacer are within 6% of the ac-
tual speedup obtained from the silicon runs. It is interesting
to note that the 1.32x speedup obtained with the real silicon
2-GPU placement is almost the same as what is optimally
obtainable with three or four GPUs. This is due to the lim-
ited parallelism available in the network, which DLPlacer
almost completely exploits with a 2-GPU placement. Iden-
tifying a 2-GPU placement that gives this performance by
simple observation of the network and without using a tool
like DLPlacer is non-trivial. DLPlacer essentially finds
placement with the shortest possible critical path among
many feasible placement solutions and places the operations
on the critical path in one GPU so as to avoid communica-
tion overhead. This shows the importance of such a tool for
maximizing performance obtainable from MP while using
minimum number of GPUs.
7. Related Work
This work identifies scaling and statistical efficiency losses
as the largest challenges to scalable data parallel training,
but researchers are improving the scalability of both data
and model parallel training rapidly. We summarize the most
significant related advancements here.
7.1. Hybrid Parallelization
Previous work (Das et al., 2016; Yadan et al., 2013; Srid-
haran et al., 2018) has also used hybrid parallelization for
scaling DL training. To the best of our knowledge, none
of these proposals provides a systematic method to iden-
tify which strategy is best for scaling-out network training
at different device counts. Das et al. (Das et al., 2016)
perform hybrid training on CPUs and maintain the global
batch size by shrinking the mini-batch size per CPU, but
do not incur a statistical efficiency loss because a small
Optimizing Multi-GPU Parallelization Strategies
Figure 7. DLPlacer’s placement solution for Inception-V3. Different colors denote different devices.
Figure 8. Normalized per-step speedup from model parallelism as
estimated by DLPlacer and obtained from silicon experiments for
the Inception-V3 network.
mini-batch size is large enough to saturate CPU throughput.
Maintaining a constant global batch size while shrinking the
mini-batch size (per compute device) can also be done for
GPUs, however GPUs typically require larger mini-batch
sizes to maintain high utilization. Yadan et al. (Yadan et al.,
2013) show that a hybrid (2-way DP, 2-MP) approach per-
forms better than both MP-only and DP-only when training
AlexNet on a 4-GPU system, but do not discuss the cause
of the results or evaluate this effect across different GPU
counts. Dean et al. (Dean et al., 2012) used hybrid paral-
lelism to train models which would not fit in a single GPU’s
memory. Therefore, in each data parallel worker, the model
replica is model parallelized across multiple devices. How-
ever with increase in capacity of memory capacity in today’s
GPUs, large models such as Inception-V3, GNMT etc. can
fit in to a single GPU memory while using sufficiently large
mini-batch size to saturate the compute throughput. More-
over, using model parallelism for models that do not fit in a
single GPUs memory is largely orthogonal to the issue we
address in this work. Amir et al. (Gholami et al., 2017)
have shown that hybrid parallelization strategy can result
in lower communication overhead over both MP and DP.
None of these works, however, have provided a systematic
analysis of finding what parallelization strategy would mini-
mize the end-to-end training time when a set of N compute
devices are available for training. Moreover, implementing
hybrid parallelism is often tricky because finding the opti-
mal strategy to split a model is non-trivial and is dependent
on the model DFG and system hardware.
7.2. Orthogonal Parallelization Strategies
Exploiting model parallelism is just one way to achieve
per step speedup without increasing global batch size. Other
strategies exist that can be combined with, or used in place
of, model parallelism to augment data parallel scaling under
our proposed model. Jia et al. (2018) propose layer-wise
parallelism for CNNs where each network layer can use an
individual parallelization strategy. A combination of the
4D tensor dimensions can be used to parallelize a given
layer and exploring multiple dimensions may provide larger
runtime benefits than MP. However, such a technique is not
yet supported by most frameworks and is evaluated using a
custom framework (Legion (Bauer et al., 2012)). Similar to
GPipe (Huang et al., 2018) (discussed in Section 2), Harlap
et al. (2018) propose partitioning a DL model’s DFG into
multi-layer stages and applying pipeline parallelism. To
enable maximum device utilization, PipeDream uses asyn-
chronous weight updates which can lead to poor statistical
efficiency as the number of devices increases. It is likely that
one or a combination of the layer-wise, pipeline, and model
parallelism techniques can be combined with DP training to
maximize end to end training performance and efficiency.
7.3. Alternate Techniques to Improve DP Scaling
Data parallel training employing sync-SGD suffers from
poor scaling efficiency due to synchronization overheads.
Prior work (Chaturapruek et al., 2015; Zhang et al., 2013;
Paine et al., 2013; Recht et al., 2011) has attempted to
address this by using asynchronous SGD. However, asyn-
chronous SGD can still result in poor statistical efficiency
while making performance debugging difficult. Hyper-
parameter tuning is a broad approach to improving statis-
tical accuracy and training convergence. Techniques such
as tuning and scaling learning rates (Goyal et al., 2017;
Krizhevsky, 2014; You et al., 2017; Smith et al., 2017; Jas-
trzebski et al., 2017; Hoffer et al., 2017), or auto-tuning the
momentum (Zhang & Mitliagkas, 2017) are several impor-
tant examples. However, these techniques are very problem
specific, require extensive knowledge of the DL models,
and are very time consuming for developers. Furthermore,
hyper-parameter tuning is not always effective (Shallue
et al., 2018).
Other works (Polyak & Juditsky, 1992; Koliousis et al.,
2019) propose using a different learning algorithm, called
model averaging, for training with small batches. An aver-
age model can asymptotically converge faster, but finding
Optimizing Multi-GPU Parallelization Strategies
the asymptotic region is difficult (Xu, 2011). Koliousis et
al. (2019) use multiple learners (each using a small batch
size) run on many GPUs, and an average main model is used
to synchronously track the learning. Model averaging is not
yet mainstream or supported by popular DL frameworks and
thus requires custom re-implementation of the DL models.
7.4. Reinforcement Learning-based Device Placement
Prior work has shown that by using reinforcement learning-
based (RL-based) placement of operations onto devices, MP
can achieve training speedup and that the RL generated
placement is non-trivial (Mirhoseini et al., 2018). However,
RL-based approaches can be long-running and compute-
intensive with no notion of optimality. On the other hand,
DLPlacer can provide optimal device placement solutions,
though can still be compute intensive for complex DFGs
and when system graph contains a large number of devices.
However, it should be noted that for simpler DFGs, simple
heuristics could achieve near-optimal placement results.
7.5. Framework Support
As we discuss in Section 4.4, we implement MP differ-
ently for our BigSLTM and GNMT evaluation compared to
Inception-V3. This is mostly driven by the baseline imple-
mentations of BigLSTM and GNMT, which makes it very
non-trivial to exploit intra-layer MP in these networks. We
use pipeline parallelism for exploiting inter-layer MP for
these two models. While a given network’s implementation
can be one hurdle to exploiting MP, the framework it is im-
plemented with can also add to the complexity. TensorFlow
and Pytorch have different levels of support for assigning
operations or tensors to different devices, but neither pro-
vide any automatic intra-layer parallelism extraction support.
DSSTNE (dsstne), is Amazon’s deep scalable sparse ten-
sor network framework which has more complete support
for extracting intra-layer parallelism. However, it only sup-
ports fully connected layers and therefore is not a versatile
framework for implementing different types of DL networks
such as CNN and RNN based networks. Also, this frame-
work is not broadly used and as such was not a focus of our
evaluation in this work.
8. Conclusion
This paper demonstrates the benefits of combining model-
parallelism (MP) with data-parallelism (DP) to overcome
the inherent scaling and statistical efficiency losses that
data-parallel training has at scale. We analyze the end-to-
end training time of DP to understand how scaling and
statistical efficiency loss impacts training scalability, and
show that the MP speedup achieved for a given DL model is
critical to the overall scalability of a hybrid parallelization
strategy. We demonstrate that when the global batch size
in DP grows to a point where DP-only training speedup
drops off significantly, MP can be used in conjunction with
DP to continue improving training times beyond what DP
can achieve alone. We evaluate the performance benefits
of such a hybrid strategy and project that for Inception-
V3, GNMT, and BigLSTM, the hybrid strategy provides
an end-to-end training speedup of at least 26.5%, 8%, and
22% respectively compared to what DP alone can achieve
at scale.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,
Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,
Ghemawat, S., Goodfellow, I. J., Harp, A., Irving, G.,
Isard, M., Jia, Y., Jo´zefowicz, R., Kaiser, L., Kudlur, M.,
Levenberg, J., Mane´, D., Monga, R., Moore, S., Mur-
ray, D. G., Olah, C., Schuster, M., Shlens, J., Steiner,
B., Sutskever, I., Talwar, K., Tucker, P. A., Vanhoucke,
V., Vasudevan, V., Vie´gas, F. B., Vinyals, O., Warden, P.,
Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Ten-
sorflow: Large-scale machine learning on heterogeneous
distributed systems. CoRR, abs/1603.04467, 2016. URL
http://arxiv.org/abs/1603.04467.
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine
translation by jointly learning to align and translate.
CoRR, abs/1409.0473, 2014. URL http://arxiv.
org/abs/1409.0473.
Bauer, M., Treichler, S., Slaughter, E., and Aiken, A. Le-
gion: Expressing locality and independence with logical
regions. In Proceedings of the International Conference
on High Performance Computing, Networking, Storage
and Analysis, SC ’12, pp. 66:1–66:11, Los Alamitos,
CA, USA, 2012. IEEE Computer Society Press. ISBN
978-1-4673-0804-5. URL http://dl.acm.org/
citation.cfm?id=2388996.2389086.
Chaturapruek, S., Duchi, J. C., and Re´, C. Asynchronous
stochastic convex optimization: the noise is in the
noise and sgd don’t care. In Cortes, C., Lawrence,
N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.),
Advances in Neural Information Processing Systems 28,
pp. 1531–1539. Curran Associates, Inc., 2015. URL
http://papers.nips.cc/paper/6031-
asynchronous-stochastic-convex-
optimization-the-noise-is-in-the-
noise-and-sgd-dont-care.pdf.
Chen, J., Monga, R., Bengio, S., and Jo´zefowicz,
R. Revisiting distributed synchronous SGD. CoRR,
abs/1604.00981, 2016. URL http://arxiv.org/
abs/1604.00981.
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran,
J., Catanzaro, B., and Shelhamer, E. cudnn: Efficient
primitives for deep learning. CoRR, abs/1410.0759, 2014.
URL http://arxiv.org/abs/1410.0759.
Optimizing Multi-GPU Parallelization Strategies
Das, D., Avancha, S., Mudigere, D., Vaidyanathan, K.,
Sridharan, S., Kalamkar, D. D., Kaul, B., and Dubey,
P. Distributed deep learning using synchronous stochastic
gradient descent. CoRR, abs/1602.06709, 2016. URL
http://arxiv.org/abs/1602.06709.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M.,
Mao, M., aurelio Ranzato, M., Senior, A., Tucker,
P., Yang, K., Le, Q. V., and Ng, A. Y. Large scale
distributed deep networks. In Pereira, F., Burges, C.
J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances
in Neural Information Processing Systems 25, pp.
1223–1231. Curran Associates, Inc., 2012. URL http:
//papers.nips.cc/paper/4687-large-
scale-distributed-deep-networks.pdf.
Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L.
Imagenet: A large-scale hierarchical image database. In
2009 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 248–255, June 2009. doi: 10.1109/
CVPR.2009.5206848.
dsstne. Amazon DSSTNE: Deep Scalable Sparse Tensor
Network Engine. https://github.com/amzn/
amazon-dsstne.
Gardner, W. A. Learning characteristics of stochastic-
gradient-descent algorithms: A general study, analysis,
and critique. Signal processing, 6(2):113–133, 1984.
Gholami, A., Azad, A., Keutzer, K., and Buluc¸, A. In-
tegrated model and data parallelism in training neural
networks. CoRR, abs/1712.04432, 2017. URL http:
//arxiv.org/abs/1712.04432.
Goyal, P., Dolla´r, P., Girshick, R. B., Noordhuis, P.,
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
He, K. Accurate, large minibatch SGD: training ima-
genet in 1 hour. CoRR, abs/1706.02677, 2017. URL
http://arxiv.org/abs/1706.02677.
Guillou, L., Hardmeier, C., Nakov, P., Stymne, S., Tiede-
mann, J., Versley, Y., Cettolo, M., Webber, B., and
Popescu-Belis, A. Findings of the 2016 WMT shared
task on cross-lingual pronoun prediction. In Proceedings
of the First Conference on Machine Translation: Volume
2, Shared Task Papers, volume 2, pp. 525–542, 2016.
Harlap, A., Narayanan, D., Phanishayee, A., Seshadri,
V., Devanur, N. R., Ganger, G. R., and Gibbons, P. B.
Pipedream: Fast and efficient pipeline parallel DNN
training. CoRR, abs/1806.03377, 2018. URL http:
//arxiv.org/abs/1806.03377.
Hashemi, S. H., Jyothi, S. A., and Campbell, R. H. Commu-
nication scheduling as a first-class citizen in distributed
machine learning systems. CoRR, abs/1803.03288, 2018.
URL http://arxiv.org/abs/1803.03288.
Hoffer, E., Hubara, I., and Soudry, D. Train longer, general-
ize better: closing the generalization gap in large batch
training of neural networks. In Advances in Neural Infor-
mation Processing Systems, pp. 1731–1741, 2017.
Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le,
Q. V., and Chen, Z. Gpipe: Efficient training of gi-
ant neural networks using pipeline parallelism. CoRR,
abs/1811.06965, 2018. URL http://arxiv.org/
abs/1811.06965.
Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A.,
Bengio, Y., and Storkey, A. J. Three factors influencing
minima in SGD. CoRR, abs/1711.04623, 2017. URL
http://arxiv.org/abs/1711.04623.
Jia, Z., Lin, S., Qi, C. R., and Aiken, A. Exploring
hidden dimensions in parallelizing convolutional neu-
ral networks. CoRR, abs/1802.04924, 2018. URL
http://arxiv.org/abs/1802.04924.
Jo´zefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and
Wu, Y. Exploring the limits of language modeling. CoRR,
abs/1602.02410, 2016. URL http://arxiv.org/
abs/1602.02410.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy,
M., and Tang, P. T. P. On large-batch training for deep
learning: Generalization gap and sharp minima. CoRR,
abs/1609.04836, 2016. URL http://arxiv.org/
abs/1609.04836.
Koliousis, A., Watcharapichat, P., Weidlich, M., Mai, L.,
Costa, P., and Pietzuch, P. Crossbow: Scaling deep learn-
ing with small batch sizes on multi-gpu servers. arXiv
preprint arXiv:1901.02244, 2019.
Krizhevsky, A. One weird trick for parallelizing convo-
lutional neural networks. CoRR, abs/1404.5997, 2014.
URL http://arxiv.org/abs/1404.5997.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks.
Commun. ACM, 60(6):84–90, May 2017. ISSN 0001-
0782. doi: 10.1145/3065386. URL http://doi.acm.
org/10.1145/3065386.
Li, M., Zhang, T., Chen, Y., and Smola, A. J. Effi-
cient mini-batch training for stochastic optimization. In
Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining,
KDD ’14, pp. 661–670, New York, NY, USA, 2014.
ACM. ISBN 978-1-4503-2956-9. doi: 10.1145/2623330.
2623612. URL http://doi.acm.org/10.1145/
2623330.2623612.
Optimizing Multi-GPU Parallelization Strategies
Migacz, S. GNMT v2: PyTorch Implemen-
tation. https://github.com/NVIDIA/
DeepLearningExamples/tree/master/
PyTorch/Translation/GNMT, 2018.
Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R.,
Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean,
J. Device placement optimization with reinforcement
learning. CoRR, abs/1706.04972, 2017. URL http:
//arxiv.org/abs/1706.04972.
Mirhoseini, A., Goldie, A., Pham, H., Steiner, B., Le, Q. V.,
and Dean, J. A hierarchical model for device placement.
In International Conference on Learning Representations,
2018. URL https://openreview.net/forum?
id=Hkc-TeZ0W.
NCCL. NVIDIA Collective Communications Library.
https://developer.nvidia.com/nccl, 2018.
Nowatzki, T., Sartin-Tarm, M., De Carli, L., Sankaralingam,
K., Estan, C., and Robatmili, B. A general constraint-
centric scheduling framework for spatial architectures.
SIGPLAN Not., 48(6):495–506, June 2013a. ISSN 0362-
1340. doi: 10.1145/2499370.2462163. URL http:
//doi.acm.org/10.1145/2499370.2462163.
Nowatzki, T., Sartin-Tarm, M., De Carli, L., Sankaralingam,
K., Estan, C., and Robatmili, B. A general constraint-
centric scheduling framework for spatial architectures.
In Proceedings of the 34th ACM SIGPLAN Conference
on Programming Language Design and Implementation,
PLDI ’13, pp. 495–506, New York, NY, USA, 2013b.
ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.
2462163. URL http://doi.acm.org/10.1145/
2491956.2462163.
NVIDIA. DGX-1. https://www.nvidia.com/en-
us/data-center/dgx-1/, 2018a.
NVIDIA. NVLink Fabric: Advancing Multi-GPU Process-
ing. https://www.nvidia.com/en-us/data-
center/nvlink/, 2018b.
NVIDIA. Tesla V100 Tensor Core GPU.
https://www.nvidia.com/en-us/data-
center/tesla-v100/, 2018.
NVIDIA Container. TensorFlow Release 18.07.
https://docs.nvidia.com/deeplearning/
dgx/tensorflow-release-notes/rel_18.
07.html, 2018.
Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling
neural machine translation. CoRR, abs/1806.00187, 2018.
URL http://arxiv.org/abs/1806.00187.
Paine, T., Jin, H., Yang, J., Lin, Z., and Huang, T. S. GPU
asynchronous stochastic gradient descent to speed up
neural network training. CoRR, abs/1312.6186, 2013.
URL http://arxiv.org/abs/1312.6186.
Patarasuk, P. and Yuan, X. Bandwidth optimal all-reduce
algorithms for clusters of workstations. J. Parallel Distrib.
Comput., 69(2):117–124, February 2009. ISSN 0743-
7315. doi: 10.1016/j.jpdc.2008.09.002. URL http://
dx.doi.org/10.1016/j.jpdc.2008.09.002.
Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic
approximation by averaging. SIAM J. Control Optim., 30
(4):838–855, July 1992. ISSN 0363-0129. doi: 10.1137/
0330046. URL http://dx.doi.org/10.1137/
0330046.
PyTorch. PyTorch: From Research To Production. https:
//pytorch.org/.
Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lock-
free approach to parallelizing stochastic gradient descent.
In Advances in neural information processing systems,
pp. 693–701, 2011.
Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. On par-
allelizability of stochastic gradient descent for speech
DNNs. 2014 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 235–239,
2014.
Sergeev, A. and Balso, M. D. Horovod: fast and
easy distributed deep learning in tensorflow. CoRR,
abs/1802.05799, 2018. URL http://arxiv.org/
abs/1802.05799.
Shallue, C. J., Lee, J., Antognini, J. M., Sohl-Dickstein,
J., Frostig, R., and Dahl, G. E. Measuring the effects
of data parallelism on neural network training. CoRR,
abs/1811.03600, 2018. URL http://arxiv.org/
abs/1811.03600.
Shi, S. and Chu, X. Performance modeling and evaluation
of distributed deep learning frameworks on gpus. CoRR,
abs/1711.05979, 2017. URL http://arxiv.org/
abs/1711.05979.
Simonyan, K. and Zisserman, A. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014. URL http://arxiv.org/
abs/1409.1556.
Smith, S. L., Kindermans, P., and Le, Q. V. Don’t de-
cay the learning rate, increase the batch size. CoRR,
abs/1711.00489, 2017. URL http://arxiv.org/
abs/1711.00489.
Optimizing Multi-GPU Parallelization Strategies
Sridharan, S., Vaidyanathan, K., Kalamkar, D. D., Das, D.,
Smorkalov, M. E., Shiryaev, M., Mudigere, D., Mellem-
pudi, N., Avancha, S., Kaul, B., and Dubey, P. On scale-
out deep learning training for cloud and HPC. CoRR,
abs/1801.08030, 2018. URL http://arxiv.org/
abs/1801.08030.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,
Z. Rethinking the inception architecture for computer
vision. CoRR, abs/1512.00567, 2015. URL http://
arxiv.org/abs/1512.00567.
Thakur, R., Rabenseifner, R., and Gropp, W. Opti-
mization of collective communication operations in
mpich. Int. J. High Perform. Comput. Appl., 19(1):49–
66, February 2005. ISSN 1094-3420. doi: 10.1177/
1094342005051521. URL http://dx.doi.org/
10.1177/1094342005051521.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,
Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,
K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser,
L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens,
K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J.,
Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes,
M., and Dean, J. Google’s neural machine translation
system: Bridging the gap between human and machine
translation. CoRR, abs/1609.08144, 2016. URL http:
//arxiv.org/abs/1609.08144.
Xu, W. Towards optimal one pass large scale learn-
ing with averaged stochastic gradient descent. CoRR,
abs/1107.2490, 2011. URL http://arxiv.org/
abs/1107.2490.
Yadan, O., Adams, K., Taigman, Y., and Ranzato, M. Multi-
gpu training of convnets. CoRR, abs/1312.5853, 2013.
URL http://arxiv.org/abs/1312.5853.
Yamazaki, M., Kasagi, A., Tabuchi, A., Honda, T., Miwa,
M., Fukumoto, N., Tabaru, T., Ike, A., and Nakashima,
K. Yet another accelerated SGD: resnet-50 training on
imagenet in 74.7 seconds. CoRR, abs/1903.12650, 2019.
URL http://arxiv.org/abs/1903.12650.
You, Y., Gitman, I., and Ginsburg, B. Scaling sgd batch size
to 32k for imagenet training. 08 2017.
Zhang, J. and Mitliagkas, I. Yellowfin and the art of mo-
mentum tuning. arXiv preprint arXiv:1706.03471, 2017.
Zhang, S., Zhang, C., You, Z., Zheng, R., and Xu, B. Asyn-
chronous stochastic gradient descent for DNN training.
In 2013 IEEE International Conference on Acoustics,
Speech and Signal Processing, pp. 6660–6663, May 2013.
doi: 10.1109/ICASSP.2013.6638950.
Zinkevich, M., Weimer, M., Li, L., and Smola, A. J.
Parallelized stochastic gradient descent. In Laf-
ferty, J. D., Williams, C. K. I., Shawe-Taylor, J.,
Zemel, R. S., and Culotta, A. (eds.), Advances
in Neural Information Processing Systems 23, pp.
2595–2603. Curran Associates, Inc., 2010. URL
http://papers.nips.cc/paper/4006-
parallelized-stochastic-gradient-
descent.pdf.
