torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models by Kim, Chiheon et al.
torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models
Chiheon Kim1∗ Heungsub Lee1∗ Myungryong Jeong1 Woonhyuk Baek1
Boogeon Yoon1 Ildoo Kim1 Sungbin Lim2† Sungwoong Kim1
1Kakao Brain 2UNIST
1{chiheon.kim,heungsub.lee,myungryong.jeong,wbaek,eric.yoon,ildoo.kim,swkim}@kakaobrain.com
2sungbin@unist.ac.kr
Abstract
We design and implement a ready-to-use library in
PyTorch for performing micro-batch pipeline parallelism
with checkpointing proposed by GPipe [11]. In partic-
ular, we develop a set of design components to enable
pipeline-parallel gradient computation in PyTorch’s define-
by-run and eager execution environment. We show that
each component is necessary to fully benefit from pipeline
parallelism in such environment, and demonstrate the effi-
ciency of the library by applying it to various network ar-
chitectures including AmoebaNet-D [23] and U-Net [24].
Our library is available at https://github.com/
kakaobrain/torchgpipe.
1. Introduction
In recent years, deep learning has seen significant
growth, driven by several methodologies which enable the
training of deep neural networks (DNNs) in a scalable way
and by development of more powerful hardwares. It is ob-
served that increased capacity of DNN effectively has im-
proved the performance. For example, AmoebaNet-B [23]
scaled with GPipe [11] has 557 million parameters and has
achieved top-1 accuracy 84.4% which was state-of-the-arts
result at the time, and GPT-2 [22] is a Transformer-based
[28] language model which has 1.5 billion parameters (see
Figure 1 of [11] for the effect of model scaling). However,
training such a massive model is very resource intensive.
One can mitigate this issue by reducing the size of the model
without losing the performance by pruning the model [8, 1],
designing more efficient architectures [10, 27], architecture
search under resource constraints [3], and many more.
We may wonder a rather direct approach is possible: can
we train a massive model fast enough, given a large pool of
devices? One obstacle is that common optimization tech-
∗Contributed equally.
†This work was done while Sungbin Lim was at Kakao Brain.
niques to train a neural network are sequential in nature.
Those algorithms repeatedly compute the gradient of the
loss with respect to the given mini-batch at a time and up-
date the model parameters using the gradient. With abun-
dant computational resource, data parallelism [17] is com-
monly used to speed up the overall optimization procedure
by dividing the mini-batch into micro-batches and delegat-
ing per micro-batch computation to available devices. With
careful hyperparameter tuning, this effectively reduce the
training time up to a certain size of mini-batch which may
depend on model, optimization algorithm, and data [6, 25].
One drawback of data-parallel training is that devices hold
their own version of network for executing the subdivided
task, and network parameters must be synchronized after
each parameter update. This may induce heavy communi-
cation load when there are lots of parameters to synchro-
nize.
Note that data parallelism is not applicable when the
model is so big that it is impossible to compute gradi-
ent even when a single data point is fed into the network.
Model parallelism [5] is a method for training such a mas-
sive model, which partitions the model into several pieces
and places them on different devices. Each device only
computes a small part of the model, and updates only the
parameters in that part. However, model parallelism suf-
fers from its underutilization behavior. Since most neural
networks consist of sequence of layers, the device holding
the later part of the model must wait until computation in
devices holding earlier parts of the model.
Another possible solution is to use gradient checkpoint-
ing [4] which saves memory by only storing the subset of
activation maps and re-computing the discarded activation
maps when necessary. Obviously, this requires certain part
of the model be computed twice and overall training time
would be increased.
It is benefitting to combine different types of paral-
lelization strategies [16, 14, 26, 12, 9, 11, 7], and recent
lines of research questions how to find an optimal strategy
1
ar
X
iv
:2
00
4.
09
91
0v
1 
 [c
s.D
C]
  2
1 A
pr
 20
20
[15, 19, 18, 29]. Among them, pipeline parallelism a way to
accelerate neural network training by combining model par-
allelism with data pipelining, either in synchronous way as
in GPipe [11] or in asynchronous way as in [12], PipeDream
[9], and XPipe [7]. We remark that gradient checkpoint-
ing (also called re-materialization) is further combined in
GPipe to allow training even bigger models.
In this paper, we design and implement torchgpipe,
a ready-to-use library for GPipe in PyTorch [21]. In par-
ticular, we develop a set of design components for opti-
mized pipeline-parallel computations in PyTorch’s define-
by-run and eager execution environment. We show that
each component is necessary to fully benefit from pipeline
parallelism in such environment, and demonstrate the effi-
ciency of torchgpipe by conducting the speed and mem-
ory benchmarks on AmoebaNet-D [23] and U-Net [24]
when trained with the library.
The rest of the paper is organized as follows. In Sec-
tion 2, we discuss how the forward and backward passes
can be decomposed into subtasks (under certain assump-
tions), describe the device placement strategy of micro-
batch pipeline parallelism, and demonstrate what the de-
sired order of execution per device is. In Section 3,
we discuss complications for achieving the optimal time-
line of pipeline parallelism in PyTorch and explain how
torchgpipe resolves them. Additionally, we relax the as-
sumption that the model is sequentially composed, and pro-
vide a way for expressing models with long skip connec-
tions so that pipeline parallelism still applies without giving
up the efficiency. Then, we demonstrate that the optimiza-
tion components suggested in the paper are essential for the
performance, and evaluate the performance of the proposed
library in Section 4.
2. Pipeline Parallelism
Suppose that we have a neural network which is repre-
sented as a composition of sequence of subnetworks. Let
us denote the subnetworks by f1, · · · , fn with parameters
θ1, · · · , θn and let the full network be
f = fn ◦ fn−1 ◦ · · · ◦ f1,
parameterized by θ = (θ1, · · · , θn). For clarity, we call
f j the jth partition of f and assume that the parameters of
partitions are mutually disjoint.
When training the network, gradient-based methods such
as stochastic gradient descent requires computing the out-
come f(x) of the network given a mini-batch x of training
data and the corresponding loss, and the gradient g of the
loss with respect to the network parameter θ. Those two
stages are called forward and backward pass, respectively.
Since f is sequentially composed, in forward pass f(x)
can be computed by letting x0 = x and sequentially ap-
plying the partitions as xj = f j(xj−1) for j = 1, · · · , L.
Furthermore, if x consists ofm smaller batches x1, · · · , xm
called micro-batches, computing f(x) dissolves into tasks
Fi,j where x0i = xi and
xji ← f j(xj−1i ) (Fi,j)
for i = 1, · · · ,m and j = 1, · · · , n, assuming that f does
not involve any intra-batch computation. One prominent
exception for this is batch normalization [13]1. The loss is
obtained by aggregating xni = f(xi) and evaluating the loss
function on them.
In a similar fashion, backward pass is decomposed into
tasks Bi,j where dxni is the gradient of the loss with respect
to xni and
dxj−1i ← ∂xf j(dxji )
gji ← ∂θjf j(dxji )
(Bi,j)
for i = 1, · · · ,m and j = 1, · · · , n. Here
∂xf
j : v 7→ vT · df
j
dx
∣∣∣∣
x=xj−1i
is a function which does backward propagation (also known
as vector-Jacobian product) through the partition f j , and
∂θjf
j is defined likewise. As a result, we get the gradient
of the loss with respect to θj by summing gji over i’s.
Note that there are data dependencies between tasks. For
example, Fi,j requires x
j−1
i which is only available after
Fi,j−1, hence Fi,j−1 must be completed before starting Fi,j
and the same applies for Bi,j and Bi,j+1. Figure 1 shows
the full dependency graph in the case of m = 4 and n = 3.
Given the set of tasks {Fi,j} and {Bi,j} and a pool of
devices which can work in parallel, different paralleliza-
tion strategies have their own rule to assign tasks to devices.
Each device computes one or more assigned tasks as soon as
the dependencies are resolved. In the setting above, all de-
pendencies are among the tasks with the same micro-batch
index i. Hence, one can effectively parallelize the tasks by
assigning tasks with different micro-batch indices to differ-
ent devices — which is data parallelism.
2.1. Dependency Graph of GPipe
Pipeline parallelism’s strategy is to assign tasks with re-
spect to the partition index j so that jth partition entirely
lies in the jth device. In addition to this, it is enforced that
Fi,j must be completed before executing Fi+1,j and Bi,j
must be completed before executing Bi−1,j .
In addition to the micro-batch pipelining, GPipe [11] fur-
ther reduces the memory requirement by utilizing gradient
checkpointing for each Bi,j . Since jth device executes Bi,j
1Applying pipeline parallelism to a network with batch normalization
is feasible while the computation is not identical anymore. Indeed, this
discrepancy also exists in data-parallel training scheme and it may results
in degradation of the result.
2
F1,1 F1,2 F1,3 B1,3 B1,2 B1,1
F2,1
F3,1
F4,1
F2,2 F2,3 B2,3 B2,2 B2,1
F3,2 F3,3 B3,3 B3,2 B3,1
F4,2 F4,3 B4,3 B4,2 B4,1
Figure 1: Minimal dependency graph for forward
and backward pass.
F1,1
F2,1
F3,1
F4,1
F1,2
F2,2
F3,2
F4,2
F1,3
F2,3
F3,3
F4,3
F ′1,3
F ′2,3
F ′3,3
B4,3
B1,3
B2,3
B3,3
F ′1,2
F ′2,2
F ′3,2
B4,2
B1,2
B2,2
B3,2
F ′1,1
F ′2,1
F ′3,1
B4,1
B1,1
B2,1
B3,1
Figure 2: Dependency graph for pipeline paral-
lelism with checkpointing. Colors denote the de-
vices that tasks are computed in.
Forward 
Send dx j−1i  to Dj−1
F′  i,j
Bi,j
F′  i−1,j
Receive dx ji  from Dj+1
Receive dx ji−1 from Dj+1
Send dx j−1i+1  to Dj−1
Send x ji  to Dj+1 Receive x
j−1
i+1  from Dj−1
Fi,j
Fi+1,j
Backward
Figure 3: The execution order that jth device must follow.
one at a time, only the activation maps obtained from Fi,j
are needed to complete Bi,j . By recomputing the forward
pass Fi,j right before executing Bi,j , memory consumption
is reduced by a factor of m. Moreover, the re-computation
can take place while the device is waiting for Bi,j+1 being
done. This is summarized in Figure 2, where dashed arrows
denotes the execution order between independent tasks in-
duced by the micro-batch order, and F ′i,j denotes the re-
computation of Fi,j .
We remark that re-computations for the last micro-batch,
i.e., F ′m,j for j = 1, · · · , n are unnecessary. This is because
that on jth device the last task in the forward pass is Fm,j ,
so discarding intermediate activations of it in forward pass
and re-computing them in the beginning of backward pass
has no effect of reducing memory, only slowing down the
pipeline. For this reason, F ′m,j is omitted from the graph.
2.2. Device-wise Execution Order
To summarize, in pipeline parallelism (with checkpoint-
ing) each device is assigned with a set of tasks with the
prescribed order. Each device will execute the given tasks
one-by-one as soon as cross-device dependencies are met.
However, there is a missing component in this picture —
data tranfer between the devices. For illustration, the full
execution order that device j must follow is shown in Fig-
ure 3. Here data transfer operations are explicitly denoted
as ‘receive’ and ‘send’ for emphasis.
3. torchgpipe: A PyTorch Library for GPipe
torchgpipe is a PyTorch library for micro-batch
pipeline parallelism with checkpointing, as known as
GPipe. The library provides a simple way to apply GPipe to
a generic sequential module written in PyTorch. The usage
of torchgpipe resembles that of the data parallel module
of PyTorch — just wrap your model with the wrapper.
Users must specify the number of micro-batches m and
how consecutive layers form n partitions. Here we remark
that even though we simplified our assumption to that the
model is a sequence of partitions, it is strictly required in
torchgpipe that the model is a sequence of layers to give
flexibility for users how to split the model. torchgpipe
will assume that each layer is a non-divisible, black-box,
and referentially transparent2 algorithm.
2This is required especially for checkpointing: referential transparency
ensures that recomputation is identical to the computation done in the for-
ward pass.
3
For convenience, the library provides the submodule
torchgpipe.balance which computes a partition whose
pairwise resource discrepancy is small, where resource con-
sumption is computed by profiling. Specifically, we used
the algorithm from [2].
As torchgpipe is built on PyTorch equipped with
CUDA backend, we will often assume that devices are
NVIDIA GPU throughout this section. Nevertheless, the
underlying principle of the library applies in general for im-
plementing pipeline parallelism any eager execution envi-
ronments.
3.1. Complications in PyTorch
Our primary concern is efficiency. As we discussed in
Section 2.2, in order for pipeline parallelism to work as de-
sired, the tasks must be assigned to each device in the cor-
rect order. There are several complications to achieve this
in PyTorch.
First of all, kernels are issued to each device on-the-fly
due to PyTorch’s define-by-run style and its eager execution
behavior (as opposed to in construct-and-run type frame-
works). Hence, one must design the host code carefully so
not only that device-bound tasks are issued in the correct or-
der within each device, but also that execution of the tasks
on devices (asynchronous to CPU) are not delayed due to
the Python interpreter failing to request it ahead of the time.
This kind of delay may happen when some of the tasks are
CPU-intensive or involve a lot of cheap kernel calls. As a
solution, torchgpipe introduces deterministic clock-cycle
which gives the total ordering of the tasks.
Secondly, the computation graph for backward pass
is constructed dynamically during the forward pass in
PyTorch. In other words, “it avoids ever materializing a
“forward graph”, recording only what is necessary to dif-
ferentiate the computation.” [21] Since PyTorch does not
record the forward computation graph nor maintain a gradi-
ent tape, the automatic differentiation (autograd) engine of
PyTorch does back-propagation solely with respect to the
graph. It implies that autograd engine may not run exactly
in the reverse order of execution as in the forward pass, un-
less enforced by the structure of the graph. To deal with
this, we develop a pair of primitive functions called ‘fork’
and ‘join’ to create explicit dependencies on the fly in the
backward computation graph.
Thirdly, communication between several devices can
cause two-way synchronization, if not carefully managed.
This may cause under-utilization since sender may wait to
synchronize with the receiver even when there is no explicit
dependency between the copy and next task in queue, or
vice versa. torchgpipe avoids this issue by using non-
default CUDA streams so that copies would never block
computations unless the computation must wait for the data.
Lastly, torchgpipe attempts to relax the restriction of
micro-batch pipeline parallelism that model must be se-
quential. Although any neural network can be written in
a sequential form in principle, this requires knowing the en-
tire computation graph ahead of the time which is not the
case in PyTorch. In particular, if there is a tensor which
skips from a layer in device j′ to another layer in device
j > j′ + 1, the tensor will be copied to all devices in be-
tween since torchgpipe cannot know it ahead. To circum-
vent this issue, we design an interface to signify which in-
termediate tensors are skipped and which layers use them.
3.2. Optimization Components
In the remainder of this section, it is explained how the
components of torchgpipe are designed and why each of
them is essential for performance.
3.2.1 Forward Dependency: Deterministic Clock-cycle
As we discussed in Section 3.1, the total ordering of tasks is
determined by the host code in the forward pass. Each de-
vice implicitly understands the dependency between tasks
by the order they are assigned by CPU. Ideally, if tasks
could be assigned to devices with no cost, CPU may as-
sign tasks to devices in any order as long as the ordering
within device is correct. However, this assumption is not
realistic enough, as launching kernels on a GPU is not free
for CPU, memory transfer between GPUs may require syn-
chronization, or a task is CPU-intensive. For this reason, we
minimize the delay coming from CPU by sorting all tasks
by the distance to F1,1.
Algorithm 1: Deterministic clock-cycle
for k from 1 to m+ n− 1 do
for i, j such that i+ j − 1 = k do
if j > 1 then
Copy xj−1i to device j.
for i, j such that i+ j − 1 = k do
Execute Fi,j .
We call this deterministic clock-cycle (Algorithm 1). In
the algorithm, CPU executes the clock cycles starting from
the counter k = 1 to k = m + n − 1. In kth clock cycle,
all copy kernels for data needed to execute tasks Fi,j where
i + j − 1 = k are first issued, and then the computation
kernels for executing the tasks are registered to correspond-
ing devices (which can be safely multithreaded since tasks
in the same clock cycle are independent).
3.2.2 Backward Dependency: Fork and Join
Suppose now that we run a forward pass according to the
deterministic clock-cycle. The resulting computation graph
4
Bi+1,j−1 F′  i+1,j−1Join Join Bi+1,j
Bi,j F′  i,jFork Fork Bi,j+1 Micro-batch i
Micro-batch i + 1
Figure 4: The backward computation graph with Fork and
Join. Different colors correspond to different devices. Ar-
rows are drawn according to the direction in backward com-
putation graph and these relations are constructed during
the forward pass. Here the virtual depedency of F ′i,j on
Bi+1,j is created via Fork and Join, which is illustrated by
dashed arrows.
for backward will look rather like 1 than 2, even when the
forward tasks F1,j , · · · , Fm,j on device j were executed
in order. From such a graph, autograd engine of PyTorch
would never know that Bi+1,j must be executed before
Bi,j , and this messes up the timeline of the backward pass.
For this reason, virtual dependencies (dashed arrows in Fig-
ure 2) must be explicitly drawn during the forward pass.
We design a pair of primitive functions called Fork and
Join to express such dependency. Basically, Fork is the au-
tograd function mapping a tensor x to the pair (x,∅) where
∅ is an empty tensor3, and Join is the autograd function
mapping a pair (x,∅) to the tensor x. Now, dependency
of Fi+1,j upon Fi,j (which translates to the dependency of
Bi,j upon Bi+1,j in the backward computation graph) can
be expressed as
(xji ,∅)← Fork(xji )
xj−1i+1 ← Join(xj−1i+1 ,∅).
See Figure 4 for illustration.
3In principle, the tensor which indicates the virtual dependency can be
arbitrary. We chose to use the empty tensor for this, however, to remove
any unnecessary computation caused by the tensor such as gradient accu-
mulation in PyTorch.
3.2.3 Concurrent Copy and Computation: Streams
PyTorch issues every device-bound kernels to the default
stream, unless it is specified otherwise. Stream is a device-
bound sequence of kernels that is executed in order. Kernels
in the same stream are guaranteed to be executed in the pre-
scribed order, but kernels in different streams can be inter-
leaved, and even can overlap when possible. In particular,
nearly all CUDA devices with compute capability 1.1 and
higher support concurrent copy and execution: data transfer
between devices can always overlap with kernel execution
(see section 4.5.1.5 of [20]).
torchgpipe registers every copy kernel to non-default
streams while keeping computation kernels on the default
stream. This allows the device j processing Fi,j in concur-
rent with sending xji−1 to the device j + 1 and/or receiving
xj−1i from the device j − 1. Moreover, each device uses
different streams for each micro-batch. Since there is no
true dependency between different micro-batches, this use
of streams is safe and this allows copies to occur as fast as
possible. See Figure 5 for illustration.
3.2.4 Autograd Functions with Shared Memory
So far in this section, we did not discuss how to schedule
re-computation tasks F ′i,j when gradient checkpointing is in
use. It must be scheduled in prior to the back-propagation
task Bi,j upon completion of Bi+1,j . This must be encoded
in the computation graph as well for autograd engine. In-
deed, PyTorch supports such functionality via an in-house
autograd function for checkpointing.
Checkpoint in PyTorch is implemented by defining an
autograd function which computes as usual function in the
forward pass without storing intermediate activation maps
but the inputs. In the backward pass, this function con-
structs a local computation graph for backward by recom-
puting the function using the stored inputs, and computes
gradients by back-propagating through the local graph.
However, this tightly binds F ′i,j and Bi,j together. Ulti-
mately, we would like to insert the instruction for waiting
Bi+1,j F′ i,j Bi,jSend dx j−1i+1 Receive dx ji
(a) Default streams only
Device j timeline
Bi+1,j F′ i,j Bi,j
Send dx j−1i+1
Receive dx ji
Non-default stream for micro-batch i + 1
Default stream
Non-default stream for micro-batch i
(b) Non-default streams for copy
Figure 5: Timeline of device j with or without non-default streams for copy. (a): If only default streams are used, copy kernels
may block computation kernels (and vice versa) until the copy is completely finished. (b): With copy streams, computation
can happen in concurrent with sending or receiving data from other devices.
5
the result dxji of Bi,j+1 to be copied from device j + 1 to
device j in between F ′i,j and Bi,j , to allow that F
′
i,j and the
copy happens concurrently.
For such a fine-grained order control, torchgpipe im-
plements checkpointing with two separate autograd func-
tions Checkpoint and Recompute. At the execution time
of the task Fi,j , a pair of Checkpoint and Recompute
which have a shared memory is generated. This shared
memory is used in the backward pass for transferring the
local computation graph made by executing Recompute to
Checkpoint for back-propagation. By arranging the func-
tions so that F ′i,j , synchronization for receiving dx
j
i , and
Bi,j are executed in the order during the backward pass, it
is ensured that re-computation and copy can happen con-
currently.
3.3. Dealing with Non-sequential Models
In Section 2, we assumed that the model f is composed
of partitions f1, · · · , fn in sequence. In principle, any neu-
ral network can be represented in this form by sorting all
nodes in the forward computation graph of f in topological
ordering. Hence, pipeline parallelism is applicable to any
model.
However, consider a symptomatic case that all the parti-
tions except the first and the last one are parallel, i.e.,
f(y) = gn(x2, · · · , xn−1)
where x1 = g1(x) and xj = gj(x1) for j = 2, · · · , n − 1.
In a sequential form, this is equivalent to f = fn ◦ · · · ◦ f1
such that
fn(x1, x2, · · · , xn−1) := gn(x2, · · · , xn−1),
f j(x1, · · · , xj−1) := (x1, · · · , xj−1, f j(x1))
for j = 2, · · · , n − 1, and f1 = g1. In this case, it
is quite inefficient to use pipeline parallelism in its native
form since at the boundary of device j − 1 and j, the tuple
(x1i , · · · , xj−1i ) must be copied instead of a single tensor x1i
which is the only required data to compute jth partition.
torchgpipe provides a submodule which allows
users to indicate skipping tensors from which layer to
which layer: torchgpipe.skip. With the decorator
@skippable, user-defined layer can stash a tensor for later
or pop a stashed one via yield operator in Python without
returning it. This in particular does not change the input
and output signature of a layer. Hence, minimal effort is
needed for adding skip connection to a preexisting sequen-
tial model.
3.3.1 Hiding Skip Tensors in the Graph: Portals
Adding skip connections into the dependency graph (Fig-
ure 2) is fairly straightforward. Indeed, no additional de-
pendency would be introduced no matter how many skip
connections are added, hence only the copy kernels for
skip connections need extra care. In torchgpipe, this is
taken care by portals consisting of three autograd func-
tions PortalBlue, PortalOrange, and PortalCopy shar-
ing memory, like Checkpoint and Recompute in Sec-
tion 3.2.4. Each does the job of saving the skip tensor, load-
ing the tensor, and moving the saved tensor to the skipped
device, respectively (and vice versa in the backward pass).
This mechanism is illustrated in Figure 6.
4. Experiments
Every experiment was conducted with NVIDIA Tesla
P40 GPUs with CUDA 10.1.243, each having 22 GiB of
memory. For reproducibility, codes for all benchmarks pro-
vided in this section is made available in the repository4.
4.1. Effects of Optimization Components
We conducted an experiment to show that every compo-
nent of torchgpipe is necessary to achieve the maximal
efficiency. Starting from the baseline which only has deter-
ministic clock-cycle but no others, each component (back-
ward dependency via Fork and Join, non-default streams
for copy kernels, and portals for skip connections) is added
4Further details available at this link.
Copyi,1→2
Fi,2 Fi,3Copyi,2→3
Copyi,1→3
Fi,1
(b) With portals
Fi,1 Copyi,1→2 Fi,2 Copyi,2→3 Fi,3
(a) Without portals
Tensors on boundary Skip connection Skip connection through portalEmpty tensor Portals
Figure 6: The flow of skip connection with or without portals. (a): Without portals, skipped tensor from device 1 is copied to
device 2 and subsequently to device 3. (b): With portals, the tensor is directly copied to device 3. The gradient flows in the
exact reverse direction in the backward pass.
6
Optimization components Throughput Speed up Utilization Memory usage
× × × 30.662/s 1 44% 52.2 GiB
Dependency × × 41.306/s 1.347 59% 19.1 GiB
Dependency Streams × 55.191/s 1.800 71% 30.0 GiB
Dependency Streams Portals 58.477/s 1.907 75% 23.5 GiB
Table 1: Performance of torchgpipe when optimization components are incrementally added. The U-Net model with
(B,C) = (5, 64) is used for the experiment. The batch size and the number of micro-batches are fixed as 128 and 8,
respectively. The model is partitioned and placed on four devices via torchgpipe. Here the partition was found manually
with the aid of torchgpipe.balance.
(a) Baseline (b) Dependency
(c) Dependency and streams (d) Dependency, streams, and portals
2.1s2.3s
4.6s 3.1s
Figure 7: Detailed view of CUDA timeline for each setting in Table 1, profiled with NVIDIA Nsight Systems 2019.5.1.58.
Starting from the top, adjacent lanes with blue bars and red bars visualize the timeline per device. Blue bars represent
computation kernels while red bars represent device-to-device copy (length proportional to time).
incrementally. We report the throughput, GPU utilization,
and memory usage under each setting to measure how each
component contributed to the performance of torchgpipe.
We find that addition of each component gives a speed-up,
and with all components torchgpipe runs nearly twice as
fast as the baseline. Results can be found in Table 1.
We used U-Net for the experiment. Details of the ar-
chitecture can be found in Section 4.2.2 and we set (B,C)
to be (5, 64) as in the speed benchmark. In settings with-
out portals, the model is implemented as a fully sequential
version where skip connections are encoded as inputs and
outputs of layers that they pass through, as described in the
symptomatic example of Section 3.3. For the setting with
all components, it is implemented with torchgpipe.skip
while the architecture is identical.
We also visualized per GPU timelines to help under-
standing each component’s role, illustrated in Figure 7. Ex-
planation for each picture is summarized as follows.
(a) By deterministic clock-cycle, all kernels are issued in
the correct order during forward pass. It is illustrated
by the left part of the timeline. However, without ex-
plicit dependency encoded in the computation graph,
the autograd engine processes the micro-batches in an
uncontrollable order so the timeline is messed up.
(b) With backward dependency, kernels are now issued in
the correct, deterministic order in backward pass.
(c) By using non-default copy streams, copies and com-
putations are now concurrent as illustrated by overlap-
ping blue and red bars.
(d) Portals remove unnecessary copies caused by transfer-
ring the skipping tensor to all devices in between. This
is illustrated by that the length of red bars are reduced
compared to (c).
4.2. Performance Benchmarks
To demonstrate the efficiency of torchgpipe, we re-
port performance benchmarks similar to that conducted by
GPipe [11].
4.2.1 AmoebaNet-D Speed Benchmark
We measured the throughput of AmoebaNet-D with various
number of devices. For this, we measured the throughput of
the model when torchgpipe is applied, with n partitions
and m micro-batches. Here throughput means the number
of samples processed per second.
The experiment is conducted for each pair (m,n) where
m ∈ {1, 4, 32} and n ∈ {2, 4, 8}. When m = 1, we used
7
checkpointing to all micro-batches5 to make a fair compar-
ison of loss due to checkpointing with [11]. The model
we used is our implementation of a sequential version of
AmoebaNet-D in PyTorch6.
The model is trained by plain SGD for 10 epochs and
reported the average throughput over the epochs except the
first one. To exclude the overhead caused by data loading,
we used a synthesized dataset which consists of 10,000 im-
ages whose dimension is 3 × 224 × 224. For each setting,
the batch size and the number of micro-batches are chosen
to maximize the throughput. Relative speed-up is calculated
against the baseline case (m,n) = (1, 2) and reported in Ta-
ble 2. We included the speed-up of GPipe for comparison.
The relative speed-up of torchgpipe shows similar
trend to that of GPipe. We remark that differences in perfor-
mance reported in Table 2 might be due to many unknown
factors such as balance of the partitions, discrepancy be-
tween the implementation, difference in devices, and so on.
4.2.2 U-Net Memory Benchmark
To evaluate the effectiveness of torchgpipe for models
with long skip connections, we used U-Net [24] for 2-
dimensional segmentation. The version of U-Net we used
has five down-sampling layers and five up-sampling layers,
and two hyper-parameters B and C determining the size
of the model. Here B stands for the number of convolu-
tion blocks in between down-sampling layers, and C stands
for the number of output channels of the first convolution.
Channels are doubled after each down-sampling layers (or
halved after each up-sampling layers, respectively). Our
implementation of U-Net is rather symmetric than the orig-
inal model proposed in [24] for effective balancing.
We conducted an experiment to measure the ability of
torchgpipe for training a bigger model. For 1, 2, 4 and 8
GPUs, we found maximum (B,C) to occupy each number
of devices. In all settings, the input size is set to 3× 192×
192, the output size to 1 × 192 × 192, and the batch size
to 32. The total memory usage for training each model is
reported in Table 3. Here parameters consumes 8 bytes each
for itself and its gradients.
4.2.3 U-Net Speed Benchmark
We also measured the throughput of U-Net with various
number of devices. Naive-1 denotes the baseline without
pipeline parallelism nor checkpointing, and Pipeline-1, -2,
-4, -8 denotes that the model is trained with torchgpipe
5torchgpipe does not use checkpointing on the last micro-batch by
default, as explained in Section 2. This means that no checkpointing is
applied whenm = 1.
6We tried to make it as close as possible to the model in the official
repository of TensorFlow (link).
AmoebaNet-D GPipe [11] Ours
n = 2 4 8 2 4 8
m = 1 1 1.13 1.38 1 1.00 0.93
m = 4 1.07 1.26 1.72 1.54 1.67 2.62
m = 32 1.21 1.84 3.48 1.77 2.71 4.95
Table 2: Speed benchmark on AmoebaNet-D (18, 256).
In [11], Cloud TPUv3s were used while we used NVIDIA
Tesla P40 GPUs in our experiments.
U-Net (B, C) Parameters Memory usage
Naive-1 (6, 72) 362.2M 20.3 GiB
Pipeline-1 (11, 128) 2.21B 20.5 GiB
Pipeline-2 (24, 128) 4.99B 43.4 GiB
Pipeline-4 (24, 160) 7.80B 79.1 GiB
Pipeline-8 (48, 160) 15.82B 154.1 GiB
Table 3: Memory benchmark on U-Net.
U-Net Throughput Speed up Batch size m
Naive-1 28.500/s 1 40 ×
Pipeline-1 24.456/s 0.858 80 2
Pipeline-2 35.502/s 1.246 512 32
Pipeline-4 67.042/s 2.352 512 16
Pipeline-8 88.497/s 3.105 640 40
Table 4: Speed benchmark on U-Net with (B,C) = (5, 64).
with the corresponding number of partitions. The hyper-
parameters determining the size of U-Net is set to (B,C) =
(5, 64) in this experiment. The batch size, the number of
micro-batches (m), and the balance to partitions are chosen
to maximize the throughput. For each setting, throughput is
measured as in Section 4.2.1 except that the image size was
3× 192× 192 in this experiment. Result is summarized in
Table 4.
5. Conclusion
In this paper, we introduced torchgpipe, a ready-to-
use library in PyTorch for micro-batch pipeline parallelism
with checkpointing proposed by GPipe [11]. This library is
designed and implemented in PyTorch’s define-by-run and
eager execution environment. Ablation study and perfor-
mance benchmarks presented in Section 4 demonstrate that
all components of torchgpipe are essential to endeavor
the desired advantanges of pipeline parallelism with check-
pointing in eager execution environment. We believe that
general principles we established in the paper apply to any
other frameworks with eager execution environment.
We tried to avoid going too deep into techni-
cal details involved in torchgpipe. Our code is
available at https://github.com/kakaobrain/
torchgpipe for those who are interested in further de-
tails, and those who want to apply pipeline parallelism to
their model in PyTorch.
8
References
[1] Jose M Alvarez and Mathieu Salzmann. Learning the num-
ber of neurons in deep networks. In Advances in Neural In-
formation Processing Systems, pages 2270–2278, 2016. 1
[2] Imre Ba´ra´ny and Victor S Grinberg. Block partitions of se-
quences. Israel Journal of Mathematics, 206(1):155–164,
2015. 4
[3] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct
neural architecture search on target task and hardware. In In-
ternational Conference on Learning Representations, 2019.
1
[4] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin.
Training deep nets with sublinear memory cost. arXiv
preprint arXiv:1604.06174, 2016. 1
[5] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen,
Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew
Senior, Paul Tucker, Ke Yang, et al. Large scale distributed
deep networks. In Advances in neural information process-
ing systems, pages 1223–1231, 2012. 1
[6] Priya Goyal, Piotr Dolla´r, Ross Girshick, Pieter Noord-
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
Yangqing Jia, and Kaiming He. Accurate, large mini-
batch sgd: Training imagenet in 1 hour. arXiv preprint
arXiv:1706.02677, 2017. 1
[7] Lei Guan, Wotao Yin, Dongsheng Li, and Xicheng Lu.
Xpipe: Efficient pipeline model parallelism for multi-gpu
dnn training. arXiv preprint arXiv:1911.04610, 2019. 1,
2
[8] Song Han, Jeff Pool, John Tran, and William Dally. Learning
both weights and connections for efficient neural network. In
Advances in neural information processing systems, pages
1135–1143, 2015. 1
[9] Aaron Harlap, Deepak Narayanan, Amar Phanishayee,
Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gib-
bons. Pipedream: Fast and efficient pipeline parallel dnn
training. arXiv preprint arXiv:1806.03377, 2018. 1, 2
[10] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
tional neural networks for mobile vision applications. arXiv
preprint arXiv:1704.04861, 2017. 1
[11] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat,
Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam,
Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient train-
ing of giant neural networks using pipeline parallelism. In
Advances in Neural Information Processing Systems, pages
103–112, 2019. 1, 2, 7, 8
[12] Zhouyuan Huo, Bin Gu, Qian Yang, and Heng Huang. De-
coupled parallel backpropagation with convergence guaran-
tee. arXiv preprint arXiv:1804.10574, 2018. 1, 2
[13] Sergey Ioffe and Christian Szegedy. Batch normalization:
Accelerating deep network training by reducing internal co-
variate shift. In Francis Bach and David Blei, editors, Pro-
ceedings of the 32nd International Conference on Machine
Learning, volume 37 of Proceedings of Machine Learning
Research, pages 448–456, Lille, France, 07–09 Jul 2015.
PMLR. 2
[14] Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken. Explor-
ing hidden dimensions in accelerating convolutional neural
networks. In Jennifer Dy and Andreas Krause, editors, Pro-
ceedings of the 35th International Conference on Machine
Learning, volume 80 of Proceedings of Machine Learning
Research, pages 2274–2283, Stockholmsmssan, Stockholm
Sweden, 10–15 Jul 2018. PMLR. 1
[15] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and
model parallelism for deep neural networks. In The Confer-
ence on Systems and Machine Learning (SysML), 2019. 1
[16] Alex Krizhevsky. One weird trick for parallelizing convo-
lutional neural networks. arXiv preprint arXiv:1404.5997,
2014. 1
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural net-
works. In Advances in neural information processing sys-
tems, pages 1097–1105, 2012. 1
[18] Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner,
Quoc V. Le, and Jeff Dean. A hierarchical model for device
placement. In International Conference on Learning Repre-
sentations, 2018. 1
[19] Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner,
Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad
Norouzi, Samy Bengio, and Jeff Dean. Device placement
optimization with reinforcement learning. In Proceedings
of the 34th International Conference on Machine Learning-
Volume 70, pages 2430–2439. JMLR. org, 2017. 1
[20] NVIDIA. NVIDIA CUDA programming guide 1.1. (link),
2007. 5
[21] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban
Desmaison, Luca Antiga, and Adam Lerer. Automatic dif-
ferentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
2, 4
[22] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Amodei, and Ilya Sutskever. Language models are unsuper-
vised multitask learners. 1
[23] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V
Le. Regularized evolution for image classifier architecture
search. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, pages 4780–4789, 2019. 1, 2
[24] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
Convolutional networks for biomedical image segmentation.
MICCAI, Springer, LNCS, 9351:234–241, 2015. 1, 2, 8
[25] Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha
Sohl-Dickstein, Roy Frostig, and George E Dahl. Measuring
the effects of data parallelism on neural network training.
arXiv preprint arXiv:1811.03600, 2018. 1
[26] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran,
Ashish Vaswani, Penporn Koanantakool, Peter Hawkins,
HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al.
Mesh-tensorflow: Deep learning for supercomputers. In
Advances in Neural Information Processing Systems, pages
10414–10423, 2018. 1
[27] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
scaling for convolutional neural networks. In International
Conference on Machine Learning, pages 6105–6114, 2019.
1
9
[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017. 1
[29] Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel
Wong, Peter C Ma, Qiumin Xu, Ming Zhong, Hanxiao Liu,
Anna Goldie, Azalia Mirhoseini, et al. Gdp: General-
ized device placement for dataflow graphs. arXiv preprint
arXiv:1910.01578, 2019. 1
10
