Training Large Neural Networks with Constant Memory using a New
  Execution Algorithm by Pudipeddi, Bharadwaj et al.
Training Large Neural Networks with Constant
Memory using a New Execution Algorithm
Bharadwaj Pudipeddi
Microsoft
Sunnyvale, CA
bharadwaj.pudipeddi@microsoft.com
Maral Mesmakhosroshahi
Microsoft
Sunnyvale, CA
maral.mesmakhosroshahi@microsoft.com
Jinwen Xi
Microsoft
Sunnyvale, CA
Jinwen.Xi@microsoft.com
Sujeeth Bharadwaj
Microsoft
Sunnyvale, CA
sujeeth.bharadwaj@microsoft.com
Abstract
Widely popular transformer-based NLP models such as BERT and Turing-NLG
have enormous capacity trending to billions of parameters. Current execution
methods demand brute-force resources such as HBM devices and high speed
interconnectivity for data parallelism. In this paper, we introduce a new relay-style
execution technique called L2L (layer-to-layer) where at any given moment, the
device memory is primarily populated only with the executing layer(s)’s footprint.
The model resides in the DRAM memory attached to either a CPU or an FPGA
as an entity we call eager param-server (EPS). To overcome the bandwidth issues
of shuttling parameters to and from EPS, the model is executed a layer at a time
across many micro-batches instead of the conventional method of minibatches over
whole model. L2L is implemented using 16GB V100 devices for BERT-Large
running it with a device batch size of up to 256. Our results show 45% reduction
in memory and 40% increase in the throughput compared to the state-of-the-art
baseline. L2L is also able to fit models up to 50Billion parameters on a machine
with a single 16GB V100 and a CPU with 512GB memory and without requiring
any model partitioning. L2L scales to arbitrary depth allowing researchers to
develop on affordable devices which is a big step toward democratizing AI. By
running the optimizer in the host EPS, we show a new form of mixed precision
for faster throughput and convergence. In addition, the EPS enables dynamic
neural architecture approaches by varying layers across iterations. Finally, we also
propose and demonstrate a constant memory variation of L2L and we propose
future enhancements. This work has been performed on GPUs first, but also
targeted towards all high TFLOPS/Watt accelerators.
1 Introduction
The transformer architecture spawned the "ResNet" moment in natural language processing (NLP),
where residual blocks of arbitrary depth can be stacked to create state-of-the-art models such as
BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019) and the recently published gigantic GPT-3
with 175B parameters (Brown et al., 2020). Although these models reduce design complexity, they
have significant overhead in memory requirements. BERT-large can barely train on a high-end GPU
such as the V100 with 16GB with a batch-size of 2.
Preprint. Under review.
ar
X
iv
:2
00
2.
05
64
5v
4 
 [c
s.L
G]
  2
 Ju
n 2
02
0
Training large NLP models like BERT with billions of parameters has only been successfully carried
out on high-bandwidth memory devices such as GPUs and TPUs with high memory capacities. The
memory size is influenced not only by the model parameters but also by a sufficiently large batch size
required for convergence. The transformer-class of models such as BERT can be classified as having
high weight/activation ratios: they have high number of parameters and yet relatively small output
activations. For instance, BERT-large has 24 encoder layers, 350M parameters, but the layer output
size is only 1MB per sample. This is the key observation to develop a more efficient execution
method for large NLP models.
1.1 Related Work
Traditional distributed training of neural networks started with data parallelism which keeps a copy
of the whole model on each device and partitions data among multiple devices (Dean et al., 2012;
Krizhevsky, 2014; Lian et al., 2015; Zhang et al., 2016; Chen et al., 2016a). Data parallelism
works with the assumption that the whole model can fit on the device which is not necessarily true
anymore. Training larger models has been requiring the model to be partitioned across multiple
devices (Dean et al., 2012) using model parallelism approaches which could often be inefficient and
hard to implement. Pipelining is another traditional approach that is common for distributed training
which overlaps computations between the layers.
There are more novel approaches proposed in recent years to solve the memory limitation problem.
The first is PipeDream (Harlap et al., 2018), which partitions a model across multiple devices and
pipelines the execution of forward passes interspersing them with backward passes to maximize
hardware utilization. Pipedream updates on every minibatch and circumvents staleness by maintaining
various versions of the model. A related model parallelism approach is GPipe (Huang et al., 2018)
which also partitions the model across multiple devices. However, GPipe pipelines the execution of
microbatches before applying a single synchronous gradient update for the entire minibatch. GPipe
stacks the forward pass output activations and recomputes them during backward pass as it pops each
microbatch off the stack. GPipe and PipeDream both have overheads related during the start of the
pipeline, and both approaches require the number of devices to scale with the model depth and not
just the layer size. Therefore, are not constant memory approaches. Also, neither approach has made
specific extensions for distributed data parallelism training over model parallelism that can overcome
their overheads.
A third method is OpenAI’s gradient checkpointing (Chen et al., 2016b; Bulatov, 2018). which
tradeoffs memory with more computation. A deep neural network can checkpoint a subset of nodes
in the computational graph so that it does not need to retain state of all the nodes. For a node’s
backward pass, the required activations are recomputed from the nearest checkpoint. A constant
memory implementation gradient checkpointing is feasible, but results in a computational complexity
that scales by O(N2) and it’s recomputation costs for large models are massive.
vDNN (M. Rhu, 2016) is another technique proposed by Nvidia where special CUDA memory
management techniques convert a model to a layerwise execution on the device where older layers
are released to CPU memory based on a layer distance heuristic. vDNN was demonstrated on vision
models, and works exceptionally well on a Titan X GPU. However, vDNN is different by nature
as the heuristic trades off performance with a coarse grain of layer-level buffering. This will not
adapt well to large transformer-based models where the buffering requirements would be high due to
the enormous size of the layer and would still cause computational efficiency issues due to smaller
memory available for execution. vDNN also was constructed to off-load entire activations to avoid
recompute, but even if it only off-loaded output activtions, its method of heuristic choice by CUDA
level software based on layer distance instead of carefully orchestrated transfers like L2L cannot hide
the transfer latencies or the data parallelism scaling overhead incurred.
The recently published DeepSpeed and Zero(Rajbhandari et al., 2019) partition a single copy of the
model across many GPUs while running them in data parallelism layer-by-layer. DeepSpeed is an
effective method for large models as they demonstrate a 17B parameters model over 256 GPUs. But
DeepSpeed requires the model to fit across the combined memory of all the GPU devices.
There is no known solution, however, where a large size transformer-based model of billions of
parameters can be run on a single device with insufficient on-board memory at throughput that can be
theoretically adjusted to over 90% of the throughput of a device with sufficient memory.
2
1.2 L2L Contribution
In this paper, we propose a new relay-style execution algorithm called L2L (layer-to-layer) that runs
models of high weight/activation ratio on a single device. The followings are the main contributions
of L2L:
• L2L keeps only the executing layer and transit buffers on the device which results in a less
than 1GB graph on device. The whole model and the optimizer with its state are in the
host which relays the next layer through the host-to-device interface after each layer-level
iteration on the device. Fig 1 shows L2L structure vs conventional approach.
• In L2L, a new innerlooping approach is proposed where we run multiple micro-batches (of
a minibatch) on one layer at a time. This increases throughput with reduced communication,
and it also enables running a larger batch size on a single device.
• Combining the layer-to-layer execution and micro-batching techniques, L2L can scale the
model by depth with constant memory without model partitioning.
• L2L enables using new precisions such as cross mixed precision (CMP) presented in this
paper more efficiently.
Figure 1: L2L execution with EPS compared with conventional execution.
L2L allows a researcher to run a very large model independent of depth on a single device or group
of devices with a sufficiently large batch size for convergence.
With L2L, we show that we not only can run BERT-large with higher batch size, less memory and
comparable performance than baseline, we demonstrate how L2L runs a gigantic 384 layer BERT
on a single GPU with only 3.69GB. Every other technique results in out of memory even with 36
layers. Furthermore, L2L allows fitting models with 50billion parameters on a single 16GB V100.
In theory, L2L can run on top of any model parallelism (pipelined or just partitioned) or checkpointing,
so it is complimentary. In particular, it can be combined with DeepSpeed and Zero as the same model
memory partitioning can be applied in the eager param-server as each executing device only carries a
much smaller part of the model.
2 Proposed Algorithm
In this section, we propose a new layer-to-layer (L2L) execution algorithm for running large
transformer-based models with constant memory. L2L achieves this by reducing the graph size
and layer-wise microbatch looping explained below.
2.1 L2L Graph Reduction
In conventional methods, the whole model graph resides on device. In transformer-based models, all
encoder layers have similar architectures. Keeping all of the encoders on device is one of the major
3
limitations of running very large models. L2L moves the whole graph to the host which is a special
form of param-server we call Eager Param-Server (EPS).
A traditional syncrhonous param-server hosts a coherent space where devices keep their parameters
as a state dictionary from which they push all the gradients and update the models at every sync.
The EPS - on the other hand – not only services the state space on every layer-level sync, but it also
reduces in parallel eagerly which means as soon as the layer-level gradients arrive and in parallel
to execution. Fig 2 shows the execution strategy of L2L for running transformers compared with
the conventional approach. L2L only keeps one encoder on device which is a flexible encoder layer
Figure 2: Model graph residing in device for L2L compared with conventional execution.
meaning that it receives each layer’s weights from EPS one at a time in forward pass. When running
forward pass, only final activations are stashed and all intermediate layer activations are dismissed.
In the backward pass, the stashed activations are used one at a time to recompute the forward layer
and used for backpropagation of each layer. The gradients computed for each layer will be sent to the
EPS and erased from the device memory before going to the next layer. EPS updates weights in the
host after receiving all the gradients.
Using this approach, we only need a 3 layer model to be on device which results in massive saving
in on-device memory. In addition, transformer activations can be moved to the EPS if required for
further memory saving.
2.2 Inner Looping in L2L
In conventional data-parallel minibatch execution, each device executes a minibatch of size mb
through forward and backward of the whole model and gradient reduction and weight updates
are performed afterward. Fig. 3a shows the conventional execution commonly used in the field.
Recompute approach and weights and gradient transfers used in L2L is a loss in effective throughput.
(a) Conventional data parallel execution
(b) L2L Execution with micro-batch looping
Figure 3: L2L execution with micro-batch looping compared with the conventional data parallelism.
4
But unlike other techniques, L2L can compensate it by keeping the layer on device long enough, i. e.
running more data on each layer. To achieve this, L2L proposes a micro-batch looping technique
shown in Fig. 3. The idea is to run a long minibatch mb - if necessary dividing it into a number of
microbatches u1, u2, u3 – on just one layer at a time so that the overall communication overhead of
transmitting the layers over a slow host-to-worker interface is insignificant. Note that increasing the
number of microbatches per minibatch is not necessary after the overhead is minimized.
To give a better picture of the proposed L2L micro-batch looping, we compare this algorithm with
the conventional baseline with gradient accumulation. Algorithm 1 shows the execution order in the
baseline with accumulated gradients and algorithms 2 shows the execution of the L2L approaches.
The main idea here is that L2L inverts the minibatch loop and layer loop. That is the key principle for
depth-independent memory sizing.
Algorithm 1 Baseline with AG Execution
Input: data x, #layers layers, #uBatches ub
for batch in data do
for u in range(ub) do
for l in layers do
actl = forward(actl−1)
end for
for l in reverse(layers) do
gl+ = backward(gl+1)
end for
loss = forward(batch)/u
grads = backward(loss)
acc_grads = acc_grads+ grads
end for
for l in layers do
wl = optimizer(wl, gl)
end for
end for
Algorithm 2 L2L Execution
Input: data data, #layers layers, #uBatches
ub
for batch in data do
for l in layers do
for u in range(ub) do
actl = forward(actl−1)
end for
end for
for l in reverse(layers) do
for u in range(ub) do
gl = backward(gl+1)
end for
end for
for l in layers do
wl = optimizer(wl, gl)
end for
end for
2.2.1 Computational Complexity Analysis on Minimizing Transfer Overhead
To evaluate the computational complexity of the forward pass, let us assume that the best effective
TFLOPs when running a layer is F achieved with a micro-batch size of ub, the number of layers is N ,
L is the layer size in MB. Also, the PCIe bandwidth from host to device is B GB/sec and c is the
number of giga operations in a forward pass for ub samples. Without loss of generality, we assume a
backward pass which is twice as long as forward, i. e. 2× c. Considering the above assumptions,
the transfer time over PCIe is X = LB in mSec. The forward computation for one layer of size ub is
C = cF in mSec. The total forward pass time for ub is N × (C +X). With the backward pass and
recompute, the total time for ub can be calculated using Eq. 1,
Total time = N × (C +X) +N × (3× C +X)
= N × (4× C + 2×X), (1)
and the throughputs for forward pass and the whole training can be calculated using Eqs. 2 and 3.
T_forward = 1000× ub
N × (C +X) (samples/sec), (2)
T_training = 1000× ub
4× C + 2×X (samples/sec). (3)
With innnerlooping, the overhead of XC can be vastly reduced. Assuming that u is the number of
micro-batches in the inner-loop, forward computation for one layer of size u× ub is C ′ = u×C and
the total forward pass time for ub is N × (u× C +X). With backward pass and recompute, totoal
time for ub can be calculated using Eq. 4,
Total time = N × (u× C +X) +N × (3× u× C +X)
= N × (4× u× C + 2×X), (4)
5
and throughputs for the forward pass and total training can be calculated using Eqs. 5 and 6,
T_forward = 1000× u× ub
N × (C +X) (samples/sec), (5)
T_training = 1000× u× ub
4× u× C + 2×X (samples/sec). (6)
The effect of X on the throughput can be diminished by choosing a large size of u. For instance, in
the case when X is same as C, then transfer overhead can be reduced to less than 10% by choosing
u = 10.
2.3 Cross Mixed Precision
To reduce the memory requirement, faster kernel computes with less exchanges and getting higher
peak TFLOPs, it is necessary to run the models in FP16. However, key NLP workloads don’t converge
on pure FP16. Nvidia’s popular automatic mixed precision (AMP) package provides options to run
models with mixed precision which requires keeping the FP32 master copy in the GPU memory. As
explained above, L2L keeps the master copy of the model in the EPS, allowing us to use a new way of
running mixed precision called cross mixed precision (CMP). In CMP, we keep an FP32 master copy
of the model in the host and run the reduced graph model on GPU with FP16. Using this approach,
the optimizer which is running in EPS is updating weights with FP32 precision and the forward and
backward pass are in pure FP16. CMP gives us better performance compared to AMP (O2) due to
the flexibility L2L provides for running mixed precision.
3 Experimental Results
In this section, we present the experimental results for the L2L approach compared with the baseline.
3.1 Experimental Data and Setups
We have used the GLUE dataset (Wang et al., 2019) in our experiments which includes 8 sequence
classification tasks. Our experiments are performed on a machine with 4 16GB V100 GPUs and a
CPU with 512GB memory. The HuggingFace library (Wolf et al., 2019) is used as a baseline for
development and experiments. The pretrained model provided by BERT (Devlin et al., 2019) is used
as initial weights for fine-tuning the sequence classification task in both baseline and L2L methods.
Table 1 shows the BERT configuration for both baseline and L2L.
Table 1: BERT Configuration.
BERT CONFIGURATION FOR BASELINE AND L2L
#TRANSFORMER LAYERS 24
HIDDEN SIZE 1024
INTERMEDIATE SIZE 4096
MAX SEQUENCE LENGTH 512
OPTIMIZER LAMB (YOU ET AL., 2019)
3.2 Memory and Throughput Test
The major goal of the L2L is to improve the speed and reduce the memory for running large language
models. To demonstrate these goals, we compare L2L and baseline on running BERT-Large on 4
V100 GPUs with 16GB memory. Table2 shows the throughput and memory comparison performed
on the SST-2 which is one of the GLUE tasks. L2L BERT-Large runs with 50% less memory and
twice the throughput. L2L also allows us to fit the uBatch size of 64 on each device without any model
partitioning while baseline can only fit a device batch size of 2. Note that there is no inner-looping on
this run.
6
Table 2: Memory and throughput comparison between L2L and baseline-AG on 4 V100 GPUs.
METHOD PRECISION DEVICE UBATCH TOTAL THROUGHPUT MEMORY
BATCH SIZE SIZE BATCH SIZE (SAMPLE/SEC) (GB)
BASELINE FP32 2 NA 256 16.52 10.51
BASELINE AMP 2 NA 256 26.2 9.2
L2L FP32 64 64 256 22.5 9.45
L2L CMP 64 64 256 52.48 4.96
3.2.1 Increasing L2L Throughput with Inner Looping
One of the main contributions of L2L is the innerlooping method discussed in section 2.2. By
increasing the number of uBatches, we can further improve the throughput of L2L. This would, of
course, increase the overall minbatch size. Table 3 shows L2L results by increasing the number of
ubatches from 1 to 4. As the results show, innerlooping increases the throughput by over 60%.
Table 3: Memory and throughput test for L2L by increasing #uBatches over 4 v100-16GBs.
UBATCH #UBATCHES DEVICE THROUGHPUT MEMORY
SIZE BATCH SIZE (SAMPLE/SEC) (GB)
64 1 64 52.48 3.69
64 2 128 70.97 3.89
64 4 256 84.91 4.27
3.3 Going Beyond BERT
L2L aims at removing the barriers of memory requirement for running giant transformer-based
models on affordable devices. To show this breakthrough, we tried to fit larger models on 16GB
V100 GPUs by increasing the depth and width of the model.
3.3.1 Constant Memory by Depth of the Model
In this experiment, we kept the size of transformer constant and increased the depth of BERT for L2L
and baseline. For L2L, we did separate tests by keeping the activation stash on GPU and moving
them to the CPU. Table 4 shows the memory requirement to fit L2L and baseline on a single GPU.
Table 4: Memory comparison between the baseline and L2L.
METHOD UBATCH DEVICE #LAYER #PARAMETERS MEMORY
SIZE BATCH SIZE (GB)
BASELINE 2 2 24 300 MILLION 9.23
BASELINE 2 2 48 600 MILLION OOM
L2L-STASH ON GPU 64 64 24 300 MILLION 5.22
L2L-STASH ON GPU 64 64 48 600 MILLION 6.76
L2L-STASH ON GPU 64 64 96 1.2 BILLION 9.83
L2L-STASH ON CPU 64 64 24 300 MILLION 3.69
L2L-STASH ON CPU 64 64 96 1.2 BILLION 3.69
L2L-STASH ON CPU 64 64 384 4.8 BILLION 3.69
According to the experiments, we can increase the depth of the model indefinitely (as long as the
CPU memory allows) without increasing the GPU memory requirement or partitioning the model.
The constant memory aspect of L2L opens up many opportunities for breaking the boundaries of
NLP model development and also reducing the cost of deploying models.
7
3.3.2 Memory by Width of the Model
We can increase the width of the model as long as one transformer layer fits in a single GPU and the
whole model fits in the CPU memory. To show the performance of L2L on models larger that BERT,
we performed a test on a transformer-based model with settings used in Turing-NLG (Rajbhandari
et al., 2019). The model has 78 transformer layers and each transformer has a hidden size of 4256
and maximum sequence length is set to 1024 which results in a 17billion parameter model. Table 5
shows the performance of L2L on this model.
Table 5: Performance of L2L on Turing-NLG-like model.
UBATCH #UBATCHES DEVICE #LAYER #PARAMETERS MEMORY THROUGHPUT
SIZE BATCH SIZE (GB) SAMPLES/SEC
8 16 128 78 17 BILLION 6.68 0.58
Table 5 proves that L2L enables fitting giant models on a single device with large batch sizes and
without requiring to partition the model. Using a machine with 16GB V100s and a CPU with 512GB
memory, we were able to fit models up to 50billion parameters on a single device.
3.4 Accuracy Test
L2L doesn’t change the model architecture and it is almost mathematically equivalent to the original
version of it. To confirm this, we performed accuracy test on the GLUE tasks for BERT-Large and
compared it with baseline. In addition, we have compared the performance of CMP with Nvidia’s
mixed precision ran on baseline. To do this comparison, we ran both baseline and L2L for 3 epochs
on learning rate set to 10−3. Table 6 shows the test results for both FP32 and mixed precision.
Table 6: Accuracy comparison of L2L and baseline using gradient accumulation for different GLUE
dataset tasks.
METHOD PRECISION
DEVICE
BATCH
SIZE
ACCURACY (%)
QNLI SST-2 COLA STS-B MRPC RTE
BASELINE FP32 2 91.32 93.46 58.08 89.0 88.58 66.06
L2L FP32 64 91.46 93.92 61.05 88.50 88.30 69.67
BASELINE AMP 2 91.70 93.57 58.98 NA 87.97 67.87
L2L CMP 64 91.45 94.26 61.18 NA 88.50 72.2
Results demonstrate that both L2L and baseline converge to comparable accuracies within a reasonable
error bound. As BERT is very sensitive to hyperparameters, the reported accuracies can be further
tuned by changing them.
4 Conclusion
Training large language models require massive resources and device memories that are only possible
with high-end GPUs and TPUs. Moreover, newer ASICs for acceleration are emerging with high
FP16 performance and little or no off-chip memory. For these reasons, we introduce a new execution
paradigm called L2L by elastically using the CPU memory for keeping the model and the optimizer.
The devices only keep the executing layer while a process in the CPU called EPS prepares and
transmits the next layer. Using inner-looping, L2L reduces the frequency of transfers from EPS to
the devices. EPS also handles reduction and optimization tasks with the potential for virtually linear
scaling. An additional benefit was outperforming baseline due to two factors: (a) faster execution due
to relaxed memory, (b) infrequent updates.
We demonstrate L2L method by running BERT-Large on V100 GPUs with 45% less memory and 40%
increase in throughput compared to baseline. We also demonstrate that L2L never runs out of memory
8
even when the BERT model grows to 384 layers while all other approaches go out-of-memory. In
conclusion, the constant-memory nature of this approach allows to scale to arbitrary depth in the
number of layers. We enable developers to run very large models on more affordable hardware.
Lastly, each layer can be structurally agnostic to others, encouraging dynamic modeling approaches
such as neural architecture search (NAS).
Broader Impact
By proposing L2L, we are following a path to democratize AI and make NLP models accessible
for everyone. L2L allows researchers to train such models with less memory and higher throughput
resulting in massive cost saving. It provides opportunities for using affordable devices to train and
deploy giant NLP models. In addition, L2L provides flexibility to pursue neural architecture search
considering the affordability and dynamic nature of the GPU model.
We hope this new execution paradigm will also influence the hardware industry that is currently
investing in single-tier devices with brute-force High Bandwidth Memory technologies and high
speed links to also consider a two-tier approach to training where the top tier is responsible for the
model and optimization (EPS) while the device tier is responsible for executing the layer.
Acknowledgements
This paper and the research behind it would not have been possible without the exceptional support
of our manager and colleagues. We would especially like to thank Tiyasa Mitra, Mohit Mittal, Layali
Rashid, Marc Tremblay, and Rajiv Kapoor for their advice and support during the development and
publishing of this paper.
References
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam,
P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R.,
Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray,
S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D.
(2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Bulatov, Y. (2018). Fitting larger networks into memory. Technical report, OpenAI.
Chen, J., Monga, R., Bengio, S., and Józefowicz, R. (2016a). Revisiting distributed synchronous
SGD. arXiv preprint arXiv:1604.00981.
Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016b). Training deep nets with sublinear memory
cost. arXiv preprint arXiv:1604.06174.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior,
A. W., Tucker, P. A., Yang, K., and Ng, A. Y. (2012). Large scale distributed deep networks. In
Proceedings of the 26th Conference on Neural Information Processing Systems.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association
for Computational Linguistics.
Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., Devanur, N., Ganger, G., and Gib-
bons, P. (2018). Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint
arXiv:1806.03377.
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M. X., Chen, D., Lee, H., Ngiam, J., Le, Q. V.,
Wu, Y., and Chen, Z. (2018). Gpipe: Efficient training of giant neural networks using pipeline
parallelism. arXiv preprint arXiv:1811.06965.
9
Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks. arXiv
preprint arXiv:1404.5997.
Lian, X., Huang, Y., Li, Y., and Liu, J. (2015). Asynchronous parallel stochastic gradient for
nonconvex optimization. In Proceedings of the 29th Conference on Neural Information Processing
Systems.
M. Rhu, N. Gimelshein, J. C. A. Z. S. W. K. (2016). vdnn: Virtualized deep neural networks for
scalable, memory-efficient neural network design. arXiv:1602.08124.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are
unsupervised multitask learners.
Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. (2019). Zero: Memory optimization towards
training a trillion parameter models. arXiv preprint arXiv:1910.02054.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman1, S. R. (2019). Glue: A multi-task
benchmark and analysis platform for natural language understanding. In Proceedings of the
International Conference on Representation Learning.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R.,
Funtowicz, M., and Brew, J. (2019). Huggingface’s transformers: State-of-the-art natural language
processing. arXiv preprint arXiv:1910.03771, abs/1910.03771.
You, Y., Li, J., Hseu, J., Song, X., Demmel, J., and Hsieh, C. (2019). Large batch optimization for
deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962.
Zhang, W., Gupta, S., Lian, X., and Liu, J. (2016). Staleness-aware async-sgd for distributed deep
learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence.
10
