Multi-node Bert-pretraining: Cost-efficient Approach by Lin, Jiahuang et al.
MULTI-NODE BERT-PRETRAINING: COST-EFFICIENT APPROACH
A PREPRINT
Jiahuang Lin
University of Toronto
Vector Institute
jacoblin@cs.toronto.edu
Xin Li
University of Toronto
Vector Institute
xin.li@vectorinstitute.ai
Gennady Pekhimenko
University of Toronto
Vector Institute
pekhimenko@cs.toronto.edu
August 4, 2020
ABSTRACT
Recently, large scale Transformer-based language models such as BERT, GPT-2, and XLNet have
brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP)
tasks. One of the common trends in these recent models is a significant increase in model complexity,
which introduces both more weights and computation. Moreover, with the advent of large-scale
unsupervised datasets, training time is further extended due to the increased amount of data samples
within a single training epoch. As a result, to train these models within reasonable time, machine
learning (ML) programmers often require advanced hardware setups such as the premium GPU-
enabled NVIDIA DGX workstations or specialized accelerators such as Google’s TPU Pods. Our
work addresses this limitation and demonstrates that the BERT pre-trained model can be trained
within 2 weeks on an academic-size cluster of widely available GPUs through careful algorithmic and
software optimizations. In this paper, we present these optimizations on how to improve single device
training throughput, distribute the training workload over multiple nodes and GPUs, and overcome
the communication bottleneck introduced by the large data exchanges over the network. We show
that we are able to perform pre-training on BERT within a reasonable time budget (12 days) in an
academic setting, but with a much less expensive and less aggressive hardware resource requirement
than in previously demonstrated industrial settings based on NVIDIA DGX machines or Google’s
TPU Pods.
1 Introduction
The BERT language model [1] has significantly improved the state-of-the-art performance of many downstream NLP
tasks such as language understanding and question answering. However, training BERT is computationally intensive due
to its high model complexity and large amount of training data that needs to be processed to achieve the state-of-the-art
model accuracy. There are 110M parameters in BERT base and 340M parameters in BERT large, and, due to the
unsupervised nature of the algorithm, the model is trained on an abundance of unlabeled data over many epochs, e.g.,
the BookCorpus dataset [2] (800M words) and the English Wikipedia (2500M words) dataset were trained for 40 epochs
in the original BERT model [1] . Nonetheless, with BERT being a Transformer-based language model [3], the stacked
attention and fully-connected layers allow significantly more parallelism as compared to sequential models (e.g., LSTM
RNN models [4, 5, 6, 7, 8]), as LSTM RNNs have a higher memory requirement that prevents the GPU to utilize its
compute cores efficiently [9]. This architectural advantage of BERT naturally provides opportunities to train BERT on
multiple GPU/TPU devices, which reduces the training time linearly with the amount of hardware resource available.
Nonetheless, machines that are capable of servicing such level of parallelism are usually very expensive and overly
dedicated to computationally intensive workloads [10]. For example, NVIDIA released its BERT pre-training results
that took place on 32 interconnected DGX-1, DGX-2, or DGX-A100 workstations [11]. The DGX machines are
equipped with high-end GPUs, CPUs, and interconnects [12], and have a unit price ranging from $150K to $400K [13].
Therefore, such a hardware setup requires an initial capital cost of around $4.8M to $12.8M. The capital cost to enable
such experiment is not affordable for many researchers and research institutions despite the significant improvement in
training time for the large models. In addition, the setup is over-powered to the routine computation needs of many
ar
X
iv
:2
00
8.
00
17
7v
1 
 [c
s.L
G]
  1
 A
ug
 20
20
typical ML users. For example, Vector Institute only averages 1.28 GPUs per job allocation over a month period for
one of its major clusters equipped with 64 nodes, 8 GPUs each. This indicates a relatively low interest in academic
users to utilize the high-speed interconnects and communication collectives provided by the system, which is unsuitable
for the configuration of DGX machines and other high-performance workstations alike. Anecdotally, we also observe
relatively low usage of advanced hardware features such as TensorCores [14] and mixed precision [15]. We attribute
this to two factors: (i) limited awareness among ML researchers/practitioners into advanced hardware features for ML
and (ii) the necessity to deal with problems such overflow/underflow in FP16 computation [15, 16] that require special
software tricks [16] to be handled properly.
Our work aims to exploit the same parallelism in BERT model, but distribute the workload over a much cheaper 32-node
cluster available to us at Vector Institute connected with a commodity 10Gb/s network, where each node is equipped
with 8 low-budget NVIDIA T4 GPUs [17]. We apply multiple layers of optimization to improve both the single GPU
performance and distributed performance over the network and over internal PCIe interconnect (used within a node).
With a much lower hardware budget of approximately $600K and a much more generic compute environment, we
are able to achieve a 70% weak scaling efficiency and complete BERT training in 12 days, which we consider to be a
reasonable training time under an academic setting.
2 Related work
The pre-trained language models such as BERT [1], GPT-2 [18], XLNet [19], RoBERTa [20] and GPT-3 [21] have
been proven successful on many downstream NLP tasks including text classification, question answering, and natural
language inference. In this paper, we will focus on the pre-training of BERT-large. We pick BERT-large because of its
wide adoption and state-of-the-art performance. From a systems perspective, BERT-large is also a suitable candidate
since the encoder based attention layers [3]‘ in BERT have similar characteristics to many of the aforementioned
models, and is considered as a next generation ML benchmark for the state-of-the-art ML benchmark suites such as
MLPerf [22].
2.1 BERT Pre-Training
In the original BERT paper [1], the model reuses the encoder implementation of the Transformer model [3]. Pre-training
is done with two tasks: (i) the masked language model, where the model predicts words that are randomly masked in
the input sentence, and (ii) the next sentence prediction, where the model need to classify whether two input sentences
are logically adjacent. A combination of Wikipedia Corpus and BooksCorpus dataset is used to train BERT-large. The
pre-training processes is carried out into two separate phases where phase 1 covers the first 90% of epochs using a
sequence length of 128 to improve training speed, and phase 2 covers the rest 10% of epochs with a sequence length of
512 to learn the positional embeddings. The two training phases add up to a total of 40 epochs and take over 80 hours
to complete on a 64 TPUv3 chips [1].
To reduce training time, a natural choice is to increase the training mini-batch size. A larger mini-batch size (number
of samples used for each back-propagation update) decreases the total number of iterations per training epoch as the
number of total samples in the dataset is fixed. Although a larger mini-batch size also increases the computation load
to each iteration, this additional workload can be fully parallelized given enough computation resources available.
Nonetheless, experience has shown that learning rate will need to be carefully fine-tuned for large mini-batch learning
to be successful [23]. As a result, fairly large mini-batch has prevented pre-training from continuing to scale out with
more hardware resources. To address the aforementioned issue, LAMB optimizer [24] was introduced, where the
gradients and learning rate will be further normalized and dynamically adjusted in a layer-wise manner. The LAMB
optimizer paper has shown that by following the two phases pre-training convention, BERT-large pre-training time can
be shorten to 76 minutes by scaling to 1024 TPUv3 chips [24] without a loss of model accuracy.
2.2 Distributed Training
As deep learning models become more powerful and complex, the training of those models also demands more
computation resources. Large models like ResNet [25] and DeepSpeech2 [26] can take weeks to train on a single GPU
device [27]. The need for shortening the training time of large deep learning models has brought up distributed training
algorithms. Among the distributed training algorithms, data parallelism [28] and model parallelism [29] are the two
most popular types. Data parallelism [28] is a natural way to scale out the training process by slicing and distributing
the training data into multiple devices. Each worker will retain a full replica of the model on different data. Workers
will synchronize over the updates of the model by exchanging gradients. In contrast, model parallelism [29] divides
the model into different pieces and distributes those pieces into each devices to form a training pipeline. Workers will
2
train on the same date but for different parts of the model, which allows devices to fit bigger model. Workers will
synchronize over the activation maps.
However, as the training graphs of deep learning models are typically directed-acyclic, model parallelism [29], which
partitions the execution graph, essentially introduces strong sequential dependencies: at most one device will be
fully utilized in computing at any given time of training. To gain more device utilization, pipeline devices to overlap
computation was proposed [30]. Such overlapping requires activation maps that are supposed to be synchronized to
be stacked and stored. Assuming mini-batch data enters the pipeline one at a time, and the pipeline has length n, this
will impose an extra linear scale of memory storage for devices in the pipeline, which is not scalable as the number of
devices increases. Furthermore, the hard limit on device memory that model parallelism [29] brings leaves very little
room for researchers to optimize the training throughput. Compared with model parallelism [29], data parallelism [28]
also introduces hard limit: each device must be able to fit in one complete replica of the model. One might argue that
this is a bottleneck as models are getting bigger, it’s the feature maps that consumes most memory rather than the model
itself. Although model size usually is not a bottleneck, data parallelism suffers from trade-offs between synchronization
cost and model parameter staleness in parameter updates [28].
Network topologies have also been explored in the implementation of data parallelism in order to reduce the syn-
chronization cost. For example, a ring based system topology [31] has been proposed to maximize the inter-device
communication bandwidth. By having all devices jointly form a ring topology, each device only passes the computed
weight gradient to its neighbor in the ring. Such approach guarantees that communication channel between any two
devices will be filled up with maximum one model’s gradient, avoiding traffic congestions. Further benchmarks have
shown this approach guarantees linear scalability of bandwidth with respect to the number of devices [32].
2.3 Mixed Precision Training
Reducing the arithmetic complexity can be beneficial to the run-time performance of training procedures, especially for
large models, as the weights of deep learning model consume a great amount of memory. In addition to the reduction of
memory footprint, lower precision number representations also reduce the numerical computation complexity, and thus
increases the calculation throughput. This effectively shortens the program execution time. For example, deep learning
models usually use full precision floating point numbers (FP32) to store the weights and carry out computations. The
work on mixed precision training [15] showed the possibility of using 16-bit floating point numbers (FP16) while
preserving similar convergence behavior and model performance for DNNs. During training, FP16 are used to perform
multiplications between the weights and activations, and FP32 are used for the accumulation of the products during
the reduction. Since multiplications require more hardware resource, this optimization can significantly improve the
compute efficiency during training. Moreover, with the introduction of TensorCore [33], direct hardware support is
provided to this mixed precision multiplication and accumulate pattern. As a result, mixed precision training can
improve the training throughput and shorten the training time by 2-6 times on various representative DNN models [15].
This migration also effectively reduces the memory footprint during training, which allows for larger batch size to fit in
GPU memory.
Loss scaling is used to compensate for the loss of dynamic range from FP32 to FP16. During training, the gradients
usually have a very small magnitude (negative exponent). Since the exponent bits in FP16 have a representation range
of [-14,15], most of the positive exponent range is left unused while many small-magnitude gradients are rounded to
zero. To mitigate the zeroing of the gradients during the backward pass, the gradients are scaled up by a constant factor
to take advantage of the unused range of positive exponents. They are then scaled down before the weight updates to
preserve the same update magnitude to the original FP32 model.
3 Methodology
We explain our training setup and distribution strategies used to train the BERT-large model below.
3.1 Datasets
3.1.1 Pre-training Datasets
Similarly to the original BERT paper [1], we also used Wikipedia Corpus [34] and BookCorpus [2] dataset. The
Wikipedia Corpus has 2.5B words and BooksCorpus has 800M words. After extracting plain English text from those
two public datasets, we then process the sentences exactly like in BERT [1]. Namely,
• tokenize the raw text through WordPiece tokenization [35]
3
Table 1: Multi-node Hardware Setup for BERT-large Training
System Requirements Value
Node Count 32
GPU Per Node 8 (NVIDIA T4)
CPU Per Node 32 (Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz)
Total CPU count 1024
Total GPU count 256
GPU-Interconnect PCIe 4.0 link with 64Gb/s bandwidth
Network Connection Between Nodes 10 Gb/s.
Estimated Cost of Acquisition Per Node $19.5K
Estimated Total Cost of Acquisition $624K
Table 2: Multi-node Software Setup for BERT-large Training
Major Software Requirements Version
Ubuntu 18.04
CUDA 10.0
cuDNN 7.5.1
Python 3.6
PyTorch 1.1.0
APEX (A Pytorch Extension) 0.1
NCCL v2.4.8-1
MKL 2019.4
• mask out 15% of the words in the input sentences for the model to learn the relationship within the sentence
• split and shuffle adjacent sentences with 50% probability for the model to do next sentence prediction
3.1.2 Fine-tuning Datasets
For fine-tuning, we picked the the question-answering task trained on the Stanford Question Answering Dataset (SQuAD
1.0) [36]. This dataset contains 100k question-answer pairs from Wikipedia, in which a question and a passage is
provided as the training data and the corresponding answer is provided as the training label.
3.2 System Setup
We utilized data parallelism [28] for multi-node training. To characterize the hardware topology, we use the name
"<X>M<Y>G" to encode X number of "Machines" and "Y" number of "GPUs" on each machine. Table 1 shows our
32M8G hardware setup to conduct training. This setup has a unit cost of $19.5K per node, which is much less expensive
than the unit price of a single DGX system, which ranges from $150K to $400K [13]. The system also offers more
flexibility because it is more cost effective for light-weight computing needs, which do not utilize the interconnects as
heavily as is required in distributed training for some large models. For simplicity, let us consider a simple scenario of
a 2-node, 4 GPUs each, setup as shown in Figure 1. Each node will have four NVIDIA T4 cards in our case and are
connected through PCIe. Nodes will be connected through network card.
For the software stack, we used NVIDIA’s PyTorch implementation of the BERT model [11]. Table 2 shows the
software stack of our training program. We also utilized NCCL v2.4.8-1 [31] as our distributed training backend
framework that implements data parallelism [28]. In addition, NCCL [31] also auto detects the network topology upon
setting up connections among nodes in order to form a ring topology if possible [31].
When doing training in the context of data parallelism [28], mini-batches of data are split into individual devices. In the
forward pass, the calculated gradients will be collected. Gradients will be exchanged through PCIe within a node and
through network card across different nodes, after which the parametres are updated based on the aggregated results. In
particular, the PCIe gradient pass is independent with the network card gradient pass. During the backward pass, the
whole model’s gradients in any device will need to be passed to all other devices as each device contains a full replica of
the model in data parallelism. As such, we can tell from the previous bandwidth comparison that the network gradient
transmission will certainly lag behind the PCIe gradient transmission. And we will see later that this is indeed the case:
the time needed for completing a backward pass is dominated by the gradient exchange time spent over the network.
4
Figure 1: PCIe based Data Parallelism
3.3 Pre-training
We trained BERT-large on the combined Wikipedia Corpus and BooksCorpus dataset. We also followed the two phase
training schema when performing pre-training [1]), where we used a sequence length of 128 to train the first 36 epochs
and a sequence length of 512 to train the last 4 epochs.
4 Optimization
To improve the single device throughput and fine-tune the training workload with respect to our distributed system
topology, we performed the following optimizations on BERT-large pre-training.
4.1 Data Sharding
Considering the significant size of raw data: 2.5B words from Wikipedia Corpus [34] and 800M words from BooksCor-
pus [2], file I/O became a bottleneck during data loading. Due to the essence of data parallelism [28], the data file is
loaded into memory first for each node. And then the data will be truncated and each device will get a portion of it.
However, with this large dataset, data loading and distributing brings long latency upon the start of each epoch training.
Our benchmark result shows that it will take 8 to 10 minutes to load and distribute data in a single machine that has
8 GPUs when the training program freshly starts. This latency will be shortened to 3 to 5 minutes later upon epoch
re-shuffling and re-distributing.
To address this I/O bottleneck, we perform data sharding before training based on the number of devices we have.
During training, each device will only read from the data shard that it belongs to in order to avoid the I/O congestion
caused by data transferring. Each processed shard is stored in hdf5 file format [37] for flexible data storage and efficient
I/O during distributed training. Those shards are evenly distributed segments of the original dataset. This organization
of data facilitates distributed training because data can be dispatched and shuffled efficiently to different nodes in
the network. After data sharding, our results shows that data loading time upon the start of the training program has
decreased to less than 2 minutes, and the time spent on the re-loading in the beginning of each epoch is less than one
minute.
4.2 Automated Mixed Precision
During pre-training, the BERT model stores the weights as 32-bit full precision floating point numbers (FP32), which
means the training is carried out with 32-bit multiplication and accumulation compute units. We applied automated
mixed precision [15] to the pre-training of BERT-large. In other words, we convert the model to use half precision data
type whenever possible while keeping a full precision model weights replica in the master node. Loss scaling [15] is
used to preserve the small gradient values. Depending on the device, empirical benchmark result 4 shows 1.7× to 2.5×
speedups after applying automated mixed precision for the pre-training of BERT-large.
Typically in a computation graph, not all FP16 operators are numerically safe. This means operators that are considered
numerically dangerous will have its calculation in full precision. For example, a plus operator is marked as safe while a
5
power or a log operator is considered numerically dangerous in half precision. Automated mixed precision handles the
categorization of the numerical safety level through the rewriting of computation graph [16].
4.3 Kernel Fusion
A CUDA kernel is a compiled routine that runs on NVIDIA GPUs. These kernels are highly optimized to perform matrix
algebra operations in the model. Although PyTorch uses Python as its front-end interface for users to build up their
deep learning applications, to speed up training, the front-end Python interpreter will invoke compiled CUDA kernels to
perform training on GPU devices. Generally, kernels are provided for each operator in the front-end language. For
example, a sequence of operations, such as a matrix-matrix multiplication followed by an element-wise tanh activation
would produce two CUDA kernels corresponding to the two operators. Another example would be Gauussian Error
Linear Unit [38], which is heavily used as an activation function in BERT [1]. The GELU function was approximated
by the following:
GELU(x) = 0.5x(1 + tanh[
√
2/pi(x+ 0.044715x3)])
By replacing the constants with a, b and c we get:
GELU(x) = ax(1 + tanh(b(x+ cx3)))
Without kernel fusion, the above equation will translate into 7 kernels as the following:
1. f = x3
2. f = c ∗ f
3. f = x+ f
4. f = b ∗ f
5. f = tanh(f) + 1
6. f = x ∗ f
7. f = a ∗ f
This is inefficient compared to a single fused CUDA kernel combining all operators because the fused kernel incurs less
kernel launch overhead and the access of the same piece of data exploits better memory locality. Therefore, we applied
kernel fusion for both layer normalization [39], activation functions [38] and the optimizer [24] using the Apex utility
functions provided by [40]. Based on our benchmarking result, the throughput improved by 8% to 11% on average
depending on the device.
4.4 Multiple Node Training
While the results for single GPU optimization might seem promising, training BERT-large with a single GPU is still
practically infeasible. Table 3 justifies the need for multi-node training as it will take years to train BERT-large in
a single GPU setting. We thus exploit data parallelism to scale to multiple GPUs in a multi-node context. During
training, each GPU has a complete copy of the model, and they are provided with different batch of input training data.
Each GPU worker first computes the gradient with respect to its input individually, the gradient are then exchanged
and accumulated across different workers through the NCCL [31]). Since both data loading and weight sharing
consumes communication bandwidth of the interconnects, careful scheduling and allocation of the communication
bandwidth is required to minimize congestion and maximize GPU utilization. As a result, we perform data loading
via the PCIe channels and weight sharing via the network, which minimizes the competition of resources. To improve
GPU utilization in the presence of a non-trivial communication workload, we overlap the gradient computation with
communication as illustrated in Figure 2. The gradients are exchanged as soon as they become available after passing
some certain size threshold during the backward pass, so back-propagation and weight exchange can happen in a
parallel in a pipelined fashion.
Figure 2: Timeline Comparison Between Non-overlapping/Overlapping Communication with Computation
6
Table 3: Single GPU Pre-training Time Estimation
Device Optimized Throughput Tokens/Epoch Estimated Time Per-Epoch 40 Epoch Time
P100 [41] 3228.8/s 16752.7 Million 1441.6 hours (60 days) 2400 days
T4(TensorCore) [17] 5429.1/s 16752.7 Million 857.1 hours (36 days) 1440 days
2080Ti [42] 10765.8/s 16752.7 Million 432.3 hours (18 days) 720 days
There are two types of communication happening during the above process, namely intra-node communication through
PCIe and inter-node communication through the network. As our environment has only 10 Gbps network bandwidth,
inter-node communication quickly became the bottleneck for multi-node training. Our benchmarking shows that after
overlapping communication with computation, in a simple scenario of training on 2 nodes with each node having one
GPU respectively, time spent on synchronization barrier is comparable to the forward pass, the backward pass and
the weight update pass combined. Figure 3 further illustrates this observation. In this figure, the X-axis denotes the
hardware configuration as we scale up the computation resource for the training.We use the name "<X>M<Y>G" to
encode X number of "Machines" and "Y" number of "GPUs" on each machine. For Inter-node scaling, i.e. increasing
"X", one can see that there is nearly zero throughput gain going from 1M1G to 2M1G as almost half of the time was
spent on the communication rather than computation. In addition, we can see from Figure 3 that the weak scaling
efficacy is upper bounded by 38% in practice, which is significantly less efficient than the Inter-node scaling alternative.
Figure 3: Weak Scaling Comparison Between Intra-node Scaling and Inter-node
Figure 4: Gradient Memory Profile
With network bandwidth being the hard limit of hardware, re-
ducing the amount of data each node needs to transmit became
one solution for maintaining scalability. Prior work [43] pro-
posed Gradient Sparsification to reduce the size of gradients.
However, sparsification is effective primarily in sparse gradi-
ent data. This characteristic is unfavorable for our model. As
we show in Figure 4, the majority of the gradients are in the
attention, intermediate, and output layers, which are mostly
fully-connected layers that produce dense gradients. In addi-
tion, picking the right sparsification threshold not only requires
an extra amount of calculation overhead but also a lot of tuning
work: if the threshold is too low then we are only able to reduce
the gradient size by little amount; if the threshold is too high
then we are in risk of affecting the training convergence as
some non-negligible gradient signals were skipped.
To address the aforementioned communication bottleneck, gra-
dient accumulation is applied. Gradient accumulation is the process of adding up the loss and gradients in each local
worker over multiple mini-batch iterations and updating the weights of the model globally once in every several itera-
tions. As Figure 5 illustrates, gradient accumulation essentially reduces the ratio between communication workload and
computation workload. Since our hardware setup is network bandwidth limited, a properly tuned gradient accumulation
step can effectively balance the computation and communication time, and therefore increases the overall compute
utilization rate. Note that gradient accumulation also effectively increases the batch size of the training, so other
hyper-parameters need to be adjusted accordingly.
7
Figure 5: CUDA Stream Timeline for Gradient Accumulated Training
5 Evaluation
We describe our performance improvements for the different layers of optimizations below.
5.1 Single GPU Optimization
For the single kernel optimization, Table 4 and 5 summarize the throughput gain of using FP16 and kernel fusion. Using
FP16 improves the throughput by 1.7× on NVIDIA P100 and 2.5× on NVIDIA 2080 Ti. Furthermore, FP16 is more
effective on GPUs equipped with TensorCores as the cores are enabled only by FP16 operations. Kernel fusion further
enhances the single-GPU throughput by around 1.2× for all three devices. Since these optimization techniques can be
applied separately, the combination of both produces a final speed up of at least 2.05× on NVIDIA P100, 2.78× on
NVIDIA T4, and 3.05× on NVIDIA 2080 Ti.
Table 4: Throughput Comparison (Tokens/s)
Device Non-Optimized FP16 FP16 & Fused Kernel Seq Length
P100 [41] 1576.3 2680.7 3228.8 128
T4 (TensorCore) [17] 1953.5 4430.9 5429.1 128
2080Ti (TensorCore) [42] 3527.2 8823.8 10765.8 128
Table 5: Throughput Speedups (using non-optimized baseline)
Device Non-Optimized FP16 FP16 & Fused Kernel
P100 [41] 1 1.7 2.05
T4 (TensorCore) [17] 1 2.27 2.78
2080Ti (TensorCore) [42] 1 2.5 3.05
5.2 Multi-Node Optimization
We trained BERT-large with 32 machine-nodes, each equipped with 8 NVIDIA T4 GPUs [17]. This amounts to 256
GPUs in total. We applied gradient accumulation for 4 steps to reduce network traffic. Individual GPU workers sum up
the gradients from 4 different mini-batches before exchanging and updating the model parameter among all the workers.
Combining the reduction of network traffic from performing gradient accumulation with our optimization work on
single GPU, we are able to achieve a weak scaling factor of 165 times with 10 Gbps network bandwidth. As we show in
Figure 6, the scaling efficiency decreases as we continue to increase the number of machines as communication and
synchronization overhead dominates the training time.
Figure 7 shows the loss curves of two phase training and table 8 listed the differences in training configurations for our
two phase pre-training. We had some convergence issues in phase 2, as figure 7 illustrates, the training loss plateaus
after one epoch of training, and starting from the second epoch, loss spikes up at the very end of each epoch and
decreases later. In phase 1 the loss value at the end of last epoch is about 1.2. In phase 2, the average loss value in the
final epoch is around 1.3.
5.3 Pre-training and Fine-tuning Results
We evaluated our pre-training models through fine-tuning our pre-trained model on SQuAD v1.1 dataset using the same
fine-tuning configurations as [1] did. Our model achieved 81% to 83% F1 scores depending on the loaded pre-trained
checkpoints. Compared with Google’s 90.9% [1] and NVIDIA’s [11] 90% to 91%, there is a discrepancy of 9%–10%.
8
Figure 6: Multi-node Throughput Scaling
Figure 7: BERT-large Phase 1 (left) & Phase 2 (right) Pre-training Loss Plot
However, we believe that such discrepancy is not caused by the system performance optimization approach that we had
been taken to shorten the pre-training time of BERT-large. To illustrate this, we experimented two sample pre-training
run for each phase, one run with the performance optimizations while the other without any of the performance
optimization. As Figure 8 shows, in phase 1 the two loss curve is highly similar and in phase 2 the optimized loss curve
looks even more stable than the non-optimized one. This proves that the convergence issue that we saw in phase 2 might
be caused by incorrect hyper-parameter settings. However, to fine-tuning of these hyper-parameters requires excessive
amount of computing resources as well as time. Note that we trained extra two epochs in phase 2 to get the desired loss
value because of the convergence problem, this makes our total training time 13 days. With an ideal parameter setting,
phase 2 training should be completed by 4 epochs and the total training time can be further shortened to be within 12
days.
6 Conclusion & Future Work
We have successfully completed the pre-training of BERT-large in 12 days with a relatively low hardware capital budget
than most of the published results by major software/hardware companies [1, 20]. For example, our capital cost for
the experiment is estimated to be $624K for a total of 32 nodes, where as the DGX workstations used for NVIDIA’s
pre-training of BERT cost around $4.8M to $12.8M. The choice of owning the hardware is also more cost efficient than
renting is through major cloud service providers. For example, the cost of 256 T4 GPUs on Google Cloud for 12 days
Table 6: Two Phase Pre-training Comparison (per GPU)
Sentences (S) Length/S Predictions/S Batch Size Learning Rate Epochs Epoch Time
Phase 1 32 128 20 4096 1e-4 36 6 hours
Phase 2 4 512 80 2048 1e-4 6 16 hours
9
Figure 8: BERT-large Phase 1 (left) & Phase 2 (right) Optimized vs. Non-optimized Training Loss Comparison
is estimated to be $25739 (Appendix 6), which is 24 times less the price of owning the hardware. However, the average
replacement cycle for the hardware is about 3 years, which is enough time for 90 times of such a 12-day experiment.
Acknowledgements
We want to thank Vector Institute NLP Project industry and technical staff participants who gave feedback. We want to
offer special thanks to Dr. Garth Gibson and Dr. Elham Dolatabadi from Vector Institute for their guidance and support
throughout this work, Punendu Mukherjee and Thor Johnsen from NVIDIA, Fillippo Pompilli from Thomson Reuters
for their help during preliminary phase of experiments.
This work was supported in part by the NSERC Discovery grant, the Canada Foundation for Innovation JELF grant, the
Connaught Fund, the Huawei grants, the Province of Ontario, the Government of Canada through CIFAR AI Chair
award, and sponsors of the Vector Institute (www.vectorinstitute.ai/#partners).
References
[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional
transformers for language understanding. CoRR, abs/1810.04805, 2018.
[2] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja
Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,
2015.
[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and
Illia Polosukhin. Attention Is All You Need. arXiv e-prints, page arXiv:1706.03762, June 2017.
[4] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim
Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz
Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant
Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff
Hughes, and Jeffrey Dean. Google’s Neural Machine Translation System: Bridging the Gap between Human and
Machine Translation. arXiv e-prints, page arXiv:1609.08144, September 2016.
[6] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. arXiv
e-prints, page arXiv:1409.3215, September 2014.
[7] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to
Align and Translate. arXiv e-prints, page arXiv:1409.0473, September 2014.
[8] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. Massive Exploration of Neural Machine Translation
Architectures. arXiv e-prints, page arXiv:1703.03906, March 2017.
[9] Bojian Zheng, Abhishek Tiwari, Nandita Vijaykumar, and Gennady Pekhimenko. Echo: Compiler-based GPU
Memory Footprint Reduction for LSTM RNN Training. arXiv e-prints, page arXiv:1805.08899, May 2018.
10
[10] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in
nlp. arXiv preprint arXiv:1906.02243, 2019.
[11] NVIDIA. Deep learning example. https://github.com/NVIDIA/DeepLearningExamples, 2019. Accessed:
2020-05-16.
[12] NVIDIA. Nvidia dgx-1 with tesla v100 system architecture. https://images.nvidia.com/content/pdf/
dgx1-v100-system-architecture-whitepaper.pdf. Accessed: 2020-07-10.
[13] Ian Cutress. Nvidia’s dgx-2: Sixteen tesla v100s, 30 tb of nvme, only $400k. Accessed: 2020-05-10.
[14] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. NVIDIA tensor core
programmability, performance & precision. CoRR, abs/1803.04014, 2018.
[15] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg,
Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed Precision Training. arXiv e-prints,
page arXiv:1710.03740, October 2017.
[16] Matt Conley, Minmin Sun, and Yan Jun. Mixed precision grappler optimizer. https://github.com/
tensorflow/tensorflow/pull/26342, 2019. Accessed: 2020-05-16.
[17] NVIDIA. Nvidia t4. https://www.nvidia.com/en-us/data-center/tesla-t4/. Accessed: 2020-05-10.
[18] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are
unsupervised multitask learners. 2019.
[19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet:
Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237, 2019.
[20] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
moyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692,
2019.
[21] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
[22] Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin
Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim
Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar,
Jeffery Liao, Guokai Ma, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay
Janapa Reddi, Taylor Robie, Tom St. John, Tsuguchika Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki,
Cliff Young, and Matei Zaharia. MLPerf Training Benchmark. arXiv e-prints, page arXiv:1910.01500, October
2019.
[23] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew
Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR,
abs/1706.02677, 2017.
[24] Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing BERT pre-
training time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019.
[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR,
abs/1512.03385, 2015.
[26] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen,
Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse H. Engel, Linxi Fan, Christopher Fougner,
Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil
Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian
Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep speech 2: End-to-end speech
recognition in english and mandarin. CoRR, abs/1512.02595, 2015.
11
[27] Dario Amodei, Danny Hernandez, Girish Sastry, Jack Clark, Greg Brockman, and Ilya Sutskever. Ai and compute.
https://openai.com/blog/ai-and-compute/, 2018. Accessed: 2020-05-16.
[28] Christopher J. Shallue, Jaehoon Lee, Joseph M. Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E.
Dahl. Measuring the effects of data parallelism on neural network training. CoRR, abs/1811.03600, 2018.
[29] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.
[30] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen.
Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR, abs/1811.06965, 2018.
[31] Nathan Luehr. Fast multi-gpu collectives with nccl, 2016.
[32] Nathan Luehr. Nccl tests. https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.
md#bus-bandwidth, 2018. Accessed: 2020-05-10.
[33] NVIDIA. Nvidia tesla v100 gpu architecture., 2017.
[34] Giuseppe Attardi. Wikiextractor. https://github.com/attardi/wikiextractor, 2020.
[35] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim
Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz
Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant
Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff
Hughes, and Jeffrey Dean. Google’s Neural Machine Translation System: Bridging the Gap between Human and
Machine Translation. arXiv e-prints, page arXiv:1609.08144, September 2016.
[36] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine
Comprehension of Text. arXiv e-prints, page arXiv:1606.05250, June 2016.
[37] The HDF Group. High level introduction to hdf5. https://support.hdfgroup.org/HDF5/Tutor/
HDF5Intro.pdf, 2016. Accessed: 2020-05-16.
[38] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear
units. CoRR, abs/1606.08415, 2016.
[39] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization. arXiv e-prints, page
arXiv:1607.06450, July 2016.
[40] NVIDIA. Apex (A PyTorch Extension).
[41] NVIDIA. Nvidia p100. https://www.nvidia.com/en-us/data-center/tesla-p100/. Accessed: 2020-
05-10.
[42] NVIDIA. Nvidia rtx2080ti. https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/.
Accessed: 2020-05-10.
[43] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient
distributed optimization. CoRR, abs/1710.09854, 2017.
[44] Google. Gpus pricing. https://cloud.google.com/compute/gpus-pricing. Accessed: 2020-05-10.
12
Appendices
Training Cost Estimation
Table 7: Google Cloud Price Estimation
Devices Number of Devices Price/hour (USD) Traing Time Total Cost (USD)
NVIDIA T4 256 $0.35 [44] 12 Days $25804.8
Table 8: NVIDIA DGX Cluster Price Estimation
Devices Number of Devices Price (USD) Total Cost (USD)
NVIDIA DGX1 32 $149,000 [13] $4768000
NVIDIA DGX2 32 $399,000 [13] $12768000
13
