Performance Analysis of Deep Learning Workloads on Leading-edge Systems by Ren, Yihui et al.
Performance Analysis of Deep Learning Workloads on
Leading-edge Systems
Yihui Ren
Shinjae Yoo
Adolfy Hoisie
yren@bnl.gov
sjyoo@bnl.gov
ahoisie@bnl.gov
Brookhaven National Laboratory
P.O. Box 5000
Upton, New York 11793
ABSTRACT
is work examines the performance of leading-edge systems de-
signed for machine learning computing, including the NVIDIA
DGX-2, Amazon Web Services (AWS) P3, IBM Power System Ac-
celerated Compute Server AC922, and a consumer-grade Exxact
TensorEX TS4 GPU server. Representative deep learning workloads
from the elds of computer vision and natural language processing
are the focus of the analysis. Performance analysis is performed
along with a number of important dimensions. Performance of the
communication interconnects and large and high-throughput deep
learning models are considered. Dierent potential use models
for the systems as standalone and in the cloud also are examined.
e eect of various optimization of the deep learning models and
system congurations is included in the analysis.
CCS CONCEPTS
•General and reference→ Performance; •Hardware→ Test-
ing with distributed and parallel systems; •Networks→ Net-
work performance analysis; •Computer methodologies→ Neural
networks;
KEYWORDS
Performance Analysis, Deep Learning, HPC, DGX-2, GPU
ACM Reference format:
Yihui Ren, Shinjae Yoo, and Adolfy Hoisie. 2016. Performance Analysis of
Deep Learning Workloads on Leading-edge Systems. In Proceedings of ACM
Conference, Washington, DC, USA, July 2017 (Conference’17), 10 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permied. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
Conference’17, Washington, DC, USA
© 2016 ACM. 978-x-xxxx-xxxx-x/YY/MM. . .$15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
e growth of machine learning and deep learning (DL) extends
across all data analytical application areas, impacting many disci-
plines and markets. Hence, their practical use potential appears
exponential and seemingly unbounded. In turn, the ever-insatiable
need for computing resources for these workloads has led to the
development of computer architectures and systems designed to
improve machine learning performance [1, 4, 10, 14, 16, 25, 26]. As
the presently preferred architectures for machine learning applica-
tion workloads, GPU-based systems are an important exemplar in
this category.
is work evaluates the performance of two important types of
DL algorithms on four leading-edge GPU-based systems. Speci-
cally, we consider convolutional neural network (CNN) algorithms,
such as AlexNet and ResNet, mostly used in computer vision and
aention-mechanism-based algorithms for natural language pro-
cessing (BERT) on the NVIDIA DGX-1 and DGX-2, IBM Power
System AC922, and Exxact TensorEX TS4. Moreover, we analyze a
cloud-based Amazon Web Services (AWS) P3dn use mode for the
DGX-1 and compare DL performance against standalone use for
the other systems considered.
GPU-based systems are especially well suited for DL workloads
as proven in practice and in scientic publications [3, 4, 22]. Briey,
this stems from their single-instruction multiple-data (SIMD) na-
ture and arithmetic intensity of the algorithms mapping well to
available oating point operations (FLOPS) on GPUs; availability
of large amounts of high-bandwidth memory that allows for data
access at fast rates and low latency; and to high-speed intercon-
nects that aord communication at high bandwidth with minimal
contention. e rst three examples of leading-edge systems consid-
ered herein use the NVIDIA Tesla V100 GPU with dierent topolo-
gies of the NVLink interconnect. e Exxact TS4 is congured with
the consumer-grade GeForce RTX 2080 Ti GPU, which is popular
among AI researchers, developers, and hobbyists. Section 2.1 de-
scribes the systems and their key architectural characteristics in
more detail.
Section 3 details how DL models considered are trained, the
fundamental arithmetic operations involved during training, and
their eects on dierent hardware systems. Specically, Section 3.2
dissects CNN models for computer vision, while Section 3.3 ex-
plores the state-of-the-art Bidirectional Encoder Representations
ar
X
iv
:1
90
5.
08
76
4v
1 
 [c
s.P
F]
  2
1 M
ay
 20
19
Conference’17, July 2017, Washington, DC, USA Yihui Ren, Shinjae Yoo, and Adolfy Hoisie
from Transformers (BERT) model for natural language processing
(NLP) [6].
e detailed performance analysis is done along a few important
dimensions. Section 4.1 presents the performance of key global com-
munication kernels used in the benchmarks considered. Section 4.2
discusses performance and scalability of large and high-throughput
DL models. Section 4.3 compares performance when the bench-
marks are expressed in an easy-to-code multi-GPU architecture
enabled by system soware described in Section 2.2.
2 ENVIRONMENT
2.1 Hardware Environment
As part of this work, the following systems were put to the test:
NVIDIA DGX-1V and DGX-2 (DGX-2), IBM Power System AC922
(IBM-P9), AWS P3dn (AWS P3), and Exxact TensorEX TS4 (RTX).
Henceforth, the systems will be referenced using their respective
abbreviations noted in parentheses. For added convenience, a con-
sistent color scheme and geometric shape are maintained for each
system represented in gures throughout this work (green diamond,
DGX-2; blue square, IBM-P9; orange triangle, AWS P3; red circle,
RTX). Of note, the AWS P3 essentially is a DGX-1V as shown in the
communication bandwidth test depicted in Section 4.1.
Before delving into the details of each system, we rst introduce
the key architectural component: the NVIDIA Tesla V100 GPU.
Tesla V100. e Tesla V100 GPU [17] is a building block for
three of the four systems under consideration. e V100 GPU has
640 Tensor cores and 5,120 CUDA cores with 32 GB (or 16 GB)
HBM2 GPU memory (900 GB/s bandwidth). It can achieve 15.7
TFLOPS for single-precision performance. For direct inter-device
(GPU-to-GPU) communication, the V100 has six NVLink-2.0 fabric
supporting 25 GB/s per link, per data direction. erefore, each
V100 has the ability to communicate with other GPU devices at
150 GB/s unidirectional (or 300 GB/s bidirectional) bandwidth. e
high bandwidth of inter-node communication is crucial for training
deep neural network models across multiple devices.
DGX-2. e bulk of the DGX-2’s computation capacity is from 16
V100 (32 GB) GPUs evenly distributed on two baseboards and con-
nected via 12 on-node switches, or NVSwitch [18]. Each NVSwitch
has 18 NVLink ports (16 in use) and supports 900 GB/s bidirectional
peak bandwidth. Eight NVLink ports are connected to dierent
GPU devices (one per link) on the same baseboard, whereas the
other eight NVLink ports are connected to the matching NVSwith
ports on the other baseboard (Figure 1a). is network connectiv-
ity aords communications at a bandwidth of up to 150 GB/s per
direction. Any two V100 GPUs can establish full bandwidth (up to
150 GB/s per direction) communication using all six NVLink ports.
e specic DGX-2 tested in this work has two hyper-threaded
24-core Intel Xeon 8168 CPUs (96 logic cores in total) with base
frequency of 2.7 GHz, 1.5 TB system memory, and 30 TB NVMe
SSD in eight-way RAID0.
AWS P3. AWS’ P3dn.24xlarge instance is similar to the NVIDIA
DGX-1V system [16] and is equipped with eight Tesla V100 (32
GB) GPUs connected in a hybrid cube-mesh topology (Figure 1b).
e hybrid cube-mesh topology leads to each node having four
immediate neighbors. is is a legacy design following the previous
0 61 7
8 149 15
25
GB
/s
25G
B/s 25GB
/s 25GB/s 25GB/s 25GB/s
(a) DGX-2 NVSwitch Crossbar
3 0 4 7
2 1 5 6
50 GB/s
25
 G
B/s
25
 G
B/
s
50 GB/s
25 GB/s
(b) DGX-1V and AWS P3 Hybrid Cube-Mesh Topology
0 1 2 3
3x25GB/s
32 GB/s
CPU
Power9
CPU
Power9
sy
s m
em
sys m
emSMP
3x25GB/s
60GB/s
(c) IBM AC922 Model 8335-GTH NVLink-enabled POWER9 CPU
Figure 1: GPU-to-GPU Communication Topology. Each
Tesla V100 GPU has six NVLink ports with unidirectional
communication bandwidth of 25 GB/s per port. Numeri-
cally labeled boxes represent dierent GPU devices. e six
NVLinks from device-0 are colored dierently.
DGX-1P system, where the Tesla P100 GPU featured only four
NVLink ports. Two of the four neighbors are connected to two
links each, while the other two connect to one only. To connect
two P3 systems, AWS provides network connection bandwidth up
to 100 GBits/s. e caveat is that this limit can be reached only for
multi-ow connections. e single-ow bandwidth is 10 Gbits/s
(1.25 GB/s). e specic AWS P3 systems tested in this eort have
two hyper-threaded 24-core Intel Xeon 8175M CPUs (96 logic cores
in total) with base frequency of 2.5 GHz, 768 GB system memory,
and 2 TB ephemeral NVMe SSD. Section 4.1 shows that the NVIDIA
DGX-1V system is analogous to the AWS P3. us, we include only
the results for the AWS P3.
Performance Analysis of Deep Learning Workloads on Leading-edge Systems Conference’17, July 2017, Washington, DC, USA
IBM-P9. e IBM Power System AC922 [2] (Model 8335-GTH)
server tested is equipped with four Tesla V100 (32 GB) GPUs (Fig-
ure 1c). e tested AC922 server has two IBM POWER9 hyper-
threaded 20-core CPUs (160 logic cores in total) with base frequency
of 2.3 GHz and max frequency of 3.8 GHz. IBM’s POWER9 CPU
is NVLink-enabled. Each CPU has six direct NVLink connections
to GPUs (three per GPU), enabling a 75 GB/s unidirectional com-
munication bandwidth to each GPU. In addition, there are three
NVLink fabrics connecting two GPUs directly. If the GPUs are not
connected to the same CPU, communications must route through
the inter-CPU symmetric multiprocessing (SMP) cable with unidi-
rectional bandwidth of 32 GB/s. e POWER9 CPU connects to
the system main memory with accumulated (eight channels) unidi-
rectional bandwidth of 60 GB/s. e tested system has four nodes,
connected via high-bandwidth (24 GB/s unidirectional) InniBand.
All of the nodes use IBM General Parallel File System (GPFS) with
block size of 16 MB and bandwidth of approximately 18 GB/s.
RTX. e Exxact TensorEX 4U server (TS4-1598415-DPN) is
equipped with eight NVIDIA consumer-grade GeForce RTX 2080
Ti GPUs [19]. Each RTX 2080 Ti GPU has 4352 CUDA cores and
11 GB GDDR6 GPU memory with 616 GB/s memory bandwidth. It
can reach a peak performance of 13.4 TFLOPS for single-precision
performance, or about 85.4% of the V100 GPU’s peak performance.
e specic server tested in this work has two hyper-threaded
12-core Intel Xeon 4116 CPUs (48 logic cores in total) with base
frequency of 2.1 GHz. All eight GPUs are connected via a PCIe bus.
Compared to other high-end V100 GPU-based solutions, the RTX
GPU cards are a unique feature for this system. As such, we refer
to this system as RTX.
2.2 Soware Environment
Because of its popularity among AI researchers, its well-designed
user interface, and native support for NVIDIA communication and
computation backend kernels and MPI, we use the PyTorch DL plat-
form. To maintain a consistent and reproducible soware environ-
ment, we use docker containers, which also alleviate the diculty
in migrating the DL models to other hardware systems and elimi-
nate the performance dierences introduced by distinct soware
environments. For the x86 architecture (Intel Xeon CPU) systems,
including DGX-1, DGX-2, AWS P3, and RTX, we use the NVIDIA
ocial PyTorch docker image (NVCR)1 as the base soware envi-
ronment. For the ppc64le architecture (IBM POWER9 CPU) system,
IBM-P9, we use the PowerAI v1.6 [7].
Nevertheless, to ensure our work is reproducible, Table 1 lists the
exact library versions of the NVIDIA docker and the PowerAI v1.6.
e NVIDIA CUDA library is a programming interface to NVIDIA
GPUs for parallel computing, while NVIDIA’s cuDNN (deep neural
network) library provides device-level optimized, neural-network-
related backend kernels. e NVIDIA NCCL (collective commu-
nication) library provides a multi-GPU communication interface,
supporting several communication means, such as NVLink, PCIe,
and Ethernet.
1nvcr.io/nvidia/pytorch:18.11-py3
Table 1: Soware Environment
Library NVIDIA NVCR IBM PowerAI
PyTorch 1.0.0a0 1.0.1
CUDA 10.0.130 10.1.105
cuDNN 7.401 7.5
NCCL 2.307 2.402
3 DEEP LEARNING MODELS
3.1 Data Movement and Communication
Between Devices
Deep learning is a data-driven modeling approach. e training
process, known as stochastic gradient descent, consists of numerous
iterations of feeding data to the model and adjusting the model
parameters to reduce the predened loss. At each iteration, a batch
of data is selected at random (without replacement). e data are
loaded from the hard drive to the host memory, and, sometimes,
preprocessing data-augmentation procedures are applied using
CPU threads, such as randomly ipping images or adjusting image
sizes. en, the preprocessed batch is sent to the GPU memory via
PCIe bus.
e bulk of actual computation usually is done on one or multiple
GPUs. In the multiple GPU case, the execution is done in a SIMD
fashion so each GPU has an exact replica of the neural network
model and applies the exact executions on dierent sampled data
batches. In the ideal case, the throughput would grow linearly
with the number of GPUs. At the end of every iteration, all of
the model replicas require synchronization. is synchronization
is done by a collective communication using NCCL. Most of the
results in this work use the NCCL all-reduce kernel. erefore, the
two major factors aecting the time cost of communication are:
1) the inter-device communication bandwidth and 2) number of
model parameters.
For this work, we have selected several representative DLmodels
to cover dierent ranges of parameters, computation-communication
ratios, application domains, and various types of neural network
DL layers. Because of the vast number of potential DL models, we
are unable to test all of them exhaustively. However, by provid-
ing detailed descriptions and computation characteristics for these
select models, readers may be able to easily estimate the perfor-
mance (in terms of computation eciency not model accuracy) of
other models as the fundamental types of numeric operations are
comparable. As computer vision and NLP are the two most suc-
cessful application domains for DL, we choose the AlexNet model
and ResNet model from the computer vision domain and BERT
model from NLP to represent examples of DL methods in these
areas. We analyze the models in terms of their number of train-
able parameters and operations. e former aects the memory
footprint as well as the inter-device communication costs, while
the laer impacts the on-device computation time. We introduce
the ratio Γ, the number of operations per training instance to the
number of model parameters, as a quantitative measure. e total
computation cost scales linearly with the number of instances per
sampled data batch, known as the batch size. However, the actual
Conference’17, July 2017, Washington, DC, USA Yihui Ren, Shinjae Yoo, and Adolfy Hoisie
Table 2: TestedDeepLearningModels [Γ denotes the ratio be-
tween the number of operations per instance (Ops/ins.) and
number of parameters (Param.)].
Model Name Param. Ops/ins. ratio Γ
AlexNet 61.10 M 0.72 G 11.78
ResNet18 11.69 M 1.83 G 156.54
ResNet50 25.56 M 4.14 G 161.97
ResNet101 44.55 M 7.88 G 176.88
ResNet152 60.19 M 11.62 G 193.06
BERT-SWAG 109.5 M 0.19 G 1.74
BERT-SAD 109.5 M 2.87 G 26.21
computation cost depends on many other factors. Table 2 provides
a summary of the number of parameters, operations per instance,
and the ratio Γ for all of the models presented in this work.
3.2 Computer Vision
e goal of computer vision is to make computers gain high-level
“understanding” of images. To evaluate if a program (AImodel) truly
“understands” the image, researchers have developed dierent eval-
uation tasks to measure its comprehension. One type of these tasks,
known as image classication, provides an image to the program
and asks about which predened class the image belongs to. For
example, the MNIST (handwrien digit database) asks the program
to tell it which digit, from 0 to 9, the grayscale image (28-by-28
pixels) belongs to. is is considered one of the simplest computer
vision tasks, and traditional machine learning methods, such as the
support vector method, have reached 99.2% accuracy [13]. e Im-
ageNet Large Scale Visual Recognition Challenge, or ILSVRC [24],
a much more challenging image classication test, was introduced
in 2010. It contains 1000 predened classes (including 60 dierent
dog breeds) and more than a million training images. e best-
performing model in the rst ILSVRC (2011) achieved only about a
25% top-ve error rate.2 In 2012, AlexNet [12], considered the rst
modern CNN-based model, successfully reduced the top-ve error
rate to 16.4%. In 2015, ResNet [11] further reduced the error rate to
3.57%. It also introduced residual blocks to mitigate the “vanishing
gradient problem” when the neural network becomes too deep.
A deep neural network is a stack of multiple neural network
layers, usually varying kinds. Each layer takes the previous layer’s
output as its input, where both input and output are tensors. A
Linear layer is one of the simplest kind, amatrix of size ci×co , where
ci and co are the number of input and output channels. erefore,
the number of parameters of a Linear layer is on the order ofO(cico ).
e operation performed by a Linear layer essentially is a general
matrix-matrix multiplication (GEMM). In most cases, the multiplier
matrix (input) has a dimension of B × ci , and the multiplicand
matrix (Linear layer weights) has a dimension of ci × co . As such,
the number of operations for a batch size B is B×ci ×co . One could
deduce that the operation-to-parameter ratio Γ for a Linear layer is
B: ΓLinear = B, implying that computation cost grows linearly with
the number of parameters in the Linear layer and batch size.
2Top-ve error rate. For each test image, the algorithm is allowed to give ve predictions.
If any of the ve predictions match to the ground truth, it is considered a hit.
A two-dimensional convolutional (Conv2D) layer consists of co
kernels of size ci × k × k . erefore, the number of parameters of
a Conv2D layer is co (k2ci + 1). A kernel is simply a small tensor
applied to the input tensor in a sliding-window fashion, where the
step size is called the stride. When the stride is greater than one, the
input tensor is downsampled in the spatial dimension. e number
of operations for a Conv2D layer can be calculated by considering
the number of times the kernel has been applied and the cost of
applying each kernel. Applying a Conv2D kernel on the input
tensor of size B × ci × Hi ×Wi is meant to perform a tensor dot
product of ci × k2 on every pixel of the spatial dimension H ×W .
For simplicity, assume the striding step is 1, and padding is bk/2c
such that the spatial dimension is unchangedHo = Hi andWo =Wi .
us, each kernel has been appliedHo ×Wo times.3 For each kernel
application at every pixel level, a GEMM operation is performed,
which costs C ≡ co (cik2 + 1). erefore, in total, the number
of operations of the Conv2D layer is Ho ×Wo × C . Because the
number of parameters of a Conv2D layer is alsoC , the operation-to-
parameter ratio Γ for Conv2D layer is ΓConv2D = BHoWo . As in the
case of the Linear layer, the total number of operations scales with
the batch size. Yet, in contrast to the Linear layer, the total number
of operations also depends on the spatial dimension of the output
tensor. Each parameter of a Conv2D layer has been operated HoWo
more times than a parameter in a Linear layer.
AlexNet consists of ve Conv2D layers of ∼ 221 parameters in
total, two hidden Linear layers (∼ 225), and one output Linear layer
(∼ 222). e Linear layer also uses an order of magnitude more
parameters. Compared to AlexNet, ResNet consists almost entirely
of Conv2D layers, except the nal Linear layer for classication out-
put. e sub-types of ResNet models are labeled as ResNetX, where
X represents the total number of parameterized layers (Conv2D and
Linear). e choices of X in the original paper [11] are 18, 34, 50,
101, and 152. ResNet18 serves as a high-throughput (small number
of operations), low-accuracy model because of the small amount of
parameters, while ResNet152 has the highest accuracy but slowest
training throughput. Using ResNet50 for ImageNet data (1000-way
classication) as a concrete example, the model contains about 224.6
parameters, where only 221 are from the Linear layer. As discussed,
each parameter of a Conv2D layer contributes a factor of Ho ×Wo
more operations than one in a Linear layer. As such, ResNet has a
much higher operation-to-parameter ratio than AlexNet.
3.3 Natural Language Processing
NLP is another successful application of DL techniques. Some NLP
tasks include speech recognition, translation, speech-to-text (and
vice versa), and question-and-answer systems. In the pre-DL era,
NLP was dominated by hidden Markov models [8]. Mikolov et
al. [15] introduced a DNN-based word embedding model to rep-
resent words as vectors based on their context. Namely, similar
words would have comparable context around them and end up
closer in the vector space. is approach provides a meaningful way
to represent non-numeric entities, i.e., words, as numeric vectors
and provides a foundation for solving a diverse range of NLP tasks.
3 Note that by seing striding greater than one, fewer kernel operations will be applied,
which can reduce the spatial dimension (downsampling). Whereas, by seing the space
between kernel points (dilation), the spatial dimension (upsampling) can increase. e
computation cost analysis is similar.
Performance Analysis of Deep Learning Workloads on Leading-edge Systems Conference’17, July 2017, Washington, DC, USA
Graves et al. [10] developed a deep recurrent-neural-network-based
approach to perform automatic speech recognition and broke the
TIMIT phoneme recognition benchmark record [9]. By the end of
2016, all major technology companies had adopted the DNN-based
approach for their speech recognition systems. Vaswani et al. [27]
introduces the aention mechanism into NLP tasks and demon-
strates its superior performance in natural language translation
tasks.
e particular NLP model in this work, BERT, uses bidirectional
transformers [6] and exceeded 11 NLP benchmark records in No-
vember 2018.4
e BERT model has two training phases: 1) pre-training and
2) ne-tuning. In the pre-training phase, BERT uses the semi-
supervised sequence learning approach [5] by masking out a ran-
dom word in a sentence. Unlike other previous unidirectional
approaches, BERT tries to predict the masked word from both di-
rections. Training is done on large unlabeled corpora, such as the
English Wikipedia (2,500 million words). Herein, this pre-trained
model is known as the base-model. In the task-specic ne-tuning
phase, the base-model connects with a classication Linear layer
designed for the specic task. e data used for ne-tuning are
labeled and much smaller compared to the large corpora [21]. e
majority of aention mechanism operations are matrix multiplica-
tion and layer-wise normalization. For details regarding how the
aention mechanism works, readers can refer to several available
guides.5, 6
We use the pre-trained BERT base-model and ne-tune it for
two specic NLP tasks: SWAG and Stanfordestion Answering
Dataset (SAD). e SWAG [28] is a multi-choice task. Given a
situation described by a sentence as input, the model is asked to
select the most plausible scenario that happens next amongmultiple
choices. e SAD [23] is a estion Answering task, where a
pair that includes a question and a relevant paragraph (containing
the answer) is provided and the model is tasked to nd the answer
in the given paragraph.
Although the base model is the same, to fully cover the training
data, dierent max-seq-length is used. We use max-seq-length of 80
for SWAG and 384 for SAD. As the max-seq-length determines
the aention span, it takes more operations to perform the SAD
task. Table 2 features the number of model parameters and esti-
mated operations of BERT-SWAG and BERT-SAD, respectively.
Of note, our benchmark code is modied from the source code.7
4 PERFORMANCE ANALYSIS
is section details the performance analysis of DL workloads using
the four systems (already described) under consideration. e all-
important communication performance is rst presented. Given the
dierent workload characteristics, the analysis is done separately
for large-scale and high-throughput models. Performance details
for an increasingly popular code expression (due to ease of coding)—
PyTorch’s On-node Data Parallel [20]—also is included.
4As of March 2019, OpenAI and Microso have released their model challengers to
BERT.
5hp://nlp.seas.harvard.edu/2018/04/03/aention.html.
6hps://jalammar.github.io/illustrated-transformer/.
7hps://github.com/huggingface/pytorch-pretrained-BERT
4.1 Communication Performance
As shown in Section 2.1, leading-edge systems implement vari-
ous direct high-bandwidth inter-device communication topologies
based on NVLink. e NCCL8 provides MPI-like primitives for
multi-GPU and multi-node collective communications. e library
is optimized for NVIDIA GPU devices to achieve high communi-
cation bandwidth over NVLink and PCIe (when necessary). NCCL
supports collective communication primitives, such as all-reduce,
all-gather, reduce-scaer, reduce, and broadcast.
As the most relevant communication kernels occurring in the
benchmarks considered, all-reduce and broadcast are examined
for performance using NVIDIA’s NCCL-tests code.9 Results are
presented normalized to the ”bus bandwidth,” a concept described
by NVIDIA in the NCCL-tests.10 Bus bandwidth is obtained by
applying a normalization divider of the measured bandwidth
(“message size”/time) dierent for each communication kernel to
reect its communication complexity and topological mapping to
the network. Because the bus bandwidth reects how optimally the
hardware is used, it provides a consistent and normalized way to
compare the results with the theoretical peak bandwidth, including
across dierent communication primitives.
In this work, data size varies from 1MB to 1 GB, which covers the
communication needs for synchronizing model parameters. Each
data point is averaged over 500 iterations, except for the case of 16
GPUs using two AWS P3s, which is averaged over 50 iterations due
to the slow inter-node Ethernet connection. Figure 2 illustrates the
results.
e DGX-2 consistently achieves 120 GB/s for large message
sizes, regardless of the number of GPUs involved in the communi-
cations. is can be aributed to the NVSwitch’s link bandwidth
and contention properties (described in Section 2.1).
e AWS P3 and DGX-1V yield analogous, if not exactly the
duplicate, results because they share the same hybrid cube-mesh
topology (refer to Figure 1b). Becuase of the heterogeneity of this
topology, the measured peak bandwidth depends on the devices
involved in the communication. In the case of two GPUs, the test
employs device-0 and device-1, which are connected via a single
NVLink that oers 25 GB/s theoretical unidirectional bandwidth.
For four GPUs, device-0 to -4 are used, and the NVLinks connect-
ing to device-5 to -7 are not. e observed bandwidth is 80 GB/s.
For eight GPUs, the DGX-1 surpasses the DGX-2 in the all-reduce
tests (Figure 2a). In the broadcast test (Figure 2b), the crossover
occurs when the message size exceeds 256 MB. While these results
may seem unexpected due to the higher bandwidth and topological
richness of the NVSwitch compared to the NVLink, the actual expla-
nation stems from the communication protocol changes introduced
on the NVSwitch. Here, posted requests are converted to non-
posted, which, in turn, requires acks at the expense of bandwidth in
the reverse direction. is is not the case on the DGX-1V without
NVSwitch. With access to only one DGX-1, the 16 GPU case was
done on AWS P3. e two AWS-P3dn nodes are connected via a 100
Gbits/s multi-ow Ethernet connection. e experimental setup in
the AWS cloud allowed for only a single ow (review Section 2.1)
8hps://github.com/NVIDIA/nccl.
9hps://github.com/NVIDIA/nccl-tests.
10Described in detail here:
hps://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md.
Conference’17, July 2017, Washington, DC, USA Yihui Ren, Shinjae Yoo, and Adolfy Hoisie
(a) All-reduce
(b) Broadcast
Figure 2: Communication Bus Bandwidth.
with a peak bandwidth of 1.25 GB/s. In this case, the communication
bandwidth clearly is bolenecked by the slow Ethernet connection.
IBM-P9 uses half of the NVLinks for CPU-GPU communication
(Figure 1c). is leaves three NVLinks to connect device-0 and
device-1. In the case of twoGPUs, themeasured bus bandwidth of 70
GB/s is quite close to the theoretical peak of 75 GB/s. However, with
four GPUs, the bus bandwidth reduces to about 30 GB/s, matching
the theoretical SMP bus bandwidth of 32 GB/s when connecting
two POWER9 CPUs. Higher count GPU congurations on the IBM
P9 (eight- and 16-GPU) exhibit lower bus bandwidth (Figure 2). is
achieved performance is due to NCCL not being optimized for the
InniBand interconnect.
e RTX system does not use NVLink technology, and all eight
RTX 2080Ti GPUs connect through a PCIe bus. erefore, the com-
munication bandwidth is throled down by the PCIe bus. Despite
its inferior communication performance, the RTX system serves as
the baseline for other systems.
4.2 Performance of Deep Learning Workloads
Computation performance is measured in terms of the model train-
ing throughput: the average number of training samples, or in-
stances, the system can process per second. For each dierent com-
bination of models, batch sizes, and number of GPUs, time intervals
are measured between consecutive iterations during training. For
computer vision DL models, each model runs for 200 iterations.
For the BERT models, the reported throughput is averaged over
one training epoch.11 All of the models are represented in single
precision (FP32).
Distributed data-parallel trainingwith asynchronous data prefetch-
ing is used. Each GPU is associated with j data-fetching CPU pro-
cesses using CUDA streams. In these tests j = 4. is allows data
to be loaded and preprocessed asynchronously and concurrently
on the CPUs while the GPUs are in use. Every GPU device holds a
replica of the model and applies the model on dierent data batches.
At each iteration’s conclusion, all GPUs synchronize their param-
eter gradients via an all-reduce NCCL operation. en, all model
replicas individually update their parameters using the gradients.
e computer vision models are trained on the ILSVRC ImageNet
data set, while BERT models are ne-tuned on task-specic data
sets, SWAG and SAD (introduced in Section 3.3).
As the system performance characteristics vary for dierent
models, we group models such as ResNet101(152) and BERT as
large DL models and those with high throughput, e.g., ResNet18(50)
and AlexNet, as high-throughput DL models. e large DL model
results are discussed in this sub-section, while high-throughput DL
models are addressed in Section 4.2.2. For added clarity, the bar
plots featured in this section depict systems ordered from le to
right, corresponding to the system order posed in legends (inset
from top to boom).
4.2.1 PerformanceAnalysis of LargeDeepLearningMod-
els. Initially, the absolute throughput values of large DL models,
e.g., ResNet101, ResNet152, BERT-SWAG, and BERT-SAD, are
examined (Figure 3). As the amount of communication for synchro-
nization depends on the number of model parameters and not on
the batch size, we choose the largest batch size that can t into
the 32 GB of memory of a single V100 GPU to achieve the best
possible scaling results. Specically, the batch sizes used are: 128
per GPU for ResNet101 and ResNet152, 64 for BERT-SWAG, and 32
for BERT-SAD.
Across all four systems, the DGX-2 and AWS P3 have similar
performance up to eight GPUs. is is expected as both systems
have the same V100 GPUs and are connected via high-bandwidth
(over 120 GB/s) NVLinks. However, when 16 GPUs are in use,
two AWS P3s communicate through a relatively slow Ethernet
connection (about 1 GB/s). Figures 3c and 3d reveal the dierences
in performance, especially in BERT models where the number of
parameters is large.
Given its high-bandwidth inter-node communication network,
for all models except the ResNet101, the IBM P9 exhibits similar
performance to DGX-2 all the way to up to a 16 GPU conguration.
11One epoch is dened as going through the entire data set once.
Performance Analysis of Deep Learning Workloads on Leading-edge Systems Conference’17, July 2017, Washington, DC, USA
(a) ResNet101 (b) ResNet152
(c) BERT-SWAG (d) BERT-SAD
Figure 3: Trainingroughput of Large DL Models on RTX,
IBM-P9, AWS P3, and DGX-2.
e IBM P9’s mediocre performance on ResNet101 stems from the
fact that the IBM GPFS le system has a block size of 16 MB, much
larger than the individual image sizes, leading to sub-optimal band-
width utilization given that all data are fetched remotely through
the le system. is boleneck becomes more apparent in testing
the high-throughput models (Section 4.2.2).
e RTX server has 11 GB of DDR6 GPU memory. Hence, the
batch sizes are even smaller: one-quarter of the size when using 32
GBs on the V100 GPU on all other systems. Specically, the batch
size for ResNet101 and ResNet152 is 64, BERT-SAD is 8, and
BERT-SWAG is 16. is leads to a quadrupling of the amount of
communication for the same total of computed instances. RTX’s
slow inter-device communication via a PCIe bus further exacerbates
its performance degradation. For example, in the case of 1 GPU,
RTX can reach about 65.82% throughput of the DGX-2 averaged
over four DL models, yet merely 57.27% in the case of eight GPUs
(see Table 3). Hence, the RTX server is the least ecient system for
large model distributed training.
To examine the scaling more closely throughout the full span
of GPU congurations, we plot the throughput for all DL mod-
els in a log-log scale (Figure 4), where the dashed reference line
depicts linear scalability. If the measured throughput follows the
reference line, or maintains a constant gap, it has good parallel
scalability. e DGX-2 exhibits good scalability on all four models,
whereas AWS P3 shows linear scalability up to eight GPUs. For
the RTX, there is a signicant drop from one GPU to two GPUs in
terms of scalability because one GPU computation does not require
model synchronization, while that cost does apply for multiple GPU
congurations.
4.2.2 Performance Analysis of High-Throughput Learn-
ing Models. Here, AlexNet, ResNet18, and ResNet50 are charac-
terized as high-throughput models. All systems except RTX use a
(a) ResNet101 (b) ResNet152
(c) BERT-SWAG (d) BERT-SAD
Figure 4: Linear Scaling in Log-Log Scale.
256 batch size per GPU to fully utilize their 32 GB of memory for all
models. RTX uses a batch size of 64. Figure 5 illustrates the results.
Training high-throughput models implies the added cost of data
movement through the le system. Because of the large block
size and its GPFS le system, the IBM-P9 achieves a throughput
ceiling of 4000 instances per second for AlexNet and ResNet18. Its
performance on ResNet50 is similar to that of the DGX-2 and AWS
P3, which uses local NVMe hard drives and employs a le system
with a much smaller block size.
For ResNet50 (Figure 5c), both the DGX-2 and AWS P3 exhibit
linear scaling. Because of the ResNet50 model’s small size, the slow
inter-node Ethernet bandwidth of the AWS P3 does not boleneck
the distributed training throughput performance.
Because AlexNet uses more than twice the number of parameters
of ResNet50, throughput performance is throled down by the slow
Ethernet connection on AWS P3 when two nodes (with a total of 16
GPUs) are in use (Figure 5a). Even on the DGX-2, AlexNet does not
Table 3: Relative Performance of RTX to DGX-2
Model Name 1 GPU 2 GPUs 4 GPUs 8 GPUs
AlexNet 78.19% 63.01% 53.41% 47.95%
ResNet18 73.50% 69.13% 64.39% 54.80%
ResNet50 67.97% 62.67% 62.97% 61.75%
Average 73.22% 64.94% 60.26% 54.83%
ResNet101 69.70% 63.72% 64.15% 62.69%
ResNet152 69.73% 62.45% 62.96% 61.90%
BERT-SWAG 64.04% 57.52% 57.20% 56.25%
BERT-SAD 59.81% 49.79% 49.74% 48.22%
Average 65.82% 58.37% 58.51% 57.27%
Overall avg. 68.99% 61.19% 59.26% 56.22%
Conference’17, July 2017, Washington, DC, USA Yihui Ren, Shinjae Yoo, and Adolfy Hoisie
(a) AlexNet (b) ResNet18 (c) ResNet50
Figure 5: Training roughput of High-throughput DL Models on RTX, IBM-P9, AWS P3, and DGX-2.
(a) AlexNet (b) ResNet18 (c) ResNet50
Figure 6: Examining Scaling in Log-Log Scale.
scale linearly to 16 GPUs (shown in Figure 6a). When 16 GPUs are
in use on the DGX-2, AlexNet spends about 80% of the active GPU
time in communication, whereas ResNet50 spends only about 4%.
Given its smallest amount of parameters, ResNet18’s need for
inter-device communication is modest. Even so, as shown in Fig-
ure 6b, the scaling is not ideal. An interesting observation is that
when using 16 GPUs, the AWS P3 performs beer than the DGX-2
(Figure 5b).
(a) CPU Core Speed (b) Instructions per Cycle
Figure 7: CPU Performance Bottleneck of ResNet18.
Recall from Section 4.2 that in all experiments, each GPU is
associated with (j =) 4 CPU processes for prefetching data. On
the AWS P3, the two CPUs on each node will handle 32 processes
for the eight GPUs. On the DGX-2, the 16 GPUs require 64 CPU
data-fetching processes from the two associated CPUs. To explain
why the AWS P3 outperforms the DGX-2 in Figure 5b requires
determining if the scaling inconsistency stems from a lower core
frequency speed and/or cache capacity eects. Figure 7a shows CPU
core speed measurements (enabled given Turbo Boost technology)
for both systemswhile varying j from 1 to 16 on theDGX-2 andAWS
P3. For example, if j = 16 and DGX-2 uses all 16 GPUs, there are 256
CPU processes in total. e light green curve (Figure 7a) depicts the
case when only eight GPUs on the DGX-2 are in use, in which case
the DGX-2 has slightly beer performance than AWS P3 5b. When
using j = 1 CPU process per GPU, the DGX-2’s CPU core speed
is much higher than that of the AWS P3 because of its superior
CPU performance characteristics (see Section 2.1). However, as j
increases, the DGX-2’s CPU core speed decreases, which is typical
for Intel Turbo Boost technology. For j = 4, the specic case present
in the benchmark runs (also shown by the vertical doed line in
Figure 7a), the DGX-2 maintains a higher CPU core speed than that
for the AWS P3. Hence, clock frequency is not the sole explanation
for the performance inconsistency. To understand the exact amount
of work the CPU does per unit time, Figure 7b shows the metric of
instructions per cycle (IPC). e IPC of the DGX-2 using 16 GPUs at
j = 4 is much lower than that of AWS P3: 1.35 versus 1.90, pointing
to cache utilization ineciencies.12 Additional measurements of
L1-cache data loading speed and data-translation lookaside buer
(TLB) load misses conrm this hypothesis.e data also reveal that
j = 4 usually is a good choice. Of note, because we use the pinned
memory13 to improve host-device data transfer, using large j will
cause high memory usage on the host.
For RTX versus DGX-2 performance, when one or two GPUs
are in use, RTX performance is close to that of the DGX-2 (re-
fer to Table 3). Because of their smaller GPU memory footprints,
high-throughput workloads look more suitable on RTX than large
models. Just as with the case of performance on large models, RTX’s
scalability is less than for the DGX-2 (see Table 3 and Figure 6) due
12Note: e tested Intel Xeon CPU can reach theoretical maximum of four IPC when
instructions are perfectly aligned by manual loop unrolling.
13Employing pinned memory will prevent the host memory from being swapped out
and enable GPU drivers direct access to the host memory.
Performance Analysis of Deep Learning Workloads on Leading-edge Systems Conference’17, July 2017, Washington, DC, USA
to its slower communication performance. is makes the RTX
system most suited for small-scale model development rather than
full-scale training workloads.
4.3 Comparing the PyTorch On-node Data
Parallel with Distributed Data Parallel
Until now, all of the results herein use the highly optimized dis-
tributed data parallel code to achieve the highest system perfor-
mance possible. By contrast, PyTorch on-node data parallel is an
easy-to-use method for enabling computations on multiple GPUs.
Code modications basically are conned to introduction of a
directive-like instruction that wraps a non-parallel PyTorch Module
with a DataParallel syntax, such as
model = torch.nn.DataParallel(model).14 e communica-
tion paern of on-node data parallel diers from the distributed
data parallel. In it, one GPU maintains a master copy of the model
parameters. At every iteration, it broadcasts the parameters to the
other GPUs in the conguration. At the end of every iteration,
the parameters are “all-reduced” back to the master GPU, which
updates the model parameters. erefore, for each iteration, two
global communications (broadcast and reduce) are issued. To em-
ulate the common practice of most PyTorch models, we use the
default PyTorch data loader for on-node data parallel experiments
(torch.utils.data.DataLoader), which supports multi-worker
and pinned memory but not asynchronous data loading. PyTorch’s
on-node data parallel design maximizes its usefulness but targets
small parallel GPU congurations, such as those common in work-
stations.
Figure 8: Relativeroughput Performance of ResNet50 be-
tween PyTorch On-node Data Parallel and Distributed Data
Parallel.
Figure 8 presents the relative performance of models expressed
as on-node data parallel compared to the distributed data parallel
algorithms for all systems considered. For one GPU, the two data
parallel schemes produce similar results. e experiments are done
using ResNet50. As more GPUs are utilized, performance decreases
when using on-node data parallelism. When two nodes are in use,
both DGX-2 and AWS P3 achieve about 90% of the distributed data
parallel performance. en, it drops rapidly for larger numbers of
GPUs. e stark lower performance on this test for the IBM P9 is
related to the choice of ResNet50, which is sub-optimal for the le
system on this system (refer to Section 4.2.2).
14hps://pytorch.org/docs/stable/ modules/torch/nn/parallel/data parallel.html.
5 CONCLUSION
In this work we analyzed the performance of several leading-edge
systems architected for DL workload performance: DGX-2, AWS P3,
and IBM-P9. We also considered a consumer-grade, budget-ecient
system: a RTX-2080 Ti server. e inclusion of AWS P3, which
essentially is a DGX-1 system, was done to explore performance
along the ever-increasing use of cloud computing scenarios for DL
workloads. e tested DL models spanned the computer vision
and NLP domains, are realistic, and actually are used in real-life
DL applications. By varying the types of neural network models
and batch sizes per GPU, the systems were probed using dierent
realistic computation and communication scenarios. Some of the
specic performance aspects revealed in this work include:
• e DGX-2 oered the best 16 GPU collective communica-
tion, making it most suited for training large models on 16
GPUs.
• When training on eight GPUs, the DGX-1, AWS P3, and
DGX-2 aorded similar performance.
• Because of the limited GPU memory and PCIe bandwidth,
when eight GPUs are in use, the RTX-2080 Ti server can
reach about 61.46% of the throughput performance oered
by the leading-edge systems considered in this evaluation.
• e cloud-use scenario is not leading to very large perfor-
mance degradationwhen the communication-to-computation
ratio of the DLmodels is low. However, achieving that level
of performance requires extensive understanding about
the cloud environment to maximize performance by mini-
mizing system contention, ensure geographical closeness
of systems, and other idiosyncratic tasks.
• Scalability of the DLmodels was investigated up to the sizes
of the DGX-2 machine available as a standalone system.
Future work will need to consider scaling up to production-
size DL models.
Practical considerations can be readily extracted from the work
documented in this paper, including regarding guidance for procur-
ing systems that maximize performance for a given workload of
interest, as well as for considering choice of machines, DL models,
and use modes. While as part of this work we implicitly considered
cost impacts in system selection, readers are le to weigh such an
analysis (and aspects related to it) on their own.
ACKNOWLEDGMENTS
e authors extend their sincere gratitude to Ethan Hereth (Univer-
sity of Tennessee at Chaanooga) for his exhaustive assistance and
support related to the IBM-P9, as well as Anthony Skjellum (Uni-
versity of Tennessee at Chaanooga) for his additional oversight.
Too, they thank IBM’s Xinghong He, Mladen Karcic, and Douglas
L. Lehr for facilitating access to internal benchmarking resources
(four IBM-P9 node conguration) used in this work. anks also to
Brian Barre (Amazon Web Services) for his assistance related to
the AWS P3 and to Craig Tierney and Louis Capps (both of NVIDIA)
and Zhihua Dong (Brookhaven Lab Computational Science Initia-
tive) for DGX-2 benchmarking support. e authors are grateful for
the signicant assistance received from Charity Plata (Brookhaven
Lab) in the editing and graphics enhancements of this paper.
Conference’17, July 2017, Washington, DC, USA Yihui Ren, Shinjae Yoo, and Adolfy Hoisie
is performance analysis was funded as part of the Exploiting
the Convergence of Research Challenges in Scientic Discovery And
National Security program within Brookhaven Lab’s Computational
Science Initiative with additional hardware infrastructure support
from the Empire State Development Corporation. Brookhaven Na-
tional Laboratory is operated and managed for the U.S. Department
of Energy’s Oce of Science by Brookhaven Science Associates on
behalf of Stony Brook University and Baelle under Contract No.
DE-SC0012704.
REFERENCES
[1] R. Adolf, S. Rama, B. Reagen, G. Wei, and D. Brooks. 2016. Fathom: reference
workloads for modern deep learning methods. In 2016 IEEE International Sympo-
sium on Workload Characterization (IISWC). 1–10. hps://doi.org/10.1109/IISWC.
2016.7581275
[2] Alexandre Bicas Caldeira. 2018. IBM Power System AC922 Introduction and
Technical Overview. (2018), 74. hps://www.redbooks.ibm.com/redpapers/pdfs/
redp5472.pdf
[3] Sharan Chetlur, Cli Woolley, Philippe Vandermersch, Jonathan Cohen, John
Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Ecient Primitives
for Deep Learning. arXiv e-prints (Oct. 2014), arXiv:1410.0759.
[4] D. Ciregan, U. Meier, and J. Schmidhuber. 2012. Multi-column deep neural
networks for image classication. In 2012 IEEE Conference on Computer Vision
and Paern Recognition. 3642–3649. hps://doi.org/10.1109/CVPR.2012.6248110
[5] Andrew M Dai andoc V Le. 2015. Semi-supervised Sequence Learning. In
Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence,
D. D. Lee, M. Sugiyama, and R. Garne (Eds.). Curran Associates, Inc., 3079–3087.
hp://papers.nips.cc/paper/5949-semi-supervised-sequence-learning.pdf
[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding. ArXiv e-prints
(Oct. 2018).
[7] Jason Jason Furmanek. 2019. PowerAI 1.6.0 Introduction: A Full Transition to
Conda. (March 2019). hps://developer.ibm.com/linuxonpower/2019/03/20/
powerai-1-6-0-introduction-a-full-transition-to-conda/
[8] Mark Gales and Steve Young. 2007. e Application of Hidden Markov Models
in Speech Recognition. Found. Trends Signal Process. 1, 3 (Jan. 2007), 195–304.
hps://doi.org/10.1561/2000000004
[9] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Palle. 1993. DARPA
TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc
1-1.1. NASA STI/Recon Technical Report N 93 (Feb. 1993).
[10] A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep
recurrent neural networks. In 2013 IEEE International Conference on Acoustics,
Speech and Signal Processing. 6645–6649. hps://doi.org/10.1109/ICASSP.2013.
6638947
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual
Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and
Paern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. 770–778.
hps://doi.org/10.1109/CVPR.2016.90
[12] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. 2012. ImageNet Classi-
cation with Deep Convolutional Neural Networks. In Advances in Neural Infor-
mation Processing Systems 25, F. Pereira, C. J. C. Burges, L. Boou, and K. Q. Wein-
berger (Eds.). Curran Associates, Inc., 1097–1105. hp://papers.nips.cc/paper/
4824-imagenet-classication-with-deep-convolutional-neural-networks.pdf
[13] Y. Lecun, L. Boou, Y. Bengio, and P. Haner. 1998. Gradient-based learning
applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278–2324.
hps://doi.org/10.1109/5.726791
[14] A. D. Malony, S. Biersdor, S. Shende, H. Jagode, S. Tomov, G. Juckeland, R.
Dietrich, D. Poole, and C. Lamb. 2011. Parallel Performance Measurement of
Heterogeneous Parallel Systems with GPUs. In 2011 International Conference on
Parallel Processing. 176–185. hps://doi.org/10.1109/ICPP.2011.71
[15] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean.
2013. Distributed Representations of Words and Phrases and their Com-
positionality. In Advances in Neural Information Processing Systems 26,
C. J. C. Burges, L. Boou, M. Welling, Z. Ghahramani, and K. Q. Wein-
berger (Eds.). Curran Associates, Inc., 3111–3119. hp://papers.nips.cc/paper/
5021-distributed-representations-of-words-and-phrases-and-their-compositionality.
pdf
[16] White Paper NVIDIA. 2017. NVIDIA DGX-1 With Tesla V100 Sys-
tem Architecture. (2017). hp://images.nvidia.com/content/pdf/
dgx1-v100-system-architecture-whitepaper.pdf
[17] White Paper NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture.
(Aug. 2017). hps://images.nvidia.com/content/volta-architecturae/pdf/
volta-architecture-whitepaper.pdf
[18] White Paper NVIDIA. 2018. NVIDIA NVSwitch: eWorld’s Highest-Bandwidth
On-Node Switch. (2018).
[19] White Paper NVIDIA. 2018. NVIDIA Turing Architecture. (2018). hps://www.
nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/
turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
[20] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
2017. Automatic dierentiation in PyTorch. In NIPS-W.
[21] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018.
Improving Language Understanding by Generative Pre-Training. Online (2018),
12. hps://openai.com/blog/language-unsupervised/
[22] Rajat Raina, Anand Madhavan, and Andrew Y. Ng. 2009. Large-scale Deep
Unsupervised Learning Using Graphics Processors. In Proceedings of the 26th
Annual International Conference on Machine Learning (ICML ’09). ACM, New
York, NY, USA, 873–880. hps://doi.org/10.1145/1553374.1553486 event-place:
Montreal, ebec, Canada.
[23] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
SAD: 100,000+estions for Machine Comprehension of Text. In Proceed-
ings of the 2016 Conference on Empirical Methods in Natural Language Pro-
cessing. Association for Computational Linguistics, Austin, Texas, 2383–2392.
hps://doi.org/10.18653/v1/D16-1264
[24] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan-
der C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition
Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.
hps://doi.org/10.1007/s11263-015-0816-y
[25] A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani, S. Hutsell,
R. Agarwal, and Y. Liu. 2016. Knights Landing: Second-Generation Intel Xeon
Phi Product. IEEE Micro 36, 2 (March 2016), 34–46. hps://doi.org/10.1109/MM.
2016.25
[26] Nathan R. Tallent, Nitin A. Gawande, Charles Siegel, Abhinav Vishnu, and
Adolfy Hoisie. 2017. Evaluating On-Node GPU Interconnects for Deep Learning
Workloads. In PMBS@SC (Lecture Notes in Computer Science), Vol. 10724. Springer,
3–21.
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Aention is All
you Need. In Advances in Neural Information Processing Systems 30, I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
ne (Eds.). Curran Associates, Inc., 5998–6008. hp://papers.nips.cc/paper/
7181-aention-is-all-you-need.pdf
[28] Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG:
A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. In
EMNLP. Association for Computational Linguistics, 93–104.
