Demystifying the MLPerf Benchmark Suite by Verma, Snehil et al.
ar
X
iv
:1
90
8.
09
20
7v
1 
 [c
s.L
G]
  2
4 A
ug
 20
19
Demystifying the MLPerf Benchmark Suite
Snehil Verma1, Qinzhe Wu1, Bagus Hanindhito1, Gunjan Jha2
Eugene B. John2, Ramesh Radhakrishnan3, and Lizy K. John1
University of Texas at Austin1
University of Texas at San Antonio2
Dell Inc.3
August 27, 2019
Abstract
MLPerf, an emergingmachine learning benchmark suite strives
to cover a broad range of applications of machine learning. We
present a study on its characteristics and how theMLPerf bench-
marks differ from some of the previous deep learning bench-
marks like DAWNBench and DeepBench. We find that appli-
cation benchmarks such as MLPerf (although rich in kernels)
exhibit different features compared to kernel benchmarks such
as DeepBench. MLPerf benchmark suite contains a diverse set
of models which allows unveiling various bottlenecks in the sys-
tem. Based on our findings, dedicated low latency interconnect
between GPUs in multi-GPU systems is required for optimal
distributed deep learning training. We also observe variation
in scaling efficiency across the MLPerf models. The variation
exhibited by the different models highlight the importance of
smart scheduling strategies for multi-GPU training. Another
observation is that CPU utilization increases with increase in
number of GPUs used for training. Corroborating prior work
we also observe and quantify improvements possible by com-
piler optimizations, mixed-precision training and use of Tensor
Cores.
1 Introduction
The recent advances in machine learning have led to an evolu-
tion of a myriad of applications, revolutionizing scientific, in-
dustrial and commercial fields. Machine learning, primarily
deep learning, is the state-of-the-art in providing models, meth-
ods, tools, and techniques for developing autonomous and intel-
ligent systems.
There are two parts to machine learning: training and in-
ference. Training refers to the process where the neural net-
work learns a new capability based on existing data. Train-
ing is a compute-intensive task since it operates typically on
massive datasets, tuning weights until the model meets the de-
sired quality. As the system’s compute power plays a signifi-
cant role in accelerating the neural network learning, training
is usually done using high-performance hardware or compute
clusters. However, reducing the training time while maintain-
ing the desired quality of the neural network model is still an
active area of research. On the other hand, inference utilizes the
capabilities of a trained neural network to make useful predic-
tions which requires less compute power. Inference is usually
performed inside the end-user hardware such as edge devices
where energy efficiency is also an important design considera-
tion.
Benchmarking various machine learning workloads and eval-
uating their performance based on a reasonable metric is a pre-
requisite for a fair comparison. MLPerf is an emerging con-
sortium that provides an extensive benchmark suite for mea-
suring the performance of machine learning software frame-
works, hardware accelerators, and cloud platforms [14, 27, 49].
The major contributors include Google, NVIDIA, Baidu, Intel,
AMD, and other commercial vendors, as well as research uni-
versities such as Harvard, Stanford, and the University of Cali-
fornia, Berkeley. MLPerf initial release v0.5 consists of bench-
marks only for training, with inference benchmarks expected
shortly. The benchmarks provide reference implementations
for workloads in the areas of vision, language, product recom-
mendation and other key areas where Deep Learning models
have shown success and datasets are available publicly.
Young [49] rightly points out five main attributes that a good
machine learning benchmark suite should possess, grouped to-
gether as five ”R”s. One of themwas Representative workloads,
with regards to which he wrote:
A good benchmark suite is both diverse and represen-
tative, where each workload in the suite has unique
attributes and the suite collectively covers a large frac-
tion of the application space.
In a recent talk on ”MLPerf design challenges”, Mattson [24]
highlighted that the current set of training benchmarks cover a
wide range of applications.
We evaluate the MLPerf benchmarks with experiments on
diverse hardware platforms. We investigate whether the exe-
cution characteristics of these benchmarks also point out suffi-
cient dissimilarities or are they largely similar in spite of diverse
domains? This work focuses solely on the training workloads.
The objective of this work is to unfold the answers to following
enigmas:
• How different are the MLPerf benchmarks from the prior
deep learning benchmarks? How different are the MLPerf
1
Table 1: Summary of key insights from the work.
Observation Location Insight/Explanation
MLPerf benchmark suite has a disjoint envelope
from DAWNBench and DeepBench.
Figure 1a MLPerf, DAWNBench, and DeepBench suite stress
HBM2 memory at different levels, and are optimized to
different extents. Throughput and arithmetic intensity:
DAWNBench >MLPerf > DeepBench.
DeepBench, MLPerf, and DAWNBench are located
in different regions in the roofline graph.
Figure 3
Every benchmark in MLPerf benchmark suite is on
the boundary of the workload space.
Figure 1b There is a great diversity existing in MLPerf benchmark
suite, e.g., in terms of the scaling efficiency. This
information is helpful for resource scheduling in
systems with multiple devices, such as data centers and
cloud platforms.
Different benchmarks scale up differently, and by
exploiting these differences, the optimal scheduling
can save hours of training on multi-GPU systems.
Table 5
Figure 6
The linkage distance between Res50 MX and
Res50 TF is as much as the longest distance among
a group of other workloads on different platforms,
different application domains (including Res18 Py)
Figure 2
Same neural network model does not guarantee that the
characteristics of the workloads would be similar.
Hyperparameters and the implementation frameworks
largely affect the behavior of the benchmark.
Data points representing machine learning
workloads are close to the slanted roof line.
Figure 3
It’s easy to exploit the abundant parallelism in ML
applications and finally end up being bound by
hardware resources
Mixed precision in combination with TensorCores
earns significant speedup on MLPerf.
Figure 4
Hardware support for reduced precision arithmetics is
important, especially for machine learning workloads.
With XLA enabled, Res50 TF converges to the same
accuracy as no XLA, while the time reduces by 40%
Figure 5
Compiler optimizations, especially kernel fusion,
provides for a lot of potential for performance
improvement.
When scaling to more GPUs, many benchmarks have
a super-linear increase in PCIe / NVLink utilization.
Table 6 Machine learning applications can become
communication-heavy workloads, so the bus is worth
attention in ML processor designs. Direct connections
between GPUs facilitate machine learning workloads.
Training time: GPU-system with NVLink enabled <
GPU-system with PCIe switch enabled < system
with GPUs connected using CPU PCIe ports.
Figure 7
Table 4
benchmarks from each other? If hardware designers do
not have time budget to evaluate all benchmarks, can they
use a subset of the benchmarks?
• What is the performance differential that can be obtained
by using reduced precision and NVIDIA’s tensor cores?
• Howwell does the training performance scale with increas-
ing the number of GPUs?
• What is an efficient way for a user to operate on multiple
GPUs to train several models: should they run distributed
jobs one-by-one on all GPUs or should they run jobs as-
signing one model to each GPU or is there any other better
solution?
• How are CPU, GPU and interconnect utilizations? Is there
a significant performance impact from the high-bandwidth
GPU interconnects?
The key insights revealed by this work are summarized in Ta-
ble 1, and the rest of paper is organized as follows: Section 2
introduces the emerging MLPerf [27] benchmark suite as well
as some prior deep learning benchmarks like DAWNBench [9]
and DeepBench [3]. In Section 3, we expand on the system
configurations and topologies, on which various experiments
were performed. We investigate various benchmark character-
istics and present a detailed analysis of the similarity of vari-
ous benchmarks in Section 4. This section also presents perfor-
mance impact of mixed precision and tensor cores. Section 5
presents measurements on the system resources utilization to
provide insights on CPU’s, GPU’s, and interconnect’s impact
on machine learning training performance as well as the mem-
ory requirement to store the dataset during processing. Then,
we conclude the paper with Section 6.
2 Background
In this section, we present an introduction to MLPerf [27],
DAWNBench [9], and DeepBench [3] benchmarks for machine
learning.
2.1 MLPerf Benchmarks
The MLPerf [27] benchmark suite includes workloads from im-
age classification, object detection, translation, recommenda-
tion and reinforcement learning.
• Image classification is a typical deep learning application
that identifies the object classes present in the image. This
benchmark uses ResNet-50 [17, 18] model to classify im-
ages. ResNet-50 signifies a 50-layered residual network,
that effectively overcomes the problem of degradation of
training accuracy and is easier to optimize, and can gain
accuracy from considerably increased depth.
• Object detection is a technology that classifies individual
objects and localizes each using a bounding box. Mask
R-CNN [16] adds a branch for predicting segmentation
2
Table 2: Summary of benchmarks in MLPerf v0.5, DAWNBench, and DeepBench suites used in this study.
Abbreviation Domain Model Framework (submitter) Dataset Quality Target
MLPerf v0.5
MLPf Res50 TF TensorFlow (Google)
MLPf Res50 MX
Image Classification ResNet-50
MXNet (NVIDIA)
ImageNet Accuracy: 0.749
MLPf SSD Py SSD (light-weight) mAP: 0.212
MLPf MRCNN Py
Object Detection Mask RCNN
(heavy-weight)
PyTorch (NVIDIA)
Microsoft
COCO
Box mAP: 0.377,
Mask mAP: 0.339
MLPf XFMR Py Transformer BLEU score (uncased): 25
MLPf GNMT Py
Translation
RNN GNMT
PyTorch (NVIDIA) WMT17 Sacre BLEU score
(uncased): 21.80
MLPf NCF Py Recommendation
Neural Collaborative
Filtering
PyTorch (NVIDIA)
MovieLens
20-million
Hit rate @ 10: 0.635
DAWNBench
Dawn Res18 Py Image Classification ResNet-18 (modified) PyTorch (bkj) CIFAR10 Test accuracy: 94%
Dawn DrQA Py Question Answering DrQA PyTorch (Yang et al.) SQuAD F1 score: 0.75
DeepBench
Deep GEMM Cu Dense Matrix Multiply N/A Bare-metal CUDA N/A N/A
Deep Conv Cu Convolution N/A Bare-metal CUDA N/A N/A
Deep RNN Cu Recurrent Layer N/A Bare-metal CUDA N/A N/A
Deep Red Cu All-Reduce N/A Bare-metal CUDA N/A N/A
masks on each Region of Interest (RoI), along with the
existing branch for classification and bounding box re-
gression. In Mask R-CNN, the additional mask output is
distinct from the class and box outputs, as it extracts a
finer spatial layout of an object. On the contrary, Single
Shot Detection (SSD) [23] discretizes the output space of
bounding boxes into a set of default boxes over different
aspect ratios and scales per feature map location. The SSD
model completely eliminates proposal generation and sub-
sequent pixel or feature resampling stage and encapsulates
all computation in a single network. This makes SSD easy
to train and integrate into systems that require a detection
component.
• Translation is the task of converting an input text from
one language to another. The model architecture - Trans-
former [42], avoids recurrence and relies on an attention
mechanism to generate global dependencies between in-
put and output. The attention weights apply to all sym-
bols in the sequences. On the other hand, Google’s
Neural Machine Translation system (GNMT) [46] model
uses residual connections as well as attention connections.
GNMT provides a decent balance between the flexibil-
ity of “character”-delimited models and the efficiency of
“word”-delimited models, and handles translation of rare
words.
• Recommendation is a task accomplished by a recommen-
dation system, that predicts the ”rating” or ”preference” to
an item. This benchmark uses Neural Collaborative Fil-
tering model (NCF) [19] that can express and generalize
matrix factorization under its framework. To supercharge
NCF modeling with non-linearities, a multi-layer percep-
tron can be utilized in this model to learn the user-item
interaction function.
• Reinforcement Learning is associated with how software
agents should take actions in an environment to maximize
the notion of cumulative reward. This benchmark is based
on a fork of the mini-go project [37], inspired by Deep-
Mind’s AlphaGo algorithm [34,36]. There are four phases
in this benchmark, repeated in order: selfplay, training, tar-
get evaluation, andmodel evaluation. Moreover, this archi-
tecture is also extended for Chess and Shogi [35]. 1
The above mentioned MLPerf benchmarks use various
datasets for training, such as ImageNet [10], Microsoft COCO
[22], WMT17 [6], and MovieLens 20-million [12, 15].
Table 2 displays a summary of the various workloads of
MLPerf v0.5 release, including respective models as well as
the datasets used. The metric used by MLPerf is the time
taken to reach a specified accuracy or quality target, which
is also listed in Table 2 for each benchmark. MLPerf bench-
mark implementations provided by the submitters currently in-
clude frameworks such as PyTorch [32], MXNet [7] and Ten-
sorFlow [1]. Many of the workloads consume days of training
time on powerful GPUs, as indicated in Table 3 for MLPerf’s
reference machine which has an NVIDIA Tesla P100 GPU.
In order to balance fairness and innovation, MLPerf takes
two approaches: closed model division and open model divi-
sion. The MLPerf closed model division postulates the model
to be used and restricts the values of hyper parameters, such as
batch size and learning rate, with the emphasis on fair compar-
isons of the hardware and software systems. On the contrary,
in the open model division, the same problem is required to be
solved using the same data set but with fewer restrictions, with
1Since our evaluation focus is on MLPerf v0.5 on GPU platforms, and the
only GPU code of Reinforcement Learning is the reference one, which spends
more time on the CPU than the GPU, Reinforcement Learning is excluded in
the rest of the paper.
3
the emphasis on advancing the state-of-the-art of ML [27].
Table 3: Training time of MLPerf reference implementations of
the benchmark on MLPerf’s reference machine (consisting one
NVIDIA Tesla P100 GPU).
Benchmark Training Time (mins.)
Image Classification 8831.3
Object Detection (SSD) 827.7
Object Detection (M-RCNN) 4999.5
Translation (Transformer) 1869.8
Translation (GNMT) 1334.5
Recommendation (NCF) 46.7
2.2 DAWNBench
DAWNBench [9], developed by Stanford University in 2017,
evaluates deep learning systems across different optimization
strategies, model architectures, software frameworks, clouds,
and hardware. It supports benchmarking of Image Classifica-
tion on CIFAR10 [21] and ImageNet [10], and Question An-
swering on SQuAD [33]. DAWNBench assesses the perfor-
mance based on four metrics: training time to a specified val-
idation accuracy, cost (in USD) of training, average latency
of performing inference, and the cost (in USD) of inference.
It provides reference implementations and seed entries, imple-
mented in two popular deep learning frameworks: PyTorch [32]
and TensorFlow [1]. The hyperparameters that DAWNBench
considers for optimizations are optimizer for gradient descent,
minibatch size, and regularization.
2.3 DeepBench
DeepBench [3], released in 2016, and updated in 2017 [4], pri-
marily uses the neural network libraries to benchmark the per-
formance of basic operations on different hardware. The perfor-
mance characteristics of models built for various applications
are different from each other. DeepBench essentially bench-
marks the underlying operations such as dense matrix multipli-
cation, convolutions, recurrent layers, and communication. For
training, DeepBench specifies the minimum precision require-
ments as 16 and 32 bits for multiplication and addition, respec-
tively [3]. The benchmarks are written in CUDA and thus, are
more fundamental than any deep learning framework or model
implementation. Additionally, there is no concept of a quality
target.
With research in the field of deep learning, various other
benchmarks have also appeared in the past, such as Fathom [2],
Training Benchmark for DNNs (TBD) [50], etc., but our study
is restricted to MLPerf, DAWNbench and DeepBench.
3 Methodology
3.1 System configurations
In this work, we used different system configurations for exper-
imentation, whose hardware specifications are highlighted in
Table 4. All the systems, except C4140 (B), operate on Ubuntu
16.04.4 LTS. The operating system on C4140 (B) is CentOS
Linux 7.
3.2 Benchmarks
The benchmarks we chose to conduct research on are as fol-
lows:
• GPU submissions of the MLPerf [27] training bench-
marks, which were made by Google (cloud) and Nvidia
(on-premise). The submitted source codes were optimized
for performance on their respective hardware. Among the
various submissions, we picked Google’s submission on
8x Volta V100 and NVIDIA’s submission on DGX-1 as
we had access to platforms with a maximum of 8 GPUs.
Note that, as there was no GPU submission for Reinforce-
ment Learning benchmark (one of the MLPerf training
benchmarks), we exclude this benchmark from the study.
• From DAWNBench [9], for Image Classification (CI-
FAR10) training we selected the ResNet-18 implemen-
tation [5] provided by bkj, and for Question Answer-
ing (SQuAD) training we chose the DrQA implementa-
tion [48] submitted by Yang et al.
• In the case of DeepBench [3], we used four NVIDIA
training benchmarks: gemm bench, conv bench,
rnn- bench, and nccl single all reduce. We omit-
ted nccl mpi all reduce as training on different nodes
is not the focus of this work. The aggregated numbers
are used for all the kernels with different sizes, except for
rnn bench, for which we only take six configurations be-
cause the benchmark takes a lot of time to profile. The six
configurations used are Vanilla DeepSpeech (Units=1760,
N=16), LSTM Machine Translation (Input=512, N=16),
LSTM Language Modeling (Input=4096, N=16), LSTM
Character Language Modeling (Input=256, N=16), GRU
DeepSpeech (Units=2816, N=32), and GRU Speaker ID
(Units=1024, N=32).
Note that, the hyperparameters like batch size and learning
rate were scaled accordingly to ensure that the run2 completed
successfully on our experimental setup.
3.3 Measurement tools
nvprof: We use nvprof profiler from CUDA-toolkit to profile
the Region of Interest (ROI) in the benchmarks. Information
collected are: invocation and duration of kernels, floating point
operation counts, and memory read/write transactions. With
this information, we added data points as the representatives of
machine learning workloads to the roofline plot.
dstat: Additionally, we used dstat [44] to obtain the real-
time statistics of system resource usage such as CPU usage,
memory usage, disk activity, and network traffic. In UNIX
platform, dstat gives more flexibility that combines vmstat
2“A run is a complete execution of an implementation on a system, training
a model from initialization to the quality target.” - MLPerf [27]
4
Table 4: Hardware specifications of systems for experimentation.
Systems T640 C4140 (B) C4140 (K) C4140 (M) R940 XA DSS 8440
CPUs (Intel Xeon Gold)
Model # 6148 6148 6148 6148 6148 6142
Base freq. 2.40GHz 2.40GHz 2.40GHz 2.40GHz 2.40GHz 2.60GHz
Memory (Samsung/Micron DDR4)
# DIMM 12 12 12 24 24 12
Size 16GB 16GB 16GB 16GB 16GB 32GB
GPUs (NVIDIA Tesla V100)
Form
Factor
PCIe Full
Height/Length
PCIe Full
Height/Length
SXM2 SXM2
PCIe Full
Height/Length
PCIe Full
Height/Length
Inter-
connect
PCIe & UPI3 PCIe NVLink NVLink UPI3 PCIe & UPI3
# GPUs 4 4 4 4 4 8
Memory 32GB HBM2 16GB HBM2 16GB HBM2 16GB HBM2 32GB HBM2 16GB HBM2
System (Dell PowerEdge)
Topology
(virtual memory statistics) [41], iostat (storage input/output
statistics) [39], and netstat (network statistics) [40]. The
statistics can then be exported to comma-separated values for
further analysis. Moreover, we can extend the functionality of
dstat by adding plugins such as one to measure NVIDIAGPU
Utilization [43].
dmon: Finally, we also make use of dmon which is available
in Nvidia System Management Interface (nvidia-smi) [31] to
get individual GPU usage statistics that includes GPU stream-
ing multiprocessor usage, GPU memory usage, temperature,
frequency, and PCI Express bus usage. A feature to measure
the NVLink bus utilization using hardware counters is also em-
ployed in nvidia-smi.
4 Benchmark Analysis
The analysis is presented on the optimized codes submitted by
Google and NVIDIA to MLPerf unless specified otherwise. It
may be noted from the MLPerf website that only three vendors
(Google, NVIDIA, and Intel) have submitted results to MLPerf,
and no vendor has submitted results for all benchmarks. The
effort to runMLPerf codes on the systems mentioned in Table 3
and 5 was non-trivial, and some of the benchmarks are omitted
from some studies due to difficulties with runs. A statistic of
kernels is available in the appendix.
4.1 Similarity/Dissimilarity analysis
We perform Principal Component Analysis (PCA) on 8 col-
lected workload characteristics (namely, PCIe utilization, GPU
utilization, CPU utilization, DDR memory footprint, HBM2
footprint, flop throughput, memory throughput, and number of
epochs), and visualize the distribution of the targeted machine
learning benchmarks in the workload space. This analysis helps
3UPI: Ultra Path Interconnect
us to understand how similar and different these benchmarks
are. In addition, we generate the dendrogram in order to help
users to pick the most representative benchmarks of certain
number according to their time budget.
As shown in Figure 1a, MLPerf benchmarks are so differ-
ent from DeepBench kernels as well as DAWNBench bench-
marks on PC1, that they become two isolated clusters (with
outliers labeled) sitting in two sides. PC1 is dominated by
GPU memory footprint. The location in the space is actually
a reflection of the fact that DeepBench kernels and DAWN-
Bench benchmarks are working on relatively smaller datasets,
and they cannot stress GPU memory as much as MLPerf bench-
marks can. On the PC2 axis MLPerf benchmarks have a shorter
span than other benchmark do, mainly because MLPerf bench-
marks are optimized end-to-end applications, having a stable
floating point operation throughput, while while more diver-
sity exists in the other benchmarks (e.g., the communication
kernel Deep Red Cu even has zero floating point operations).
MLPerf benchmarks are more sparsely-spread on the PC3-PC4
plane (Figure 1b), and cover what other benchmarks cover. The
intra-suite diversity is exposed in Figure 1 as well. For PC1 to
PC4 (covering 88% variance), each MLPerf benchmark gets at
least one chance to extend the boundary, and there are no two
MLPerf benchmarks that are very close to each other.
The dendrogram shown in Figure 2 presents the result
of linkage-distance-based hierarchical clustering, where each
benchmark starts as a leaf node then the two benchmarks clos-
est to each other (i.e., most similar) are linked first, for instance,
MLPf Res50 TF and MLPf Res50 MX. Dendrogram is more
useful than presenting the similarities between benchmarks, it
facilitates the benchmarks selections for users who do not want
or cannot run all the benchmarks due to time or cost limitation.
For example, in Figure 2, the dashed line crossing 4 vertical
lines filters out 4 most representative subsets for people can
only do evaluation with 4 benchmarks. The user is supposed to
use Dawn DrQA Py, MLPf SSD Py, one of MLPf Res50 TF
5
(a) PC1 - PC2
(b) PC3 - PC4
Figure 1: The distribution of MLPerf, DAWNBench, and Deep-
Bench in the dominant principal component workload space.
The dominant metric is the one with the greatest absolute value
in the eigenvector of a principal component.
and MLPf Res50- MX (take MLPf Res50 TF for example),
and another from the purple group (with all the benchmarks left,
take Deep Red- Cu for example). As a validation for the subset-
ting, we report the range of 8 metrics covered by the 4 selected
benchmarks with respect to all: PCIe utilization 0.3%˜100%,
GPU utilization 0%˜95.6%, CPU utilization 0%˜100%, DDR
memory footprint 0%˜100%, HBM2 footprint 0%˜100%, flop
throughput 0%˜100%, memory throughput 8.0%˜100%, and
number of epochs 0%˜98.4%.
4.2 Roofline analysis
A rooflinemodel [45] is a visual representation of the maximum
attainable performance for a given workload in a given hard-
ware by combining the processing core performance, memory
bandwidth, and the data locality.
Figure 3 presents the roofline model for a single Tesla V100
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Linkage Distance
Dawn_DrQA_Py
MLPf_SSD_Py
MLPf_Res50_TF
MLPf_Res50_MX
Deep_Red_Cu
Deep_GEMM_Cu
Deep_Conv_Cu
MLPf_MRCNN_Py
Dawn_Res18_Py
MLPf_XFMR_Py
MLPf_NCF_Py
MLPf_GNMT_Py
Deep_RNN_Cu
Figure 2: Dendrogram of MLPerf, DAWNBench and Deep-
Bench benchmarks. If a subset of 4 is desired, pick one from
each cluster intercepted by the vertical line at linkage distance
around 0.8.
10
2
10
3
10
4
10
-1
10
0
10
1
10
2
HP 27219.9 GFLOPs/sec
43
9.
6 
G
B
/s
SP 14190.8 GFLOPs/sec
54
1.
0 
G
B
/s
DP 7002.0 GFLOPs/sec
74
1.
7 
G
B
/s
G
F
L
O
P
s
 /
 s
e
c
FLOPs / Byte
Empirical Roofline
MLPf_Res50_TF
MLPf_Res50_MX
MLPf_SSD_Py
MLPf_MRCNN_Py
MLPf_XFMR_Py
MLPf_GNMT_Py
MLPf_NCF_Py
Dawn_Res18_Py
Dawn_DrQA_Py
Deep_GEMM_Cu
Deep_Conv_Cu
Deep_RNN_Cu
Figure 3: V100 roofline model marked with MLPerf bench-
marks (labeled in blue), DAWNBench (labeled in red), and
DeepBench (labeled in cyan). Red, blue, green polylines show
the empirical limitations (from available memory bandwidth
and computational resources) for V100 to perform double, sin-
gle, and half-precision floating point operations (measured with
Empirical Roofline Toolkit [47]).
GPU and machine learning workloads we studied. The runs
were carried out on the T640 system, invoking just one GPU.
The vertical axis represents the compute capability that can be
expressed usually in a unit of Floating Point Operation per Sec-
ond (FLOPS/sec). Meanwhile, the horizontal axis denotes the
arithmetic intensity, which is the ratio between floating point
operations that can be performed per unit data using Floating
Point Operations per Byte (FLOPs/Byte) as the unit. Memory-
bound workloads have lower arithmetic intensity, hence their
performance are limited by memory bandwidth (corresponding
to the slope of the slash lines in Figure 3). Compute-bound
workloads have high enough arithmetic intensities so their per-
formance are limited by the computational resources (corre-
sponding to the horizontal lines in Figure 3). We indicate the
location of different machine learning workloads with points
in different shapes. Workloads from the same benchmark suite
6
are assigned the same color. As we can see from the Figure 3,
MLPerf benchmarks are more optimized than DeepBench ker-
nels so that there is more data reuse, achieving higher arith-
metic intensity, while the two DAWNBench workloads shows
even higher arithmetic intensities with higher throughputs. Nev-
ertheless, all the workloads are memory-bound (have not cross
the turn point, and touch the horizontal lines). This observation
implies memory is the system bottleneck for machine learning
workloads, and we should dedicate more resources to memory
interface for a well-balanced system.
4.3 Sensitivity of MLPerf models to Mixed Pre-
cision Training using Tensor Cores
í
ð
ñ
î
ô
ï
î
ó
ð
õ ñ
ó
ð
ó
õ
õ
ì
ð
ó
ð
õ
ï í
ï
î
ïXï ïXî
íXñ
íXõ
îXï
ìXì
ìXñ
íXì
íXñ
îXì
îXñ
ïXì
ïXñ
ì
íìì
îìì
ïìì
ðìì
ñìì
òìì
óìì
Zñìzd& ^^zWÇ DZEEzWÇ 'EDdzWÇ E&zWÇ
/D'
>^^/&/d/KE
K:ddd/KE dZE^>d/KE ZKDDEd/KE
d
/D

d
K
d
Z

/E
~
D
/E

D]ÆW]]}v~dv}} ^]vPoW]]}v ^µ
Figure 4: Mixed Precision training (supported by Tensor Cores)
results in 1.5× to 3.3× speedups over single precision. (Note
the time of NCF Py is in seconds)
Prior work ( [13, 20, 25] ) suggests that reduced precision
training helps the deep learning training in the following ways:
• Lowering the on-chip memory requirement for the neural
network models.
• Reducing the memory bandwidth requirement by access-
ing less or equal bytes compared to single precision.
• Accelerating the math-intensive operations especially on
GPUs with Tensor Cores.
Typically, only some pieces of data employ reduced precision
leading to mixed precision implementations. Moreover, em-
ploying mixed precision for training is getting easier for pro-
grammers with the release of NVIDIA’s Automatic Mixed Pre-
cision (AMP) [30] feature on different frameworks like Tensor-
Flow [1], PyTorch [32] and MXNet [7]. Figure 4 shows the
speedup observed in different MLPerf training benchmarks by
employing the use of half-precision along with single-precision
when tested on DSS 8440 using 8 GPUs. The speedup observed
is in the range of 1.5× in MRCNN Py to 3.3× in Res50 TF.
Thus, it can be inferred that MLPerf, an end-to-end benchmark
suite, is capable of testing the reduced precision support of pro-
cessors. For example, TensorCores are tested here.
4.4 Compiler optimization impact on Bench-
mark performance
Deep learning frameworks offer building blocks for designing,
training and validating deep neural networks through a high
level programming interface. They rely on GPU-accelerated
libraries such as cuDNN [8] and NCCL [28] to deliver high-
performance for single as well as multi GPU accelerated train-
ing. In Figure 5, we can see that MLPf Res50- TF takes around
270 minutes. These experiments are performed on C4140 (K)
using all 4 GPUs. It is interesting to note that the Tensor-
Flow/XLA JIT (just-in-time) compiler [38] optimizes Tensor-
Flow computations and reduces the execution time by about
40% for this use case. XLA uses JIT compilation techniques to
analyze and optimize the TensorFlow subgraphs created by the
user at runtime. Some optimizations are specialized for the tar-
get device. The compiler then fuses multiple operators (kernel
fusion) together and generates efficient native machine code for
the device. This results in the reduction of execution time and
the required memory bandwidth for the application.
ì
ìXí
ìXî
ìXï
ìXð
ìXñ
ìXò
ìXó
ìXô
ì íìì îìì ïìì ðìì ñìì


µ

Ç
d]u~u]vµ
dv}G}ÁÀíXíî~y>
dv}G}ÁÀíXíî~v}y>
Figure 5: Image Classification (Res50 TF) reaches desired ac-
curacy in 60% time if compiler uses XLA optimization.
0 1 2 3
GPU
0
5
10
15
20
25
ti
m
e 
(h
ou
rs
)
(a)
0 1 2 3
GPU
0
5
10
15
20
25
ti
m
e 
(h
ou
rs
)
(b)
Figure 6: Scheduling a mix of MLPerf workloads on 4 GPUs:
(a) naive scheduling, which distributes one benchmark on all
the GPUs one by one; (b) optimal scheduling, found by search-
ing through the possible space, saves 2.8 hours.
4.5 Scalability of the benchmarks
The scalability study is performed on a system with 8 GPUs,
the DSS 8440, where the number of GPUs employed to train
the model is controlled. Ideally performance speedup of using
2 GPUs, 4 GPUs and 8 GPUs over 1 GPU should be 2×, 4× and
8×, respectively. Table 5 shows the scalability trends for every
MLPerf benchmark except for GNMT Py. The training time us-
ing a single GPU is also added to provide a better understand-
ing. We can see that some of the benchmarks like Res50 TF,
Res50 MX, and SSD Py scale well with the number of GPUs
7
Table 5: Scaling efficiency for distributed training.
Scalability (speedup)
Benchmark
Training Time on
1×V100 (minutes) 1-to-2 1-to-4 1-to-8
Res50 TF 1016.9 1.92× 3.84× 7.04×
Res50 MX 957.0 1.92× 3.76× 5.92×
SSD Py 206.1 1.94× 3.72× 7.28×
MRCNN Py 1840.4 1.76× 2.64× 5.60×
XFMR Py 636.0 1.42× 2.92× 5.60×
NCF Py 2.2 1.88× 2.16× 2.32×
while for others increasing the number of GPUs beyond a cer-
tain point is not rewarding enough. For instance, in the case
of Res50 TF, when the number of GPUs is increased from 1-
to-2, from 1-to-4, and from 1-to-8, training time improves by
approximately 1.9×, 3.8×, and 7×, respectively. On the con-
trary, for NCF Py the speedup achieved over a single GPU is
1.9×, 2.2×, and 2.3× when the number of GPUs is increased
to 2, 4, and 8, respectively. This data does not justify increas-
ing the number of GPUs beyond 2 for training Recommenda-
tion benchmark. We believe the small dataset (MovieLens 20-
million) causes this behavior for the benchmark. Small dataset
limits the maximum batch size which as a result restricts the
scalability of the benchmark. Few other benchmarks, such as
XFMR Py and MRCNN Py fall between the most and the least
scalable ones, providing a scale-up by a factor of roughly 1.6×,
2.8×, and 5.6× for 2, 4, and 8 GPUs, respectively.
Such differences in scalability between different workloads
give users hints to schedule a mix of machine learning training
tasks. The naive scheduling scheme, that sequentially distribute
every workload to all resources at once, avoids fragmentation,
and keeps the resources busy all the time. However, it may not
be the most efficient way in terms of total training time, because
users having multiple GPUs can choose to distribute some scal-
able workloads, while they decide to run workloads with poor
scalability in sets simultaneously on fewer GPUs. Thus, the
system administrators associated with super computing clusters
might be interested in finding an effective algorithm to sched-
ule various of machine learning training jobs submitted from
researchers, developers, and all other kinds of machine learn-
ing users. To show the potential benefit, we search through all
permutations of scheduling 7 MLPerf benchmarks on multiple
GPUs, and Figure 6 presents 4-GPU scheduling for illustration.
In each subfigure, available GPUs are listed along the x-axis,
with vertical dashed lines as their timelines. Different color
shades under the timeline correspond to the executions of the
7 different MLPerf workloads. Figure 6b shows the shortest
scheduling of the 7 MLPerf benchmarks on 4 GPUs. Com-
pared with the naive scheduling in Figure 6a, it saves about
2.8 hours to finish all the training tasks. In the optimal schedul-
ing, the workloads chosen to be distributed on 4 GPUs, namely
XFMR Py and SSD Py are the scalable benchmarks we ob-
served above. MRCNN Py gets two GPUs to execute due to
its medium scalability. Two Image Classification workloads,
Res50 MX and Res50 TF, are assigned to single GPUs sep-
arately to achieve faster training time. Note that, two simi-
lar workloads running in parallel will provide lower training
time than running them in a distributed fashion even if they
are highly scalable. Similarly, optimal scheduling could save
around 4.1 hours and 0.4 hours for 2-GPU and 8-GPU settings,
respectively. It is worth mentioning that this performance gain
is without any effort in optimizing the software or adding costly
hardware.
5 System Level Measurements
In this section we present observations made on system level
utilization measurements that were performed using the mea-
surement tools such as dstat and dmon to better understand
the impact of running DL training workloads and system re-
quirements for the different models. The experimentation is
performed on C4140 (K) system by appropriately regulating
the number of GPUs.
Table 6: System resource usage statistics on C4140 (K). Utiliza-
tion and footprint increase with use of more GPUs.
Utilization Memory Footprint Bus Utilization
# GPUs CPU GPU System GPU PCIe NVLink
(%) (%) (MB) (MB) (MBps) (MBps)
MLPf Res50 TF
1xV100 10.76 85.84 17,922 15,927 1,251 0
2xV100 16.25 188.08 18,521 31,896 2,609 967
4xV100 29.06 372.43 19,970 62,214 4,269 2,867
MLPf Res50 MX
1xV100 4.56 85.84 7,091 10,343 1,251 0
2xV100 9.16 190.90 14,924 20,605 6,913 1,871
4xV100 18.12 378.94 28,781 40,959 11,480 21,755
MLPf SSD Py
1xV100 3.89 96.13 4,100 15,406 4,720 0
2xV100 7.21 180.58 10,305 30,772 6,998 509
4xV100 13.69 334.84 20,273 60,539 9,791 1,500
MLPf MRCNN Py
1xV100 2.45 62.46 7,208 4,762 258 0
2xV100 4.83 144.40 13,561 15,933 2,219 2,472
4xV100 10.39 283.88 24,923 33,935 3,444 6,547
MLPf XFMR Py
1xV100 1.80 91.14 3,992 14,926 47 0
2xV100 3.35 189.30 7,167 29,493 123 11,247
4xV100 6.39 376.91 14,244 58,229 249 35,862
MLPf GNMT Py
1xV100 1.91 89.94 7,210 12,098 2,743 0
2xV100 3.32 185.71 13,561 24,479 4,609 1508
4xV100 6.41 360.89 24,923 46,016 7,692 33,262
MLPf NCF Py
1xV100 0.76 96.39 1,550 13,870 42 0
2xV100 2.41 194.44 3,077 24,847 110 17,887
4xV100 5.69 333.11 5,978 39,634 200 75,051
Dawn Res18 Py
1xV100 4.67 76.90 2,670 2,056 176 0
Dawn DrQA Py
1xV100 48.84 20.30 6,721 2,657 52 0
Deep GEMM Cu
1xV100 1.80 99.60 333 1,067 13 0
Deep Conv Cu
1xV100 1.73 99.10 948 783 13 0
Deep RNN Cu
1xV100 1.80 94.80 994 2,536 3,747 0
Deep Red Cu
1xV100 0.75 91.30 313 631 27 0
2xV100 0.96 193.20 430 994 86 77,992
4xV100 1.68 366.24 1123 2320 134 404,376
5.1 CPU utilization across different workloads
In the previous section, we presented the scalability of each
benchmark for one, two, four, and eight GPUs runs. Although
8
most of the computation of the MLPerf benchmark submissions
in this paper are offloaded to the GPUs, it will be interesting to
see how the CPU is utilized during the execution of the bench-
marks as we increase the number of GPUs. We run each work-
load on C4140 (K) platform and configure accordingly to use
one, two, or all four GPUs available on that platform. We mon-
itor the CPU usage periodically using dstat.
The average CPU usage while running one, two, and four
GPUs is summarized in Table 6. Note that, the average CPU us-
age includes the operating system (e.g., kernel, low-level driver)
usage as well as that used by the user programs. In general, as
we double the number of GPUs that are used to run each work-
load, the utilization of the CPU roughly doubles. This trend is
observable for all submissions to MLPerf which indicates that
the CPU must have adequate performance to keep all GPUs
busy otherwise it will become a bottleneck during the run.
Among the MLPerf submissions, MLPf Res50 TF has the
highest CPU Utilization followed by MLPf Res50 MX. This
is because, compared to other workloads, both Image Classi-
fication benchmarks require CPU to perform more packaging
of the data before dispatching them to the GPUs and post-
processing the data after the GPUs finish the requested tasks.
Moreover, the dataset used for Image Classification benchmark
is significantly bigger (around 300GB) compared to datasets
for other benchmarks. Since it is not feasible to store such a
big chunk of data on GPU memory, the CPU has to coordinate
small parts of the dataset that can be stored in GPU memory
at one time. The GPU can then perform a partial computation.
This copying back and forth between CPU memory and GPU
memory also increases the utilization of CPU. MLPf NCF Py
shows lowest CPU utilization followed by MLPf GNMT Py
and MLPf XFMR Py. The Object Detection workloads are in
the middle in terms of CPU utilization.
Another observation that we would like to highlight comes
from Dawn DrQA Py. Although this benchmark runs on a sin-
gle GPU, it has the highest CPU usage of all the workloads
included in the Table 6. Unfortunately, this benchmark also
shows least GPU utilization among all the workloads, around
20%, which indicates that a major part of the computation is
performed on the CPU with few tasks that can be offloaded to
the GPU.
5.2 GPU utilization for different workloads
The GPU utilization as given in Table 6 is the sum of the uti-
lization of every GPU that is used during the runtime. There-
fore, single-, dual-, and quad-GPU run will have maximum
utilization of 100%, 200%, and 400%, respectively. For
Image Classification workloads, both MLPf Res50 TF and
MLPf Res50 MX, show near identical GPU utilization with
around 85% GPU usage for single-GPU run, around 190%
GPU usage (i.e., around 95% utilization per GPU) for dual-
GPU run, and around 375% GPU usage (i.e., around 93.5%
utilization per GPU) for quad-GPU run.
Most of the submissions to MLPerf show a similar trend
for single-GPU and dual-GPU runs. Moreover, MLPf NCF Py
shows decreasing individual GPU usage for quad-GPU run
compared to dual-GPU run. This observation agrees with the
one mentioned in Section 4.5 that due to the limited scope of
increase in the batch size for the workload, it is unable to uti-
lize the GPUs efficiently. The increasing of communication
cost for multi-GPU run that can impact individual GPU uti-
lization is confirmed by Deep Red Cu benchmark from Deep-
Bench which shows the same trend.
5.3 CPU and GPU memory footprint
The system memory is mostly used to store the dataset that is
used for the training as well as the intermediate data required
between computations. In the case when the dataset is too large
to fit in the GPUmemory, the system memory acts as a buffer to
store the dataset. The user programwill move the data back and
forth between the system and GPU memory to perform partial
calculations. Moreover, in an extreme case, the dataset can be
too large to be stored inside the system memory. Thus the disk
storage (e.g., hard disk drive, solid state drive) is used to store
them, and the CPU is responsible for coordinating the switching
between each part of the dataset.
From Table 6, we can notice that the system memory foot-
print roughly doubles every time we double the number of
GPUs. The GPU memory footprint is the total memory foot-
print for every GPU used during the run. Note that, the foot-
print of GPU memory depends on the batch size, and the batch
sizes for the experiments are scaled accordingly from the origi-
nal submissions as mentioned in Section 3.2.
Although the table only shows the memory footprint of each
benchmark, we would like to emphasize that the heterogene-
ity of the medium where the dataset is stored may become a
bottleneck especially for memory-bounded applications which
perform data exchange frequently. In this case, the interconnect
bandwidth between each storage medium and the intelligence
of the program to overlap the data transfer just before the next
computation and to manage the locality of the data can play a
crucial factor.
In our C4140 (K) platform, for example, each CPU has
96GB of memory consisting six 16GB DDR4-2666 DIMMs
in hexa-channel memory configuration. The theoretical unidi-
rectional memory bandwidth available to each CPU is around
128GBps [11] while the Intel’s proprietary Ultra Path Intercon-
nect (UPI) that links two CPUs has only unidirectional theo-
retical bandwidth of 20.8GBps [26]. In a case when a CPU
needs a part of the dataset stored in other CPU’s memory, the
performance of data transfer will be significantly reduced (i.e.,
128GBps direct access for local DRAM v.s. 20.8GBps neigh-
bor DRAM access via UPI).
The same thing happens with a GPU that has more limited
dedicated memory. In our C4140 (K) platform, each Nvidia
Tesla V100 is equipped with 16GB HBM2 stacked memory
which is capable of 450GB/s unidirectional bandwidth. In the
case that the dataset cannot be fully stored inside the GPUmem-
ory, the CPU should bring a part of the dataset from the system
memory into the GPU memory. This data exchange uses PCIe
3.0 bus which connects the CPU and GPU and able to provide
theoretical unidirectional bandwidth of 15.8GBps for x16 lanes
which limits the performance of data transfer.
9
Zñìzd&
î
ó
ì
î
ð
íî
ó
í
î
ð
õî
ó
ï
î
ñ
ð
ï
ì
ð
î
ò
ð
ï
ì
ô
î
ò
ð
ì
ñì
íìì
íñì
îìì
îñì
ïìì
ïñì
d
]u


}
d

]v

~u
]v
µ


ð
ô
ó ñ
ï
ï ñ
õ
ð ò
ï
ñ
ò
ñ
ó
ì
íìì
îìì
ïìì
ðìì
ñìì
òìì
óìì
d
]u


}
d

]v

~u
]v
µ


ñ
í
ï
ò
ñ
ó
ï
î
ñ
ð
ï
ó
ñ
ô
ï
ò
ò
ì
ñ
í
ô
ì
ñ
ì
ñ
ô
ñ
ì
ô
ô
ñ
î
ò
ì
ñ
í
ô
ô
ñ
ñ
ì
íì
îì
ïì
ðì
ñì
òì
óì
ôì
õì
íìì
d
]u


}
d

]v

~u
]v
µ


/D'>^^/&/d/KE
ZñìzDy DZEEzWÇ ^^zWÇ 'EDdzWÇ y&DZzWÇ E&zWÇ
K:ddd/KE dZE^>d/KE ZKDDEd/KEK:ddd/KE
ðíðì~DW/=Es>]vl ðíðì~<W/^t=Es>]vl ðíðì~W/^t dòðìW/ ZõðìyW/
Figure 7: Training time on 4-GPU systems. Time on systems with NVLink interconnect (the first 2 bars) is less than training
time on the remaining systems. (Note that, the time of NCF Py is in seconds)
5.4 System and GPU bus utilization
In the previous section, we have mentioned that interconnection
bus between CPU-GPU and between GPU-GPU may play an
important role in determining the overall system performance.
Moreover, as we have learnt previously, the choice intercon-
nection topology between CPU and GPU should be considered
carefully. In this section, we will explain more details about
how the performance is impacted by the interconnection bus
based on the data on Table 6.
Modern microprocessor systems use PCI Express (PCIe) bus
as the interconnection standard between CPU and external pe-
ripheral that requires high-speed data communication. PCIe 3.0
standard, introduced in 2010, has been widely adopted by most
computer system products available in today’s market. PCIe
3.0 provides theoretical unidirectional bandwidth up-to 984.6
MBps per lane and up-to 15.8 GBps per PCIe 3.0 compatible
device connected using 16 PCIe 3.0 lanes (PCIe 3.0 x16). This
massive bandwidth, in theory, should be sufficient for most of
the peripheral devices including GPU, network interface card,
and non-volatile memory storage.
Usually a GPU is connected to the CPU using PCIe 3.0 x16
to assure that there is plenty of bandwidth between them. High
bandwidth is easy to achieve for a single-GPU system, but more
complicated for a multi-GPU system since the number of PCIe
3.0 lanes that the CPU has are limited. High-end Intel Xeon
may have up to 48 lanes of PCIe 3.0 which are then allocated
to various devices. With this constraint, each GPU on a four
GPUs system, for example, may only be assigned eight PCIe
3.0 lanes. While it depends on how we use the GPU and how
intense the data exchange happens between the CPU and GPU,
some applications like gaming may find PCIe 3.0 x8 already
provides plenty of bandwidth. On the other hand, this much
bandwidth may not be optimal for deep learning training.
Alternatively, a PCIe switch, such as those manufactured by
PLX Technology, can be used to provide additional PCIe lanes;
thus each GPU can have PCIe 3.0 x16 lanes. This switch will
be useful for GPU-to-GPU communication since the data ex-
change will only take place on the switch without going over
to the CPU. However, the switch will not be beneficial to im-
prove the bandwidth between CPU and all GPUs on the system
as the effective CPU-to-GPU bandwidth is still limited by what
the CPU has. We will discuss the interconnection topology in
detail and how it affect the performance in Section 5.5.
Furthermore, apart from CPU-to-GPU communication, PCIe
bus can be used for GPU-to-GPU communication for a multi-
GPU system. Although each GPU can be allocated with PCIe
3.0 x16 lanes, the available bandwidth may not be sufficient for
some workloads that require intensive data exchange between
the GPUs. Therefore, an additional bus has been developed to
be used specifically for GPU-to-GPU communication such as
NVLink which is high-speed proprietary interconnect system
in NVIDIA GPUs. Each NVLink lane provides 25 GBps theo-
retical unidirectional bandwidth. The Nvidia Tesla V100 GPU
in SXM2 form factor has six NVLink lanes which are capable
of transferring data with theoretical unidirectional bandwidth
of 150GBps. This is significantly faster than what PCIe 3.0
x16 can offer.
Table 6 shows the PCIe 3.0 bus utilization between CPU and
GPU available on the system as well as NVLink utilization be-
tween GPU and GPU. The value presented in the table is the
sum of PCIe 3.0 bidirectional PCIe bus utilization for each GPU
that is used during the run, and the sum of NVLink lane uti-
lization from each GPU used during the run. As we can see
from the table, the data transfer rate over NVLink bus increases
as we add more GPU for the run. The Deep Red Cu and the
MLPf NCF Py use the highest bandwidth of NVLink which
means that the data exchanges between GPU for those bench-
mark are intensive. On the other hand, the utilization of PCIe
3.0 bus increases as we add more GPU which is as expected. In
a multi-GPU system equipped with NVLink, the PCIe 3.0 bus
is used only for communication between CPU and each GPU
because the GPU to GPU communication has been offloaded
into the higher speed NVLink.
5.5 Impact of GPU-Interconnect Topology
To reduce the training time it is becoming increasingly com-
mon to scale deep learning (DL) training across multiple GPUs
within a system. There are multiple ways in how GPUs can
be connected within the system. Primarily there are two op-
tions available - using a PCIe based interconnect (which may in-
clude PCIe switches if the number of lanes from the CPU is not
sufficient) and using NVIDIA’s proprietary interconnect like
NVLink. The theoretical bandwidth of an NVLink intercon-
nect is 10× higher than PCIe (300 GB/s vs. 32GB/s) [29]. Ad-
10
òî óì
òô
òõ
òõ
òõ
òô
òõ
òõ
òô
óï
óñ
óì
óí
óð
óî
òõ
õñ õð
õð
õï
õï
õð
õï
õï
õð
õì
ôô
ôô
ôõ
ôõ
ôô
ôõ
ôõ
ôõ óì
òô
òõ
òõ
òõ
òô
òõ
òõ
õñ
õò
õò
õò
õñ
õò
õò
õñ
õí òï
òí
òì
òí
òí
òí
òí
òî
õô
õô
õô
õó
õô
õô
õô
õô
ôñ ôí
òõ
ôí
ôì
ôí
óñ
ôì
ôì
ôñ
ôð
ôî
ôî
ôð
ôï
ôî
ôð
^]vPo'WhZµv~íÆsíìì
]PZ/vvvhv](}u^]vPo'WhZµv~ôíÆsíìì
]]µKr'WhZµv~ôÆsíìì
ì
íìì
îìì
ïìì
ðìì
ñìì
òìì
óìì
ôìì
õìì
'
W
h
h

o]
Ì

}
v
~
9

]PZ/vvvD]Æ^]vPo'WhZµv~ôíÆsíìì
õò
õó
õò
õó
õò
õó
õó
õó
õô
õó
õó
õó
õô
õó
õó
ôò
õó
ôó õó
õò
õó
õò
õó
õò
õó
õì
õí
ôõ
ôô
ôõ
õì
ôõ
ôõ
õò

À


P


W
h
h

o]
Ì

}
v
~
9

ïì
ðì
ñì
òì
óì
ì
íì
îì
ï
ï
Xì
ï
î
Xñ
ï
Xõ
î
Xï
î
í
Xõ
î
ï
Xî
ï
ô
Xí
ï
ô
Xõ
ð
Xð
ò
î
Xò
ð
ò
Xò
í
ì
Xñ
í
Xõ
í
ð
Xì
í
î
Xì
í
Xô
í
î
Xï
ó
Xì
ì
Xô
ó
Xò
ï
Xì
Zñìzd&
/D'>^^/&/d/KE
ZñìzDy DZEEzWÇ ^^zWÇ 'EDdzWÇ y&DZzWÇ E&zWÇ
K:ddd/KE dZE^>d/KE ZKDDEd/KE
Zñìzd&
/D'>^^/&/d/KE
ZñìzDy DZEEzWÇ ^^zWÇ 'EDdzWÇ y&DZzWÇ E&zWÇ
K:ddd/KE dZE^>d/KE ZKDDEd/KE
^]vPo'WhZµv~íÆsíìì
]PZ/vvvhv](}u^]vPo'WhZµv~ôíÆsíìì
]]µKr'WhZµv~ôÆsíìì
]PZ/vvvD]Æ^]vPo'WhZµv~ôíÆsíìì
Figure 8: Utilization of CPU (left) and GPU (right) for single-GPU run, 8 independent uniform single-GPU runs, distributed
8-GPU run, and 8 independent mixed single-GPU runs. Higher CPU utilization in multiple-run vs distributed in 5 of 7 cases.
ditionally, communication libraries like NCCL from NVIDIA
are optimized to perform GPUDirect peer-to-peer (P2P) direct
access when NVLink is available between GPUs, which can
lower training times if there is significant peer-to-peer commu-
nication during model training. GPUDirect P2P is also feasi-
ble in certain PCIe topology designs where GPUs are the same
PCIe domain (single root complex). UsingMLPerf, we conduct
a performance evaluation of five different 4-GPU platforms,
each of them with a unique GPU interconnect topology. Ta-
ble 4 shows how the GPUs are interconnected for the servers
used in this study.
Two of the five servers, C4140 (M) and C4104 (K) include
the high-speed proprietary NVLink interconnect to provide
100GB/s bandwidth between any two GPUs. The difference
between the two NVLink based designs is the use of a PCIe
switch in the C4140 (K) to aggregate the PCIe connections to
the GPUs. The remaining three systems use PCIe based in-
terconnects. They use very different approaches in how the
GPUs are connected to the CPUs and in how they communicate
with other GPUs. One system C4140 (B), uses a 96-lane PCIe
switch that allows for 4 GPUs to be hosted in a single PCIe
domain where it can perform GPUDirect peer-to-peer (P2P) be-
tween the GPUs using the PCIe switch. This is not feasible
in the other two PCIe based interconnect platforms - the T640,
where two GPUs are hosted per CPU and R940 XA which is
a 4 CPU platform with each GPU connected directly using the
PCIe lanes of the CPU.
The training times for the different servers are plotted in Fig-
ure 7 which illustrates the impact of GPU interconnect topol-
ogy on DL training times. As expected, due to lack of GPUDi-
rect P2P capability between any of the GPUs, the T640 and
R940 XA take the longest time to train all the MLPerf mod-
els. Conversely, the two servers that use NVLink interconnect
(the C4140 (M) and (K) systems) show the best training times
across all the MLPerf models. However, the performance im-
provements differ depending on the model that is being trained
and ranges from 42% and 17% for the Translation benchmarks,
30% for MLPf MRCNN Py to 11% for the Image Classifica-
tion benchmarks. The C4140 (M) which uses a PCIe topol-
ogy, but can perform GPUDirect P2P between GPUs due to all
GPUs connected to a PCIe switch, shows performance parity to
the NVLink platform for the Image Classification benchmarks
and better performance than the R940 XA and T640 servers for
remaining benchmarks. This platform provides a mix of flex-
ibility that is available when using PCIe based GPU cards in
addition to higher performance over PCIe based designs that
do not support GPUDirect P2P transactions between GPUs.
5.6 Impact of job types on system utilization
GPU infrastructure in cloud or on-premise data centers typi-
cally hosts different classes of training jobs with different pur-
poses:
Distributed-run: to train a large complexmodel over a large
training dataset across multiple GPUs for fast time-to-solution.
Multiple-run: to sweep hyper-parameter space of the same
model, typically having one training run with different settings
of hyper-parameters on each GPU on the same test dataset.
Mixed-run: different users submit different jobs that are
training smaller models using single GPU each on a cluster
In this section, we compare the system resource utilization
of these three methods of running machine learning workloads
on a multi-GPU system (the 8-GPU DSS 8440).
Figure 8 shows the CPU and GPU utilization for each
method. In general, running multiple instances of the same
benchmark requires higher CPU utilization compared to a sin-
gle instance on multiple GPUs (distributed-run). It is because
for multiple-run, each instance has its own host (CPU) program
that performs pre-processing, controls the GPU computation,
and collects the computation from the GPU. Thus, CPU is re-
quired to handle each host program, hence, leading to higher
CPU utilization. On the other hand, the mixed-run CPU utiliza-
tion is roughly the same as the sum of CPU utilizations of a
single GPU run for each workload.
In GPU utilization, the topology of how the GPUs are con-
nected to the CPU plays an important role. MLPf NCF Py,
MLPf XFMR Py, GNMT Py, MLPf MRCNN Py, and the
MLPf Res50 TF have higher GPU utilization for the dis-
tributed eight-GPU run compared to eight independent uni-
form single GPU runs. It turns out that their usage of
11
PCIe bus for the distributed run is significantly higher com-
pared to the independent run. During the distributed-run,
total data transfer rate for MLPf NCF Py, MLPf XFMR Py,
MLPf -GNMT Py, MLPf MRCNN Py, and MLPf Res50 TF
over PCIe bus reaches 58.95 GBps, 51.90 GBps, 39.1 GBps,
19.51 GBps, and 16.97 GBps, respectively. Meanwhile, for
the multiple-run, they only use 276 MBps, 606 MBps, 2.07
GBps, 2.07 GBps, and 12.39 GBps, respectively. We suspect
that most of GPU utilization is coming from communicating
between GPUs and as there is no NVLink for GPU-GPU com-
munication on this system. Each GPU will compete for the
PCIe bus as well as for the UPI link. On the other hand, the
MLPf Res50 MX and MLPf SSD Py have the opposite behav-
ior. Here the total data transfer rate for the distributed-run is
smaller than the multiple-run.
6 Conclusion
We have presented a detailed characterization of the recent
MLPerf benchmark suite in this paper. While MLPerf bench-
mark characteristics may be heavily influenced by the specific
implementations, the suite does provide a diverse set of bench-
marks which allows to unveil various bottlenecks in the system.
Our experiments point towards (i) the importance of powerful
interconnects in multi-GPU systems, (ii) the variation in scal-
ability exhibited by different ML models, (iii) the opportunity
for smart scheduling strategies in multi-gpu training exploiting
the variability in scaling, and (iv) the need for powerful CPUs
as host with increase in number of GPUs.
We also present the dissimilarity of the benchmarks to other
benchmarks in the suite (intra-suite dissimilarity) and dissimi-
larity against other suites such as DAWNBench and DeepBench
(inter-suite dissimilarity). MLPerf provides benchmarks with
moderately high memory transactions per second and moder-
ately high compute rates. DAWNBench creates a high-compute
benchmark with low memory transaction rate, whereas Deep-
Bench provides low compute rate benchmarks. The various
MLPerf benchmarks show uniqueness such as high NVLink uti-
lization in NCF Py, low NVLink utilization in SSD Py, near-
perfect scalability with increasing GPU counts in Res50 TF
and SSD Py, and low scalability in NCF Py. MRCNN Py
makes only 1.5× improvement with tensor cores and reduced
precision, whereas Res50 TF make 3.3× improvement. The
DrQA Py from DAWNBench results in high CPU utilization
and low GPU utilization.
References
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,
S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard,
Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mane´, R. Monga, S. Moore, D. Murray, C. Olah,
M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vie´gas,
O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine
learning on heterogeneous systems,” 2015, software
available from tensorflow.org. [Online]. Available: http://
tensorflow.org/
[2] R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks,
“Fathom: Reference workloads for modern deep learning
methods,” 2016.
[3] Baidu, “Deepbench: Benchmarking deep learning
operations on different hardware,” 2017. [Online].
Available: https://github.com/baidu-research/DeepBench
[4] Baidu, “An update to deepbench with a focus on deep
learning inference,” 2017. [Online]. Available: https://
github.com/baidu-research/DeepBench
[5] bkj, “Resnet18 + minor modifications (submission
at DAWNBench),” https://github.com/bkj/basenet/
tree/49b2b61e5b9420815c64227c5a10233267c1fb14/
examples, 2018.
[6] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham,
B. Haddow, S. Huang, M. Huck, P. Koehn, Q. Liu, V. Lo-
gacheva, C. Monz, M. Negri, M. Post, R. Rubino, L. Spe-
cia, and M. Turchi, “Findings of the 2017 conference on
machine translation (wmt17),” inWMT, 2017.
[7] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M.Wang, T. Xiao,
B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible
and efficient machine learning library for heterogeneous
distributed systems,” CoRR, vol. abs/1512.01274, 2015.
[Online]. Available: http://arxiv.org/abs/1512.01274
[8] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen,
J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Effi-
cient primitives for deep learning,” 2014.
[9] C. A. Coleman, D. Narayanan, D. Kang, T. Zhao,
J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Re´, and
M. Zaharia, “Dawnbench : An end-to-end deep learning
benchmark and competition,” in NIPS ML Systems Work-
shop, 2017.
[10] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei,
“Imagenet: A large-scale hierarchical image database,” in
2009 IEEE Conference on Computer Vision and Pattern
Recognition, June 2009, pp. 248–255.
[11] FUJITSU, “White paper fujitsu server primergy &
primequest memory performance of xeon scalable
processor(skylake-sp) based systems,” https://sp.ts.fujitsu.
com/dmsp/Publications/public/wp-skylake-memory-
performance-ww-en.pdf, 2018.
[12] GroupLens, “MovieLens,” https://grouplens.org/datasets/
movielens/20m/, 2016.
[13] S. Gupta, A. Agrawal, K. Gopalakrishnan, and
P. Narayanan, “Deep learning with limited numerical pre-
cision,” 2015.
12
[14] L. Gwennap, “Ai benchmarks remain immature,” Micro-
processor Report, January 28, 2019.
[15] F. M. Harper and J. A. Konstan, “The movielens datasets:
History and context,” ACM Trans. Interact. Intell. Syst.,
vol. 5, no. 4, pp. 19:1–19:19, Dec. 2015. [Online].
Available: http://doi.acm.org/10.1145/2827872
[16] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask
r-cnn,” IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, pp. 1–1, 2018.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learn-
ing for image recognition,” 2015.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings
in deep residual networks,” 2016.
[19] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua,
“Neural collaborative filtering,” 2017.
[20] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and
Y. Bengio, “Quantized neural networks: Training neu-
ral networks with low precision weights and activations,”
2016.
[21] A. Krizhevsky, “Learning multiple layers of features from
tiny images,” https://www.cs.toronto.edu/∼kriz/learning-
features-2009-TR.pdf, 2009.
[22] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Gir-
shick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and
P. Dollr, “Microsoft coco: Common objects in context,”
2014.
[23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
Fu, and A. C. Berg, “Ssd: Single shot multibox detector,”
2015.
[24] P. Mattson, “Mlperf design challenges,” in FastPath 2019,
ISPASS, 2019.
[25] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen,
D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev,
G. Venkatesh, and H. Wu, “Mixed precision training,”
2017.
[26] Microway, “Performance characteristics of common
transports and buses,” https://www.microway.com/
knowledge-center-articles/performance-characteristics-
of-common-transports-buses/, 2019.
[27] “MLPerf,” https://mlperf.org/, MLPerf, 2018.
[28] NVIDIA, “Nvidia collective communications library
(nccl),” https://developer.nvidia.com/nccl.
[29] NVIDIA, “Nvidia tesla v100 gpu accelerator,” https://
images.nvidia.com/content/technologies/volta/pdf/tesla-
volta-v100-datasheet-letter-fnl-web.pdf, 2018.
[30] NVIDIA, “Automatic mixed precision (amp),” https://
developer.nvidia.com/automatic-mixed-precision, 2019.
[31] NVIDIA Corporation, “Nvidia system management in-
terface program,” https://developer.download.nvidia.com/
compute/DCGM/docs/nvidia-smi-367.38.pdf, 2016.
[32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang,
Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer,
“Automatic differentiation in pytorch,” in NIPS-W, 2017.
[33] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad:
100,000+ questions for machine comprehension of text,”
2016.
[34] D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre,
G. van den Driessche, J. Schrittwieser, I. Antonoglou,
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe,
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis,
“Mastering the game of go with deep neural networks and
tree search,” Nature, vol. 529, pp. 484–489, 01 2016.
[35] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou,
M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran,
T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis,
“Mastering chess and shogi by self-play with a general re-
inforcement learning algorithm,” 2017.
[36] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,
A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai,
A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre,
G. van den Driessche, T. Graepel, and D. Hassabis, “Mas-
tering the game of go without human knowledge,”Nature,
vol. 550, pp. 354–359, 10 2017.
[37] “Minigo: A minimalist Go enginemodeled after AlphaGo
Zero, built on MuGo,” https://github.com/tensorflow/
minigo, tensorflow.
[38] “XLA (accelerated linear algebra),” https://www.
tensorflow.org/xla/jit, tensorflow.
[39] The FreeBSD Project, “Iostat: I/o statistics tool,”
https://www.freebsd.org/cgi/man.cgi?query=iostat&
manpath=FreeBSD+12.0-RELEASE+and+Ports.
[40] The FreeBSD Project, “Netstat: Network status and
statistics tool,” https://www.freebsd.org/cgi/man.cgi?
query=netstat&sektion=1&manpath=FreeBSD+12.0-
RELEASE+and+Ports.
[41] The FreeBSD Project, “Vmstat: Virtual memory
statistics tool,” https://www.freebsd.org/cgi/man.cgi?
query=vmstat&sektion=8&manpath=FreeBSD+12.0-
RELEASE+and+Ports.
[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is
all you need,” 2017.
[43] V. Vryniotis, “Nvidia gpu utilization plugin for dstat,”
https://raw.githubusercontent.com/datumbox/dstat/
master/plugins/dstat nvidia gpu.py, 2017.
13
[44] D. Wiee¨rs, “Dstat: Versatile resource statistics tool,”
http://dag.wiee.rs/home-made/dstat/.
[45] S. Williams, A. Waterman, and D. Patterson, “Roofline:
An insightful visual performance model for multicore
architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76,
Apr. 2009. [Online]. Available: http://doi.acm.org/10.
1145/1498765.1498785
[46] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,
J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser,
S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens,
G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa,
A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and
J. Dean, “Google’s neural machine translation system:
Bridging the gap between human and machine transla-
tion,” 2016.
[47] C. Yang, “Berkeley cs roofline toolkit,” https://bitbucket.
org/berkeleylab/cs-roofline-toolkit.
[48] R. Yang, Facebook-ParlAI, and B. Koonce, “DrQA (sub-
mission at DAWNBench),” https://github.com/hitvoice/
DrQA, 2018.
[49] C. Young, “Why machine learning needs benchmarks,”
Computer Architecture Today, ACM SIGARCH, 2018.
[50] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phan-
ishayee, B. Schroeder, and G. Pekhimenko, “Tbd: Bench-
marking and analyzing deep neural network training,”
2018.
Appendix
14
Table 7: The classification of kernels from 7 benchmarks from MLPerf v0.5.0 submission. Classes are sorted by time.
time intensity throughput
class % acc% ms #call #unique FLOPs transactions (FLOPs/byte) (GFLOPs/sec)
relu 17.62% 17.62% 18482.42 62,426 22 8,063,786,730,942 198,032,803,079 1.27 436.29
M
M
MM 256x128 6.60% 24.22% 6924.74 22,855 6 286,017,276,190 45,463,681,015 0.20 41.30
MM 128x128 3.94% 28.16% 4134.53 16,360 3 401,432,676,657 37,149,153,847 0.34 97.09
MM 128x64 2.81% 30.97% 2944.66 37,642 10 11,202,912,520,076 19,259,247,078 18.18 3804.48
MM 256x64 0.39% 31.35% 404.64 3,967 1 30,191,909,672 3,727,342,569 0.25 74.61
MM 64x64 0.37% 31.73% 393.19 1,787 3 18,356,862,434 6,165,125,205 0.09 46.69
MM 0.37% 32.10% 385.25 182,150 4 271,401,541,563 259,980,751 32.62 704.48
MM 32 0.31% 32.41% 329.51 3,780 4 33,426,507,336 3,174,607,380 0.33 101.44
MM 128x32 0.15% 32.57% 162.41 781 4 580,080,631,808 2,358,357,621 7.69 3571.62
MM 4x1 0.09% 32.65% 91.71 2,400 4 592,209,510,400 88,557,000 208.98 6457.51
MM 64x32 0.00% 32.65% 0.37 31 1 523,911,168 654,960 25.00 1400.68
subtotal of 10 15.03% 32.65% 15771.01 271,753 40 13,416,553,347,304 117,646,707,426 3.56 850.71
w
g
r
a
d
wgrad 128x128 7.68% 40.33% 8056.95 16,283 2 139,939,275,826 32,782,052,379 0.13 17.37
wgrad 1x4 4.20% 44.54% 4408.49 31,440 2 198,511,252,432 40,071,279,318 0.15 45.03
wgrad 256x128 0.58% 45.11% 603.20 822 1 16,231,956,048 1,854,362,760 0.27 26.91
wgrad 256x64 0.05% 45.16% 56.08 100 1 1,179,648,000 413,558,700 0.09 21.04
wgrad 128 0.01% 45.17% 6.59 100 1 26,982,297,600 10,601,600 79.53 4096.17
wgrad 512 0.00% 45.17% 3.51 100 1 2,248,908,800 507,400 138.51 639.93
subtotal of 6 12.52% 45.17% 13134.82 48,845 8 385,093,338,706 75,132,362,157 0.16 29.32
d
g
r
a
d
dgrad 128x128 9.41% 54.58% 9869.51 33,143 4 1,049,997,394,135 65,438,412,942 0.50 106.39
dgrad 256x64 0.94% 55.52% 981.43 2,044 3 127,396,347,819 8,469,780,667 0.47 129.81
dgrad 256x128 0.59% 56.11% 618.48 1,916 3 28,611,706,064 5,305,664,320 0.17 46.26
dgrad 128 0.48% 56.59% 507.78 406 2 5,013,687,936 770,984,560 0.20 9.87
subtotal of 4 11.42% 56.59% 11977.20 37,509 12 1,211,019,135,954 79,984,842,489 0.47 101.11
tensorIter 6.34% 62.93% 6651.16 323,426 24 3,436,211,135,239 127,141,050,448 0.84 516.63
normalize 5.01% 67.94% 5255.45 10,305 13 8,090,472,122,948 101,382,390,874 2.49 1539.44
P
o
i
n
t
w
i
s
e
Pointwise ThresholdUpdateGradInput 1.76% 69.70% 1843.42 24,041 3 337,114,947,912 31,431,449,451 0.34 182.87
Pointwise Copy 1.57% 71.27% 1645.53 134,198 22 50,666,448,212 27,970,992,933 0.06 30.79
Pointwise MaxValue 1.13% 72.40% 1189.46 27,941 4 334,982,250,018 20,056,797,471 0.52 281.62
Pointwise ThresholdUpdateOutput 0.22% 72.62% 225.82 2,200 1 66,178,890,600 4,114,842,600 0.50 293.06
Pointwise Fill 0.08% 72.70% 85.57 78,175 3 0 127,024,371 0.00 0.00
Pointwise Take 0.02% 72.72% 21.21 10,188 5 0 25,326,225 0.00 0.00
Pointwise Remainder 0.02% 72.74% 20.84 12,782 2 0 1,790,296 0.00 0.00
Pointwise EQValue 0.01% 72.75% 12.21 7,800 2 0 11,143,100 0.00 0.00
Pointwise neg Float 0.01% 72.76% 10.00 7,300 1 160,655,900 4,857,100 1.03 16.06
Pointwise GTValue 0.01% 72.77% 8.47 4,170 1 0 44,140,002 0.00 0.00
Pointwise WrapIndex 0.01% 72.78% 7.56 2,796 2 0 47,137,402 0.00 0.00
Pointwise PutAccumulate 0.01% 72.78% 6.88 2,796 2 309,819,336 6,927,868 1.40 45.03
Pointwise MaskedSelect 0.01% 72.79% 6.55 4,000 1 0 1,476,000 0.00 0.00
Pointwise Put 0.01% 72.79% 6.33 3,294 3 0 5,144,394 0.00 0.00
Pointwise GEValue 0.01% 72.80% 5.87 2,000 2 0 48,826,400 0.00 0.00
Pointwise 0.00% 72.80% 5.07 2,388 1 0 150,633 0.00 0.00
Pointwise log Float 0.00% 72.81% 4.37 900 1 6,842,237,600 66,396,100 3.22 1564.62
Pointwise Pow 0.00% 72.81% 2.55 1,592 2 156,408 103,610 0.05 0.06
Pointwise exp Float 0.00% 72.81% 2.48 1,600 1 314,246,400 2,120,000 4.63 126.78
Pointwise BitAnd 0.00% 72.82% 2.30 1,600 1 0 25,600 0.00 0.00
Pointwise LT 0.00% 72.82% 2.07 100 1 0 43,113,000 0.00 0.00
Pointwise Sign 0.00% 72.82% 1.37 796 1 0 60,572 0.00 0.00
Pointwise abs Float 0.00% 72.82% 1.23 796 1 156,408 46,226 0.11 0.13
Pointwise LTValue 0.00% 72.82% 1.21 796 1 0 39,053 0.00 0.00
Pointwise NEValue 0.00% 72.82% 0.85 400 1 0 668,000 0.00 0.00
Pointwise log2 Float 0.00% 72.82% 0.65 400 1 10,480,800 19,600 16.71 16.14
Pointwise MaskedFill 0.00% 72.82% 0.63 200 2 0 33,900 0.00 0.00
Pointwise sqrt Float 0.00% 72.82% 0.62 400 1 2,821,600 16,400 5.38 4.52
1
5
Table 7: The classification of kernels from 7 benchmarks from MLPerf v0.5.0 submission. Classes are sorted by time. (continue)
time intensity throughput
class % acc% ms #call #unique FLOPs transactions (FLOPs/byte) (GFLOPs/sec)
P
o
i
n
t
w
i
s
e
Pointwise BitOr 0.00% 72.82% 0.62 400 1 0 18,800 0.00 0.00
Pointwise Clamp 0.00% 72.83% 0.60 400 1 0 16,400 0.00 0.00
Pointwise floor Float 0.00% 72.83% 0.60 400 1 0 12,800 0.00 0.00
Pointwise MaskedCopy 0.00% 72.83% 0.43 200 1 0 61,000 0.00 0.00
Pointwise GE 0.00% 72.83% 0.18 100 1 0 23,500 0.00 0.00
Pointwise MinValue 0.00% 72.83% 0.17 100 1 0 15,000 0.00 0.00
subtotal of 34 4.88% 72.83% 5123.74 337,249 75 796,583,111,194 84,010,815,807 0.30 155.47
Transpose 3.49% 76.32% 3660.90 170,161 8 834,785,105,680 54,658,781,343 0.48 228.03
Reduce 3.20% 79.51% 3352.12 123,458 46 1,148,926,597,534 45,911,843,322 0.78 342.75
fusion 2.80% 82.31% 2936.77 75,800 373 2,577,445,621,800 50,446,856,600 1.60 877.65
SoftMax 2.16% 84.47% 2266.63 5,400 10 1,253,264,603,200 37,204,331,400 1.05 552.92
c
u
b
cub Merge 0.34% 84.81% 352.81 13,478 6 0 6,230,254,710 0.00 0.00
cub BlockSort 0.29% 85.10% 303.22 4,102 7 0 711,844,104 0.00 0.00
cub RadixSortScanBins 0.13% 85.23% 138.16 4,000 1 0 49,724,000 0.00 0.00
cub Partition 0.10% 85.33% 108.66 13,478 3 0 230,149,983 0.00 0.00
cub ParallelFor 0.08% 85.42% 87.35 23,376 5 0 730,411,098 0.00 0.00
cub CopyIf 0.06% 85.48% 63.44 15,470 1 0 33,774,810 0.00 0.00
cub DeviceRadixSortDownsweep 0.05% 85.53% 56.84 4,000 2 0 438,513,600 0.00 0.00
cub Scan 0.02% 85.55% 25.49 4,200 1 0 2,286,400 0.00 0.00
cub DeviceRadixSortUpsweep 0.02% 85.58% 22.18 4,000 2 0 23,642,400 0.00 0.00
cub Init 0.02% 85.60% 22.05 19,670 2 0 304,192 0.00 0.00
subtotal of 10 1.13% 85.60% 1180.20 105,774 30 0 8,450,905,297 0.00 0.00
Dropout 0.70% 86.30% 738.54 13,600 4 327,512,958,800 14,020,378,200 0.73 443.46
Copy 0.61% 86.91% 642.79 21,631 43 0 10,794,889,758 0.00 0.00
Gather 0.43% 87.35% 453.54 4,578 5 143,982,000 6,278,574 0.72 0.32
Elementwise 0.34% 87.68% 354.28 26,140 3 230,390,201,361 2,800,328,330 2.57 650.31
tanh 0.27% 87.96% 286.32 200 2 344,846,634,800 5,356,401,300 2.01 1204.41
Embedding 0.22% 88.18% 229.83 588 1 94,857,813,824 1,710,689,852 1.73 412.72
SortKV 0.18% 88.36% 192.67 6,241 6 238,796,800 3,769,225 1.98 1.24
Scatter 0.08% 88.44% 86.95 400 2 3,531,248,200 1,000,731,000 0.11 40.61
Pooling 0.05% 88.49% 51.55 1,200 2 13,134,239,200 833,788,800 0.49 254.81
winograd 4x4 0.05% 88.54% 51.47 2,700 6 113,593,864,400 594,620,500 5.97 2206.96
generate 0.01% 88.55% 11.44 2,500 3 545,412,400 58,183,700 0.29 47.69
Sigmoid 0.00% 88.56% 5.01 2,000 1 2,050,252,000 10,974,000 5.84 409.17
1
6
Table 8: The comparison between the kernel classifications of the 7 benchmarks from MLPerf v0.5.0 submission. Classes are sorted by time.
Res50 TF Res50 MX XFMR Py GNMT Py SSD Py MRCNN Py NCF Py
class time% #call time% #call time% #call time% #call time% #call time% #call time% #call
relu 22.90% 17,000 33.06% 10,142 0.00% 0 0.00% 0 25.12% 6,100 13.69% 29,184 0.00% 0
M
M
MM 256x128 0.00% 0 0.05% 200 33.98% 17,526 23.23% 4,729 0.00% 0 0.31% 400 0.00% 0
MM 128x128 0.00% 0 0.00% 0 20.31% 10,482 10.21% 5,596 0.00% 0 0.00% 0 18.81% 282
MM 128x64 0.00% 0 0.02% 100 8.00% 6,516 16.54% 28,479 0.17% 900 0.49% 1,600 1.23% 47
MM 256x64 0.00% 0 0.00% 0 0.00% 0 3.12% 3,567 0.00% 0 0.36% 400 0.00% 0
MM 64x64 0.00% 0 0.00% 0 0.76% 1,596 0.03% 50 0.00% 0 0.00% 0 12.94% 141
MM 0.63% 33,800 0.23% 19,724 0.00% 0 0.00% 0 0.14% 17,500 1.08% 111,126 0.00% 0
MM 32 0.00% 0 0.00% 0 2.51% 3,780 0.00% 0 0.00% 0 0.00% 0 0.00% 0
MM 128x32 0.05% 200 0.00% 0 0.00% 0 0.90% 87 0.00% 0 0.03% 400 2.65% 94
MM 4x1 0.16% 400 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.34% 2,000 0.00% 0
MM 64x32 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 31 0.00% 0
subtotal of 10 0.85% 34,400 0.31% 20,024 65.56% 39,900 54.03% 42,508 0.31% 18,400 2.60% 115,957 35.62% 564
w
g
r
a
d
wgrad 128x128 3.67% 3,200 8.68% 3,326 0.00% 0 0.00% 0 18.20% 2,900 3.11% 6,857 0.00% 0
wgrad 1x4 11.12% 7,200 7.02% 1,919 0.00% 0 0.00% 0 3.31% 1,600 5.51% 20,721 0.00% 0
wgrad 256x128 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 2.57% 822 0.00% 0
wgrad 256x64 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.19% 100 0.00% 0 0.00% 0
wgrad 128 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.02% 100 0.00% 0 0.00% 0
wgrad 512 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.01% 100 0.00% 0 0.00% 0
subtotal of 6 14.79% 10,400 15.70% 5,245 0.00% 0 0.00% 0 21.73% 4,800 11.19% 28,400 0.00% 0
d
g
r
a
d
dgrad 128x128 5.10% 2,600 12.37% 3,630 0.00% 0 0.00% 0 17.32% 3,500 8.69% 23,413 0.00% 0
dgrad 256x64 1.12% 600 1.46% 404 0.00% 0 0.00% 0 2.03% 600 0.08% 440 0.00% 0
dgrad 256x128 0.00% 0 1.20% 606 0.00% 0 0.00% 0 0.00% 0 1.72% 1,310 0.00% 0
dgrad 128 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 2.16% 406 0.00% 0
subtotal of 4 6.21% 3,200 15.03% 4,640 0.00% 0 0.00% 0 19.35% 4,100 12.65% 25,569 0.00% 0
tensorIter 0.00% 0 0.00% 0 5.89% 14,900 6.84% 4,500 5.14% 7,300 15.11% 296,397 3.61% 329
normalize 0.00% 0 16.92% 5,804 0.00% 0 0.00% 0 7.39% 4,501 0.00% 0 0.00% 0
P
o
i
n
t
w
i
s
e
Pointwise ThresholdUpdateGradInput 0.00% 0 0.00% 0 1.81% 1,200 0.00% 0 2.62% 2,300 2.48% 20,400 10.37% 141
Pointwise Copy 0.00% 0 0.00% 0 0.24% 952 1.78% 5,900 1.69% 10,600 3.25% 116,370 7.10% 376
Pointwise MaxValue 0.00% 0 0.00% 0 0.00% 0 0.00% 0 1.55% 1,400 2.40% 26,400 7.06% 141
Pointwise ThresholdUpdateOutput 0.00% 0 0.00% 0 1.14% 1,200 0.00% 0 0.25% 1,000 0.00% 0 0.00% 0
Pointwise Fill 0.00% 0 0.00% 0 0.02% 2,000 0.01% 800 0.00% 600 0.35% 74,728 0.00% 47
Pointwise Take 0.00% 0 0.00% 0 0.01% 400 0.00% 200 0.00% 0 0.08% 9,588 0.00% 0
Pointwise Remainder 0.00% 0 0.00% 0 0.01% 800 0.00% 200 0.00% 200 0.08% 11,582 0.00% 0
Pointwise EQValue 0.00% 0 0.00% 0 0.00% 100 0.00% 100 0.00% 0 0.05% 7,600 0.00% 0
Pointwise neg Float 0.00% 0 0.00% 0 0.00% 400 0.01% 500 0.00% 0 0.04% 6,400 0.00% 0
Pointwise GTValue 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.01% 200 0.03% 3,970 0.00% 0
Pointwise WrapIndex 0.00% 0 0.00% 0 0.00% 200 0.00% 200 0.00% 0 0.03% 2,396 0.00% 0
Pointwise PutAccumulate 0.00% 0 0.00% 0 0.00% 200 0.00% 200 0.00% 0 0.03% 2,396 0.00% 0
Pointwise MaskedSelect 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.03% 4,000 0.00% 0
Pointwise Put 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 100 0.03% 3,194 0.00% 0
Pointwise GEValue 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.03% 2,000 0.00% 0
Pointwise 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.02% 2,388 0.00% 0
Pointwise log Float 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.01% 100 0.01% 800 0.00% 0
Pointwise Pow 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.01% 1,592 0.00% 0
Pointwise exp Float 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.01% 1,600 0.00% 0
Pointwise BitAnd 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.01% 1,600 0.00% 0
Pointwise LT 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.01% 100 0.00% 0 0.00% 0
Pointwise Sign 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.01% 796 0.00% 0
Pointwise abs Float 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.01% 796 0.00% 0
Pointwise LTValue 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.01% 796 0.00% 0
Pointwise NEValue 0.00% 0 0.00% 0 0.00% 300 0.00% 100 0.00% 0 0.00% 0 0.00% 0
Pointwise log2 Float 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 400 0.00% 0
Pointwise MaskedFill 0.00% 0 0.00% 0 0.00% 0 0.01% 200 0.00% 0 0.00% 0 0.00% 0
Pointwise sqrt Float 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 400 0.00% 0
1
7
Table 8: The comparison between the kernel classifications of the 7 benchmarks from MLPerf v0.5.0 submission. Classes are sorted by time.
Res50 TF Res50 MX XFMR Py GNMT Py SSD Py MRCNN Py NCF Py
class time% #call time% #call time% #call time% #call time% #call time% #call time% #call
P
o
i
n
t
w
i
s
e
Pointwise BitOr 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 400 0.00% 0
Pointwise Clamp 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 400 0.00% 0
Pointwise floor Float 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 400 0.00% 0
Pointwise MaskedCopy 0.00% 0 0.00% 0 0.00% 200 0.00% 0 0.00% 0 0.00% 0 0.00% 0
Pointwise GE 0.00% 0 0.00% 0 0.00% 0 0.00% 100 0.00% 0 0.00% 0 0.00% 0
Pointwise MinValue 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 100 0.00% 0 0.00% 0
subtotal of 34 0.00% 0 0.00% 0 3.25% 7,952 1.82% 8,500 6.14% 16,700 8.98% 303,392 24.52% 705
Transpose 0.05% 400 0.01% 201 0.00% 0 0.39% 996 1.62% 6,600 13.32% 161,964 0.00% 0
Reduce 1.02% 1,400 3.75% 804 2.24% 9,943 3.30% 2,170 3.73% 4,500 3.21% 104,406 4.04% 235
fusion 37.86% 75,800 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0
SoftMax 0.00% 0 0.01% 200 7.36% 3,800 4.52% 400 2.76% 200 0.02% 800 0.00% 0
c
u
b
cub Merge 0.00% 0 0.00% 0 0.04% 719 0.04% 488 0.76% 3,900 0.17% 6,679 3.44% 1,692
cub BlockSort 0.00% 0 0.00% 0 0.09% 400 0.13% 400 0.53% 400 0.38% 2,714 1.28% 188
cub RadixSortScanBins 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.59% 4,000 0.00% 0
cub Partition 0.00% 0 0.00% 0 0.02% 719 0.02% 488 0.19% 3,900 0.12% 6,679 0.85% 1,692
cub ParallelFor 0.00% 0 0.00% 0 0.02% 1,000 0.01% 500 0.07% 600 0.26% 21,088 0.09% 188
cub CopyIf 0.00% 0 0.00% 0 0.02% 400 0.01% 200 0.01% 100 0.24% 14,770 0.00% 0
cub DeviceRadixSortDownsweep 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.24% 4,000 0.00% 0
cub Scan 0.00% 0 0.00% 0 0.01% 200 0.00% 0 0.00% 0 0.10% 4,000 0.00% 0
cub DeviceRadixSortUpsweep 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.09% 4,000 0.00% 0
cub Init 0.00% 0 0.00% 0 0.01% 600 0.00% 200 0.00% 100 0.09% 18,770 0.00% 0
subtotal of 10 0.00% 0 0.00% 0 0.21% 4,038 0.21% 2,276 1.56% 9,000 2.28% 86,700 5.66% 3,760
Dropout 0.00% 0 0.00% 0 5.22% 12,400 0.52% 1,200 0.00% 0 0.00% 0 0.00% 0
Copy 0.61% 6,600 0.00% 0 0.96% 100 1.17% 200 0.47% 700 0.37% 13,937 5.43% 94
Gather 0.00% 0 0.00% 0 0.00% 100 0.00% 100 0.00% 0 1.93% 4,378 0.00% 0
Elementwise 0.00% 0 0.00% 0 0.00% 0 3.45% 26,140 0.00% 0 0.00% 0 0.00% 0
tanh 0.00% 0 0.00% 0 0.00% 0 2.79% 200 0.00% 0 0.00% 0 0.00% 0
Embedding 0.00% 0 0.00% 0 0.32% 200 0.28% 200 0.00% 0 0.00% 0 7.08% 188
SortKV 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.82% 6,241 0.00% 0
Scatter 1.11% 200 0.00% 0 0.00% 100 0.00% 100 0.00% 0 0.00% 0 0.00% 0
Pooling 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.22% 1,200 0.00% 0
winograd 4x4 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.17% 2,700 0.00% 0 0.00% 0
generate 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 100 0.05% 2,400 0.00% 0
Sigmoid 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.00% 0 0.02% 2,000 0.00% 0
1
8
Table 9: Unique kernel sources distribution for each kernel class. Sorted by number of unique kernels.
class all 7 Res50 TF Res50 MX XFMR Py GNMT Py SSD Py MRCNN Py NCF Py
fusion 373 373 0 0 0 0 0 0
P
o
i
n
t
w
i
s
e
Pointwise Copy 22 0 0 2 7 11 14 4
Pointwise Take 5 0 0 2 1 0 4 0
Pointwise MaxValue 4 0 0 0 0 2 4 1
Pointwise Put 3 0 0 0 0 1 3 0
Pointwise ThresholdUpdateGradInput 3 0 0 1 0 2 1 2
Pointwise Fill 3 0 0 3 2 2 3 1
Pointwise MaskedFill 2 0 0 0 2 0 0 0
Pointwise Pow 2 0 0 0 0 0 2 0
Pointwise WrapIndex 2 0 0 1 1 0 2 0
Pointwise EQValue 2 0 0 1 1 0 2 0
Pointwise PutAccumulate 2 0 0 1 1 0 1 0
Pointwise GEValue 2 0 0 0 0 0 2 0
Pointwise Remainder 2 0 0 1 1 1 2 0
Pointwise BitOr 1 0 0 0 0 0 1 0
Pointwise log2 Float 1 0 0 0 0 0 1 0
Pointwise exp Float 1 0 0 0 0 0 1 0
Pointwise NEValue 1 0 0 1 1 0 0 0
Pointwise MaskedCopy 1 0 0 1 0 0 0 0
Pointwise log Float 1 0 0 0 0 1 1 0
Pointwise GE 1 0 0 0 1 0 0 0
Pointwise floor Float 1 0 0 0 0 0 1 0
Pointwise GTValue 1 0 0 0 0 1 1 0
Pointwise MaskedSelect 1 0 0 0 0 0 1 0
Pointwise Clamp 1 0 0 0 0 0 1 0
Pointwise 1 0 0 0 0 0 1 0
Pointwise LT 1 0 0 0 0 1 0 0
Pointwise Sign 1 0 0 0 0 0 1 0
Pointwise LTValue 1 0 0 0 0 0 1 0
Pointwise MinValue 1 0 0 0 0 1 0 0
Pointwise neg Float 1 0 0 1 1 0 1 0
Pointwise BitAnd 1 0 0 0 0 0 1 0
Pointwise ThresholdUpdateOutput 1 0 0 1 0 1 0 0
Pointwise abs Float 1 0 0 0 0 0 1 0
Pointwise sqrt Float 1 0 0 0 0 0 1 0
subtotal of 34 75 0 0 16 19 24 55 8
Reduce 46 3 7 21 16 17 19 3
Copy 43 33 0 1 2 6 7 1
M
M
MM 128x64 10 0 1 8 8 2 3 1
MM 256x128 6 0 2 6 5 0 1 0
MM 4x1 4 2 0 0 0 0 3 0
MM 4 3 3 0 0 3 4 0
MM 32 4 0 0 4 0 0 0 0
MM 128x32 4 1 0 0 1 0 1 2
MM 128x128 3 0 0 3 3 0 0 2
MM 64x64 3 0 0 3 3 0 0 1
MM 256x64 1 0 0 0 1 0 1 0
MM 64x32 1 0 0 0 0 0 1 0
subtotal of 10 40 6 6 24 21 5 14 6
c
u
b
cub BlockSort 7 0 0 2 2 4 3 1
cub Merge 6 0 0 2 2 4 1 1
cub ParallelFor 5 0 0 2 3 3 5 1
cub Partition 3 0 0 1 1 3 1 1
cub DeviceRadixSortDownsweep 2 0 0 0 0 0 2 0
cub Init 2 0 0 2 1 1 2 0
cub DeviceRadixSortUpsweep 2 0 0 0 0 0 2 0
1
9
Table 9: Unique kernel sources distribution for each kernel class. Sorted by number of unique kernels. (continue)
class all 7 Res50 TF Res50 MX XFMR Py GNMT Py SSD Py MRCNN Py NCF Py
c
u
b
cub CopyIf 1 0 0 1 1 1 1 0
cub Scan 1 0 0 1 0 0 1 0
cub RadixSortScanBins 1 0 0 0 0 0 1 0
subtotal of 10 30 0 0 11 10 16 19 4
tensorIter 24 0 0 11 14 15 17 5
relu 22 5 10 0 0 9 10 0
normalize 13 0 9 0 0 4 0 0
d
g
r
a
d
dgrad 128x128 4 1 2 0 0 3 2 0
dgrad 256x128 3 0 1 0 0 0 2 0
dgrad 256x64 3 1 2 0 0 1 1 0
dgrad 128 2 0 0 0 0 0 2 0
subtotal of 4 12 2 5 0 0 4 7 0
SoftMax 10 0 2 4 4 2 2 0
Transpose 8 1 1 0 1 5 6 0
w
g
r
a
d
wgrad 128x128 2 1 2 0 0 2 1 0
wgrad 1x4 2 1 1 0 0 1 2 0
wgrad 128 1 0 0 0 0 1 0 0
wgrad 256x64 1 0 0 0 0 1 0 0
wgrad 512 1 0 0 0 0 1 0 0
wgrad 256x128 1 0 0 0 0 0 1 0
subtotal of 6 8 2 3 0 0 6 4 0
SortKV 6 0 0 0 0 0 6 0
winograd 4x4 6 0 0 0 0 6 0 0
Gather 5 0 0 1 1 0 4 0
Dropout 4 0 0 3 3 0 0 0
Elementwise 3 0 0 0 3 0 0 0
generate 3 0 0 0 0 1 2 0
tanh 2 0 0 0 2 0 0 0
Scatter 2 1 0 1 1 0 0 0
Pooling 2 0 0 0 0 0 2 0
Sigmoid 1 0 0 0 0 0 1 0
Embedding 1 0 0 1 1 0 0 1
2
0
