Hierarchical Roofline Performance Analysis for Deep Learning
  Applications by Wang, Yunsong et al.
Hierarchical Roofline Performance Analysis for
Deep Learning Applications
Yunsong Wang, Charlene Yang
Steven Farrell
National Energy Research Scientific Computing Center
Lawrence Berkeley National Laboratory
Berkeley, CA 94720, USA
{yunsongwang, cjyang, sfarrell}@lbl.gov
Thorsten Kurth
NVIDIA Corporation
2001 Broadway
Oakland, CA 94612, USA
tkurth@nvidia.com
Samuel Williams
Computational Research Division
Lawrence Berkeley National Laboratory
Berkeley, CA 94720, USA
swwilliams@lbl.gov
Abstract—This paper presents a practical methodology for
collecting performance data necessary to conduct hierarchical
Roofline analysis on NVIDIA GPUs. It discusses the extension of
the Empirical Roofline Toolkit for more data precision support
and Tensor Core support and introduces an Nsight Compute
based method to accurately collect application performance
information. This methodology allows for automated machine
characterization and application characterization for Roofline
analysis across the entire memory hierarchy on NVIDIA GPUs,
and it is validated by a complex deep learning application used
for climate image segmentation. We will use two versions of the
code, in TensorFlow and PyTorch respectively, to demonstrate the
use and effectiveness of this methodology, and some insights will
be highlighted on how the application utilizes the compute and
memory capabilities on the GPU and how the implementation
and performance differs in two deep learning frameworks.
Index Terms—Roofline Model, Performance Analysis, Memory
Hierarchy, NVIDIA GPUs, Deep Learning, Image Segmentation
I. INTRODUCTION
The Roofline model [1] is an intuitive performance model
that can offer valuable insights into application performance,
performance bottlenecks and possible optimization opportuni-
ties. Its capability to extract the key computational character-
istics and abstract away the complexity of modern computer
architectures has gained its popularity in recent years in both
traditional high-performance computing (HPC) and machine
learning. Roofline is a throughput-oriented model centered
around the interplay of computational capabilities, memory
bandwidth, and data locality. Data locality is expressed as the
arithmetic intensity (AI), the reuse of data once it is being
loaded from memory, and it is commonly calculated as the
ratio of the floating-point operations performed to the data
movement, i.e. FLOPs per byte. The sustained performance
(GFLOP/s) is then bound by two terms:
GFLOP/s ≤ min
{
Peak GFLOP/s
Peak GB/s× Arithmetic Intensity (1)
The Roofline model conventionally only focuses on one
level in the memory hierarchy, but this has been extended
in recent years to the full memory system to help understand
cache reuse and data locality and provide additional insights
into code performance. To facilitate the Roofline study, many
tools and workflows have sprung to life, for example, the
Empirical Roofline Toolkit (ERT) developed at the Lawrence
Berkeley National Laboratory, for more accurate machine
characterization [2], [3], and other tools, methodologies and
workflows for more streamlined application performance data
collection in [4]–[7]. A range of studies have also been
conducted on the application of Roofline in both traditional
HPC [7]–[12] and Machine Learning [7], [13], [14], and
the extension and refinement of the model to other related
topics such as instruction Roofline [15], time-based Roofline
[14], Roofline scaling trajectories [16], performance portability
analysis based on Roofline [3], and power and energy Roofline
[17], [18].
Deep learning has become one of the most dominant tools
in areas such as pattern recognition, object detection, image
segmentation, and language processing [19], [20], and its
training or inference process usually takes a long time and
requires significant computational resources. To tackle this
problem, many innovative methods have been proposed [21],
[22] to scale up such applications, and in this paper, we will
focus on the Roofline-based performance modeling to analyze
and examine how well various deep learning frameworks are
utilizing the different aspects of the computer architecture,
especially NVIDIA GPUs.
We will propose a practical methodology for collecting
necessary performance data in order to conduct hierarchical
Roofline analysis on NVIDIA GPUs. There are two com-
ponents to this methodology, machine characterization using
the Empirical Roofline Toolkit (ERT) [2] and application
characterization using Nsight Compute [23]. We will discuss
the extension of ERT for support on multiple data precisions
and Tensor Core operations, and the Nsight Compute metrics
used to measure application performance such as the runtime,
sustained throughput and data movement across the entire
memory hierarchy. This methodology then will be validated by
a state-of-the-art deep learning application, DeepCAM [21], in
climate image segmentation, to demonstrate its effectiveness
in application analysis. Two versions of the code will be
examined, in TensorFlow and PyTorch respectively, and some
insights will be highlighted on how deep learning applica-
tions in general utilize the compute/memory capabilities on
ar
X
iv
:2
00
9.
05
25
7v
1 
 [c
s.D
C]
  1
1 S
ep
 20
20
NVIDIA GPUs and how the two deep learning frameworks,
TensorFlow and PyTorch, can differ in implementation and
performance.
II. METHODOLOGIES
In this section, we will discuss the extension work done on
the Empirical Roofline Toolkit (ERT) in order to support multi-
ple data precisions (such as FP16) and Tensor Core operations
on NVIDIA GPUs, and the set of metrics in Nsight Compute
that can be used to measure application performance such as
runtime, sustained throughput and data movement at differ-
ent levels of the memory hierarchy. These two components
together comprise the complete data collection methodology
for machine and application characterization in a hierarchical
Roofline analysis on NVIDIA GPUs.
A. ERT Extensions for Machine Characterization
The Empirical Roofline Toolkit (ERT) [2] is developed and
maintained by the Lawrence Berkeley National Laboratory.
It consists of micro-kernels that are finely tuned to test the
various aspects of the computer architecture such as memory
bandwidth and compute throughput. Compared to theoretical
values or marketing numbers from vendors, this provides a
more accurate understanding of the architecture’s capability
in real programming environments with real power, thermal
constraints, and programming models.
ERT is essentially a Python script that wraps around a range
of micro-kernels written in C++ and parallelized with various
programming models on different architectures. For example,
OpenMP and MPI are used on Intel CPUs, CUDA is used on
NVIDIA GPUs, and more micro-kernels are currently being
added to support AMD architectures, IBM Power processors,
and Intel GPUs. These micro-kernels are specifically tuned to
test different aspects of the architecture and provide an upper
bound for real-life applications on them, i.e. if such kernels
can not reach certain performance, there is almost no hope for
large complex applications in real life to achieve it.
The ERT prior to this paper only supports double precision
(FP64) performance characterization and in this section, we
will detail how we have extended it to support single precision
(FP32), and half precision (FP16), as well as Tensor Core
operations on NVIDIA GPUs. The resultant Roofline ceilings
are shown in Fig. 1, with 7.7 TFLOP/s for FP64, 15.2
TFLOP/s for FP32, 29.2 TFLOP/s for FP16 on the CUDA
core, and 103.7 TFLOP/s on the Tensor Core, on V100 GPUs.
1) Single-Precision (FP32) and Half-Precision (FP16):
The original ERT is written in C and only supports double
precision (FP64) measurements. While this can be easily
extended to single precision (FP32) by replacing ‘double’ by
‘single’ in the code, it requires work to support half precision
(FP16). For maintainability and future extensibility purposes,
we have rewritten ERT in C++ and leverage C++ templates to
support multiple data types.
For FP32, we have easily obtained 15.2 TFLOP/s peak
performance, which is within 5% of the advertised 15.7
TFLOP/s performance [24].
100 101 102 103 104
Arithmetic Intensity [FLOPs/Byte]
103
104
105
Pe
rfo
rm
an
ce
 [G
FL
OP
/s
ec
]
TENSOR: 103742.3 GFLOP/s
FP16: 29181.6 GFLOP/s
FP32: 15158.2 GFLOP/s
FP64: 7669.1 GFLOP/s
HB
M:
 82
8.8
 GB
/s
Fig. 1: ERT Roofline graph for V100 GPU.
TABLE I: FP16 Performance on CUDA Core on V100 GPUs
Version Implementation Performance (TFLOP/s)
v1 naive 15.421
v2 replace half with half2 20.142
v3 uint32 t for indexing 28.152
v4 inline intermediate variables 28.376
v5 uint32 t only 29.182
For FP16 (on the CUDA Core), some performance tuning
is required as detailed in Tab. I. The naive implementation
(v1) simply passes half as the data type to the templated
functions and that resulted in a similar performance to the
FP32 precision’s, 15.4 TFLOP/s. This is because V100s do
not support FP16 directly on the CUDA Core [24] and each
FP16 operation is essentially executed as an FP32 operation
(i.e. going through the same pipeline). To efficiently perform
FP16 operations (even though utilizing the Tensor Core would
be a good option), on the CUDA Core, a vector type half2
can be used to pack two FP16 values together to one FP32
register and be executed in one FP32 instruction. In ERT, we
have implemented this using intrinsic functions and obtained
an improved performance of 20.1 TFLOP/s (v2) in Tab. I. In
real life, it is not feasible to implement large scale applications
in intrinsics but out implementation is an attempt to push the
Roofline ceiling as high as we possibly can.
The rest three versions v3-v5 in Tab. I are a series of
optimizations that have proved to be beneficial to the devel-
opment of ERT and are expected to be largely helpful to real-
life applications and their performance tuning as well. Out
of the three, replacing uint64 t indexing variables with the
uint32 t data type has proven to bring the most performance
gain, from 20.1 TFLOP/s to 28.2 TFLOP/s. This is due to
the fact that V100s only support INT32 integer operations on
the hardware level and that there is constant type conversion
between uint64 t and uint32 t for the second version of
ERT (v2). With inlining of intermediate variables in v4 and
conversion of all integers to uint32 t in v5, the FP16 CUDA
Core performance of ERT has been brought on par to the
theoretical peak with 29.2 TFLOP/s in Fig. 1.
2) Tensor Core: NVIDIA Tensor Cores are designed to
accelerate matrix-matrix multiplication operations, which rep-
resent the mathematical nature of many deep learning work-
loads, for example, convolutional neural networks (CNNs).
They operate on 4×4 matrices and can perform the following
matrix multiplication and accumulation extremely efficiently.
D = A×B + C (2)
where A and B are matrices in FP16, and C and D are matrices
in either FP16 or FP32. V100 has 80 SMs and 8 tensor cores
per SM, and at 1.3 GHz clock frequency, its theoretical Tensor
Core peak can be calculated as
80× 8× 1.3× 43 × 2 = 106.496 TFLOP/s (3)
To stress test the Tensor Cores on V100, we have imple-
mented ERT based on general matrix-matrix multiplications
(GEMMs), where α and β are constant coefficients:
D = α ∗A×B + β ∗ C (4)
In general, there are two ways to program on Tensor
Core, using the WMMA (Warp Matrix Multiply Accumu-
late) API in CUDA [25], or libraries such as cuBLAS [26]
and cuDNN [27]. The nvcuda::wmma namespace in CUDA
provides specialized matrix load, multiply, accumulate and
store operations and allows for direct programming on Tensor
Cores. cuBLAS and cuDNN libraries, on the other hand,
shields users away from low-level CUDA programming and
provides a very versatile, and highly-tuned, high-level user
API for GEMM and other operations.
For a given GEMM in Equation 4 with matrix size M ×N
for A, N×K for B, and M×K for C and D, if M = N = K,
the total number of FLOPs performed in this kernel can
be calculated as M3 × 2. This is an estimation without
including the constant efficiency multiplications, which usually
are performed on the CUDA Core, not Tensor Core, and
are negligible. With the runtime t, we can then estimate the
FLOP/s performance of the kernel as (M3 × 2)/t for a given
matrix size in Fig. 2.
It is clear that as the matrix size increases, so does the
performance of both wmma and cuBLAS approaches. At the
largest with M = N = K = 32768, we have obtained 103.7
TFLOP/s at 97.4% of the theoretical peak from the cuBLAS
approach, and 58 TFLOP/s at 54.5% from the wmma approach.
This is largely due to the optimizations in cuBLAS such as the
use of shared memory, data padding (to avoid bank conflicts
in shared memory), highly tuned thread block size, tile size,
and other parameters.
For the rest of this paper, we will use 103.7 TFLOP/s as
the Tensor Core peak; however the 58 TFLOP/s performance
provides an empirical upper bound for users who program in
wmma on the Tensor Core.
B. Nsight Compute Metrics for Application Characterization
The application characterization methodology for Roofline
analysis on NVIDIA GPUs has been evolving with the de-
veloper toolchain change. The first proposed methodology
128 256 512 1024 2048 4096 8192 16384 32768
Matrix Size (in one dimension)
0
20
40
60
80
100
TF
LO
PS
wmma
cuBLAS
Fig. 2: Tensor Core Performance by varying the matrix size.
was based on nvprof [28] in [7], and then an Nsight
Compute [29] based methodology is developed at [30] and
briefly presented in [31]. In this paper, we will discuss in detail
how the Nsight Compute metrics can be used for hierar-
chical Roofline analysis on NVIDIA GPUs and demonstrate
its effectiveness in analyzing deep learning applications.
The Nsight profiling toolkit is replacing nvprof as the
new performance tool suite for NVIDIA GPU developers. It
consists of three components, Nsight Systems, Nsight
Compute, and Nsight Graphics, with the first two being
most relevant to scientific application and machine learning
application development. Nsight Systems can provide a
system-wide visualization of application performance and help
users identify issues such as insufficient parallelism on the
GPU, unnecessary device-host data transfers, and inefficient
kernel synchronization, while Nsight Compute dives a bit
deeper and allows for collection of more detailed performance
metrics such as warp issues statistics, instruction pipeline
utilization, and memory access pattern.
Between the two generations of developer tools, nvprof
and Nsight Compute have a few major differences.
• nvprof uses CUPTI [32] while Nsight Compute is
based on PerfWorks [33], a new framework for perfor-
mance metric collection.
• The metrics in Nsight Compute are more nuanced
than in nvprof, with some metrics broken down into
more in Nsight Compute.
• The naming and organizing convention in Nsight
Compute is more structured as well, with components
such as unit, subunit, interface, counter name, rollup met-
ric and submetric, used to distinguish different metrics.
• Kernel replay when multiple metrics are being collected,
and profiling overhead, are more optimized in Nsight
Compute, to provide faster and more accurate hardware
and software counter measurements.
To construct a hierarchical Roofline on NVIDIA GPUs, we
need to collect the following quntities, kernel run time, the
total number of FLOPs performed in each kernel, and the
number of bytes being read and written at each level of the
memory hierarchy. With Nsight Compute, we can use this
command to collect metrics listed in Tab. II.
nv-nsight-cu-cli --metrics metric ./application
1) Kernel Run Time: As shown in TABLE II, we use
the metric sm__cycles_elapsed.avg to obtain the total
number of elapsed cycles and its submetric per_second
to get the rate (number of cycles per second), in order to
calculate the kernel execution time:
time = cycles/rate (5)
2) FLOPs: To count the number of FLOPs performed
in the kernel, Nsight Compute doesn’t provide a unified
metric like flop_count_dp in nvprof. But for each
floating-point precision (FP64, FP32 and FP16), it splits the
measurement into three metrics based on the instruction type,
addition, multiplication, and fused multiply-add (FMA). Note
that each FMA is considered two FLOPs and the total number
of FLOPs can be calculated as add + 2 x fma + mul for
each data precision. Also, one can tell from the naming of the
metrics that only non-predicated threads are counted in these
FLOPs, i.e. masked operations are not included.
For Tensor Core, we count the number of warp instructions
by using the sm__inst_executed_pipe_tensor.sum
metric and the total Tensor Core FLOPs is
FLOPtc = Insttc × 512 (6)
3) Bytes: Metrics are listed in TABLE II for measuring the
data movement on each level of the memory hierarchy.
TABLE II: Nsight Compute metrics for hierarchical Roofline
Metrics
Time sm cycles elapsed.avgsm cycles elapsed.avg.per second
FP64 FLOPs
sm sass thread inst executed op hadd pred on.sum
sm sass thread inst executed op hmul pred on.sum
sm sass thread inst executed op hfma pred on.sum
FP32 FLOPs
sm sass thread inst executed op fadd pred on.sum
sm sass thread inst executed op fmul pred on.sum
sm sass thread inst executed op ffma pred on.sum
FP16 FLOPs
sm sass thread inst executed op hadd pred on.sum
sm sass thread inst executed op hmul pred on.sum
sm sass thread inst executed op hfma pred on.sum
Tensor Core FLOPs sm inst executed pipe tensor.sum
L1 Cache l1tex t bytes.sum
L2 Cache lts t bytes.sum
HBM dram bytes.sum
PCIe / NVLINK lts t sectors aperture sysmem op read.sumlts t sectors aperture sysmem op write.sum
For PCI/NVLINK measurement, we collect the total number
of read and write transactions (Tread and Twrite) and multiply
them by the transaction size, Tsize, which is 32 bytes per
transaction in this case:
Bytes = (Tread + Twrite)× Tsize (7)
For device memory (or HBM), L2 cache, and L1 cache,
the latest Nsight Compute provides a unified byte metric
for each of them to facilitate measurement. Note that shared
memory transactions are not included in the current L1 metric.
Due to profiling overhead, it is recommended to restrict the
number of kernels to run Nsight Compute with at a time,
and these metrics can be collected on separate runs as well, as
long as the execution of the application is deterministic. Also,
note that as of 2020.1.0, Nsight Compute serializes multi-
stream execution so certain performance gain due to kernel
overlapping may be overlooked; however, the performance
analysis in this paper is still insightful in understanding
application performance on a kernel level.
III. EXPERIMENTAL SETUP
A. Hardware and Software Configuration
Results presented in this paper are obtained from the
Cori supercomputer, and in particular its GPU partition, at
the National Energy Research Scientific Computing Center
(NERSC), Lawrence Berkeley National Laboratory (LBNL).
The GPU partition is primarily deployed for GPU porting,
benchmarking, and testing efforts in the NERSC Exascale
Science Application Program (NESAP). Each node contains
two Intel Xeon Gold 6148 Skylake CPUs, 384 GB DDR4
memory, and 8 NVIDIA V100 GPUs. Each GPU has 16 GB
of HBM2 memory and 80 SMs, and GPUs on a node are
connected to each other in a ‘hybrid cube-mesh’ topology.
On the software side, we have used the TensorFlow 1 and
PyTorch implementation of the climate image segmentation
code in [34], and CUDA 10.2.89, cuDNN 7.6.5, Nsight
Compute 2020.1.0, Python 3.7, PyTorch GPU 1.5.0, and
TensorFlow GPU 1.15.0 for this study.
B. DeepCAM
DeepCAM [35] is a deep learning benchmark extracted
from the 2018 Gordon Bell winning project [21], used for
detection, classification and localization of extreme weather
patterns in climate images. It has two different implementa-
tions, in TensorFlow and PyTorch respectively, with the Py-
Torch version being selected for MLPerf [36] HPC benchmark
suite. In this paper, we will compare the performance of
these two implementations using the methodology presented
in Sec. II. To ensure a fair comparison, we have tuned the
parameters to be as close as possible, for example, the number
of layers in the encoder-decoder architecture, layer parameters,
optimization algorithms, step rates, batch size, usage of batch
norm, and Automatic Mixed Precision (AMP) settings.
To profile the code, the profile-from-start option
is disabled in Nsight Compute and we use CuPy [37] to
explicitly restrict the profiling region to include the iteration
loop only. To have relatively stable run time behavior during
profiling, we also set up a warm-up loop with 5 iterations
before the target profiling loop. We collect only one metric
during each execution to minimize the profiling overhead
which will result in random algorithmic choices due to the
TensorFlow runtime auto-tuning. To solve this issue, NVIDIA
TensorFlow Determinism [38] is employed to get rid of this
uncertainty.
If not otherwise stated, the default setting for the Tensor-
Flow DeepCAM implementation is with AMP enabled, and
for PyTorch DeepCAM with AMP optimization level O1. The
source code and full raw results are available at [39].
IV. RESULTS
In this section, we will apply the methodology in Sec. II on
the DeepCAM benchmark and discuss some of its performance
implications. On the following Roofline charts, each kernel is
represented by three open circles (blue for L1, red for L2
and green for HBM), and the circle size is proportional to
the kernel’s run time. There could be many invocations of the
same kernel and the data presented on these Roofline charts is
the aggregation of all these invocations of the same kernel. In
the following subsections, we will discuss how performance is
different in the forward and backward pass in both TensorFlow
and PyTorch implementations, and the performance impact of
the NVIDIA Automatic Mixed Precision package and the zero-
AI kernels.
Note that the backward pass for TensorFlow DeepCAM in-
cludes both gradient calculation and gradient update, whereas
the PyTorch DeepCAM backward pass only includes gradient
calculation (with its ‘optimizer’ being the gradient update
step).
A. The TensorFlow version of DeepCAM
10−2 10−1 100 101 102 103
Arithmetic Intensity [FLOP/Byte]
100
101
102
103
104
105
Pe
rfo
rm
an
ce
 [G
FL
OP
/s
]
Tensor Core: 103.7 TFLOP/s
FP16: 29.2 TFLOP/s
FP32: 15.2 TFLOP/s
L1: 
143
36.0
 GB
/s
L2: 
299
6.8 
GB/
s
HBM
: 82
8.8 
GB/
s
L1
L2
HBM
Fig. 3: Hierarchical Roofline of the TensorFlow DeepCAM in
the forward pass with default configurations. Dominant kernels
have very high Tensor Core utilization.
Fig. 3 shows the hierarchical Roofline of the TensorFlow
version of DeepCAM in its forward pass. The main computa-
tional kernel represented by the three large circles under the
Tensor Core ceiling indicates that it has very high Tensor Core
utilization, whereas many of the other circles either do not
use Tensor Core or are bandwidth bound. This major kernel’s
L1 circle (in blue) slightly overlaps with its L2 circle (in red)
indicating a relatively low L1 cache locality; however, the large
gap between its L2 and HBM circles demonstrates that L2
cache misses rarely happened and the kernel benefits from high
L2 data locality. As for the rest of the kernels, their L1, L2,
and HBM kernels are generally close to each other implying
a poor data locality across all levels of memory hierarchies.
Fig. 4 shows the corresponding backward pass of the
TensorFlow DeepCAM. Instead of one single major kernel
10−2 10−1 100 101 102 103
Arithmetic Intensity [FLOP/Byte]
100
101
102
103
104
105
Pe
rfo
rm
an
ce
 [G
FL
OP
/s
]
Tensor Core: 103.7 TFLOP/s
FP16: 29.2 TFLOP/s
FP32: 15.2 TFLOP/s
L1: 
143
36.0
 GB
/s
L2: 
299
6.8 
GB/
s
HBM
: 82
8.8 
GB/
s
L1
L2
HBM
Fig. 4: Hierarchical Roofline of the TensorFlow DeepCAM in
the backward pass with default configurations. There are more
compute-intensive kernels than in the forward pass.
appearing in the forward pass, two very time-consuming
kernels are found in the backward pass calculation. It’s ob-
vious that these two kernels both require longer run time
than the major kernel in the forward pass (notice the size),
which implies the backward pass has more comptute-intensive
kernels than the forward pass and is generally more time-
consuming. Compared to a few kernels using Tensor Core in
the forward pass, we can find that more kernels benefit from
the Tensor Core pipeline in the backward pass since they are
sitting above the half-precision peak. Another observation is
that more kernel invocations are involved in the backward pass
than the forward. In the end, we can conclude that in either
forward or backward pass, the main computational kernels are
compute-bound and are highly optimized for the underlying
architecture.
B. The PyTorch version of DeepCAM
10−2 10−1 100 101 102 103
Arithmetic Intensity [FLOP/Byte]
100
101
102
103
104
105
Pe
rfo
rm
an
ce
 [G
FL
OP
/s
]
Tensor Core: 103.7 TFLOP/s
FP16: 29.2 TFLOP/s
FP32: 15.2 TFLOP/s
L1: 
143
36.0
 GB
/s
L2: 
299
6.8 
GB/
s
HBM
: 82
8.8 
GB/
s
L1
L2
HBM
Fig. 5: Hierarchical Roofline of the PyTorch DeepCAM in
the forward pass with default configurations. No single kernel
requires significantly longer run time than the others.
Compared to the TensorFlow result (Fig. 3), no dominant
kernels (kernel run time significantly larger than the others)
can be found in the PyTorch forward pass (Fig. 5). The num-
ber one kernel is located slightly below the single-precision
performance peak, and based on the symbol distance between
different memory hierarchies, it has a better cache utilization
than the dominant kernel in TensorFlow (even though it runs
on the CUDA Core). Besides, similar to TensorFlow, a large
number of trivial kernels are HBM-bound in the PyTorch
implementation of DeepCAM.
10−2 10−1 100 101 102 103
Arithmetic Intensity [FLOP/Byte]
100
101
102
103
104
105
Pe
rfo
rm
an
ce
 [G
FL
OP
/s
]
Tensor Core: 103.7 TFLOP/s
FP16: 29.2 TFLOP/s
FP32: 15.2 TFLOP/s
L1: 
143
36.0
 GB
/s
L2: 
299
6.8 
GB/
s
HBM
: 82
8.8 
GB/
s
L1
L2
HBM
Fig. 6: Hierarchical Roofline of the PyTorch DeepCAM in its
backward pass with default configurations.
Fig. 6 shows the PyTorch DeepCAM performance in the
backward pass, with default configurations. Surprisingly, the
number one time-consuming kernel does not utilize Tensor
Core and delivers only about 1 TFLOP/s performance. How-
ever, this implementation’s overall run time is still lower than
in the TensorFlow case, seen by the size of the circles, thanks
to optimizations in other kernels or the overall execution of
kernels.
10−2 10−1 100 101 102 103
Arithmetic Intensity [FLOP/Byte]
100
101
102
103
104
105
Pe
rfo
rm
an
ce
 [G
FL
OP
/s
]
Tensor Core: 103.7 TFLOP/s
FP16: 29.2 TFLOP/s
FP32: 15.2 TFLOP/s
L1: 
143
36.0
 GB
/s
L2: 
299
6.8 
GB/
s
HBM
: 82
8.8 
GB/
s
L1
L2
HBM
Fig. 7: Hierarchical Roofline of the PyTorch DeepCAM in its
‘optimizer’ step.
Compared to TensorFlow, PyTorch has more flexibility
when profiling the model and the ‘optimizer’ step can be
easily separated from the gradient calculation in the back-
propagation. The optimization step is mainly to update model
parameters with newly calculated gradients and is thus low
on arithmetic intensity. Fig. 7 demonstrates this as all the
‘optimizer’ kernels are memory-bound and have a much lower
FLOP/s performance than some of the kernels in Fig. 5 or
Fig. 6. It should be noted that there are 2709 kernel invocations
involved in this process, even though it may not look so. The
reason that there are only a few circles visible is because
these kernels all have very similar arithmetic intensity and
performance, thus overlapping.
C. Automatic Mixed Precision
10−2 10−1 100 101 102 103
Arithmetic Intensity [FLOP/Byte]
100
101
102
103
104
105
Pe
rfo
rm
an
ce
 [G
FL
OP
/s
]
Tensor Core: 103.7 TFLOP/s
FP16: 29.2 TFLOP/s
FP32: 15.2 TFLOP/s
L1: 
143
36.0
 GB
/s
L2: 
299
6.8 
GB/
s
HBM
: 82
8.8 
GB/
s
L1
L2
HBM
Fig. 8: Hierarchical Roofline of a manual FP16 implementation
of DeepCAMP in TensorFlow (backward pass).
The Automatic Mixed Precision (AMP) package developed
at NVIDIA is dedicated to accelerating deep learning pro-
cesses by partially converting single-precision data to half-
precision to reduces data movement and improve computa-
tional throughput. It allows for automatic type conversion of
certain model parameters and also implements schemes such
as loss scaling to ensure numerical correctness and accuracy.
We have implemented a FP16 version of DeepCAM in Ten-
sorFlow, by manually selecting variables and type casting
them. Fig. 8 shows that the backward pass performance of
this implementation is very close to that of the TensorFlow
DeepCAM, where AMP is used, in Fig. 4, demonstrating the
effectiveness of the AMP package.
10−2 10−1 100 101 102 103
Arithmetic Intensity [FLOP/Byte]
100
101
102
103
104
105
Pe
rfo
rm
an
ce
 [G
FL
OP
/s
]
Tensor Core: 103.7 TFLOP/s
FP16: 29.2 TFLOP/s
FP32: 15.2 TFLOP/s
L1: 
143
36.0
 GB
/s
L2: 
299
6.8 
GB/
s
HBM
: 82
8.8 
GB/
s
L1
L2
HBM
Fig. 9: Hierarchical Roofline of the PyTorch DeepCAM in its
backward pass with AMP O0.
Unlike with TensorFlow, AMP provides more detailed opti-
mization levels for PyTorch. According to the AMP documen-
tation [40], O0 level for PyTorch is usually used to establish
a stable baseline for the auto mixed-precision acceleration,
O1 follows a conservative type conversion and numerical
properties are highly preserved, while O2 implements a more
aggressive FP32 to FP16 conversion and extra care needs to
be taken for model convergence concerns. Our default setting
is O1 and its backward pass performance is shown in Fig. 6.
From the O0 optimization level in Fig. 9 to O1 in Fig. 6, kernel
runtime has been largely reduced and many kernels have been
moved to execute on the Tensor Core, providing a much higher
computational throughput.
D. Zero-AI kernels
#invocations in TensorFlow
Forward pass Backward pass
#invocations in PyTorch
Forward pass Backward pass
Fig. 10: Ratio of zero-AI kernel invocations to the total number
of kernel invocations in two implementations, TensorFlow
DeepCAM and PyTorch DeepCAM.
Compared to traditional HPC applications where users usu-
ally have full control of kernel invocations, high-level Python-
based deep learning frameworks tend to implicitly invoke
many subsidiary kernels, either for data conversion or device-
host transfer purposes. Fig. 10 shows the ratio of these kernel
invocations to the total number of invocations. Around 40-
50% of the invocations are for such zero-AI kernels, where
no floating point operation is performed. This may not inad-
vertently affect the overall performance much if these kernels
are perfectly overlapped with other kernel executions, but it
is very hard to achieve that in reality. As hardware constantly
evolves, new computer architectures tend to provide higher
and higher FLOP/s performance and bandwidth, but with less
progressive improvement on kernel launch overhead. To avoid
becoming overhead-bound, it is recommended that these deep
learning applications avoid such “implicit” zero-AI kernels as
much as possible.
V. CONCLUSIONS
In this paper, we first revisited the need for mixed-precision
performance analysis and extended ERT to incorporate single-
precision, half-precision and Tensor Core performance mea-
surement. Then, based on the previous nvprof hierarchi-
cal Roofline methodology, we established a new Nsight
Compute methodology to collect Roofline data on NVIDIA
GPUs. In the third part of this paper, we applied this
new methodology to a real-life deep learning benchmark,
DeepCAM, with its two implementations in TensorFlow and
PyTorch. Results show that this new methodology is very
effective in analyzing and better understanding the perfor-
mance of deep learning applications, and many useful insights
are shared, for example on computational characteristics of
different stages of the training process, and the performance
impact of the automatic mixed precision (AMP) package and
zero-AI kernels, which should be largely helpful to other deep
learning programmers and framework developers.
In the future, we would like to apply our time-based
Roofline [14] into a real-life deep learning application like
DeepCAM. The current Nsight Compute methodology
will be extended to incorporate cross-node performance anal-
ysis as well.
ACKNOWLEDGEMENTS
This material is based upon work supported by the Ad-
vanced Scientific Computing Research Program in the U.S.
Department of Energy, Office of Science, under Award Num-
ber DE-AC02-05CH11231. This research used resources of
the National Energy Research Scientific Computing Center
(NERSC) which is supported by the Office of Science of
the U.S. Department of Energy under Contract No. DE-
AC02-05CH11231. We thank NVIDIA Corporation for their
willingness to answer our myriad of questions on Nsight
metrics.
REFERENCES
[1] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful
visual performance model for floating-point programs and multicore
architectures,” Lawrence Berkeley National Lab.(LBNL), Berkeley, CA
(United States), Tech. Rep., 2009.
[2] “Empirical Roofline Toolkit (ERT),” accessed: 2020-08-01.
[Online]. Available: https://bitbucket.org/berkeleylab/cs-roofline-toolkit/
src/master/
[3] C. Yang, R. Gayatri, T. Kurth, P. Basu, Z. Ronaghi, A. Adetokunbo,
B. Friesen, B. Cook, D. Doerfler, L. Oliker, J. Deslippe, and S. Williams,
“An Empirical Roofline Methodology for Quantitatively Assessing Per-
formance Portability,” in 2018 IEEE/ACM International Workshop on
Performance, Portability and Productivity in HPC (P3HPC). IEEE,
2018, pp. 14–23.
[4] NERSC Roofline Model Documentation. [Online]. Available: https:
//docs.nersc.gov/development/performance-debugging-tools/roofline/
[5] C. Yang, B. Friesen, T. Kurth, B. Cook, and S. Williams, “Toward
Automated Application Profiling on Cray Systems,” in Cray User Group
Conference (CUG), 2018.
[6] J. R. Madsen, M. G. Awan, H. Brunie, J. Deslippe, R. Gayatri, L. Oliker,
Y. Wang, C. Yang, and S. Williams, “Timemory: Modular Performance
Analysis for HPC,” in International Conference on High Performance
Computing. Springer, 2020, pp. 434–452.
[7] C. Yang, T. Kurth, and S. Williams, “Hierarchical Roofline Analysis
for GPUs: Accelerating Performance Optimization for the NERSC-9
Perlmutter System,” Concurrency and Computation: Practice and
Experience, p. e5547, 2019. [Online]. Available: https://doi.org/10.
1002/cpe.5547
[8] D. Doerfler, J. Deslippe, S. Williams, L. Oliker, B. Cook, T. Kurth,
M. Lobet, T. Malas, J.-L. Vay, and H. Vincenti, “Applying the roofline
performance model to the intel xeon phi knights landing processor,” in
International Conference on High Performance Computing. Springer,
2016, pp. 339–353.
[9] T. Koskela, Z. Matveev, C. Yang, A. Adedoyin, R. Belenov, P. Thierry,
Z. Zhao, R. Gayatri, H. Shan, L. Oliker, J. Deslippe, R. Green, and
S. Williams, “A Novel Multi-Level Integrated Roofline Model Approach
for Performance Characterization,” in International Conference on High
Performance Computing. Springer, 2018, pp. 226–245.
[10] M. Del Ben, C. Yang, S. Louie, and J. Deslippe, “Accelerating Large-
Scale GW Calculations on Hybrid GPU-CPU Systems,” Bulletin of the
American Physical Society, vol. 65, 2020.
[11] R. Gayatri, C. Yang, T. Kurth, and J. Deslippe, “A Case Study For
Performance Portability Using OpenMP 4.5,” in International Workshop
on Accelerator Programming Using Directives. Springer, 2018, pp. 75–
95.
[12] C. Yang. 8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU:
Roofline Analysis and Other Tricks. [Online]. Available: https:
//arxiv.org/abs/2008.11326
[13] M. H. Javed, K. Z. Ibrahim, and X. Lu, “Performance analysis of deep
learning workloads using roofline trajectories,” CCF Transactions on
High Performance Computing, vol. 1, no. 3, pp. 224–239, 2019.
[14] Y. Wang, C. Yang, S. Farrel, Y. Zhang, T. Kurth, and S. Williams,
“Time-Based Roofline for Deep Learning Performance Analysis,”
in 2020 IEEE/ACM Deep Learning on Supercomputers Workshop.
(Submitted). [Online]. Available: https://arxiv.org/abs/2009.04598
[15] N. Ding and S. Williams, “An Instruction Roofline Model for GPUs,” in
2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation
of High Performance Computer Systems (PMBS). IEEE, 2019, pp.
7–18.
[16] K. Z. Ibrahim, S. Williams, and L. Oliker, “Performance Analysis ff
GPU Programming Models Using the Roofline Scaling Trajectories,” in
International Symposium on Benchmarking, Measuring and Optimiza-
tion. Springer, 2019, pp. 3–19.
[17] J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc, “A Roofline Model of
Energy,” in 2013 IEEE 27th International Symposium on Parallel and
Distributed Processing, 2013, pp. 661–672.
[18] A. Lopes, F. Pratas, L. Sousa, and A. Ilic, “Exploring GPU Performance,
Power And Energy-Efficiency Bounds with Cache-aware Roofline Mod-
eling,” in 2017 IEEE International Symposium on Performance Analysis
of Systems and Software (ISPASS), 2017, pp. 259–268.
[19] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech,
and time series,” The handbook of brain theory and neural networks,
vol. 3361, no. 10, p. 1995, 1995.
[20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems, 2014, pp. 2672–
2680.
[21] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips,
A. Mahesh, M. Matheson, J. Deslippe, M. Fatica et al., “Exascale
deep learning for climate analytics,” in SC18: International Conference
for High Performance Computing, Networking, Storage and Analysis.
IEEE, 2018, pp. 649–660.
[22] W. Joubert, D. Weighill, D. Kainer, S. Climer, A. Justice, K. Fagnan,
and D. Jacobson, “Attacking the opioid epidemic: Determining the
epistatic and pleiotropic genetic architectures for chronic pain and opioid
addiction,” in SC18: International Conference for High Performance
Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 717–
730.
[23] “Nsight compute cli - metric comparison,” accessed: 2020-
08-01. [Online]. Available: https://docs.nvidia.com/nsight-compute/
NsightComputeCli/index.html#nvprof-metric-comparison
[24] T. NVIDIA, “V100 gpu architecture. the worlds most advanced data
center gpu. version wp-08608-001 v1. 1,” NVIDIA. Aug, p. 108, 2017.
[25] CUDA C++ wmma API. [Online]. Available: https://docs.nvidia.com/
cuda/cuda-c-programming-guide/index.html
[26] CUDA cuBLAS Library. [Online]. Available: https://docs.nvidia.com/
cuda/cublas/index.html
[27] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,
and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv
preprint arXiv:1410.0759, 2014.
[28] “Profiler user’s guide,” accessed: 2020-08-01. [Online]. Available:
https://docs.nvidia.com/cuda/profiler-users-guide/
[29] “Nvidia developer tools overview,” accessed: 2020-08-01. [Online].
Available: https://developer.nvidia.com/tools-overview
[30] “Roofline Methodology on NVIDIA GPUs.” [Online]. Available:
https://gitlab.com/NERSC/roofline-on-nvidia-gpus
[31] C. Yang. Hierarchical Roofline Analysis: How to Collect Data using
Performance Tools on Intel CPUs and NVIDIA GPUs. [Online].
Available: https://arxiv.org/abs/2009.02449
[32] NVIDIA CUPTI API reference guide. [Online]. Available: https:
//docs.nvidia.com/cupti/Cupti/
[33] PerfWorks measurement library for Nsight Compute. [Online].
Available: https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.
html#metrics-structure
[34] Climate Segmentation Application in TensorFlow and PyTorch. [Online].
Available: https://github.com/PointKernel/climate-seg-benchmark
[35] Deep Learning Climate Segmentation Benchmark. [Online]. Available:
https://github.com/cyanguwa/DeepLearningProfiling
[36] MLPerf Benchmark. [Online]. Available: https://mlperf.org
[37] R. Okuta, Y. Unno, D. Nishino, S. Hido, and C. Loomis, “Cupy: A
numpy-compatible library for nvidia gpu calculations,” in Proceedings
of Workshop on Machine Learning Systems (LearningSys) in The Thirty-
first Annual Conference on Neural Information Processing Systems
(NIPS), 2017.
[38] Deterministic Profiling for TensorFlow). [Online]. Available: https:
//github.com/NVIDIA/tensorflow-determinism
[39] DeepCAM code and results (TensorFlow and PyTorch). [Online].
Available: https://github.com/cyanguwa/DeepLearningProfiling
[40] “apex.amp,” accessed: 2020-08-01. [Online]. Available: https://nvidia.
github.io/apex/amp
