8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other
  Tricks by Yang, Charlene
8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU:
Roofline Analysis and Other Tricks
Charlene Yang
National Energy Research Scientific Computing Center (NERSC)
Lawrence Berkeley National Laboratory (LBNL)
Berkeley, California 94720, USA
cjyang@lbl.gov
Abstract—Performance optimization can be a daunting task
especially as the hardware architecture becomes more and more
complex. This paper takes a kernel from the Materials Science
code BerkeleyGW, and demonstrates a few performance analysis
and optimization techniques. Despite challenges such as high
register usage, low occupancy, complex data access patterns,
and the existence of several long-latency instructions, we have
achieved 3.7 TFLOP/s of double-precision performance on an
NVIDIA V100 GPU, with 8 optimization steps. This is 55% of
the theoretical peak, 6.7 TFLOP/s, at nominal frequency 1312
MHz, and 70% of the more customized peak based on our
58% FMA ratio, 5.3 TFLOP/s. An array of techniques used to
analyze this OpenACC kernel and optimize its performance are
shown, including the use of hierarchical Roofline performance
model and the performance tool Nsight Compute. This kernel
exhibits computational characteristics that are commonly seen
in many high-performance computing (HPC) applications, and
are expected to be very helpful to a general audience of HPC
developers and computational scientists, as they pursue more
performance on NVIDIA GPUs.
Index Terms—NVIDIA GPU, hierarchical Roofline analysis,
Nsight Compute, performance optimization
I. INTRODUCTION
The Roofline performance model [1] has gained a lot of pop-
ularity in recent years for performance characterization, analy-
sis and optimization in high-performance computing (HPC). It
provides useful insights to machine characteristics, bottlenecks
of application performance, and performance optimization
strategies. Thanks to the community’s research interest on
Roofline, the model has been expanded to characterize the full
memory hierarchy, instead of focusing on the highest level
cache/memory only. Methodologies for collecting Roofline
data have been proposed on Intel CPUs [2] and NVIDIA
GPUs [3], [4], and they have been integrated into production
tools Intel Advisor [5] and NVIDIA Nsight Compute [6]
respectively. The hierarchical Roofline helps understand cache
reuse and data locality, providing even more insights into how
efficiently the code is using the memory subsystem.
To facilitate the Roofline study, a range of tools have sprung
to life, for more accurate machine characterization such as
the Empirical Roofline Toolkit (ERT) [7], [8], and for more
streamlined methods to collect Roofline performance data
using open-source tools or workflows [3], [9]–[11]. Other
than tools development, there are also many studies on the
application of the Roofline model in both traditional HPC [3],
[12]–[14] and the new, emerging field of Machine Learning
[3], [15], [16], and the extension and refinement of the model,
such as instruction Roofline [17], Roofline scaling trajectories
[18], performance portability based on Roofline [8], and power
and energy Roofline [19], [20].
In this paper, we focus on a General Plasmon Pole (GPP)
kernel from the BerkeleyGW code and discuss what dif-
ficulties scientific HPC codes are usually faced with and
what performance analysis and optimization techniques can
be employed to achieve good performance on NVIDIA GPUs.
To that end, we will use the Roofline model for high-level
performance analysis [3], and the performance tool Nsight
Compute [21] for more detailed performance data collection.
The 8 optimization steps taken to speed up the GPP kernel
by 3 times include, replacing long-latency instructions with
shorter ones, rearranging loops to gain arithmetic intensity,
reducing branching, and cache blocking. With these steps, we
have achieved 3.7 TFLOP/s double precision performance on
an NVIDIA V100 GPU, which is 55% of the theoretical peak
6.7 TFLOP/s at nominal frequency 1312 MHz, and 70% of
the more customized peak based on our 58% FMA ratio, 5.3
TFLOP/s. This is despite the challenges such as high register
usage, low occupancy, complex data access patterns, and the
existence of several long-latency instructions. An array of
techniques used to analyze this OpenACC kernel and optimize
its performance are demonstrated, and due to the commonality
of computational characteristics this kernel shares with other
HPC application, these techniques are expected to be very
helpful to a large segment of audience in HPC, as compu-
tational scientists and programmers embark on their quest of
performance optimization on NVIDIA GPUs.
The rest of the paper is organized as follows. Section II
will describe the GPP kernel and its implementation in detail,
Roofline data collection methodology used in this paper, and
the machine configuration. Section III will then discuss the 8
optimization steps taken to improve the kernel’s performance
from 2.3 TFLOP/s to 3.7 TFLOP/s on an NVIDIA V100
GPU. A combination of hierarchical Roofline analysis and
the performance tool Nsight Compute is employed, and a
more customized Roofline ceiling based on our FMA ratio is
presented. Finally, Section IV will draw conclusions on lessons
learned through this study, which will be helpful to many GPU
developers and computational scientists.
ar
X
iv
:2
00
8.
11
32
6v
2 
 [c
s.D
C]
  3
 Se
p 2
02
0
II. APPLICATION AND METHODOLOGY
A. General Plasmon Pole (GPP) Kernel
The GPP kernel [4] is abstracted from the Sigma module
of the Materials Science code BerkeleyGW [22], commonly
used for self-energy calculations in electronic structure studies.
It is written in Fortran and parallelized with OpenACC, and
the computation involved in this kernel represents work that
typically is performed on an individual MPI task in a much
larger calculation. The computation is tensor-contraction like,
and several pre-calculated complex double precision arrays are
multiplied and summed over certain dimensions and collapsed
into a small vector.
We use two benchmark systems for this study, Si-214 and
Si-510, respectively, a silicon system with 214 atoms and 510
atoms. The pseudo code of the kernel is shown below, and the
magnitudes for different loops are listed for the Si-214 system.
The Si-510 system is 3 to 4 times larger on the band, igp,
and ig loops than Si-214.
start timer
do band = 1, nbands # O(1000)
do igp = 1, ngpown # O(1000)
do ig = 1, ncouls # O(10000)
do iw = 1, nw # small, nw=2
load wtilde_array(ig,igp)
load aqsntemp(ig,band)
load eps(ig,igp)
compute wdiff, delw, sch_array, ssx_array
reduce on achtemp(iw), asxtemp(iw)
end timer
Some computational characteristics of the GPP kernel are,
1) there is abundant parallelism, enough to saturate the GPU
for efficient utilization, 2) most arrays in this kernel are
in double precision, either floats or complex numbers, 3)
dimensions of the arrays encompass almost all combinations
of indices band, igp and ig, possibly posing a challenge
in ensuring memory coalescence between threads in a warp,
and the effectiveness of cache blocking, 4) among many mul-
tiplications, additions, and fused-multiplication and additions
(FMAs), there are a few divides and abs() instructions,
which present a longer latency and could potentially hurt
the performance, 5) the reduction runs across all three loops
band, igp, and ig, and accumulates results to the iw-th
element of arrays achtemp and asxtemp. At the time of
writing, OpenACC does not support array reductions, so some
alternative needs to be sought. Luckily the loop count nw is
very small and fixed so scalar reductions can be used for each
iw explicitly. In PGI’s implementation of OpenACC, scalar
reductions are usually done in a separate kernel after the
main computational kernel, with much shorter computation
and much fewer threads. There will be two kernels appearing
in the Nsight Compute profile, however, we will only focus on
the compute kernel for the bulk of this performance study. The
runtime and throughput calculation, though, (Tab. I, and Fig. 1,
Fig. 3, Fig. 5 and Fig. 6), will still be for the entire calculation,
as indicated by the timers in the pseudo code above.
B. Roofline Data Collection Methodology
The Roofline model [1] characterizes application perfor-
mance based on two quantities, arithmetic intensity (AI) and
floating-point operations per second (FLOP/s) performance. To
calculate these quantities, we need to collect the runtime, total
FLOPs performed (the count, not the throughput per second),
and the data movement (in bytes). The hierarchical Roofline
models [3] looks at data transactions between each pair of
memory/cache levels, and on NVIDIA GPUs, we particularly
focus on data transactions between these three levels, device
memory (or HBM), L2 cache, and L1 cache.
There is an nvprof based methodology for collecting
Roofline data presented in this paper [3], and a more updated
version based on Nsight Compute here, [4]. Nsight Compute is
the NVIDIA performance tool that will replace nvprof in the
future, and it has incorporated an HBM-only Roofline analysis
feature in the CUDA 11 release. In this paper, we will employ
the methodology presented in [4] for hierarchical Roofline
analysis across three levels, HBM, L2, and L1, and particularly
the arithmetic intensity (AI) and FLOP/s performance are
calculated as follows.
AI<precision>,<level> =
ncu FLOPs<precision>
ncu Bytes<level>
(1)
FLOP/s<precision> =
ncu FLOPs<precision>
ncu Runtime
(2)
where <level> is L1, L2 and HBM, and <precision> is
FP64 since the GPP kernel mostly performs double precision
calculations. The Nsight Compute based methodology in [4]
collects much fewer raw metrics and hardware counters than
the one based on nvprof, in [3]. It potentially presents lower
overhead during profiling, and is thus a more recommended
method.
C. Machine Configuration
All the studies in this paper are conducted on the GPU
partition of the Cori supercomputer at the National Energy
Research Scientific Computing Center (NERSC), at Lawrence
Berkeley National Laboratory (LBNL). Cori GPU is primarily
deployed for GPU porting, benchmarking, and testing efforts
in the NERSC Exascale Science Applications Program (NE-
SAP). There are 18 GPU nodes in this partition, and each
node contains two sockets of 20-core Intel Xeon Gold 6148
Skylake CPUs, 384 GB DDR4 memory, 930 GB on-node
NVMe storage, and 8 NVIDIA V100 Volta GPUs. Each of
the GPUs has 80 Streaming Multiprocessors (SMs), 16 GB
HBM2 memory, and is connected to the other seven in a
‘hybrid cube-mesh’ topology. Each SM on a Volta has 32 FP64
cores, and clocking at nominal freqency 1312 MHz, delivers
for the entire GPU a theoretical peak of 6.7 TFLOP/s double
precision performance.
III. OPTIMIZATION JOURNEY
In this section, we will discuss 8 steps we took to opti-
mize the GPP kernel and improve its performance from 2.3
TFLOP/s to 3.7 TFLOP/s (double precision) on an NVIDIA
V100 GPU. Two benchmark systems, Si-214 and Si-510, are
used to validate these optimizations, v1 to v8, and the runtime
and TFLOP/s performance for each step are listed in Tab. I.
TABLE I
OPTIMIZATION PATH OF THE GPP KERNEL ON AN NVIDIA V100 GPU
Version Si-214 Si-510 Warps RegistersTime TFLOP/s Time TFLOP/s per SM per thread
v0 1.691 2.337 24.705 2.216 12 154
v1 1.106 2.629 13.269 2.526 12 160
v2 1.098 2.628 13.260 2.525 12 160
v3 0.987 2.647 11.983 2.543 12 154
v4 0.977 2.754 11.246 2.641 8 170
v5 0.873 2.901 10.257 2.741 12 136
v6 1.022 2.392 11.923 2.313 8 178
v7 0.996 2.548 10.901 2.550 8 184
v8 0.717 3.710 7.565 3.638 16 128
The GPP kernel has a very high register usage throughout
these optimization steps, and we have recorded its number of
registers required per thread, and the actual number of active
warps per SM, in Tab. I. Note that these numbers are the same
for both benchmark systems, Si-214 and Si-510.
This optimization process is a balancing act between register
usage, SM occupancy, memory bandwidth usage, memory
access pattern for different arrays, instruction latency, and
arithmetic intensity, and we will discuss the 8 optimization
steps in detail in the following subsections. The Roofline charts
and Nsight Compute results in Fig. 1-8 are for the Si-214
benchmark, however, Si-510 presents a very similar profile, as
can be seen by the trajectory in Tab. I (highlighted in gray).
v0. Baseline
The baseline version of the GPP kernel is a collapse(3)
of the band, igp, and ig loops, and the iw loop is unrolled
on each thread during the reduction. The reason behind this
design for an initial CPU-to-GPU port is that it creates ample
parallelism to fully utilize the available threads on a GPU.
This version of the kernel has a double precision performance
of 2.3 TFLOP/s as shown in Fig. 1, and there is little to no
cache reuse between L1, L2 and HBM levels, as seen by the
gaps between dots of the same color on the Roofline chart.
!$acc parallel loop gang vector
!$acc reduction(+:...) collapse(3)
do band = 1, nbands # O(1000)
do igp = 1, ngpown # O(1000)
do ig = 1, ncouls # O(10000)
do iw = 1, nw # small, nw=2
...
v1. Replace divides
There are a few divide instructions in the GPP kernel,
operating on complex double precision numbers, and these
instructions present a very high execution latency, lower the
warp issue rate, and can hurt our performance. The division
of two complex numbers ultimately end up with a floating
point divide, e.g. div.rn.f64, and this instruction requires
more cycles than a normal multiplication, addition, or FMA
Fig. 1. Hierarchical Roofline analysis of v0 and v1 for Si-214
(fused multiplication and addition). NVIDIA GPUs have a
faster, reciprocal instruction, e.g. rcp.rn.f64, and here we
will try and coax the compiler to generate these reciprocals
instead of divides.
Fig. 2 shows the code difference before and after this
optimization, and the Nsight Compute profile for the number
of active warp samples that land on these lines of code. It is
evident that the number has dropped significantly, and there is
a 13% improvement on the compute throughput, as shown in
Fig. 1. The kernel also moved to be more bandwidth bound
on the Roofline, which is expected as more data needs to be
drawn in order to satisfy the need of the faster computation.
We will observe this pattern again in v3.
Fig. 2. Replacing divides with reciprocals (Top: v0, Bottom: v1)
Usually, performance dots on the Roofline chart move quite
often as we optimize the code. Even though the general
narrative for Roofline based optimization process is to move
these dots rightward (more compute bound) and upward
(higher throughput, more performance), we do see these dots
sometimes become more bandwidth bound, and even move
downwards, as a temporary regression in performance. How-
ever, as long as there is a clear philosophy and goal behind
the design of these optimizations, performance gain can be
redeemed with later optimizations. Steps v6, v7 and v8 will
serve as a great example of this.
v2. Reduce branching
Even though GPU architectures have evolved to support
branching more efficiently, it is still worth reducing the un-
necessary branches to avoid thread divergence within a warp.
In GPP, there is a 3-way branching code block shown below,
and in this case, it can be further simplified to be 2-way
as demonstrated in the !After section. Even though this
optimization did not bring any performance gain for GPP as
shown in Fig. 3, it has shown to be beneficial to other kernels
in the full code BerkeleyGW, and is generally advisable.
! Before
if (wdiffr > limittwo .and. delwr < limitone) then
calculate sch, ssx
else if (delwr > TOL_Zero) then
calculate sch, ssx
else
sch = 0.0d0
ssx = 0.0d0
endif
! After
sch = 0.0d0
ssx = 0.0d0
if (wdiffr > limittwo .and. delwr < limitone) then
calculate sch, ssx
else if (delwr > TOL_Zero) then
calculate sch, ssx
endif
v3. Replace abs()
As eluded to earlier in v1, abs() is also generally more
expensive than a common multiply or add instruction, and in
the case of GPP, abs() is only used for condition evaluation
for if/else statements. Their explicit results are not needed,
so we can try and replace them with power 2 calculations, i.e.
abs(a) ∗ b < c is equivalent to a ∗ conj(a) ∗ b2 < c2 for a
complex number a. This helps reduce the instruction latency
and has proven to be beneficial in both the Nsight Compute
profile in Fig. 4 and on the Roofline chart in Fig. 3.
Fig. 3. Hierarchical Roofline analysis of v1, v2 and v3 for Si-214
Fig. 4. Replacing abs() with power 2 calculation (Top: v2, Bottom: v3)
v4. Increase arithmetic intensity
Up to this point, the GPP kernel has been operating in an
HBM bandwidth bound region on Roofline, with an arith-
metic intensity just below the ‘machine balance’ point (7.4
FLOPs/Byte on V100 for double precision and for the HBM
level). Since there is abundant parallelism and any two of the
loops, band, igp and ig, can still provide enough parallelism
to saturate a GPU, we can try collapse only two of them and
leave the third running sequentially on each thread. This will
increase the data reuse of certain arrays, increase the arithmetic
intensity for the kernel, and move the kernel to a more compute
bound region. With a higher ceiling in the compute bound
region (compared to the bandwidth bound region), we are
more likely to reach higher performance if we can utilize the
computational side of resources well.
However, the selection of the third loop needs to be done
with care. There are a few multi-dimensional arrays in the
kernel, and serializing any of the loop indices could potentially
break the memory coalescence for some of them. Thorough
investigation shows that band has the least number of ar-
rays that use it as a non-first index, making it the logical
choice. With band unrolled, memory access for arrays such
as wtilde_array(ig,igp), I_eps_array(ig,igp)
and aqsntemp(ig,band) is still coalesced, because their
fastest moving index ig is still mapped to the thread ID (note
that Fortran is column major). In OpenACC, this unrolling can
be done as below.
!$acc parallel loop gang vector
!$acc reduction(+:...) collapse(2)
do igp = 1, ngpown # O(1000)
do ig = 1, ncouls # O(10000)
!$acc loop seq
do band = 1, nbands # O(1000)
do iw = 1, nw # small, nw=2
...
Thanks to this optimization, Fig. 5 shows a better separation
between the L1 and L2 performance dots on Roofline (cyan),
and the overall throughput has improved as well, from 2.6
TFLOP/s to 2.7 TFLOP/s. The wider gaps between these dots
suggest better cache reuse and data locality in L1, compared
to previous versions, and with an arithmetic intensity of 12.5
FLOPs/Byte on the HBM level, it moves us back to the
compute bound region, with more headroom to optimize for.
v5. Reduce branching, again
The baseline code v0 unrolled the iw loop, and the reduc-
tion is done over four scalars due to the fact that OpenACC
does not support array reductions at the time of writing.
However, this constant branching creates warp divergence
(especially when nw > 2), increases register usage (which
Fig. 5. Hierarchical Roofline analysis of v3, v4 and v5 for Si-214
limits the number of active warps per SM), and requires more
thread block synchronization as well. In this step, we will
move the iw loop outside the kernel, and reduce on a set of
two scalars in a separate kernel. This may result in redundant
computation in both kernels (nw = 2), however, the reduced
register usage and increased occupancy (as seen in Tab. I) has
proven to outweigh the duplication of certain calculations, as
shown in Fig. 5. The throughput has increased a little, and
the kernel has moved to the left, back to the HBM bandwidth
bound region. We will attempt to tile the cache accesses to
increase the arithmetic intensity in the next three steps, to
move the kernel back to the compute bound region and to
improve performance.
v6. Cache blocking
Cache blocking (or loop tiling) is a technique used to
rearrange data access to pull subsets (blocks) of data into cache
and to operate on this block to avoid repeatedly evicting it and
fetching it back from more remote memory such as the main
memory. This helps data reuse and gains cache locality. In
this step, we apply the blocking technique to ig and band
loops, as shown in the pseudo code below. Arrays with ig
and igp indices can be shared as different band iterations
are executed, and different threads can share the same band
related arrays as well.
ig_blksize = 128
band_blksize = 64
do iw = nstart, nend
!$ACC PARALLEL LOOP GANG VECTOR
!$ACC reduction(+:...) collapse(3)
do band_blk = 1, band_blksize
do igp = 1, ngpown
do ig_blk = 1, ig_blksize
!$ACC LOOP SEQ
do ig = ig_blk, ncouls, ig_blksize
!$ACC LOOP SEQ
do band = band_blk, nbands, band_blksize
...
As a first attempt, this did not bring any performance
benefit but has caused degradation in performance as shown
in Fig. 6. However, the L1 and L2 cache locality has increased
significantly, as seen by the wider gaps between L1 and L2
dots, and between L2 and HBM dots. It has moved the kernel
back to the compute bound region, and in the next two steps,
we will try to make the kernel utilize the compute resources
more efficiently, to move the dots upward on Roofline.
v7. Swap array index
To comply with the new memory access pattern after
applying cache blocking, adjustment of the memory layout
for certain arrays may be necessary. For example, we need to
swap the two indices of arrays wx_array_t(iw,band)
and aqsmtemp_local(igp,band), to ensure that they
are still accessed contiguously along the band index. This
is different than ensuring memory coalescence among threads
in a warp (which sometimes needs to be re-examined too),
but it focuses on how these arrays are accessed between
different iterations of band on the same thread. As shown in
Fig. 6, this optimization has not provided much performance
improvement, but it is necessary for the next optimization to
take effect.
Fig. 6. Hierarchical Roofline analysis of v5 to v8 for Si-214
v8. Adjust thread block size
Currently, we are well in the compute bound region, and
to improve the performance further, we should focus on the
compute related resources, such as whether we have enough
concurrency on each SM. As seen in Tab. I, the GPP kernel
has had a very high register usage throughout this process. In
the case of v7, there are 184 registers required by each thread,
and that has limited the number of active warps per SM to 8,
which is only 1 thread block in this OpenACC kernel (the
default vector_length is 128 threads, i.e. 8 warps).
This is a relatively low occupancy, considering that each SM
can execute 2048 threads concurrently on a V100, which is 64
warps. This low occupancy could reduce our latency-hiding
ability on the GPU and be detrimental to the performance.
In the case of GPP, we already have a lot of long-latency
instructions such as reciprocals - they are ‘long’ compared
to the regular multiplies and adds. Also, the bulk of our
operations are in double precision. There are 64 FP32 cores, 64
Int32 cores and 32 FP64 cores on each SM, and this disparity
in core count naturally dictates that FP64 instructions have
longer execution latency than their counterpart FP32 or Int32
instructions. Switching to FP32 for self-energy calculation is
not an option though, because of the high accuracy required,
as in many other scientific applications.
However, to increase the occupancy, we can try manually
limiting the number of registers per thread by either specifying
-Mcuda=maxregcount:128 to the PGI compiler, or by
adding an OpenACC clause vector_length(512) to the
parallel construct in the code. For GPP, both cases will
result in the same configuration of 512 threads per SM (i.e.
16 warps). This is double the number of warps we had before,
and is a significant improvement in occupancy. However, the
register spills created due to the squashing of the register usage
can be a concern.
As we set a limit on the register usage, certain variables can
not be stored in registers anymore, and they will instead be
in the local memory, which is called a ‘register spill’. Local
memory resides in device memory physically, and thus has a
longer latency compared to the lower level caches (L1 and L2).
Having said that, it is still worth a try to limit the register usage
and gain better occupancy, because sometimes the benefit
of the increased occupancy can outweigh the performance
penality from register spills - just like in our case. Fig. 6
shows that this optimization has improved GPP’s performance
significantly with a 3.7 TFLOP/s throughput now. This of
course is not possible if we did not apply cache blocking or
the array index re-adjustment in v6 and v7.
Fig. 7. Nsight Compute analysis for the cache blocking steps (Top: v6,
Middle: v7, and Bottom: v8)
Fig. 8. L1 and L2 cache hit rate (Top: v0, Bottom: v8)
The Nsight Compute profile validates the same conclu-
sion in Fig. 7, and as an example, the time spent on the
matngmatmgp line has been largely reduced. Throughout
this entire process, from version v0 to v8, the cache hit rate
for both L1 and L2 has improved significantly as seen in Fig, 8,
which is a sign of successfully optimization as well.
Based on Nsight Compute metrics for dmul, dadd and
dfma instructions, we have measured the ratio of the FMA
instructions out of all FP64 instructions in GPP to 58%,
i.e. dfma/(dmul+dadd+ dfma)=58%. Increasing the FMA
ratio requires significant performance tuning effort and even
if tuning on the lowest level, down to the assembly, we may
still not get 100% FMA ratio. To realistically estimate the
upper bound for the performance, we can customize our peak
performance based on this 58% FMA ratio [3], i.e. with
58% FMAs, we expect (2 × 0.58 + 1 − 0.58)/2 = 79%
of the theoretical FP64 peak at 1312 MHz, 6.7 TFLOP/s,
and that yields the more realistic upper bound to be 0.79 ×
6.71 TFLOP/s = 5.3 TFLOP/s. It is the maximum attainable
performance we can possibly achieve given the 58% ratio,
and with 3.71 TFLOP/s in v8, we have achieved 70% of
that (3.71 TFLOP/s/5.3 TFLOP/s), and 55% of the theoretical
peak (3.71 TFLOP/s/6.71 TFLOP/s), as shown in Fig. 6. This
is significant contribution considering all the challenges we
have faced, such as the complex double precision arithmetics,
long-latency instructions such as divides and now reciprocals,
complicated data access pattern for multiple multi-dimensional
arrays, and the high register pressure and possible register
spills.
IV. CONCLUSION
This paper presents 8 optimization steps that have been
taken to improve the performance of a Materials Science ker-
nel GPP from 2.3 TFLOP/s to 3.7 TFLOP/s (double precision)
on a single NVIDIA V100 GPU. An array of performance
analysis and optimization techniques have been discussed in
detail, including the use of the hierarchical Roofline perfor-
mance model and the performance tool Nsight Compute. The
3.7 TFLOP/s performance is at 55% of the theoretical peak
6.7 TFLOP/s and 70% of the kernel’s customized attainable
peak based on its FMA ratio, 5.3 TFLOP/s. It is obtained given
all the challenges faced such as complex number arithmetics,
long-latency instructions, complicated data access pattern, and
high register pressure. These challenges are very common in
many other scientific applications, and the techniques pre-
sented in this paper are expected to be of help to those with
similar characteristics.
REFERENCES
[1] S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful
Visual Performance Model for Multicore Architectures,” Commun. ACM,
vol. 52, no. 4, 2009.
[2] T. Koskela, Z. Matveev, C. Yang, A. Adedoyin, R. Belenov, P. Thierry,
Z. Zhao, R. Gayatri, H. Shan, L. Oliker, J. Deslippe, R. Green, and
S. Williams, “A Novel Multi-Level Integrated Roofline Model Approach
for Performance Characterization,” in International Conference on High
Performance Computing. Springer, 2018, pp. 226–245.
[3] C. Yang, T. Kurth, and S. Williams, “Hierarchical Roofline Analysis
for GPUs: Accelerating Performance Optimization for the NERSC-
9 Perlmutter System,” Concurrency and Computation: Practice and
Experience. [Online]. Available: https://doi.org/10.1002/cpe.5547
[4] Data Collection Methdology for Roofline Analysis on NVIDIA GPUs.
[Online]. Available: https://gitlab.com/NERSC/roofline-on-nvidia-gpus/
-/tree/arxiv-paper
[5] Intel Advisor Roofline Analysis. [Online]. Available:
https://software.intel.com/content/www/us/en/develop/documentation/
advisor-user-guide/top/survey-trip-counts-flops-and-roofline-analyses/
roofline-analysis.html
[6] Nsight Compute Roofline Analysis. [Online]. Available: https://docs.
nvidia.com/nsight-compute/ProfilingGuide/index.html#roofline
[7] Empirical Roofline Toolkit. [Online]. Available: https://bitbucket.org/
berkeleylab/cs-roofline-toolkit/src/master/
[8] C. Yang, R. Gayatri, T. Kurth, P. Basu, Z. Ronaghi, A. Adetokunbo,
B. Friesen, B. Cook, D. Doerfler, L. Oliker, J. Deslippe, and S. Williams,
“An Empirical Roofline Methodology for Quantitatively Assessing Per-
formance Portability,” in 2018 IEEE/ACM International Workshop on
Performance, Portability and Productivity in HPC (P3HPC). IEEE,
2018, pp. 14–23.
[9] NERSC Roofline Model Documentation. [Online]. Available: https:
//docs.nersc.gov/development/performance-debugging-tools/roofline/
[10] C. Yang, B. Friesen, T. Kurth, B. Cook, and S. Williams, “Toward
Automated Application Profiling on Cray Systems,” in Cray User Group
Conference (CUG), 2018.
[11] J. R. Madsen, M. G. Awan, H. Brunie, J. Deslippe, R. Gayatri, L. Oliker,
Y. Wang, C. Yang, and S. Williams, “Timemory: Modular Performance
Analysis for HPC,” in International Conference on High Performance
Computing. Springer, 2020, pp. 434–452.
[12] D. Doerfler, J. Deslippe, S. Williams, L. Oliker, B. Cook, T. Kurth,
M. Lobet, T. Malas, J.-L. Vay, and H. Vincenti, “Applying the Roofline
performance model to the Intel Xeon Phi Knights Landing processor,”
International Conference on High Performance Computing, pp. 339–
353, 2016.
[13] M. Del Ben, C. Yang, S. Louie, and J. Deslippe, “Accelerating Large-
Scale GW Calculations on Hybrid GPU-CPU Systems,” Bulletin of the
American Physical Society, vol. 65, 2020.
[14] R. Gayatri, C. Yang, T. Kurth, and J. Deslippe, “A Case Study For
Performance Portability Using OpenMP 4.5,” in International Workshop
on Accelerator Programming Using Directives. Springer, 2018, pp. 75–
95.
[15] Y. Wang, C. Yang, S. Farrel, Y. Zhang, T. Kurth, and
S. Williams, “Hierarchical Roofline Performance Analysis for Deep
Learning Applications,” in 2020 IEEE/ACM Performance Modeling,
Benchmarking and Simulation of High Performance Computer Systems
(PMBS). [Online]. Available: https://arxiv.org/
[16] Y. Wang, C. Yang, S. Farrel, Y. Zhang, T. Kurth, and S. Williams, “Time-
Based Roofline for Deep Learning Performance Analysis,” in 2020
IEEE/ACM Deep Learning on Supercomputers Workshop. [Online].
Available: https://arxiv.org/
[17] N. Ding and S. Williams, “An Instruction Roofline Model for GPUs,” in
2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation
of High Performance Computer Systems (PMBS). IEEE, 2019, pp.
7–18.
[18] K. Z. Ibrahim, S. Williams, and L. Oliker, “Performance analysis of
GPU programming models using the roofline scaling trajectories,” in
International Symposium on Benchmarking, Measuring and Optimiza-
tion. Springer, 2019, pp. 3–19.
[19] J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc, “A Roofline Model of
Energy,” in 2013 IEEE 27th International Symposium on Parallel and
Distributed Processing, 2013, pp. 661–672.
[20] A. Lopes, F. Pratas, L. Sousa, and A. Ilic, “Exploring GPU Performance,
Power And Energy-Efficiency Bounds with Cache-aware Roofline Mod-
eling,” in 2017 IEEE International Symposium on Performance Analysis
of Systems and Software (ISPASS), 2017, pp. 259–268.
[21] “NVIDIA Nsight Compute Profiling Tool,” https://docs.nvidia.com/
nsight-compute/NsightCompute/index.html.
[22] “BerkeleyGW,” http://www.berkeleygw.org.
