Accelerating High-Order Stencils on GPUs by Sai, Ryuichi et al.
Accelerating High-Order Stencils on GPUs
Ryuichi Sai
Department of Computer Science
Rice University
Houston, TX, USA
ryuichi@rice.edu
John Mellor-Crummey
Department of Computer Science
Rice University
Houston, TX, USA
johnmc@rice.edu
Xiaozhu Meng
Department of Computer Science
Rice University
Houston, TX, USA
xm13@rice.edu
Mauricio Araya-Polo
Computational Science and Engineering
Total E&P Research and Technology US, LLC.
Houston, TX, USA
Jie Meng
Computational Science and Engineering
Total E&P Research and Technology US, LLC.
Houston, TX, USA
Abstract—Stencil computations are widely used in HPC appli-
cations. Today, many HPC platforms use GPUs as accelerators.
As a result, understanding how to perform stencil computations
fast on GPUs is important. While implementation strategies
for low-order stencils on GPUs have been well-studied in the
literature, not all of proposed enhancements work well for high-
order stencils, such as those used for seismic modeling. Further-
more, coping with boundary conditions often requires different
computational logic, which complicates efficient exploitation of
the thread-level parallelism on GPUs. In this paper, we study
high-order stencils and their unique characteristics on GPUs.
We manually crafted a collection of implementations of a 25-
point seismic modeling stencil in CUDA and related boundary
conditions. We evaluate their code shapes, memory hierarchy us-
age, data-fetching patterns, and other performance attributes. We
conducted an empirical evaluation of these stencils using several
mature and emerging tools and discuss our quantitative findings.
Among our implementations, we achieve twice the performance
of a proprietary code developed in C and mapped to GPUs
using OpenACC. Additionally, several of our implementations
have excellent performance portability.
Index Terms—stencil computation, high-order, boundary con-
dition, HPC, GPU
I. INTRODUCTION
”Remember that Time is Money.” [1] This is even more
true in today’s competitive business environments, such as
the oil and gas industry, where fast simulations enable more
realizations of experiments that can reduce the uncertainty of
hydrocarbon reserves location or CO2 storage management
processes. Seismic depth imaging is the main tool used to
extract information from seismic field records to identify
relevant subsurface structures. High-order stencil computations
typically serve as the foundation for seismic depth imaging.
Over the past two decades, the availability of sufficiently
powerful computational resources has led to the every-day use
of more complex stencil-based wave equation approximations,
and the power of today’s petascale systems enables simulations
based on the full-wave equation instead of simple approxima-
tions.
Today, HPC platforms often employ Graphics Processing
Units (GPUs) to increase their computational power. Accord-
ingly, using GPUs to accelerate full-wave equation simulations
based on high-order stencils is a natural approach. However,
the complexity of GPU architectures makes achieving top
performance with high-order stencil computations surprisingly
difficult. Without careful design, a stencil computation on a
GPU is likely to underperform. An efficient implementation
of high-order stencils with boundary conditions on a GPU
requires paying careful attention to data reuse, warp utilization,
work balance, and arithmetic intensity among other issues. For
that reason, understanding how to develop efficient high-order
stencils for GPUs is a topic of significant interest.
It is important to note that GPUs from different vendors
have different characteristics. Furthermore, even the character-
istics of GPUs from a single vendor may change significantly
between generations. For that reason, performance portability
across GPUs of different vendors and across multiple gener-
ations of GPUs is of significant interest. The best kernel on
one GPU may not be the best on GPUs from other vendors
and may not remain the best on newer generations of GPUs.
For these reasons, our current goal is to fathom what is and
how to achieve excellent performance for high-order stencils
on GPUs, and understand the factors that affect performance
portability. To do so, we need to understand the strengths and
weaknesses of various code shapes for high-order stencils.
This paper describes our progress toward this goal and makes
the following contributions:
• a careful comparison of existing approaches, including
an assessment of their strengths and weaknesses when
applied to high-order stencils with boundary conditions;
• the implementation and tuning of a collection of high-
order stencil kernels with a selected set of algorithms
and their variants using CUDA;
• a performance comparison of stencil implementations
across multiple generations of NVIDIA GPUs along with
a quantitative assessment of their performance using a
Roofline performance model for GPUs; and
• an investigation of the characteristics of different stencil
kernels that affect their performance, including register
and memory footprints, exposed latencies and stalls, as
ar
X
iv
:2
00
9.
04
61
9v
1 
 [c
s.D
C]
  1
0 S
ep
 20
20
well as theoretical and achieved occupancy.
The next section is an overview of related work. Sections
III and IV present our approaches and implementations, re-
spectively. Section V describes our evaluation methodology,
experimental results, and a discussion of our findings. Sec-
tion VI summarizes our conclusions and briefly discusses our
plans for future work.
II. RELATED WORK
There is a rich literature describing efforts to efficiently
implement stencil computations on CPUs [2], [3], [4], [5], [6],
[7], [8], [9], [10], [11], [15], [16], [12] as well as GPUs [13],
[14], [17], [18], [19], [22], [23]. We discuss the most related
efforts below.
Time skewing [7], [8], which combines blocking in both the
data and time domains, is well studied on CPUs [9], [10], [11].
A time step loop is skewed with one spatial loop over a data
tile so that the data dependencies are preserved across multiple
time steps. Time skewing improves data reuse by advancing a
region of a domain by multiple time steps while its data is in
cache.
Overlapped tiling is an approach that uses time skewing to
increase the arithmetic intensity by introducing overlapped re-
gions with redundant computation [12], [13]. Overlapped tiling
works on GPUs because loading data from global memory
takes many more cycles than performing computations. This
approach decreases the need for global memory bandwidth by
trading with an increase in the computational workload that
hides the memory latency. While this approach can improve
stencil performance on GPUs, prior evaluations were done
only for low-dimension low-order simple stencils. Without
careful design, overlapping tiling may not scale well with high-
order stencils, as redundant computation needed grows quickly
with stencil width and offsets the gain from the global memory
bandwidth reductions.
Split tiling [14] is another algorithm that employs time
skewing on GPUs. Instead of adding redundant computations
as in overlapped tiling, it computes points in two phases. The
first phase tiles in the time dimension as upright trapezoids,
and executes them in parallel. Once all tiles from the first
phase are complete, the second phase back-fills the missing
points in the time dimension, since all their dependencies are
either from points in the previous time step or the tiles in
previous phase.
The semi-stencil algorithm [15], [16], which has only been
studied on CPUs, factors the computation of a stencil into two
or more pieces. Rather than computing the result for a point
as a single computation, the semi-stencil algorithm divides
computation along one or more axes into two halves. While
sweeping along a dimension, it computes the final result for
one point and a partial result for another point a half-stencil
width ahead. Compared to the traditional implementation of
stencils, this approach changes the load/store ratio for the
computation by trading half of the loads along a dimension
for a store and a reload of a partial result. On a GPU, this has
the potential for nearly halving the cache footprint of a thread
block for a high-order stencil.
Nguyen et al. [17] introduce a 3.5D blocking algorithm as a
mix of 2.5D spatial blocking with 1D temporal blocking. 2.5D
spatial blocking involves blocking in a 2D plane and streaming
through a third dimension. To increase data reuse, active 2D
planes are stored in GPU shared memory. Additionally, 1D
temporal blocking performs multiple time step computations
before writing the data back to the global memory. While the
3.5D algorithm works very well on CPUs, the 1D temporal
blocking introduces two potential implementation challenges
for high-order stencils with boundary conditions on GPUs,
namely barrier synchronizations and limited parallelism. In
this paper, we evaluate 2.5D spatial blocking of high-order
stencils and plan to explore 3.5D blocking in future work.
Nguyen et al.’s approach [17] loads the currently active
plane and the halo planes above and below the current plane
into shared memory for faster data access during the actual
stencil computation. While this strategy improves data reuse,
the size of a data tile is limited by the shared memory size on
GPUs. To reduce the shared memory pressure, we looked into
the work that uses registers on GPUs. Micikevicius [18] also
uses 2.5D blocking, however, only the active plane is loaded
into shared memory. Registers are used to store the halo data
on the third dimension. When the 2D active plane streams
through the third dimension, the algorithm shifts the data in
the registers.
Recently, as part of their AN5D framework work, Mat-
sumura et al. [19] apply three more refinements to 2.5D and
3.5D solutions: fixed register allocations, double buffering, and
division of the streaming dimension. While these approaches
work extremely well for simple single-statement kernels,
neither boundary conditions nor multi-statement stencils are
evaluated. While these approaches work extremely well for
simple single-statement kernels, In our work, we study a
high-order stencil with boundary conditions, and part of our
application has multiple statements, instead of simple single-
statement stencil updates.
Other interesting approaches to tackle the stencil computa-
tions including auto-tuning with dynamic resource allocations
[20], DAG reordering [21], diamond tiling using polyhedral
model [24], [25], functional programming [26], [27], and
multi-layer intermediate representations [28], [29], [30].
From a software engineering perspective, there are two
strategies for developing stencils for NVIDIA GPUs: hand-
written kernels in CUDA and Domain-Specific Language
(DSL)-based approaches [6], [22], [31], [32], [33], [34], [35].
The DSL approach can simplify the generation of code with
complex logic. While we are interested in DSL-based ap-
proaches for the future, the focus of this paper is to understand
in detail the strengths and weaknesses of various algorithmic
strategies for achieving high performance and performance
portability. To avoid limitations as we explore this space, we
chose to evaluate hand-written kernels.
III. APPROACH
We developed several implementations of the acoustic
isotropic approximation of the wave equation [36], a com-
monly used wave equation approximation used in the oil
and gas industry. Solving this with finite differences involves
using a high-order stencil-based solver with suitable boundary
conditions. For oil and gas codes, this approach is applied
on large grids to model subsurface and generate seismic data
from source perturbations. In our work, we employ different
code shapes that differ principally in how they organize the
computation (e.g., 2D vs. 3D tiles) and how they manage
the memory hierarchy. In the rest of this section, we will
briefly describe the acoustic isotropic approximation, explain
the various data decomposition strategies we employ, describe
blocking strategies, and discuss how we structure our imple-
mentations.
A. Seismic Modeling and Acoustic Isotropic Kernel
We study a stencil-based implementation of the acoustic
isotropic wave equation approximation for seismic modeling.
The details of the model are described in [36]. The wave
equation for an acoustic isotropic operator with constant-
density has the following form:
1
V2
∂2u
∂t2
−∇2u = f , (1)
where u = u(x, y, z) is the wavefield, V is the Earth model
(with velocity as rock property), and f is the source perturba-
tion. The equation is discretized in time using a second-order
centered stencil, resulting in the semi-discretized equation:
un+1−Qun+un−1 = (∆t2)V2fn,with Q = 2+∆t2V2∇2.
(2)
Finally, the equation is discretized in space using a 25-point
stencil in 3D, with eight points in along each axis surrounding
a center point:
∇2u(x, y, z) ≈ cxyz × u(i, j, k)+
4∑
m=1
cxm × [u(i + m, j, k) + u(i−m, j, k)] +
cym × [u(i, j + m, k) + u(i, j −m, k)] +
czm × [u(i, j, k + m) + u(i, j, k −m)] (3)
where cxyz, cxm, cym, czm are the discretization parameters.
A high-level description of the algorithm is shown in
Algorithm 1. We apply a Perfectly-Matched Layer (PML) [37]
boundary condition to the regions around the physical domain.
The resulting extended domain consists of an “inner” region
where Equation (2) is applied, and the outer “boundary” region
where a PML calculation is applied.
Solving the acoustic isotropic approximation for the wave
equation involves applying complex multi-statement stencil
that are 8th-order in space and 2nd-order in time. It involves
applying a star shaped 25-point stencil data stored in the
u-array. Boundary conditions employ a different differential
Data: f : source
Result: un: wavefield at timestep n, for n← 1 to T
1 u0 := 0;
2 for n← 1 to T do
3 for each point in wavefield un do
4 Solve Eq. 2 (left hand side) for wavefield un;
5 end
6 un = un + fn (Eq. 2 right hand side);
7 end
Algorithm 1: A high-level description of the algorithm for
solving the acoustic isotropic approximation of the wave
equation with constant density.
operator, a 7-point star shaped stencil. Thus, one high-order,
and one lower-order stencil, respectively.
When real-world simulations are conducted, the grid (that
represent the physical domain) size is usually large (up to
4, 000 grid points in each dimension). In addition, the stencil
computations need to be applied iteratively for a large number
of time steps to yield precise results of practical use.
In this paper, we study the acoustic isotropic kernel as the
seismic model wave approximation. However, we believe that
our approach is general enough that it could be applied to
other high-order stencils with boundary conditions.
B. Data Domain Decomposition
As described in the previous section, our data domain con-
tains two regions, the inner region and the Perfectly Matched
Layer (PML) boundaries. The inner region is a cubic grid
sits at the center of the whole grid, while the PML region
corresponds to the volume between the inner region and the
data domain boundaries. The size of the inner region and the
width of the PML region are defined as user inputs during the
simulations.
Since GPUs have different architectural characteristics than
CPUs, GPU computations must be carefully structured to
leverage the fine-grain data parallelism of GPUs. GPUs mem-
ory hierarchies differ significantly from those of CPUs. Fur-
thermore, the execution model on a GPU is also different, in
that GPUs use a Single-Instruction-Multiple-Thread (SIMT)
execution model. On NVIDIA GPUs, the execution model is
realized by scheduling groups of 32 SIMT threads known as
warps. Each warp has its own quota for its shared memory
size, registers, and other hardware resource limitations. On
NVIDIA GPUs, one realizes parallelism by partitioning the
simulation domain into enough blocks to keep the GPUs
streaming multiprocessors busy. Therefore, we need to divide
our data domain into smaller subregions.
We proposed and experimented with three decomposition
strategies based on our data domain and boundary conditions.
First, we developed a single kernel that could be applied
to any region of the data domain. The kernel contained
conditionals that would employ the PML calculations near
any of the domain boundaries and compute the stencil for
the acoustic isotropic wave function approximation in the
Fig. 1: Data Domain Decomposition
central region of the data domain. This strategy yielded branch
divergence for subregions that contain points in both the PML
and central regions which hurt performance.
Next, we developed separate kernels for the central region
and the PML region. These separate kernels can be launched
concurrently. This strategy eliminates the need for checking
whether the point is inside inner region or PML region in
every kernel, thus, reducing the chance of branch divergence.
Nevertheless, it leaves unbalanced work among threads, along
the boundaries between the inner and PML regions when if
the size of the GPU blocks doesn’t evenly divide the extents
of the PML and central regions.
Figure 1 shows our decomposition strategy. We separate
the inner region from PML region, and further divide PML
region into six subregions. We slice the domain along the top
and bottom of the inner region, and this gives us a top block,
a bottom block, and a border of four walls. We further slice
along the front and back, and it becomes four separate walls.
These four walls and the two subregions from the first cuts
result in total of six subregions of the PML region, namely:
top, bottom, front, back, left and right PML subregions. The
symmetry of these subregions is a relevant characteristic that
we discuss later along with our results. Next, we perform
stencil computations as kernel launches for the seven regions,
one for the inner region, and six for the PML regions. This
approach does not have intrinsic branch divergence at the
boundaries. While there are still work imbalances due to
different grid sizes, they occur only for a few edge cases along
the borders. We expect that we can further reduce unbalanced
work at border locations through automated code generation
using fewer threads to match reduced number of points.
C. Blocking Strategies
For each of the seven data regions, we need to further
slice them into smaller blocks, so that each block can fit into
GPU’s designated resources for each kernel launch. We use
two blocking strategies in our experiments: 3D Blocking and
2.5D Blocking.
1) 3D Blocking: divides each of the data regions into axis-
aligned 3D blocks. To find the best block dimensions, in
Fig. 2: Blocking Strategies: (left) 3D Blocking; (right) 2.5D
Blocking
practice, they are fixed at runtime for easing experiments with
different values. To perform stencil computations on GPUs,
each block maps to a kernel launch with a 3D threadblock
with the thread dimensions matching the block dimensions.
All points inside the block and their halos are loaded into the
GPU on-chip memory before any kernel is launched.
2) 2.5D Blocking: partitions the data domain along the
inner two data dimensions - X and Y - and performs a
streaming computation along the outermost Z dimension. We
launch kernels with 2D threadblocks with their dimensions
matching the 2D planes.
D. Kernels and their variants
We implemented several kernels, each employs different
combinations of methods in terms of their blocking strategies,
algorithms, and strategies for managing data on a GPU. To
better understand the strengths and weaknesses, every imple-
mentation also comes with multiple variants, which block the
data domain into different tile sizes. We describe the details
of our kernel implementations in the next section.
IV. IMPLEMENTATIONS
The notation is defined as follows, we denote R as the width
of the halo, which is half the spatial order of the stencil. For
the acoustic isotropic simulations in our experiments, R is set
to 4. Let Nx, Ny, and Nz denote the dimension size of the
input data region in the X , Y , and Z axis, respectively. For 3D
blocks, we use (x, y, z) to denote the 3D coordinate for both
the point location in the 3D blocks and the thread in kernel
threadblocks. Similarly, for 2D planes, (x, y) is used to locate
the point in the 2D plane, as well as identifying the thread.
1) 3D Blocking Using Global Memory Only: This is the
conceptually and practically the simplest kernel to understand
and implement. Let Dx, Dy, and Dz denote the block
dimensions in the X , Y , and Z axis, respectively. Thus the
block size is Dx ∗Dy ∗Dz. Because we launch each kernel
with a threadblock of the same size, the total number of
points have to be equal or less than 1024 according to current
GPU hardware limit. The GPU grid size of each data region
is ru(Nx/Dx) ∗ ru(Ny/Dy) ∗ ru(Nz/Dz), where ru is a
round-up operator.
During kernel execution, each thread is responsible for
fetching the point for itself, as well as all 24 neighbors points
for the stencil computations. In this implementation, the data is
fetched directly from global memory. But one must adopt good
memory access pattern when fetching from global memory for
performance concerns. Since we store the 3D grid data as a
flat 1D array, we carefully use global memory coalescing for
the most innermost dimension X .
The kernel implementations under this category are identi-
fied with gmem_{Dx}_{Dy}_{Dz} in this paper.
2) 3D Blocking Using Shared Memory for U-Array: The
overall implementation of this approach is based on the
previous one, where the same 3D blocking strategy is used
for each of the data regions. But instead of reading data
from global memory directly, in this approach, u-array is
firstly fetched from global memory and stored into shared
memory for each of the 3D block prior to perform the
stencil computations. The total amount of points we need
to fetch in this case are Dx ∗Dy ∗Dz for the block itself
and (Dx ∗Dy + Dx ∗Dz + Dy ∗Dz) ∗R ∗ 2 for the halos
around the block. Because we work with high-order stencil, the
total amount of points we need to fetch from halos can easily
exceed the shared memory size limit. Therefore, we need to
carefully consider the block size, so that we don’t overwhelm
the shared memory.
Furthermore, also because of the high-order, the halo load-
ing takes a significant portion of the entire shared memory
fetching time. Thus, designing the right approach to speedup
the fetching is crucial to the overall performance. We propose
a few ways of fetching data from global memory into shared
memory, and we describe the fastest approach based on the
our experiments. At the same time, this approach is general
enough that can be applied to various code shapes.
Firstly, thread (i, j, k) fetches the point (i, j, k). Then, when
we design the block size, for each thread dimension, we must
have at least 2 ∗R threads, and we use the first 2 ∗R threads
for the halo fetching. Threads 0 to R− 1 are responsible for
fetching the halos to the one side, and threads R to 2 ∗R− 1
fetch the other side. So when a dimension has just same
amount of points, |D| = 2 ∗R, we have a perfectly balanced
work for every thread. However, if any dimension has more
than 2 ∗R threads, then this approach would result with some
unbalanced work. Nevertheless, because the R is 4 in acoustic
isotropic model, we need to have at least eight threads for each
dimension. Also considering maximum number of threads in a
block being 1024 and the maximum shared memory size limit
of the GPUs, therefore, the only practically possible dimension
size in this case is 8x8x8, which fortunately has a perfectly
balanced work load for every thread. While it is not practically
always possible, both the point fetching and the halo fetching
need to be done in the global memory coalescing fashion as
much as possible.
We identify this implementation as smem u.
3) 3D Blocking Using Shared Memory for boundary region
computing: This implementation also tries to take the advan-
tages of the shared memory, and still use the 3D blocking
strategy. But the main difference is that, this implementation
fetches eta-array into shared memory, whereas the previous
approach fetches the u-array. Because eta-array is only used
in the stencil computation inside the PML region computation,
the changes would only be seen in the PML kernel.
This approach may appears to be nothing new at the first
sight, however, it is twofold interesting in. First, as previously
described, stencil computations on eta is a lower-order stencil
rather than the high-order stencil. In fact, the halo size of eta
is just one. This kind of lower order stencil have been widely
covered in the literature, but the combination of high and low
order stencil is seldom addressed. Second, this implementation
gives us an opportunity to observe the performance changes by
using global memory with good access pattern for a high-order
stencil, meanwhile applying shared memory to a lower-order
stencil.
In terms of fetching eta into shared memory, we have two
implementations that differ in the number of conditionals. We
denote smem eta 3 the first one with three conditionals and
smem eta 1 with just one conditional, and we use R eta to
denote the width of halos for eta, and we know in acoustic
isotropic, R eta is 1.
smem eta 3 uses the similar algorithm as smem u, where
the first 2 ∗R eta threads from each dimension are used for
the halo fetching. Because R eta is just 1, we will only need
two threads from each of the thread dimension. Since we have
three dimensions, we need three conditionals, each for one
dimension, respectively, and checks for the first 2 ∗ R eta
threads. This could introduce unbalanced work, theoretically,
because for a 3D block of 8x8x8, if only two threads in
each dimension performs the fetching, that is one fourth of
the threads. We have three dimensions, that means, while the
majority of the threads are idle, only 1/64 of the threads do
the work.
Additionally, since the above described is already a concern,
and we definitely don’t want to waste such significant amount
of threads, we implement another shared memory fetching
strategy as a comparison, where we choose to use the first
six threads from x dimension, and we perform the fetching as
shown in Algorithm 2. This algorithm uses only one condition,
so we name the implementation smem eta 1. For 8x8x8
3D blocks, because we only use six threads from a single
dimension, this time, 6/8 threads do the work, and in theory,
we reduces the thread idleness to just 20%. However, this
algorithm has relatively complex logic when we tilt the thread
to its proper halo position, thus, experimental evaluation are
needed to compare whether it worths the tradeoff.
4) Semi-stencil: Semi-stencil was initially introduced for
CPUs, where it separates the stencil computations into two
phases, namely the forward phase and the backward phase.
The algorithm reads R + 1 points on one dimension, and the
Data: xidz,yidx, zidx: thread index of x, y, and z
dimension, respectively
Data: nt: number of threads per block dimension
Result: g: coordinate for global memory
Result: s: coordinate for shared memory
1 if zidx < 6 then
2 z ← zidx & 1;
3 sz ← z ∗ 9;
4 gz ← z ∗ (bt + 1)− 1;
5 xzswap ← zidx <= 1;
6 yzswap ← (zidx & 2) == 2;
7 si ← xzswap ? sz : (xidx+1);
8 sj ← yzswap ? sz : (yidx+1);
9 si ← xzswap ?(xidx+1) : (yzswap?(yidx+1) : sz );
10 gi ← xzswap ? gz : xidx;
11 gj ← yzswap ? gz : yidx;
12 gi ← xzswap ? xidx : (yzswap ? yidx: gz );
13 s← (si, sj, sk);
14 g← (gi, gj, gk);
15 end
Algorithm 2: Shared Memory Fetching Strategy for eta-
array Using Only One Conditional
forward phases compute as the points are the left side of
the stencil, and the partial result is stored to the rightmost
point. Backward phases then compute as if the points are the
right side of the stencil, and write out the final result to the
leftmost point. By doing this, semi-stencil has the potential
to improve performance by changing the ratio between load
and store. For example, for a 3D stencil of halo size of R,
with typical approach, it requires to load 6 ∗R + 1 points in
order to perform stencil computation for one point. Once the
computation is done, a single store writes the result back. So
the load-store-ratio is (6 ∗R + 1) : 1. On the other hand, using
semi-stencil on one dimension, we read the center point and
the half points with R + 1 loads, then forward phase writes
one store, and backward phase write another store, thus total
of two stores. Therefore, the load-store-ratio for semi-stencil is
(R+ 1) : 2. Also please note that, even for multi-dimensions,
this ratio does not change. This is in theory very appealing for
high-order stencils because the larger the size of the halo, the
potential benefits one might achieve as the algorithm trades
half of loads with just one more store. Therefore, we try to
adopt semi-stencil algorithm on GPUs.
Our GPU implementation also uses 3D blocking. While
the stencil computation in each 3D block is close to the
CPU implementations described in the original paper. we
further parallelize the executions by running all time steps
concurrently on GPUs.
We assign the identifier semi to this implementation.
5) 2.5D Streaming with Multi-Plane using Shared Memory:
Starting from this implementation, we use 2.5D blocking.
As already described, 2.5D algorithm mainly stream a 2D
plane through the third dimension. In our implementations,
we choose the 2D XY-subplane because X is the innermost
dimension in our data layout. Let Dx and Dy denote the
dimension size of X and Y axis, respectively, for the 2D plane.
So we launch kernels using 2D threadblocks with the same size
of Dx by Dy, thus total of Dx ∗Dy threads. The GPU grid
size of each data region is ru(Nx/Dx) ∗ ru(Ny/Dy), where
ru still stands for a round-up operator.
In this approach, we exploit shared memory as a buffer
to store all the data needed in the stencil computations for
a particular XY-subplane. This means, in addition to the
current XY-subplane, we also load R subplanes above the
current subplane and R subplanes below into the shared
memory. Therefore, we allocate a buffer for 2 ∗R + 1 planes,
each has (Dx + 2 ∗R) ∗ (Dy + 2 ∗R) points, thus, total
of (2 ∗R + 1) ∗ (Dx + 2 ∗R) ∗ (Dy + 2 ∗R) points. Hence,
the dimension size of each subplane must be well-thought,
such that the buffer size could be big enough for better data
reuse, but at the same time, without overwhelm the shared
memory on the GPU chips. Let B denotes our buffer, and
B[i] for the i-th subplane in the buffer.
Before we can start the streaming process, points from the
top halos are pre-loaded into buffer B[0..R), as well as the first
R XY-subplanes loaded into B[R..2R). Then, in our streaming
loop, for each z ← [0..Nz), we firstly load the (z+R)-th XY-
subplane into B[(z + R)%(2 ∗R + 1)]; and then, perform the
stencil computation for the z-th XY-subplane with the needed
points read from the shared memory; and finally, we store the
result directly back to the global memory.
We use a modulo operator for better illustration of the
method, in practice, because modulo operator is slow, we
should avoid using it. Inside the streaming loop, the z index
is always increased by one at a time. Thus, we can use faster
means, such as index rotation and hard-coded index fetches,
instead. In our implementation, we choose index rotation
because of its simplicity.
We identify the kernel implementations under this category
as st smem {Dx} {Dy}.
As described above, the buffer required in this approach
could potentially be capped by the GPU shared memory sizes.
Furthermore, some old generation GPUs may just not have
enough shared memory quota to use. Thus, one could try to
offload some of the data from shared memory to other GPU
resources that are available to us, such as registers. So, we
will discuss two approaches using registers in the following
sections.
6) 2.5D Streaming using Register Shifting: In this 2.5D
streaming approach, the points from the current XY-subplane
keeps using shared memory. However, since we stream
through the z-axis, we use registers for the halo points on
the z-axis. On the contrary to shared memory, where the data
loaded from one thread is also accessible from another thread
in the same block, registers are only available to the current
thread. But this is not a problem, since we are streaming on
the z-axis, and the data from z-axis needed for one thread is
not at all going to be needed by another thread.
So we allocate a shared memory space to hold
(Dx + 2 ∗R) ∗ (Dy + 2 ∗R) points for the currently active
plane. So the shared memory footprint compare to the previous
method is 1 : (2 ∗R + 1). Because of the high-order, R is
a large number, thus, the shared memory usage reduction is
optimal. Let S(x, y) denotes the shared memory with location
(x, y).
We also allocate 2 ∗ R + 1 registers for the current point
and the halo points from each direction on the z-axis. Since
we still can get the value of the current point from shared
memory, theoretically, we only need 2 ∗ R registers for the
halos, however, it is easier to use one more register for the
current point to ease the programming and to potentially speed
up the performance as reading data from registers is considered
faster than reading from shared memory. Let Reg(x, y)[i] to
denote the i-th register for the thread (x, y).
Before we can start the streaming process, thread (x, y)
fetches data from (x, y, z) for z ← [−R..R), and stores them
into register Reg(x, y)[0..2R), respectively. Then, inside the
stream loop, for each z ← [0..Nz), we firstly perform the
register shifting on each thread, such that for r ← (0..2R],
Reg(x, y)[r − 1] = Reg(x, y)[r]; then, we load the the fur-
thest halo point from the streaming dimension (x, y, z+R) to
register Reg(x, y)[2 ∗R]; additionally, we fetch data (x, y, z)
from global memory into S(x, y); and we finally perform the
stencil computation by using the data of XY-subplane from
shared memory and data of z-axis from registers; lastly, the
result is stored back to the global memory directly.
We identify the kernel implementations under this category
as st reg shft {Dx} {Dy}.
It is worth noting that, even though the denotation of the
registers may give the impression of using an array in regular
programming, in the actual implementation, they are written
out explicitly as 2 ∗R + 1 local variables. In fact, our acoustic
isotropic kernel has R = 4, same as the sample code in [18], so
our implementation keeps the variable names used from [18],
and calls them behind4, behind3, behind2, behind1,
current, front1, front2, front3, and front4.
7) 2.5D Streaming using Fixed Registers with Loop Un-
rolling: This approach, same as previous one, utilizes shared
memory for the current XY-subplane, as well as registers for
the halos on the z-axis, the streaming dimension. However, it
uses fixed registers instead of shifting them. To achieve this
by loop unrolling.
So we again allocate a shared memory of
(Dx + 2 ∗R) ∗ (Dy + 2 ∗R) points. Let S(x, y) denotes the
shared memory for location (x, y). We allocate 2 ∗ R + 1
registers as well, and denote Reg(x, y)[i] for the i-th register
of the thread (x, y). And similarly, in practice, they are
2 ∗R + 1 named variables.
Before we can start the streaming process, thread (x, y)
fetches data from (x, y, z) for z ← [−R..R), and stores
them into register Reg(x, y)[0..2R), respectively. Then, inside
the stream loop, for each z ← [0..Nz), we don’t modify
any values in the existing registers except updating regis-
ter Reg(x, y)[(z + 2 ∗ R)%(2 ∗ R + 1)] with the value
of point (x, y, z + R); then, we fetch data (x, y, z) from
global memory into S(x, y); next, we perform the sten-
cil computation by using the data of XY-subplane from
shared memory and i-th data above current point from
Reg(x, y[(z + R− i)%(2 ∗R + 1)], and j-th data below the
current point from Reg(x, y[(z + R + j)%(2 ∗R + 1)]; and
finally, the result is stored back to the global memory directly.
Furthermore, to achieve this without using expensive mod-
ule operators, we implement it using loop unrolling by intro-
ducing macro constructors with the register indices as micro
placeholders. Thus, inside the streaming loop, we expand
2 ∗ R + 1 macro calls, each with register indices shifted by
one. Additionally, we need to check to exit the loop when the
stream reaches the boundary of z-axis.
We identify the kernel implementations under this category
as st reg fixed {Dx} {Dy}.
V. EVALUATION
A. Experiment Environments
We evaluate all kernel implementations on three machines
with NVIDIA GPUs across several generations. Table I lists
our machine specifications. And we refer to the three machines
by their GPU models.
V100 P100 NVS510
CPU IBMPOWER9
IBM
POWER8NVL
Intel Xeon
E3-1245 v6
CPU Cores 160 160 8
RAM 256 GB 256 GB 16 GB
GPU NVIDIATesla V100
NVIDIA
Tesla P100
NVIDIA
NVS 510
GRAM 32 GB 16 GB 2 GB
OS RHEL v7.7 RHEL v7.4 Ubuntu18.04 LTS
CUDA 10.2.89 10.1 10.2.89
NVIDIA Driver 440.33.01 418.39 440.33.01
TABLE I: Machine Specifications
Machine V100 physically equips with four such GPUs, and
we dedicate one GPU for our experiments. We apply compiler
flag -arch=sm_70 to all compilations. In addition, P100
has four P100 installed, and we use one exclusively during our
experiments. Similarly, compiler options -arch=sm_60 are
applied to all kernels when being compiled on this machine.
Lastly, we specify -arch=sm_30 for NVS510. Please note
that, the support from some toolings have been marked as
deprecated on this machine, thus while they are working to
some extents, many have limited functionalities. Plus, the GPU
memory available on NVS510 really can’t meet the minimal
grid size of the real-world use. Therefore, we only use this
machine for basic comparisons across GPU generations. And
while we look into some of the metrics, we won’t discuss them
in details.
In most situations, we will let nvcc compiler to figure out
the register usage by itself, but we pay very close attentions
to the register usage. There are, however, a few cases, where
we must specify the maximum number of register usage, by
adding compiler flag -maxrregcount=X, to prevent from
register spilling.
We also use HPCToolkit [41], [42] version 20200803 and
Empirical Roofline Toolkit [40] 1 in our experiments. Fur-
thermore, NVIDIA Nsight Compute version 2019.5.0 is used
during our evaluations.
B. Evaluation Methodologies
We evaluate all the implementations and their variants.
We firstly conduct basic time measurement. Then we use
HPCToolkit to profile the kernel details with PC samplings.
Additionally, we run Nsight Compute for device-specific ker-
nel characteristics. Furthermore, we run Empirical Roofline
Toolkit to understand the algorithms performance character-
istics against HW constrains. And finally, we calculate the
arithmetic intensities and the performance of every kernels,
and compare them with the roofline chart. We will describe
each of our evaluation methods below.
1) Time Measurements: For each machine, based on its
device memory size, we try to run the kernels with a large
grid size that is allowed by the device. Thus, in V100, our
grid size is chosen to be 10003; for P100, the grid size is set to
8933; while for NVS510, the grid size is 3003. As described in
III-A, the stencil needs to be iterated enough time in practical
use, so, here we set number of time iterations to 1000 for all
kernels and all machines. For each execution, we warm up
the kernel by running the entire execution once, and then we
execute the kernel five times. We take the average of the five
runs, and report it as our time measurements.
2) HPCToolkit: We use the August 2020 release of the
HPCToolkit and evaluate the hardware performance metrics,
such as register use, block and grid size, as well as PC
sampling statistics, to name a few, exposed latencies and their
kinds.
The evaluation contains four steps: Firstly, running hpcrun
along with kernel executions for sampling using program
counters. Thanks to the very low overhead of hpcrun, we
can run our kernels with 1000 time iterations, which gives
us accurate measurements that match the kernel behavior in
the real world. Then, we use hpcstruct on the kernel
binaries and recover the information about their relations to
the source code. This is needed to contribute the performance
metrics back to the source code, so that we can evaluate it
later at the source code level, which makes the investigation
easier. The source code structure computed by hpcstruct
is then associated by hpcprof with the raw sampling data
from hpcrun. Finally, a HPCToolkit performance database is
generated.
HPCToolkit provides two graphic user interface tools to
analyze the performance database, namely, HPCViewer and
HPCTraceViewer. We use HPCViewer primarily for is-
sues such as global memory stalls, and we can easily spot
1We have to use a local fork of the tool for better Python 3 support and other
minor changes needed for our environments. Our changes are made avail-
able at https://github.com/rsrice/cs-roofline-toolkit-fork under the same open-
source license as the upstream repository https://bitbucket.org/berkeleylab/cs-
roofline-toolkit/src/master/.
which lines are contributing to the issue in HPCViewer. We
also use HPCViewer to quickly identify the performance
hotspot of our kernel executions thanks to its complementary
code-centric views from top-down, bottom-up, and flat angles,
respectively. In addition, HPCTraceViewer is used to in-
spect the program execution over time. It allows us to quickly
identify the processor idleness, and zoom into the problematic
area and see the calling stacks.
We will describe some of our findings using HPCToolkit
when we discuss the evaluation results.
3) Nsight Compute: We run Nsight Compute for kernel
characteristics, such as theoretical and achieved occupancy.
We can also verify some of the metrics that we got previously
from HPCToolkit. It is extremely helpful when the perfor-
mance differences are driven by the kernel characteristics. For
example, when a low occupancy happens, one could easily tell
from Nsight Compute report that whether or not the register
footprint might be the culprit, or the size of the shared memory,
or the number of threads, to name a few.
Nsight Compute repeats every kernel execution multiple
times, each for various kinds of profiling, hence, the overhead
is huge. Instead of running 1000 time iterations, when we run
Nsight Compute, we only need to run five time iterations.
4) Roofline Performance Model: We adopt Roofline per-
formance model in our evaluation, so that we understand
our machine characteristics and where our kernels are with
the respect to the machine’s peak performance based on the
kernel’s arithmetic intensity and memory system bandwidth.
First of all, We use Empirical Roofline Toolkit (ERT) for
machine characterizations. It runs several micro-benchmarks to
characterize the peak compute speed and memory bandwidth
of the machine. Benchmarking directly on the machine gives
us a realistic result rather than the theoretical number claimed
by the manufacturers.
With machine characterization in place, next, we work on
kernel characterizations. We use nvprof, a tool shipped
along with CUDA Toolkit, and measure several kernel metrics,
such as floating point operations (FLOP), L2 read and write
transactions, as well as DRAM read and write transactions.
The output from nvprof is then fed into the calculations
of both the performance and the arithmetic intensity for each
kernel. Performance uses the FLOP measured by nvprof and
divides it by the execution time. And arithmetic intensity also
uses the same FLOP, but is divided with the data movement
instead. The data movement is measured at both L2 and
DRAM levels.
In the roofline plot, we have x-axis as the arithmetic in-
tensity, while y-axis being the performance. We then compare
the kernels with their arithmetic intensities and their relative
performance. Additionally, we compare its performance with
machine’s peak performance for the arithmetic intensity. We
also consider to combine the evaluations from other means
and to reason the performance traits.
Kernel Machine
Kernel Identifier Dx Dy Dz Nr V100 P100 NVS510
gmem 4x4x4 4 4 4 - 77.77 181.99 682.89
gmem 8x8x4 8 8 4 - 71.91 167.75 674.09
gmem 8x8x8 8 8 8 - 53.88 117.74 415.85
gmem 16x16x4 16 16 4 - 85.52 195.82 760.72
gmem 32x32x1 32 32 1 - 292.36 639.62 2507.22
smem u 8 8 8 - 57.30 76.18 210.42
smem eta 1 8 8 8 - 54.87 119.15 397.56
smem eta 3 8 8 8 - 54.34 117.39 396.49
semi 8 8 8 - 172.84 217.29 1726.17
st smem 8x8 8 8 - - 116.38 112.71 509.18
st smem 8x16 8 16 - - 113.46 105.41 439.47
st smem 16x8 16 8 - - 59.92 77.91 425.73
st smem 16x16 16 16 - - 55.87 72.73 349.45
st reg shft 8x8 8 8 - - 104.36 144.89 209.87
st reg shft 16x16 16 16 - - 65.79 80.23 182.52
st reg shft 16x32 16 32 - - 65.61 82.25 199.61
st reg shft 16x64 16 64 - 64 115.54 98.19 240.41
st reg shft 32x16 32 16 - - 60.83 70.63 171.30
st reg shft 32x32 32 32 - 64 93.92 76.27 167.29
st reg shft 64x16 64 16 - 64 90.98 80.67 202.74
st reg fixed 8x8 8 8 - - 113.88 152.75 195.05
st reg fixed 16x8 16 8 - - 70.24 84.05 159.73
st reg fixed 16x16 16 16 - - 61.66 76.10 170.03
st reg fixed 32x16 32 16 - - 62.45 66.60 162.05
st reg fixed 32x32 32 32 - 64 58.96 61.74 160.91
TABLE II: Time Measurement
C. Results
We will save our discussions to a later part of this section,
so that we can discuss them by looking at them from multiple
angles in an intertwined fashion. Thus, to start this section,
we summarize and present all results in tables and plots.
We first present the time measurement result in Table II.
For 3D blockings, the columns Dx, Dy, and Dz stands for
the block dimension size for x-, y-, and z-axis, respectively.
And for 2.5D blockings, only columns Dx and Dy are used.
For Nr column, we write a value when we explicitly specify
the maximum register count, otherwise, we use - to denote
the ones that we will let compiler to decide.
In addition, Table III present the kernel characteristics for
inner data region at the top, and PML regions at the bottom,
respectively. Please note that, as we have described previously,
the PML region is decomposed into six sub-regions. However,
because they are symmetric, when we look at the performance
characteristics, there are only three classes, top/bottom pair,
front/back, and left/right, respectively. So we group each
class together in Table III, and name the columns based on
their class. Furthermore, some characteristics are the same
regardless of the data region, so we further extract them into
the Static column to compact the table for readability.
We present the performance characteristics of our imple-
mentations in Table IV. We combine the metrics from both
inner region and PML region into the same table, so that
we can discuss the entire execution. Also, Figure 3 shows
these performance characteristics using roofline performance
model, where Subfigures 3a and 3b being the roofline of
L2 and DRAM, respectively, and Subfigures 3c and 3d are
zoomed-in view of our implementations. All y-axes represent
performance, and x-axes are arithmetic intensity of L2 in
Subfigures 3a and 3c and of DRAM in Subfigures 3b and 3d.
The dots in each group of the implementations are categorized
with the same color and their coordinates can be found in Table
IV.
From these results, we propose the following observations.
3D Blocking using Global Memory:
We started with gmem_8x8x8, the simplest implementa-
tion with only using the global memory. Despite its simplicity,
on V100 it has the best performance. With L1 data cache and
shared memory functionality combined into a single memory
block on V100 architecture [43], we have a much large data
cache than previous generations. In addition, when no shared
memory is used in the program, all the combined memory
block will be usable as a cache. Therefore, when retrieving
data from global memory with good access pattern, we can
achieve a great performance.
However, when we compare its performance across GPU
generations, we notice its poor performance portability. In fact,
it is one of the slowest implementations on P100, which is
still used daily in industry. And it performs poorly on older
generation GPUs as well.
We then tried several variants of the global memory
implementations that differ each others in terms of code
shapes, from smaller to larger, gmem_4x4x4, gmem_8x8x4,
gmem_8x8x8, gmem_16x16x4, and gmem_32x32x1, to
name a few. Our results show that, gmem_8x8x8 is the best
among them. As we have high-order, and we need to load
all halos before performing stencil computations, for blocks
that smaller than gmem_8x8x8, such as gmem_4x4x4 and
Kernel Identifier Block Size Grid Size Registers
Per Thread
Achieved
Active
Warps
Achieved
Occupancy
Theoretical
Active
Warps
Theoretical
Occupancy
gmem 4x4x4 64 13,312,053 40 37.2 58.2 48.0 75.0
gmem 8x8x4 256 3,356,157 40 44.0 68.7 48.0 75.0
gmem 8x8x8 512 1,685,159 40 42.5 66.4 48.0 75.0
gmem 16x16x4 1,024 853,200 40 28.9 45.2 32.0 50.0
gmem 32x32x1 1,024 851,400 40 29.3 45.8 32.0 50.0
smem u 512 1,685,159 38 44.6 69.7 48.0 75.0
smem eta 1 512 1,685,159 40 42.4 66.3 48.0 75.0
smem eta 3 512 1,685,159 40 42.4 66.2 48.0 75.0
semi 768 1,685,159 40 41.2 64.4 48.0 75.0
st smem 8x8 64 14,161 56 19.9 31.1 20.0 31.2
st smem 8x16 128 7,140 56 27.9 43.6 28.0 43.7
st smem 16x8 128 7,140 56 27.9 43.5 28.0 43.7
st smem 16x16 256 3,600 56 31.6 49.4 32.0 50.0
st reg shft 8x8 64 14,161 96 19.0 29.7 20.0 31.2
st reg shft 16x16 256 3,600 96 15.9 24.9 16.0 25.0
st reg shft 16x32 512 1,800 96 16.0 25.0 16.0 25.0
st reg shft 16x64 1,024 900 64 32.0 50.0 32.0 50.0
st reg shft 32x16 512 1,800 96 16.0 25.0 16.0 25.0
st reg shft 32x32 1,024 900 64 32.0 50.0 32.0 50.0
st reg shft 64x16 1,024 900 64 32.0 50.0 32.0 50.0
st ref fixed 8x8 64 14,161 78 23.9 37.3 24.0 37.5
st ref fixed 16x8 128 7,140 78 23.9 37.3 24.0 37.5
st ref fixed 16x16 256 3,600 78 23.9 37.4 24.0 37.5
st ref fixed 32x16 512 1,800 78 16.0 25.0 16.0 25.0
st ref fixed 32x32 1,024 900 64 32.0 50.0 32.0 50.0
Static Top/Bottom Front/Back Left/Right
Kernel Identifier Block
Size
Registers
Per
Thread
Theoretical
Active
Warps
Theoretical
Occu-
pancy
Grid
Size
Achieved
Active
Warps
Achieved
Occu-
pancy
Grid
Size
Achieved
Active
Warps
Achieved
Occu-
pancy
Grid
Size
Achieved
Active
Warps
Achieved
Occu-
pancy
gmem 4x4x4 64 48 40.0 62.5 437500 38.0 59.5 414750 38.0 59.4 393183 38.2 59.7
gmem 8x8x4 256 48 40.0 62.5 109375 37.5 58.6 118500 37.5 58.6 112812 36.7 57.3
gmem 8x8x8 512 48 32.0 50.0 62500 29.2 45.7 59500 26.9 42.0 56644 28.2 44.1
gmem 16x16x4 1024 48 32.0 50.0 27783 30.0 46.6 29862 29.5 46.1 28440 26.0 40.2
gmem 32x32x1 1024 48 32.0 50.0 27648 29.0 45.0 30272 29.0 45.0 28380 24.2 38.0
smem u 512 48 32.0 50.0 62500 30.1 47.1 59500 27.8 43.5 56644 27.7 43.3
smem eta 1 512 32 64.0 100.0 62500 59.4 92.9 59500 54.8 85.6 56644 54.8 85.7
smem eta 3 512 32 64.0 100.0 62500 59.1 92.4 59500 53.9 84.3 56644 54.2 84.8
semi 768 64 24.0 37.5 62500 17.7 27.6 59500 18.7 29.3 56644 17.5 27.3
st smem 8x8 64 72 20.0 31.2 500 12.4 19.4 476 11.8 18.5 14161 19.7 30.8
st smem 8x16 128 72 28.0 43.7 252 12.6 19.7 238 11.8 18.5 7140 27.5 43.1
st smem 16x8 128 72 28.0 43.7 250 12.4 19.5 240 11.9 18.6 7140 27.5 43.0
st smem 16x16 256 72 24.0 37.5 126 12.7 19.8 120 12.0 18.7 3600 23.9 37.3
st reg shft 8x8 64 80 24.0 37.5 500 12.4 19.4 476 11.8 18.4 14161 23.6 36.8
st reg shft 16x16 256 80 24.0 37.5 126 12.6 19.7 120 11.9 18.6 3600 23.9 37.3
st reg shft 16x32 512 80 16.0 25.0 64 16.0 25.0 60 16.0 25.0 1800 15.9 24.9
st reg shft 16x64 1024 64 32.0 50.0 32 32.0 50.0 60 31.9 49.9 900 31.9 49.8
st reg shft 32x16 512 80 16.0 25.0 63 16.0 25.0 60 16.0 25.0 1800 15.9 24.9
st reg shft 32x32 1024 64 32.0 50.0 32 32.0 50.0 30 31.9 49.9 900 31.8 49.8
st reg shft 64x16 1024 64 32.0 50.0 63 31.9 49.9 30 31.9 49.9 900 31.8 49.8
st ref fixed 8x8 64 106 16.0 25.0 500 12.4 19.4 476 11.8 18.4 14161 15.7 24.6
st ref fixed 16x8 128 104 16.0 25.0 250 12.4 19.5 240 11.8 18.5 7140 15.7 24.6
st ref fixed 16x16 256 104 16.0 25.0 126 12.6 19.8 120 12.0 18.7 3600 15.8 24.7
st ref fixed 32x16 512 106 16.0 25.0 63 16.0 25.0 60 16.0 25.0 1800 15.9 24.9
st ref fixed 32x32 1024 64 32.0 50.0 32 32.0 50.0 30 31.9 49.9 900 31.9 49.8
TABLE III: Kernel Characteristics on V100: (top) Inner; (bottom) PML
gmem_8x8x4, the halo size dominates comparing to the
actual data points. Therefore, more time is spent in loading
halos relative to the time in loading points. Thus, harm the per-
formance. In addition, smaller block sizes also result in larger
GPU grid size as we can see from Table III, which means
more kernel launches, so, the accumulated overheads also
slow the overall executions. On the other hands, we also see
performance degradation in larger blocks, gmem_16x16x4
and gmem_32x32x1. Due to maximum number of threads
available to a block is 1024, in order to use all the threads,
we need to make the block thinner. Despite their smaller grid
size, both has low theoretical and achieved occupancy. At the
same time, we also observe an increase in their L2 cache
misses resulting in additional L2 transactions from Table IV,
especially with gmem_32x32x1. These cache misses can be
seen as very expensive, as the very kernel performs extremely
poorly.
As a quick summary, global memory implementations are
Kerner Identifier FLOP
(x1013)
Achieved
Perfor-
mance
(GFLOPs)
L2
Trans-
actions
(x1012)
L2
Arith-
metic
Inten-
sity
L2 Ma-
chine
Peak
Perfor-
mance
(GFLOPs)
L2
Achieved
Per-
centage
DRAM
Trans-
actions
(x1011)
DRAM
Arith-
metic
Inten-
sity
DRAM
Ma-
chine
Peak
Perfor-
mance
(GFLOPs)
DRAM
Achieved
Per-
centage
gmem 4x4x4 opt 4.453 533 3.38 0.41 1361 39.19% 8.42 1.65 1291 41.29%
gmem 8x8x4 opt 4.453 577 2.81 0.49 1635 35.27% 7.26 1.92 1498 38.50%
gmem 8x8x8 opt 4.453 770 1.79 0.78 2566 30.00% 7.26 1.92 1498 51.39%
gmem 16x16x4 opt 4.453 485 2.45 0.57 1877 25.83% 6.67 2.08 1628 29.78%
gmem 32x32x1 opt 4.453 142 13.90 0.10 330 42.95% 6.56 2.12 1656 8.57%
smem u opt 4.453 724 1.82 0.77 2531 28.60% 7.37 1.89 1474 49.11%
smem eta 1 opt 4.453 756 1.82 0.76 2522 29.97% 7.31 1.90 1487 50.81%
smem eta 3 opt 4.453 763 1.81 0.77 2535 30.10% 7.31 1.90 1488 51.30%
semi opt 6.400 345 2.67 0.75 2480 13.90% 18.40 1.08 847 40.71%
st smem 8x8 opt 4.453 356 1.59 0.87 2891 12.33% 12.30 1.13 885 40.27%
st smem 8x16 opt 4.453 366 1.47 0.95 3130 11.68% 13.30 1.05 820 44.58%
st smem 16x8 opt 4.453 692 1.17 1.19 3933 17.59% 7.74 1.80 1404 49.27%
st smem 16x16 opt 4.453 742 1.04 1.34 4414 16.81% 6.97 2.00 1560 47.58%
st reg shft 8x8 opt 4.453 397 1.57 0.89 2935 13.54% 10.40 1.34 1047 37.96%
st reg shft 16x16 opt 4.453 630 1.20 1.16 3841 16.41% 7.22 1.93 1506 41.86%
st reg shft 16x32 opt 4.453 632 1.15 1.21 3991 15.84% 6.76 2.06 1607 39.32%
st reg shft 16x64 opt 4.453 359 1.99 0.70 2317 15.49% 17.00 0.82 638 56.25%
st reg shft 32x16 opt 4.453 682 0.94 1.47 4861 14.02% 6.94 2.00 1566 43.54%
st reg shft 32x32 opt 4.453 442 1.67 0.83 2750 16.05% 15.50 0.90 701 62.95%
st reg shft 64x16 opt 4.453 456 1.57 0.89 2938 15.52% 14.50 0.96 752 60.64%
st reg fixed 8x8 opt 4.453 364 1.65 0.84 2791 13.05% 15.00 0.93 723 50.36%
st reg fixed 16x8 opt 4.453 590 1.27 1.10 3632 16.26% 9.59 1.45 1133 52.11%
st reg fixed 16x16 opt 4.453 673 1.18 1.18 3899 17.25% 7.71 1.80 1409 47.72%
st reg fixed 32x16 opt 4.453 664 9.12 1.53 5043 13.17% 7.14 1.95 1522 43.62%
st reg fixed 32x32 opt 4.453 703 1.09 1.27 4209 16.71% 9.08 1.53 1197 58.78%
TABLE IV: Kernel Performance Characteristics on V100
the simplest, at the same time, it has the least programming
complexity and needs very little performance tuning effort.
With the right code shape and using good global memory
access pattern, on the latest GPU architectures, such as V100,
one can achieve amazingly good performance with little effort.
From software engineering’s perspective, this implementation
is easy to understand, and has low maintenance cost.
Shared memory:
Firstly, both the P100 and NVS510 results clearly show us-
ing shared memory provide significant performance boost, and
this aligns with results from existing literature. In addition, re-
call that smem_u is a high-order stencil while smem_eta_1
and smem_eta_3 are not, on these architectures, smem_u
running faster than smem_eta_1 and smem_eta_3 also
back the understanding that, with high-order stencils, by
loading larger-size halos into the shared memory would largely
benefit the performance on these architectures.
The results from 3D blocking implementations using shared
memory reiterate our previous confirmation about the impli-
cations from the combined L1 data cache and shared memory
on latest NVIDIA GPUs. Since, the results with V100 are the
opposite, and the proper global memory use is a better option
for 3D blockings on this architecture, as smem_eta_1 and
smem_eta_3 rely more on global memory than smem_u,
and they run faster.
Furthermore, regardless of the architectures, for larger and
thinner blocks, such as 32x32x1, a 3D block with the thick-
ness of 1, which is essentially a 2D plane, the results show that
using shared memory is still preferred over global memory,
especially for high-order stencils. Because a larger plane will
relatively need less halo data before it can preform the stencil
computation for its points. All 2.5D-blocking implementations
load the current plane into the shared memory because of the
same reason, and we will discuss the code shapes for 2.5D
blocking in a few paragraphs below.
Nevertheless, at the same time, due to the large halo size, for
implementations like st_smem_∗, it is easier to not having
enough shared memory for the desired plane size, such as
32x32. Thus, while shared memory helps performance, for
high-order stencils, when exploring the algorithmic design
spaces, shared memory quota still need to carefully considered,
as it could potentially limit it from being used for larger
blocks, otherwise, crash the program execution.
Semi-stencil:
Thread synchronizations on GPUs are very expensive, and
our evaluations also prove so with our semi-stencil imple-
mentation. Because of the need of storing and loading partial
results, thread synchronizations are necessary to ensure the
completeness of the required computation before it can pro-
ceed to the next. As GPU runs threads concurrently in warps,
we must introduce proper barriers to prevent data from being
corrupted. Our choice of using 3D blocking requires thread
(a) Device Performance vs L2 Arithmetic Intensity (b) Device Performance vs DRAM Arithmetic Intensity
(c) Kernel Performance vs L2 Arithmetic Intensity (d) Kernel Performance vs DRAM Arithmetic Intensity
Fig. 3: Roofline Performance on V100
synchronizations on all three dimensions, which exacerbates
the problem. HPCToolkit also backs our reasoning with its
second most significant bottleneck being the thread synchro-
nization (STL_SYNC).
Nevertheless, the methodology behind semi-stencil algo-
rithm is still valid. Thus, to avoid the excess use of thread
synchronizations, we will investigate into using double buffer-
ing. As well, we will explore using 2.5D blocking instead of
3D blocking in our future work.
Code Shape for 2.5D-Blockings:
For implementations using 2.5D-blocking strategy, in gen-
eral, the larger the 2D plane, the better the performance.
There are two main reasons: First, a larger 2D plane means
a higher degree of concurrency. Second, with a larger plane,
the percentage of the halo points in shared memory fetching
is smaller, which speeds up the overall performance by the
reduction of expensive data fetching.
In addition, one should cut the plane such that the x-
dimension of the GPU’s threadblock is assigned to the
innermost dimension in the data layout with a relatively
larger size. When we compare st_reg_shft_32x16
and st_reg_shft_16x32, st_reg_shft_32x16 runs
faster. From Table IV, we see more L2 transactions with
st_reg_shft_16x32, thus, need more GPU cycles for the
data to be available for the computing units, which in turn,
harm the performance.
It is also worth noting that, as already mentioned previously,
due to the shared memory quota on the physical device, one
may not be able to have a desired large plane. What’s more,
register pressure could be another consideration in the code
shape design, as it is interesting by its own, we will discuss
it in details next.
Register Footprint in 2.5D-Blockings:
When we evaluate st_reg_shft_∗ implementations,
the variants with 2D plane size of 1024, namely
st_reg_shft_16x64, st_reg_shft_32x32, and
st_reg_shft_64x16, show poor performance on V100.
And we see the performance degradation is caused by register
spilling. Because the maximum registers in a blockthread
is 64 ∗ 1024 = 65536, and for these implementations, we
have 1024 threads, thus, we use -maxrregcount=64 to
limit the maximum register usage per thread. However, if we
don’t explicitly set this compiler flag, the nvcc assigns 80
and 96 registers to the PML and inner kernels, respectively.
If we execute the generated mul-functioning binary, the
calculated result is just gibberish. Nevertheless, these number
of registers assigned by nvcc provides us insights about how
many registers are ideally needed. Therefore, our overrides
of limiting the number of registers to 64 makes it inadequate
to hold all the variables, thus, register spilling happens with
some variables be potentially moved out and in the registers.
What’s worse, in this class of implementations, register
shifting is used, so the frequency of accessing registers
are high, but with the register spilling, we lose quite some
benefits that is supposed to be provided by registers, hence,
we see a worse performance.
However, please note that, register spilling also happen
to other kernels, such as st_fixed_reg_32x32. But we
don’t see performance degradation, thanks to its algorithm
uses fixed registers with loop unrolling. Because the registers
are fixed, there is no register shifting, the chances of data
movement in and out registers are smaller, therefore, the
performance impacted by the register spilling is hide by other
thread activities.
2.5D-Blockings Have Good Occupancies and Performance
Portability:
From Table III, we notice that the theoretical and achieved
occupancies for 2.5D blocking are generally good.
Besides, the best performed implementations on P100 and
NVS 510 come from 2.5D approaches. Although they may
not be the best kernels on V100, they are still at the fastest
tier. Thus, if performance portability is a concern, implementa-
tions using 2.5D blocking, such as st_reg_fixed_32x32,
would be preferred.
Gaps to the Roofline Ceilings:
Lastly, the gaps between our implementations and the
ceilings from roofline performance model deserve some dis-
cussions as well. Our interpretation of the performance gaps
are twofolds:
First, we see rooms for further performance tuning. Despite
the state of art may not fully suitable for high-order stencils
with boundary conditions, we could improve the arithmetic
intensities by designing new algorithms or fusing several ex-
isting algorithms that could increase the work being performed
while data is available in the computing unit. In addition,
all of our current implementations are manually written in
CUDA, we could use a DSL approach that facilitate the
implementation of complex approaches that produces better
results for high-order stencils with boundary conditions.
Second, ERT uses simple micro-benchmarking to profile the
machines. On the contrary, acoustic isotropic model not only
is a high-order stencils with boundary conditions to handle,
but also contains complicated logic with multiple statements.
Thus, while the roofline ceilings provide us a guidance, the
real-world nature in our simulations may always keep the
reality in check.
VI. CONCLUSIONS AND FUTURE WORK
In this paper, we discussed the importance of high-order
stencils with boundary conditions from the angles of both
academic perspective and industrial applications. In addition,
We reviewed the current techniques for stencil computations,
verdict and filtered them for high-order stencils. Furthermore,
we implemented several kernels and their variants with the
current techniques and fusions of them. We also modified
the implementation when applicable for better accommodating
high-order stencils with boundary conditions. Besides, we
evaluated our implementations with several tools on multiple
platforms, we quantitatively compared them from various
perspectives and different levels, and we summarized our
observations. Shortly before submitting this paper, we gained
access to GPA, an emerging GPU Performance Advisor tool
being developed by Keren Zhou at Rice University. We tried
applying it to the smem_u implementation and found that
GPA provided tuning guidance that enabled us to improve its
performance. These promising initial results from GPA give
us confidence that we can use GPA on more of our kernels to
search for other tuning opportunities.
We originally began our evaluation by computing 25-point
stencil algorithms over the entire data domain in a single
kernel launch. Our finding of inefficiency caused by branch
divergence led us to apply domain decomposition and compute
the boundaries separate from the center region. While this
improved efficiency, having the regions separate impedes our
ability to apply time skewing along the streaming dimension
for the 2.5D algorithm. We would like to try reintegrating
the boundary computations with the central region to enable
us to evaluate 3.5D algorithms on high-order stencils with
boundary conditions by adding time skewing to the streaming
dimension. In addition, we would like to explore whether
applying the semi-stencil algorithm along the streaming di-
mension improves stencil performance by reducing the cache
footprint of a block’s stencil calculations. Meanwhile, we
would like to see if our observations on the V100 also hold for
the latest NVIDIA A100 GPU. In addition to NVIDIA GPUs,
we would like to expand the scope of our evaluation to explore
the performance of various high-order stencil implementations
on GPUs from other vendors.
We recognize that implementing, tuning, maintaining, and
porting high-performance GPU kernels for high-order stencils
is quite difficult. For the long term, we believe that a high-level
representation of stencil computations in conjunction with
powerful compiler technology is arguably the best strategy to
improve development productivity and performance portability
while also lowering maintenance costs by reducing complexity.
However, a concern for DSL users is the long-term viability
of their code. We are hopeful that adding DSL technology to
the LLVM compiler framework will provide a path forward
that will address this concern.
ACKNOWLEDGMENT
This work was supported in part by a contract from Total
E&P Research & Technology USA, LLC. We thank Keren
Zhou from Rice University for helping us use his emerging
GPU Performance Advisor tool, which offered insights for
tuning some of the kernels we studied.
REFERENCES
[1] F. Benjamin, “Advice to a Young Tradesman,” National Archives and
Records Administration/University of Virginia Press, July 1748.
[2] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, “Cache-
Oblivious Algorithms,” in Proceedings of the 40th Annual Symposium
on Foundations of Computer Science, USA, Oct. 1999, p. 285.
[3] M. Frigo and V. Strumpen, “Cache oblivious stencil computations,” in
Proceedings of the 19th annual international conference on Supercom-
puting, Cambridge, Massachusetts, Jun. 2005, pp. 361-366.
[4] M. Frigo and V. Strumpen, “The cache complexity of multithreaded
cache oblivious algorithms,” in Proceedings of the eighteenth annual
ACM symposium on Parallelism in algorithms and architectures, Cam-
bridge, Massachusetts, USA, Jul. 2006, pp. 271-280.
[5] R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel, “Cache oblivious
parallelograms in iterative stencil computations,” in Proceedings of
the 24th ACM International Conference on Supercomputing, Tsukuba,
Ibaraki, Japan, Jun. 2010, pp. 49-59.
[6] Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C.
E. Leiserson, “The pochoir stencil compiler,” in Proceedings of the
twenty-third annual ACM symposium on Parallelism in algorithms and
architectures, San Jose, California, USA, Jun. 2011, pp. 117-128.
[7] D. Wonnacott, “Using time skewing to eliminate idle time due to
memory bandwidth and network limitations,” in Proceedings 14th Inter-
national Parallel and Distributed Processing Symposium. IPDPS 2000,
May 2000, pp. 171-180.
[8] D. Wonnacott, “Achieving Scalable Locality with Time Skewing,” Inter-
national Journal of Parallel Programming, vol. 30, no. 3, pp. 181-221,
Jun. 2002.
[9] G Jin, J. Mellor-Crummey, and R. Fowler, “Increasing Temporal Locality
with Skewing and Recursive Blocking,” in SC01: Proceedings of the
2001 ACM/IEEE Conference on Supercomputing, Nov. 2001, pp. 57.
[10] J. McCalpin and D. Wonnacott, “Time Skewing: A Value-Based Ap-
proach to Optimizing for Memory Locality,” 1998.
[11] Y. Song and Z. Li, “New tiling techniques to improve cache temporal
locality,” SIGPLAN Not., vol. 34, no. 5, pp. 215-228, May 1999.
[12] S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A.
Rountev, and P. Sadayappan, “Effective automatic parallelization of
stencil computations,” SIGPLAN Not., vol. 42, no. 6, pp. 235-244, Jun.
2007.
[13] J. Holewinski, L.-N. Pouchet, and P. Sadayappan, “High-performance
code generation for stencil computations on GPU architectures,” in Pro-
ceedings of the 26th ACM international conference on Supercomputing,
San Servolo Island, Venice, Italy, Jun. 2012, pp. 311-320.
[14] T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan,
and S. Verdoolaege, “Split tiling for GPUs: automatic parallelization
using trapezoidal tiles,” in Proceedings of the 6th Workshop on General
Purpose Processor Using Graphics Processing Units, Houston, Texas,
USA, Mar. 2013, pp. 24-31.
[15] R. de la Cruz, M. Araya-Polo, and J. M. Cela, “Introducing the Semi-
stencil Algorithm,” in Parallel Processing and Applied Mathematics,
Berlin, Heidelberg, 2010, pp. 496-506.
[16] R. de la Cruz and M. Araya-Polo, “Algorithm 942: Semi-Stencil,” ACM
Trans. Math. Softw., vol. 40, no. 3, p. 23:1-23:39, Apr. 2014.
[17] A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey, “3.5-D
Blocking Optimization for Stencil Computations on Modern CPUs and
GPUs,” in SC 10: Proceedings of the 2010 ACM/IEEE International
Conference for High Performance Computing, Networking, Storage and
Analysis, Nov. 2010, pp. 1-13.
[18] P. Micikevicius, “3D finite difference computation on GPUs using
CUDA,” in Proceedings of 2nd Workshop on General Purpose Process-
ing on Graphics Processing Units, Washington, D.C., USA, Mar. 2009,
pp. 79-84.
[19] K. Matsumura, H. R. Zohouri, M. Wahib, T. Endo, and S. Matsuoka,
“AN5D: automated stencil framework for high-degree temporal blocking
on GPUs,” in Proceedings of the 18th ACM/IEEE International Sym-
posium on Code Generation and Optimization, San Diego, CA, USA,
Feb. 2020, pp. 199-211.
[20] P. S. Rawat, M. Vaidya, A. Sukumaran-Rajam, A. Rountev, L.-N.
Pouchet, and P. Sadayappan, “On Optimizing Complex Stencils on
GPUs,” in 2019 IEEE International Parallel and Distributed Processing
Symposium (IPDPS), May 2019, pp. 641-652.
[21] P. S. Rawat, F. Rastello, A. Sukumaran-Rajam, L.-N. Pouchet, A.
Rountev, and P. Sadayappan, “Register optimizations for stencils on
GPUs,” SIGPLAN Not., vol. 53, no. 1, pp. 168-182, Feb. 2018.
[22] P. Rawat et al., “SDSLc: a multi-target domain-specific compiler for
stencil computations,” in Proceedings of the 5th International Workshop
on Domain-Specific Languages and High-Level Frameworks for High
Performance Computing, Austin, Texas, Nov. 2015, pp. 1-10.
[23] P. S. Rawat et al., “Domain-Specific Optimization and Generation of
High-Performance GPU Code for Stencil Computations,” Proceedings
of the IEEE, vol. 106, no. 11, pp. 1902-1920, Nov. 2018.
[24] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, “A prac-
tical automatic polyhedral parallelizer and locality optimizer,” SIGPLAN
Not., vol. 43, no. 6, pp. 101-113, Jun. 2008.
[25] V. Bandishti, I. Pananilath, and U. Bondhugula, “Tiling stencil com-
putations to maximize parallelism,” in SC 12: Proceedings of the
International Conference on High Performance Computing, Networking,
Storage and Analysis, Nov. 2012, pp. 1-11.
[26] M. Steuwer, T. Remmelg, and C. Dubach, “LIFT: A functional data-
parallel IR for high-performance GPU code generation,” in 2017
IEEE/ACM International Symposium on Code Generation and Opti-
mization (CGO), Feb. 2017, pp. 74-85.
[27] M. L?cke, M. Steuwer, and A. Smith, “A functional pattern-based
language in mlir,” p. 6, 2020.
[28] O. Fuhrer et al., “Towards a performance portable, architecture agnostic
implementation strategy for weather and climate models,” Supercomput.
Front. Innov.: Int. J., vol. 1, no. 1, pp. 45-62, Apr. 2014.
[29] J.-M. Gorius and T. Grosser, “Modeling Stencils in a Multi-Level
Intermediate Representation,” p. 15, 2019.
[30] T. Gysi et al., “Domain-Specific Multi-Level IR Rewriting for GPU,”
arXiv:2005.13014 [cs], May 2020, Accessed: Jul. 19, 2020. [Online].
Available: http://arxiv.org/abs/2005.13014.
[31] R. Baghdadi et al., “PENCIL: A Platform-Neutral Compute Intermediate
Language for Accelerator Programming,” in 2015 International Confer-
ence on Parallel Architecture and Compilation (PACT), Oct. 2015, pp.
138-149.
[32] R. Baghdadi et al., “Tiramisu: a polyhedral compiler for expressing fast
and portable code,” in Proceedings of the 2019 IEEE/ACM International
Symposium on Code Generation and Optimization, Washington, DC,
USA, Feb. 2019, pp. 193-205, Accessed: Jul. 20, 2020. [Online].
[33] M. Christen, O. Schenk, and H. Burkhart, “PATUS: A Code Generation
and Autotuning Framework for Parallel Iterative Stencil Computations
on Modern Microarchitectures,” in 2011 IEEE International Parallel
Distributed Processing Symposium, May 2011, pp. 676-687.
[34] M. Louboutin et al., “Devito (v3.1.0): an embedded domain-specific lan-
guage for finite differences and geophysical exploration,” Geoscientific
Model Development, vol. 12, no. 3, pp. 1165-1187, Mar. 2019.
[35] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S.
Amarasinghe, “Halide: a language and compiler for optimizing paral-
lelism, locality, and recomputation in image processing pipelines,” in
Proceedings of the 34th ACM SIGPLAN Conference on Programming
Language Design and Implementation, Seattle, Washington, USA, Jun.
2013, pp. 519-530.
[36] J. Meng, A. Atle, H. Calandra, and M. Araya-Polo, “Minimod: A Finite
Difference solver for Seismic Modeling,” Jul. 2020, Accessed: Aug. 15,
2020. [Online]. Available: https://arxiv.org/abs/2007.06048v1.
[37] D. Komatitsch, J. Tromp, “A perfectly matched layer absorbing boundary
condition for the second-order seismic wave equation,” Geophysical
Journal International, volume 154, number 1, pp. 146-153, July 2003.
[38] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful
visual performance model for multicore architectures,” Commun. ACM,
vol. 52, no. 4, pp. 65-76, Apr. 2009.
[39] N. Ding and S. Williams, “An Instruction Roofline Model for GPUs,” in
2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation
of High Performance Computer Systems (PMBS), Nov. 2019, pp. 7-18.
[40] C. Yang, “Hierarchical Roofline Analysis on GPUs,” ECP Annual
Meeting, February 2020,
[41] J. Mellor-Crummey, R. Fowler, and D. Whalley, “Tools for application-
oriented performance tuning,” in Proceedings of the 15th international
conference on Supercomputing, Sorrento, Italy, Jun. 2001, pp. 154-165.
[42] K. Zhou, M. Krentel, and J. Mellor-Crummey, “A tool for top-down
performance analysis of GPU-accelerated applications,” in Proceedings
of the 25th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, San Diego, California, Feb. 2020, pp. 415-416.
[43] NVIDIA, “NVIDIA Tesla V100 GPU Architecture,” August 2017.
