A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs by Yang Yang et al.
A Hybrid Circular Queue Method for Iterative Stencil Computations
on GPUs
Yang Yang1,2, Hui-Min Cui1,2, Xiao-Bing Feng1, and Jingling Xue3
1State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences
Beijing 100190, China
2Graduate University of Chinese Academy of Sciences, Beijing 100190, China
3Programming Languages and Compilers Group, School of Computer Science and Engineering
University of New South Wales, Sydney, NSW 2052, Australia
Abstract In this paper, we present a hybrid circular queue method that can signiﬁcantly boost the performance of stencil
computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on
circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new
reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented
by both shared memory and registers eÆectively in a balanced manner. We describe a framework that automatically ﬁnds
the best placement of data in registers and shared memory in order to maximize the performance of stencil computations.
Validation using four diÆerent types of stencils on three diÆerent GPU platforms shows that our hybrid method achieves
speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.
Keywords stencil computation, circular queue, GPU, occupancy, register
1 Introduction
Iterative stencil computations form what are known
as stencil kernels in which some computations are per-
formed successively in multiple time steps until con-
vergence. The intermediate results are not of interest
and only the results computed at the last time step are
needed. Thus, it is important to improve performance
by reusing the data accessed multiple times across the
multiple time steps. Otherwise, memory bandwidth
may become a serious performance bottleneck. Thus, a
lot of prior work exploits this kind of reuse across mul-
tiple time steps[1-5].
The circular queue method introduced in [6] can ex-
ploit such data reuse across the time steps. By keep-
ing only necessary temporary data on-chip, the on-chip
storage required is low as it is proportional to the num-
ber of time steps, T, computed at a time. In addition,
the on-chip storage required is reused circularly. Ear-
lier implementations of circular queue algorithms on
platforms of X86[1] and Cell[7] have resulted in good
performance. These earlier eÆorts show that these algo-
rithms can be ﬂexibly implemented via indirect memo-
ry addressing, which requires the underlying memory
to be indirectly addressable, as is the case for the main
memory in X86 and the local memory in Cell.
Recently, circular queue methods have also been im-
plemented on GPU, which has become an important
parallel computing architecture. A naive solution can
be obtained by implementing a circular queue using
only its on-chip shared memory, which is indirectly ad-
dressable. However, this will not be e±cient for GPU
since its large number of (on-chip) registers is not uti-
lized. Compared to shared memory, the register ﬁle is
larger and no extra instructions to access are needed.
Therefore, it is possible to accelerate the performance
of iterative stencil computations further if registers can
also be used in implementing circular queues. Further-
more, registers and shared memory should be used in
a balanced manner to increase thread occupancy andYang Yang et al.: A Hybrid Circular Queue Method for Stencils on GPUs 59
the length of tiles can be arbitrarily long. For 3D Ja-
cobi, a grid is decomposed into cubes, each of which is
assigned to a thread block. Within each thread block,
each thread is responsible for executing a small tile in
a plane. A thread block executes a cube plane by plane
along the Z-direction. For the same reasons, cubes can
be arbitrarily long in the Z-direction. However, in the
X- and Y -directions, their lengths are determined by
thread block size and thread granularity.
Circular Queue with Multiple Time Steps. The cir-
cular queue method can compute multiple time steps
in a row between loading data from and storing results
back to oÆ-chip DRAM. The parameter T describes
the the number of time steps computed between the
loading and storing. Using a large T means comput-
ing more time steps between loading from and storing
to oÆ-chip global memory, which increases the ratio of
computation to memory access. On the other hand,
more redundant computations arise due to large ghost
zones used[8]. Therefore, a choice of T involves tradeoÆs
between reducing memory tra±c and tolerating compu-
tation redundancy.
Data-Flow. Fig.2 shows how stencil computations
are performed for each thread block for a 2D stencil,
in which Figs.2(a)ª2(d) show the pipeline setup and
Figs.2(e)ª2(f) show the steady state. By steady state,
we mean the computations of full skewed columns,
where ﬁnal results are computed. Pipeline setup refers
to the computation of the few partial skewed columns
at the beginning of the whole process. The pyramid-
like grid is computed by a thread block. In this case,
three consecutive time steps are computed at a time
(T = 3). The bottom yellow bars are data loaded from
global memory at the time step 0. The top red bars
are computed results at the time step 3, which are to
be stored back to global memory. The middle blue bars
represent computed results at the time steps 1 and 2. In
each iteration, a skewed column (dark-colored bars in
Figs.2(b)ª2(f), marked by black circles) is computed.
Within each iteration, the computation is performed
from bottom-right towards top-left. For each iteration
in the steady state (Figs.2(e) and 2(f)), at least two
skewed columns plus one bar of data are needed to per-
form the computation. The storage space is immedia-
tely retrieved for newly loaded or computed data once
it is no longer used.
Listing 1 shows the pseudo code of the circular queue
method for 2D 5-point stencil implemented with shared
memory arrays with 3 £ T storage locations[1]. Values
computed in the current skewed column (the inner loop
computes a skewed column) is used by adjacent threads
during the next skewed column, and thus synchroniza-
tion happens after each skewed column is computed to
Fig.2. The computation process of a 2D stencil on GPU. (a)ª(d)
show the pipeline setup and (e) ª (f) show the steady state.
The pyramid-like grid is computed by a thread block. The bot-
tom yellow bars are data loaded from global memory, the top
red bars are results to be stored back to global memory and
the middle blue bars are intermediate values. White bars cor-
respond to data not yet loaded or computed. Bars with darker
colors indicate data being loaded or computed in the current it-
eration. In each subﬁgure except (a), those dark-colored bars
form a skewed column, marked by black circles. Only those bars
to the bottom-left, bottom-right and under a dark-colored bar
need to be stored on-chip. In each diagram, computations are
performed in order from bottom-right to top-left along skewed
columns. Storage spaces are recycled for reuse immediately after
values are no longer used.
Listing 1. Pseudo Code of the Circular Queue Method
Implemented With Shared Memory Arrays with 3 £ T
Storage Locations
to ensure data exchange.
2.2 Problems with Shared-Memory-Based
Circular Queue on GPU
2.2.1 Unbalanced Usage of On-Chip Storages and the
Overhead of Accessing Shared Memory
For GPU, the concept of occupancy[9] is deﬁned as:60 J. Comput. Sci. & Technol., Jan. 2012, Vol.27, No.1
Occupancy =
No. of active threads per SM
max No. of active threads per SM
.
(1)
SM in the above formula is short for Streaming
Multiprocessor[9]. Higher occupancy indicates more
threads to hide latency, which may have great perfor-
mance impact. According to [10], the number of active
threads per SM (NATPS) is determined by
NATPS = min
n
8,
registers per SM
registers per thread block
,
shared memory per SM
shared memory per thread block
o
£
thread Block Size. (2)
As a result, increasing occupancy requires usage of
register ﬁle and shared memory to be balanced. In ad-
dition, excessive usage of either storage may result in an
decrease of occupancy. All current generations of GPU
have a much larger per SM register ﬁle than shared
memory. For each thread block, the data partition
should therefore be biased towards registers. However,
for circular queues implemented using shared memory
only, data are allocated on shared memory. In this case,
the occupancy is determined by shared memory usage
with many on-chip registers being left unused.
Besides, shared memory takes extra load and store
instructions to access. Putting a circular queue data
structure entirely in shared memory is therefore not ef-
ﬁcient.
2.2.2 Limited Shared Memory Space
GPU’s per SM shared memory is only 48KB even
for newly released Fermi[9] and 16KB for all older gene-
rations. Shared memory is shared by all active threads
on an SM at a given time. It is recommended that at
least 192 threads be simultaneously on an SM[9], and
thus, for each thread block, the number of time steps,
T, cannot be very large, which probably prevents us
from obtaining the best performance.
In summary, implementing a circular queue using
shared memory only is not e±cient. Instead, we should
consider leveraging both registers and shared memory.
2.3 Challenges of Implementing Circular
Queue in Register File
If we are to move part or all of circular queue from
shared memory to registers, the following challenges
must be addressed:
• Registers and shared memory are both precious
resources. Thus, how can we utilize both eÆectively?
• Circular queues require storage reuse, which is
easy to implement with pointer arithmetics. However,
for registers, which are not indirectly addressable, how
should we implement a circular queue using registers?
• Stencils require inter-thread communication. If we
put some data in registers, how should we handle the
data exchange?
• How do we ﬁnd the best partition of data between
registers and shared memory?
We will provide our solutions to address all these
issues below.
3 Scalar-Based Circular Queue on GPU
To exploit the potential of registers for circular
queue, we must shift from array-based storage to scalar-
based storage. In this section, we propose scalar-based
circular queue. These scalars variables (SVs) can either
reside in registers or shared memory. Two issues need
to be ﬁgured out for this purpose: 1) The circular queue
should be implemented with the least number of SVs;
2) The SVs should be reused explicitly.
To cover the these two issues, we designed the
method of reuse pattern. We ﬁrst discuss it for 2D sten-
cils with each thread computing only one point in the
X-dimension, and extend it to coarser thread granu-
larity and higher dimensions later. We refer to the
dimension along which the computation moves as the
moving dimension. In the 2D case, the moving dimen-
sion is the Y -dimension while for the 3D case it is the
Z-dimension. We deﬁne H as the halo size (the number
of neighborhood elements needed in each direction to
compute 1 point[8]) of the moving dimension. We as-
sume the halo size along the Y -axis positive and nega-
tive directions are the same.
For a two-dimensional stencil with halo size H, to
compute T time steps in a row, intuitively, (2 £ H +
1) £ T SVs are needed, where 2 £ H + 1 SVs are allo-
cated for each time step. Actually, there are two cases
where SVs can be reused: 1) The oldest value to com-
pute a point is no longer used after computation, and
the SV can be used to hold a newly computed value; 2)
The ﬁnal result is immediately written back to global
memory, and thus the SV can be used to hold the value
loaded from global memory of the next skewed column.
The reuse reduces the number of SVs to 2£H £T +1.
We name the SV holding the newly loaded value
BaseSV, shown as circled node 0 in Fig.3(a)-1. We
could start from BaseSV, allocate SVs to values as in
Fig.3(a)-1: in each level, if the SV holding a value is
SV r, then, the SV holding the value to its left is:
SV l = SV ((r°1+(2£H£T+1))mod(2£H£T+1)). (3)
After 2 £ H £ T + 1 skewed columns (in the case ofYang Yang et al.: A Hybrid Circular Queue Method for Stencils on GPUs 61
Fig.3. SV reuse and reuse pattern for T = 2 and H = 1. (a) and
(b) shows how SVs are used. (c) summarizes the reuse pattern
for T = 2 and H = 1. (d) shows the reuse pattern for T = 2 and
H = 2.
Fig.3(a), ﬁve skewed columns), the allocation pattern
goes back to Fig.3(a)-1, forming a repetitive allocation
pattern. We call this repetitive allocation pattern the
steady state of the reuse pattern. The reuse of SVs are
reﬂected in that: 1) The leftmost value and the newly
computed value in upper level are assigned the same
SV; 2) The ﬁnal result and the newly loaded value of
the next skewed column are assigned the same SV. For
example, ﬁnal result of Fig.3(a)-5 and the newly loaded
value of Fig.3(a)-1 are assigned to the same SV-node 0.
Fig.3(b) shows how each SV iterates over the steady
state, and Fig.3(c) summarizes the reuse pattern when
T = 2 and H = 1. Note that, while SVs iterate over
the steady state, the time steps of the values they are
holding also change, as shown in Figs.3(a) and 3(b).
Fig.3(d) shows the reuse pattern of a 2D 5-point stencil
with bigger halo size. For stencils with bigger halo size,
skewed columns are skewed more towards the horizon-
tal line, and steady states are longer.
The prolog contains 2 £ H £ T partial skewed
columns, which can be divided into T consecutive
groups, each containing 2 £ H partial skewed columns
of the same length. Each partial skewed columns in the
ﬁrst group only contains a load from global memory,
each column in the next group contains one load fol-
lowed by one computation, etc. Skewed columns in the
next group has one more computation than those in the
previous group.
From the above analysis, we could summarize a reuse
pattern for a stencil given T and H as follows.
• The steady state of the pattern involves exactly
2 £ H £ T + 1 skewed columns for the pattern to re-
peat, which means the loop body contains 2£H£T +1
skewed columns.
• The prolog of the pattern involves exactly 2£H£T
partial skewed columns. The length of i-th partial
skewed columns is di/(2 £ H)e.
• To compute a new value stored in SV k, the
SVs needed are SV k, SV (k+1)mod(2£H£T+1), ...,
SV (k+2£H)mod(2£H£T+1).
• The SV holding the next value to compute in the
skewed column immediately after the computation of
SV k is:
SV next = SV ((k°2£H+(2£H£T+1))mod(2£H£T+1)).
(4)
Once the SV used in the ﬁrst partial skewed column
of the prolog is set, we could derive the reuse pattern.
Given T and H, code can be automatically generated
with the above rules. Some of the SVs will reside in
registers, while the rest reside in shared memory, which
will be addressed in Subsection 5.1.
It is easy to extend to cases with coarser thread
granularity. If each thread computes k points in the
X-direction, the k points can be viewed as a super-SV,
which can be treated as one plain SV in our method.
Therefore we need to allocate 2£H £T +1 super-SVs,
with each super-SV consisting of k points, resulting in
a total of k£(2£H£T +1) SVs. Extending to stencils
with higher dimensions is similar. For example, in a 3D
7-point stencil, each thread operates on small 2D tiles
of data of size m£n, we allocate m£n£(2£H£T +1)
SVs, which are organized as m£n super-SVs with each
consisting of 2£H£T +1 points. Inside each super-SV,
computations of diÆerent plain SVs can be performed in
any order, since Jacobi-like stencils have no order con-
straints. The reuse pattern can extend to cases where
halo size along Y -axis positive and negative directions
are diÆerent. In this case, we just need to replace 2£H
in the above formulae with Hp + Hn.
4 Inter-thread Communication
Within a thread block, each thread needs data
from neighboring threads to perform a stencil compu-
tation. In this section, we discuss how to exchange data
through shared memory, and how to reuse shared mem-
ory space. Again, we start from 2D stencils with one62 J. Comput. Sci. & Technol., Jan. 2012, Vol.27, No.1
point per thread, and then we extend it to more general
cases.
In Fig.4(a), we divide a stencil pattern into three ar-
eas: left, right and self. Each point has an index for the
Y -dimension. In the case that each thread only com-
putes one point, points in the left and right parts to-
gether reﬂect what data are needed by adjacent threads.
Y -indices of these points form a subset, which indicate
data to be broadcasted. We divide stencils into two
categories according to the size of the subset needed,
as shown in Figs.4(b) and 4(c):
• Type A: all the points in the left and right areas
have the same Y -index;
• Type B: points in the left and right areas have
diÆerent Y -index.
Fig.4. Our division of stencil types. (a) shows our anatomy of
a 2D stencil patterns. (b) and (c) use 2D 5-point and 2D 9-
point stencils as examples respectively to demonstrate Type A
and Type B stencils we deﬁned. Each circle represents a loaded
or computed data. The parallelogram boxes show data needed
by adjacent threads to compute the skewed column indicated by
arrows. In (c), as the dark circles show, each data is used by
adjacent threads 3 times.
The diÆerence between these two types is that for
Type B, as computation moves forward, the data broad-
casted has temporal reuse in the next few skewed
columns, while for Type A, it is only used once. For
Type B, we deﬁne NIdx as the number of diÆerent Y -
indices in the left area and the right area, MaxIdx and
MinIdx as the biggest and smallest among these indices.
Note that between MaxIdx and MinIdx, there may be
indices that do not appear in the left and right parts.
For Type A, to compute a skewed column, each
thread needs to broadcast T values to its adjacent
threads. Thus, each thread needs no more than T lo-
cations for communication. To use less shared memory
space, we could allocate dT/RDe locations (1 < RD 6
T), in which case communication space is reused within
a skewed column. We name this scheme single column
reuse (SCR), and RD as reuse degree of communica-
tion space. The bigger RD is, the less shared memory
space for communication we need, which is good for
occupancy. However, since this storage reuse intro-
duces write-after-read hazard, extra synchronizations
are needed, which brings overhead.
For Type B, similar to Type A, we could adopt SCR,
allocating dT/RDe£NIdx locations (1 6 RD 6 T). In
this manner, each thread needs to write NIdx values for
each single computation. Alternatively, we could ex-
ploit the temporal reuse, so that each value only needs
to be written once, instead of NIdx times for SCR.
We propose multiple column reuse (MCR), in which
once a value is written to shared memory for commu-
nication, it is not overwritten until it is not used any-
more. For each single stencil computation, data in the
range between MaxIdx and MinIdx should be kept in
shared memory. To save space, locations holding va-
lues not used anymore are immediately reused to hold
newly broadcasted value. The reuse of communica-
tion locations follow the reuse pattern with parame-
ter Hp + Hn being equal to MaxIdx + MinIdx. Thus,
(MaxIdx °MinIdx)£T +1 locations are needed. When
MaxIdx+MinIdx does not equal 2£H, reuse pattern for
SVs and communication spaces are diÆerent, and thus
have diÆerent length of steady state. We may need to
unroll the original code to their least common multiple.
To compare SCR and MCR, we consider shared
memory writes and space requirement. For Type B,
the number of shared memory writes of MCR is only
1/NIdx of the number of SCR. As of space require-
ment, for a speciﬁc stencil, the space requirement of
MCR is a ﬁxed value while the space requirement
of SCR can be reduced by increasing RD, and thus
potentially has higher occupancy. Particularly when
NIdx < MaxIdx ° MinIdx + 1, MCR has bigger disad-
vantage in space requirement. It is expected that when
occupancy plays a more important role than instruction
count, SCR will outperform MCR and vice versa.
When MaxIdx equals H, newly loaded or computed
values are needed in the next computation along the
same skewed column. In this case, new values must be
broadcasted immediately after its generation. In other
cases, shared memory writes can be clustered together
before the computation of a skewed column (or a sec-
tion of a skewed column when RD > 1 for SCR).
For cases with higher thread granularity, where each
thread computes several points in non-moving dimen-
sions, it is possible that data of some points are only
needed by points computed by the same thread, and
thus not needed by other threads. We need to ﬁnd outYang Yang et al.: A Hybrid Circular Queue Method for Stencils on GPUs 63
the subset of Y -indices that are only needed by other
threads. As a result, points within a thread may have
diÆerent subsets and some points may not need to
broadcast data to other threads. For 3D stencils, we
need to analyze adjacent threads from both X- and Y -
dimensions to decide the subset of needed data, and the
rest is similar.
5 Framework for Circular Queue on GPUs
In this section, we describe our framework that au-
tomatically generates code and searches for the best
partition of SVs between register and shared memory.
Our iterative search engine aims at ﬁnding the best par-
tition between registers and shared memory as well as
communication scheme for given thread block size and
T.
For readability, we summarize denotations that will
be used:
T: the number of time steps calculated between one
load and store.
H: the number of neighborhood elements along the
moving dimension needed in each direction to compute
one point.
N R: the number of SVs allocated in registers.
N S: the number of SVs allocated in shared memory.
5.1 Code Generation
A script which takes H, T, RD and N R as input
is used for code generation. Listing 2 shows the gene-
rated code for a 2D 5-point stencil with H = 1, T = 2,
each thread computing one point, with four SVs in regi-
sters and one in shared memory. A complete code con-
sists of four parts: declaration (lines 7ª9), prolog (lines
17ª31), steady state (lines 33ª78) and epilog (lines
79ª100).
Listing 2. Code of 2D 5-point Stencil with T = 2,
RD = 1, H = 1, and Each Thread Computing one
Point in the X-direction
Given N R, we assign the N R SVs with highest SV
numbers to registers. These N R SVs are declared as
automatic variables in the form aX, where X is the SV
number (line 7), while the rest are declared aggregately
as a shared memory array with SV numbers being array
indices. We allocate a two-dimensional shared memory
array (for 3D stencil, it would be a 3D array) to ac-
commodate both SVs, communication locations and an64 J. Comput. Sci. & Technol., Jan. 2012, Vol.27, No.1
extra space for global memory coalescing (lines 8ª9),
and each thread owns one row of it. The row length
of the array is padded to odd numbers (lines 5ª6) to
avoid bank conﬂict.
As explained in the previous section, the length
of prolog is 2 £ H £ T, each of which computes a
partial skewed column, and the steady state contains
2 £ H £ T + 1 skewed columns. The epilog handles
arbitrary tile size in the slowest changing dimension.
In Listing 2, due to limited space, we only show epi-
log code when epilog = 1. With the reuse pattern, the
script is able to derive SV numbers involved in each
load, computation or store. According to SV number,
the script either emits an automatic variable name or
a shared memory array element.
For SCR, when RD is bigger than 1, the generated
code for a skewed column is strip-mined into several
sections, with each section does dT/RDe stencils (ex-
cept the last section of a skewed column). In each sec-
tion, data needed by other threads for the following
computations are ﬁrst written to the array, guarded by
synchronization statements (lines 35ª38, for example),
then computations are performed with data written by
adjacent threads (lines 39ª40). Our script examines
SV number, and only emits write statements for SVs in
registers. For SVs in shared memory, they will be ac-
cessed through the array indices of their original storage
locations (line 66).
Listing 3. Deﬁnition of Macros
The script uses macros as templates for code gene-
ration. The macros used are deﬁned in Listing
3. LOAD DATA and STORE DATA load data from
and store data back to global memory respectively.
WRITE SHARED MEM writes values in registers to
shared memory for communication. The DO STENCIL
macro is deﬁned by users to describe how to compute
each point of a thread. Except mid idx, all other ar-
guments of this macro are SVs (either in registers or
in shared memory) generated by the script. Users use
mid idx to index into other threads’ data in the shared
memory. For 2D stencils, sh array is 2-dimensional,
and each thread stores its broadcasted data in the
sh array[tx + HALO] sub-array. For 3D stencils,
sh array is 2-dimensional, and each thread stores its
broadcasted data in the sh array[ty + HALO][tx +
HALO] sub-array. Data of adjacent threads can be
accessed by changing oÆsets of the highest or the high-
est two dimensions for 2D and 3D stencils respectively.
Global memory access must be coalesced[9]. Sten-
cil computations need ghost zones to perform compu-
tation. Assuming that there are m threads in the X
dimension for each thread block and each thread com-
putes one point, then only the inner m°2£T threads
write the ﬁnal result. When m°2£T is not a multiple of
16, the alignment condition of coalescing is not satisﬁed.
To make writes aligned, the script generates code that
calculates an oÆset (line 13 of Listing 2). When writ-
ing global memory, threads ﬁrst write data to shared
memory, then a selected subset of threads, determined
by the oÆset, read other threads’ data with this oÆset,
and write them to global memory (lines 7ª13 of List-
ing 3). There are cases where the oÆset is too big and
the subset does not have enough threads to store all
the results, thus, the script generates additional global
memory writes, guarded by if statement, for them (lines
14ª19 of Listing 3). Global memory reads have the
same alignment issue, but we could bind the global
memory array to a texture[9], and use the read-only
texture memory to easily solve this problem (lines 1ª4
of Listing 3).
In this work we do not take boundary conditions into
account. For a stencil with halo size H and time block-
ing factor T, we add H £ T elements to both ends of
each dimension. For example, for an N £ N domain,
we actually use an array of size (N +2£H £T)£(H +
2 £ H £ T), in which only the inner N £ N sub-array
corresponds to points to be calculated, and the outside
are used as ghost zones for blocks at domain bound-
ary. As a result, boundary blocks have the same access
pattern as inner blocks. We do this to simplify code
generation. Boundary condition handling is orthogo-
nal to the proposed reuse pattern and data partition as
reuse pattern and data partition deals with data stored
on-chip. Special handling can be added to ﬁx global
memory array indices for boundary points. Thus, the
proposed method will still work with boundary condi-
tion handling code.Yang Yang et al.: A Hybrid Circular Queue Method for Stencils on GPUs 65
5.2 Search for the Best Parameters
As shown in Subsection 3.2, the best partition be-
tween registers and shared memory is the result of the
complex dynamics among occupancy, synchronization
overhead and shared memory access overhead. For exa-
mple, for two code variants, one with slightly higher oc-
cupancy but less data in registers, the other with lower
occupancy but more data in registers, it is hard to de-
cide which one is better without running. Similarly,
two code variants diÆering both in RD and occupancy
may not be compared directly either. Thus, we resort
to running to ﬁnd the best variant among those not
directly comparable. An alternative would be to use
analytical performance models to help decide. How-
ever, models with higher accuracy than those proposed
in [11-12] are required. Luckily, RD and N R are known
during code generation, and occupancy can be obtained
after compilation. Thus, only a few variants need to be
evaluated through running.
For a stencil code with given parameters of T and
thread block size, our framework works as follows.
Firstly, we select candidates for each RD value re-
spectively. We iterate RD from 1 to T. For each
RD value, we start from the version that N S is 0,
and iteratively move one SV from register to shared
memory. Each version is compiled to get registers
and shared memory requirement, which are then fed
to CUDA Occupancy Calculator[10] to compute occu-
pancy. During the process, whenever an SV movement
causes occupancy to increase, we mark the new ver-
sion as a candidate, as this version has higher occu-
pancy than previous ones, and among those with the
same occupancy, it has the most SVs in register, and
thus has the least instruction count among them. The
iteration stops until occupancy drops or all SVs are
in shared memory, which eliminates code variants with
occupancy lower than the starting version. We also
include the shared memory version as a candidate, be-
cause it needs no communication space, which may have
smaller shared memory requirement, and thus higher
occupancy than hybrid versions. For Type B, we also
add the version with MSR scheme as a candidate.
Secondly, among candidates with diÆerent RD val-
ues, we eliminate versions known to perform worse than
others. We group all candidates according to their oc-
cupancy. In each group, we compare each pair of can-
didates. Note that, candidates in the same group have
diÆerent RD values. If the code variant with bigger RD
value does not have less data in register, then we elimi-
nate it from the candidate set. Therefore, for each RD
value, at most one code variant remains.
Finally, we run the remaining code variants, and
select the outperforming one under certain T and
thread block size.
The process is illustrated in Fig.5.
Fig.5. Illustration of the iterative search process. (a) and (b)
show how occupancy changes according to R S for RD being 3
and 6 respectively. 2D 5-point stencil is used, with T = 6 on
9800GX2. Code variants in black circles are selected by step 2.
The iteration stops when R S turns 7 and 8 respectively for (a)
and (b). In step 3, the two version with RD = 6 are eliminated,
because compared to corresponding versions with RD = 3, they
have either the same or less data in registers, which will result in
worse performance.
5.3 Discussion
According to the pruning process, the number of
candidates left to run is not more than the product
of number of diÆerent occupancy values and RD val-
ues. Since ranges of both quantities are small, we do
not expect many candidates are left for the third step.
In reality, as our experiment results show, the number
of candidates that require running is very small.
It is important to guarantee that the pruned space
still contains the point with optimal performance. Since
our code generator is on the source code level, interac-
tion with compiler optimizations must be considered.
Our pruning process is based on the assumption that, if
occupancy stays the same, moving one SV from register
to shared memory or increasing RD would impair per-
formance as they increase dynamic instruction count.
However, it is not possible to predict how these source-
level behaviors would impact compiler optimizations
such as instruction selection/scheduling. It is theoreti-
cally possible that moving one SV to shared memory
causes performance to increase slightly although occu-
pancy stays the same. These anomalies will not keep
us from ﬁnding very near-optimal points because: 1) If
occupancy stays the same, such modiﬁcation as mov-
ing one SV to shared memory will only make tiny diÆe-
rences in performance; 2) By moving SVs to shared me-
mory or increasing RD, the overall trend is that added
instructions cause performance to drop. Local anoma-
lies will not compromise the total performance trend.
Thus, even if we miss the optimal point, we still will
capture points with near-optimal performance.66 J. Comput. Sci. & Technol., Jan. 2012, Vol.27, No.1
6 Experimental Results
In this section, we present the experiment results.
This section could be divided into 4 parts. In the ﬁrst
part, we ﬁx the value of T, and see how the proposed
techniques aÆect performance under diÆerent T values.
In the second part, we incorporate T into discussion.
For a certain kind of stencil on GPUs, the best T value
changes after the application of our techniques. Thus,
the purpose of the second part is to evaluate overall per-
formance improvement for a certain kind of stencil. In
the third part, we present the results on the 3D 19-point
stencil, and compare our results with those presented in
Paulius Micikevicius’s paper[22] in order to ﬁnd diÆer-
ence between our implementations. In the last part, we
present our ﬁndings related to compiler’s register allo-
cation, and suggest further reﬁnement. We tested 4 dif-
ferent kinds of stencils: 2D 5-point, 2D 9-point, 2D 13-
point (halo size=3), and 3D 7-point stencils, as shown
in Fig.6. Our evaluation is based on three diÆerent
generations of GPUs, with the conﬁgurations shown in
Table 1 (9800GX2 contains two GPUs, we only use one
of them). All the programs are compiled by nvcc under
-O3 option, and CUDA Proﬁler is used to measure ex-
ecution time and dynamic instruction count. Register
and share memory requirements are obtained from the
.cubin ﬁle generated by nvcc. We only measured the
execution time on GPU, and CPU-GPU transfer time
is excluded.
Fig.6. Stencils we used in our experiments. (a) 2D 5-point sten-
cil. (b) 2D 9-point stencil. (c) 2D 13-point stencil. (d) 3D 7-point
stencil.
For 2D stencils, we set the tile size in Y -direction
500 points, and tile size in Z-direction 100 points for
3D stencils, so that redundant computation is not too
big due to ghost zones[8]. The shared memory versions
used 2£H£T +1 elements per thread for data storage,
and no space for communication is needed. We set the
thread granularity in the X-direction to 1 in our study.
6.1 Performance Improvement for Individual
T Values
Except 2D 9-point stencil, we present three versions:
the shared mem version with all SVs in shared mem-
ory, the hyb reg version with all SVs in registers and
no communication space reuse, which is the start point
of our search, and the hyb search version, which is the
best version through search. For 2D 9-point stencil,
we used both SCR and MCR. The thread block sizes
for the 4 stencils are 192, 192, 256 and 512 (32 £ 16)
respectively.
Figs.7ª9 show the occupancy, dynamic instruction
count and performance of various versions under diÆer-
ent T values. The instruction counts are normalized to
the instruction count of reg versions, and for 2D 9-point
stencil, they are normalized to the MCR reg versions.
Table 2 summarizes speedups of search versions over
shared-mem and reg versions. For 2D 9-point stencils,
data before slashes show speedups of MCR search ver-
sions over shared-mem versions and MCR search ver-
sions over MCR reg versions while data after slashes
show speedups of SCR search versions over shared-mem
versions and SCR search versions over SCR reg ver-
sions. The circular queue’s on-chip resource consump-
tion is proportional to T. Thus, when T increases, the
occupancy curves present ladder form fall. As shown in
Fig.7, our method enables bigger T values, and for each
T value, our search versions provide higher occupancies.
Our method generally makes occupancy decrease slower
when T increases. Reg versions have all SVs in regis-
ters, which eliminates the instructions to access shared
memory during computation. Thus, they have the least
instruction count. Search versions may have more in-
structions than reg versions when RD is bigger than 1
or N S is bigger than 0.
For 2D 5-point stencil, on GTX 285, the search ver-
sion outperforms the reg version when T is equal to
and greater than 5. Actually, when T = 4, the search
version has higher occupancy, but no performance ad-
vantage, which shows that the occupancy level of the
reg version is high enough to hide latency. Occupancy
plays a more important role when its value is small,
which explains the big gap between the search and the
reg version when 9 6 T 6 16. 9800GX2 has a smaller
Table 1. Machine Conﬁguration
Peak Arithmetic Global Memory No. Max No. Shared No. 32-bit Driver NVCC
Performance Bandwidth SMs Threads Memory per Registers Version Version
(GFLOPS) (GB/s) per SM SM (KB) per SM
9800GX2 384 64 16 768 16 8192 190.18 2.3
GTX285 709 141 30 1024 16 16384 190.53 2.3
C2050 1030 144 14 1536 48 32768 260.19 3.2Yang Yang et al.: A Hybrid Circular Queue Method for Stencils on GPUs 67
Fig.7. Occupancy of the four stencils on all platforms.
Table 2. Performance Speedups for Individual T Values
Search/Shared-mem Search/Reg
9800GX2 GTX285 C2050 9800GX2 GTX285 C2050
2D 5-point 1-1.72 1.04-2.10 1-2.93 1-1.59 1-1.42 1-1.11
2D 9-point 1-1.45/1-1.15 1-1.59/1-1.13 1-2.51/1-1.12 1-1.39/1-1.49 1-1.65/1 1-2.6/1-1.01
2D 13-point 1.06-1.08 1.36-1.71 1-2.29 1-1.43 1 1-1.02
3D 7-point 1-1.03 1.11-1.44 1-2.05 1 1 1
register ﬁle, thus, the reg version even performs worse
than the shared-mem version when T = 4, because of
the unbalanced usage of register ﬁle and shared mem-
ory. Tesla C2050’s register ﬁle and shared memory is
signiﬁcantly larger than its previous generations, and
thus, all three versions are able to sustain high occu-
pancy levels even for big T values. The ratio of reg-
ister and shared memory requirement of the reg ver-
sion is close to C2050’s hardware ratio, and the search
version is not able to further improve occupancy. As
a result, the curves of the two versions almost overlap.
The same happens for 2D 13-point and 3D 7-point sten-
cils on C2050.
For 2D 9-point stencil, on all three machines, MCR
search versions have the same occupancies as shared-
mem versions, but have less SVs in shared memory
than shared-mem versions. Thus, their performance gap
shows the eÆect of instruction reduction. Similar to 2D68 J. Comput. Sci. & Technol., Jan. 2012, Vol.27, No.1
Fig.8. Dynamic instruction count of the four stencils on all platforms, normalized to the reg versions. For 2D 9-point stencil, normalized
to MCR reg version.
5-point stencil, due to the smaller register ﬁle, the MCR
reg version on 9800GX2 has lower occupancies than
the shared-mem version and thus performs worse. The
SCR versions have more instructions than MCR be-
cause they cannot take advantage of the reuse of broad-
casted data and thus have more shared memory writes
for communication. They even have more instructions
than the shared-mem version, because the shared-mem
version can exploit the reuse. The performance diÆer-
ence of SCR search and MCR search versions is the re-
sult of a combat between occupancy and instruction re-
duction. When T is small, occupancy is less signiﬁcant
and thus the MCR search version wins as having less
shared memory writes. For bigger T values, occupancy
is more important and the SCR search version beats the
MCR search. SCR-reg versions need a large amount
of shared memory to store the broadcasted data of a
skewed column, even more than shared-mem versions
and thus they perform the worst.
For stencils with big halo sizes such as 2D 13-point
stencil, number of SVs grows faster with T. As a re-
sult, value range of T is smaller. The occupancy is
determined by register usage and reusing communi-
cation space and partition hardly ﬁnd any chance to
move enough data from registers to shared memory
to increase occupancy. Therefore, most of the time,
search versions and reg versions are the same. On
9800GX2, due to a smaller register ﬁle and the large
storage requirement, even the most balanced version
cannot improve occupancy.
3D 7-point stencils have small range of T because
we choose the thread block size of 512. With such a
thread block size, when T increases, storage require-
ment quickly exceeds the hardware capacity, which
makes code variant not runable. For 3D stencils, the
ratio of redundant computation increases faster with TYang Yang et al.: A Hybrid Circular Queue Method for Stencils on GPUs 69
Fig.9. Performance of the four stencils on all platforms.
than 2D stencils and thus a big thread block size is
preferred to increase tile size[8]. 9800GX2 can only run
at most 768 threads per SM and thus with a thread
block size of 512, there is no chance to improve occu-
pancy. The limited speedup when T = 2 comes from
instruction reduction.
Table 3 shows the eÆectiveness of the search space
pruning. For 9800GX2, T values used for the four
Table 3. Search Space Pruning
2D 5-point 2D 9-point 2D 13-point 3D 7-point
9800GX2 2/84 3/76 2/24 3/5
GTX285 3/94 3/182 3/25 2/9
C2050 5/193 6/341 2/166 3/31
Note: In the form a/b, a is the number of candidates actually
run in the 3rd step, and b is the total number of runable code
variants in the search space.
stencils are 10, 10, 3, 3 respectively while for the rest
two platforms, T values are 15, 15, 8, 4. Thread block
sizes used are 192, 192, 256 and 512 (32 £ 16) respec-
tively. As illustrated in the table, only a very small
amount of code variants are actually run to ﬁnd the best
one, which shows that our pruning scheme is very ef-
fective. For all the four stencils on three platforms, our
framework succeeded in capturing the optimal points.
Table 4 shows total and average time spent on code
generation, compilation and running of the four sten-
cils on C2050 with the above T value and thread block
sizes. The CPU of the C2050 system is Intel Xeon
E5506 at 2.13GHz and the main memory is of size 4GB.
Code generation and compilation account for most of
the search time. This is due to the fact that: 1) only a
small portion of generated and compiled code variants70 J. Comput. Sci. & Technol., Jan. 2012, Vol.27, No.1
Table 4. Search Time Breakdown on Tesla C2050
2D 5-point 2D 9-point 2D 13-point 3D 7-point
Time (seconds) 371(4.9)/743(9.9)/27(5.4) 667(6.2)/4359(40.7)/34(5.7) 629(7.8)/2291(28.3)/12(6) 8(0.4)/38(2)/11(3.7)
Note: In the form a(x)/b(y)/c(z), a, b and c are the total time spent on generation of source code, compilation, and running
respectively. Numbers in parenthesis show average time spent on generation of source code, compilation and running for each code
variant. Running time includes the time of running a kernel for 10 times and CPU-GPU data transfer time.
need running; 2) the average time spent on code genera-
tion or compilation is on par or longer than the average
running time. The number of code variants generated
and compiled are 75, 107, 81 and 19 for the four sten-
cils respectively. The size of the loop body is linear
to T and thus when T is big, code size is big. Fur-
thermore, the epilog almost doubles the code size. All
these cause the code generation and compilation to slow
down. The average code generation and compilation
time is small for 3D 7-point stencil because a small T is
used. To reduce overall search time, possible solutions
are: 1) Reduce the number of code variants that are
generated and compiled. For example, moving two SVs
from registers to shared memory at a time instead of
one or developing models to rule out code variants that
are unlikely to have high occupancy. Both ways have
the risk of missing optimal conﬁgurations. 2) Speed up
code generation by implementing it with more e±cient
languages such as C instead of shell scripts we are us-
ing now. As for compilation time, it is currently out of
our control. However, if inline assembly is allowed, we
could generate code with inline assembly directly, which
we believe could save some compilation time. We will
consider reducing search time in our future work.
6.2 Interaction With T
In this subsection, we show how the proposed tech-
niques interact with T and compare performance un-
der each versions’ best T values. As discussed in [8],
the best T value is a tradeoÆ between global memory
tra±c reduction and redundant computation brought
in. On GPUs, performance is also aÆected by oc-
cupancy. Decrease of occupancy may create sudden
drops in the performance curve, which makes the ac-
tual best T value smaller. The search versions are also
aÆected by communication space reuse and data parti-
tion, which impacts synchronization frequency and in-
struction count. When T increases, to prevent occu-
pancy falling, RD and N S tend to increase, and may
hurt performance. When occupancy stays the same,
performance of reg versions no longer increases from
T = 18 and T = 2 for GTX285 in Figs.9(a) and 9(b),
indicating the theoretical best T value for the two cases.
For the rest cases, the theoretical best T values are not
yet reached until the curves of reg versions end. 3D
stencils have smaller theoretical best T values because
redundancy grows faster[8].
Due to the occupancy change, our search versions
are able to reach bigger T values with higher perfor-
mance originally unreachable, such as the cases of 2D
13-point stencil on all three platforms and 3D 7-point
stencil on 9800GX2, as shown in Figs.9(c) and 9(d). In
the rest cases, the best T values either stay the same
while performance increases or the best T values itself
increase.
For each stencil, we tried thread block sizes of 64,
128, 192, 256, 320, 384, 448, and 512. Fig.10 summa-
rizes the speedups of our search versions over shared
mem versions under their own best T values.
Table 5 and Table 6 show the parameters of our best
Fig.10. Speedups of best search versions over best shared-mem
versions for the four stencils on all platforms. (pt: point)
Table 5. Parameters of the Balanced Versions for 9800GX2, GTX285 and C2050
9800GX2 GTX285 C2050
2D 5-pt 2D 9-pt 2D 13-pt 3D 7-pt 2D 5-pt 2D 9-pt 2D 13-pt 3D 7-pt 2D 5-pt 2D 9-pt 2D 13-pt 3D 7-pt
Thread Block Size 192 192 128 32 £ 16 256 192 256 32 £ 16 192 192 256 32 £ 32
T 64 33 94 32 96 44
Total Number of SVs 13 9 19 7 19 9 19 5 19 13 25 9
SVs in Register 9 7 19 6 18 9 19 5 19 13 25 9
RD 2 MCR 1 1 2 MCR 1 1 1 3 1 1
Total Regs per Thread 21 20 31 16 31 24 31 14 34 27 40 22
Shared Mem (B) 7020 7020 2712 12276 7252 7012 5264 7372 8536 5432 5240 23120
Occupancy 0.5 0.5 0.333 0.667 0.5 0.375 0.5 1 0.625 0.75 0.5 0.667
Reg/Shared mem 2.30 2.19 5.85 2.67 4.38 2.63 6.03 3.89 3.06 4.50 7.80 3.90
Performance (GFLOPS) 93.6 98.6 99.3 39.0 211.0 180.5 219.6 91.3 218.1 169.6 236.3 95.7
Note: pt-point.Yang Yang et al.: A Hybrid Circular Queue Method for Stencils on GPUs 71
Table 6. Parameters of the Best Shared memory Based Versions for 9800GX2, GTX285 and C2050
9800GX2 GTX285 C2050
2D 5-pt 2D 9-pt 2D 13-pt 3D 7-pt 2D 5-pt 2D 9-pt 2D 13-pt 3D 7-pt 2D 5-pt 2D 9-pt 2D 13-pt 3D 7-pt
Thread Block Size 128 192 128 32 £ 16 192 192 128 16 £ 20 256 256 192 32 £ 16
T 74 22 44 22 74 32
Total Number of SVs 15 9 13 5 9 9 13 5 15 9 19 5
Regs per Thread 12 12 12 13 12 12 12 13 34 15 20 18
Shared Mem (B) 7836 7020 7000 12276 7012 7012 6992 7948 15480 9288 15048 12240
Occupancy 0.333 0.5 0.333 0.667 0.375 0.375 0.25 0.625 0.5 0.833 0.375 1
Reg/Shared mem 0.78 1.31 0.88 2.16 1.31 1.31 0.88 2.09 2.25 1.65 1.02 3.01
Performance (GFLOPS) 77.1 89.7 77.2 35.9 145.8 162.5 134.5 64.9 161.3 148.7 169.8 64.5
Note: pt-point.
versions and the best shared memory only versions re-
spectively. For 2D 9-point stencil on GTX285, MCR
has the best performance, while on 9800GX2, best per-
formance comes with SCR-1. For 3D stencils, thread
block sizes are expressed as X £ Y , in which X and Y
are dimensionalities of X and Y directions. The maxi-
mum numbers of threads per SM are diÆerent on all the
three platforms, and therefore, the same occupancy on
the two platforms indicates diÆerent number of active
threads on an SM.
6.3 Comparison with Paulius Micikevicius’s
Results
Paulius Micikevicius presented results of some 3D
diÆerence computation on GPUs[22]. His work focused
on 3D stencils with big halo sizes. He also combined
the registers and shared memory in the implementa-
tion of circular queue but only explored the cases when
T = 1. In this subsection, we present experimental
results of 3D 19-point stencil and compare our results
with Micikevicius’s results in the hope of ﬁnding some-
thing interesting.
There are two diÆerences between our implementa-
tions. 1) In Paulius’s implementation, all threads in a
thread block compute a ﬁnal result (they only imple-
mented the T = 1 case), and data of the ghost zones
are loaded by threads near thread block boundaries. In
our implementation, however, only the inner (BLK X°
HALO£T)£(BLK Y °HALO£T) threads of a thread
block compute ﬁnal results. Thus, the ratio of redun-
dant computation of our implementation is substan-
tially higher than that of Paulius’s implementation. 2)
Paulius’s implementation involves register-to-register
copies when moving along the moving-dimension while
in our implementation, these copies are eliminated
through reuse pattern. Table 7 shows the performance
of the four versions of 3D 19-point stencil on Tesla
C2050: Paulius’s version, the shared memory version,
our search version and a modiﬁed implementation of the
search version, in which the inner (BLK X °HALO £
(T °1))£(BLK Y °HALO£(T °1)) in a thread block
compute ﬁnal results. The search version ends up the
same as the register version. When T = 1, our modiﬁed
search version has the same ratio of redundant com-
putation as Paulius’s version. Thread block sizes are
32£16 and T is 1 for all the four versions. We substitute
global memory reads in Paulius’s version with texture
loads so as to make the comparison fair, which improves
the performance by 2 GFLOPS. Both Paulius’s version
and our modiﬁed search version greatly outperform our
original search version due to their relatively lower ratio
of redundant computation. When T = 1, our modiﬁed
search version slightly outperforms Paulius’s, which we
believe is due to the elimination of register-to-register
copies. When T = 2, performance of both our origi-
nal and modiﬁed search versions drop greatly because
the reduction in memory tra±c does not compensate
the performance loss brought by the increased ratio of
redundant computation.
Table 7. Performance Results of 3D 19-Point
Stencil on C2050
Micikevicius’ Shared Search Search
Memory (modiﬁed)
T = 1 100.1 49.9 70.4 106.0
T = 2 17.1 37.4 73.2
Note: Performance comparison of four versions of the 3D 19-
point stencil (halo size = 3) on Tesla C2050. Thread block
size is 32 £ 16 for all four versions. Performance is measured
in GFLOPS.
We can observe great performance improvement of
our search version over the shared memory version. The
occupancies of the shared mem version are 0.667 and
0.333 respectively when T = 1 and T = 2 while the oc-
cupancies of the search version are 1.0 and 0.667, which
is major reason of the performance improvement. Ex-
cessive shared memory requirement causes the low oc-
cupancies of the shared memory version. Each thread
needs a shared memory space of 2£HALO £T +1 ele-
ments. When HALO is big, it requires large shared
memory space while some registers are not utilized.
The searcg version (in this case, also the register ver-
sion), however, is able to achieve a more balanced usage
of shared memory and registers.
6.4 Interaction with Register Allocation
The reuse pattern we proposed with minimum72 J. Comput. Sci. & Technol., Jan. 2012, Vol.27, No.1
number of SVs requires that an SV be assigned the
same hardware register for all of its live ranges. Other-
wise, more hardware registers are needed than the min-
imum. With decuda[13], a third party disassembler, we
ﬁnd that the register allocation is not exactly what we
expected, and thus more registers are used. The regis-
ter allocation part of the tool-chain is not open-sourced,
nor do we have control over it. However, we believe it is
completely feasible for a compiler to incorporate such a
feature as remembering the diÆerent live ranges of the
same variable in the source code, and assigning them
the same hardware register during register allocation.
In such a way, the proposed reuse pattern can reach its
minimum register requirement.
7 Related Work
Occupancy and Shared Memory/Register Pressure
Related GPU Optimizations. A lot of research has tried
to increase occupancy to improve performance. [14]
used a method based on graph coloring to reuse shared
memory space. It focuses on making diÆerent user-
declared shared memory arrays use the same shared
memory partition. Since the data structure of circu-
lar queue is usually declared as one array, it will ﬁnd
no optimization opportunity for circular queue. [15]
proposed to apply loop ﬁssion to loops with multi-
ple independent statements and split kernel with if-
else statement into two kernels, where one kernel only
executes the if branch and the other only executes the
else branch, so as to reduce register pressure. The lat-
ter is also adopted by [16]. [17] manually spilled some
register variables into shared memory for FFT. Their
adjustment of register pressure is ad-hoc, and not as
systematic as ours. [18] and [19] separated a complex
kernel into several simple kernels, which brings higher
occupancy, due to less resource requirement of each sim-
pliﬁed kernel. These techniques are either application-
speciﬁc, or not suitable for stencils.
Temporal Blocking for Iterative Stencil Computa-
tions. A lot of research has addressed the issue of
data reuse across multiple time steps for iterative sten-
cil computations. In [1-2], Wonnacott proposed time
skewing to take advantage of temporal reuse by tiling
both space and time dimensions. Song and Li proposed
similar techniques in [4]. [3] proposed a cache-oblivious
algorithm to compute stencil computations for the same
purpose of temporal reuse. [6] proposed circular queue,
which uses a special data structure to hold temporary
values. [7] implemented circular queue using the local
store of the CBEA. Our work is based on circular queue,
and extends it to registers.
Stencil on GPU. [8] proposed a model to determine
the best ghost zone size for stencil computations on
GPU and a related transformation framework. Their
work suggested very small best T values for 2D and 3D
stencils, because the tile size of their implementation is
limited by on-chip storage. By adopting circular queue,
tile size can be enlarged, and thus, our experiment re-
sulted in bigger best T values. Their model cannot
be applied directly to circular queue, because when T
changes, storage requirement changes too, which may
result in a change in occupancy. Their model did not
take occupancy into consideration. [20] proposed to
loose the synchronization constraints for stencils on
GPU. Unlike our work, theirs is an alteration of al-
gorithm. [6, 21-22] all adopted circular queue method,
but they only explored the case when T = 1 for GPU.
Besides, [21] only used shared memory. Although [6, 12]
combined register ﬁle and shared memory, they did not
attempt to balance the usage of the two storages as we
do, and they both used 3£H £T locations for storage.
[23] proposed a new parallel algorithm for computing
SSOR, a diÆerent kind of stencils than we do. Besides,
their work was on the algorithmic level.
[6, 24] proposed an autotuning framework for sten-
cils on various platforms including GPU. Our tuning of
register and shared memory usage complements their
work.
[25] presented new way of tiling as well as ILP-level
tuning on both CPUs and GPUs. Our work shared
a lot of similarities, but focuses on diÆerent aspects.
Our work focuses on economizing and balancing on-
chip storages. Such techniques as using only 2T +1
SVs, communication space reuse and search for the best
partition of data were not used in their work. Similarly,
we did not tune ILP as they did. Apart from ILP, their
implementation is similar to our reg versions, which our
search versions outperform in some cases.
Optimization Space Pruning. Search is a powerful
method to ﬁnd the optimization combination, sequence
and parameters of the best performing version. How-
ever, exploring the full optimization space incurs pro-
hibitive overhead. Lots of research has been done to
prune the search space. On traditional platforms, some
use analytical model to eliminate unnecessary candi-
dates, such as [26-29]. These works focus on a collection
of traditional compiler optimizations, and use a model
to reduce search candidates. Some adopt method from
artiﬁcial intelligence to guide search process, such as
[30-33]. On a massively parallel architecture such as
GPU, [34] proposed utilization and e±ciency as met-
rics and combined them to prune optimization space.
Their method is general and requires that global mem-
ory bandwidth is not the bottleneck for performance.
In [35], a new compressed format is proposed for sparse
matrix on GPU, and search is needed to determine cer-
tain parameters of the format. They proposed anYang Yang et al.: A Hybrid Circular Queue Method for Stencils on GPUs 73
analytical model speciﬁc to SpMV to eliminate search
candidates. Our iterative search algorithm is designed
speciﬁcally for the circular queue method on GPU. It
tries to ﬁnd a balance between proposed technique of
register/shared memory partition and communication
space reusing. It prunes the search space by lever-
aging knowledge speciﬁc to GPUs and our proposed
techniques.
Other Related Work. [36] proposed modulo variable
expansion (MVE), which assigns multiple registers to
a variable in a loop, to eliminate anti- and output-
dependence in registers across successive iterations for
software pipelining. Both our works used unrolling to
expose a reuse pattern. However, MVE was proposed to
eliminate anti- and output-dependence. Our work, on
the other hand, leverages this reuse pattern to minimize
storage requirement. Besides, MVE is an instruction-
level compiler optimization, while our work is a source
code level optimization that is well aware of the data
ﬂow of the algorithm and speciﬁc to circular queue. [37]
proposed scalar replacement to keep some subscripted
variables in registers to expose reuse. Our work tries
to move data originally in lower level of memory into
registers. However, our work is fundamentally diÆer-
ent in that scalar replacement does not change where a
variable is originally stored. It just keeps a copy in a
register. Our work is more of a data partition problem
that may change the storage location of a variable.
8 Conclusions and Future Work
In this paper, we addressed the issue of balancing
the usage of registers and shared-memory for the cir-
cular queue method on GPUs. We proposed a reuse
pattern that enables the usage of registers in the cir-
cular queue method. Also, we proposed a framework
that can generate circular queue code for stencils on
GPUs, and can search for the best data partition. The
four diÆerent kinds of stencils we studied showed up
to 2.93X speedup over shared-memory based versions.
For future work, we will 1) develop a model to guide
the search process to reduce search time, and 2) extend
our discussion of on-chip storage allocation on GPUs
to a broader range of applications, and develop a more
general framework to help programmers optimize their
code.
References
[1] Wonnacott D. Achieving scalable locality with time skewing.
Int. J. Parallel Program, 2002, 30(3): 181-221.
[2] Mccalpin J, Wonnacott D. Time skewing: A value-based ap-
proach to optimizing for memory locality. Technical Report
DCS-TR-379, Department of Computer Science, Rugers Uni-
versity. 1999.
[3] Strzodka R, Shaheen M, Pajak D et al. Cache oblivious
parallelograms in iterative stencil computations. In Proc.
the 24th ACM Int. Conf. Supercomputing, Tsukuba, Japan,
Jun. 1-4, 2010, pp.49-59.
[4] Song Y, Li Z. New tiling techniques to improve cache temporal
locality. In Proc. ACM SIGPLAN Conference on Program-
ming Language Design and Implementation, Atlanta, USA,
May 1-4, 1999, pp.215-228.
[5] Jin G, Mellor-Crummey J, Fowler R. Increasing tempo-
ral locality with skewing and recursive blocking. In Proc.
ACM/IEEE Conference on Supercomputing, Denver, USA,
Nov. 10-16, 2001, pp.43-43.
[6] Datta K, Murphy M, Volkov V et al. Stencil computation op-
timization and auto-tuning on state-of-the-art multicore ar-
chitectures. In Proc. ACM/IEEE Conference on Supercom-
puting, Austin, USA, Nov.15-21, 2008, Article 4.
[7] Williams S, Shalf J, Oliker L et al. Scientiﬁc computing Ker-
nels on the cell processor. Int. J. Parallel Program, 2007,
35(3): 263-298.
[8] Meng J, Skadron K. Performance modeling and automatic
ghost zone optimization for iterative stencil loops on GPUs.
In Proc. the 23rd International Conference on Supercomput-
ing, Yorktown Heights, USA, Jun. 8-12, 2009, pp.256-265.
[9] NVIDIA. NVIDIA CUDA programming guide 3.0, http://de-
veloper.download.nvidia.com/compute/cuda/3 0/toolkit/do-
cs/NVIDIA CUDA ProgrammingGuide-pdf, 2010.
[10] NVIDIA Corp. CUDA Occupancy Calculator, 2010.
[11] Hong S, Kim H. An analytical model for a GPU architecture
with memory-level and thread-level parallelism awareness. In
Proc. the 36th Annual Int. Symp. Computer Architecture,
Austin, USA, Jun. 20-24, 2009, pp.152-163.
[12] Baghsorkhi S S, Delahaye M, Patel S J, Gropp W D, Hwu W
W. An adaptive performance modeling tool for GPU archi-
tectures. In Proc. the 15th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, Bangalore,
India, Jan. 9-14, 2010, pp.105-114.
[13] van der Laan W J. Decuda. http://wiki.github.com/laanwj/
decuda/, Sept., 2010.
[14] Moazeni M, Bui A, Sarrafzadeh M. A memory optimization
technique for software-managed scratchpad memory in GPUs.
In Proc. the 7th IEEE Symposium on Application Speciﬁc
Processors, San Francisco, USA, Jul. 27-28, 2009, pp.43-49.
[15] Carrillo S, Siegel J, Li X. A control-structure splitting opti-
mization for GPGPU. In Proc. the 6th ACM Conf. Comput-
ing Frontiers, Ischia, Italy, May 18-20, 2009, pp.147-150.
[16] Park S J, Ross J, Shires D, Richie D, Henz B, Nguyen L. Hy-
brid core acceleration of UWB SIRE radar signal processing.
IEEE Trans. Parallel Distrib. Syst, 2011, 22(1): 46-57.
[17] Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFT
library on CUDA GPU. In Proc. the 24th ACM Int. Conf.
Supercomputing, Tsukuba, Japan, Jun. 1-4, 2010, pp.305-314.
[18] Goorts P, Rogmans S, Bekaert P. Optimal data distribu-
tion for versatile ﬁnite impulse response ﬁltering on next-
generation graphics hardware using CUDA. In Proc. the 15th
International Conference on Parallel and Distributed Sys-
tems, Shenzhen, China, Dec. 9-11, 2009, pp.300-307.
[19] Bailey P, Myre J, Walsh S D C, Lilja D J, Saar M O. Accele-
rating lattice Boltzmann ﬂuid ﬂow simulations using graphics
processors. In Proc. International Conference on Parallel
Processing, Vienna, Austria, Sep. 22-25, 2009, pp.550-557.
[20] Venkatasubramanian S, Vuduc R W. Tuned and wildly asyn-
chronous stencil kernels for hybrid CPU/GPU systems. In
Proc. the 23rd International Conference on Supercomputing,
Yorktown Heights, USA, Jun. 8-12, 2009, pp.244-255.
[21] Christen M, Schenk O, Neufeld E et al. Parallel data-locality
aware stencil computations on modern micro-architectures.
In Proc. IEEE Int. Symp. Parallel & Distributed Process-
ing, Rome, Italy, May 23-29, 2009, pp.1-10.74 J. Comput. Sci. & Technol., Jan. 2012, Vol.27, No.1
[22] Micikevicius P. 3D ﬁnite diÆerence computation on GPUs us-
ing CUDA. In Proc. the 2nd Workshop on General Purpose
Processing on Graphics Processing Units, Washington, USA,
Mar. 8, 2009, pp.79-84.
[23] Di P, Wan Q, Zhang X et al. Toward harnessing DOACROSS
parallelism for multi-GPGPUs. In Proc. the 39th Int. Conf.
Parallel Processing, San Diego, USA, Sep. 13-16, 2010, pp.40-
50.
[24] Christen M, Schenk O, Burkhart H. Patus: A code genera-
tion and autotuning framework for parallel iterative stencil
computations on modern microarchitectures. In Proc. IEEE
International Parallel & Distributed Processing Symposium,
Anchorage, USA, May 16-20, 2011, pp.676-687.
[25] Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D
blocking optimization for stencil computations on modern
CPUs and GPUs. In Proc. ACM/IEEE Int. Conf. for High
Performance Computing, Networking, Storage and Analysis,
New Orleans, USA, Nov. 13-19, 2010, pp.1-13.
[26] Qasem A, Kennedy K. Proﬁtable loop fusion and tiling us-
ing model-driven empirical search. In Proc. the 20th Annual
International Conference on Supercomputing, Cairns, Aus-
tralia, Jun. 28-Jul. 1, 2006, pp.249-258.
[27] Knijnenburg P M W, Kisuki T, Gallivan K et al. The eÆect of
cache models on iterative compilation for combined tiling and
unrolling: Research articles. Concurrency and Computation:
Practice & Experience, 2004, 16(2-3): 247-270.
[28] Yotov K, Li X, Ren G et al. A comparison of empirical and
model-driven optimization. In Proc. the ACM SIGPLAN
2003 Conference on Programming Language Design and Im-
plementation, San Diego, USA, Jun. 8-11, 2003, pp.63-76.
[29] Chen C, Chame J, Hall M. Combining models and guided
empirical search to optimize for multiple levels of the mem-
ory hierarchy. In Proc. Int. Symp. Code Generation and
Optimization, San Jose, USA, Mar. 20-23, 2005, pp.111-122.
[30] Epshteyn A, Garzaran M, DeJong G et al. Analytical models
and empirical search: A hybrid approach to code optimiza-
tion. In Proc. the 18th International Workshop on Lan-
guages and Compilers for Parallel Computing, Hawthorne,
USA, Oct. 20-22, 2005, pp.259-273.
[31] Agakov F, Bonilla E, Cavazos J et al. Using machine learning
to focus iterative optimization. In Proc. Int. Symp. Code
Generation and Optimization, New York, USA, Mar. 26-29,
2006, pp.295-305.
[32] Almagor L, Cooper K D, Grosul A, Harvey T J, Reeves S W,
Subramanian D, Torczon L, Waterman T. Finding eÆective
compilation sequences. In Proc. ACM SIGPLAN/SIGBED
Conference on Languages, Compilers, and Tools for Embed-
ded Systems, Washington, USA, Jun. 11-13, 2004, pp.231-239.
[33] Vaswani K, Thazhuthaveetil M J, Srikant Y N et al. Mi-
croarchitecture sensitive empirical models for compiler opti-
mizations. In Proc. Int, Symp, Code Generation and Opti-
mization, San Jose, USA, Mar. 11-14, 2007, pp.131-143.
[34] Ryoo S, Rodrigues C I, Stone S S et al. Program optimiza-
tion space pruning for a multithreaded gpu. In Proc. the
6th Annual IEEE/ACM Int. Symp. Code Generation and
Optimization, Boston, USA, Apr. 6-9, 2008, pp.195-204.
[35] Choi J W, Singh A, Vuduc R W. Model-driven autotuning of
sparse matrix-vector multiply on GPUs. In Proc. the 15th
ACM SIGPLAN Symp. Principles and Practice of Parallel
Programming, Bangalore, India, Jan. 9-14, 2010, pp.115-126.
[36] Lam M. Software pipelining: An eÆective scheduling tech-
nique for VLIW machines. In Proc. ACM SIGPLAN Confe-
rence on Programming Language Design and Implementa-
tion, Atlanta, USA, Jun. 20-24, 1988, pp.318-328.
[37] Callahan D, Carr S, Kennedy K. Improving register allocation
for subscripted variables. In Proc. ACM SIGPLAN Confer-
ence on Programming Language Design and Implementation,
White Plains, USA, Jun. 20-22, 1990, pp.53-65.
.