Lowering IrGL to CUDA by Pai, Sreepathi & Pingali, Keshav
Lowering IrGL to CUDA
Sreepathi Pai Keshav Pingali
The University of Texas at Austin.
sreepai@ices.utexas.edu,pingali@cs.utexas.edu
Abstract
The IrGL intermediate representation is an explicitly parallel
representation for irregular programs that targets GPUs. In
this report, we describe IrGL constructs, examples of their
use and how IrGL is compiled to CUDA by the Galois GPU
compiler.
Keywords Irregular applications, amorphous data-parallelism,
GPUs, compilers, CUDA
1. Compiling IrGL
Our implementation of the IrGL compiler is written in
Python and operates on an AST of IrGL constructs (List-
ing 6). Apart from the constructs in Table 1, this AST also
contains the CBlock construct for C++ code used in writ-
ing the operator. Our compiler parses this code for well-
formedness as well as generating read/write sets, but is lim-
ited to C99 syntax and hence also accepts annotations to
describe read/write sets. The compiler generates CUDA out-
put, targeting Kepler and Maxwell GPUs. In the remainder
of this report, we describe how the IrGL AST is lowered to
CUDA. We assume a deep familiarity with CUDA. Listing 6
contains the definition of this AST in ASDL [10].
We use the following typographical conventions in this
document – Terminals in the AST are indicated by a sans-
serif font. Attributes on AST nodes, as well as values in code
are represented by typewriter font.
2. Overall Structure of the AST
The IrGL AST is rooted at the Module node. The only sup-
ported children of Module are Kernel and global-level dec-
larations. The Names node provides support for importing
foreign names (such as #define constants) from C into the
local IrGL symbol table. Many constructs in the AST, such
[Copyright notice will appear here once ’preprint’ option is removed.]
asWhile, If, etc. only serve to expose control flow to the IrGL
compiler. C code blocks, represented by CBlocks, must ob-
serve single-entry, single-exit control-flow behaviour, akin
to basic blocks, but are allowed to call functions.
Not shown in the AST definition are compiler-specific
annotations that can be applied to certain nodes. For exam-
ple, the CUDA launch bounds annotation convention-
ally used to indicate register-usage restrictions to the CUDA
compiler is also supported by our IrGL compiler (which
passes it through, but see Section 4.3). Other significant an-
notations will be mentioned when the nodes they annotate
are discussed below.
3. Compiling Kernels
A Kernel node designates a plain IrGL kernel, a host kernel
(host=true) or a device kernel (device=true). Host ker-
nels execute on the CPU. Device kernels correspond directly
to CUDA device kernels.
Only a host Kernel may use the IrGL orchestration con-
structs – Invoke, Iterate and Pipe. Similarly, only a non-
host, non-device Kernel can use the IrGL kernel constructs –
Atomic, Exclusive, Retry, Respawn, ReduceAndReturn and
ForAll. Host kernels may use ForAll, but it is treated as For
by our current compiler.
The compiler primarily uses device kernels when im-
plementing the iteration outlining optimization. While user-
provided device kernels are supported, they are treated as
opaque and are largely ignored by our compiler.
The following subsections discuss how we compile the
ForAll, Atomic, Exclusive and ReduceAndReturn kernel
constructs. We defer discussion of Retry and Respawn to
Section 4.
3.1 Compiling ForAll
The iterations of the outermost ForAll in an IrGL kernel are
mapped to CUDA threads. Each CUDA thread usually ex-
ecutes multiple iterations. Section 4.3 describes the process
the compiler uses to determine the number of CUDA threads
to use for a kernel.
The object stored in iterator represents a random-
access iterator and the order of iteration execution is not
defined. By default, consecutive iterations of the ForAll are
mapped to consecutive CUDA threads. However, other map-
1 2018/11/8
ar
X
iv
:1
60
7.
05
70
7v
1 
 [c
s.P
L]
  1
9 J
ul 
20
16
Construct Semantics
Kernel Constructs
ForAll (iterator) { stmts } Traverse iterator in parallel executing stmts
ReduceAndReturn (bool-expr) Reduce values of bool-expr and return as kernel return value. The actual reduction,
one of Any or All, is specified at kernel invocation.
Atomic (lock-expr) { locked-stmts }
[ Else { failed-stmts } ]
Acquire lock-expr and execute locked-stmts. If an Else block provided, execute failed-
stmts if lock-expr was not acquired. Implements divergence-free blocking locks [7].
Exclusive (object, elements) {
locked-stmts } [ Else { failed-stmts
} ]
Try once to acquire locks for elements in object and execute locked-stmts on succeed-
ing. On failure execute failed-stmts if provided otherwise execute next statement. One
thread is guaranteed to execute locked-stmts on conflicts.
SyncRunningThreads Compiler-supported safe implementation of GPU-wide global barriers [11]
Retry item (or Respawn item) Push item into a retry worklist and re-execute the kernel. Use of Retry indicates a run-
time conflict and triggers conflict management in the runtime (e.g. serial execution).
Orchestration Constructs
[Any | All (] Invoke kernel(args) [)] Invoke kernel, passing current worklists if kernel uses them.
Iterate [While | Until Any | All] ker-
nel(args) [Initial (init-iter-expr)]
Iteratively invoke kernel until termination condition is met or worklist is depleted. if
invoked standalone, establish fresh worklists using init-iter-expr for initialization, else
pass current worklists.
Pipe [Once] { stmts } Establish worklists to be used by stmts. Without Once, repeat stmts until worklists are
empty. Nested Pipes will not establish worklists.
Table 1. Summary of IrGL Statements, [] indicate optional parts, | indicates options. See Listing 6 for a formal definition of
the abstract syntax tree.
pings are possible and these mappings are represented by
compiler-specific annotations. For example, when imple-
menting the Retry Backoff optimization, a mapping that dis-
tributes contiguous blocks of iterations to a single thread is
used to reduce conflicts.
For is used to represent a loop whose iterations cannot be
executed in parallel. When the amount of parallelism can
only be discovered at runtime, as in most irregular graph
algorithms, ForAll is used with additional synchronization
inside the body of the loop.
3.2 Compiling Atomic and Exclusive
IrGL provides two statements – Atomic and Exclusive – that
allow iterations of a ForAll loop to implement mutual exclu-
sion. Both these statements implement functionality that are
hard to get right [1, 7]. Atomic and Exclusive currently use
software implementations but can be recompiled to use pro-
posed hardware primitives [2, 7] if such primitives become
available.
3.2.1 Atomic
Atomic implements an atomic section, a block of code that
is executed under control of a single lock. Atomic can be
nested. Two forms of Atomic are supported, a default block-
ing form that waits for the lock to be acquired. The other
form, indicated by a non-empty fail stmts (i.e. Else), is
non-blocking and executes the statements in fail stmts
when the lock was not acquired.
Atomic provides as a safe alternative to spinlocks, since
spinlocks can deadlock on GPUs due to warp divergence.
Internally, Atomic uses atomicCAS to set the state of a lock
variable. If the atomicCAS fails to acquire the lock and the
Atomic is blocking, a divergence-safe loop similar to that
described in [5, 7] is generated by the compiler to reattempt
locking.
We illustrate the use of Atomic using Boru˚vka’s algo-
rithm for minimum spanning tree. Boru˚vka’s algorithm be-
gins by treating each node of the input graph as a compo-
nent. Then, it finds the minimum cross-component edge out
of each components. These edges are added to the minimum
spanning tree, and the components they connect are merged
(or unified). The procedure then repeats on these merged
components until only one component remains or no cross-
component edges can be found (i.e. in a disconnected graph).
Particularly challenging is the implementation of finding
the minimum edge out of a component. Since a component
can consist of many nodes, recording the minimum edge re-
quires at least two updates that must be carried out atomi-
cally – the minimum weight and the edge itself. No CUDA
primitive suffices to perform multiple updates atomically.
Previous implementations, notably that of Vineet et al. [9],
store the weight and edge identifier as bitfields of a 32-bit
integer, which would allow use of a single atomicMin at the
cost of severely limiting the generality of the resulting code
and its applicability to input graphs.
Listing 1 describes the find-min-edge kernel in the
IrGL implementation that uses an Atomic to update the com-
ponent’s data in an atomic context. Note that this instance of
Atomic is a blocking lock (i.e. no Else clause), so it can use
2 2018/11/8
1 ForAl l ( n idx In wl ) {
2 n = wl . pop ( n idx ) ;
3 n component = components [ n ] ;
4 minwt = INF ;
5
6 f o r ( e In edges ) {
7 / / f i n d minimum c r o s s−component edge
8 / / o u t o f node n ; s t o r e w e i g h t i n minwt ,
9 / / and edge i d i n minedge
10 }
11
12 Atomic ( c o m p o n e n t l o c k s [ n component ] ) {
13 i f ( component minwt [ n component ] > minwt ) {
14 component minwt [ n component ] = minwt
15 component minedge [ n component ] = minedge
16 / / o t h e r u p d a t e s
17 }
18 }
19
20 i f ( node has c r o s s−component edge ) {
21 wl . push ( n )
22 }
23 }
Listing 1. Find-Minimum-Cross-Component-Edge kernel
of Boru˚vka’s MST algorithm. The worklist initially contains
all nodes.
a ticket lock from our runtime by simply setting a compiler
flag.
3.3 Exclusive
Exclusive encloses a block of code that must acquire a large
number of locks. Internally, threads are assigned priorities
so that at least one thread always acquires all the locks it
needs. Exclusive never blocks and may not be nested. The
statements in fail stmt are executed if the locks were not
acquired.
The set of locks to be acquired is obtained from an array
of lock indices. In the simplest form, objs lock indicates
an Array which contains the lock indices to be acquired.
In the ArrayIterator form, objs lock specifies an array
iterator that yields the indices to be locked.
Consider the use of Exclusive in Delaunay Mesh Refine-
ment (DMR). For DMR, the key kernel is refine which,
when presented with a worklist of bad triangles, fixes each
one of them in parallel. Each thread must have exclusive ac-
cess to the triangles in the cavity of its bad triangle, as well
as to the triangles that form the boundary of the cavity. The
Exclusive construct is key to simplifying the implementation
of DMR. Listing 2 illustrates how the triangles in the cavity
are passed as input to Exclusive which then permits access
to triangles in the cavity to one thread.
The Exclusive statement is implemented using a three-
phase algorithm, with each phase separated by SyncRun-
ningThreads. Our implementation is similar to the race–
prioritycheck–check scheme described in [4]. In the first
phase, Exclusive claims the locks supplied for the execut-
ing thread. If multiple threads claim the same lock, only one
1 ForAl l ( b t i d x In wl ) {
2 b a d t r i a n g l e = wl . pop ( b t i d x ) ;
3
4 b u i l d c a v i t y ( b a d t r i a n g l e , &c a v i t y s i z e , &c a v i t y ) ;
5
6 Exc lu s i v e ( mesh , c a v i t y s i z e , c a v i t y ) {
7 d e l e t e c a v i t y ( . . . )
8 / / c r e a t e new t r i a n g l e s
9 }
10
11 SyncRunningThreads ( ) ;
12 }
Listing 2. Simplified Refine kernel in Delaunay Mesh Re-
finement.
claim is allowed. In the second phase, each thread checks
to see if its claim for every lock stands. If another thread
was granted the claim, then the threads that lost the claim at-
tempt to win priority over the claim. In the third phase, each
thread checks to see if it still retains the claims it sought. Any
threads that do so proceed to execute the statements within
the Exclusive, while those that do not move to the next state-
ment or execute the Else clause if one is supplied. Our use
of SyncRunningThreads for implementing Exclusive places
restrictions on its use. An Exclusive must be placed in a loca-
tion that will be uniformly executed by all threads and thus
cannot be nested. In practice, this means that Exclusive may
only be directly placed underneath the outermost ForAll.
3.4 Compiling SyncRunningThreads
Like CPUs, GPUs can create and execute many more threads
than can run concurrently on hardware. However, unlike
CPUs, a GPU thread usually runs to completion and cannot
be preempted. Thus, the notion of a global barrier that syn-
chronizes all threads does not readily translate to the GPU. If
all threads being synchronized are not running concurrently,
the global barrier will deadlock.
Nevertheless, barrier-like functionality is useful, even if
it is limited to only those threads that are running concur-
rently. Such “device-wide barriers”, have been described
previously [11] and are supported in IrGL through the Syn-
cRunningThreads statement, though our implementation is
derived from code used in [3]. Safe use of SyncRunningTh-
reads requires that a GPU kernel never be launched with
more physical threads than can run concurrently. This num-
ber can vary from GPU to GPU and also depends on the size
of the CUDA thread block.
CUDA 6.5 introduces the occupancy API that allows this
number to be calculated at runtime for a kernel for each GPU
present in the system. When our compiler generates code
to launch an IrGL kernel that uses SyncRunningThreads, it
limits the number of threads using this occupancy API to
ensure deadlock-free execution. We note this method is not
portable to other devices [8] and even on NVIDIA GPUs,
assumes that all thread blocks of the kernel will eventually
execute concurrently.
3 2018/11/8
1 Kernel BFS ( graph , LEVEL) {
2 ForAl l ( wl idx In wl ) {
3 n = wl . pop ( wl idx )
4 ForAl l ( e In graph . edges ( n ) ) {
5 i f ( e . d s t . l e v e l == INF ) {
6 e . d s t . l e v e l = LEVEL ;
7 wl . push ( e . d s t . i d ) ;
8 }
9 }
10 }
11
12 LEVEL=0
13 I t e r a t e BFS ( graph , LEVEL) I n i t i a l [ s r c ] {
14 LEVEL++;
15 } ;
Listing 3. Level-by-level BFS kernel using a worklist
3.5 Compiling ReduceAndReturn
The ReduceAndReturn statement is used to construct a re-
turn value for an IrGL kernel using a reduction. The actual
reduction is specified when invoking the kernel using Invoke
or Iterate. Our compiler currently supports the Any and All
reductions, and therefore the value to be reduced is a boolean
expression stored in value.
Any returns true if any value evaluated to true. All
returns true only if all value evaluated to true.
ReduceAndReturn terminates execution of the kernel, ex-
cept when invoked inside a ForAll when it only terminates
the current iteration.
The simplest compilation of ReduceAndReturn uses
global memory storage and CUDA atomic instructions to
implement these reductions. However, this can be made
cheaper by re-using the cooperative conversion optimization
machinery. In effect, each CUDA thread partially aggregates
the ReduceAndReturn values, with atomics being used only
to aggregate the values of each CUDA thread block. Un-
fortunately, since CUDA does not support virtual functions,
we must generate multiple variants for each kernel (e.g. by
using C++ templates) for each reduction used in the calling
Invoke or Iterate.
4. Compiling Orchestration
The orchestration constructs Invoke, Iterate and Pipe are
used to invoke IrGL kernels. Since data-driven IrGL kernels
often use worklists, the Iterate and Pipe also setup worklists.
They also execute a series of IrGL kernels until the worklist
is empty since iterative exection is a common pattern.
IrGL provides a default worklist object named WL that
exposes push, pop and an iterator to each kernel. Therefore,
pop and push are encoded in the AST as MethodInvocation
on this object and do not appear as first-class AST nodes.
4.1 Worklist Mechanics
Kernels use worklists to manage work and as a means of
communication of work between kernels. IrGL provides a
default worklist to every kernel. A kernel may push values
1 Pipe Once {
2 Invoke i d e n t i f y b a d t r i a n g l e s ( mesh ) ;
3 p r i n t f ( ‘ ‘ i n i t i a l bad : %
4
5 / / l o o p i n g Pipe
6 Pipe {
7 Invoke r e f i n e ( mesh ) ;
8 . . . / / o t h e r mesh main t enance code
9
10 / / o n l y among newly c r e a t e d t r i a n g l e s
11 Invoke i n c r e m e n t a l i d b a d t r i a n g l e s ( mesh ) ;
12 }
13
14 / / s a n i t y check
15 Invoke i d e n t i f y b a d t r i a n g l e s ( mesh ) ;
16 p r i n t f ( ‘ ‘ f i n a l bad : %
17 }
Listing 4. Example of Pipe in DMR
1 Pipe Once {
2 Invoke A ( ) ;
3
4 i f ( cond ) {
5 Invoke B ( ) ;
6 e l s e {
7 Invoke C ( ) ;
8 }
9 }
Listing 5. Example of Dynamic Piping
onto a worklist to enqueue work, and may pop values off
the worklist to perform work, usually using a ForAll. IrGL
worklists exhibit bulk-synchronous behaviour – work items
pushed during an invocation cannot be popped in the same
invocation.
Worklists are created and managed by Iterate and Pipe
constructs. Iterate is best illustrated by the BFS code in
Listing 3. It creates a worklist, initially populated with src
and invokes the BFS kernel repeatedly until the worklist is
depleted. After every invocation, code in stmts is executed.
In this example, the LEVEL variable is incremented. The
automatically created worklist is not available beyond the
execution of the Iterate statement.
The Pipe statement establishes a shared worklist for the
Iterate, Invoke and Pipe statements within it. A Pipe may
execute once or loop until the worklist is empty. Inside a
Pipe, the values pushed by an invocation of a kernel are
forwarded to the next kernel in the pipe. Listing 4 illustrates
the use of Pipe in the main loop of the DMR benchmark.
After receiving an initial set of bad triangles, the inner Pipe
iteratively refines the mesh, communicating the worklists
between the two kernels inside the pipe.
The “flow” of worklists between kernels is not fixed at
compile time. For example, Listing 5 is perfectly valid IrGL
code. Depending on what cond evaluates to, the worklist
produced by A may be consumed by either B or C.
4 2018/11/8
previous kernel
K
retry/respawn;
swap retry, in
swap in, out
iterate loop
next kernel
Figure 1. Flow control and worklist management for Iterate
4.2 Compiling Pipe
The actual creation and communication of worklists be-
tween kernels is the responsibility of the Pipe/Iterate state-
ments. Currently, the outermost Pipe (or Iterate) state-
ment in a host Kernel creates a pipe context. All nested
Pipe, Iterate and Invoke statements inherit this pipe con-
text. In our implementation, the pipe context contains the
incoming, outgoing and retry worklists named in, out and
retry respectively. The wlinit attribute specifies the size
of the worklist (size) and how the initial worklist is pop-
ulated, which is implementation-dependent. For example,
our compiler supports initializing worklists from a list of
scalar expressions (WorklistInitializer) or from an array
(WorklistInitializerFromArray).
When compiling an invocation to a kernel that reads,
writes or iterates over the WL object, all pops are executed
on the in worklist. Similarly all pushes are executed on the
out worklist, using cooperative conversion where applicable
to improve performance. Workitems to be retried are pushed
into the retry worklist.
Figures 1 and 2 illustrate how control flows within a
Iterate or Invoke. In general, flow is linear from the previous
kernel to the next unless the kernel uses Iterate in which case
the kernel is invoked repeatedly until no more items are left
to process. Each invocation swaps the in and out worklists.
If the kernel uses the retry worklist, it will be invoked
repeatedly, but the in and retry worklists are swapped,
while the out worklist remains the same.
Our compiler also sets up storage to store the return
value when compiling Invoke and Iterate for kernels that use
ReduceAndReturn.
Note that Iterate is syntactic sugar for a loop that wraps
Invoke for kernels that do not use worklists. Iterate is es-
sentially equivalent to a Pipe for kernels that do use work-
lists. Apart from terminating when the worklist is empty, it
is possible to specify (in extra cond) additional conditions
that will cause the loop to exit even when the worklist is
not empty. This extra condition may be combined with the
empty worklist check using either And or Or.
4.3 Compiling Kernel Invocations
In the most general case, when the statement invoking the
kernel, an Iterate or Invoke or Pipe, lies in a host kernel, a
kernel invocation compiles down to a CUDA kernel launch.
However, when the kernel invocation lies in the control ker-
nel of an outlined Pipe, which is a CUDA global , it is
K
previous kernel
retry/respawn;
swap retry, in
swap in, out;
next kernel
Figure 2. Flow control and worklist management for Invoke
on a kernel that uses worklists
compiled to a device kernel function invocation. Our com-
piler also supports the use of CUDA Dynamic Parallelism
when launching kernels from control kernels, but the perfor-
mance is poor, and it is not recommended.
Since IrGL kernels have no notion of threads, our com-
piler must also choose appropriate grid and thread block
sizes for the CUDA launch. If SyncRunningThreads or Ex-
clusive are not used in the kernel, then any grid size can be
used, with our compiler using a fixed grid size calculated
from the number of multiprocessors in the GPU. The use of
these constructs requires that the grid size be chosen care-
fully as described earlier in Section 3.4. Without optimiza-
tions enabled, IrGL kernels also naturally compile down to
elastic kernels [6] and so can run with any thread block size.
However, when optimizations are enabled, the thread block
sizes for a kernel may be constrained as we describe below.
Our compiler allows programmers to use the CUDA
launchbounds (maxthreadsperblock, minblocks)
annotation on individual kernels. The optional minblocks
parameter is advisory and requests the compiler to achieve a
residency of at least minblocks on each multiprocessor of
the GPU. It is ignored by our compiler. The maxthreads-
perblock parameter, on the other hand, informs the CUDA
compiler that the kernel will not be launched with more than
maxthreadsperblock which changes the behaviour of the
register allocator. Attempting to launch a kernel with more
than maxthreadsperblock will result in failure. Thus,
launchbounds establishes an upper bound on the thread
block size that can be used by our compiler.
Using nested parallelism or cooperative conversion also
imposes a constraint on the thread block size that can be
selected for a kernel. Essentially, both these optimizations
make use of CUDA shared memory for communication with
the size of shared memory used depending on the thread
block size. Similarly, some libraries that we use internally
use C++ template parameters to specialize for a statically
specified thread block size. Such kernels are therefore lim-
ited to a fixed thread block size.
To summarize, IrGL kernels can fall into three categories
depending on the thread block size they support. First are
the ElasticBlock kernels, which can execute with any thread
block size. Second are the ShrinkableBlock kernels, which
place an upper bound on their thread block size. Finally, in
the third category are the FixedBlock kernels, which can only
execute with a fixed thread block size.
If iteration outlining is not used, the constraint on one
kernel does not affect another kernel. However, when a Pipe
5 2018/11/8
containing different kernels is outlined to the GPU and dy-
namic parallelism is not used, the thread block size chosen
for the control kernel must satisfy the constraints on all ker-
nels in the Pipe.
Since the maximum thread block size is limited by CUDA
to 1024 on all GPUs we support, the set of possible thread
block sizes for a kernel k, denoted by Tk, is finite. If K is
the set of kernels in a Pipe, then Tcontrol is simply:
Tcontrol =
⋂
k∈K
Tk
Thus, the thread block size of the control kernel is simply
the intersection of the domain sets for the constraint vari-
ables of each kernel. If this intersection is empty, then it-
eration outlining cannot be performed on this Pipe. If this
intersection contains multiple values, our compiler chooses
the highest value.
It is possible for ElasticBlock and ShrinkableBlock ker-
nels to support different thread block sizes in different Pipes.
However, for simplicity, our compiler picks a single thread
block size for each kernel that is used at every invocation.
5. Conclusion
In this report, we have described the AST for IrGL and how
it is lowered to CUDA. Our scope has been limited to the
primary IrGL constructs since our intent was to provide a
high-level overview of the process. We hope that this doc-
ument will also be helpful to understand the organization
of the IrGL compiler source that will be released separately.
Among this document’s omissions, we note absence of a dis-
cussion regarding the annotations supported by our compiler
as well as its support for selecting optimizations at the Block
level that allows a richer search space for auto-tuning, since
these constructs are currently in flux.
References
[1] J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan,
J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU
concurrency: Weak behaviours and programming assump-
tions. In O¨. O¨zturk, K. Ebcioglu, and S. Dwarkadas, edi-
tors, Proceedings of the Twentieth International Conference
on Architectural Support for Programming Languages and
Operating Systems, ASPLOS ’15, Istanbul, Turkey, March 14-
18, 2015, pages 577–591. ACM, 2015. ISBN 978-1-4503-
2835-7. . URL http://doi.acm.org/10.1145/2694344.
2694391.
[2] W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt.
Kilo TM: hardware transactional memory for GPU architec-
tures. IEEE Micro, 32(3), 2012. .
[3] D. Merrill, M. Garland, and A. S. Grimshaw. Scalable GPU
graph traversal. In PPOPP 2012. ACM, 2012. .
[4] R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on
gpus. In PPoPP ’13, PPoPP ’13, 2013.
[5] L. Nyland and S. Jones. Understanding and using atomic
memory operations. GTC 2013, 2013.
[6] S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving
GPGPU concurrency with elastic kernels. In V. Sarkar and
R. Bodı´k, editors, Architectural Support for Programming
Languages and Operating Systems, ASPLOS ’13, Houston,
TX, USA - March 16 - 20, 2013, pages 407–418. ACM, 2013.
ISBN 978-1-4503-1870-9. . URL http://doi.acm.org/
10.1145/2451116.2451160.
[7] A. Ramamurthy. Towards scalar synchronization in SIMT
architectures. Master’s thesis, The University of British
Columbia, 2011.
[8] T. Sorensen and A. F. Donaldson. The hitchhiker’s guide to
cross-platform opencl application development. In Proceed-
ings of the 4th International Workshop on OpenCL, IWOCL
2016, Vienna, Austria, April 19-21, 2016, pages 2:1–2:12.
ACM, 2016. ISBN 978-1-4503-4338-1. . URL http:
//doi.acm.org/10.1145/2909437.2909440.
[9] V. Vineet, P. Harish, S. Patidar, and P. J. Narayanan. Fast min-
imum spanning tree for large graphs on the GPU. In Proceed-
ings of the ACM SIGGRAPH/EUROGRAPHICS Conference
on High Performance Graphics 2009. ACM, 2009. .
[10] D. C. Wang, A. W. Appel, J. L. Korn, and C. S. Serra.
The zephyr abstract syntax description language. In
C. Ramming, editor, Proceedings of the Conference on
Domain-Specific Languages, DSL’97, Santa Barbara, Cali-
fornia, USA, October 15-17, 1997, pages 213–228. USENIX,
1997. URL http://www.usenix.org/publications/
library/proceedings/dsl97/wang.html.
[11] S. Xiao and W. Feng. Inter-block GPU communication via
fast barrier synchronization. IPDPS 2010. IEEE, 2010. .
6 2018/11/8
1 module = Module ( mod s tmts s t m t s )
2 mod s tmts = Ke rn e l ( i d e n t i f i e r name , param ∗params ,
3 b s t m t s s t m t s , s t r r e t t y p e ,
4 boo l hos t , boo l d e v i c e )
5 | CDeclGloba l ( c d e c l ∗ d e c l s , boo l p a r s e , boo l dont move )
6 | CBlock ( s t r ∗ s t m t s , boo l p a r s e )
7 | NOP ( )
8 | Names ( s t r ∗names )
9
10 param = ( s t r type , i d e n t i f i e r name )
11 c d e c l = ( s t r type , i d e n t i f i e r name , s t r i n i t i a l i z e r )
12
13 b s t m t s = Block ( s t m t s ∗ s t m t s )
14
15 s t m t s = Block ( s t m t s ∗ s t m t s )
16 | Ass ign ( s t r l h s , s t r r h s )
17 | For ( i d e n t i f i e r ndxvar , o b j e c t i t e r a t o r , b s t m t s ∗ s t m t s )
18 | F o r A l l ( i d e n t i f i e r ndxvar , o b j e c t i t e r a t o r , b s t m t s ∗ s t m t s )
19 | CFor ( s t r i n i t , s t r cond , s t r upda te , b s t m t s ∗ s t m t s )
20
21 | While ( s t r cond , b s t m t s ∗ s t m t s )
22 | DoWhile ( s t r cond , b s t m t s ∗ s t m t s )
23
24 | Atomic ( s t r lock , s t r lockndx , s t m t s ∗ s t m t s , b s t m t s ∗ f a i l s t m t s )
25 | E x c l u s i v e ( o b j e c t l i s t o b j s l o c k , s t m t s ∗ s t m t s , b s t m t s ∗ f a i l s t m t s )
26
27 | R e t r y ( s t r a rg s , boo l merge )
28 | Respawn ( s t r a r g s )
29 | CBlock ( s t r ∗ s t m t s , boo l p a r s e )
30 | CDecl ( c d e c l ∗ d e c l s , boo l p a r s e , boo l dont move )
31
32 | I f ( s t r cond , b s t m t s ∗ t r u e s t m t s , b s t m t s ∗ f a l s e s t m t s )
33
34 | Invoke ( a g g r f u n c aggr , i d e n t i f i e r k e r n e l , s t r ∗ a r g s )
35 | I t e r a t e ( t e r m c o n d cond , a g g r f u n c aggr , i d e n t i f i e r k e r n e l ,
36 s t r ∗ a rgs , w o r k l i s t i n i t i a l i z e r w l i n i t , b s t m t s ∗ smts ,
37 e x t r a c o n d e x t r a c o n d )
38 | ReduceAndReturn ( s t r v a l u e )
39
40 | SyncRunningThreads ( )
41 | L o c a l B a r r i e r ( )
42
43 | Pipe ( b s t m t s ∗ s t m t s , w o r k l i s t i n i t i a l i z e r w l i n i t , boo l once )
44
45 | Expr ( exp r e )
46
47 exp r = CExpr ( s t r expr , boo l p a r s e )
48 | M e t h o d I n v o c a t i o n ( s t r obj , i d e n t i f i e r method , s t r o b j t y p e , s t r ∗ a r g s )
49
50 o b j e c t l i s t = Tuple ( i d e n t i f i e r o b j e c t , i t e m l i s t ∗ i t e m s )
51
52 i t e m l i s t = Array ( i n t s i z e , i d e n t i f i e r name )
53 | A r r a y I t e r a t o r ( i d e n t i f i e r a r r a y , s t r s t a r t , s t r end , s t r s t e p )
54
55 t e r m c o n d = While | U n t i l
56 a g g r f u n c = Any | A l l
57 e x t r a c o n d = And ( s t r cond ) | Or ( s t r cond )
58
59 w o r k l i s t i n i t i a l i z e r = W o r k l i s t I n i t i a l i z e r ( s t r s i z e , s t r ∗ i n i t i a l )
60 | W o r k l i s t I n i t i a l i z e r F r o m A r r a y ( s t r s i z e , i d e n t i f i e r a r r a y , s t r a r r a y s i z e )
Listing 6. IrGL Abstract Syntax Tree in ASDL
7 2018/11/8
