On the Correctness of the SIMT Execution Model of GPUs by Habermaier, Axel & Knapp, Alexander
UNIVERSITÄT AUGSBURG
On the Correctness of the SIMT Execution
Model of GPUs
Extended version of the author’s ESOP’12 paper
Axel Habermaier and Alexander Knapp
Report 2012-01 January 2012
INSTITUT FÜR INFORMATIK
D-86135 AUGSBURG
Copyright c© Axel Habermaier and Alexander Knapp
Institut für Informatik
Universität Augsburg
D–86135 Augsburg, Germany
http://www.Informatik.Uni-Augsburg.DE
— all rights reserved —
On the Correctness of the
SIMT Execution Model of GPUs
Axel Habermaier and Alexander Knapp
Institute for Software and Systems Engineering, University of Augsburg
{habermaier,knapp}@informatik.uni-augsburg.de
Abstract. GPUs are becoming a primary resource of computing power. They
use a single instruction, multiple threads (SIMT) execution model that executes
batches of threads in lockstep. If the control flow of threads within the same batch
diverges, the different execution paths are scheduled sequentially; once the con-
trol flows reconverge, all threads are executed in lockstep again. Several thread
batching mechanisms have been proposed, albeit without establishing their se-
mantic validity or their scheduling properties. To increase the level of confidence
in the correctness of GPU-accelerated programs, we formalize the SIMT execu-
tion model for a stack-based reconvergence mechanism in an operational seman-
tics and prove its correctness by constructing a simulation between the SIMT se-
mantics and a standard interleaved multi-thread semantics. We also demonstrate
that the SIMT execution model produces unfair schedules in some cases. We dis-
cuss the problem of unfairness for different batching mechanisms like dynamic
warp formation and a stack-less reconvergence strategy.
1 Introduction
Since the introduction of general purpose programming frameworks for graphics pro-
cessing units (GPUs) a few years ago, GPUs are capable of accelerating many other
kinds of data-parallel algorithms aside from graphics computations. Speedups of one or
two orders of magnitude compared to CPU-based implementations have been achieved
for applications in the fields of molecular dynamics, medical imaging, seismic imag-
ing, fluid dynamics, and many others [8, 20]. GPUs are well suited for such massively
parallel problems because of their high computational power and memory bandwidth.
CPUs, on the other hand, are more optimized for sequential code containing many data-
dependent branch instructions. The model of computation of GPUs is therefore unlike
the traditional one of CPUs, even though they are converging [16].
GPUs typically launch thousands of threads to execute a data-parallel program.
Each thread executes the program on different input data akin to the single program,
multiple data (SPMD) principle. NVIDIA GPUs based on the FERMI architecture pro-
cess up to 512 threads in parallel, with thousands more being idle and waiting to be
scheduled later on [20]. Instead of executing each thread individually, however, the
hardware transparently batches several threads together for improved efficiency. The
threads of a batch always execute the same instruction in lockstep on a single instruc-
tion, multiple data (SIMD) unit, i.e., in parallel on different operands [23]. As threads
are batched dynamically at runtime, GPUs based on NVIDIA’s FERMI architecture or
on AMD’s GRAPHICS CORE NEXT design have no explicit support for SIMD vectors in
their instruction sets [15, 22], which distinguishes them from classical SIMD architec-
tures [13]. NVIDIA uses the term single instruction, multiple threads (SIMT) to describe
the SPMD- and SIMD-like execution model of their hardware [23].
A batch of threads that is executed in lockstep is called a warp or wavefront by
NVIDIA and AMD, respectively. Current NVIDIA hardware always uses a warp size of
32 threads, whereas AMD’s GRAPHICS CORE NEXT architecture has a wavefront size
of 64 [15, 23]. The SIMT design promises significant performance benefits especially
for graphics applications [13], but is in general not well-suited for code that heavily
relies on data-dependent control flow instructions: If the control flow of threads of the
same warp diverges, execution of the warp is serialized for each unique path, disabling
the threads that did not take the path. The hardware must therefore track the threads’
activation states and schedule all paths for execution one after another. Once all paths
complete, the threads reconverge and proceed in lockstep again. It is desirable to recon-
verge threads as soon as possible for performance reasons, as otherwise many hardware
units remain unused [23]. There are several different mechanisms to handle divergent
control flow on a SIMT architecture [4], which differ in thread scheduling and per-
formance; in particular, FERMI uses a stack-based reconvergence mechanism based on
immediate post-dominators, enabling full C++ support on the GPU [5, 19].
The growing complexity of GPU-based software potentially increases the number of
software errors, while at the same time the more wide-spread adoption requires a higher
level of confidence in program correctness. For formal proofs of correctness, however,
not only a precise understanding but also a formal foundation of the underlying SIMT
execution model is required. We therefore provide a formal semantics for NVIDIA’s
stack-based reconvergence mechanism. To keep the formalization concise, it is based
on a high-level programming language, disregarding much of NVIDIA’s technical en-
vironment (a full discussion based on the assembly-level language PTX including the
memory model can be found in [9]). We analyze the semantic validity of this execution
model by comparing it to a standard interleaved multi-thread semantics: We construct
and prove a simulation that demonstrates that each warp execution including serializa-
tions on control-flow divergences corresponds to an interleaved multi-thread execution.
In this sense, the SIMT execution model is correct with respect to the interleaving
semantics. However, there is a difference between the two models regarding fairness
of execution. In the interleaving model, generally weak fairness is assumed such that
every enabled thread will eventually take a step. In contrast, FERMI’s stack-based re-
convergence mechanism does not guarantee such a fairness condition, rather its unfair
scheduling of divergent threads prevents some otherwise valid programs from termi-
nating in certain corner cases. We discuss the issue of unfairness for both NVIDIA’s
mechanism and for alternative SIMT implementations and provide a sufficient criterion
for detecting such situations.
Overview. Section 2 gives a brief overview of the hardware architecture and introduces
NVIDIA’s SIMT execution model with the help of an example. Section 3 presents the
formalization of this execution model. The interleaved multi-thread semantics and their
simulation of the SIMT behavior are given in Sect. 4. Section 5 then discusses the issue
of unfairness and Sect. 6 summarizes our findings and gives an outlook to future work.
2 SIMT Hardware Model
To set the context for the remainder of the paper, we give an overview of the FERMI
architecture and its programming model, omitting all graphics-related details. We fo-
cus on NVIDIA’s Compute Unified Device Architecture (CUDA – particularly, batches
of threads are referred to as warps), as it is representative for other GPU programming
frameworks and hardware designs: The underlying principles also apply to DirectCom-
pute, the Open Computing Language (OPENCL), and AMD’s GPUs [1, 12, 15, 21, 23].
Hennessy and Patterson [10] provide a more detailed introduction to CUDA and FERMI
and contrast FERMI’s design with traditional vector and SIMD architectures.
2.1 Hardware Architecture and Programming Model
CUDA programs launch parallel compute kernels on the GPU. Kernels are typically ex-
ecuted by thousands of threads in parallel, organized into a hierarchy: A grid represents
a set of threads executing the same kernel in a data-parallel fashion similar to the SPMD
principle. Grids are divided into thread blocks, each of which is assigned to a particular
SIMT core by the hardware. Threads belonging to the same block communicate via a
special on-chip memory on the SIMT core. Threads of different thread blocks share data
using the GPU’s global memory, which is not guaranteed to be consistent. The SIMT
core splits thread blocks into warps, which comprise a fixed number of scalar threads
that are executed in lockstep by an array of scalar processors on the SIMT core.
GPUs have multiple SIMT cores that execute warps as SIMD groups on the scalar
processors. Typically, there are more warps allocated on a SIMT core than can be exe-
cuted in parallel. This enables the GPU to hide high-latency memory operations: Instead
of using advanced hardware mechanism such as prediction and out-of-order execution,
GPUs exploit the massive parallelism of the thread hierarchy by interleaving the execu-
tion of different warps; switching between warps incurs no overhead. Different warps
are executed independently, hence there is no performance gain or penalty when they
are executing common or disjoint code paths.
The focus of this paper is divergence and reconvergence within a warp, so we mainly
consider one single warp only; Sect. 5.2 discusses alternative SIMT implementations,
some of which take multiple warps into consideration to improve efficiency.
2.2 SIMT Control Flow
SIMT cores execute warps in lockstep, that is, each thread of a warp executes the same
instruction in parallel on different operands. In this way, instruction fetch and decoding
costs are amortized over all threads of a warp and memory operations performed by
threads of the same warp can often be coalesced into fewer memory accesses, which is
likely to result in significant performance benefits [13, 23]. To support a wide variety
of programs, however, the SIMT cores must be able to deal with diverging control flow
that occurs when at least two threads of the same warp take different paths due to a
data-dependent control flow instruction. Earlier GPU designs serialized the execution
of the remainder of the program once the control flow diverges. Serialization with-
out a reconvergence mechanism results in a low utilization of processing resources for
branch-heavy programs and performance drops by a factor proportional to the warp size
in the worst case [18]. Consequently, today’s GPU architectures implement mechanisms
for reconverging threads once their control flows return to a common path. Our focus
in this paper is the formalization of an operational semantics for NVIDIA’s stack-based
reconvergence mechanism as outlined by Hennessy and Patterson [10] and explained in
detail in one of NVIDIA’s US patent applications [5]. Neither AMD nor NVIDIA provide
any official information on their SIMT implementations.
2.3 Stack-based SIMT Reconvergence
The main idea of a stack-based reconvergence mechanism is to store information about
control flow divergence and reconvergence in a reconvergence stack. Possible causes
for divergence are branches and loops with conditions depending on thread-specific
data as well as loops and function calls with bodies containing data-dependent break
or return statements. Whenever the control flow might diverge, a token is pushed onto
the reconvergence stack. Tokens store both the continuation of a potentially divergent
instruction and the threads participating in its execution. Once the control flows recon-
verge (or the execution of the branch, loop, or function call completes without causing
any divergence at all), the topmost token is popped off the stack and the SIMT core uses
the information contained in the token to continue the execution of the program.
The reconvergence stack belongs to the execution state of a warp, also comprising
a program counter (PC), an active mask, and a disable mask. The active mask indicates
which threads are active and participate in the execution of the instruction referred to
by the warp’s PC. Inactive threads do not execute the instruction and their respective
scalar processors remain idle. The disable mask records the disable state of each thread:
b or r indicate that the thread executed a break or return instruction, respectively,
whereas 0 means that the thread is not disabled. Only threads with a disable state of
0 can be reactivated when a token is popped off the stack, thereby guaranteeing the
correct handling of nested control flow instructions within a compute kernel.
Each token on the reconvergence stack is of a specific type and comprises an ac-
tive mask and a program counter. Once a token is popped off the stack, the token’s PC
indicates the next instruction to be executed by the warp, the active mask determines
which threads are activated or deactivated, and the token type affects the update oper-
ations performed on the warp’s current active and disable masks. The type of a token
is either div or sync for branches, brk for loops, or call for function calls. A div token
stores all the information required for the execution of the second path of a branch af-
ter the first path terminates, while sync and brk tokens mark the reconvergence points
of branches and while loops (i.e., their continuations), respectively. It seems that the
reconvergence point of a control flow statement always corresponds to the statement’s
immediate post-dominator [5, 19], which denotes the first instruction that must be ex-
ecuted by all divergent (and still active) threads before they return from the current
function. The return address of a function call is stored in the corresponding call token
on the reconvergence stack, hence no function call stack is required to store return ad-
dresses (such a stack would only be necessary to push and pop function arguments).
Tokens of types brk and call are also used to determine all instructions a thread must
skip after executing a break or return.
The execution state of a warp is stored directly on the SIMT core, but older entries
of the stack may be spilled to global memory if necessary, making operations on the
stack potentially time-consuming. In fact, the disable mask is merely an optimization
that eliminates the need to modify the stack when threads are deactivated [5].
Example 1. We demonstrate how NVIDIA’s SIMT implementation handles nested con-
trol flow instructions by means of Prog. 1’s compute kernel main; the kernel serves
no real purpose other than an illustrative one. Note that when Prog. 1 is run on a GPU,
the actual execution is likely to differ from what is described below, as the compiler
performs (semantics preserving) control flow optimizations to minimize the overhead
incurred by operations on the reconvergence stack; particularly, the compiler will most
certainly inline the call to func.
Table 1 depicts the evolution of the warp’s execution state as the kernel is being ex-
ecuted. The active mask is shown as a bit field with a value of 1 at position n indicating
that the thread with id n is active; the disable mask is also given in a similar bit field
like fashion. The rightmost column represents the reconvergence stack with the topmost
token on the left. Each consecutive pair of rows shows how the instruction pointed to
by the PC in between them manipulates the warp’s execution state. Figure 1 shows the
control flow graph of the while loop in function func of Prog. 1, highlighting the
immediate post-dominators of the loop at line 4 and the branch at line 5.
We assume a warp size of four for illustration purposes, with all four threads being
active initially (in situations where the number of threads executing a kernel is not a
multiple of the warp size, there are underpopulated warps with idle scalar processors).
The global pointer variables a and b are assumed to be shared arrays of length four,
acting as the kernel’s input and output parameters; we avoid function parameters and
return values to simplify matters, even though they are supported by CUDA, OPENCL,
and DirectCompute. We assume that the values of the arrays a and b are 0, 1, 1, 1 and
0, 1, 1, 3 for indices 0 to 3, respectively. For each thread executing the kernel, the thread-
local variable tid contains the unique id of the thread (0 for the first thread, 1 for the
second one, and so on), which is used to index into the arrays.
Execution of the compute kernel begins at line 14 where all threads of the warp call
func in lockstep. A call token is pushed onto the stack, with its program counter set to
the return address of the call, i.e., 15. The token’s active mask marks all four threads as
active, meaning that all four threads should eventually execute the return at line 15.
Execution continues at line 3 where all threads initialize their local variables i and
j. Subsequently, the while loop is encountered and a brk token is pushed onto the
stack. The loop condition evaluates to false for thread 0, hence the corresponding bit
in the active mask is set to 0 and the thread does not participate in the execution of
the loop. Next, the remaining threads push a sync token onto the stack because of the
if instruction at line 5. The token’s PC is set to the instruction at line 9, which is
the reconvergence point where all threads taking one of the two paths of the branch
should be executed in lockstep again. As thread 0 is not active when the if instruction
is encountered, its bit in the sync token’s active mask is not set. The branch statement
also causes a div token to be pushed onto the stack, as threads 1 and 2 take the else-
path and thread 3 takes the then-path; if no divergence had occurred, the stack would
remain unchanged. Execution of both paths is serialized, with current NVIDIA hardware
1 __shared int *a, *b;
2 void func() {
3 int i = a[tid], j = b[tid];
4 while (i > 0) {
5 if (j > 2 * i)
6 b[tid] += i;
7 else
8 break;
9 --i;
10 }
11 return;
12 }
13 void main() {
14 func();
15 return;
16 }
Program 1. A compute kernel with nested control flow
instructions
4
5
6 8
9
11
i > 0
j > 2 * i j <= 2 * i
i <= 0
Fig. 1. Control flow graph of
the while statement in Prog. 1;
immediate post-dominators are
highlighted
PC Active Mask Disabled Mask Top of Stack
1111 0000 – –
14 1111 0000 (call, 1111, 15) –
3 1111 0000 (call, 1111, 15) –
4 0111 0000 (brk, 1111, 11) (call, 1111, 15)
5 0110 0000 (div, 0001, 6) (sync, 0111, 9) . . .
8 0001 0bb0 (sync, 0111, 9) (brk, 1111, 11) . . .
6 0001 0bb0 (brk, 1111, 11) (call, 1111, 15)
9 0001 0bb0 (brk, 1111, 11) (call, 1111, 15)
4 1111 0000 (call, 1111, 15) –
11 1111 0000 – –
15 1111 0000 – –
Table 1. Evolution of the warp’s execution state during the execution of Prog. 1
executing the else-path first. Thus, thread 3 is disabled and threads 1 and 2 continue
execution at line 8. The div token stores the information required to execute the then-
path once the else-path is completed. For this reason, the token’s PC is set to point to
the instruction of the then-path at line 6 and its active mask has only thread 3 activated.
Threads 1 and 2 execute the break statement at line 8. Their bits in the active mask
are cleared and their disable states are set to b. Consequently, there no longer are any
active threads, so the warp pops the div token off the stack that causes thread 3 to execute
the then-path of the branch at line 6. After the execution of the assignment, the end of
the then-path is reached, causing the sync token to be popped off the stack. Execution
now resumes at line 9; however, the warp cannot just use the token’s active mask, as
threads 1 and 2 executed a break and should therefore not participate in the execution
of the loop anymore. To support such arbitrary nested control flow instructions within
a compute kernel, the token’s active mask is combined with the warp’s disable mask,
resulting in threads 1 and 2 to remain disabled in this case because their disable states
have not yet been reset to 0.
Thread 3 returns to the beginning of the loop at line 4. As the condition evaluates
to false, there no longer are any active threads and the warp subsequently pops the
brk token off the stack. A token of type brk resets a disable state of b of all threads
that are active in the token’s active mask. Hence, all four threads execute the return
instruction at line 11 that causes the threads to jump to the function’s return address by
popping the call token off the stack. They execute the next return statement at line
15, which completes the execution of the compute kernel. uunionsq
During the compilation of a compute kernel written in a structured programming
language like CUDA-C, branches and loops are replaced by unstructured conditional
jumps. The compiler preserves the structural information by adding special flags and
instructions into the assembly-level code, allowing the hardware to efficiently deter-
mine the immediate post-dominators [5, 22]. Our formalization of the SIMT execution
model disregards these low-level implementation details and focuses on a structured
programming language instead. A formal semantics of the SIMT behavior based on
NVIDIA’s assembly-level language PTX can be found in [9].
3 Formalization of the SIMT Execution Model
We formalize the SIMT behavior as discussed in the preceding section in an operational
semantics for a C-like high-level language. We focus on the main ideas of the mecha-
nism and omit other hardware supported features such as indirect branches and function
calls, as these can be reduced to sequences of their direct counterparts [5]. Due to the
stack-based reconvergence mechanism, we base our formalization of the SIMT seman-
tics on an instruction stack that unifies the treatment of statements from the structured
programming language and the warp management tokens.
3.1 Basic Domains and Language Grammar
We assume a syntactic category Var of variable identifiers with typical element x (all
metavariables may occur in an arbitrarily adorned form) and a syntactic category Func
of function identifiers, of which f will be a typical element. We also assume a syntactic
category Expr of (side effect-free arithmetical) expressions over Var , ranged over by
e, which we do not specify more precisely.
Our programming language is a simple while-language with function calls; we dis-
tinguish between statements, statement lists, and programs. The grammar of (C-like)
statements and statement lists is as follows:
Stm 3 s ::= ; | x = e;
| if (e) S1 else S2
| while (e) S | while (e) S | break;
| f(); | return;
Stms 3 S ::= s | s S
Stm does not include a separate sequential composition but relies on statement lists
instead. The statements while and while are required to distinguish between the
first iteration of a loop and subsequent iterations, which do not generate further brk
tokens. The domain Prog ⊆ Stms of programs, ranged over by P , singles out those
statement lists that do not contain while and have break; only within while loops.
A function environment ϕ ∈ FuncEnv = Func → Prog maps each function identifier
to a program.
Statement lists are executed by threads θ taken from a domain of thread identifiers
Thread; the domain of all subsets of Thread is ranged over by Θ. Threads can share
variables or have their own local copies depending on the initialization of the threads’
variable environments. A variable environment ν ∈ VarEnv = Var → Addr of a
thread assigns an address from the domainAddr to each variable in a program. A thread
environment η ∈ ThreadEnv = Thread → VarEnv maps a thread to its variable
environment. We assume a memory type Mem = Addr → Val , ranged over by µ,
holding integer values Val = Z that is accessible by all threads (we disregard caching).
Appendix A gives an overview of all metavariables and semantic domains used
throughout this paper.
3.2 Warp Configurations and Transitions
The execution state of a warp consists of an instruction stack comprising statements
and tokens, an active mask, and a disable mask. In contrast to the informal description
in Sect. 2.3, our formal model does not maintain a separate reconvergence stack.
A thread is considered active if it is contained in the warp’s active mask Θ ⊆
Thread; we writeΘ+ for an active mask with at least one active thread, i.e.,Θ+ 6= ∅. A
disable mask ∆ is defined as a function in DisableMask = Thread → DisableState,
which maps a thread to its disable state δ that is either 0, b, or r. Token types are denoted
by t and have a value of either div, sync, brk, or call. A token τ = tΘ comprises a token
type t and an active mask Θ. Warp instruction stacks are given by
WStack 3W ::= ε | s W | τ W ,
combining tokens and statements. In contrast to Sect. 2.3, a token’s continuation is not
given by a program counter but rather as the remainder of the instruction stack.
A warp configuration ϕ, η B W,Θ,∆, µ consists of a static and a dynamic part,
separated by B. The former comprises a function environment ϕ and a thread envi-
ronment η, whereas the latter contains a warp instruction stack W , an active mask
Θ, a disable mask ∆, and a memory µ. An initial warp configuration is of the form
ϕ, η B P callΘ, Θ, {Θ 7→ 0}, µ where P is a program, Θ is an arbitrary subset of
threads and µ is an arbitrary memory. For reasons of uniformity, we assume that there
always is a call token at the bottom of the stack which corresponds to the invocation of
the compute kernel. A warp transition ϕ, η BW,Θ,∆, µ⇒w W ′, Θ′, ∆′, µ′ describes
a step transforming a warp configuration into another warp configuration, not repeating
the static parts of the configurations.
(skipw) ;, Θ+ ⇒w ε,Θ+
(assignw) η B x = e;, Θ+, µ⇒w ε,Θ+, µ{η(Θ+)(x) 7→ EJeK η(Θ+)µ}
(ifw) η B if (e) S1 else S2, Θ+, µ⇒w
S2 divact(Θ+,e,η,µ) S1 syncΘ+ , Θ
+ \ act(Θ+, e, η, µ), µ
(whilew) η B while (e) S,Θ+, µ⇒w S while (e) S brkΘ+ , act(Θ+, e, η, µ), µ
(whilew) η B while (e) S,Θ+, µ⇒w S while (e) S, act(Θ+, e, η, µ), µ
(breakw) break; [S] τ,Θ+,∆⇒w τ, ∅,∆{Θ+ 7→ b}
(callw) ϕB f();, Θ+ ⇒w ϕ(f) callΘ+ , Θ+
(returnw) return; [S] τ,Θ+,∆⇒w τ, ∅,∆{Θ+ 7→ r}
(tokenw) τ,Θ1,∆1 ⇒w ε,Θ2,∆2 where (Θ2,∆2) = enable(∆1, τ)
(inactw) S τ, ∅ ⇒w τ, ∅
Table 2. Operational semantics of a warp
3.3 Warp Operational Semantics
The warp operational semantics is the smallest binary relation⇒w on warp configura-
tions which is closed under the rules in Table 2. These are, in fact, rule schemes where
warp transitions are obtained by replacing the metavariables with suitable instances.
We use the following notational conventions: We only give the initial segment of the in-
struction stack W that is of relevance for the given rule; the remainder of W is omitted
and remains unchanged. Similarly, we drop all other irrelevant parts of an operational
judgement that remain unchanged. For example, the rule
ϕ, η B f();W,Θ+, ∆, µ⇒w ϕ(f) callΘ+ W,Θ+, ∆, µ
is abbreviated as follows, where Θ+ is not omitted (even though it remains unchanged)
as it has to be stored in the call token and the rule may only be applied to warp configu-
rations with a non-empty set of active threads:
ϕB f();, Θ+ ⇒w ϕ(f) callΘ+ , Θ+ .
The operational rules in Table 2 process the instruction stack of a warp configuration
disregarding all possible compiler or hardware optimizations. The skip operation ; is
simply popped off the instruction stack; like all other rules except for (tokenw) and
(inactw), it can only be applied to warp configurations with at least one active thread.
To execute an assignment x = e;, all active threads use the function EJ−K :
Expr → VarEnv ×Mem → Val to evaluate e before any of them write to x, thereby
avoiding potential nondeterminism. However, the order of conflicting writes (that is,
η(θ1)(x) = η(θ2)(x) but EJeK η(θ1)µ 6= EJeK η(θ2)µ for two active threads θ1, θ2 ∈
Θ) is undefined [23]; this is modeled by the nondeterminism in the instantiation of the
rule (assignw): µ{η(Θ+)(x) 7→ EJeK η(Θ+)µ} abbreviates the memory update
µ{η(θ1)(x) . . . η(θn)(x) 7→ EJeK η(θ1)µ . . . EJeK η(θn)µ}
for some arbitrary order of threads θi ∈ Θ+.
An if statement pushes two tokens onto the stack: The sync token marks the re-
convergence point where all active threads Θ+ are reactivated again; provided that they
do not execute a break or return instruction in the meantime. The div token stores
the information needed to execute the then-path of the branch once the execution of the
else-path is completed. The rule (ifw) uses the function
act(Θ, e, η, µ) = {θ ∈ Θ | EJeK η(θ)µ 6= 0}
to determine the set of threads for which expression e evaluates to true (i.e. non-zero).
A while statement at the top of the instruction stack is replaced by a sequence of
instructions: First, a brk token storing the continuation for all threads exiting the loop
is pushed onto the stack. Second, a corresponding while statement is created, which
does not generate another brk token once it is encountered. Finally, the statement list
forming the while statement’s body is pushed onto the stack. Whether there are any
threads for which the loop condition holds is again determined by the function act.
A break or return statement deactivates all active threads and sets their disable
state to b or r, respectively. [S] denotes a possibly empty statement list, so either break
and return are directly followed by a token τ or all statements up to the next token
on the stack are skipped.
A call to a function f places the program ϕ(f) on the top of the stack. The call token
that is pushed onto the stack beforehand stores the currently active threads Θ+, all of
which are reactivated once the warp begins to execute the continuation of the call token.
When a token is popped off the instruction stack, the warp’s active and disable
masks are updated. The reactivation of an inactive thread depends on the type of the
token and the thread’s disable state. For instance, a thread with a disable state of r may
only be reactivated if the token is of type call. The predicate
awaits(δ, t)↔ (δ = b→ t = brk) ∧ (δ = r→ t = call)
establishes this relationship between disable states and token types. The function enable
clears the disable states of all threads for which awaits holds. Furthermore, it replaces
the warp’s active mask with the one of the token, removing all threads with a disable
state other than 0. Formally, we define enable as
enable(∆1, tΘ) = ({θ ∈ Θ | ∆2(θ) = 0}, ∆2)
where ∆2 = ∆1{{θ ∈ Θ | awaits(∆1(θ), t)} 7→ 0} .
Particularly, enable only changes the disable states of threads that are contained in
the token’s active mask, otherwise threads would be reactivated too soon; this has al-
ready been illustrated in Ex. 1. For another example, consider a thread θ that executes a
return statement in the else-path of a branch, while the other threads Θ of the warp
call another function f ′ when they later execute the if-path of the branch. Once the con-
trol flow returns from f ′, θ’s disable state remains unchanged because θ is not active in
the call token corresponding to the invocation of f ′. Hence, when the sync token of the
branch is subsequently popped, θ is removed from the token’s active mask before it is
copied into the warp’s execution state because its disable state is still set to r.
The rule (inactw) is used to skip all statements up to the topmost token on the stack
if the warp’s active mask is empty. Such a situation typically arises when the condition
of a while statement evaluates to false for all active threads or when a token does not
activate any threads at all. The latter occurs when all threads return from a function call
within an if-statement, for example. In that case, the rule (inactw) skips all statements
up to the next token, which is then dealt with by the rule (tokenw). Once the last token is
popped of the stack, the compute kernel terminates, as the last token on the instruction
stack is the call token corresponding to the invocation of the compute kernel.
Example 2. We apply our formal semantics of the SIMT execution model to the com-
pute kernel main of Prog. 1. As in Ex. 1, we illustrate active and disable masks as bit
fields and omit the memory as well as the function and thread environments for reasons
of brevity (observe that our formal model treats indexed array accesses such as b[tid]
as regular variables; we assume that the thread environment is initialized such that no
variables are shared). The following shows the derivation sequence of Prog. 1 begin-
ning with the while statement at line 4 and ending just before the execution of the
return statement at line 11. Again considering only four threads as in Ex. 1, the first
warp configuration is given as while (i > 0) . . . return; W, 1111, 0000 with
W = call1111 return; call1111 denoting the remainder of the instruction stack; W
contains two call tokens that correspond to the invocations of func and main. Recall
that the values of the shared arrays a and b are 0, 1, 1, 1 and 0, 1, 1, 3 for indices 0 to 3,
respectively.
while (i > 0) . . . return;W, 1111, 0000
while===⇒w if (j > 2 * i) b[tid] += i; else break; --i;
while (i > 0) . . . brk1111 return;W, 0111, 0000
if===⇒w break; div0001 b[tid] += i; sync0111 --i;
while (i > 0) . . . brk1111 return;W, 0110, 0000
break===⇒w div0001 b[tid] += i; sync0111 --i;
while (i > 0) . . . brk1111 return;W, 0000, 0bb0
token===⇒w b[tid] += i; sync0111 --i;
while (i > 0) . . . brk1111 return;W, 0001, 0bb0
assign===⇒w sync0111 --i; while (i > 0) . . . brk1111 return;W, 0001, 0bb0
token===⇒w --i; while (i > 0) . . . brk1111 return;W, 0001, 0bb0
assign===⇒w while (i > 0) . . . brk1111 return;W, 0001, 0bb0
while
===⇒w if (j > 2 * i) b[tid] += i; else break; --i;
while (i > 0) . . . brk1111 return;W, 0000, 0bb0
inact===⇒w brk1111 return;W, 0000, 0bb0
token===⇒w return;W, 1111, 0000
The application of the rule (whilew) pushes the entire body of the while statement onto
the instruction stack again, even though there no longer are any active threads. However,
the body of a while statement consists of statements only, hence no new tokens are
pushed onto the stack. The rule (inactw) is therefore able to skip all statements on the
stack up to the brk token. When the rule (tokenw) pops the brk token off the stack, the
function enable reactivates all threads with a disable state of b. In contrast, the threads
remain disabled when the div and sync tokens are encountered. uunionsq
The following lemma summarizes a few invariants of reachable warp configura-
tions. A warp configuration w is reachable if a finite sequence of warp transitions trans-
forms some initial warp configuration into w; a warp transition is reachable if the warp
configuration on its left hand side is reachable.
As can be seen by inspecting the operational rules for warps in Table 2, no token
intervenes a div token and a sync token and the active mask of a sync token comprises
both the currently active threads and the active mask of the previous div token. Simi-
larly, each while statement is directly followed by a brk token and the currently active
threads are contained in the brk token’s active mask. By induction and observing that
no warp rule alters previously pushed tokens, the active mask of a token τ1 is a subset
of the active mask of another token τ2 lower on the stack, if neither token is of type div.
Analogously, the warp’s current active mask always is a subset of the active masks of
all tokens on the stack; except for div tokens, which always disable all active threads.
Lemma 1. Let ϕ, η BW,Θ,∆, µ be a reachable warp configuration.
1. If W = divΘ′ W0, then W0 = S syncΘ′′ W1 and Θ ∪Θ′ ⊆ Θ′′.
2. If W = while (e) S W0, then W0 = brkΘ′ W1 and Θ ⊆ Θ′.
3. If W = W1 t1,Θ1 W2 t2,Θ2 W3 and t1 6= div 6= t2, then Θ1 ⊆ Θ2.
4. If W = W1 tΘ′ W2, then Θ ∩Θ′ = ∅ if t = div or Θ ⊆ Θ′ otherwise.
Proof. We prove claims (1–4) simultaneously by induction over the length of the deriva-
tion to reach ϕ, η BW,Θ,∆, µ and a case distinction over the warp rules. In an initial
configuration the claims are obvious.
Claim (1) and (2) follow directly from the (ifw) and (whilew) rule, respectively, and
the induction hypothesis using claim (4). We only have to consider (3) and (4).
If (skipw), (assignw), or (inactw) was the last rule applied to obtainϕ, ηBW,Θ,∆, µ
claims (3) and (4) follow from the induction hypothesis.
For (ifw), let Θ+ be the active mask of the sync token pushed onto the instruction
stack in the previous warp configuration. For (3) let tΘ′ be the first non-div token that
was on the instruction stack; if no such token exists the claim is immediate. By claim (4)
from the induction hypothesis, Θ+ ⊆ Θ′. For (4), we additionally have Θ = Θ+ and
also that act(Θ+, e, η, µ) ∩ (Θ+ \ act(Θ+, e, η, µ)) = ∅.
For (whilew), (whilew), (breakw), (callw), and (returnw) the claims follow from the
induction hypothesis, since the new active mask is contained in or equal to the active
mask of the previous warp configuration and the newly pushed tokens have an active
mask equal to the active mask of the previous warp configuration.
For (tokenw), let ϕ, η B tΘ0 W0, Θ1, ∆1 be the previous warp configuration and
(Θ,∆2) = enable(Θ1, tΘ0) with Θ = {θ ∈ Θ0 | ∆2(θ) = 0} and ∆2 = ∆1{{θ ∈
Θ0 | awaits(∆1(θ), t)} 7→ 0}; in particular, Θ ⊆ Θ0. Claim (3) is immediate since W
is a prefix of W0. For (4), let t′Θ′ be a token in W0; if no such token exists the claim is
obvious. Token tΘ0 has been pushed onto the instruction stack above t
′
Θ′ either by an
application of the (whilew) rule (for t = brk), or the (callw) rule (for t = call), or the
(ifw) rule (for t = div); in case of an (ifw) also a token syncΘ′′ with Θ0 ⊆ Θ′′ has been
pushed above t′Θ′ . In case of (whilew) and (callw) the active mask Θ0 of tΘ0 has been
the active mask of the current warp configuration when applying the rule; for (ifw) this
is the case for syncΘ′′ . By the induction hypothesis, claim (4) holds for the result of this
rule application. Since no rule alters previously pushed tokens, claim (4) follows for the
induction step. uunionsq
4 Simulating SIMT Execution by Interleaved Multi-Threading
The formalization of the SIMT execution model in the preceding section allows us to
formally establish its semantic validity by constructing a simulation relation between
the warp semantics and a standard interleaved multi-thread semantics. The simulation
shows that the SIMT execution model is correct in the sense that warps execute control
flow instructions in a way that can be reproduced by a certain schedule of the interleaved
thread semantics.
4.1 Interleaved Multi-Thread Semantics
The concepts of active masks, disable masks, and divergence do not apply to individual
threads. However, threads still depend on an instruction stack that comprises statements
and contexts c. Similar to the tokens in a warp’s instruction stack, contexts denote the
thread’s continuations once a loop is exited or a break or return statement is exe-
cuted. Thread instruction stacks are defined as follows, with contexts c being either brk
or call:
TStack 3 T ::= ε | s T | c T .
A thread configuration ϕ, ν B T, µ consists of a function environment ϕ, a variable
environment ν, a thread instruction stack T , and a memory µ. A thread transition ϕ, νB
T, µ⇒t T ′, µ′ transforms a thread configuration into another thread configuration.
The rules of our thread operational semantics ⇒t are given in Table 4, where we
reuse our notational conventions for warps. It is a fairly standard small-step semantics
(see e.g. [25]), so we only remark the following: Contexts that reach the top of the
instruction stack are simply discarded by the rule (contextt). When a thread encounters
a break or return statement, it skips all statements up to the next brk or call context
on the stack, respectively. T¬call denotes a possibly empty list of instructions that does
not contain any call contexts.
Multiple threads interleave execution. An interleaved thread configuration ϕ, η B
ς, µ uses a thread stack function ς : Thread → TStack to determine the instruction
stack of each thread participating in the execution. An interleaved thread transition
ϕ, η B ς, µ ⇒i ς ′, µ′ describes a step transforming an interleaved thread configuration
(skipt) ;⇒t ε
(assigntrd) ν B x = e;, µ⇒t x = v;, µ where EJeK ν µ = v
(assigntwr) ν B x = v;, µ⇒t ε, µ{ν(x) 7→ v}
(ifttt) ν B if (e) S1 else S2, µ⇒t S1, µ if EJeK ν µ 6= 0
(iftff) ν B if (e) S1 else S2, µ⇒t S2, µ if EJeK ν µ = 0
(whilettt) ν B while (e) S, µ⇒t S while (e) S brk, µ if EJeK ν µ 6= 0
(whiletff) ν B while (e) S, µ⇒t ε, µ if EJeK ν µ = 0
(whilettt) ν B while (e) S, µ⇒t S while (e) S, µ if EJeK ν µ 6= 0
(whiletff) ν B while (e) S, µ⇒t ε, µ if EJeK ν µ = 0
(breakt) break; [S] brk ⇒t ε
(callt) ϕB f();⇒t ϕ(f) call
(returnt) return; T¬call call ⇒t ε
(contextt) c⇒t ε
Table 4. Operational semantics of a single thread
into another interleaved thread configuration by selecting an arbitrary thread and exe-
cuting a thread transition. With the help of our notational conventions for warps, the
rule for our interleaved thread semantics⇒i is given as:
ϕ, η(θ)B ς(θ), µ⇒t T, µ′
ϕ, η B ς, µ⇒i ς{θ 7→ T}, µ′
.
The thread semantics handles assignments in two steps: The rule (assigntrd) eval-
uates the expression before the rule (assigntwr) performs the actual memory update; if
there was only one rule for assignments, the interleaved thread semantics would be un-
able to simulate the SIMT execution model. For instance, consider a statement x =
x + 1 for some shared variable x. As a warp executes all of its n active threads in
lockstep, the value of x is incremented by 1 in total. With only one assignment rule
for threads, the interleaved thread semantics would always increment x by n. With the
two rules for assignment, however, the interleaved semantics nondeterministically in-
crements x by l with 1 ≤ l ≤ n, depending on the order in which the threads apply the
rules (assigntrd) and (assigntwr). The simulation of the SIMT behavior therefore applies
the rule (assigntrd) to all active threads before any of them applies the rule (assigntwr); in
the example, this guarantees that x is incremented by 1.
4.2 Simulation Relation
Figure 2 shows the intended simulation relation between the warp semantics and the
interleaved thread semantics: Any reachable warp transition shall be matched by a finite
ϕ, η B piΘ(W,∆), µ
i,∗+3 piΘ′(W ′,∆′), µ′
ϕ, η BW,Θ,∆, µ w+3
γ
OO
W ′, Θ′,∆′, µ′
γ
OO
Fig. 2. Simulation of a reachable warp transition by a sequence of interleaved thread transitions
sequence of zero, one, or more interleaved thread transitions (written as ⇒i,∗). Based
on the simulation of one single warp transition, we extend the simulation to sequences
of reachable warp transitions, meaning that all warp executions are correct with regard
to the interleaved thread semantics.
The simulation depends on the mutually recursive projection functions piθ, pi¬θ :
WStack ×DisableState → TStack defined in Table 5 for each θ ∈ Thread that trans-
form a warp instruction stack into a thread instruction stack. piθ projects active threads,
whereas pi¬θ is used for inactive ones. The former simply outputs all statements it en-
counters, replaces all brk and call tokens by brk and call contexts, removes all sync and
div tokens as they have no meaning for an individual thread, and deactivates the thread
whenever a div token is encountered by calling pi¬θ. In the definition of piθ, the active
masks of the tokens on the stack do not have to be considered because of Lem. 1(4). For
an inactive thread, pi¬θ skips all instructions until it encounters an activation token, i.e.,
a token that reactivates the thread. The function actToken determines whether a given
token is a thread’s activation token:
actTokenθ(tΘ, δ)↔ θ ∈ Θ ∧ awaits(δ, t) .
The projection functions piθ and pi¬θ are combined into the curried thread stack
function piΘ : (WStack ×DisableMask)→ Thread → TStack with
piΘ(W,∆)(θ) =
{
piθ(W,∆(θ)) if θ ∈ Θ
pi¬θ(W,∆(θ)) otherwise ,
parameterized by Θ, which is used by the conversion function γ of Fig. 2 to turn warp
configurations into interleaved thread configurations.
The existence proof for the simulation relation relies on a series of lemmata that
relate warp configurations to interleaved thread configurations. The first lemma estab-
piθ(ε, δ) = ε
piθ(s W, δ) = s piθ(W, δ)
piθ(callΘ W, δ) = call piθ(W, δ)
piθ(brkΘ W, δ) = brk piθ(W, δ)
piθ(syncΘ W, δ) = piθ(W, δ)
piθ(divΘ W, δ) = pi¬θ(W, δ)
pi¬θ(ε, δ) = ε
pi¬θ(s W, δ) = pi¬θ(W, δ)
pi¬θ(τ W, δ) =
{
piθ(W, 0) if actTokenθ(τ, δ)
pi¬θ(W, δ) otherwise
Table 5. Definitions of the projection functions
lishes a relationship between the functions actToken and enable, ensuring that a warp
correctly activates inactive threads: It only activates inactive threads when it reaches
their activation token and conversely, inactive threads are reactivated once their activa-
tion token is encountered.
Lemma 2. Let ϕ, η BW,Θ,∆, µ be a reachable warp configuration, θ /∈ Θ, and τ a
token in W . Then actTokenθ(τ,∆(θ)) is true if and only if enable(∆, τ) = (Θ′, ∆′)
with θ ∈ Θ′.
Proof. Let δ = ∆(θ) for some inactive thread θ and letW contain some token τ = tΘ0 .
If actTokenθ(tΘ0 , δ) is true, then θ ∈ Θ0 and awaits(δ, t) holds. Consequently, it
follows that ∆′(θ) = 0 and therefore θ ∈ Θ′.
If, on the other hand, θ ∈ Θ′, we know that θ ∈ Θ0 and ∆′(θ) = 0. Therefore, either
δ = 0 and awaits(δ, t) holds or δ 6= 0 and awaits(δ, t) holds anyway. In any case,
actTokenθ(τ, δ) holds as well. uunionsq
Using Lem. 2, we can show that all inactive threads of a reachable warp configura-
tion simulate an arbitrary operational warp transition by simply remaining idle. This is
because the instruction stacks of inactive threads remain unchanged.
Lemma 3. Let ϕ, η BW,Θ,∆, µ ⇒w W ′, Θ′, ∆′, µ′ be a reachable warp transition
and let θ /∈ Θ. Then piΘ(W,∆)(θ) = piΘ′(W ′, ∆′)(θ).
Proof. Let δ = ∆(θ) and δ′ = ∆′(θ). In particular, piΘ(W,∆)(θ) = pi¬θ(W, δ).
For the rules (skipw), (assignw), (whilew), (breakw), and (returnw), W is of the
form S W0 and W ′ = W0 (where W0 starts with a token for (breakw) and (returnw)).
The disable mask for inactive threads remains unchanged, i.e., δ = δ′, and no inactive
threads become active, i.e., θ /∈ Θ′. The projection pi¬θ(S W0, δ) steps over S, hence
pi¬θ(W, δ) = pi¬θ(W0, δ) = pi¬θ(W0, δ′) = pi¬θ(W ′, δ′) = piΘ′(W ′, ∆′)(θ).
For the rules (ifw), (whilew), and (callw), W is of the form s W0 and W ′ of the
form W1 W0. No inactive thread is in the active masks of the tokens generated in W1,
hence none of these tokens is an activation token for an inactive thread. Additionally,
no threads are activated by these rules, i.e., θ /∈ Θ′, and the disable mask remains
unchanged, i.e., δ = δ′. Consequently, the projections skip over the statements and
tokens referred to by the transition, that is, pi¬θ(W, δ) = pi¬θ(W0, δ) = pi¬θ(W0, δ′) =
pi¬θ(W ′, δ′) = piΘ′(W ′, ∆′)(θ).
For the rule (tokenw), W is of the form τ W0 and W ′ = W0 with τ = tΘ0 . The
projection pi¬θ(τ W0, δ) removes the token and yields either piθ(W0, 0) or pi¬θ(W0, δ),
depending on whether τ is θ’s activation token. We have to show that in either case this
matches piΘ′(W ′, ∆′)(θ). If actTokenθ(τ, δ) is true, then θ ∈ Θ′ and δ′ = 0 according
to Lem. 2. Thus, pi¬θ(W, δ) = pi¬θ(τ W0, δ) = piθ(W0, 0) = piΘ′(W0, ∆′)(θ). How-
ever, if actTokenθ(τ, δ) is false, we know that θ /∈ Θ′ because of Lem. 2. Moreover,
δ = δ′ because either θ /∈ Θ0 or awaits(δ, t) is false. Hence we have pi¬θ(W, δ) =
pi¬θ(τ W0, δ) = pi¬θ(W0, δ) = piΘ′(W0, ∆′)(θ).
For the rule (inactw), all threads are and remain inactive, no changes are made to
the disable mask, and we have pi¬θ(W, δ) = pi¬θ(S τ W0, δ) = pi¬θ(τ W0, δ′) =
pi¬θ(W ′, δ′) = piΘ′(W ′, ∆′)(θ). uunionsq
For the active threads in a reachable warp configuration, we would also like to pro-
ceed by focusing on a single thread, showing that each single active thread can simu-
late a warp transition. Assignments, however, are a special case that requires all active
threads to be considered, as the order of conflicting writes to the same memory address
is undefined. Lemma 4 therefore covers the simulation of assignments separately where
⇒i,+ denotes a finite sequence of one or more interleaved thread transitions. Lemma 5
covers the remaining cases focusing solely on one active thread;⇒t,= denotes zero or
one thread transitions.
Lemma 4. Let ϕ, η BW,Θ,∆, µ ⇒w W ′, Θ′, ∆′, µ′ be a reachable warp transition
using the rule (assignw). Then ϕ, η B piΘ(W,∆), µ⇒i,+ piΘ′(W ′, ∆′), µ′.
Proof. Since (assignw) is used, W = x = e; W0, W ′ = W0, ∅ 6= Θ = Θ′, and
∆ = ∆′. Each θ ∈ Θ can match this operation using its (assigntrd) rule, because
piθ(W,∆(θ)) = piθ(x = e; W0, ∆(θ)) = x = e; piθ(W0, ∆(θ)). To simulate the
assignment, the interleaved threads first apply the rule (assigntrd) for all active threads
θ ∈ Θ, yielding T = x = v; piθ(W0, ∆(θ)) with v = EJeK η(θ)µ for each θ. As the
threads of a warp evaluate the expression e in parallel before any writes are performed
on the memory, the interleaved semantics must simulate this behavior; otherwise, as-
signments such as x = x + 1 could not result in the same memory updates for the
warp and the interleaved thread semantics as already mentioned in Sect. 4.1. After all
active threads applied the rule (assigntrd), each active thread uses the rule (assigntwr) on
T to write the result of the expression evaluation to memory. Consequently, T is trans-
formed into T ′ = piθ(W0, ∆(θ)) = piθ(W ′, ∆′(θ)) for each θ. We now construct the
sequence of interleaved thread transitions such that matches the order of the warp: The
last write to a variable is performed by a thread that writes the value ultimately written
by the warp, which in the end guarantees that both the warp and the interleaved seman-
tics change the memory in the same way. uunionsq
Lemma 5. Let ϕ, η BW,Θ,∆, µ ⇒w W ′, Θ′, ∆′, µ′ be a reachable warp transition
not using the rule (assignw) and let θ ∈ Θ. Then ϕ, η(θ) B piΘ(W,∆)(θ), µ ⇒t,=
piΘ′(W ′, ∆′)(θ), µ′.
Proof. Let δ = ∆(θ) and δ′ = ∆′(θ). We construct a simulating thread transition
ϕ, η(θ)B piθ(W, δ), µ⇒t piθ(W ′, δ′), µ′ or ϕ, η(θ)B piθ(W, δ), µ⇒t pi¬θ(W ′, δ′), µ′.
We proceed by a case distinction on the rule applied for the warp transition. In the
following, W0 always refers to the remainder of the current statement that is omitted in
the rules due to our notational convention. Since only (assignw) modifies the memory,
we always have µ = µ′.
(skipw): As Θ = Θ′ and ∆ = ∆′, we can simply apply rule (skipt) to construct
the transition of θ, because piθ(W, δ) = piθ(;W0, δ) = ; piθ(W0, δ) and piθ(W ′, δ′) =
piθ(W0, δ′) = piθ(W0, δ).
(ifw): If θ ∈ act(Θ, e, η, µ), then thread θ can use rule (ifttt) because the side con-
dition EJeK η(θ)µ 6= 0 is satisfied. Furthermore, the div token is θ’s activation token,
hence pi¬θ(S2 divact(Θ,e,η,µ) S1 syncΘ W0, δ′) = S1 piθ(W0, δ) and θ does not execute
S2. — If, on the other hand, θ ∈ Θ \ act(Θ, e, η, µ), then thread θ can use rule (iftff)
because EJeK η(θ)µ = 0 is satisfied. Thus, piθ(S2 divact(Θ,e,η,µ) S1 syncΘ W0, δ′) =
S2 piθ(W0, δ′), as the div token deactivates the thread and the sync token reactivates it.
Consequently, θ does not execute S1.
(whilew): If θ ∈ act(Θ, e, η, µ), then thread θ simulates the warp by using the rule
(whilettt) since EJeK η(θ)µ 6= 0 is satisfied. Therefore, the thread executes the while
statement because the projection piθ(S while (e) S brkΘ W0, δ′) yields the thread
statement S while (e) S brk piθ(W0, δ′). If, on the other hand, θ /∈ act(Θ, e, η, µ),
then thread θ can use rule (whiletff) because the side condition EJeK η(θ)µ = 0 is satis-
fied. The thread skips the while statement since pi¬θ(S while (e) S brkΘ W0, δ′)
yields piθ(W0, δ′) and the brk token is the thread’s activation token.
(whilew): If θ ∈ act(Θ, e, η, µ), then thread θ simulates the warp using the rule
(whilettt) because EJeK η(θ)µ 6= 0 is satisfied. Therefore, the thread executes the while
statement because piθ(S while (e) S W0, δ′) yields S while (e) S piθ(W0, δ′).
If, on the other hand, θ /∈ act(Θ, e, η, µ), then thread θ can use rule (whiletff) because
the side condition EJeK η(θ)µ = 0 is satisfied. Due to Lem. 1(2), we know that W0 =
brkΘ¯ W1 and that the brk token is θ’s activation token. Thus, the thread skips the while
statement because pi¬θ(S while (e) S brkΘ¯ W1, δ′) = piθ(W1, δ′).
(breakw): Such a warp transition is matched by the thread rule (breakt). As the warp
transition is reachable, we know that the execution of the warp started with an initial
warp configuration for some program P ∈ Prog where break; statements only occur
within while loops. We therefore know that at least one brk token must be on the
stack if the warp ever encounters a break; instruction. Consequently, τ is either a
brk token or W0 = W1 brkΘ¯ W2 and W1 does not contain any call or brk tokens. In
the former case the statement lists [S]θ and [S]W of both the thread and the warp are
identical, whereas in the latter case it follows that [S]θ = piθ([S]W τ W1, δ). All active
threads execute the break;, so they are deactivated and their disable state is set to
b. Consequently, the projection of W ′ skips all statements and tokens until it reaches
the first brk token and thus pi¬θ(W ′, δ′) = piθ(W0, δ′) or pi¬θ(W ′, δ′) = piθ(W2, δ′),
depending on whether τ is a brk token.
(callw): Again,Θ = Θ′ and∆ = ∆′. Thread θ can apply its (callt) rule to match the
operation by the warp, as piθ(W0, δ) = piθ(W0, δ′) and therefore piθ(ϕ(f) callΘ W0, δ′)
returns ϕ(f) call piθ(W0, δ′).
(returnw): Such a warp transition is matched by the thread rule (returnt). There is at
least one call token on the stack because the warp configuration is reachable and so the
execution of the warp began with an initial warp configuration, which always have at
least one call token on the stack as long as they are being executed. We therefore know
that either τ is a call token or W0 = W1 callΘ¯ W2 and W1 does not contain any call
tokens. In the former case we have T¬call = [S], whereas in the latter case it follows
that T¬call = piθ([S] τ W1, δ). All active threads execute the return;, so they are
deactivated and their disable state is set to r. Consequently, the projection of W ′ skips
all statements and tokens until it reaches the first call token and thus pi¬θ(W ′, δ′) =
piθ(W0, δ′) or pi¬θ(W ′, δ′) = piθ(W2, δ′), depending on whether τ is a call token.
(tokenw): If θ is deactivated by τ and thus θ /∈ Θ′, Lem. 1(4) applies and con-
sequently τ must be a div token. Furthermore, W0 = S1 syncΘ¯ W1, and θ ∈ Θ¯,
that is, the sync token is θ’s activation token according to Lem. 1(1). Consequently,
piθ(W, δ) = pi¬θ(W ′, δ′) = piθ(W1, δ′) and therefore θ skips statement S1 and contin-
ues executing (the projected) W1. Hence, a thread θ that is deactivated by τ does not
perform a transition in the interleaved semantics. — If, on the other hand, θ remains
active, there are three possibilities (cf. Lem. 1(3)):
piθ(τ W0, δ) =

call piθ(W0, δ) if τ is a call token
brk piθ(W0, δ) if τ is a brk token
piθ(W0, δ) if τ is a sync token
In the former two cases, θ can get rid of the context using the rule (contextt). In the
latter case, θ does not perform a transition in the interleaved semantics as piθ(W, δ) =
piθ(W0, δ) = piθ(W ′, δ′).
(inactw): Trivial as there are no active threads. uunionsq
The combination of Lemmata 3, 4, and 5 proves that there always exists a sequence
of interleaved thread transitions to simulate some arbitrary reachable warp transition.
This result is summarized in the following proposition, completing the proof of the
simulation relation shown in Fig. 2.
Proposition 1. Let ϕ, η BW,Θ,∆, µ ⇒w W ′, Θ′, ∆′, µ′ be a reachable warp transi-
tion. Then ϕ, η B piΘ(W,∆), µ⇒i,∗ piΘ′(W ′, ∆′), µ′.
The full simulation result follows from Prop. 1 by induction: All sequences of reach-
able warp transitions can be simulated by sequences of interleaved thread transitions.
Theorem 1 (Simulation of the SIMT Execution Model). Let ϕ, ηBW,Θ,∆, µ⇒w,∗
W ′, Θ′, ∆′, µ′ be a sequence of reachable warp transitions. Then ϕ, η B piΘ(W,∆), µ
⇒i,∗ piΘ′(W ′, ∆′), µ′.
From Thm. 1 it follows directly that all threads simulating the execution of a warp
terminate once the warp terminates, that is, they have fully executed the program.
Lemma 3 shows that the instruction stacks of inactive threads do not change, hence
ensuring that inactive threads do not skip any instructions. Additionally, Lemmata 2
and 3 guarantee that inactive threads are not left behind if the warp pops their activation
token off the stack. However, the SIMT execution model cannot ensure that all inactive
threads will eventually be reactivated, even though the call token at the bottom of the
stack is an activation token for all threads: In some cases, there are tokens on the instruc-
tion stack that are never reached again; the instruction stack is continuously modified
without shrinking below a certain threshold. Theorem 1 holds even for non-terminating
programs, hence the interleaved thread semantics is still able to simulate the warp ex-
ecution. Obvious reasons for non-termination are bugs causing non-terminating loops
or infinite recursion; there is, however, a more fundamental problem with the SIMT
execution model: unfairness.
5 Unfairness of the SIMT Execution Model
Divergent threads within a warp must be scheduled and executed one after another. To-
day’s GPUs use an unfair scheduling strategy in the sense that one of the divergent paths
is fully executed before execution of the second path begins; if the first one does not ter-
minate, the second one is not executed at all. For some programs, this unfair scheduling
strategy makes it impossible for the warp to eventually terminate, even though in the
interleaved semantics all weakly fair schedules terminate (where weak fairness means
that no thread is left behind indefinitely).
The SIMT execution model is not part of CUDA’s or OPENCL’s specification [12,
23]; instead, it is considered an implementation detail that programmers can “essen-
tially ignore” for “the purposes of correctness” according to NVIDIA [23, p. 62]. Our
findings in the preceding section support this statement insofar as they formally show
that warps execute control flow instructions as if each thread executed them individ-
ually in some schedule. The correctness of the SIMT execution model (in the sense
of simulatability) is therefore unaffected by fairness considerations. On the other hand,
unfairness potentially affects program termination and thus program correctness, which
may be the reason for the qualifying “essentially” in NVIDIA’s statement.
5.1 Programs Affected by the Unfairness Problem
We first illustrate the problem of unfairness with two example programs before dis-
cussing it more generally: Suppose that in Prog. 2, the variable lock is shared among
all threads of the warp with an initial value of 0, whereas tid stores the id of each
thread. Execution of the program terminates if the interleaved thread semantics chooses
a fair schedule; namely, it eventually executes the thread with id 0, causing the condi-
tion of the loop to evaluate to false for all other threads, which then terminate. A warp,
on the other hand, schedules the else-path before allowing thread 0 to set lock to 1.
The loop therefore never terminates, thereby preventing the program from terminating.
If the hardware were to execute the if-path first or if the conditional statement were
reversed, the program would successfully terminate.
Program 3 (with lock and tid as above) also does not terminate when executed
by a warp, although the unfairness has a different cause in this case. As the warp first
encounters the loop, the condition evaluates to true for all threads except for thread 0. As
the warp chooses the immediate post-dominator of the loop as the reconvergence point,
thread 0 is not allowed to continue execution. Instead, the warp continuously executes
the remaining threads, which never leave the loop as thread 0 never increments the
value of lock. Again, a fair interleaved schedule would eventually allow each thread
to increment lock, resulting in a successful termination of the program.
Programs affected by the unfairness problem are uncommon in practice. Particu-
larly, Prog. 2 uses shared variables without any means of synchronization in both paths
of an if statement, which is generally considered bad programming practice. Even if
the code was not affected by the unfairness problem, it exploits the implicit knowledge
about the sequential execution of the paths and might therefore break on future hard-
ware if this assumption is no longer valid. Busy-loops like the one in Prog. 3 are often
used in an attempt to implement global synchronization mechanisms that all NVIDIA
if (tid == 0)
lock = 1;
else
while (lock != 1) {}
Program 2. Unfair scheduling
of divergent branches
while (lock != tid) {}
// ...
++lock;
Program 3. Reconverging at
the immediate post-dominator
results in unfair schedules
while (next != 32) {
if (tid == next)
++next;
}
Program 4. Lem. 6 is not a
necessary condition for warp
termination
GPUs are currently lacking [23]. Global synchronization is in fact impossible to imple-
ment, though not because of unfairness issues: A compute kernel might be executed by
more threads than the hardware is capable of allocating concurrently, hence threads at
the synchronization point might be waiting for threads that do not even exist yet and
cannot be allocated, resulting in a deadlock.
The following lemma provides a sufficient criterion for programs that are unaffected
by the unfairness problem. It is based on a new kind of thread transition tη defined as
ϕ, η(θ)B T, µ⇒t T ′, µ′
ϕ, η(θ)B T, µ tη T ′, µ′′
where ∀a ∈ Addr . a /∈ saddr(η)→ µ′′(a) = µ′(a)
with saddr(η) denoting the set of addresses shared among the threads of thread en-
vironment η. Such a thread transition makes arbitrary changes to the contents of all
shared addresses. If all possible sequences of tη transitions terminate for all threads,
the warp execution is guaranteed to terminate as well. Assume for a contradiction that
the warp execution does not terminate. Then by Thm. 1 there is an infinite sequence of
⇒t transitions with a corresponding infinite sequence of tη for at least one thread.
Lemma 6. Let ϕ, η B P callΘ, Θ, {Θ 7→ 0}, µ0 be an initial warp configuration. If
there is no infinite sequence ϕ, η(θ)B P call, µ0  tη T1, µ1  tη T2, µ2  tη . . . for all
θ ∈ Θ, then ϕ, η B P callΘ, Θ, {Θ 7→ 0}, µ0 ⇒w,∗ ε,Θ′, ∆′, µ′.
Lemma 6 is only a sufficient condition for warp termination, but not a necessary
one as exemplified by Prog. 4. Assuming that the shared variable next is initialized to
0, tid stores each thread’s id, and the warp size is 32, the warp execution terminates:
The condition of the if statement eventually evaluates to true for all threads, so next
is equal to 32 at some point and the loop terminates. By contrast, a sequence of  tη
transitions that always resets next to 0 obviously never terminates.
In practice, however, infinite sequences of  tη transitions are rarely caused by
shared variables: Most compute kernels do not use shared variables in a way that affects
loops or recursion and graphics APIs have only recently introduced shared variables or
atomic operations for some shader types [11, 21].
5.2 Unfairness of Alternative SIMT Execution Models
Several alternative implementations of the SIMT execution model have been proposed,
be it for performance reasons or generality [4, 6, 7, 17]. A stack-less approach, for in-
stance, replaces the warp’s reconvergence stack by a set of program counters, one for
each thread and updated appropriately, that the warp uses to handle reconvergence. We
formalize this stack-less warp semantics ⇒w as follows, where the abstract function
schedule : (Thread → TStack) → 2Thread selects a set of threads with the same PC,
that is, a set of threads for which ∀θ, θ′ ∈ schedule(ς) . ς(θ) = ς(θ′) holds:
(ϕ, η(θ)B ς(θ), µ⇒t T ′θ, µ′θ)θ∈Θ
ϕ, η B ς, µ⇒w ς{(θ 7→ T ′θ)θ∈Θ}, µ{(a 7→ µ′θ(a))θ∈Θ,a∈Addr}
where schedule(ς) = Θ
Collange [4] suggests a similar stack-less approach. By contrast, however, our stack-
less semantics⇒w does not consider (function call) stack pointers when selecting the
next PC to execute, as that only affects performance but does not influence correctness
or fairness. As reconvergence is based on equality of program counters, the fairness of
⇒w and Collange’s approach depends on the fairness of the choice function schedule.
Particularly, Collange’s lowest program counter scheduling policy makes the overall
mechanism unfair.
Fung et al. propose a stack-less technique for more than one warp: dynamic warp
formation [7]. The SIMT core dynamically regroups all of its threads with the same
PC into one or more warps. Fairness depends on the warp scheduling policy; of the
five suggested policies, only DTime selecting the oldest warp is generally fair. Thread
block compaction [6] is another approach proposed by Fung et al. It reintroduces the
reconvergence stack, albeit at the thread block level. The stack is used for block-wide
synchronization at divergent branches and reconvergence points; divergent warps are
regrouped into non-divergent ones, restoring the original warp groupings upon encoun-
tering the reconvergence point. Due to the synchronization, thread block compaction
suffers from the unfairness problem.
Meng et al. [17] propose dynamic warp subdivision where warps are dynamically
subdivided on branch (or memory) divergence. Each so-called warp-split is individ-
ually scheduled, therefore execution of divergent paths is interleaved. Additionally,
threads might reconverge at some point past the immediate post-dominator, reuniting
the warp-splits. Their approach consequently has the potential of solving the unfairness
problem; in practice, however, unfairness is still an issue as warp subdivision is only
allowed on statically determined “appropriate” branches in order to avoid undesirable
over-subdivision.
6 Conclusions and Future Work
The single instruction, multiple threads execution model used by today’s GPUs groups
threads into batches that execute a compute kernel in lockstep, requiring a special mech-
anism to efficiently and correctly handle divergent control flow. Our formalization of
the SIMT execution model allows us to prove its correctness in the sense that each
SIMT execution corresponds to a standard interleaved multi-thread execution for some
schedule. SIMT execution potentially affects program termination, however, as diver-
gent threads are scheduled in an unfair way. Some alternative implementations of the
SIMT execution model also exhibit this unwanted behavior.
As more and more GPU-accelerated algorithms are used in safety- or security-
critical applications such as medical imaging [8,20], the importance of formally verified
program correctness increases. In particular, GPUs are capable of accelerating model
checking algorithms that in turn are used in formal analyses of various problems in a
wide range of application domains [2, 3]. Our work establishes the semantic validity of
the underlying SIMT execution model, contributing to the development of formal ver-
ification tools and mechanisms for GPU-based applications. We plan to use a theorem
prover to verify correctness and other properties of GPU-based programs.
Several research papers propose changes to the SIMT execution model in order to
improve efficiency and performance. While the main point of interest is performance
for the time being, new mechanisms should also explore the possibilities of solving the
unfairness problem to avoid unexpected non-termination, especially since the current
trend is the unification of the CPU and GPU programming models: For example, the
CUDA 4.1 compiler is based on the LLVM compiler infrastructure [14] with the intention
of allowing CUDA programs to run on either the GPU or the CPU [24]. In order to make
the verification of program correctness independent of the execution model, we plan to
study stronger criteria for the preservation of termination and other liveness properties
when the underlying hardware uses the SIMT execution model instead of a weakly
fair multi-thread semantics. Furthermore, it might be worthwhile to check whether our
findings can be generalized to the SIMD execution models found in some contemporary
CPU architectures.
References
1. AMD. Evergreen Family Instruction Set Architecture, 2011. Reference Guide.
2. J. Barnat, L. Brim, M. Ceska, and T. Lamr. CUDA Accelerated LTL Model Checking. In
Proc. 15th Int. Conf. Parallel and Distributed Systems (ICPADS’09), pages 34–41, 2009.
3. D. Bošnacˇki, S. Edelkamp, D. Sulewski, and A. Wijs. GPU-PRISM: An Extension of PRISM
for General Purpose Graphics Processing Units. In Proc. 9th Int. Wsh. Parallel and Dis-
tributed Methods in Verification (PDMV’10), pages 17–19, 2010.
4. S. Collange. Stack-less SIMT Reconvergence at Low Cost. Technical Report HAL-
00622654, INRIA, 2011.
5. B. W. Coon, J. R. Nickolls, L. Nyland, P. C. Mills, and J. E. Lindholm. Indirect Function
Call Instructions in a Synchronous Parallel Thread Processor, 2009. United States Patent
Application #2009/0240931.
6. W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control
Flow. In Proc. 17th IEEE Int. Symp. High Performance Computer Architecture (HPCA’11),
pages 25–36, 2011.
7. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Schedul-
ing for Efficient GPU Control Flow. In Proc. 40th Ann. IEEE/ACM Int. Symp. Microarchi-
tecture (MICRO’07), pages 407–420, 2007.
8. M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips,
Y. Zhang, and V. Volkov. Parallel Computing Experiences with CUDA. IEEE Micro,
28(4):13–27, 2008.
9. A. Habermaier. The Model of Computation of CUDA and its Formal Semantics. Technical
Report 2011-14, University of Augsburg, 2011.
10. J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Elsevier
Science & Technology, 5th edition, 2011.
11. Khronos Group Inc. The OpenGL Shading Language 4.20, 2011. Revision 6.
12. Khronos OpenCL Working Group. The OpenCL Specification 1.2, 2011. Revision 15.
13. A. Levinthal and T. Porter. Chap – A SIMD Graphics Processor. SIGGRAPH Comput.
Graph., 18:77–82, 1984.
14. The LLVM Compiler Infrastructure. http://www.llvm.org/(01/04/2012).
15. M. Mantor and M. Houston. AMD Graphic Core Next: Low Power High Performance Graph-
ics & Parallel Compute, 2011. Presentation at the AMD Fusion Developer Summit.
16. W. Mark. Future Graphics Architectures. ACM Queue, 6:54–64, 2008.
17. J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch
and Memory Divergence Tolerance. In Proc. 37th Ann. Int. Symp. Computer Architecture
(ISCA’10), pages 235–246, 2010.
18. S. Moy and J. E. Lindholm. Method and System for Programmable Pipelined Graphics
Processing with Branching Instructions, 2005. United States Patent #6,947,047.
19. S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Pub-
lishers Inc., 1997.
20. J. R. Nickolls and W. Dally. The GPU Computing Era. IEEE Micro, 30(2):56–69, 2010.
21. NVIDIA. DirectCompute Programming Guide 3.2, 2010.
22. NVIDIA. cuobjdump, 2011. CUDA Toolkit 4.1.
23. NVIDIA. NVIDIA CUDA C Programming Guide 4.1, 2011.
24. NVIDIA. NVIDIA Opens Up CUDA Platform by Releasing Compiler Source Code, 2011.
http://tiny.cc/NvidiaLLVM(01/04/2012).
25. J. C. Reynolds. Theories of Programming Languages. Cambridge University Press, 1998.
A List of Symbols
x ∈ Var Variable identifier
e ∈ Expr Side-effect free arithmetical expression over Var
f ∈ Func Function identifier
θ ∈ Thread Thread identifier
ϕ ∈ FuncEnv Function environment
ν ∈ VarEnv Variable environment
η ∈ ThreadEnv Thread environment
µ ∈ Mem Memory
Θ ∈ 2Thread Active mask with Θ+ 6= ∅
∆ ∈ DisableMask Disable mask
δ ∈ DisableState Disable state
t ∈ {sync, div, brk, call} Token type
τ = tΘ Token
c ∈ {brk, call} Context
Table 6. Semantic domains
s ∈ Stm Statement
S Non-empty list of statements
P ∈ Prog Program
W ∈WStack Warp instruction stack
T ∈ TStack Thread instruction stack
T¬call Possibly empty list of thread instructions without any call contexts
ς Thread stack function
Table 7. Statements
ϕ, η BW,Θ,∆, µ Warp configuration
ϕ, ν B T, µ Thread configuration
ϕ, η B ς, µ Interleaved thread configuration
⇒w Warp transition
⇒t Thread transition
⇒i Interleaved thread transition
Table 8. Configurations and transitions
