Stateless Model Checking for POWER by Abdulla, Parosh Aziz et al.
Stateless Model Checking for POWER
Parosh Aziz Abdulla, Mohamed Faouzi Atig, Bengt Jonsson, and Carl Leonardsson
Dept. of Information Technology, Uppsala University, Sweden
Abstract. We present the first framework for efficient application of stateless
model checking (SMC) to programs running under the relaxed memory model of
POWER. The framework combines several contributions. The first contribution is
that we develop a scheme for systematically deriving operational execution mod-
els from existing axiomatic ones. The scheme is such that the derived execution
models are well suited for efficient SMC. We apply our scheme to the axiomatic
model of POWER from [7]. Our main contribution is a technique for efficient
SMC, called Relaxed Stateless Model Checking (RSMC), which systematically
explores the possible inequivalent executions of a program. RSMC is suitable for
execution models obtained using our scheme. We prove that RSMC is sound and
optimal for the POWER memory model, in the sense that each complete program
behavior is explored exactly once. We show the feasibility of our technique by
providing an implementation for programs written in C/pthreads.
1 Introduction
Verification and testing of concurrent programs is difficult, since one must consider
all the different ways in which parallel threads can interact. To make matters worse,
current shared-memory multicore processors, such as Intel’s x86, IBM’s POWER,
and ARM, [28,44,27,8], achieve higher performance by implementing relaxed mem-
ory models that allow threads to interact in even subtler ways than by interleaving of
their instructions, as would be the case in the model of sequential consistency (SC) [31].
Under the relaxed memory model of POWER, loads and stores to different memory lo-
cations may be reordered by the hardware, and the accesses may even be observed in
different orders on different processor cores.
Stateless model checking (SMC) [24] is one successful technique for verifying
concurrent programs. It detects violations of correctness by systematically exploring
the set of possible program executions. Given a concurrent program which is termi-
nating and threadwisely deterministic (e.g., by fixing any input data to avoid data-
nondeterminism), a special runtime scheduler drives the SMC exploration by control-
ling decisions that may affect subsequent computations, so that the exploration covers
all possible executions. The technique is automatic, has no false positives, can be ap-
plied directly to the program source code, and can easily reproduce detected bugs. SMC
has been successfully implemented in tools, such as VeriSoft [25], CHESS [36], Con-
cuerror [16], rInspect [48], and Nidhugg [1].
However, SMC suffers from the state-space explosion problem, and must therefore
be equipped with techniques to reduce the number of explored executions. The most
prominent one is partial order reduction [46,38,23,17], adapted to SMC as dynamic
ar
X
iv
:1
60
5.
02
18
5v
1 
 [c
s.L
O]
  7
 M
ay
 20
16
partial order reduction (DPOR) [2,22,42,39]. DPOR addresses state-space explosion
caused by the many possible ways to schedule concurrent threads. DPOR retains full
behavior coverage, while reducing the number of explored executions by exploiting
that two schedules which induce the same order between conflicting instructions will
induce equivalent executions. DPOR has been adapted to the memory models TSO and
PSO [1,48], by introducing auxiliary threads that induce the reorderings allowed by
TSO and PSO, and using DPOR to counteract the resulting increase in thread schedul-
ings.
In spite of impressive progress in SMC techniques for SC, TSO, and PSO, there is
so far no effective technique for SMC under more relaxed models, such as POWER. A
major reason is that POWER allows more aggressive reorderings of instructions within
each thread, as well as looser synchronization between threads, making it significantly
more complex than SC, TSO, and PSO. Therefore, existing SMC techniques for SC,
TSO, and PSO can not be easily extended to POWER.
In this paper, we present the first SMC algorithm for programs running under the
POWER relaxed memory model. The technique is both sound, in the sense that it guar-
antees to explore each programmer-observable behavior at least once, and optimal, in
the sense that it does not explore the same complete behavior twice. Our technique
combines solutions to several major challenges.
The first challenge is to design an execution model for POWER that is suitable
for SMC. Existing execution models fall into two categories. Operational models, such
as [20,41,40,11], define behaviors as resulting from sequences of small steps of an
abstract processor. Basing SMC on such a model would induce large numbers of ex-
ecutions with equivalent programmer-observable behavior, and it would be difficult
to prevent redundant exploration, even if DPOR techniques are employed. Axiomatic
models, such as [7,35,6], avoid such redundancy by being defined in terms of an ab-
stract representation of programmer-observable behavior, due to Shasha and Snir [43],
here called Shasha-Snir traces. However, being axiomatic, they judge whether an ex-
ecution is allowed only after it has been completed. Directly basing SMC on such a
model would lead to much wasted exploration of unallowed executions. To address this
challenge, we have therefore developed a scheme for systematically deriving execution
models that are suitable for SMC. Our scheme derives an execution model, in the form
of a labeled transition system, from an existing axiomatic model, defined in terms of
Shasha-Snir traces. Its states are partially constructed Shasha-Snir traces. Each transi-
tion adds (“commits”) an instruction to the state, and also equips the instruction with a
parameter that determines how it is inserted into the Shasha-Snir trace. The parameter
of a load is the store from which it reads its value. The parameter of a store is its position
in the coherence order of stores to the same memory location. The order in which in-
structions are added must respect various dependencies between instructions, such that
each instruction makes sense at the time when it is added. For example, when adding a
store or a load instruction, earlier instructions that are needed to compute which mem-
ory address it accesses must already have been added. Our execution model therefore
takes as input a partial order, called commit-before, which constrains the order in which
instructions can be added. The commit-before order should be tuned to suit the given
axiomatic memory model. We define a condition of validity for commit-before orders,
under which our derived execution model is equivalent to the original axiomatic one, in
that they generate the same sets of Shasha-Snir traces. We use our scheme to derive an
execution model for POWER, equivalent to the axiomatic model of [7].
Having designed a suitable execution model, we address our main challenge, which
is to design an effective SMC algorithm that explores all Shasha-Snir traces that can
be generated by the execution model. We address this challenge by a novel exploration
technique, called Relaxed Stateless Model Checking (RSMC). RSMC is suitable for
execution models, in which each instruction can be executed in many ways with dif-
ferent effects on the program state, such as those derived using our execution model
scheme. The exploration by RSMC combines two mechanisms: (i) RSMC considers
instructions one-by-one, respecting the commit-before order, and explores the effects
of each possible way in which the instruction can be executed. (ii) RSMC monitors the
generated execution for data races from loads to subsequent stores, and initiates alter-
native explorations where instructions are reordered. We define the property deadlock
freedom of execution models, meaning intuitively that no run will block before being
complete. We prove that RSMC is sound for deadlock free execution models, and that
our execution model for POWER is indeed deadlock free. We also prove that RSMC
is optimal for POWER, in the sense that it explores each complete Shasha-Snir trace
exactly once. Similar to sleep set blocking for classical SMC/DPOR, it may happen for
RSMC that superfluous incomplete Shasha-Snir traces are explored. Our experiments
indicate, however, that this is rare.
To demonstrate the usefulness of our framework, we have implemented RSMC in
the stateless model checker Nidhugg [32]. For test cases written in C with pthreads, it
explores all Shasha-Snir traces allowed under the POWER memory model, up to some
bounded length. We evaluate our implementation on several challenging benchmarks.
The results show that RSMC efficiently explores the Shasha-Snir traces of a program,
since (i) on most benchmarks, our implementation performs no superfluous exploration
(as discussed above), and (ii) the running times correlate to the number of Shasha-Snir
traces of the program. We show the competitiveness of our implementation by compar-
ing with an existing state of the art analysis tool for POWER: goto-instrument [4].
Outline. The next section presents our derivation of execution models. Section 3
presents our RSMC algorithm, and Section 4 presents our implementation and exper-
iments. Proofs of all theorems, and formal definitions, are provided in the appendix.
Our implementation is available at [32].
2 Execution Model for Relaxed Memory Models
POWER — a Brief Glimpse. The programmer-observable behavior of POWER mul-
tiprocessors emerges from a combination of many features, including out-of-order
and speculative execution, various buffers, and caches. POWER provides significantly
weaker ordering guarantees than, e.g., SC and TSO.
We consider programs consisting of a number of threads, each of which runs a
deterministic code, built as a sequence of assembly instructions. The grammar of our
assumed language is given in Fig. 1. The threads access a shared memory, which is a
〈prog〉 ::= 〈varinit〉∗ 〈thrd〉+
〈varinit〉 ::= 〈var〉 '=' Z
〈thrd〉 := 'thread' 〈tid〉 ':' 〈linstr〉+
〈linstr〉 ::= 〈label〉 ':' 〈instr〉 ';'
〈instr〉 ::= 〈reg〉 ':=' 〈expr〉 | // register assignment
'if' 〈expr〉 'goto' 〈label〉 | // conditional branch
〈reg〉 ':=' '[' 〈expr〉 ']' | // memory load
'[' 〈expr〉 ']' ':=' 〈expr〉 | // memory store
'sync' | 'lwsync' | 'isync' // fences
〈expr〉 ::= (arithmetic expression over literals and registers)
Fig. 1. The grammar of concurrent programs
x = 0 y = 0
thread P: thread Q:
L0: r0 := x; L2: r1 := y;
L1: y := r0+1; L3: x := 1;
L0: r0 := x
L1: y := r0+1
L2: r1 := y
L3: x := 1
po,data porfrf
Fig. 2. Left: An example program: LB+data. Right: A trace of the same program.
mapping from addresses to values. A program may start by declaring named global vari-
ables with specific initial values. Instructions include register assignments and condi-
tional branches with the usual semantics. A load 'r:=[a]' loads the value from the mem-
ory address given by the arithmetic expression a into the register r. A store '[a0]:=a1'
stores the value of the expression a1 to the memory location addressed by the evaluation
of a0. For a global variable x, we use x as syntactic sugar for [&x], where &x is the ad-
dress of x. The instructions sync, lwsync, isync are fences (or memory barriers), which
are special instructions preventing some memory ordering relaxations. Each instruction
is given a label, which is assumed to be unique.
As an example, consider the program in Fig. 2. It consists of two threads P and Q,
and has two zero-initialized memory locations x and y. The thread P loads the value of
x, and stores that value plus one to y. The threadQ is similar, but always stores the value
1, regardless of the loaded value. Under the SC or TSO memory models, at least one
of the loads L0 and L2 is guaranteed to load the initial value 0 from memory. However,
under POWER the order between the load L2 and the store L3 is not maintained. Then
it is possible for P to load the value 1 into r0, and for Q to load 2 into r1. Inserting a
sync between L2 and L3 would prevent such a behavior.
Axiomatic Memory Models. Axiomatic memory models, of the form in [7], operate
on an abstract representation of observable program behavior, introduced by Shasha
and Snir [43], here called traces. A trace is a directed graph, in which vertices are
executed instructions (called events), and edges capture dependencies between them.
More precisely, a trace pi is a quadruple (E, po, co, rf) where E is a set of events, and
po, co, and rf are relations overE1. An event is a tuple (t, n, l)where t is an identifier for
the executing thread, l is the unique label of the instruction, and n is a natural number
which disambiguates instructions. Let E denote the set of all possible events. For an
1 [7] uses the term “execution” to denote what we call “trace.”
Event Parameter Semantic Meaning
L3: x := 1 0 First in coherence order for x
L0: r0 := x L3 Read value 1 from L3
L1: y := r0+1 0 First in coherence order for y
L2: r1 := y L1 Read value 2 from L1
Fig. 3. The run L3[0].L0[L3].L1[0].L2[L1], of the program in Fig. 2 (left), leading to the complete
state corresponding to the trace given in Fig. 2 (right). Here we use the labels L0-L3 as shorthands
for the corresponding events.
event e = (t, n, l), let tid(e) denote t and let instr(e) denote the instruction labelled
l in the program code. The relation po (for “program order”) totally orders all events
executed by the same thread. The relation co (for “coherence order”) totally orders all
stores to the same memory location. The relation rf (for “read-from”) contains the pairs
(e, e′) such that e is a store and e′ is a load which gets its value from e. For simplicity,
we assume that the initial value of each memory address x is assigned by a special
initializer instruction initx, which is first in the coherence order for that address. A trace
is a complete trace of the program P if the program order over the committed events
of each thread makes up a path from the first instruction in the code of the thread, to
the last instruction, respecting the evaluation of conditional branches. Fig. 2 shows the
complete trace corresponding to the behavior described in the beginning of this section,
in which each thread loads the value stored by the other thread.
An axiomatic memory model M (following the framework [7]) is defined as a pred-
icate M over traces pi, such that M(pi) holds precisely when pi is an allowed trace under
the model. Deciding whether M(pi) holds involves checking(i) that the trace is inter-
nally consistent, defined in the natural way (e.g., the relation co relates precisely events
that access the same memory location), and (ii) that various combinations of relations
that are derived from the trace are acyclic or irreflexive. Which specific relations need
to be acyclic depends on the memory model.
We define the axiomatic semantics under M as a mapping from programs P to their
denotations [[P]]AxM , where [[P]]AxM is the set of complete traces pi of P such that M(pi)
holds. In the following, we assume that the axiomatic memory model for POWER,
here denoted MPOWER, is defined as in [7]. The interested reader is encouraged to read
the details in [7], but the high-level understanding given above should be enough to
understand the remainder of this text.
Deriving an Execution Model. Let an axiomatic model M be given, in the style of [7].
We will derive an equivalent execution model in the form of a transition system.
States. States of our execution model are traces, augmented with a set of fetched events.
A state σ is a tuple of the form (λ, F ,E, po, co, rf) where λ(t) is a label in the code of
t for each thread t, F ⊆ E is a set of events, and (E, po|E , co, rf) is a trace such that
E ⊆ F . (Here po|E is the restriction of po to E.) For a state σ = (λ, F ,E, po, co, rf),
we let exec(σ) denote the trace (E, po|E , co, rf). Intuitively, F is the set of all currently
fetched events andE is the set of events that have been committed. The function λ gives
the label of the next instruction to fetch for each thread. The relation po is the program
order between all fetched events. The relations co and rf are defined for committed
events (i.e., events inE) only. The set of all possible states is denoted S. The initial state
σ0 ∈ S is defined as σ0 = (λ0, E0, E0,∅,∅,∅) where λ0 is the function providing
the initial label of each thread, and E0 is the set of initializer events for all memory
locations.
Commit-Before. The order in which events can be committed – effectively a lineariza-
tion of the trace – is restricted by a commit-before order. It is a parameter of our execu-
tion model which can be tuned to suit the given axiomatic model. Formally, a commit-
before order is defined by a commit-before function cb, which associates with each state
σ = (λ, F ,E, po, co, rf), a commit-before order cbσ ⊆ F × F , which is a partial order
on the set of fetched events. For each state σ, the commit-before order cbσ induces a
predicate enabledσ over the set of fetched events e ∈ F such that enabledσ(e) holds if
and only if e 6∈ E and the set {e′ ∈ F | (e′, e) ∈ cbσ} is included inE. Intuitively, e can
be committed only if all the events it depends on have already been committed. Later in
this section, we define requirements on commit-before functions, which are necessary
for the execution model and for the RSMC algorithm respectively.
Transitions. The transition relation between states is given by a set of rules, in Fig. 4.
The function valσ(e, a) denotes the value taken by the arithmetic expression a, when
evaluated at the event e in the state σ. The value is computed in the natural way, respect-
ing data-flow.(Formal definition given in Appendix A.1.) For example, in the state σ
corresponding to the trace given in Fig. 2, where e is the event corresponding to label
L1, we would have valσ(e, r0+1) = 2. The function addressσ(e) associates with each
load or store event e the memory location accessed. For a label l, let λnext(l) denote the
next label following l in the program code. Finally, for a state σ with coherence order co
and a store e to some memory location x, we let extendσ(e) denote the set of coherence
orders co′ which result from inserting e anywhere in the total order of stores to x in co.
For each such order co′, we let positionco′(e) denote the position of e in the total order:
I.e. positionco′(e) is the number of (non-initializer) events e
′ which precede e in co′.
The intuition behind the rules in Fig. 4 is that events are committed non-
deterministically out of order, but respecting the constraints induced by the commit-
before order. When a memory access (load or store) is committed, a non-deterministic
choice is made about its effect. If the event is a store, it is non-deterministically inserted
somewhere in the coherence order. If the event is a load, we non-deterministically pick
the store from which to read. Thus, when committed, each memory access event e is
parameterized by a choice p: the coherence position for a store, and the source store
for a load. We call e[p] a parameterized event, and let P denote the set of all possible
parameterized events. A transition committing a memory access is only enabled if the
resulting state is allowed by the memory model M. Transitions are labelled with FLB
when an event is fetched or a local event is committed, or with e[p] when a memory
access event e is committed with parameter p.
We illustrate this intuition for the program in Fig. 2 (left). The trace in Fig. 2 (right)
can be produced by committing the instructions (events) in the order L3, L0, L1, L2.
For the load L0, we can then choose the already performed L3 as the store from which
it reads, and for the load L2, we can choose to read from the store L1. Each of the two
stores L3 and L1 can only be inserted at one place in their respective coherence orders,
since the program has only one store to each memory location. We show the resulting
F t = {e′′ ∈ F |tid(e′′) = t} e = (t, |F t|, λ(t))
6 ∃e′, a, l . e′ ∈ F \ E ∧ tid(e′) = t ∧ instr(e′) = (if a goto l)
σ
FLB−−→ (λ[t←↩ λnext(λ(t))], F ∪ {e}, E, po ∪ (F t × {e}), co, rf)
FETCH
instr(e) = (if a goto l) t = tid(e)
valσ(e, a) ∈ Z \ {0} enabledσ(e)
σ
FLB−−→ (λ[t←↩ l], F , E ∪ {e}, po, co, rf)BRT
instr(e) ∈ {sync, lwsync, isync, r:=a}
enabledσ(e)
σ
FLB−−→ (λ, F ,E ∪ {e}, po, co, rf) LOC
instr(e) = (if a goto l)
valσ(e, a) = 0 enabledσ(e)
σ
FLB−−→ (λ, F ,E ∪ {e}, po, co, rf)BRF
instr(e) = ([a]:=a′) enabledσ(e) M(exec(σ′))
σ′ = (λ, F ,E ∪ {e}, po, co′, rf) co′ ∈ extendσ(e)
σ
e[positionco′ (e)]−−−−−−−−−−→ σ′
ST
instr(e) = (r:=[a]) enabledσ(e) ew ∈ E instr(ew) = ([a′]:=a′′)
addressσ(ew) = addressσ(e) σ′ = (λ, F ,E ∪ {e}, po, co, rf ∪ {(ew, e)}) M(exec(σ′))
σ
e[ew ]−−−→ σ′ LD
Fig. 4. Execution model of programs under the memory model M. Here σ = (λ, F ,E, po, co, rf).
sequence of committed events in Fig. 3: the first column shows the sequence of events in
the order they are committed, the second column is the parameter assigned to the event,
and the third column explains the parameter. Note that other traces can be obtained by
choosing different values of parameters. For instance, the load L2 can also read from
the initial value, which would generate a different trace.
Next we explain each of the rules: The rule FETCH allows to fetch the next in-
struction according to the control flow of the program code. The first two requirements
identify the next instruction. To fetch an event, all preceding branch events must already
be committed. Therefore events are never fetched along a control flow path that is not
taken. We point out that this restriction does not prevent our execution model from cap-
turing the observable effects of speculative execution (formally ensured by Theorem 1).
The rules LOC, BRT and BRF describe how to commit non-memory access events.
When a store event is committed by the ST rule, it is inserted non-deterministically
at some position n = positionco′(e) in the coherence order. The guard M(exec(σ
′))
ensures that the resulting state is allowed by the axiomatic memory model.
The rule LD describes how to commit a load event e. It is similar to the ST rule. For
a load we non-deterministically choose a source store ew, from which the value can be
read. As before, the guard M(exec(σ′)) ensures that the resulting state is allowed.
Given two states σ, σ′ ∈ S, we use σ FLB(max)−−−−−→ σ′ to denote that σ FLB−−→
∗
σ′ and there
is no state σ′′ ∈ S with σ′ FLB−−→ σ′′. A run τ from some state σ is a sequence of param-
eterized events e1[p1].e2[p2]. · · · .ek[pk] such that σ FLB(max)−−−−−→ σ1 e1[p1]−−−−→ σ′1 FLB(max)−−−−−→
· · · ek[pk]−−−−→ σ′k FLB(max)−−−−−→ σk+1 for some states σ1, σ′1, . . . , σ′k, σk+1 ∈ S. We write
e[p] ∈ τ to denote that the parameterized event e[p] appears in τ . Observe that the
sequence τ leads to a uniquely determined state σk+1, which we denote τ(σ). A run
τ , from the initial state σ0, is complete iff the reached trace exec(τ(σ0)) is complete.
Fig. 3 shows an example complete run of the program in Fig. 2 (left).
In summary, our execution model represents a program P as a labeled transition
system TSPM,cb = (S, σ0,−→), where S is the set of states, σ0 is the initial state, and
−→⊆ S×(P∪{FLB})×S is the transition relation. We define the execution semantics
under M and cb as a mapping, which maps each program P to its denotation [[P]]ExM,cb,
which is the set of complete runs τ induced by TSPM,cb.
Validity and Deadlock Freedom. Here, we define validity and deadlock freedom for
memory models and commit-before functions. Validity is necessary for the correct op-
eration of our execution model (Theorem 1). Deadlock freedom is necessary for sound-
ness of the RSMC algorithm (Theorem 4). First, we introduce some auxiliary notions.
We say that a state σ′ = (λ′, F ′, E′, po′, co′, rf′) is a cb-extension of a state σ =
(λ, F ,E, po, co, rf), denoted σ ≤cb σ′, if σ′ can be obtained from σ by fetching in
program order or committing events in cb order. Formally σ ≤cb σ′ if po = po′|F ,
co = co′|E , rf = rf′|E , F is a po′-closed subset of F ′, and E is a cbσ′ -closed subset of
E′. More precisely, the condition on F means that for any events e, e′ ∈ F ′, we have
[e′ ∈ F ∧ (e, e′) ∈ po′]⇒ e ∈ F . The condition on E is analogous.
We say that cb is monotonic w.r.t. M if whenever σ ≤cb σ′, then (i) M(exec(σ′))⇒
M(exec(σ)), (ii) cbσ ⊆ cbσ′ , and (iii) for all e ∈ F such that either e ∈ E or(
enabledσ(e) ∧ e 6∈ E′
)
, we have (e′, e) ∈ cbσ ⇔ (e′, e) ∈ cbσ′ for all e′ ∈ F ′.
Conditions (i) and (ii) are natural monotonicity requirements on M and cb. Condition
(iii) says that while an event is committed or enabled, its cb-predecessors do not change.
A state σ induces a number of relations over its fetched (possibly committed) events.
Following [7], we let addrσ , dataσ , ctrlσ , denote respectively address dependency, data
dependency and control dependency. Similarly, po-locσ is the subset of po that relates
memory accesses to the same memory location. Lastly, syncσ and lwsyncσ relate events
that are separated in program order by respectively a sync or lwsync. The formal defi-
nitions can be found in [7],as well as in Appendix A.1. We can now define a weakest
reasonable commit-before function cb0, capturing natural dependencies:
cb0σ = (addrσ ∪ dataσ ∪ ctrlσ ∪ rf)+ ,
where R+ denotes the transitive (but not reflexive) closure of R.
We say that a commit-before function cb is valid w.r.t. a memory model M if cb
is monotonic w.r.t. M, and for all states σ such that M(exec(σ)) we have that cbσ is
acyclic and cb0σ ⊆ cbσ .
Theorem 1 (Equivalence with Axiomatic Model). Let cb be a commit-before func-
tion valid w.r.t. a memory model M. Then [[P]]AxM = {exec(τ(σ0)) | τ ∈ [[P]]ExM,cb}.
The commit-before function cb0 is valid w.r.t. MPOWER, implying (by Theorem 1)
that [[P]]ExMPOWER,cb0 is a faithful execution model for POWER. However, cb
0 is not strong
enough to prevent blocking runs in the execution model for POWER. I.e., it is possible,
with cb0, to create an incomplete run, which cannot be completed. Any such blocking
is undesirable for SMC, since it corresponds to wasted exploration. Fig. 5 shows an
example of how the POWER semantics may deadlock when based on cb0.
Program Blocked run τ Blocked state σ
x = 0 y = 0
thread P: thread Q:
L0: r0:=y; L3: x:=3;
L1: x:=r0; L4: sync;
L2: x:=2; L5: y:=1;
L3[0]
L5[0]
L0[L5]
L2[0]
(L1 blocked)
L0: r0:=y
L1: x:=r0
L2: x:=2
L3: x:=3
L4: sync
L5: y:=1
data
po-loc
sync
sync
co rf
Fig. 5. If the weak commit-before function cb0 is used, the POWER semantics may deadlock.
When the program above (left) is executed according to the run τ (center) we reach a state σ
(right) where L0, L2, L3-L5 are successfully committed. However, any attempt to commit L1 will
close a cycle in the relation co; syncσ; rf; dataσ; po-locσ , which is forbidden under POWER.
This blocking behavior is prevented when the stronger commit-before function cbpower is used,
since it requires L1 and L2 to be committed in program order.
We say that a memory model M and a commit before function cb are deadlock free
if for all runs τ from σ0 and memory access events e such that enabledτ(σ0)(e) there
exists a parameter p such that τ .e[p] is a run from σ0. I.e., it is impossible to reach a state
where some event is enabled, but has no parameter with which it can be committed.
Commit-Before Order for POWER. We will now define a stronger commit before
function for POWER, which is both valid and deadlock free:
cbpowerσ = (cb
0
σ ∪ (addrσ; po) ∪ po-locσ ∪ syncσ ∪ lwsyncσ)+
Theorem 2. cbpower is valid w.r.t. MPOWER.
Theorem 3. MPOWER and cbpower are deadlock free.
3 The RSMC Algorithm
Having derived an execution model, we address the challenge of defining an SMC algo-
rithm, which explores all allowed traces of a program in an efficient manner. Since each
trace can be generated by many equivalent runs, we must, just as in standard SMC for
SC, develop techniques for reducing the number of explored runs, while still guaran-
teeing coverage of all traces. Our RSMC algorithm is designed to do this in the context
of semantics like the one defined above, in which instructions can be committed with
several different parameters, each yielding different results.
Our exploration technique basically combines two mechanisms:
(i) In each state, RSMC considers an instruction e, whose cb-predecessors have al-
ready been committed. For each possible parameter value p of e in the current
state, RSMC extends the state by e[p] and continues the exploration recursively
from the new state.
(ii) RSMC monitors generated runs to detect read-write conflicts (or “races”), i.e., the
occurrence of a load and a subsequent store to the same memory location, such
that the load would be able to read from the store if they were committed in the
reverse order. For each such conflict, RSMC starts an alternative exploration, in
which the load is preceded by the store, so that the load can read from the store.
Mechanism (ii) is analogous to the detection and reversal of races in conventional
DPOR, with the difference that RSMC need only detect conflicts in which a load is
followed by a store. A race in the opposite direction (store followed by load) does
not induce reordering by mechanism (ii). This is because our execution model allows
the load to read from any of the already committed stores to the same memory loca-
tion, without any reordering. An anlogous observation applies to occurrences of several
stores to the same memory location.
Instruction Parameter Semantic Meaning
L0: r0 := x initx (read initial value)
L1: y := r0+1 0 (first in coherence of y)
L2: r1 := y inity (read initial value)
L3: x := 1 0 (first in coherence of x)
Fig. 6. The first explored run of the program in Fig. 2
We illustrate the basic idea of RSMC on the program in Fig. 2 (left). As usual
in SMC, we start by running the program under an arbitrary schedule, subject to the
constraints imposed by the commit-before order cb. For each instruction, we explore
the effects of each parameter value which is allowed by the memory model. Let us
assume that we initially explore the instructions in the order L0, L1, L2, L3. For this
schedule, there is only one possible parameter for L0, L1, and L3, whereas L2 can read
either from the initial value or from L1. Let us assume that it reads the initial value.
This gives us the first run, shown in Fig. 6. The second run is produced by changing the
parameter for L2, and let it read the value 1 written by L1.
During the exploration of the first two runs, the RSMC algorithm also detects a
race between the load L0 and the store L3. An important observation is that L3 is not
ordered after L0 by the commit-before order, implying that their order can be reversed.
Reversing the order between L0 and L3 would allow L0 to read from L3. Therefore,
RSMC initiates an exploration where the load L0 is preceded by L3 and reads from
it. (If L3 would have been preceded by other events that enable L3, these would be
executed before L3.) After the sequence L3[0].L0[L3], RSMC is free to choose the
order in which the remaining instructions are considered. Assume that the order L1, L2
is chosen. In this case, the load L2 can read from either the initial value or from L1. In
the latter case, we obtain the run in Fig. 3, corresponding to the trace in Fig. 2 (right).
After this, there are no more unexplored parameter choices, and so the RSMC algo-
rithm terminates, having explored four runs corresponding to the four possible traces.
In the following section, we will provide a more detailed look at the RSMC algo-
rithm, and see formally how this exploration is carried out.
3.1 Algorithm Description
In this section, we present our algorithm, RSMC, for SMC under POWER. We prove
soundness of RSMC, and optimality w.r.t. explored complete traces.
// P[e] holds a run
// preceding the load event e.
global P = λe.〈〉
// Q[e] holds a set of continuations
// leading to the execution of the
// load event e after P[e].
global Q = λe.∅
Explore(τ , σ)
// Fetch & commit local greedily.
1: while(∃σ′.σ FLB−−−−→ σ′){σ := σ′;}
// Find committable memory access e.
2: if(∃e.enabledσ(e)){
3: if(e is a store){
// Explore all ways to execute e.
4: S := {(n, σ′)|σ e[n]−−−→ σ′};
5: for((n, σ′) ∈ S){
6: Explore(τ .e[n], σ′);
7: }
8: DetectRace(τ , σ, e);
9: }else{ // e is a load
10: P[e] := τ;
// Explore all ways to execute e.
11: S := {(ew, σ′)|σ e[ew ]−−−−→ σ′};
12: for((ew, σ′) ∈ S){
13: Explore(τ.e[ew], σ′);
14: }
// Handle R -> W races.
15: explored = ∅;
16: while(∃τ ′ ∈ Q[e]\explored){
17: explored := explored∪{τ ′};
18: Traverse(τ , σ, τ ′);
19: }
20: }
21: }
DetectRace(τ , σ, e)
1: for
(
er [ew] ∈ τ s.t.
er is a load ∧ (er , e) 6∈ cbσ
∧ addressσ(er) = addressσ(e)
)
{
// Compute postfix after P[er].
2: τ ′ := the τ ′ s.t. τ = P[er].τ ′;
// Remove events not cb-before e.
3: τ ′′ := normalize(cut(τ ′, e, σ), cbσ);
// Construct new continuation.
4: τ ′′′ := τ ′′.e[*].er [e];
// Add to Q, to explore later.
5: Q[er] := Q[er]∪{τ ′′′};
6: }
Traverse(τ , σ, τ ′)
1: if(τ ′ = 〈〉){
2: Explore(τ , σ);
3: }else{
// Fetch & commit local greedily.
4: while(∃σ′.σ FLB−−−−→ σ′){σ := σ′;}
5: e[p].τ ′′ := τ ′; // Get first event.
6: if(p = *){
// Explore all ways to execute e.
7: S := {(n, σ′)|σ e[n]−−−→ σ′};
8: for((n, σ′) ∈ S){
9: Traverse(τ .e[n], σ′, τ ′′);
10: }
11: }else if(∃σ′.σ e[p]−−−→ σ′){
12: Traverse(τ .e[p], σ′, τ ′′);
13: }else{
// Only happens when the final
// load in τ ′ does not accept its
// parameter. Stop exploring.
14: }
15: }
Fig. 7. An algorithm to explore all traces of a given program. The initial call is Explore(〈〉, σ0).
The RSMC algorithm is shown in Fig. 7. It uses the recursive procedure Explore,
which takes parameters τ and σ such that σ = τ(σ0). Explore will explore all states
that can be reached by complete runs extending τ .
First, on line 1, we fetch instructions and commit all local instructions as far as
possible from σ. The order of these operations makes no difference. Then we turn to
memory accesses. If the run is not yet terminated, we select an enabled event e on line 2.
If the chosen event e is a store (lines 3-8), we first collect, on line 4, all parameters
for ewhich are allowed by the memory model. For each of them, we recursively explore
all of its continuations on line 6. I.e., for each coherence position n that is allowed for e
by the memory model, we explore the continuation of τ obtained by committing e[n].
Finally, we call DetectRace. We will return shortly to a discourse of that mechanism.
If e is a load (lines 9-20), we proceed in a similar manner. Line 10 is related to
DetectRace, and discussed later. On line 11 we compute all allowed parameters for
the load e. They are (some of the) stores in τ which access the same address as e. On
line 13, we make one recursive call to Explore per allowed parameter. The structure of
this exploration is illustrated in the two branches from σ1 to σ2 and σ5 in Fig. 8(a).
σ0
σ1
τ0
ew
e′w
σ2 σ5
σ3
σ4
er[ew]
er[e
′
w]
τ1
eˆw[p]
σ6
σ7
σ8
τ2
eˆw[p]
er[eˆw]
ra
ce
σ0
σ1
τ0
ew
e′w
σ2 ...
σ3
σ4
er[ew]
τ1
eˆw[p]
σ6
σ7
σ8
τ2
eˆw[p]
er[eˆw]
σ9
σ10
τ3
eˆw[p
′]
σ11
σ12
σ13
τ4
eˆw[p
′]
er[eˆw]
race
(a) A new branch τ2.eˆw[*].er[eˆw] is added
to Q[er] and later explored, starting from
σ1. τ2 is a restriction of τ1, containing only
events that are cbσ4 -before eˆw.
(b) Another read-write race is detected,
starting from the leaf of a branch ex-
plored by Traverse. The new branch
τ4.eˆw[*].er[eˆw] is added at σ1, not at σ7.
Fig. 8. How Explore applies event parameters, and introduces new branches. Thin arrows indicate
exploration performed directly by Explore. Bold arrows indicate traversal by Traverse.
Notice in the above that both for stores and loads, the available parameters are de-
termined entirely by τ , i.e. by the events that precede e in the run. In the case of stores,
the parameters are coherence positions between the earlier stores occurring in τ . In the
case of loads, the parameters are the earlier stores occurring in τ . For stores, this way of
exploring is sufficient. But for loads it is necessary to also consider parameters which
appear later than the load in a run. Consider the example in Fig. 8(a). During the recur-
sive exploration of a run from σ0 to σ4 we encounter a new store eˆw, which is in a race
with er. If the load er and the store eˆw access the same memory location, and er does
not precede eˆw in the cb-order, they could appear in the opposite order in a run (with eˆw
preceding er), and eˆw could be an allowed parameter for the load er. This read-write
race is detected on line 1 in the function DetectRace, when it is called from line 8 in
Explore when the store eˆw is being explored. We must then ensure that some run is
explored where eˆw is committed before er so that eˆw can be considered as a parameter
for er. Such a run must include all events that are before eˆw in cb-order, so that eˆw can
be committed. We construct τ2, which is a template for a new run, including precisely
the events in τ1 which are cb-before the store eˆw. The run template τ2 can be explored
from the state σ1 (the state where er was previously committed) and will then lead to a
state where eˆw can be committed. The run template τ2 is computed from the complete
run in DetectRace on lines 2-3. This is done by first removing (at line 2) the prefix τ0
which precedes er (stored in P[er] on line 10 in Explore). Thereafter (at line 3) events
that are not cb-before eˆw are removed using the function cut (here, cut(τ , e, σ) restricts
τ to the events which are cbσ-before e), and the resulting run is normalized. The func-
tion normalize normalizes a run by imposing a predefined order on the events which
are not ordered by cb. This is done to avoid unnecessarily exploring two equivalent run
templates. (Formal definitions in Appendix A.2.) The run template τ2.eˆw[*].er[eˆw] is
then stored on line 5 in the set Q[er], to ensure that it is explored later. Here we use the
special pseudo-parameter * to indicate that every allowed parameter for eˆw should be
explored (See lines 6-10 in Traverse.).
All of the run templates collected in Q[er] are explored from the same call to
Explore(τ0, σ1) where er was originally committed. This is done on lines 15-19. The
new branch is shown in Fig. 8(a) in the run from σ0 to σ8. Notice on line 18 that the
new branch is explored by the function Traverse, rather than by Explore itself. This
has the effect that τ2 is traversed, with each event using the parameter given in τ2, until
er[eˆw] is committed. The traversal by Traverse is marked with bold arrows in Fig. 8. If
the memory model does not allow er to be committed with the parameter eˆw, then the
exploration of this branch terminates on line 13 in Traverse. Otherwise, the exploration
continues using Explore, as soon as er has been committed (line 2 in Traverse).
Let us now consider the situation in Fig. 8(b) in the run from σ0 to σ10. Here
τ2.eˆw[*].er[eˆw], is explored as described above. Then Explore continues the explo-
ration, and a read-write race is discovered from er to eˆw. From earlier DPOR algo-
rithms such as e.g. [22], one might expect that this case is handled by exploring a new
branch of the form τ2.eˆw[p].τ ′3.eˆw[p
′].er[eˆw], where er is simply delayed after σ7 until
eˆw has been committed. Our algorithm handles the case differently, as shown in the run
from σ0 to σ13. Notice that P[er] can be used to identify the position in the run where
er was last committed by Explore (as opposed to by Traverse), i.e., σ1 in Fig. 8(b).
We start the new branch from that position (σ1), rather than from the position where
er was committed when the race was detected (i.e., σ7). The new branch τ4 is con-
structed when the race is detected on lines 2-3 in DetectRace, by restricting the sub-run
τ2.eˆw[p].er[eˆw].τ3 to events that cb-precede the store eˆw.
The reason for returning all the way up to σ1, rather than starting the new branch at
σ7, is to avoid exploring multiple runs corresponding to the same trace. This could oth-
erwise happen when the same race is detected in multiple runs. To see this happen, let
us consider the program given in Fig. 9. A part of its exploration tree is given in Fig. 10.
In the interest of brevity, when describing the exploration of the program runs, we will
ignore some runs which would be explored by the algorithm, but which have no impact
on the point of the example. Throughout this example, we will use the labels L0, L1,
and L2 to identify the events corresponding to the labelled instructions. We assume that
in the first run to be explored (the path from σ0 to σ3 in Fig. 10), the load at L0 is com-
mitted first (loading the initial value of x), then the stores at L1 and L2. There are two
read-write races in this run, from L0 to L1 and to L2. When the races are detected, the
branches L1[*].L0[L1] and L2[*].L0[L2] will be added to Q[L0]. These branches
are later explored, and appear in Fig. 10 as the paths from σ0 to σ6 and from σ0 to σ9
respectively. In the run ending in σ9, we discover the race from L0 to L1 again. This
indicates that a run should be explored where L0 reads from L1. If we were to continue
exploration from σ7 by delaying L0 until L1 has been committed, we would follow the
path from σ7 to σ11 in Fig. 10. In σ11, we have successfully reversed the race between
L0 and L1. However, the trace of σ11 turns out to be identical to the one we already
explored in σ6. Hence, by exploring in this manner, we would end up exploring redun-
dant runs. The Explore algorithm avoids this redundancy by exploring in the different
manner described above: When the race from L0 to L1 is discovered at σ9, we con-
sider the entire sub-run L2[0].L0[L2].L1[1] from σ0, and construct the new sub-run
L1[*].L0[L1] by removing all events that are not cb-before L1, generalizing the pa-
rameter to L1, and by appending L0[L1] to the result. The new branch L1[*].L2[L1]
is added to Q[L0]. But Q[L0] already contains the branch L1[*].L2[L1] which was
added at the beginning of the exploration. And since it has already been explored (it has
already been added to the set explored at line 17) we avoid exploring it again.
thread P: thread Q: thread R:
L0: r := x L1: x := 1 L2: x := 2
Fig. 9. A small program where one thread P loads from x, and two threads Q and R store to x.
σ0
σ1 σ4 σ7
L0[initx]
L1[0]
L2[0]
σ2 σ5 σ8
L1[0] L0[L1] L0[L2]
σ3 σ6 σ9
L2[0] L2[0] L1[1]
σ10
σ11
L1[1]
L0[L1]
race
ra
ce
race
L0 L1 L2
rf co
exec(σ6) = exec(σ11)
Fig. 10. Part of a faulty exploration tree for the program above, containing redundant branches.
The branches ending in σ6 and σ11 correspond to the same trace. The Explore algorithm avoids
this redundancy, by the mechanism where all branches for read-write races from the same load
er are collected in one set Q[er].
Soundness and Optimality. We first establish soundness of the RSMC algorithm in
Fig. 7 for the POWER memory model, in the sense that it guarantees to explore all
Shasha-Snir traces of a program. We thereafter establish that RSMC is optimal, in the
sense that it will never explore the same complete trace twice.
Theorem 4 (Soundness). Assume that cb is valid w.r.t. M, and that M and cb are
deadlock free. Then, for each pi ∈ [[P]]AxM , the evaluation of a call to Explore(〈〉, σ0)
will contain a recursive call to Explore(τ , σ) for some τ , σ such that exec(σ) = pi.
Corollary 1. RSMC is sound for POWER using MPOWER and cbpower.
The proof of Theorem 4 involves showing that if an allowed trace exists, then the
races detected in previously explored runs are sufficient to trigger the later exploration
of a run corresponding to that trace.
Theorem 5 (Optimality for POWER). Assume that M = MPOWER and cb = cbpower.
Let pi ∈ [[P]]AxM . Then during the evaluation of a call to Explore(〈〉, σ0), there will be
exactly one call Explore(τ , σ) such that exec(σ) = pi.
While the RSMC algorithm is optimal in the sense that it explores precisely one
complete run per Shasha-Snir trace, it may initiate explorations that block before reach-
ing a complete trace (similarly to sleep set blocking in classical DPOR). Such blocking
may arise when the RSMC algorithm detects a read-write race and adds a branch to Q,
which upon traversal turns out to be not allowed under the memory model. Our experi-
ments in Section 4 indicate that the effect of such blocking is almost negligible, without
any blocking in most benchmarks, and otherwise at most 10% of explored runs.
4 Experimental Results
In order to evaluate the efficiency of our approach, we have implemented it as a part
of the open source tool Nidhugg [32], for stateless model checking of C/pthreads pro-
grams under the relaxed memory. It operates under the restrictions that (i) all executions
are bounded by loop unrolling, and (ii) the analysis runs on a given compilation of the
target C code. The implementation uses RSMC to explore all allowed program behav-
iors under POWER, and detects any assertion violation that can occur. We validated the
correctness of our implementation by successfully running all 8070 relevant litmus tests
published with [7].
The main goals of our experimental evaluation are (i) to show the feasibility and
competitiveness of our approach, in particular to show for which programs it performs
well, (ii) to compare with goto-instrument, which to our knowledge is the only other
tool analyzing C/pthreads programs under POWER2, and (iii) to show the effectiveness
of our approach in terms of wasted exploration effort.
Table 1 shows running times for Nidhugg and goto-instrument for several bench-
marks in C/pthreads. All benchmarks were run on an 3.07 GHz Intel Core i7 CPU with
6 GB RAM. We use goto-instrument version 5.1 with cbmc version 5.1 as backend.
We note here that the comparison of running time is mainly relevant for the bench-
marks where no error is detected (errors are indicated with a * in Table 1). This is
because when an error is detected, a tool may terminate its analysis without searching
the remaining part of the search space (i.e., the remaining runs in our case). Therefore
the time consumption in such cases, is determined by whether the search strategy was
2 The cbmc tool previously supported POWER [5], but has withdrawn support in later versions.
Tool running time (s), and trace count
goto-
instrument Nidhugg
F LB time time SS B
dcl singleton 7 *0.40 *0.13 3 0
dcl singleton y 7 5.05 0.19 7 0
dekker 10 *229.39 *0.11 5 0
dekker y 10 t/o 0.76 246 0
fib false *1.86 t/o 109171 0
fib false join *0.84 *35.46 11938 0
fib true 7.05 t/o 109122 0
fib true join 8.92 57.67 19404 0
indexer 5 68.16 1.57 19 0
lamport 8 *635.45 *0.12 3 0
lamport y 8 t/o 0.20 50 2
parker 5 1.20 *0.13 5 0
parker y 5 1.24 7.44 1126 0
peterson *0.24 *0.11 3 0
peterson y 0.19 0.11 10 1
pgsql 8 *161.05 *0.11 2 0
pgsql y 8 t/o 0.58 16 0
pgsql bnd t/o *0.11 2 0
pgsql bnd y t/o t/o 36211 0
stack safe 13.84 73.86 1005 0
stack unsafe *1.03 *3.32 20 0
szymanski *1.02 *0.11 17 0
szymanski y 304.87 0.31 226 0
Table 1. A comparison of running times
(in seconds) for our implementation Nid-
hugg and goto-instrument. The F col-
umn indicates whether fences have been
inserted code to regain safety. The LB
column indicates whether the tools were
instructed to unroll loops up to a cer-
tain bound. A t/o entry means that the
tool failed to complete within 900 sec-
onds. An asterisk (*) means that the tool
found a safety violation. A struck out en-
try means that the tool gave the wrong
answer regarding the safety of the bench-
mark. The superior running time for each
benchmark is given in bold font. The
SS column indicates the number of com-
plete traces explored by Nidhugg before
detecting an error, exploring all traces, or
timing out. The B (for “blocking”) col-
umn indicates the number of incomplete
runs that Nidhugg started to explore, but
that turned out to be invalid.
lucky or not. This also explains why in e.g. the dekker benchmark, fewer Shasha-Snir
traces are explored in the version without fences, than in the version with fences.
Comparison with goto-instrument. goto-instrument employs code-to-code transfor-
mation in order to allow verification tools for SC to work for more relaxed mem-
ory models such as TSO, PSO and POWER [4]. The results in Table 1 show that
our technique is competitive. In many cases Nidhugg significantly outperforms goto-
instrument. The benchmarks for which goto-instrument performs better than Nidhugg,
have in common that goto-instrument reports that no trace may contain a cycle which
indicates non-SC behavior. This allows goto-instrument to avoid expensive program
instrumentation to capture the extra program behaviors caused by memory consistency
relaxation. While this treatment is very beneficial in some cases (e.g. for stack * which
is data race free and hence has no non-SC executions), it also leads to false negatives
in cases like parker, when goto-instrument fails to detect Shasha Snir-cycles that cause
safety violations. In contrast, our technique is precise, and will never miss any behaviors
caused by the memory consistency violation within the execution length bound.
We remark that our approach is restricted to thread-wisely deterministic programs
with fixed input data, whereas the bounded model-checking used as a backend (CBMC)
for goto-instrument can handle both concurrency and data nondeterminism.
x = 0 y = 0 z = 0
thread P: thread Q:
L0: x := 1; M0: y := 1;
L1: sync; M1: sync;
L2: r0 := y; M2: r1 := x;
L3: if r0 = 1 M3: if r1 = 1
goto L14; goto M14;
L4: z := 1; M4: z := 1;
L5: z := 1; M5: z := 1;
L6: z := 1; M6: z := 1;
L7: z := 1; M7: z := 1;
L8: z := 1; M8: z := 1;
L9: z := 1; M9: z := 1;
L10: z := 1; M10: z := 1;
L11: z := 1; M11: z := 1;
L12: z := 1; M12: z := 1;
L13: z := 1; M13: z := 1;
L14: r0 := 0; M14: r1 := 0;
Fig. 11. SB+10W+syncs: A litmus
test based on the idiom known as
“Dekker” or “SB”. It has 3 allowed
Shasha-Snir traces under POWER. If
the sync fences at lines L1 and M1
are removed, then it has 184759 al-
lowed Shasha-Snir traces. This test is
designed to have a large difference be-
tween the total number of coherent
Shasha-Snir traces and the number of
allowed Shasha-Snir traces.
Efficiency of Our Approach. While our RSMC al-
gorithm is optimal, in the sense that it explores
precisely one complete run per Shasha-Snir trace,
it may additionally start to explore runs that then
turn out to block before completing, as described
in Section 3. The SS and B columns of Table 1
indicate that the effect of such blocking is almost
negligible, with no blocking in most benchmarks,
and at most 10% of the runs.
A costly aspect of our approach is that every
time a new event is committed in a trace, Nid-
hugg will check which of its possible parameters
are allowed by the axiomatic memory model. This
check is implemented as a search for particular
cycles in a graph over the committed events. The
cost is alleviated by the fact that RSMC is opti-
mal, and avoids exploring unnecessary traces.
To illustrate this tradeoff, we present the small
program in Fig. 11. The first three lines of each
thread implement the classical Dekker idiom. It
is impossible for both threads to read the value
0 in the same execution. This property is used to
implement a critical section, containing the lines
L4-L13 and M4-M13. However, if the fences at L1
and M1 are removed, the mutual exclusion prop-
erty can be violated, and the critical sections may
execute in an interleaved manner. The program
with fences has only three allowed Shasha-Snir traces, corresponding to the different
observable orderings of the first three instructions of both threads. Without the fences,
the number rises to 184759, due to the many possible interleavings of the repeated stores
to z. The running time of Nidhugg is 0.01s with fences and 161.36s without fences.
We compare this with the results of the litmus test checking tool herd [7], which
operates by generating all possible Shasha-Snir traces, and then checking which are al-
lowed by the memory model. The running time of herd on SB+10W+syncs is 925.95s
with fences and 78.09s without fences. Thus herd performs better than Nidhugg on the
litmus test without fences. This is because a large proportion of the possible Shasha-
Snir traces are allowed by the memory model. For each of them herd needs to check
the trace only once. On the other hand, when the fences are added, the performance of
herd deteriorates. This is because herd still checks every Shasha-Snir trace against the
memory model, and each check becomes more expensive, since the fences introduce
many new dependency edges into the traces.
We conclude that our approach is particularly superior for application style pro-
grams with control structures, mutual exclusion primitives etc., where relaxed memory
effects are significant, but where most potential Shasha-Snir traces are forbidden.
5 Conclusions
We have presented the first framework for efficient application of SMC to programs
running under POWER. Our framework combines solutions to several challenges. We
developed a scheme for systematically deriving execution models that are suitable for
SMC, from axiomatic ones. We present RSMC, a novel algorithm for exploring all
relaxed-memory traces of a program, based on our derived execution model. We show
that RSMC is sound for POWER, meaning that it explores all Shasha-Snir traces of a
program, and optimal in the sense that it explores the same complete trace exactly once.
The RSMC algorithm can in some situations waste effort by exploring blocked runs,
but our experimental results shows that this is rare in practice. Our implementation
shows that the RSMC approach is competitive relative to an existing state-of-the-art
implementation. We expect that RSMC will be sound also for other similar memory
models with suitably defined commit-before functions.
Related Work. Several SMC techniques have been recently developed for programs
running under the memory models TSO and PSO [1,48,19]. In this work we propose a
novel and efficient SMC technique for programs running under POWER.
In [7], a similar execution model was suggested, also based on the axiomatic se-
mantics. However, compared to our semantics, it will lead many spurious executions
that will be blocked by the semantics as they are found to be disallowed. This would
cause superfluous runs to be explored, if used as a basis for stateless model checking.
Beyond SMC techniques for relaxed memory models, there have been many
works related to the verification of programs running under relaxed memory models
(e.g., [34,29,30,18,3,13,14,10,12,47]). Some of these works propose precise analysis
techniques for finite-state programs under relaxed memory models (e.g., [3,10,20]).
Others propose algorithms and tools for monitoring and testing programs running un-
der relaxed memory models (e.g., [13,14,15,34,21]). Different techniques based on ex-
plicit state-space exploration for the verification of programs running under relaxed
memory models have also been developed during the last years (e.g., [26,37,29,30,33]).
There are also a number of efforts to design bounded model checking techniques for
programs under relaxed memory models (e.g., [5,47,12,45]) which encode the verifi-
cation problem in SAT/SMT. Finally, there are code-to-code transformation techniques
(e.g., [9,4,10]) which reduce verification of a program under relaxed memory models
to verification of a transformed program under SC. Most of these works do not handle
POWER. In [20], the robustness problem for POWER has been shown to be PSPACE-
complete.
The closest works to ours were presented in [5,4,7]. The work [4] extends cbmc to
work with relaxed memory models (such as TSO, PSO and POWER) using a code-to-
code transformation. The work in [5] develops a bounded model checking technique
that can be applied to different memory models (e.g., TSO, PSO, and POWER). The
cbmc tool previously supported POWER [5], but has withdrawn support in its later
versions. The tool herd [7] operates by generating all possible Shasha-Snir traces, and
then for each one of them checking whether it is allowed by the memory model. In
Section 4, we experimentally compare RSMC with the tools of [4] and [7].
References
1. Abdulla, P.A., Aronis, S., Atig, M.F., Jonsson, B., Leonardsson, C., Sagonas, K.F.: Stateless
model checking for TSO and PSO. In: TACAS, LNCS, vol. 9035, pp. 353–367. Springer
(2015)
2. Abdulla, P.A., Aronis, S., Jonsson, B., Sagonas, K.F.: Optimal dynamic partial order reduc-
tion. In: POPL. pp. 373–384. ACM (2014)
3. Abdulla, P.A., Atig, M.F., Chen, Y., Leonardsson, C., Rezine, A.: Counter-example guided
fence insertion under TSO. In: TACAS, LNCS, vol. 7214, pp. 204–219. Springer (2012)
4. Alglave, J., Kroening, D., Nimal, V., Tautschnig, M.: Software verification for weak memory
via program transformation. In: ESOP. LNCS, vol. 7792, pp. 512–532. Springer (2013)
5. Alglave, J., Kroening, D., Tautschnig, M.: Partial orders for efficient bounded model check-
ing of concurrent software. In: CAV. LNCS, vol. 8044, pp. 141–157. Springer (2013)
6. Alglave, J., Maranget, L.: Stability in weak memory models. In: CAV. LNCS, vol. 6806, pp.
50–66. Springer (2011)
7. Alglave, J., Maranget, L., Tautschnig, M.: Herding cats: Modelling, simulation, testing, and
data mining for weak memory. ACM Trans. Program. Lang. Syst. 36(2), 7:1–7:74 (2014)
8. ARM: ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition (2014)
9. Atig, M.F., Bouajjani, A., Parlato, G.: Getting rid of store-buffers in TSO analysis. In: CAV.
LNCS, vol. 6806, pp. 99–115. Springer (2011)
10. Bouajjani, A., Derevenetc, E., Meyer, R.: Checking and enforcing robustness against TSO.
In: ESOP. LNCS, vol. 7792, pp. 533–553. Springer (2013)
11. Boudol, G., Petri, G., Serpette, B.P.: Relaxed operational semantics of concurrent program-
ming languages. In: EXPRESS/SOS 2012. EPTCS, vol. 89, pp. 19–33 (2012)
12. Burckhardt, S., Alur, R., Martin, M.M.K.: CheckFence: checking consistency of concurrent
data types on relaxed memory models. In: PLDI. pp. 12–21. ACM (2007)
13. Burckhardt, S., Musuvathi, M.: Effective program verification for relaxed memory models.
In: CAV. LNCS, vol. 5123, pp. 107–120. Springer (2008)
14. Burnim, J., Sen, K., Stergiou, C.: Sound and complete monitoring of sequential consistency
for relaxed memory models. In: TACAS. pp. 11–25. Springer (2011), lNCS 6605
15. Burnim, J., Sen, K., Stergiou, C.: Testing concurrent programs on relaxed memory models.
In: ISSTA. pp. 122–132. ACM (2011)
16. Christakis, M., Gotovos, A., Sagonas, K.F.: Systematic testing for detecting concurrency
errors in erlang programs. In: ICST. pp. 154–163. IEEE Computer Society (2013)
17. Clarke, E.M., Grumberg, O., Minea, M., Peled, D.A.: State space reduction using partial
order techniques. STTT 2(3), 279–287 (1999)
18. Dan, A.M., Meshman, Y., Vechev, M.T., Yahav, E.: Predicate abstraction for relaxed memory
models. In: SAS. LNCS, vol. 7935, pp. 84–104. Springer (2013)
19. Demsky, B., Lam, P.: SATCheck: SAT-directed stateless model checking for SC and TSO.
In: OOPSLA 2015. pp. 20–36. ACM (2015)
20. Derevenetc, E., Meyer, R.: Robustness against Power is PSpace-complete. In: ICALP (2).
LNCS, vol. 8573, pp. 158–170. Springer (2014)
21. Flanagan, C., Freund, S.N.: Adversarial memory for detecting destructive races. In: PLDI.
pp. 244–254. ACM (2010)
22. Flanagan, C., Godefroid, P.: Dynamic partial-order reduction for model checking software.
In: POPL. pp. 110–121. ACM (2005)
23. Godefroid, P.: Partial-Order Methods for the Verification of Concurrent Systems - An Ap-
proach to the State-Explosion Problem, LNCS, vol. 1032. Springer (1996)
24. Godefroid, P.: Model checking for programming languages using verisoft. In: POPL. pp.
174–186. ACM Press (1997)
25. Godefroid, P.: Software model checking: The VeriSoft approach. Formal Methods in System
Design 26(2), 77–101 (2005)
26. Huynh, T.Q., Roychoudhury, A.: Memory model sensitive bytecode verification. Formal
Methods in System Design 31(3), 281–305 (2007)
27. IBM: Power ISA, Version 2.07 (2013)
28. Intel Corporation: Intel 64 and IA-32 Architectures Software Developers Manual (2012)
29. Kuperstein, M., Vechev, M.T., Yahav, E.: Automatic inference of memory fences. In: FM-
CAD. pp. 111–119. IEEE (2010)
30. Kuperstein, M., Vechev, M.T., Yahav, E.: Partial-coherence abstractions for relaxed memory
models. In: PLDI. pp. 187–198. ACM (2011)
31. Lamport, L.: How to make a multiprocessor that correctly executes multiprocess programs.
IEEE Trans. Computers 28(9), 690–691 (1979)
32. Leonardsson, C.: Nidhugg. https://github.com/nidhugg/nidhugg
33. Linden, A., Wolper, P.: A verification-based approach to memory fence insertion in PSO
memory systems. In: TACAS. LNCS, vol. 7795, pp. 339–353. Springer (2013)
34. Liu, F., Nedev, N., Prisadnikov, N., Vechev, M.T., Yahav, E.: Dynamic synthesis for relaxed
memory models. In: PLDI. pp. 429–440. ACM (2012)
35. Mador-Haim, S., Maranget, L., Sarkar, S., Memarian, K., Alglave, J., Owens, S., Alur, R.,
Martin, M.M.K., Sewell, P., Williams, D.: An axiomatic memory model for POWER multi-
processors. In: CAV. LNCS, vol. 7358, pp. 495–512. Springer (2012)
36. Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P.A., Neamtiu, I.: Finding and repro-
ducing heisenbugs in concurrent programs. In: OSDI. pp. 267–280. USENIX (2008)
37. Park, S., Dill, D.L.: An executable specification and verifier for relaxed memory order. IEEE
Trans. Computers 48(2), 227–235 (1999)
38. Peled, D.A.: All from one, one for all: on model checking using representatives. In: CAV.
LNCS, vol. 697, pp. 409–423. Springer (1993)
39. Saarikivi, O., Ka¨hko¨nen, K., Heljanko, K.: Improving dynamic partial order reductions for
concolic testing. In: ACSD. pp. 132–141. IEEE Computer Society (2012)
40. Sarkar, S., Memarian, K., Owens, S., Batty, M., Sewell, P., Maranget, L., Alglave, J.,
Williams, D.: Synchronising C/C++ and POWER. In: PLDI. pp. 311–322. ACM (2012)
41. Sarkar, S., Sewell, P., Alglave, J., Maranget, L., Williams, D.: Understanding POWER mul-
tiprocessors. In: PLDI. pp. 175–186. ACM (2011)
42. Sen, K., Agha, G.: A race-detection and flipping algorithm for automated testing of multi-
threaded programs. In: HVC. LNCS, vol. 4383, pp. 166–182. Springer (2007)
43. Shasha, D., Snir, M.: Efficient and correct execution of parallel programs that share memory.
ACM Trans. Program. Lang. Syst. 10(2), 282–312 (1988)
44. SPARC International, Inc.: The SPARC Architecture Manual Version 9 (1994)
45. Torlak, E., Vaziri, M., Dolby, J.: Memsat: checking axiomatic specifications of memory mod-
els. In: PLDI. pp. 341–350. ACM (2010)
46. Valmari, A.: Stubborn sets for reduced state space generation. In: Advances in Petri Nets.
LNCS, vol. 483, pp. 491–515. Springer (1989)
47. Yang, Y., Gopalakrishnan, G., Lindstrom, G., Slind, K.: Nemos: A framework for axiomatic
and executable specifications of memory consistency models. In: IPDPS. IEEE (2004)
48. Zhang, N., Kusano, M., Wang, C.: Dynamic partial order reduction for relaxed memory
models. In: PLDI. pp. 250–259. ACM (2015)
Appendix Overview. These appendices contain formal definitions and proofs elided
from the main text.
Appendix A contains formal definitions of some concepts used in the semantics (Sec-
tion 2) and RSMC algorithm (Section 3).
Appendix B contains proofs of theorems about the execution model (Section 2): In
particular the proof of equivalence between an execution model and the axiomatic
model it is derived from, the proof of validity of cbpower, and the deadlock freedom
of MPOWER and cbpower.
Appendix C contains proofs of theorems about the RSMC algorithm (Section 3): In
particular the proof of soundness, and the proof of optimality.
A Additional Formal Definitions
Here we provide some formal definitions that were elided from the main text.
A.1 Additional Definitions for Section 2
x = 0
y = 0
thread P: thread Q:
L0: x := 1; L3: r1 := y;
L1: lwfence; L4: ffence;
L2: y := 1; L5: r2 := x;
Fig. 12. An example program: MP+lwfence+ffence.
In the following, we introduce some notations and definitions following [7] that are
needed in order to define dependencies between events. We also give the formal defini-
tion of the partial function valσ which gives the evaluation of an arithmetic expression.
Let σ = (F ,E, po, co, rf) ∈ S be a state. We define two partial functions valσ
and adepsσ over the set of events and arithmetic expression so that valσ(e, a) is the
value of the arithmetic expression a when evaluated at the event e ∈ F in the state σ,
and adepsσ(e, a) is the set of load events in F which are dependencies for the evalu-
ation of the arithmetic expression a at the event e. Here, valσ(e, a) can be undefined
(valσ(e, a) =⊥) when the value of a at the event e depends on the value of a load
which is not yet executed. Formally, we define valσ(e, a) and adepsσ(e, a) recursively,
depending on the type of arithmetic expression:
– If a is a literal integer i, then valσ(e, a) = i and
adepsσ(e, a) = ∅.
– If a = f(a0, · · · , an) for some arithmetic operator f and subexpressions
a0, · · · , an, then valσ(e, a) = f(valσ(e, a0), · · · , valσ(e, an)) and adepsσ(e, a) =⋃n
i=0 adepsσ(e, ai).
– If a = r for some register r, then let er ∈ F be the po-greatest event such that
(er, e) ∈ po and either instr(er) = r:=a′ or instr(er) = r:=[a′] for some expres-
sion a′.
• If there is no such event er, then valσ(e, a) = 0 and adepsσ(e, a) = ∅.
• If instr(er) = r:=a′, then valσ(e, a) = valσ(er, a′) and adepsσ(e, a) =
adepsσ(er, a
′).
• If instr(er) = r:=[a′] and er ∈ E then let ew ∈ E be the event such
that (ew, er) ∈ rf. Let a′′, a′′′ be the arithmetic expressions s.t. instr(ew) =
[a′′]:=a′′′. Now we define valσ(e, a) = valpi(ew, a′′′) and adepsσ(e, a) =
{er}.
• If instr(er) = r:=[a′] and er 6∈ E then valσ(e, a) =⊥ and
adepsσ(e, a) = {er}.
We overload the function adepsσ for event arguments:
– If instr(e) = r:=a, then adepsσ(e) = adepsσ(e, a).
– If instr(e) = if a goto l, then adepsσ(e) = adepsσ(e, a).
– If instr(e) = r:=[a], then adepsσ(e) = adepsσ(e, a).
– If instr(e) = [a]:=a′, then adepsσ(e) = adepsσ(e, a) ∪ adepsσ(e, a′).
– If instr(e) ∈ {sync, lwsync, isync}, then adepsσ(e) = ∅.
x = 0
y = 0
thread P: thread Q
L0: r0 := x; L3: r1 := x;
L1: if r0 = 1 goto L0; L4: [r1] := 1
L2: x := 1; L5: r2 := y;
Fig. 13. A program with address and control dependencies.
We also define the address dependency relation addrσ ⊆ (F × F ) to capture how
events depend on earlier loads for the computation of their address. For a memory
access event e with instr(e) is of the form [a]:=a′ or r:=[a], we have (e′, e) ∈ addrσ
for any event e′ ∈ adepsσ(e, a). For instance, in the example described in Figure 13,
there is an address dependency between the load L3 and the store L4.
We define the data dependency relation dataσ ⊆ (F × F ) to capture how events
depend on earlier loads for the computation of their data. For an event e with instr(e)
is of the form r:=a, if a goto l or [a′]:=a, we have (e′, e) ∈ dataσ for any event
e′ ∈ adepsσ(e, a). For instance, in the example described in Figure 2, there is a data
dependency between the load L0 and the store L1.
We define the relation ctrlσ ⊆ F × F to capture how the control flow to an event
depends on earlier loads. For two events e ∈ F and e′ ∈ F we have (e, e′) ∈ ctrlpi
iff instr(e) = r:=[a′] (i.e., e is a load event) and there is a branch event eb with
instr(eb) = if a goto l for some arithmetic expression a and label l such that
(e, eb), (eb, e
′) ∈ po and e ∈ adepsσ(eb, a). In the example given in Figure 13, there is
a control dependency between the load L0 and the store L2.
We define the relation po-locσ ⊆ F × F to capture the program order between
accesses to the same memory location: po-locσ = {(e, e′) ∈ po|addressσ(e) =
addressσ(e′) ∧ addressσ(e) 6=⊥}. In the example described in Figure 13, the pair (L0,
L2) is in po-locσ .
Finally, we define the relations syncσ, lwsyncσ ⊆ F × F that contain the set of
pairs of events that are separated by the fence instruction sync and lwsync respectively.
For instance, in the example described in Figure 12, syncσ and lwsyncσ will contain
the pairs (L3, L5) and (L0, L2), respectively. We then define lwsyncσ = {(e, e′) ∈
lwsyncσ|¬(e is a store, and e′ is a load)}, corresponding to the intuition that the order
between a store and a later load is not enforced by an lwsync under POWER.
A.2 Additional Definitions for Section 3
Definition of the cut Function In order to define the cut function, we need to define an
auxiliary function cut′. We then define cut(τ , e, σ) = cut′(τ , e, σ, λa.∅).
The function cut′(τ , e, σ,W ) works by recursively traversing τ and removing each
event which is not cbσ-before e. While doing so, for each store ew[n] that is removed,
the parameter n is stored in W (addressσ(ew)). When a store ew[n] is retained in the
run, its parameter n is updated to reflect that all the preceding stores with parameters
W (addressσ(ew)) have disappeared. Formally, the function cut′ is defined as follows:
cut′(〈〉, e, σ,W ) = 〈〉
cut′(e0[p0].τ , e, σ,W ) =
e0[p0].cut
′(τ , e, σ,W ) if e0 ∈ R ∧ (e0, e) ∈ cbσ
cut′(τ , e, σ,W ) if e0 ∈ R ∧ (e0, e) 6∈ cbσ
e0[p
′
0].cut
′(τ , e, σ,W ) if e0 ∈W ∧ (e0, e) ∈ cbσ
where p′0 = p0 − |{i ∈ A|i < p0}|
where A =W (addressσ(e0))
cut′(τ , e, σ,W ′) if e0 ∈W ∧ (e0, e) 6∈ cbσ
where W ′ =W [a←↩ W (a) ∪ {p0}]
where a = addressσ(e0)
B Proofs for Section 2
Here we provide proofs for the various theorems appearing in Section 2.
B.1 Proof of Theorem 1 (Equivalence of Semantics)
Proof (Proof of theorem 1). We prove first that {exec(τ(σ0)) | τ ∈ [[P]]ExM,cb} ⊆
[[P]]AxM . This follows directly from the fact that every rule in the operational semantics
checks the new state against M before allowing the transition, and from M(σ0).
We turn instead to proving the other direction [[P]]AxM ⊆ {exec(τ(σ0)) |
τ ∈ [[P]]ExM,cb}. Let pi = (E, po, co, rf) be an execution in [[P]]AxM . Let σ =
(λ,E,E, po, co, rf) where λ maps every thread to its final state be the complete state
corresponding to pi. From the assumption that cb is valid w.r.t. M, we know that cbσ is
acyclic. Let τ be some linearization w.r.t. cbσ of the memory access events inE, instan-
tiated with parameters according to co and rf. We will show that τ is a run in [[P]]ExM,cb
such that τ(σ0) = σ.
Let τ = e1[p1].e2[p2] · · · en[pn], and for every 0 ≤ i ≤ n, let τ i denote the
prefix e1[p1] · · · ei[pi] of τ . In the case that τ i is a run (it doesn’t block), let σi =
(λi, F i, Ei, poi, coi, rfi) = τ i(σ0) for all 0 ≤ i ≤ n.
We will prove by induction that for all 0 ≤ i ≤ n it holds that τ i is a run, and
σi ≤cb σ and the restriction of Ei to memory access events is the set of events in τ i.
Base case (i = 0): From the definition of runs, we see that τ0 = 〈〉 is a run if there is a
state σ0 such that σ0
FLB(max)−−−−−→ σ0. This holds vacuously by the definition of FLB(max)−−−−−→.
Furthermore, we see from the definition of
FLB(max)−−−−−→ that F 0 will consist of all events
that can be fetched without committing any branch which depends on a load. The same
events must necessarily be fetched in any complete state, and therefore we have F 0 ⊆
F . We see that E0 consists of all local instructions that do not depend on any memory
access. For the same reason, the same events must also be committed in any complete
state. And so we have E0 ⊆ E. It follows similarly that po0 ⊆ po. Since no memory
access events have been committed, co0 = ∅ ⊆ co and rf0 = ∅ ⊆ rf. Hence we
have σ0 ≤cb σ. No memory access events have been committed by FLB(max)−−−−−→, and no
memory access events appear in τ0 = 〈〉, and so the restriction of Ei to memory access
events is the set of events in τ i.
Inductive case (0 < i + 1 ≤ n): We assume as inductive hypothesis that τ i is a run,
and σi ≤cb σ and the restriction of Ei to memory access events is the set of events in
τ i for some 0 ≤ i < n.
We know that ei+1 ∈ F i, since all earlier branch events in F i have been committed
(in Ei) by
FLB(max)−−−−−→ and all loads that they depend on have been committed (notice
(el, ei+1) ∈ ctrlσ ⊆ cb0σ ⊆ cbσ for all such loads el) with the same sources as in
σ. Since the restriction of Ei to memory access events is the set of events in τ i, we
know that ei+1 6∈ Ei+1. To show that enabledσi(ei+1) holds, it remains to show that
for all events e such that (e, ei+1) ∈ cbσi it holds that e ∈ Ei. This follows from the
monotonicity of cb and M as follows: Since σi ≤cb σ, we have cbσi ⊆ cbσ . Since τ
is a linearization of cbσ , for any e with (e, ei+1) ∈ cbσi it must hold that either e is a
memory access, and then precedes ei+1 in τ and is therefore already committed in σi, or
e is a local event which depends only on memory accesses that similarly precede ei+1,
in which case e has been committed by
FLB(max)−−−−−→. Hence we have enabledσi(ei+1).
We now split into cases, depending on whether ei+1 is a store or a load.
Assume first that ei+1 is a store. From the construction of τ , we know that the
parameter pi+1 is a coherence position chosen such that pi+1 = positioncoi+1 for some
coherence order coi+1 ∈ extendσi(ei+1) such that ei+1 is ordered with the previous
stores in Ei in the same order as in co. In order to show that the rule
ei+1[pi+1]−−−−−−→ applies,
we must first show that enabledσi(ei+1) holds. The rule
ei+1[pi+1]−−−−−−→ in the operational
semantics will produce a state σ′ = (λi, F i, Ei ∪ {ei+1}, poi, coi+1, rfi), and then
check whether M(exec(σ′)) holds. In order to show that τ i+1 is a run, we need to show
that M(exec(σ′)) does indeed hold. Since we have σi ≤cb σ, and ei+1 was chosen for
τ from the committed events in σ, and coi+1 orders ei+1 with Ei in the same way as
co, we also have σ′ ≤cb σ. Then by the monotonicity of the memory model M, we
have M(exec(σ′)). Hence τ i+1 is a run, and σi+1 = τ i+1(σ0) for some σi+1 such
that σ′
FLB(max)−−−−−→ σi+1. By the same argument as in the base case, it then follows that
σi+1 ≤cb σ and the restriction of Ei+1 to memory access events is the set of events in
τ i+1.
Assume next that ei+1 is a load. From the construction of τ , we know that the
parameter pi+1 is the store event ew such that (ew, ei+1) ∈ rf. Since τ is a lineariza-
tion of cbσ , and (ew, ei+1) ∈ rf ⊆ cbσ , we know that ew appears before ei+1 in τ i.
Therefore we know that ew is already committed in σi, and that it is therefore avail-
able as a parameter for the load ei+1. Since all loads that precede ew and ei+1 in cbσi
have been committed in the same way as in σ we know that the addresses accessed
by ew and ei+1 are computed in the same way in σi as in σ, and therefore we have
addressσi(ew) = addressσi(ei+1). The rule
ei+1[pi+1]−−−−−−→ in the operational semantics
will produce a state σ′ = (λi, F i, Ei ∪ {ei+1}, poi, coi, rfi ∪ {(ew, ei+1)}), and then
check whether M(exec(σ′)) holds. In order to show that τ i+1 is a run, we need to show
that M(exec(σ′)) does indeed hold. Since we have σi ≤cb σ, and ei+1 was chosen
for τ from the committed events in σ, and ew was chosen such that (ew, ei+1) ∈ rf,
we also have σ′ ≤cb σ. Then by the monotonicity of the memory model M, we have
M(exec(σ′)). Hence τ i+1 is a run, and σi+1 = τ i+1(σ0) for some σi+1 such that
σ′
FLB(max)−−−−−→ σi+1. By the same argument as in the base case, it then follows that
σi+1 ≤cb σ and the restriction of Ei+1 to memory access events is the set of events in
τ i+1.
This concludes the inductive sub-proof.
Since τn = τ is a run, and σn = τ(σ0) ≤cb σ, and the committed memory access
events in τ(σ0) are the same as in σ, and σ is a complete state, we have that τ(σ0) = σ.
Then τ must also be complete, and hence we have τ ∈ [[P]]ExM,cb. This concludes the
proof. uunionsq
B.2 Proof of Theorem 2 (Validity of cbpower)
Proof (Proof of Theorem 2). Monotonicity is proven in Lemma 1. Acyclicity is proven
in Lemma 2. That cb0σ ⊆ cbpowerσ for any state σ follows directly from the definition of
cbpower. uunionsq
Monotonicity of POWER
Lemma 1. cbpower is monotonic w.r.t. MPOWER.
Proof (Proof of Lemma 1). Assume that cb = cbpower and M = MPOWER. Let σ =
(λ, F ,E, po, co, rf) and σ′ = (λ′, F ′, E′, po′, co′, rf′) be two states such that σ ≤cb σ′.
We prove first condition (i): that if M(σ′) then M(σ). To see this, we need to study
the definition of the POWER axiomatic memory models as given in [7]. We see that
an execution is allowed by the axiomatic memory model, unless it contains certain
cycles in the relations between events. All such forbidden cycles are constructed from
some combination of the following relations: po-loc, co, rf, fr, addr, data, fre, rfe, rfi,
ctrl + isync, coe, ctrl, addr; po, sync, lwsync. The construction of the forbidden cycles
is such that adding more relations between events can never cause a forbidden cycle
to disappear. Studying these relations one by one, we see that for each of them, the
relation in σ is a subset of the relation in σ′. We discuss here only one of the more
interesting cases: po-loc. Consider two events e and e′ which are committed in σ, and
where (e, e′) ∈ po-locσ . The same events must also be committed in σ′, and be ordered
in the same way in program order in σ′ as in σ. Therefore we must argue that e and e′
both access the same memory location in σ′ as in σ. This follows from the fact that the
set of committed events E in σ is cbσ′-closed. Since cbpower contains all three of addr,
data and rf, the computation of the address in e and e′ must produce the same value
in σ′ as in σ. Hence we have (e, e′) ∈ po-locσ′ . Since all of the relations participating
in forbidden cycles in σ are subsets of the corresponding relations in σ′, we know that
any forbidden cycle in σ must also be in σ′. Therefore ¬M(σ)⇒ ¬M(σ′). The contra-
positive gives us condition (i).
We turn now to condition (ii): that cbσ ⊆ cbσ′ . We will show that any edge in cbσ is
also in cbσ′ . From the definition of cbpower, we know that cbσ is the transitive irreflexive
closure of the union of the following relations: addrσ , dataσ , ctrlσ , rf, (addrσ; po),
po-locσ , syncσ , lwsyncσ . We will consider an arbitrary edge (e, e
′) ∈ cbσ which is in
one of those relations, and show that (e, e′) is also in the corresponding relation in σ′. If
(e, e′) is in addrσ or dataσ , then e′ uses the value in a register provided by the program
order-earlier event e. We know that in the extended state σ′, the same relation persists.
This is because any new event e′′ which might appear in σ′ and which breaks the data-
flow from e to e′ must be between e and e′ in program order. This would contradict the
assumption that F is a po′-closed subset of F ′. The case when (e, e′) ∈ (addrσ; po)
follows similarly. The case (e, e′) ∈ po-locσ was covered in the proof for condition (i)
above. In all of the remaining cases, ctrlσ , lwsyncσ , syncσ , there is some event e
′′ (a
branch or some fence) which comes between e and e′ in program order in σ. Since we
have F ⊆ F ′ and po = po′|F , the same event must also appear in σ, and cause the
same relation between e and e′.
Finally we turn to proving condition (iii): that for all e ∈ F such that either
enabledσ(e) and e 6∈ E′ or e ∈ E, we have (e′, e) ∈ cbσ ⇔ (e′, e) ∈ cbσ′ for all
e′ ∈ F ′. We have already shown that cbσ ⊆ cbσ′ . So we have (e′, e) ∈ cbσ ⇒ (e′, e) ∈
cbσ′ for all e, e′ ∈ F ′. It remains to show that for any event e which is either enabled
or committed in σ, and which does not become committed when extending e to e′, we
have (e′, e) ∈ cbσ′ ⇒ (e′, e) ∈ cbσ for all e′ ∈ F ′, i.e., that no additional incoming cb
edges to e appear in σ′ which are not in σ. Let e be such an event. If any new incoming
cb edge to e has appeared in σ′, then there must be an event e′ ∈ F ′ such that (e′, e)
is in one of the relations making up cbσ′ , i.e.: addrσ′ , dataσ′ , ctrlσ′ , rf′, (addrσ′ ; po′),
po-locσ′ , syncσ′ , or lwsyncσ′ . We will show that in each case (e
′, e) is also in the corre-
sponding relation in σ. Since F is a po′-closed subset of F ′, all events which program
order-precede e must be fetched in σ. Hence the data-flow forming address or data de-
pendencies in σ′ are already visible in σ. So if (e′, e) is in addrσ′ or dataσ′ , it must
also be in addrσ or dataσ . The cases when (e′, e) is in ctrlσ′ , (addrσ′ ; po′), syncσ′ or
lwsyncσ′ are similar. If (e
′, e) ∈ rf′, then e must be committed in σ′, since read-from
edges are only added upon committing. By assumption we have either enabledσ(e)
and e 6∈ E′ or e ∈ E. Therefore we must have e ∈ E. We have rf = rf′|E since σ′
is a cb-extension of σ. Therefore we have (e′, e) ∈ rf. The remaining case is when
(e′, e) ∈ po-locσ′ . Again, since F is po′-closed, we have e′ ∈ F . Since e is either en-
abled or committed in σ, its address must be computed in σ. It remains to show that the
address of e′ is also computed in σ. If, for a contradiction, the address of e′ is not com-
puted in σ, then there exists another event e′′ such that (e′′, e′) ∈ addrσ , and e′′ 6∈ E.
However, then we have (e′′, e) ∈ (addrσ; po) ⊆ cbσ . Since e′′ is not committed in
σ, this would contradict the assumption that e is enabled or committed in σ and E is
cbσ′ -closed. This concludes the proof. uunionsq
Acyclicity Proof for cb0 and cbpower under POWER. In the following we prove that
the commit-before functions cb0 and cbpower are acyclic in states that are allowed under
POWER.
Lemma 2. For any state σ such that MPOWER(exec(σ)), the relations cb0σ and cb
power
σ
are acyclic.
From the definition of the commit-before functions, we see that (addrσ ∪ dataσ ∪
ctrlσ) ⊆ cb0σ ⊆ cbpowerσ . Since cbpower the strongest one. It is sufficient to only prove
the acyclicity of cbpower w.r.t. to POWER. Here we assume that the POWER memory
model is defined in the way described in the Herding Cats paper [7]. We will assume
that the reader is familiar with the notations and definitions used in [7].
Define the sets R,W,M ⊆ E as the sets of load events, store events and memory
accesses events respectively:
R = {e ∈ E|∃r, a.instr(e) = r:=[a]} and
W = {e ∈ E|∃a, a′.instr(e) = [a]:=a′} and M = R∪W. Define RR = R×R. Define
RW, RM, WR, WW, WM, MR, MW, MM similarly.
The proof of the acyclicity of cbpower w.r.t. to POWER is done by contradiction. Let
us assume that a state σ = (F ,E, po, co, rf) such that M(exec(σ)), with M = MPOWER,
holds and acyclic(cbpowerσ ) does not. This implies that there is a sequence of events
e0, e1, . . . , en ∈ F such that (e0, e1), (e1, e2), . . . , (en−1, en), (en, e0) ∈ cbpowerσ is a
cycle. Let rfe = {(e, e′) ∈ rf | tid(e) 6= tid(e′)}, rfi = rf \ rfe, dp = addrσ ∪ dataσ ,
cc0 = dp∪ ctrlσ ∪ (addrσ; po)∪po-locσ , and fences = syncσ ∪ lwsyncσ . First, we will
show the acyclicity of the relation cc0 ∪ fences ∪ rfi.
Lemma 3. The relation cc0 ∪ fences ∪ rfi is acyclic .
Proof. This is an immediate consequence of the fact that po is acyclic by definition and
cc0 ∪ fences ∪ rfi ⊆ po. uunionsq
Since rfe is the only relation in the definition of cbpowerσ which relates events of
different threads, the cycle should contains at least two events belonging to two different
threads and related by rfe (otherwise, we will have a cycle in cbpowerσ \ rfe = cc0 ∪
fences ∪ rfi contains a cycle and this contradicts Lemma 3). We assume w.l.o.g. that
tid(e0) 6= tid(en). Since (en, e0) ∈ rfe, we have that e0 ∈ R and en ∈ W. This implies
(e0, e1) /∈ rfe.
Let i1, i2, . . . , ik ∈ {0, . . . , n} be the maximal sequence of indices such that for
every j ∈ {1, . . . , k}, we have (eij , e(ij+1)mod(n+1)) ∈ rfe. Let i0 = −1. For every
j ∈ {1, . . . , k}, we have eij ∈ W ∩ E and eij−1+1 ∈ R ∩ E. In the following, we
will show that (eij−1+1, eij ) ∈ ppo ∪ fences for all j ∈ {1, . . . , k}. (Observe that
ppo is defined as in [7]).) This can be seen as an immediate consequence of Lemma 4,
since (eij−1+1, eij ) ∈ (cc0 ∪ fences ∪ rfi)∗ ∩ RW by definition. This implies that the
sequence of events e0, ei1 , . . . , eik forms a cycle in hb = ppo∪fences∪rfe. Furthermore
e0, ei1 , . . . , eik are events in exec(σ). This contradicts the POWER axiom “NO THIN
AIR” which requires the acyclicity of hb in order that M(exec(σ) holds. The rest of
the proof is dedicated to the proof of the following lemma under the POWER memory
model:
Lemma 4. (cc0 ∪ fences ∪ rfi)∗ ∩ RW ⊆ ppo ∪ fences.
Proof. Assume two events e, e′ ∈ E such that (e, e′) ∈ (cc0 ∪ fences∪ rfi)∗ ∩RW. We
will show that (e, e′) ∈ ppo ∪ fences.
Since (e, e′) ∈ (cc0 ∪ fences ∪ rfi)∗ ∩ RW then there is a sequence of events
e0, e1, . . . , en ∈ E such that e0 = e, en = e′, and (ei−1, ei) ∈ (cc0 ∪ fences ∪ rfi)
for all i ∈ {1, . . . , n}.
Let us assume first that there is some i ∈ {1, . . . , n} such that (ei−1, ei) ∈ fences.
Then, there is is some fence (sync or lwsync) which is program order between ei−1 and
ei. Since cc0 ∪ fences ∪ rfi ⊆ po it also holds that the fence is program order between
e0 and en. Since we know that (e0, en) ∈ RW, we have (e0, en) ∈ fences. Thus we
conclude that (e, e′) is in ppo ∪ fence.
Let us assume now that there is no i ∈ {1, . . . , n} such that (ei−1, ei) ∈ fences. We
will show that for each i ∈ {1, . . . , n}, we have (ei−1, ei) ∈ cc. First observe that rfi ⊆
po-locσ ⊆ cc0. Then, (ei−1, ei) ∈ cc trivially holds since (ei−1, ei) ∈ cc0 ∪ rfi ⊆ cc0
and by definition of cc, we have cc0 ⊆ cc. Since cc is transitive by definition we have
(e, e′) ∈ cc. Finally, from the definition of ppo, we have cc ⊆ ic and (ic∩RW) ⊆ ppo.
This implies that (e, e′) ∈ ppo since (e, e′) ∈ RW. uunionsq
This concludes the proof of Lemma 2. uunionsq
B.3 Proof of Theorem 3 (Deadlock Freedom of POWER)
We prove here that our operational semantics instantiated with M = MPOWER and cb =
cbpower never deadlocks.
We recall what needs to be proven: Assume that M = MPOWER and cb = cbpower.
Let τ be a run from σ0, with στ = τ(σ0). Assume that e is a memory access event
(load or store) such that enabledστ (e). There exists a parameter p such that τ .e[p] is a
run from σ0 with σ′τ = τ .e[p](σ0).
Proof (Proof of Theorem 3). If e is a store, then let p be the number of committed stores
in στ to the same address as e, so that e becomes co-last in σ′τ . If e is a load, then let p
be the co-last store to addressστ (e).
We will now investigate the new edges in various inter-event relations in σ′τ . Let
piτ = (Epiτ , popiτ , copiτ , rfpiτ ) = exec(στ ) and pi
′
τ = (Epi′τ , popi′τ , copi′τ , rfpi′τ ) =
exec(σ′τ ).
E: We have Epi′τ = Epiτ ∪ {e}.
po-loc: We have po-locpi′τ ⊆ po-locpiτ ∪ {(e′, e)|e′ ∈ E}.
co: We have copi′τ ⊆ copiτ ∪ {(e′, e)|e′ ∈ E}.
fr: We have frpi′τ ⊆ frpiτ ∪ {(e′, e)|e′ ∈ E}.
fre: We have frepi′τ ⊆ frepiτ ∪ {(e′, e)|e′ ∈ E}.
rf: We have rfpi′τ ⊆ rfpiτ ∪ {(e′, e)|e′ ∈ E}.
rfi: We have rfipi′τ ⊆ rfipiτ ∪ {(e′, e)|e′ ∈ E}.
rfe: We have rfepi′τ ⊆ rfepiτ ∪ {(e′, e)|e′ ∈ E}.
com: We have compi′τ ⊆ compiτ ∪ {(e′, e)|e′ ∈ E}.
fences: We have fencespi′τ ⊆ fencespiτ ∪ {(e′, e)|e′ ∈ E}
Since, for each new edge e0
fencespi′τ−−−−−→ e1 with (e0, e1) 6∈ fencespiτ , we must have
e0 = e or e1 = e. Furthermore, we cannot have e
fencespi′τ−−−−−→ e1, since fencespi′τ ⊆
cbpowerpi′τ , and e is cb
power
pi′τ
-last.
ffence: We have ffencepi′τ ⊆ ffencepiτ ∪ {(e′, e)|e′ ∈ E}
Same motivation as for fences.
dp: We have dppi′τ ⊆ dppiτ ∪ {(e′, e)|e′ ∈ E}
Same motivation as for fences.
rdw: We have rdwpi′τ ⊆ rdwpiτ ∪ {(e′, e)|e′ ∈ E}
Same motivation as for fences, where we notice that rdwpi′τ ⊆ po-locpi′τ ⊆ cb
power
pi′τ
.
ctrl+cfence: We have ctrl+cfencepi′τ ⊆ ctrl+cfencepiτ ∪ {(e′, e)|e′ ∈ E}
Same motivation as for fences, where we notice that ctrl+cfencepi′τ ⊆ ctrlpi′τ ⊆
cbpowerpi′τ .
detour: We have detourpi′τ ⊆ detourpiτ ∪ {(e′, e)|e′ ∈ E}
Same motivation as for fences, where we notice that detourpi′τ ⊆ po-locpi′τ ⊆
cbpowerpi′τ .
po-loc: We have po-locpi′τ ⊆ po-locpiτ ∪ {(e′, e)|e′ ∈ E}
Same motivation as for fences.
ctrl: We have ctrlpi′τ ⊆ ctrlpiτ ∪ {(e′, e)|e′ ∈ E}
Same motivation as for fences.
addr;po: We have addrpi′τ ; popi′τ ⊆ addrpiτ ; popiτ ∪ {(e′, e)|e′ ∈ E}
Same motivation as for fences.
ppo: We have ppopi′τ ⊆ ppopiτ ∪ {(e′, e)|e′ ∈ E}
To see this, let (e0, e1) be any edge in ppopi′τ \ ppopiτ . From the definition of ppo,
we have that e0
(ii0 pi′τ∪ci0 pi′τ∪cc0 pi′τ )
+
−−−−−−−−−−−−−−→ e1. Furthermore there must be at least one
edge along that chain which is not in ii0 piτ ∪ ci0 piτ ∪ cc0 piτ . We have seen above
that any edge which is in ii0 pi′τ ∪ ci0 pi′τ ∪ cc0 pi′τ but not in ii0 piτ ∪ ci0 piτ ∪ cc0 piτ
must be of the form (e2, e) for some e2. Hence we have
e0
(ii0 pi′τ∪ci0 pi′τ∪cc0 pi′τ )∗−−−−−−−−−−−−−−→
e2
(ii0 pi′τ∪ci0 pi′τ∪cc0 pi′τ )−−−−−−−−−−−−−→
e
(ii0 pi′τ∪ci0 pi′τ∪cc0 pi′τ )∗−−−−−−−−−−−−−−→ e1
But there is no edge going from e in ii0 pi′τ ∪ ci0 pi′τ ∪ cc0 pi′τ , so it must be the case
that e = e1.
hb: We have hbpi′τ ⊆ hbpiτ ∪ {(e′, e)|e′ ∈ E}.
By the above, we have
hbpi′τ =
fencespi′τ ∪ rfepi′τ ∪ ppopi′τ ⊆
fencespiτ ∪ rfepiτ ∪ ppopiτ ∪ {(e′, e)|e′ ∈ E} =
hbpi′τ ∪ {(e′, e)|e′ ∈ E}
Notice that none of the piτ -relations contain links to or from e, since e is not inEpiτ .
We will now show that POWER(pi′τ ). We need to show that each of the four
POWER axioms [7] holds for pi′τ :
Subproof 3.1: Show acyclic(po-locpi′τ ∪ compi′τ )
By the assumption POWER(piτ ), we have acyclic(po-locpiτ ∪ po-locpiτ ). Since
(po-locpi′τ ∪ po-locpi′τ ) \ (po-locpiτ ∪ po-locpiτ ) only contains edges leading to e, and
po-locpi′τ ∪po-locpi′τ contains no edges leading from e, we also have acyclic(po-locpi′τ ∪
po-locpi′τ ).
Subproof 3.2: Show acyclic(hbpi′τ )
The proof is analogue to that in Subproof 3.1.
Subproof 3.3: Show irreflexive(frepi′τ ; proppi′τ ; hbpi′τ ∗)
By the assumption POWER(piτ ), we have irreflexive(frepiτ ; proppiτ ; hbpiτ ∗). As-
sume for a contradiction that we have (e0, e0) ∈ frepi′τ ; proppi′τ ; hbpi′τ ∗. By examin-
ing the definition of prop, we see that every edge building up the chain from e0 to e0
through frepi′τ ; proppi′τ ; hbpi′τ ∗, must be in one of frepi′τ , rfepi′τ , fencespi′τ , hbpi′τ , compi′τ or
ffencepi′τ . Since at least one edge must not be in the corresponding piτ -relation, the chain
of relations must go through e. But we have seen above that there is no edge going out
from e in any of the above mentioned relations. Hence there can be no such cycle.
Subproof 3.4: Show acyclic(copi′τ ∪ proppi′τ )
The proof is analogue to that in Subproof 3.3.
This concludes the proof. uunionsq
C Proofs for Section 3
Here we provide proofs for the theorems appearing in Section 3.
C.1 Proof of Theorem 4 (Soundness of RSMC)
Lemma 5. Assume that cb is valid w.r.t. M, and that M and cb are deadlock
free. Assume that τA = τ .ew[nw].er[ew] is a run from σ0. Further assume
that enabledτ(σ0)(er). Then there is a run τB = τ .er[e
′
w] from σ0 such that
enabledτB(σ0)(ew).
Proof (Proof of Lemma 5). The lemma follows from deadlock freedom and monotonic-
ity. uunionsq
Lemma 6. Assume that cb is valid w.r.t. M, and that M and cb are deadlock free.
Assume that τA = τ .er[ew0] is a run from σ0. Assume that τB = τ .τ ′.ew[nw].er[ew]
is a run from σ0. Let σB = τB(σ0). Then either
– τC = τ .er[ew1].τ ′ is a run from σ0 and enabledτC(σ0)(ew), or
– there is an event e′w ∈ τ ′ such that τD = τ .τ ′′.e′w[n′w].er[e′w].τ ′′′ is a run from σ0
with τ ′′.e′w[n
′
w].τ
′′′ = τ ′ and enabledτD(σ0)(ew).
Proof (Proof of Lemma 6). Monotonicity of cb and Lemma 5 give that there exists a
run τ ′B = τ .τ
′.er[p] from σ0 for some store p with enabledτ ′B(σ0)(ew). We case split
on whether or not p ∈ τ ′:
Assume first that p 6∈ τ ′. Then since enabledτ(σ0)(er), we know that er is not cb-
related to any event in τ ′. Therefore, we can rearrange τ ′B into τC = τ .er[p].τ
′, which
is a run from σ0. Furthermore, since the committed events are the same in τC as in τ ′B ,
we have enabledτC(σ0)(ew).
Assume instead that p ∈ τ ′. Then τ ′B = τ .τ ′′.e′w[n′w].τ ′′′.er[e′w] for some τ ′′, n′w,
τ ′′′ and e′w = p. Then by the same argument as above, we can reorder this run to form
τD = τ .τ
′′.e′w[n
′
w].er[e
′
w].τ
′′′ which is a run from σ0 and τ ′′.e′w[n
′
w].τ
′′′ = τ ′ and
enabledτD(σ0)(ew). uunionsq
We are now ready to state and prove the main lemma, from which the soundness
theorem directly follows:
Lemma 7. Assume that cb is valid w.r.t. M, and that M and cb are deadlock free. Let
τ , σ, pi be such that τ(σ0) = σ and exec(σ) = pi and M(pi). Then for all pi′ s.t. M(pi′),
and pi′ is a complete cb-extension of pi, the evaluation of Explore(τ , σ) will contain a
recursive call to Explore(τ ′, σ′) for some τ ′, σ′ such that exec(σ′) = pi′.
Proof (Proof of Lemma 7). By assumption there is an upper bound B on the length of
any run of the fixed program. Therefore, we can perform the proof by total induction
on B minus the length of τ .
Fix arbitrary τ , σ = (λ, F ,E, po, co, rf) and pi. We will show that for all pi′ s.t.
M(pi′), and pi′ is an complete cb-extension of pi, the evaluation of Explore(τ , σ) will
contain a recursive call to Explore(τ ′, σ′) for some τ ′, σ′ such that exec(σ′) = pi′. Our
inductive hypothesis states that for all τ ′′, σ′′, pi′′ such that |τ | < |τ ′′| and τ ′′(σ0) = σ′′
and exec(σ′′) = pi′′ and M(pi′′), it holds that for all pi′′′ s.t. M(pi′′′), and pi′′′ is a
complete cb-extension of pi′′, the evaluation of Explore(τ ′′, σ′′)will contain a recursive
call to Explore(τ ′′′, σ′′′) for some τ ′′′, σ′′′ such that exec(σ′′′) = pi′′′.
Consider the evaluation of Explore(τ , σ). If there is no enabled event on line 2,
then σ is complete, and there are no (non-trivial) cb-extensions of pi. The Lemma is
then trivially satisfied. Assume therefore instead that a enabled event e is selected on
line 2.
We notice first that for all executions pi′ = (E′, po′, co′, rf′) s.t. M(pi′), and pi′ is
a complete cb-extension of pi, it must be the case that e ∈ E′. This is because pi′ is an
extension of pi, and e is enabled after pi and pi′ is complete. Since e is enabled in pi, it
must be executed at some point before the execution can become complete.
The event e is either a store or a load. Assume first that e is a store. Let pi′ =
(E′, po′, co′, rf′) be an arbitrary execution s.t.M(pi′), and pi′ is a complete cb-extension
of pi. There exists some parameter (natural number) n for e which inserts e in the same
position in the coherence order relative to the other stores in E as e has in pi′. Let σn =
τ .e[n](σ0) and pin = exec(σn). Notice that pi′ is an extension of pin. We have M(pi′),
and by the monotonicity of the memory model we then also have that the parameter n
is allowed for the event e by the memory model, i.e., M(pin). Since the parameter n is
allowed for e in σ by the memory model, (n, σn) will be in S on line 4, and so there will
be a recursive call Explore(τ .e[n], σn) on line 6. The run τ .e[n] is a longer run than τ .
Hence the inductive hypothesis can be applied, and yields that the sought recursive call
Explore(τ ′, σ′) will be made. This concludes the case when e is a store.
Next assume instead that e is a load. Let E be the set of all executions pi′ s.t. M(pi′),
and pi′ is a complete cb-extension of pi. Now define the set E0 ⊆ E s.t. E0 contains
precisely the executions pi′ = (E′, po′, co′, rf′) ∈ E where (ew, e) ∈ rf′ for some
ew ∈ E. I.e. we define E0 to be the executions in E where the read-from source for e is
already committed in pi. Let E1 = E \ E0. We will show that the lemma holds, first for
all executions in E0, and then for all executions in E1.
Let pi′ be an arbitrary execution in E0. We can now apply a reasoning analogue to
the reasoning for the case when e is a store to show that there will be a recursive call
Explore(τ ′, σ′) with exec(σ′) = pi′.
We consider instead the executions in E1. Assume for a proof by contradiction that
there are some executions in E1 that will not be explored. Let E2 ⊆ E1 be the set of
executions in E1 that we fail to explore. I.e., let E2 ⊆ E1 be the set of executions pi′ such
that there are no τ ′, σ′ where τ ′(σ0) = σ′ and exec(σ′) = pi′ and there is a recursive
call Explore(τ ′, σ′) made during the evaluation of Explore(τ , σ). We will now define
a ranking function R over executions in E2, and then investigate one of the executions
that minimize R. For a run τ ′ of the form τ .τ0.ew[nw].e[ew].τ1 (notice the fixed run τ
and the fixed load event e), define R(τ ′) = |τ0|. For an execution pi′ ∈ E2, let T (pi′) be
the set of runs τ ′ of the form τ .τ0.ew[nw].e[ew].τ1 such that exec(τ ′(σ0)) = pi′. Now
define R(pi′) = R(τ ′) for a run τ ′ ∈ T (pi′) that minimizes R within T (pi′).
Now let pi′ ∈ E2 be an execution minimizing R in E2 and τ ′ =
τ .τ0.ew[nw].e[ew].τ1 be a run in T (pi′) minimizing R in T (pi′). Let σ′ = τ ′(σ0).
Lemma 6 tells us that either
– τC = τ .e[ew1].τ0 is a run from σ0 and enabledτC(σ0)(ew), or
– there is an event e′w ∈ τ0 such that τD = τ .τ ′.e′w[n′w].e[e′w].τ ′′ is a run from σ0
and τ ′.e′w[n
′
w].τ
′′ = τ0 and enabledτD(σ0)(ew).
Consider first the case when τC = τ .e[ew1].τ0 is a run from σ0 and
enabledτC(σ0)(ew). Since τC is a run, we know that ew1 is an allowed parameter
for e after τ . So (ew1, σw1) will be added to S for some σw1 on line 11 in the call
Explore(τ , σ), and a recursive call Explore(τ .e[ew1], σw1) will be made on line 13.
Since τ .e[ew1] is a longer run than τ , the inductive hypothesis tells us that all complete
cb-extensions of exec(σw1) will be explored. Let pi′C be any complete cb-extension of
exec(τC(σ0)). Notice that pi′C is also a complete cb-extension of exec(σw1). Since we
have enabledτC(σ0)(ew) it must be the case that ew is committed in pi
′
C . Then the race
detection code in DetectRace is executed for ew on line 8 in the call to Explore where
ew is committed. When the race detection code is run for ew, the R → W race from
e to ew will be detected. To see this, notice that pi′C is an extension of exec(τC(σ0)).
Hence we know that the events that are cb-before ew in pi′C are the same as the events
that are cb-before ew in τC and also in τ ′. Therefore ew targets the same memory loca-
tion in pi′C as it does in τ
′. Furthermore, e cannot be cb-before ew, since e appears after
ew in τ ′. Therefore, the race from e to ew is detected, and the branch τ0.ew[*].e[ew] is
added to Q[e]. When the lines 15-19 are executed in the call Explore(τ , σ), that branch
will be traversed. During the traversal, all parameters for ew will be explored, and in
particular the following call will be made: Explore(τ .τ0.ew[nw].e[ew], σw) for some
σw. Since the run τ .τ0.ew[nw].e[ew] is longer than τ , the inductive hypothesis tells us
that all its complete cb-extensions will be explored. These include pi′, contradicting our
assumption that pi′ is never explored.
Consider then the case when there is an event e′w ∈ τ0 such that τD =
τ .τ ′.e′w[n
′
w].e[e
′
w].τ
′′ is a run from σ0 and τ ′.e′w[n
′
w].τ
′′ = τ0 and enabledτD(σ0)(ew).
Let pi′D be any execution which is a complete cb-extension of exec(τD(σ0)). Since pi
′
D
can be reached by a run which extends τD, it must be the case that pi′D has a lower rank
than pi′, i.e., R(pi′D) < R(pi
′). Since, pi′ has a minimal rank in E2, it must be the case
that pi′D ∈ E1, and so we know that pi′D is explored by some call Explore(τ ′D, σ′D),
made recursively from Explore(τ , σ), with τ ′D(σ0) = σ′D and exec(σ′D) = pi′D. By
a reasoning analogue to that in the previous case, the store ew must be committed in
pi′D, and the R → W race from e to ew is detected by DetectRace. Again, the branch
τ0.ew[*].e[ew] is added to Q[e]. So there will be a recursive call Explore(τw, σw) for
τw = τ0.ew[nw].e[ew] and τw(σ0) = σw where the parameter * for ew has been in-
stantiated with nw. Since τw is a longer run than τ , the inductive hypothesis tells us
that all extensions of τw will be explored. The sought execution pi′ is an extension of
exec(τw(σ0)), and so it will be explored. This again contradicts our assumption that pi′
is never explored. This concludes the proof. uunionsq
Proof (Proof of Theorem 4). The theorem follows directly from Lemma 7, since each
complete execution a cb-extension of exec(σ0). uunionsq
C.2 Proof of Theorem 5 (Optimality of RSMC for POWER)
Lemma 8. Assume that M = MPOWER and cb = cbpower. Let τ , and τ .τA.e[pA]
and τ .τB .e[pB ] be runs. Let σA = τ .τA(σ0) and σB = τ .τB(σ0). Let τa =
normalize(cut(τA, e, σA)) and τ b = normalize(cut(τB , e, σB)). Assume τa 6= τ b.
Then there is some event e′ and parameters pa 6= pb such that e′[pa] ∈ τa and
e′[pb] ∈ τ b.
Proof (Proof of Lemma 8). We start by considering the control flow leading to e in
the thread tid(e). Either the control flow to e is the same in both τA and τB , or it
differs. Assume first that the control flow differs. Then there is some branch instruction
which program order-precedes e which evaluates differently in τA and τB . That means
that the arithmetic expression which is the condition for a conditional branch evaluates
differently in τA and τB . Our program semantics does not allow data nondeterminism.
The only way for an expression to evaluate differently is for it to depend on the value
of a program order-earlier load el. This loaded value must differ in τA and τB . This
can only happen if el gets its value from some chain (possibly empty) of read-from and
data dependencies, which starts in a load e′l (possibly equal to el) which takes different
parameters in τA and τB . However, since both read-from and data dependencies are a
part of cbpower, this gives us (e′l, el) ∈ cbσA∗ and (e′l, el) ∈ cbσB ∗. Furthermore, since
el provides a value used in a branch that program order-precedes e, we have a control
dependency between el and e. Control dependencies are also part of cbpower, so we have
(e′l, e) ∈ cbσA∗ and (e′l, e) ∈ cbσB ∗. Then e′l is the sought witness event.
Now assume instead that the control flow leading to e is the same in τA and τB .
Consider the sets A = {e ∈ E|∃p.e[p] ∈ τa} and B = {e ∈ E|∃p.e[p] ∈ τ b}.
Assume first that A = B. Then if all events in A (or equivalently in B) have the
same parameters in τa as in τ b, then τa and τ b consist of exactly the same parameterized
events. Since all the events have the same parameters in τa and τ b, they must also be
related in the same way by the commit-before relation. But then since τa and τ b are
normalized, (by the function normalize) it must hold that τa = τ b, which contradicts
our assumption τa 6= τ b. Hence there must be some event which appears in both τa
and τ b which takes different parameters in τa and τ b.
Assume next that A 6= B. Assume without loss of generality that A contains some
event which is not in B. Let C be the largest subset of A such that C ∩ B = ∅.
Then by the construction of τa, at least one of the events ec ∈ C must precede some
event ea ∈ (A ∩ B) ∪ {e} in cbσA . Since ec 6∈ B we also have (ec, ea) 6∈ cbσB .
We now look to the definition of cbpower, and consider the relation between ec and
ea in σA. From (ec, ea) ∈ cbσA we know that (ec, ea) ∈ addrσA ∪ dataσA ∪ ctrlσA ∪
(addrσA ; poA)∪syncσA∪lwsyncσA∪po-locσA∪rfA. However, all the relations addrσA ,
dataσA , ctrlσA , (addrσA ; poA), syncσA , lwsyncσA are given by the program (and the
control flow, which is fixed by assumption). So if (ec, ea) ∈ addrσA∪dataσA∪ctrlσA∪
(addrσA ; poA) ∪ syncσA ∪ lwsyncσA , then the same relation holds in τB , and then we
would have (ec, ea) ∈ cbσB contradicting our previous assumption. Hence it must be
the case that (ec, ea) ∈ po-locσA or (ec, ea) ∈ rfA. If (ec, ea) ∈ rfA then ea 6= e, since
e is not committed in σA, and therefore cannot have a read-from edge. Hence ea must
then be a load which appears both in τa and τ b. Furthermore, since (ec, ea) 6∈ rfB the
load ea must have different parameters in τa and τ b. Then ea is the sought witness.
The final case that we need to consider is when (ec, ea) ∈ po-locσA . Since we have
(ec, ea) ∈ po-locσA and (ec, ea) 6∈ po-locσB , the address of either ec or ea must be
computed differently in τa and τ b. Therefore, the differing address must depend on the
value read by some earlier load el, which reads different values in τa and τ b. If the
differing address is in ea, then we have (el, ea) ∈ addrσA and (el, ea) ∈ addrσB . If
the differing address is in ec, then we have (el, ea) ∈ (addrσA ; poA) and (el, ea) ∈
(addrσB ; poB). In both cases we have (el, ea) ∈ cbσA and (el, ea) ∈ cbσB . Hence
el ∈ A ∩ B. Now, by the same reasoning as above, in the case for differing control
flows, we know that there is a load e′l ∈ A ∩ B which has different parameters in τa
and τ b. This concludes the proof. uunionsq
We recall the statement of Theorem 5 from Section 3:
Theorem 5 (Optimality for POWER). Assume that M = MPOWER and cb = cbpower.
Let pi ∈ [[P]]AxM . Then during the evaluation of a call to Explore(〈〉, σ0), there will be
exactly one call Explore(τ , σ) such that exec(σ) = pi.
Proof (Proof of Theorem 5). It follows from Corollary 1 that at least one call
Explore(τ , σ) such that exec(σ) = pi is made. It remains to show that at most one
such call is made.
Assume for a proof of contradiction that two separate calls Explore(τa, σa) and
Explore(τ b, σb) are made such that exec(σa) = exec(σb) = pi.
Let Explore(τ c, σc) be the latest call to Explore, which is an ancestor to both of
the calls Explore(τa, σa) and Explore(τ b, σb). Since at least two calls were made from
Explore(τ c, σc) to Explore or Traverse, we know that an event e = ec must have been
chosen on line 2 in the call Explore(τ c, σc).
The event ec is a store or a load. Assume first that ec is a store. We see that
Explore(τ c, σc) may make calls to Explore, but not to Traverse. Hence the calls
Explore(τa, σa) and Explore(τ b, σb) must be reached from two different calls to
Explore on line 6 in the call Explore(τ c, σc). However, we see that the different calls
made to Explore from Explore(τ c, σc) will all fix different parameters for ec. There-
fore there will be two stores which are ordered differently in the coherence order of the
different calls to Explore. As we proceed deeper in the recursive evaluation of Explore,
new coherence edges may appear. But coherence edges can never disappear. Therefore
it cannot be the case that exec(σa) = exec(σb), which is the sought contradiction.
Next we assume instead that ec is a load. In this case the calls Explore(τa, σa) and
Explore(τ b, σb) will be reached via calls to either Explore on line 13 or Traverse on
line 18. If both are reached through calls to Explore, then the contradiction is reached
in an way analogue to the case when ec is a store above.
Assume instead that the call Explore(τa, σa) is reached via a recursive call
to Explore(τ c.ec[ew], σd) from the call Explore(τ c, σc), and that Explore(τ b, σb)
is reached via a call to Traverse(τ c, σc, τ ′c) from the call Explore(τ c, σc). Notice
from the way that subruns in Q[ec] are constructed that τ ′c must have the form
τ ′′c .e
′
w[*].ec[e
′
w] for some subrun τ
′′
c and some store e
′
w which is not committed in σc.
In the evaluation of Traverse(τ c, σc, τ ′c), the subrun τ ′c will be traversed, and for any
call to Explore(τe, σe) made during that evaluation it will hold that ec loads from the
store e′w. However, in the call Explore(τ c.ec[ew], σd) it holds that ec loads from ew.
Since ew is committed in σc, it must be that ew 6= e′w. And so by the same reasoning as
above, we derive the contradiction exec(σa) 6= exec(σb).
Next we assume that Explore(τa, σa) and Explore(τ b, σb) are reached
via calls from Explore(τ c, σc) to Traverse(τ c, σc, τac .eaw[*].ec[eaw]) and
Traverse(τ c, σc, τ bc.ebw[*].ec[ebw]) respectively. If eaw 6= ebw, then the contradic-
tion is derived as above. Assume therefore that eaw = e
b
w.
If τac = τ
b
c, then the entire new branches are equal: τ
a
c .e
a
w[*].ec[e
a
w] =
τ bc.e
b
w[*].ec[e
b
w]. By the mechanism on lines 15-19 using the set explored, we know
that the calls Traverse(τ c, σc, τac .eaw[*].ec[eaw]) and Traverse(τ c, σc, τ bc.ebw[*].ec[ebw])
must then be the same call, since Traverse is not called twice with the same new
branch. The call to Traverse will eventually call Explore after traversing the new
branch. From the assumption that the call Explore(τ c, σc) is the last one that is an
ancestor to both calls Explore(τa, σa) and Explore(τ b, σb), we know that the call to
Traverse must perform two different calls to Explore, which will eventually lead to
the calls Explore(τa, σa) and Explore(τ b, σb) respectively. However, when the func-
tion Traverse traverses a branch, all events have fixed parameters, except the last
store (i.e. eaw or e
b
w). So if different calls to Explore from Traverse lead to the calls
Explore(τa, σa) and Explore(τ b, σb), then eaw (which is the same as ebw) must have
different coherence positions in σa and σb. This leads to the usual contradiction.
Hence we know that τac 6= τ bc. From the way new branches are constructed in
DetectRace, we know that the branches τac .eaw[*].ec[eaw] and τ bc.ebw[*].ec[ebw] must have
been constructed and added to Q[ec] during the exploration of some continuation of τ c.
So there must exist runs τ c.τAc .e
a
w[n
a] and τ c.τBc .e
b
w[n
b] ending in the states σA and
σB respectively. Furthermore τac is the restriction of τ
A
c to events that precede e
a
w in
cbσA , and τ
b
c is the restriction of τ
B
c to events that precede e
b
w in cbσB . From τ
a
c 6= τ bc
and Lemma 8, it now follows that there is an event ed which appears both in τac and
τ bc but which has different parameters in the two subruns. This difference will also be
reflected in σa and σb. Hence exec(σa) 6= exec(σb), which gives a contradiction and
concludes the proof. uunionsq
