A wide-spectrum language for verification of programs on weak memory
  models by Colvin, Robert J. & Smith, Graeme
ar
X
iv
:1
80
2.
04
40
6v
1 
 [c
s.P
L]
  1
3 F
eb
 20
18
A wide-spectrum language for verification of
programs on weak memory models
Robert J. Colvin and Graeme Smith
School of Information Technology and Electrical Engineering
University of Queensland
Abstract. Modern processors deploy a variety of weak memory models,
which for efficiency reasons may (appear to) execute instructions in an
order different to that specified by the program text. The consequences
of instruction reordering can be complex and subtle, and can impact
on ensuring correctness. Previous work on the semantics of weak mem-
ory models has focussed on the behaviour of assembler-level programs.
In this paper we utilise that work to extract some general principles
underlying instruction reordering, and apply those principles to a wide-
spectrum language encompassing abstract data types as well as low-level
assembler code. The goal is to support reasoning about implementations
of data structures for modern processors with respect to an abstract
specification.
Specifically, we define an operational semantics, from which we derive
some properties of program refinement, and encode the semantics in the
rewriting engine Maude as a model-checking tool. The tool is used to val-
idate the semantics against the behaviour of a set of litmus tests (small
assembler programs) run on hardware, and also to model check imple-
mentations of data structures from the literature against their abstract
specifications.
1 Introduction
Modern processor architectures provide a challenge for developing efficient and
correct software. Performance can be improved by parallelising computation to
utilise multiple cores, but communication between threads is notoriously error-
prone. Weak memory models go further and improve overall system efficiency
through sophisticated techniques for batching reads and writes to the same vari-
ables and to and from the same processors. However, code that is run on such
memory models is not guaranteed to take effect in the order specified in the
program code, creating unexpected behaviours for those who are not forewarned
[1]. For instance, the instructions x :=1; y :=1 may be reordered to y :=1; x :=1.
Architectures typically providememory barrier/fence instructions which can en-
force local ordering – so that x := 1 ; fence ; y := 1 can not be reordered – but
reduce performance improvements (and so should not be overused).
Previous work on formalising weak memory models has resulted in abstract
formalisations which were developed incrementally through communication with
processor vendors and rigorous testing on real machines [22,3,9]. A large collec-
tion of “litmus tests” have been developed [2,16] which demonstrate the some-
times confusing behaviour of hardware. We utilise this existing work to provide a
wide-spectrum programming language and semantics that runs on the same re-
laxed principles that apply to assembler instructions. When these principles are
specialised to the assembler of ARM and POWER processors our semantics gives
behaviour consistent with existing litmus tests. Our language and semantics,
therefore, connect instruction reordering to higher-level notions of correctness.
This enables verification of low-level code targeting specific processors against
abstract specifications.
We begin in Sect. 2 with the basis of an operational semantics that allows
reordering of instructions according to pair-wise relationships between instruc-
tions. In Sect. 3 we describe the semantics in more detail, focussing on its in-
stantiation for the widely used ARM and POWER processors. In Sect. 4 we give
a summary of the encoding of the semantics in Maude and its application to
model-checking concurrent data structures. We discuss related work in Sect. 5
before concluding in Sect. 6.
2 Instruction reordering in weak memory models
2.1 Thread-local reorderings
It is typically assumed processes are executed in a fixed sequential order (as given
by sequential composition – the “program order”). However program order may
be inefficient, e.g., when retrieving the value of a variable from main memory
after setting its value, as in x := 1 ; r := x, and hence weak memory models
sometimes allow execution out of program order to improve overall system effi-
ciency. While many reorderings can seem surprising, there are basic principles
at play which limit the number of possible permutations, the key being that the
new ordering of instructions preserves the original sequential intention.
A classic example of weak memory models producing unexpected behaviour
is the “store buffer” pattern below [2]. Assume that all variables are initially 0,
and that thread-local variables (registers) are named r , r1, r2, etc., and that x
and y are shared variables.
(x := 1 ; r1 := y) ‖ (y := 1 ; r2 := x ) (1)
It is possible to reach a final state in which r1 = r2 = 0 in several weak memory
models: the two assignments in each process are independent (they reference
different variables), and hence can be reordered. From a sequential semantics
perspective, reordering the assignments in process 1, for example, preserves the
final values for r1 and x .
Assume that c and c′ are programs represented as sequences of atomic actions
α ; β ; . . ., as in a sequence of instructions in a processor or more abstractly a
semantic trace. Program c may be reordered to c′, written c ❀ c′, if the following
holds:
2
1. c′ is a permutation of the actions of c, possibly with some modifications due
to forwarding (see below).
2. c′ preserves the sequential semantics of c. For example, in a weakest precon-
ditions semantics [8], (∀S • wp(c, S )⇒ wp(c′, S )).
3. c′ preserves coherence-per-location with respect to c (cf. po-loc in [3]). This
means that the order of updates and accesses of each shared variable, con-
sidered individually, is maintained.
We formalise these constraints below. The key challenge for reasoning about
programs executed on a weak memory model is that the behaviour of c ‖ d is in
general quite different to the behaviour of c′ ‖ d , even if c ❀ c′.
2.2 Reordering and forwarding instructions
We write α
r
⇐ β if instruction β may be reordered before instruction α. It is rela-
tively straightforward to define when two assignment instructions (encompassing
stores, loads, and register operations at the assembler level) may be reordered.
Below let x nfi f mean that x does not appear free in the expression f , and say
expressions e and f are load-distinct if they do not reference any common shared
variables.
x := e
r
⇐ y := f if
1) x , y are distinct; 2) x nfi f ; 3) y nfi e; and
4) e, f are load-distinct;
(2)
Note that in general
r
⇐ is not reflexive: in TSO processors a load may be re-
ordered before a store, but not vice versa [23].
Provisos 1), 2) and 3) ensure executing the two assignments in either order
results in the same final values for x and y, and proviso 4) maintains order on
accesses of the shared state. If two updates do not refer to any common variables
they may be reordered. The provisos allow some reordering when they share
common variables. Proviso 1) eliminates reorderings such as (x := 1 ; x := 2)❀
(x := 2 ; x := 1) which would violate the sequential semantics (the final value of
x ). Proviso 2) eliminates reorderings such as (x := 1 ; r := x )❀ (r := x ; x := 1)
which again would violate the sequential semantics (the final value of r). Proviso
3) eliminates reorderings such as (r := y ; y := 1) ❀ (y := 1 ; r := y) which
again would violate the sequential semantics (the final value of r). Proviso 4),
requiring the update expressions to be load-distinct, preserves coherence-per-
location, eliminating reorderings such as (r1 := x ; r2 := x )❀ (r2 := x ; r1 := x ),
where r2 may receive an earlier value of x than r1 in an environment which
modifies x .
In practice, proviso 2) may be circumvented by forwarding1 . This refers to
taking into account the effect of the earlier update on the expression of the latter.
We write β[α] to represent the effect of forwarding the (assignment) instruction
α to the instruction β. For assignments we define
(y := f )[x := e] = y := (f[x\e]) if e does not refer to global variables (3)
1 We adopt the term “forwarding” from ARM and POWER [3]. The equivalent effect
is referred to as bypassing on TSO [23].
3
where the term f[x\e] stands for the syntactic replacement in expression f of
references to x with e. The proviso of (3) prevents additional loads of globals
being introduced by forwarding.
We specify the reordering and forwarding relationships with other instruc-
tions such as branches and fences in Sect. 3.3.
2.3 General operational rules for reordering
The key operational principle allowing reordering is given by the following tran-
sition rules for a program (α ; c), i.e., a program with initial instruction α.
(α ; c)
α
−→ c (a)
c
β
−→ c′ α
r
⇐ β[α]
(α ; c)
β[α]
−−→ (α ; c′)
(b) (4)
Rule (4a) is the straightforward promotion of the first instruction into a step in
a trace, similar to the basic prefixing rules of CCS [18] and CSP [11]. Rule (4b),
however, states that, unique to weak memory models, an instruction of c, say
β, can happen before α, provided that β[α] can be reordered before α according
to the rules of the architecture. Note that we forward the effect of α to β before
deciding if the reordering is possible.
Applying Rule (4b) then Rule (4a) gives the following reordered behaviour
of two assignments.
(r := 1 ; x := r ; nil) x := 1−−−→ (r := 1 ; nil) r := 1−−−→ nil (5)
We use the command nil to denote termination. The first transition above is
possible because we calculate the effect of r := 1 on the update of x before
executing that update, i.e., x := r [r := 1] = x := 1.
The definitions of instruction reordering, α
r
⇐ β, and instruction forwarding,
β[α] are architecture-specific, and are the only definitions required to specify an
architecture’s instruction ordering.2 The instantiations for sequentially consistent
processors (i.e., those which do not have a weak memory model) are trivial: α 6
r
⇐ β
for all α, β, and there is no forwarding. Since reordering is not possible Rule (4b)
never applies and hence the standard prefixing semantics is maintained. TSO is
relatively straightforward: loads may be reordered before stores (provided they
reference different shared variables). In this paper we focus on the more complex
ARM and POWER memory models. These memory models are very similar,
the notable difference being the inclusion of the lightweight fence instruction in
POWER. Due to space limitations, we omit lightweight fences in this paper but
a full definition which has been validated against litmus tests can be found in
Appendix A.
2 Different architectures may have different storage subsystems, however, and these
need to be separately defined (see Sect. 3.2).
4
2.4 Reasoning about reorderings
The operational rules allow a standard trace model of correctness to be adopted,
that is, we say programs c refines to program d , written c ⊑ d , iff every trace of
d is a trace of c. Let the program α  c have the standard semantics of prefixing,
that is, the action α always occurs before any action in c (Rule (4a)). Then
we can derive the following laws that show the interplay of reordering and true
prefixing.
α ; c ⊑ α  c (6)
α ; (β  c) ⊑ β[α]  (α ; c) if α
r
⇐ β[α] (7)
Note that in Law (7) α may be further reordered with instructions in c. A typical
interleaving law is the following.
(α  c) ‖ d ⊑ α  (c ‖ d) (8)
We may use these laws to show how the “surprise” behaviour of the store buffer
pattern above arises.3 In derivations such as the following, to save space, we
abbreviate a thread α ; nil or α  nil to α, that is, we omit the trailing nil.
(x := 1 ; r1 := y) ‖ (y := 1 ; r2 := x )
⊑ From Law (7) (twice), since x := 1
r
⇐ r1 := y from (2).
(r1 := y  x := 1) ‖ (r2 := x  y := 1)
⊑ Law (8) (four times) and commutativity of ‖.
r1 := y  r2 := x  x := 1  y := 1
If initially x = y = 0, a standard sequential semantics shows that r1 = r2 = 0 is
a possible final state in this behaviour.
3 Semantics
3.1 Formal language
The elements of our wide-spectrum language are actions (instructions) α, com-
mands (programs) c, processes (local state and a command) p, and the top level
system s , encompassing a shared state and all processes. Below x is a variable
(shared or local) and e an expression.
α ::= x := e | [e] | fence | cfence | α∗
c ::= nil | α ; c | c1 ⊓ c2 | while b do c
p ::= (lcl σ • c) | (tidn p) | p1 ‖ p2
s ::= (glb σ • p) | (stg W • p)
(9)
3 To focus on instruction reorderings we leave local variable declarations and process
ids implicit, and assume a multi-copy atomic storage system (see Sect. 3.2).
5
An action may be an update x := e, a guard [e], a (full) fence, a control fence
(see Sect. 3.3), or a finite sequence of actions, α∗, executed atomically. Through-
out the paper we denote an empty sequence by 〈〉, and construct a non-empty
sequence as 〈α1 , α2 . . .〉.
A command may be the empty command nil, which is already terminated,
a command prefixed by some action α, a choice between two commands, or an
iteration (for brevity we consider only one type of iteration, the while loop).
Conditionals are modelled using guards and choice.
if b then c1 else c2 =̂ ([b] ; c1) ⊓ ([¬b] ; c2) (10)
A well-formed process is structured as a process id n ∈ PID encompassing a
(possibly empty) local state σ and command c, i.e., a term (tidn lcl σ • c). We
assume that all local variables referenced in c are contained in the domain of σ.
A system is structured as the parallel composition of processes within the
global storage system, which may be either a typical global state, σ, that maps
all global variables to their values (modelling the storage systems of TSO, the
most recent version of ARM, and abstract specifications), or a storage system,
W , formed from a list of “writes” to the global variables (modelling the storage
systems of older versions of ARM and POWER). Hence a system is in one of
the two following forms.
(glb σ • (tid1 lcl σ1 • c1) ‖ (tid2 lcl σ2 • c2) ‖ . . .)
(stg W • (tid1 lcl σ1 • c1) ‖ (tid2 lcl σ2 • c2) ‖ . . .)
(11)
3.2 Operational semantics
The meaning of our language is formalised using an operational semantics, sum-
marised in Fig. 1. Given a program c the operational semantics generates a trace,
i.e., a possibly infinite sequence of steps c0
α1−→ c1
α2−→ . . . where the labels in
the trace are actions, or a special label τ representing a silent or internal step
that has no observable effect.
The terminated command nil has no behaviour; a trace that ends with this
command is assumed to have completed. The effect of instruction prefixing in
Rule (12) is discussed in Sect. 2.3. Note that actions become part of the trace.
We describe an instantiation for reordering and forwarding corresponding to the
semantics of ARM and POWER in Sect. 3.3.
A nondeterministic choice (the internal choice of CSP [11]) can choose either
branch, as given by Rule (13). The semantics of loops is given by unfolding, e.g.,
Rule (14) for a ‘while’ loop. Note that speculative execution, i.e., early execution
of instructions which occur after a branch point [24], is theoretically unbounded,
and loads from inside later iterations of the loop could occur in earlier iterations.
For ease of presentation in defining the semantics for local states, we give
rules for specific forms of actions, i.e., assuming that r is a local variable in the
domain of σ, and that x is a global (not in the domain of σ). The more general
version can be straightforwardly constructed from the principles below.
6
(α ; c)
α
−→ c (a)
c
β
−→ c′ α
r
⇐ β[α]
(α ; c)
β[α]
−−−→ (α ; c′)
(b) (12)
c ⊓ d
τ
−→ c
c ⊓ d
τ
−→ d
(13)
while b do c
τ
−→ ([b] ; c ; while b do c) ⊓ ([¬b] ; nil) (14)
c r := v−−−→ c′
(lcl σ • c)
τ
−→ (lcl σ[r := v] • c
′)
(15)
c x := r−−−→ c′ σ(r) = v
(lcl σ • c) x := v−−−→ (lcl σ • c′)
(16)
c r := x−−−→ c′
(lcl σ • c)
[x=v]
−−−→ (lcl σ[r := v] • c
′)
(17)
c
[e]
−−→ c′
(lcl σ • c)
[eσ]
−−→ (lcl σ • c′)
(18)
p
α
−→ p′
(tidn p)
n:α−−→ (tidn p
′)
(19)
p1
α
−→ p′1
p1 ‖ p2
α
−→ p′1 ‖ p2
p2
α
−→ p′2
p1 ‖ p2
α
−→ p1 ‖ p
′
2
(20)
p n:x := e−−−−−→ p′
(glb σ • p)
τ
−→ (glb σ[x := eσ ] • p
′)
(21)
p
n:[e]
−−−→ p′ eσ ≡ true
(glb σ • p)
τ
−→ (glb σ • p′)
(22)
Fig. 1. Semantics of the language
Rule (15) states that an action updating variable r to value v results in a
change to the local state (denoted σ[r := v ]). Since this is a purely local operation
there is no interaction with the storage subsystem and hence the transition is
promoted as a silent step τ . Rule (16) states that a store of the value in variable
r to global x is promoted as an instruction x := v where v is the local value for
r . Rule (17) covers the case of a load of x into r . The value of x is not known
locally. The promoted label is a guard requiring that the value read for x is
v . This transition is possible for any value of v , but the correct value will be
resolved when the label is promoted to the storage level. Rule (18) states that a
guard is partially evaluated with respect to the local state before it is promoted
to the global level. The notation eσ replaces x with v in e for all (x 7→ v) ∈ σ.
Rule (19) simply tags the process id to an instruction, to assist in the in-
teraction with the storage system, and otherwise has no effect. Instructions of
concurrent processes are interleaved in the usual way as described by Rule (20).
Other straightforward rules which we have omitted above include the pro-
motion of fences through a local state, and that atomic sequences of actions are
handled inductively by the above rules.
Multi-copy atomic storage subsystem. Traditionally, changes to shared
variables occur on a shared global state, and when written to the global state are
seen instantaneously by all processes in the system. This is referred to as multi-
copy atomicity and is a feature of TSO and the most recent version of ARM [21].
7
Older versions of ARM and POWER, however, lack such multi-copy atomicity
and require a more complex semantics. We give the simpler case (covered in
Fig. 1) first.4
Recall that at the global level the process id n has been tagged to the actions
by Rule (19). Rule (21) covers a store of some expression e to x . Since all local
variable references have been replaced by their values at the process level due to
Rules (15)-(18), expression e must refer only to shared variables in σ. The value
of x is updated to the fully evaluated value, eσ.
Rule (22) states that a guard transition [e] is possible exactly when e evalu-
ates to true in the global state. If it does not, no transition is possible; this is how
incorrect branches are eliminated from the traces. If a guard does not evaluate to
true, execution stops in the sense that no transition is possible. This corresponds
to a false guard, i.e., magic [20,4], and such behaviours do not terminate and
are ignored for the purposes of determining behaviour of a real system. Interest-
ingly, this straightforward concept from standard refinement theory allows us to
handle speculative execution straightforwardly. In existing approaches, the se-
mantics is complicated by needing to restart reads if speculation proceeds down
the wrong path. Treating branch points as guards works because speculation
should have no effect if the wrong branch was chosen.
To understand how this approach to speculative execution works, consider
the following derivation. Assume that (a) loads may be reordered before guards
if they reference independent variables, and (b) loads may be reordered if they
reference different variables. Recall that we omit trailing nil commands to save
space.
r1 := x ; (if r1 = 0 then r2 := y)
= Definition of if (10)
r1 := x ; (([r1 = 0] ; r2 := y) ⊓ [r1 6= 0])
⊑ Resolve to the first branch, since (c ⊓ d) ⊑ c
r1 := x ; [r1 = 0] ; r2 := y
⊑ From Law (7) and assumption (a)
r1 := x ; r2 := y  [r1 = 0]
⊑ From Law (7) and assumption (b)
r2 := y  r1 := x ; [r1 = 0]
This shows that the inner load (underlined) may be reordered before the branch
point, and subsequently before an earlier load. Note that this behaviour results
in a terminating trace only if r1 = 0 holds when the guard is evaluated, and
otherwise becomes magic (speculation down an incorrect path). On ARM pro-
cessors, placing a control fence (cfence) instruction inside the branch, before
the inner load prevents this reordering (see Sect. 3.3).
Non-multi-copy atomic storage subsystem. Some versions of ARM and
POWER allow processes to communicate values to each other without accessing
4 In this straightforward model of shared state there is no global effect of fences, and
we omit the straightforward promotion rule.
8
p
n:[x=v]
−−−−−→ p′
∀w ∈ ran(W1) • x = w .var ⇒ n 6∈ w .seen
(stg W1 a (x 7→ v)
m
ns
aW2 • p)
n:[x=v]
−−−−−→ (stg W1 a (x 7→ v)
m
ns∪{n}
aW2 • p
′)
(23)
p n:x := v−−−−−→ p′
∀w ∈ ran(W1) • n 6= w .thread ∧ (x = w .var ⇒ n 6∈ w .seen)
(stg W1 aW2 • p)
n:x := v−−−−−→ (stg W1 a (x 7→ v)
n
{n}
aW2 • p
′)
(24)
p n:fence−−−−−→ p′
(stg W • p) n:fence−−−−−→ (stg flushn(W ) • p
′)
(25)
where
flushn(〈〉) = 〈〉 flushn(w aW ) =
{
w[seen :=PID]
a flushn(W ) if n ∈ w .seen
w a flushn(W ) otherwise
Fig. 2. Rules for the non-multi-copy atomic subsystem of ARM and POWER
the heap. That is, if process p1 is storing v to x , and process p2 wants to load
x into r , p2 may preemptively load the value v into r , before p1’s store hits the
global shared storage. Therefore different processes may have different views of
the value of a global variable, as exposed by litmus tests such as the WRC family
[3].
Our approach to modelling this is based on that of the operational model
of [22]. However, that model maintains several partial orders on operations re-
flecting the nondeterminism in the system, whereas we let the nondeterminism
be represented by choices in the operational rules. This means we maintain a
simpler data structure, a single global list of writes. The shared state from the
perspective of a given process is a particular view of this list. There is no single
definitive shared state. In addition, viewing a value in the list causes the list to
be updated and this affects later views. To obtain the value of a variable this
list is searched starting with the most recent write first. A process p1 that has
already seen the latter of two updates to a variable x may not subsequently then
see the earlier update. Hence the list keeps track of which processes have seen
which stores. Furthermore, accesses of the storage subsystem are influenced by
fences.
A write w has the syntactic form (x 7→ v)nns, where x is a global variable
being updated to value v , n is the process id of the process from which the
store originated, and ns is the set of process ids that have “seen” the write.
For such a w , we let w .var = x , w .thread = n and w .seen = ns. For a write
(x 7→ v)n
ns
it is always the case that n ∈ ns. The storage W is a list of writes,
9
α 6
r
⇐ fence (26)
fence 6
r
⇐ α (27)
[b] 6
r
⇐ cfence (28)
cfence 6
r
⇐ r := e (29)
[b1]
r
⇐ [b2] (30)
[b] 6
r
⇐ ϕ := e (31)
[b]
r
⇐ r := e iff r nfi b (32)
x := e
r
⇐ [b] iff x nfi b (33)
α
r
⇐ β in all other cases
x := e
r
⇐ y := f iff (34)
x 6= y , x nfi f , y nfi e, and
e, f are load-distinct
x := e [y := f ] = x := e[y\f ] if (35)
e has no shared variables
[e][y := f ] = [e[y\f ]] if (36)
e has no shared variables
β[α] = β otherwise
Fig. 3. Reordering and forwarding following ARM assembler semantics. Let x , y denote
any variable, r a local variable, and ϕ a global variable.
initially populated with writes for the initial values of global variables, which all
processes have “seen”.
We give two specialised rules (for a load and store) in Fig. 2.5 Rule (23)
states that a previous write to x may be seen by process n if there are no more
recent writes to x that it has already seen. Its id is added to the set of processes
that have seen that write. Rule (24) states that a write to x may be added to the
system by process n, appearing earlier than existing writes in the system, if the
following two conditions hold for each of those existing writes w : they are not
by n (n 6= w .thread , local coherence), and x = w .var ⇒ n 6∈ w .seen, i.e., writes
to the same variable are seen in a consistent order (although not all writes need
be seen).
A fence action by process n ‘flushes’ all previous writes by and seen by n.
The flush function modifies W so that all processes can see all writes by n,
effectively overwriting earlier writes. This is achieved by updating the write so
that all processes have seen it, written as w[seen :=PID].
3.3 Reordering and forwarding for ARM and POWER
Our general semantics is instantiated for ARM and POWER processors in Fig. 3
which provides particular definitions for the reordering relation and forwarding
that are generalised from the orderings on stores and loads in these processors.6
5 To handle the general case of an assignment x := e, where e may contain more than
one shared variable, the antecedents of the rules are combined, retrieving the value
of each variable referenced in e individually and accumulating the changes to W .
6 We have excluded address shifting, which creates address dependencies [3], as this
does not affect the majority of high-level algorithms in which we are interested. How-
ever, address dependencies are accounted for in our tool as discussed in Appendix B.
10
Fences prevent all reorderings (26, 27). Control fences prevent speculative
loads when placed between a guard and a load (28, 29). Guards may be reordered
with other guards (30), but stores to shared variables may not come before a
guard evaluation (31). This prevents speculative execution from modifying the
global state, in the event that the speculation was down the wrong branch. An
update of a local variable may be reordered before a guard provided it does
not affect the guard expression (32). Guards may be reordered before updates if
those updates do not affect the guard expression (33).
Assignments may be reordered as shown in (34) and discussed in Sect. 2.2.
Forwarding is defined straightforwardly so that an earlier update modifies the
expression of a later update or guard (35, 36), provided it references no shared
variables.
4 Model checking concurrent data structures
Our semantics has been encoded in the Maude rewriting system [6]. We have
used the resulting prototype tool to validate the semantics against litmus tests
which have been used in other work on ARM (348 tests) [9] and POWER (758
tests) [22]. As that research was developed through testing on hardware and in
consultation with the processor vendors themselves we consider compliance with
those litmus tests to be sufficient validation. With two exceptions, as discussed
in Sect. 5, our semantics agrees with those results.
We have employed Maude as a model checker to verify that a (test-and-set)
lock provides mutual exclusion on ARM and POWER, and that a lock-free stack
algorithm, and a deque (double-ended queue) algorithm, satisfy their abstract
specifications on ARM and POWER. We describe the verification of the deque
below, in which we found a bug in the published algorithm.
4.1 Chase-Lev deque
Leˆ et. al [15] present a version of the Chase-Lev deque [5] adapted for ARM
and POWER. The deque is implemented as an array, where elements may be
put on or taken from the tail, and additionally, processes may steal an element
from the head of the deque. The put and take operations may be executed by a
single process only, hence there is no interference between these two operations
(although instruction reordering could cause consecutive invocations to overlap).
The steal operation can be executed by multiple processes concurrently.
The code we tested is given in Fig. 4 where L is the maximum size of the
deque which is implemented as a cyclic array, with all elements initialised to
some irrelevant value. The original code includes handling array resizing, but
here we focus on the insert/delete logic. For brevity we omit trailing nils. We
have used a local variable return to model the return value, and correspondingly
have refactored the algorithm to eliminate returns from within a branch. A
CAS (x , r , e) (compare-and-swap) instruction atomically compares the value of
11
Initial state: {head 7→ 0, tail 7→ 0, tasks 7→ 〈 , , . . .〉}
put(v) =̂
lcl t 7→ •
t := tail ;
tasks[t mod L] := v ;
fence;
tail := t + 1
take =̂
lcl h 7→ , t 7→ , return 7→ •
t := tail − 1 ;
tail := t ;
fence ;
h := head ;
if h ≤ t then
return := tasks[t mod L] ;
if h = t then
if ¬CAS(head , h, h + 1) then
return := empty
tail := t + 1
else
return := empty ;
tail := t + 1
steal =̂
lcl h 7→ , t 7→ , return 7→ •
h := head ;
fence ;
t := tail ;
cfence ; // unnecessary
if h < t then
return := tasks[h mod L] ;
cfence ; // incorrectly placed
if ¬CAS(head , h, h + 1) then
return := fail
else
return := empty
Fig. 4. A version of Leˆ et. al’s work-stealing deque algorithm for ARM [15]
global x with the value r and if the same updates x to e. We model a conditional
statement with a CAS as follows.
if CAS (x , r , e) then c1 else c2 =̂ (〈[x = r ] , x := e〉 ; c1) ⊓ ([x 6= r ] ; c2) (37)
The put operation straightforwardly adds an element to the end of the deque,
incrementing the tail index. It includes a full fence so that the tail pointer is
not incremented before the element is placed in the array. The take operation
uses a CAS operation to atomically increment the head index. Interference can
occur if there is a concurrent steal operation in progress, which also uses CAS
to increment head to remove an element from the head of the deque. The take
and steal operation return empty if they observe an empty deque. In addition
the steal operation may return the special value fail if interference on head
occurs. Complexity arises if the deque has one element and there are concurrent
processes trying to both take and steal that element at the same time.
Operations take and steal use a fence operation to ensure they have con-
sistent readings for the head and tail indexes, and later use CAS to atomically
update the head pointer (only if necessary, in the case of take). Additionally, the
steal operation contains two cfence barriers (ctrl isync in ARM).
12
Verification. We use an abstract model of the deque and its operations to
specify the allowed final values of the deque and return values. The function
last(q) returns the last element in q and front(q) returns q excluding its last
element.
put(v) =̂ q := q a 〈v〉
take =̂ lcl return := none •
〈[q = 〈〉] , return := empty〉 ⊓
〈[q 6= 〈〉] , return := last(q) , q := front(q)〉
steal =̂ lcl return := none
〈[q = 〈〉] , return := empty〉 ⊓
〈[q 6= 〈〉] , return := head(q) , q := tail(q)〉
The abstract specification for steal does not attempt to detect interference and
return fail . As such we exclude these behaviours of the concrete code from the
analysis.
We model-checked combinations of one to three processes operating in par-
allel, each executing one or two operations in sequence. The final states of the
abstract and concrete code were compared via a simulation relation. This ex-
posed a bug in the code which may occur when a put and steal operation execute
in parallel on an empty deque. The load return := tasks [h mod L] can be specu-
latively executed before the guard h < t is evaluated, and hence also before the
load of tail . Thus the steal process may load head , load an irrelevant return value,
at which point a put operation may complete, storing a value and incrementing
tail . The steal operation resumes, loading the new value for tail and observing
a non-empty deque, succeeding with its CAS and returning the irrelevant value,
which was loaded before the put operation had begun.
Swapping the order of the second cfence with the load of tasks [h mod L]
eliminates this bug, and our analysis did not reveal any other problems. In
addition, eliminating the first cfence does not change the possible outcomes.
5 Related work
This work makes use of an extensive suite of tests elucidating the behaviour of
weak memory models in ARM and POWER via both operational and axiomatic
semantics [3,22,17,9]. Those semantics were developed and validated through
testing on real hardware and in consultation with processor vendors themselves.
Our model is validated against their results, in the form of the results of litmus
tests.
Excluding two tests involving “shadow registers”, which appear to be pro-
cessor-specific facilities which are not intended to conform to sequential seman-
tics (they do not correspond to higher-level code), all of the 348 ARM litmus
tests run on our model agreed with the results in [9], and all of the 758 POWER
13
litmus tests run on our model agreed with the results in [22], which the ex-
ception of litmus test PPO015, which we give below, translated into our formal
language.7
x := 1 ; fence ; y := 1 ‖
r0 := y ; z := (r0 xor r0) + 1 ; z := 2 ; r3 := z ;
(if r3 = r3 then nil else nil) ; cfence ; r4 := x
(38)
The tested condition is z = 2 ∧ r0 = 1 ∧ r4 = 0, which asks whether it is
possible to load x (the last statement of process 2) before loading y (the first
statement of process 2). At a first glance the control fence prevents the load
of x happening before the branch. However, as indicated by litmus tests such
as MP+dmb.sy+fri-rfi-ctrisb, [9, Sect 3,Out of order execution], under some
circumstances the branch condition can be evaluated early, as discussed in the
speculative execution example. We expand on this below by manipulating the
second process, taking the case where the success branch of the if statement
is chosen. To aid clarity we underline the instruction that is the target of the
(next) refinement step.
r0 := y ; z := (r0 xor r0) + 1 ; z := 2 ; r3 := z ; [r3 = r3] ; cfence ; r4 := x
⊑ Promote load with forwarding (from z := 2), from Laws (6) and (7)
r3 := 2  r0 := y ; z := (r0 xor r0) + 1 ; z := 2 ; [r3 = r3] ; cfence ; r4 := x
⊑ Promote guard by Laws (6) and (7) (from (33))
r3 := 2  [r3 = r3]  r0 := y ; z := (r0 xor r0) + 1 ; z := 2 ; cfence ; r4 := x
⊑ Promote control fence by Laws (6) and (7) ((28) does not now apply)
r3 := 2  [r3 = r3]  cfence  r0 := y ; z := (r0 xor r0) + 1 ; z := 2 ; r4 := x
⊑ Promote load by Laws (6) and (7)
r3 := 2  [r3 = r3]  cfence  r4 := x  r0 := y ; z := (r0 xor r0) + 1 ; z := 2
The load r4 := x has been reordered before the load r0 := y, and hence when in-
terleaved with the first process from (38) it is straightforward that the condition
may be satisfied.
In the Flowing/POP model of [9], this behaviour is forbidden because there
is a data dependency from the load of y into r0 to r3, via z . This appears to
be because of the consecutive stores to z , one of which depends on r0. In the
testing of real processors reported in [9], the behaviour that we allow was never
observed, but is allowed by the model in [3]. As such we deem this discrepancy
to be a minor issue in Flowing/POP (preservation of transitive dependencies)
rather than a fault in our model.
Our model of the storage subsystem is similar to that of the operational mod-
els of [22,9]. However our thread model is quite different, being defined in terms
7 We simplified some of the syntax for clarity, in particular introducing a higher-level
if statement to model a jump command and implicit register (referenced by the com-
pare (CMP) and branch-not-equal (BNE) instructions). We have also combined some
commands, retaining dependencies, in a way that is not possible in the assembler
language. The xor operator is exclusive-or; its use here artificially creates a data
dependency [3] between the updates to r0 and z .
14
of relationships between actions. The key difference is how we handle branching
and the effects of speculative execution. The earlier models are complicated in
the sense that they are closer to the real execution of instructions on a proces-
sor, involving restarting reads if an earlier read invalidates the choice taken at a
branch point.
The axiomatic models, as exemplified by [3], define relationships between in-
structions in a whole-system way, including relationships between instructions in
concurrent processes. This gives a global view of how an architecture’s reorder-
ing rules (and storage system) interact to reorder instructions in a system. Such
global orderings are not immediately obvious from our pair-wise orderings on
instructions. On the other hand, those globals orderings become quite complex
and obscure some details, and it is unclear how to extract some of the generic
principles such as (2).
6 Conclusion
We have utilised earlier work to devise a wide-spectrum language and semantics
for weak memory models which is relatively straightforward to define and extend,
and which lends itself to verifying low-level code against abstract specifications.
While abstracting away from the details of the architecture, we believe it provides
a complementary insight into why some reorderings are allowed, requiring a pair-
wise relationship between instructions rather than one that is system-wide.
A model-checking approach based on our semantics exposed a bug in an
algorithm in [15] in relation to the placement of a control fence. The original
paper includes a hand-written proof of the correctness of the algorithm based
on the axiomatic model of [17]. The possible traces of the code were enumerated
and validated against a set of conditions on adding and removing elements from
the deque (rather than with respect to an abstract specification of the deque).
The conditions being checked are non-trivial to express using final state analysis
only. An advantage of having a semantics that can apply straightforwardly to
abstract specifications, rather than a proof technique that analyses behaviours
of the concrete code only, is that we may reason at a more abstract level.
We have described the ordering condition as syntactic constraints on atomic
actions, which fits with the low level decisions of hardware processors such as
ARM and POWER. However our main reordering principle (2) is based on se-
mantic concerns, and as such may be applicable as a basis for understanding
the interplay of software memory models, compiler optimisations and hardware
memory models [14].
The wide-spectrum language has as its basic instruction an assignment, which
is sufficient for specifying many concurrent programs. However we hope to ex-
tend the language to encompass more general constructs such as the specification
command [19] and support rely-guarantee reasoning [12,13,10,7].
Acknowledgements We thank Kirsten Winter for feedback on this work, and
the support of Australian Research Council Discovery Grant DP160102457.
15
References
1. Sarita V. Adve and Hans-J. Boehm. Memory models: A case for rethinking parallel
languages and hardware. Commun. ACM, 53(8):90–101, August 2010.
2. Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell. Litmus: Running
tests against hardware. In Parosh Aziz Abdulla and K. Rustan M. Leino, editors,
Tools and Algorithms for the Construction and Analysis of Systems: 17th Interna-
tional Conference, TACAS 2011, Held as Part of the Joint European Conferences
on Theory and Practice of Software, ETAPS 2011, Saarbru¨cken, Germany, March
26–April 3, 2011. Proceedings, pages 41–44, Berlin, Heidelberg, 2011. Springer
Berlin Heidelberg.
3. Jade Alglave, Luc Maranget, and Michael Tautschnig. Herding cats: Modelling,
simulation, testing, and data mining for weak memory. ACM Trans. Program.
Lang. Syst., 36(2):7:1–7:74, July 2014.
4. R. J. Back and J. von Wright. Refinement Calculus: A Systematic Introduction.
Springer-Verlag, 1998.
5. David Chase and Yossi Lev. Dynamic circular work-stealing deque. In SPAA’05:
Proceedings of the 17th annual ACM symposium on Parallelism in algorithms and
architectures, pages 21–28, New York, NY, USA, 2005. ACM Press.
6. Manuel Clavel, Francisco Duran, Steven Eker, Patrick Lincoln, Narciso Marti-Oliet,
Jose´ Meseguer, and Jose´ F. Quesada. Maude: specification and programming in
rewriting logic. Theoretical Computer Science, 285(2):187 – 243, 2002.
7. Robert J. Colvin, Ian J. Hayes, and Larissa A. Meinicke. Designing a semantic
model for a wide-spectrum language with concurrency. Formal Aspects of Com-
puting, 29(5):853–875, Sep 2017.
8. Edsger W. Dijkstra. Guarded commands, nondeterminacy and formal derivation
of programs. Commun. ACM, 18(8):453–457, August 1975.
9. Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc
Maranget, Will Deacon, and Peter Sewell. Modelling the ARMv8 architecture,
operationally: Concurrency and ISA. In Proceedings of the 43rd Annual ACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL
’16, pages 608–621, New York, NY, USA, 2016. ACM.
10. Ian J. Hayes, Robert J. Colvin, Larissa A. Meinicke, Kirsten Winter, and An-
drius Velykis. An algebra of synchronous atomic steps. In John Fitzgerald, Con-
stance Heitmeyer, Stefania Gnesi, and Anna Philippou, editors, FM 2016: Formal
Methods: 21st International Symposium, Limassol, Cyprus, November 9-11, 2016,
Proceedings, pages 352–369, Cham, 2016. Springer International Publishing.
11. C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, Inc., Upper
Saddle River, NJ, USA, 1985.
12. Cliff B. Jones. Specification and design of (parallel) programs. In IFIP Congress,
pages 321–332, 1983.
13. Cliff B. Jones. Tentative steps toward a development method for interfering pro-
grams. ACM Trans. Program. Lang. Syst., 5:596–619, October 1983.
14. Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer.
A promising semantics for relaxed-memory concurrency. In Proceedings of the
44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL
2017, pages 175–189, New York, NY, USA, 2017. ACM.
15. Nhat Minh Leˆ, Antoniu Pop, Albert Cohen, and Francesco Zappa Nardelli. Correct
and efficient work-stealing for weak memory models. In Proceedings of the 18th
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,
PPoPP ’13, pages 69–80, New York, NY, USA, 2013. ACM.
16
16. Sela Mador-Haim, Rajeev Alur, and Milo M. K. Martin. Generating litmus tests
for contrasting memory consistency models. In Tayssir Touili, Byron Cook, and
Paul Jackson, editors, Computer Aided Verification: 22nd International Confer-
ence, CAV 2010, Edinburgh, UK, July 15-19, 2010. Proceedings, pages 273–287,
Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
17. Sela Mador-Haim, Luc Maranget, Susmit Sarkar, Kayvan Memarian, Jade Al-
glave, Scott Owens, Rajeev Alur, Milo M. K. Martin, Peter Sewell, and Derek
Williams. An axiomatic memory model for POWER multiprocessors. In Proceed-
ings of the 24th International Conference on Computer Aided Verification, CAV’12,
pages 495–512, Berlin, Heidelberg, 2012. Springer-Verlag.
18. Robin Milner. A Calculus of Communicating Systems. Springer-Verlag New York,
Inc., 1982.
19. Carroll Morgan. The specification statement. ACM Trans. Program. Lang. Syst.,
10:403–419, July 1988.
20. Carroll Morgan. Programming from Specifications. Prentice Hall, second edition,
1994.
21. Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and
Peter Sewell. Simplifying ARM concurrency: Multicopy-atomic axiomatic and op-
erational models for ARMv8. In Proceedings of the ACM SIGPLAN-SIGACT
Symposium on Principles of Programming Languages (POPL). ACM Press, 2018.
To appear.
22. Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams.
Understanding POWER multiprocessors. SIGPLAN Not., 46(6):175–186, June
2011.
23. Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Mag-
nus O. Myreen. X86-TSO: A rigorous and usable programmer’s model for x86
multiprocessors. Commun. ACM, 53(7):89–97, July 2010.
24. Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer on Memory Consis-
tency and Cache Coherence. Morgan & Claypool Publishers, 1st edition, 2011.
17
A Lightweight fences
POWER’s lightweight fences maintain order between loads, loads then stores,
and stores, but not stores and subsequent loads (loads can come before ear-
lier stores). If lightweight fences did not maintain load-load order it would be
straightforward to define their effect in terms of one instruction. However to
allow reordering later loads with earlier stores but not earlier loads we model a
lwfence as two “gates”, one blocking loads and one stores.
We define lwfence ; c as fenceL ; fenceS ; c where
r := x 6
r
⇐ fenceL fenceL 6
r
⇐ r := x
x := v 6
r
⇐ fenceS fenceS 6
r
⇐ x := v
(39)
Consider the code c ; fenceL ; fenceS ; d for arbitrary c and d . Without the
intervening gates that constitute a lightweight fence the instructions of c and
d could be reordered according to the usual restrictions. The fenceS instruc-
tion however prevents stores in d from interleaving with instructions in c (but
loads may be reordered past the fenceS instruction). Additionally, the fenceL
may be reordered past any stores in c but not past any loads. Hence between
the two fence instructions can mix stores from c and loads from d , which may
be reordered subject to the usual constraints. The lightweight fence therefore
maintains store/store, load/load, and load/store order.
A lightweight fence also has a global effect on the storage system, which
we encode in the semantics of the fenceS instruction. A lightweight fence by
n marks any store in W seen by n with a tag lwf(n). A store w by process
n may not be inserted in W before a write with the lwf(n) tag. In addition,
if another process m loads a value stored by n, it sees not only that store but
also all stores marked with lwf(n). This transitive effect gives cumulativity of
lightweight fences [3].
As fenceS is the latter to reach the storage system we give it the global
effect; a fenceL instruction has no global effect.
p
n:fenceS−−−−−−→ p′
(stg W • p)
n:fenceS−−−−−−→ (stg lwflushn(W ) • p
′)
(40)
where
lwflushn(〈〉) = 〈〉
lwflushn((x 7→ v)
m
ns
aW ) =


(x 7→ v)m
lwf(n)ans
a lwflushn(W ) if n ∈ ns
(x 7→ v)m
ns
a lwflushn(W ) otherwise
Adding the lwf(n) tag to the list of process ids that have observed a write affects
the allowed ordering of how writes are seen. The key point is that n now sees,
and lightweight-fences, writes by m that m has lightweight-fenced.
18
The antecedent for Rule (24) needs to be updated to include lwf(n) 6∈ w .seen
as a further constraint on where writes can be placed in the global order W : a
write may not come before a write that the process has lightweight-fenced, even
if that write is to a different variable.
B Address shifting
In ARM (and POWER) the value loaded from (or stored to) an address may be
shifted. For the majority of high-level algorithms such details are hidden. How-
ever address shifting is investigated at the hardware level because it can affect re-
ordering – so called “address dependencies” [3]. The instruction LDR R1, [R2, X]
loads into R1 the value at address X shifted by the amount in R2. To precisely
model the semantics of address shifting requires a more concrete model than the
one we propose, however, as determined by the litmus tests of [9], the effects of
address dependencies can be investigated even when the shift amount is 0 (re-
sulting in a load of the value at the address). As such we define that an address
shift of 0 on a variable x gives x , and leave the effect of other shift amounts
undefined.
Address dependencies constrain the reorderings in the following ways: a
branch may not be reordered before a load or store with an (unresolved) ad-
dress dependency; a store may not be reordered before an instruction with an
(unresolved) address dependency; and any instruction α which shares a regis-
ter or variable with β where β has an unresolved address dependency may not
be reordered with β. We incorporate these conditions into the general rule for
assignments.
A further consequence of address shifting is that a load r2 := x may be
reordered before r1 := [n, x ] even though this would violate load-distinctness.
However, to preserve coherence-per-location, the load into r2 must not load a
value of x that was written before the value read by the load into r1. This complex
situation is handled in [9] by restarting load instructions if an earlier value is
read into r2. We handle it more abstractly by treating the load as speculation,
where if an earlier value for r2 is loaded then the effect of that speculation is
thrown away.
We can give this extra semantics by adding an extra operational rule which
applies only in those specific circumstances.
p
r2 := x−−−−→ p′
r1 := [n, x ] ; p
r2 := x−−−−→ r1 := [n, x ] ; [r1 = r2] ; p
′
(41)
In practical terms it is possible the first load of x (into r1) is delayed while
determining the offset value. The later load is allowed to proceed, freeing up p′
to continue speculatively executing until the dependency is resolved. The load
into r1 then must still be issued, the result being checked against r2. This check
must occur as to preserve coherency as the load into r2 cannot read a value earlier
than that read into r1. Note that loads in p
′ can now potentially be reordered
to execute ahead of the load into r1.
19
