A high-level operational semantics for hardware weak memory models by Colvin, Robert J. & Smith, Graeme
ar
X
iv
:1
81
2.
00
99
6v
1 
 [c
s.L
O]
  3
 D
ec
 20
18
A high-level operational semantics for hardware weak
memory models
Robert J. Colvin
School of Electrical Engineering and Information Technology
The University of Queensland
Graeme Smith
School of Electrical Engineering and Information Technology
The University of Queensland
Abstract
Modern processors deploy a variety of weak memory models, which for effi-
ciency reasons may execute instructions in an order different to that spec-
ified by the program text. The consequences of instruction reordering can
be complex and subtle, and can impact on ensuring correctness. In this pa-
per we build on extensive work elucidating the semantics of assembler-level
languages on hardware architectures with weak memory models (specifically
TSO, ARM and POWER) and lift the principles to a straightforward op-
erational semantics which allows reasoning at a higher level of abstraction.
To this end we introduce a wide-spectrum language that encompasses op-
erations on abstract data types as well as low-level assembler code, define
its operational semantics using a novel approach to allowing reordering of
instructions, and derive some refinement laws that can be used to explain
behaviours of real processors. In this framework memory models are mostly
distinguished via a pair-wise static ordering on instruction types that deter-
mines when later instructions may be reordered before earlier instructions. In
addition, memory models may use different types of storage systems. For in-
stance, non-multicopy atomic systems allow sibling processes to see updates
to different variables in different orders.
We encode the semantics in the rewriting engine Maude as a model-
Email addresses: r.colvin@uq.edu.au (Robert J. Colvin), smith@itee.uq.edu.au
(Graeme Smith)
Preprint submitted to Theoretical Computer Science December 5, 2018
checking tool, and develop confidence in our framework by validating our
semantics against existing sets of litmus tests – small assembler programs –
comparing our results with those observed on hardware and in existing se-
mantics. We also use the tool as a prototype to model check implementations
of data structures from the literature against their abstract specifications.
Keywords: weak memory models, operational semantics, verification
1. Introduction
Modern processor architectures provide a challenge for developing efficient
and correct software. Performance can be improved by parallelising compu-
tation and utilising multiple cores, but communication between threads is
notoriously error prone. Weak memory models go further and improve over-
all system efficiency through sophisticated techniques for batching read and
writes to the same variables and to and from the same processors. However,
code that is run on such memory models is not guaranteed to execute in the
order specified in the program text, creating unexpected behaviours for those
who are not forewarned [1]. To aid the programmer, architectures typically
provide memory barrier/fence instructions which can enforce thread-local or-
der corresponding to the program text, but if overused fences can eliminate
performance gains.
Previous work on formalising hardware weak memory models has resulted
in abstract formalisations which were developed incrementally through com-
munication with processor vendors and rigorous testing on real machines
[2, 3, 4]. A large collection of “litmus tests” [5, 6] demonstrate the some-
times confusing behaviour of hardware. We build on this existing work and
provide a programming language and operational semantics that runs on the
same relaxed principles that apply to the assembler instructions. The seman-
tics is validated against litmus tests from the literature, and then applied to
model check some realistic concurrent data structures.
We begin in Sect. 2 with the basis of a straightforward operational seman-
tics that allows reordering of instructions according to pair-wise relationships
between instructions, and an overview of the results of the paper. In Sect. 3
we introduce our wide-spectrum language and an informal description of the
instructions. In Sect. 4 we describe the semantics in more detail, and derive
some properties that support algebraic and refinement-based reasoning as a
basis for theorem proving. Later we show the instantiations of the thread-
2
local definitions to three well-known weak memory models, TSO [7] in Sect. 5,
ARM [4] in Sections 6 and 7 and POWER [2] in Sect. 8. We then consider the
implications of weak memory models on more complex algorithms in Sect. 9:
we verify a simple lock [8, Sect. 7.3], the Treiber lock-free stack [9] run-
ning on ARM and POWER, and find (and fix) a bug in an implementation
of the Chase-Lev work-stealing deque (double-ended queue) [10] developed
specifically for ARM [11]. We discuss related work in Sect. 10.
Contributions. This paper extends our earlier work [12] in the following ways:
• We address TSO.
• We take into account a more recent version of ARM.
• We compare our semantics to a larger set of litmus test results (over
18,000 in this paper vs. approx 1,000 in [12]), and as a result handle
more constructs (e.g., POWER’s lightweight fences and eieio fences),
and other types of constraints (e.g., address shifting).
• We apply the semantics to more case studies.
2. Instruction reordering in weak memory models
2.1. Thread-local reorderings
It is typically assumed processes are executed in a fixed sequential order
(as given by sequential composition – the “program order”). However pro-
gram order may be inefficient, e.g., when retrieving the value of a variable
from main memory after setting its value, as in x := 1 ; r := x, and hence
weak memory models sometimes allow execution to appear out of program
order to improve overall system efficiency. Specifically, in the above case, the
value 1 may be used for r in later calculations, possibly including writing to
some other shared variable, without waiting for the update to x to propagate
to all other threads in the system. While many reorderings can seem sur-
prising, there are basic principles at play which limit the number of possible
permutations, the key being that the new ordering of instructions preserves
the original sequential intention.
A classic example of weak memory models producing unexpected be-
haviour is the “store buffer” pattern below [5]. Assume that all variables are
3
initially 0, that r1 and r2 are thread-local variables, and that x and y are
shared variables.
(x := 1 ; r1 := y) ‖ (y := 1 ; r2 := x ) (1)
It is possible to reach a final state in which r1 = r2 = 0 in several weak
memory models: the two assignments in each process are independent (they
reference different variables), and hence can be reordered. From a sequential
semantics perspective, reordering the assignments in process 1, for example,
preserves the final values for r1 and x .
Assume that c and c ′ are programs represented as sequences of atomic
actions α ; β ; . . ., as in a sequence of instructions of a thread or more
abstractly a semantic trace. Program c may be reordered to c ′, written
c
r
⊑ c ′, if the following holds:
1. c ′ is a permutation of the actions of c, possibly with some modifications
due to forwarding (see below).
2. c ′ preserves the sequential semantics of c. For example, in a weak-
est preconditions semantics [13], for all predicates P , wp(c,P) ⇒
wp(c ′,P).
3. c ′ preserves coherence-per-location with respect to c (cf. po-loc in
[3]). This means that the order of updates and accesses of each shared
variable, considered individually, is maintained.
We formalise these constraints in the context of pair-wise reordering of in-
structions below. The key challenge for reasoning about programs executed
on a weak memory model is that the behaviour of c ‖ d is in general quite
different to the behaviour of c ′ ‖ d , even if c
r
⊑ c ′.
2.2. Reordering and forwarding instructions
We write α
r
⇐ β if instruction β may be reordered before instruction α.
For TSO the well-known weakening of instruction order is that loads can
appear before stores (to different variables). If we let x and y be shared
variables, r be a local variable, and v some value, we can represent this as
x := v
r
⇐ r := y
where x and y are distinct. This rule applies to a specific case of assignment
statements that correspond to assembler-level stores and loads; the relation
is generalised to all assignments for TSO in Sect. 5.
4
We give the more complex rule for reordering of updates in the ARM and
POWER memory models below. We use the notation e
v
≁ f to mean that
expressions e and f do not reference any variables in common; hence x
v
≁ f
can be read as “x is not free in f ”. The related notation e
sv
≁ f is weaker,
requiring only that the shared variables of e and f are distinct.
e
v
≁ f iff the free variables of e and f are distinct (2)
e
sv
≁ f iff the shared variables of e and f are distinct (3)
x := e
r
⇐ y := f iff (i) x
v
≁ y (ii) x
v
≁ f (iii) y
v
≁ e (iv) e
sv
≁ f (4)
Note that
r
⇐ as defined above is symmetric, however when calculated after
the effect of forwarding is applied (as described below) there are instructions
that may be reordered in one direction but not the other. The relation is
neither reflexive nor transitive.
Provisos (i), (ii) and (iii) ensure executing the two assignments in either
order results in the same final values for x and y , and proviso (iv) maintains
order on accesses of the shared state. If two updates do not refer to any
common variables they may be reordered.
Proviso (i) eliminates reorderings such as (x :=1; x :=2)
r
⊑ (x :=2; x :=1)
which would violate the sequential semantics (the final value of x ). Proviso
(ii) eliminates reorderings such as (x := 1 ; r := x )
r
⊑ (r := x ; x := 1) which
again would violate the sequential semantics (the final value of r). Proviso
(iii) eliminates reorderings such as (r := y ; y := 1)
r
⊑ (y := 1 ; r := y)
which again would violate the sequential semantics (the final value of r).
Proviso (iv), requiring the update expressions’ shared variables are distinct,
preserves coherence-per-location, eliminating reorderings such as (r1 := x ;
r2 := x )
r
⊑ (r2 := x ; r1 := x ), where r2 may receive an earlier value of x than
r1 in an environment which modifies x .
The instructions used in the above examples, where each instruction ref-
erences at most one global variable and uses simple integer values, correspond
to the basic load and store instruction types of ARM and POWER proces-
sors. We may instantiate (4) to such instructions, giving reordering rules
such as the following, which states that a store may be reordered before a
load if they are to different locations (r1 :=y
r
⇐ x :=r2). We use ARM syntax
to emphasise the application to a real architecture.
LDR r1, y
r
⇐ STR x, r2 (5)
5
In practice, proviso (ii) may be circumvented by forwarding.1 This refers
to taking into account the effect of the update moved earlier on the expression
of the other update. We write β〈α〉 to represent the effect of forwarding the
(assignment) instruction α to the instruction β. For assignments we define
(y := f )〈x := e〉 = y :=(f[x\e]) if e does not refer to global variables (6)
where the term f[x\e] stands for the syntactic replacement in expression f of
references to x with e. The proviso of (6) prevents additional loads of globals
being introduced by forwarding.
We specify the reordering and forwarding relationships with other instruc-
tions such as branches and fences in the sections on specific architectures.
2.3. General operational rules for reordering
The key operational principle allowing reordering is given by the following
transition rules for a program (α ; c), i.e., a program with initial instruction
α.
(α ; c)
α
−→ c (a)
c
β
−→ c ′ α
r
⇐ β〈α〉
(α ; c)
β〈α〉
−−→ (α ; c ′)
(b) (7)
Rule (7a) is the straightforward promotion of the first instruction into a
step in a trace, similar to the basic prefixing rules of CCS [14] and CSP
[15]. Rule (7b), however, states that, unique to weak memory models, an
instruction of c, say β, can happen before α, provided that β〈α〉 can be
reordered before α according to the rules of the architecture. Note that we
forward the effect of α to β before deciding if the reordering is possible.
Applying Rule (7b) then Rule (7a) gives the following reordered behaviour
of two assignments.
(r := 1 ; x := r ; skip) x := 1−−−→ (r := 1 ; skip) r := 1−−−→ skip (8)
We use the command skip to denote termination. The first transition above
is possible because we calculate the effect of r := 1 on the update of x before
executing that update, i.e., x := r 〈r := 1〉 = x := 1.
The definition of instruction reordering, α
r
⇐ β is architecture-specific
(instruction forwarding, β〈α〉, is constant for the architectures we consider).
2
1We adopt the term “forwarding” from ARM and POWER [3]. The equivalent effect
is sometimes referred to as bypassing on TSO [7].
2Typically this is the only definition required to specify an architecture’s instruction
6
2.4. Reasoning about reorderings
The operational rules allow a standard trace model of correctness to be
adopted, i.e., we say program c refines to program d , written c ⊑ d , iff
every trace of d is a trace of c. Let the program α  c have the standard
semantics of prefixing, that is, the action α always occurs before any action in
c (Rule (7a)). Then we can derive the following laws that show the interplay
of reordering and true prefixing.
α ; c ⊑ α  c (9)
α ; (β  c) ⊑ β〈α〉  (α ; c) if α
r
⇐ β〈α〉 (10)
Note that in Law (10) α may be further reordered with instructions in c.
Let c ‖ d denote program c running concurrently with program d . A trace-
based interleaving semantics allows us to derive the following laws straight-
forwardly.
(α  c) ‖ d ⊑ α  (c ‖ d) (11)
c ⊑ c ′ ∧ d ⊑ d ′ ⇒ c ‖ d ⊑ c ′ ‖ d ′ (12)
Law (11) is a typical interleaving law, and Law (12) states that refining either
program, or both programs, results in a refinement of their composition. We
leave the use of Law (12), and properties such as commutativity, implicit in
our derivations in this paper.
We may use these laws to show how the “surprise” behaviour of the store
buffer pattern above arises.3 In derivations such as the following, to save
space, we abbreviate a thread α ; skip or α  skip to α, that is, we omit the
trailing skip.
(x := 1 ; r1 := y) ‖ (y := 1 ; r2 := x )
⊑ From Law (10) (twice), since x := 1
r
⇐ r1 := y from (4).
(r1 := y  x := 1) ‖ (r2 := x  y := 1)
⊑ Law (11) (four times).
r1 := y  r2 := x  x := 1  y := 1
ordering, but some behaviours may require specialised operational rules, e.g., see Sect. 6.3.
In addition, different architectures may have different storage subsystems, however, and
these need to be separately defined (see Sect. 4.1).
3To focus on instruction reorderings we leave local variable declarations and process
ids implicit, and assume a multi-copy atomic storage system (see Sect. 4.1).
7
If initially x = y = 0, a standard sequential semantics shows that r1 = r2 = 0
is a possible final state in this behaviour.
3. Wide-spectrum language
In this section we give an overview of the syntax for our wide-spectrum
language. Its elements are actions (instructions) α, commands (programs)
c, processes (local state and a command) p, and the top level system s ,
encompassing a shared state and all processes. Below x is a variable (shared
or local) and e an expression.
α ::= x := e | [e] | fence | α∗
c ::= skip | α ; c | c1 ⊓ c2 | while b do c
p ::= (lcl σ • c) | (tidn p) | p1 ‖ p2
s ::= (store σ • p)
(13)
The basic actions of a weak memory model are an update x := e, a guard
[e], a (full) fence, or a finite sequence of actions, α∗, executed atomically.
Throughout the paper we denote an empty sequence by 〈〉, and construct
a non-empty sequence as 〈α1 , α2 . . .〉. ARM and POWER introduce other
instruction types, especially different types of fences, which we discuss in the
relevant sections.
A command may be the empty command skip, which is already termi-
nated, a command prefixed by some action α, a choice between two com-
mands, or an iteration (for brevity we consider only one type of iteration,
the while loop).
A well-formed process is structured as a process id n ∈ PID encompassing
a (possibly empty) local state σ and command c, i.e., a term (tidn lcl σ • c).
We assume that all local variables referenced in c are contained in the domain
of σ.
A system is structured as the parallel composition of processes within the
global storage system. The typical structure is that of a global state, σ, that
maps all global variables to their values, which models the storage systems
of TSO, the most recent version of ARM [17], and abstract specifications.
(store σ • (tid1 lcl σ1 • c1) ‖ (tid2 lcl σ2 • c2) ‖ . . .) (14)
Older versions of ARM and POWER have a more complex storage system,
though the structure of the overall system remains the same, as discussed in
Sect. 7.1.
8
3.1. Abbreviations
Conditionals are modelled using guards and choice (where false branches
are never executed).
if b then c1 else c2 =̂ ([b] ; c1) ⊓ ([¬b] ; c2) (15)
By allowing instructions in c1 or c2 to be reordered before the guards one can
model speculative execution, i.e., early execution of instructions which occur
after a branch point [16]: see Sect. 6.1.
Although the basic thread language is very simple (reflecting a sequence
of instruction on a processor, or a trace in a denotational semantics model) we
may construct more familiar imperative programming constructs in the usual
way. Sequential composition of commands, as opposed to action prefixing,
can be defined by induction.
skip ; c = c (16)
(α ; c1) ; c2 = α ; (c1 ; c2) (17)
(c1 ⊓ c2) ; c3 = (c1 ; c3) ⊓ (c2 ; c3) (18)
Loops are modelled using unfolding, as in Rule (23) below.
Read-modify-write primitives that allow atomic access of more than one
variable can be modelled as an atomic sequence of steps. For instance, con-
sider a fenced compare-and-swap (CAS ) instruction, where CAS (x , r , e) up-
dates shared variable x to the value of expression e if x = r , and otherwise
does nothing.
CAS (x , r , e) =̂ 〈[x = r ] , x := e , fence〉 ⊓ [x 6= r ] (19)
When used as the expression in a conditional we use the following abbrevia-
tion.
if CAS (x , r , e) then c1 else c2 =̂
(〈[x = r ] , x := e , fence〉 ; c1) ⊓ ([x 6= r ] ; c2)
(20)
4. Operational semantics
The meaning of our language is formalised using an operational semantics,
which, excluding the global storage system, is summarised in Fig. 1. Given
a program c the operational semantics generates a trace, i.e., a possibly
9
Rule 21 (Prefix with reordering).
(α ; c)
α
−→ c (a)
c
β
−→ c ′ α
r
⇐ β〈α〉
(α ; c)
β〈α〉
−−→ (α ; c ′)
(b)
Rule 22 (Choice).
c ⊓ d
τ
−→ c
c ⊓ d
τ
−→ d
Rule 23 (Loop).
while b do c
τ
−→ ([b] ; c ; while b do c) ⊓ ([¬b] ; skip)
Rule 24 (Locals - update).
c r := v−−−→ c ′
(lcl σ • c)
τ
−→ (lcl σ[r := v ] • c
′)
Rule 25 (Locals - store).
c x := r−−−→ c ′ σ(r) = v
(lcl σ • c) x := v−−−→ (lcl σ • c ′)
Rule 26 (Locals - load).
c r := x−−−→ c ′
(lcl σ • c)
[x=v ]
−−−→ (lcl σ[r := v ] • c
′)
Rule 27 (Locals - guard).
c
[e]
−→ c ′
(lcl σ • c)
[eσ]
−−→ (lcl σ • c ′)
Rule 28 (Thread id).
p
α
−→ p ′
(tidn p)
n:α−−→ (tidn p
′)
Rule 29 (Interleave parallel).
p1
α
−→ p ′1
p1 ‖ p2
α
−→ p ′1 ‖ p2
p2
α
−→ p ′2
p1 ‖ p2
α
−→ p1 ‖ p
′
2
Figure 1: Semantics of the language
10
infinite sequence of steps c0
α1−→ c1
α2−→ . . . where the labels in the trace are
actions, or a special label τ representing a silent or internal step that has
no observable effect. For brevity we omit rules that are a straightforward
promotion of a label from a subterm to a parent term, i.e., rules of the form
p
ℓ
−→ p ′ ⇒ C [p]
ℓ
−→ C [p ′].
The terminated command skip has no behaviour; a trace that ends with
this command is assumed to have completed. The effect of instruction pre-
fixing in Rule (21) is discussed in Sect. 2.3. Note that actions become part
of the trace.
A nondeterministic choice (the internal choice of CSP [15]) can choose
either branch, as given by Rule (22). The semantics of loops is given by
unfolding, e.g., Rule (23) for a ‘while’ loop. Note that speculative execution
is theoretically unbounded, and loads from inside later iterations of the loop
could occur in earlier iterations.
For ease of presentation in defining the semantics for local states, we give
rules for specific forms of actions, i.e., assuming that r is a local variable
in the domain of σ, and that x is a global (not in the domain of σ). The
more general version can be straightforwardly constructed from the principles
below.
Rule (24) states that an action updating variable r to value v results in
a change to the local state (denoted σ[r := v ]). Since this is a purely local
operation there is no interaction with the storage subsystem and hence the
transition is promoted as a silent step τ . Rule (25) states that a store of the
value in variable r to global x is promoted as an instruction x := v where v
is the local value for r . Rule (26) covers the case of a load of x into r . The
value of x is not known locally. The promoted label is a guard requiring that
the value read for x is v . This transition is possible for any value of v , but
the correct value will be resolved when the label is promoted to the storage
level. Rule (27) states that a guard is partially evaluated with respect to the
local state before it is promoted to the global level. The notation eσ replaces
x with v in e for all (x 7→ v) ∈ σ.
Rule (28) simply tags the process id to an instruction, to assist in the
interaction with the storage system, and otherwise has no effect. Instructions
of concurrent processes are interleaved in the usual way as described by
Rule (29).
Other straightforward rules which we have omitted above include the
promotion of fences through a local state, and that atomic sequences of
11
Rule 30 (Globals - store).
p n:x := e−−−−→ p ′
(store σ • p)
τ
−→ (store σ[x := eσ] • p
′)
Rule 31 (Globals - guard).
p
n:[e]
−−→ p ′ eσ ≡ true
(store σ • p)
τ
−→ (store σ • p ′)
Figure 2: Semantics of a standard (multicopy-atomic) storage system
actions are handled inductively by the above rules.
4.1. Multi-copy atomic storage subsystem.
Traditionally, changes to shared variables occur on a shared global state,
and when written to the global state are seen instantaneously by all processes
in the system. This is referred to as multi-copy atomicity and is a feature
of TSO and the most recent version of ARM [17]. Older versions of ARM
and POWER, however, lack such multi-copy atomicity and require a more
complex semantics. We give the simpler case (covered in Fig. 2) first. For
the store model the thread ids are not used, but they do become important
in later sections.
Recall that at the global level the process id n has been tagged to the
actions by Rule (28). Rule (30) covers a store of some expression e to x .
Since all local variable references have been replaced by their values at the
process level due to Rules (24)-(27), expression e must refer only to shared
variables in σ. The value of x is updated to the fully evaluated value, eσ.
Rule (31) states that a guard transition [e] is possible exactly when e
evaluates to true in the global state. If it does not, no transition is possi-
ble; this is how incorrect branches are eliminated from the traces, which we
discuss in more detail in the context of speculative execution in Sect. 6.1.
4.2. Reordering and forwarding (for sequential consistency)
It remains to define the reordering relation
r
⇐ for particular architectures
and the effect of forwarding so that the effect of Rule (21) can be determined;
and to model the global storage system where the system lacks multi-copy
12
x := e〈y := f 〉 = x := e[y\f ] if e has no shared variables (32)
[e]〈y := f 〉 = [e[y\f ]] if e has no shared variables (33)
β〈α〉 = β otherwise
Figure 3: Forwarding (bypassing)
atomicity defined by the rules given above. We define the reordering rela-
tion in the following sections, though we may straightforwardly define the
reordering for an atomic sequence of instructions recursively as below, where
s is a sequence of instructions.
α
r
⇐ 〈〉 (34)
α
r
⇐ 〈β〉 ⌢ s ⇔ α
r
⇐ β ∧ α
r
⇐ s (35)
We define similarly for the cases s
r
⇐ α.
We note the trivial case for defining reordering for sequentially consistent
(SC) processors: α 6
r
⇐ β for all α, β, and there is no forwarding. Since
reordering is not possible the second case of Rule (7) never applies and hence
the standard prefixing semantics is maintained. SC semantics uses a storage
system defined by the rules in Fig. 2.
Forwarding, as given in Fig. 3, is regular across all the architectures we
have considered: α〈β〉, where α is an assignment y := f where f does not
contain shared variables, is straightforward replacement of y by f in the
expression of an assignment (32) or guard (33). Otherwise forwarding has
no effect. If forwarding was applied when f contained shared variables, e.g.,
when f is the expression z , this would create more loads of z resulting in
potentially different values. This approach to modelling forwarding contrasts
with an explicit FIFO buffer which is often used in modelling TSO. The most
recent store to a global x is recorded in the program text, and need not be
explicitly kept separately in a buffer structure.
4.3. Tool support and validation
The operational semantics have been encoded in Maude [18, 19] as rewrite
rules. A process in the language is rewritten to a trace, with the Maude
system generating all possible traces through backtracking. To validate the
semantics of particular architectures and to verify data structures running
13
on them, we devised a straightforward mechanism for checking the final state
against a condition.
Modelling a particular architecture requires instantiating the reordering
relation. For commercial reasons, formal definitions of the hardware are not
provided by the vendors. To establish confidence in our semantics, therefore,
we validated it against litmus tests, small assembler programs. We are fortu-
nate in that considerable effort has gone into testing real hardware, collecting
the results, and using these to fine-tune an understanding of the hardware
for TSO, ARM and POWER. However, it must be noted that the use of
litmus tests, and their results on hardware, are problematic for validation for
several reasons:
• The set of litmus tests is unlikely to be complete.
• There may be a bug in the particular hardware tested giving incorrect
behaviour.
• The particular hardware tested may not implement all features allowed
by the memory model.
• The “specification” of the hardware may have imprecisions that re-
sulted in vendors allowing behaviours that were intended to be forbid-
den.
• Expected behaviours can change as new hardware is released.
• The absence of a behaviour does not mean that it is forbidden.
Due to the above limitations, we do not attempt to achieve full confor-
mance to the hardware results reported (where the above limitations are also
noted), nor do we try to match exactly the results of other models – indeed
some of the models themselves do not achieve full conformance, in particular
allowing many behaviours that were not observed on hardware. Instead we
aim to agree with litmus tests in the majority of cases, noting that refining
a model to agree on all known litmus tests may quickly become redundant
due to reasons above.
The litmus tests are provided in assembler syntax which we must trans-
late to our wide-spectrum language. Branch instructions (e.g., ARM’s BNE)
are modelled using a combination of guards and nondeterministic choice. A
guard [e], where e is an expression, does not directly map from a hardware
14
instruction. Abstractly, a command ([r = 0] ; c) means that if r = 0 in the
local state then c may continue execution. If r 6= 0, then no execution is
possible. As such, our guard corresponds to the guards in Dijkstra’s guarded
command language [13]. In our language we can use guards to model branch-
ing provided a straightforward structure is used, as outlined below. Let αi
stand for instructions. The BNE L instruction jumps to label L if a special
register is not equal to 0, while B L unconditionally jumps to L. Thus a
structure such as the following
α1 ; BNE L1 ; α2 ; B L2 ; L1: α3 ; L2: . . . (36)
becomes α1 ; (if cmpr = 0 then α2 else α3) ; . . . (37)
We have used the name cmpr for the local register implicitly accessed by the
BNE instruction. Note that in our framework a branch instruction such as
BNE in the assembler code structurally corresponds to a guard (covering the
true and false cases). This structured programming approach to denoting
branching cannot cover all possible jumps within hardware addresses, but is
sufficiently expressive to capture the behaviours found in the litmus tests,
and is suitable for modelling higher-level structured code.
5. TSO
α 6
r
⇐ fence (38)
fence 6
r
⇐ α (39)
[b] 6
r
⇐ α (40)
α 6
r
⇐ [b] (41)
x := e
r
⇐ r := f if x
v
≁ f , r
v
≁ e, and (42)
e has no free globals
α 6
r
⇐ β in all other cases (43)
Figure 4: Reordering following TSO assembler semantics. x denotes any variable, and r
a local variable.
The reordering relation for TSO is given in Fig. 4. It uses a multi-copy
atomic storage system as defined by the rules in Fig. 2. TSO is a relatively
15
strong memory model with α 6
r
⇐ β for all α, β except as specified in (42),
which allows loads to come before independent stores.
In addition (42) allows independent register operations to also be re-
ordered before stores, allowing forwarding (or bypassing). This means that
a load of x may take the value of the most recently written value to x . In
our framework, this means that if a load of x is reordered before a store to
x , it takes that value. That is, since
r := x 〈x :=1〉 = r := 1 (44)
from (32), we have
x := 1 ; r := x ⊑ r := 1  x := 1 (45)
by Law (10). Note that the instruction type changes from a load (r := x ) to
a simple update to a local register (r := 1), and hence is not affected by any
earlier stores to x .
TSO’s fence instruction can be employed to prevent the reordering of
stores and loads (38,39).
5.1. Validation
We tested our definitions for TSO against the litmus tests mentioned in [7]
and 25 generated tests using the herd tool (http://diy.inria.fr/herd/).
Those litmus tests cover the essence of TSO, namely that loads can appear
to come before stores, and forwarding (or bypassing) takes place.
6. Revised ARM v8
In this section we consider the latest (revised) version of ARM v8 which
is multi-copy atomic [17]. We consider older versions of ARM which lack
multi-copy atomicity in Sect. 7.
In addition to stores, loads, register operations, and full fences, ARM’s
instruction set includes a control fence, cfence, which affects local reordering
by acting as a barrier preventing subsequent loads being reordered with ear-
lier instructions. It is used in conjunction with branches to avoid the effect
of speculative execution, discussed in Sect. 6.1. ARM also has a store only
fence.
Our general semantics is instantiated for ARM processors in Fig. 5 which
provides particular definitions for the reordering relation that are generalised
from the orderings on stores and loads in these processors.
16
α ::= . . . | fence.st | cfence (46)
α 6
r
⇐ fence (47)
fence 6
r
⇐ α (48)
x := e 6
r
⇐ fence.st if x is shared (49)
fence.st 6
r
⇐ x := e if x is shared (50)
[b] 6
r
⇐ cfence (51)
cfence 6
r
⇐ r := e (52)
[b1]
r
⇐ [b2] if b1
sv
≁ b2 (53)
[b] 6
r
⇐ x := e if x is shared (54)
[b]
r
⇐ r := e if r
v
≁ b and e
sv
≁ b (55)
x := e
r
⇐ [b] if x
v
≁ b and e
sv
≁ b (56)
x := e
r
⇐ y := f if x
v
≁ y , x
v
≁ f , y
v
≁ e, and e
sv
≁ f (57)
α
r
⇐ β in all other cases
Figure 5: Reordering following ARM assembler semantics. x , y denote any variable and r
a local variable.
17
Fences prevent all reorderings as with TSO (47,48), while a store-only
barriers fence.st (corresponding to ARM’s DMB.ST and DSB.ST instructions)
maintains order on stores but not on other instruction types (49,50). A
control fence cfence prevents speculative loads when placed between a guard
and a load (51,52). Guards may be reordered with other guards provided they
do not both access the same shared variables (53) (otherwise local coherence
would be violated), but stores to shared variables may not come before a
guard evaluation (54). This prevents speculative execution from modifying
the global state, in the event that the speculation was down the wrong branch.
An update of a local variable may be reordered before a guard provided
it does not affect the guard expression and respects local coherence (55).
Guards may be reordered before updates if those updates do not affect the
guard expression and local coherence is respected (56). (Note that in ARM
assembler the e
sv
≁ b constraints for guards are always satisfied as guards
(branch points) do not reference globals.) Assignments may be reordered as
shown in (57) and discussed in Sect. 2.2.
6.1. Speculative execution
Many processors allow some form of speculative execution, where the in-
structions in a branch are tentatively executed and the effect stored locally
while, for instance, waiting for a load of a global to be serviced. On TSO
and related architectures the result of speculative execution are not visible,
i.e., speculatively executed loads are restarted if it is detected that an old
value was loaded. On ARM processors the effect of speculative execution
can become visible, i.e., the effect of speculatively executing loads is not
(conditionally) unwound. However, in all cases, if speculation was down a
branch that was eventually determined to be incorrectly chosen, no effect
is (or should be) visible.4 Fortunately this complication can be handled
straightforwardly in our semantics.
If a guard does not evaluate to true, execution stops in the sense that
no transition is possible. This corresponds to a false guard, i.e., magic
[21, 22], and such behaviours do not terminate and are ignored for the pur-
poses of determining behaviour of a system. Interestingly, this simple concept
from standard refinement theory allows us to handle speculative execution
4The recently discovered Spectre security vulnerability [20] shows that this is not
strictly the case.
straightforwardly. In existing approaches, the semantics is complicated by
needing to restart reads if speculation proceeds down the wrong path. Treat-
ing branch points as guards works because speculation should have no effect
if the wrong branch was chosen.
To understand how this approach to speculative execution works, consider
the following derivation.
r1 := x ; (if r1 = 0 then r2 := y)
= Definition of if (15)
r1 := x ; (([r1 = 0] ; r2 := y) ⊓ [r1 6= 0])
⊑ Resolve to the first branch, since (c ⊓ d) ⊑ c
r1 := x ; [r1 = 0] ; r2 := y
⊑ From Law (10) and (52)
r1 := x ; r2 := y  [r1 = 0]
⊑ From Law (10) and (54)
r2 := y  r1 := x ; [r1 = 0]
This shows that the inner load (underlined) may be reordered before the
branch point, and subsequently before an earlier load. Note that this be-
haviour results in a terminating trace only if r1 = 0 holds when the guard
is evaluated, and otherwise becomes magic (speculation down an incorrect
path). On ARM processors, placing a control fence (cfence) instruction
inside the branch, before the inner load, prevents this reordering.
6.2. Address shifting
In ARM the address an instruction loads from (or stores to) may be
shifted. The instruction LDR R1, [R2, X] loads into R1 the value at address
X shifted by the amount in R2. The presence of address shifting, or other
mechanisms for modifying the address target of an instruction, can have an
influence on the reordering relation, as captured by the addr relation in [3].
In our framework most of the restrictions introduced by address shifting are
already captured by Rule (21), where we interpret ‘x ’ to range over expres-
sions including address shifts such as y&+r , and hence the set of variables
being checked against the conditions is {y , r}. However, as mentioned in [4],
an instruction x&+e := f has at least one other effect on reordering, which is
that later stores cannot be reordered before it, in case the shift amount e
gives an invalid address and results in an exception being thrown. In such
cases, the effect of later writes should not be visible to other processes.
x&+e := f 6
r
⇐ y := g if y is a shared variable
19
To precisely model the semantics of address shifting requires a more concrete
model than the one we propose, however, as determined by the litmus tests
of [4], the effects of address shifting on reorderings can be investigated even
when the shift amount is 0 (resulting in a load of the value at the address).
As such we define x&+0 = x , and leave the effect of other shift amounts
undefined. Note that when the shift amount expression is evaluated to 0
(e.g., by forwarding) the address shifting is removed and this can have an
effect on the allowed reorderings.
6.2.1. Load speculation
A further aspect of address shifting is that in some circumstances a load
r2 := x may be reordered before a load r1 := x&+e , even though this would
appear to violate coherence-per-location. However, the load into r2 must not
load a value of x that was written before the value read by the load into r1.
This complex situation is handled in [4] by restarting load instructions if an
earlier value is read into r2. We handle it more abstractly by treating the
load as speculation, where if an earlier value for r2 is loaded then the effect
of that speculation is thrown away (the point of allowing this reordering is
apparently to allow execution after the second load to continue while the
value of the shift amount e is calculated). This is given by the following
operational rule.
p
r2 := x−−−→ p ′
r1 := x&+n ; p
r2 := x−−−→ r1 := x&+n ; [r1 = r2] ; p
′
(58)
In practical terms it is possible the first load of x (into r1) is delayed while
determining the offset value. The later load is allowed to proceed, freeing
up p ′ to continue speculatively executing until the dependency is resolved.
The load into r1 then must still be issued, the result being checked against
r2. This check must occur to preserve coherency as the load into r2 cannot
read a value earlier than that read into r1. Note that loads in p
′ can now
potentially be reordered to execute ahead of the load into r1.
6.3. Eliminating earlier writes
An additional aspect of ARM processors is that when there are consec-
utive writes to a variable x on a process the first write can effectively be
eliminated: locally only the effect of the second write will be seen (sequential
semantics is preserved), and globally it is always a valid behaviour that a
20
sibling process did not see the effect of the first write because the second
occurred immediately after it. Write elimination is captured by the following
rule.
c x := v−−−→ c ′
x := w ; c x := v−−−→ c ′
(59)
We may derive the following elimination law.
x := w ; x := v ; c ⊑ x := v ; c (60)
6.4. Validation
The new version of ARM as reported in [17] is quite recent and the litmus
tests used in that paper are not available at the time of writing. We have
validated earlier versions of ARM which are more complicated, and as such
we defer discussion of validation until Sect. 7.2.
7. Original ARM v8 and earlier – non-multicopy atomicity
In this section we consider the versions of ARM which lack multi-copy
atomicity. These include the original version of ARM v8 [4] and all earlier
versions. These versions of ARM allow processes to communicate values to
each other without accessing the heap. That is, if process p1 is storing v to
x , and process p2 wants to load x into r , p2 may preemptively load the value
v into r , before p1’s store hits the global shared storage. Therefore different
processes may have different views of the values of global variables; see litmus
tests such as the WRC family [3]. To properly model these versions of ARM
we must therefore introduce the storage subsystem, (storage ω • c), which
replaces the (store σ • p) notation defined earlier.
7.1. Storage subsystem
We conceptualise the stores in the system as a list ω of writes w1,w2, . . .,
with each write wi being of the form
(x 7→ v)nS (61)
where n is the process id of the thread that executed a store of value v to
address x . The list S is a list of process ids that have seen this write, that
is, loaded that value into some local register. For such a write w , we let
21
w .var = x , w .thread = n and w .seen = S. For a write (x 7→ v)nS it is always
the case that n ∈ S.
The order of writes in ω and previous values seen by a process affect the
values it loads, which in general are nondeterministic. When a new write
w is executed by a thread, w is not necessarily appended to the end of ω,
but instead may be “inserted” earlier, according to certain rules. The basic
principles of inserting a new write w of the form (x 7→ v)n{n} into the list ω
are:
1. Request w may not come before any earlier write by n (local coherence).
2. Request w may not come before a write w ′ to x by another process that
has been seen by n (global coherence).
When process n loads the value of location x from ω it may see either the
most recent value of x that n has already seen, or any that have been added
more recently in ω. When n sees some write w then n is added to the list
of seen process ids in w . The shared state from the perspective of a given
process is a particular view of this list. There is no single definitive shared
state. In addition, viewing a value in the list causes the list to be updated
and this affects later views.
Initially ω holds writes giving the initial values of the shared variables.
These initial values are assumed to have been seen by every process.
We give two specialised rules (for a load and store) in Fig. 6. To handle
the general case of an assignment x := e, where e may contain more than
one shared variable, the antecedents of the rules are combined, retrieving
the value of each variable referenced in e individually and accumulating the
changes to ω.
Rule (63) states that process n can load the value v for x provided there
is a write (x 7→ v)mS in the system where all earlier writes can be “seen past”,
i.e., as given in (66), an earlier write to variable x has not been seen. As a
result, n is added to the set of process ids that have seen that write (and
hence later reads of x by n will not be able to see any earlier writes to x ).
Rule (64) governs where a new write w may be “inserted” into the global
storage. Write w may appear earlier than writes that are already in system,
provided it can be reordered with them as given in (67). We say w can come
before a write v , written v
w
⇐ w , provided v was by another process, and if it
is a write to the same variable as w then it has not been seen. This constraint
keeps all writes by a single process in the same order, and keeps a system-
wide coherence on any one shared variable, but allows different processes to
22
s ::= . . . | (storage ω • p) (62)
Rule 63 (Storage - load).
p
n:[x=v ]
−−−−→ p ′
∀w ∈ ran(ω2) • canSeePastn(x ,w)
(storage ω1
⌢ (x 7→ v)mS
⌢ ω2 • p)
n:[x=v ]
−−−−→
(storage ω1
⌢ (x 7→ v)mS∪{n}
⌢ ω2 • p
′)
Rule 64 (Storage - store).
p n:x := v−−−−→ p ′
w ′ = (x 7→ v)n{n} ∀w ∈ ran(ω2) • w
w
⇐ w ′
(storage ω1
⌢ ω2 • p)
n:x := v−−−−→ (storage ω1
⌢ w ′ ⌢ ω2 • p
′)
Rule 65 (Storage - fence).
p n:fence−−−−→ p ′
(storage ω • p) n:fence−−−−→ (storage flushn(ω) • p
′)
where
canSeePastn(x ,w) =̂ x = w .var ⇒ n 6∈ w .seen (66)
w
w
⇐ (x 7→ v)n{n} =̂ n 6= w .thread ∧ (x = w .var ⇒ n 6∈ w .seen) (67)
flushn(〈〉) = 〈〉 flushn(ω
⌢ w) = flushn(ω)
⌢ flushn(w) (68)
flushn(w) =
{
w[seen :=PID] if n ∈ w .seen
w otherwise
(69)
Figure 6: Rules for the non-multi-copy atomic subsystem of ARM and POWER.
23
see updates to different variables in a different order. For instance, if the
global storage contains writes to x and y by process n,
(x 7→ v)nS1
⌢ (y 7→ w)nS2
then process m may see that x has changed but read the initial value for y ,
while another process p may see that y has changed but read the initial value
for x (assuming neither m nor p are in S1 or S2).
Rule (65) states that a fence action by process n ‘flushes’ all previous
writes seen by n (which includes those writes by n). The flush function
modifies ω so that all processes can see all writes by n, effectively overwriting
earlier writes. This is achieved by updating the write so that all processes
have seen it, written as w[seen :=PID], and defined recursively by (68,69). A
fence.st instruction also flushes the storage as in Rule (65).
7.2. Validation
The semantics was validated against 2 sets of litmus tests (with many
overlapping tests). The first was a set of 348 litmus tests developed in [4]
(we excluded some that could not be automatically translated, and also ex-
clude one that involves shadow registers, for reasons described in Appendix
A). We compared our results (pass/fail) against the expected results on
hardware taken from the supplemental material for [4]5. The translation pro-
cess was straightforward, with conditional statements translated into guarded
branches as described in Sect. 3.1. In addition some redundant register op-
erations were eliminated to reduce tool time, for instance, r := 1; x := r
becomes x := 1 provided r is not used elsewhere in the code.
The majority of the tests (333) were performed in under 3s of proces-
sor time. Those with 3 or more processes, limited local ordering (lack of
fences, etc.), and many writes, were the slowest. The Maude system mea-
sures rewrites and the largest test required approximately 50 million rewrites,
taking 33s. Results of three of the 348 tests were not recorded in [4] as they
did not complete in a reasonable time. Of the remaining litmus tests all
but three agreed with the model results in [4] which we discuss below in
Sect. 7.2.1.
The second set of tests we used for validation was the set of 9790 tests
from [3]. These tests were used to validate an axiomatic semantics rather
5https://www.cl.cam.ac.uk/~pes20/arm-supplemental/index.html
24
than an operational model as in [4]. Our results for this larger set, exclud-
ing 5 that did not parse and 133 that did not complete in a reasonable
amount of time, is 9556 tests are in agreement with [3] and 96 in disagree-
ment. Of those 96, we agree with the results obtained by hardware in 52:
5 where we allow a behaviour seen on hardware (but disallowed by [3]), the
remainder where our model says such behaviours should be forbidden (and
were not observed on hardware). Of the remaining 44, all but 3 involve a
store(x);store(x) or load(x);load(x) pattern; of those 3, 2 are iden-
tical (MP+dmb+addr-po-ctrlisb and MP0110). Those 2 identical tests are
allowed in our model possibly erroneously, as they allow load speculation in
the presence of an unresolved write with address shifting. The remaining test,
MP+PPO015, looks like it is disallowed by us erroneously because we do not
have “chain forwarding”, or possibly forwarding from a register assignment,
which is necessary inside a branch to resolve an address shift expression.
If we instead exclude from the 96 disagreements those tests involving
address shifting we are left with 20 tests where we disagree with [3]. In all
20 cases we allow the behaviour that is disallowed by [3] (and which has
not been observed on hardware). All of these 20 cases involve at least one
instance of an access to a shared variable twice or more in at least one process,
i.e., po-loc becomes relevant. By allowing more behaviours than [3], we are
erring on the side of soundness, i.e., if we can prove a program involving one
of these 20 cases correct in our semantics, it is also certainly correct in the
semantics of [3].
We note that the results in [3] include 558 tests where their model for-
bids a result which was observed on hardware, and 1525 tests where the [3]
model allows a behaviour which was not observed. Regarding the former,
the discrepancy is attributed to the load-load hazard (e.g., [23]) and fewer
discrepancies remain between their model and hardware when the affected
litmus tests are excluded.
7.2.1. Discrepancies with [4]
We present litmus test PPO017 below,6 translated into our wide-spectrum
language, for which our model gives a different result to that of [4].7 This
6Available at http://www.cl.cam.ac.uk/~pes20/arm-supplemental/src/PPO017.
litmus
7We simplified some of the syntax for clarity, in particular introducing a higher-level
if statement to model a jump command and implicit register (referenced by the compare
25
is structurally similar to the test PPO015, discussed in [12], and test PPO012:
those two other tests pass in our model for essentially the same reason as
PPO017.
x := 1 ; fence ; y := 1 ‖
r0 := y ; r2 := z&+(r0xorr0) ; z := 1 ; r4 := z ;
(if r4 = r4 then skip else skip) ; cfence ; r5 := x
(70)
The tested condition is r0 = 1 ∧ r5 = 0, which asks whether it is possible to
load x (the last statement of process 2) before loading y (the first statement
of process 2). At a first glance the control fence prevents the load of x
happening before the branch. However, as indicated by litmus tests such as
MP+dmb.sy+fri-rfi-ctrisb, [4][Sect 3,Out of order execution], under some
circumstances the branch condition can be evaluated early. We expand on
this below by manipulating the second process, taking the case where the
success branch of the if statement is chosen. To aid clarity we underline the
instruction that is the target of the (next) refinement step.
r0 := y ; r2 := z&+(r0xorr0) ; z := 1 ; r4 := z ; [r4 = r4] ; cfence ; r5 := x
⊑ Promote load with forwarding (from z := 1), from Laws (9) and (10)
r4 := 1  r0 := y ; r2 := z&+(r0xorr0) ; z := 1 ; [r4 = r4] ; cfence ; r5 := x
⊑ Promote guard by Laws (9) and (10) (from (56))
r4 := 1  [r4 = r4]  r0 := y ; r2 := z&+(r0xorr0) ; z := 1 ; cfence ; r5 := x
⊑ Promote control fence by Laws (9) and (10) ((51) does not now apply)
r4 := 1  [r4 = r4]  cfence  r0 := y ; r2 := z&+(r0xorr0) ; z := 1 ; r5 := x
⊑ Promote load by Laws (9) and (10)
r4 := 1  [r4 = r4]  cfence  r5 := x  r0 := y ; r2 := z&+(r0xorr0) ; z := 1
The load r5 := x has been reordered before the load r0 := y , and hence when
interleaved with the first process from (70) it is straightforward that the
condition may be satisfied.
In the Flowing/POP model of [4], this behaviour is forbidden because
there is an address dependency from the load of y into r0 to r4, via z . In the
(CMP) and branch-not-equal (BNE) instructions). We have also combined some commands,
retaining dependencies, in a way that is not possible in the assembler language. The xor
operator is exclusive-or; its use here (artificially) creates an address dependency [3] between
the updates to r0 and r2. The r4 = r4 empty conditional creates a control dependency
(and with the control fence, a control fence dependency) between loads.
26
testing of real processors reported in [4], the behaviour that we allow was
never observed, but it is also allowed by the model in [3].
8. POWER
POWER processors allow similar reorderings to ARM, as well as those
aspects discussed in Sect. 6.1 to Sect. 6.3. Additionally, POWER includes
so-called “lightweight fences”, lwfence, which have both a local and global
effect, and the similar eieio barriers. The reordering and forwarding defini-
tions for POWER are otherwise the same as those in Fig. 5; we discuss how
that relates to hardware tests in Sect. 8.2.
8.1. Lightweight fences
A lightweight fence, denoted lwfence, maintains order between loads,
loads then stores, and stores, but not stores and subsequent loads (i.e.,
load;load, load;store, store;store, but not store;load). As shown
in the reordering rule (72) of Fig. 7 we use two types of instruction to model
the lightweight fence instruction.8 These are “gates”, with the storegate in-
struction allowing stores to “move backwards” and the loadgate instruction
allowing loads to “move forward”, as given by the reordering rules (73,74).
For instance, assume the following sequence of instructions, where li are
loads and si are stores.
l1 ; s1 ; storegate ; loadgate ; l2 ; s2
Assuming all loads and stores are to different variables and hence there are
no pairwise constraints on reordering, application of the Laws (9) and (10)
gives
l1  storegate  l2  s1  loadgate  s2
Note that the order between loads, between stores, and between loads then
stores has been maintained, but load l2 may be reordered before the store s1.
8If lightweight fences did not maintain load;load order it would be straightforward to
define their effect in terms of one instruction. It is possible to do it with one fence that
marks any later load which ‘jumps’ it so that load can be reordered with earlier stores but
not earlier loads, however this seems less elegant and less amenable to algebraic analysis.
27
In addition to a local constraint on possible reorderings, a lightweight
fence also has a global effect on the storage system, which we encode in the
semantics of the loadgate instruction (Rule (77)). Informally, a lightweight
fence requires other processes to see changes to (different) shared variables
in the same order as the process that wrote them. This is subtly different to
a full fence, as we describe below.
Effect on the storage subsystem. As given in Rule (77) a loadgate instruction
by process n has an effect on the storage ω similar to a full fence, but in this
case all writes that n has seen (which includes all writes by n) are also
tagged as lightweight-fenced by n by adding the lwf(n) tag to the seen part
of a write. This is given by the recursive definition of lwflush in (79).
In the presence of lightweight fences the rule for a load (Rule (63)) changes
to Rule (78). The change is that if process n sees a write by process m
then it must also see any earlier writes by m that m has lightweight-fenced.
This is given by the function seem
n
(ω) which simply marks any writes that
are lightweight-fenced by m to be seen and lightweight-fenced by n. This
transitive effect gives cumulativity of lightweight fences [3].
The inclusion of a lightweight fence also has a subtle effect on the write
order in the storage subsystem: the definition of write reordering as used in
Rule (64) is updated to (81), where w1
w
⇐ w2 is defined so that write w2 by
n to x may come before write w1 (to different variable y) provided that w1
has not previously been lightweight-fenced by n.
In addition to lightweight fences POWER also includes an eieio barrier.
Based on the discussion in [3] we treated this as a barrier on stores only
(75,76). In addition an eieio barrier acts as a lightweight flush on the storage
system as in Rule (77).
8.2. Validation
There are two litmus test resources we used. Firstly, we validated the
semantics against a set of 758 litmus tests taken from the supplementary
material of [2]9 (the reason for more tests is due to the extra lwfence cases).
As with ARM, some tests were excluded due to parsing issues.
As with the results from [4], our model disagrees on the same three tests
discussed in Sect. 7.2.1, and otherwise agrees with the expected results on
hardware, and agrees with the results of their model where available (52
9https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/index.html
28
α ::= . . . | storegate | loadgate | eieio (71)
lwfence ; c =̂ storegate ; loadgate ; c (72)
α
r
⇐ storegate if α is a store (73)
loadgate
r
⇐ α if α is a load (74)
eieio 6
r
⇐ α if α is a store (75)
α 6
r
⇐ eieio if α is a store (76)
Rule 77 (Storage - lightweight fence).
p
n:loadgate
−−−−−−→ p ′
(storage ω • p)
n:loadgate
−−−−−−→ (storage lwflushn(ω) • p
′)
Rule 78 (Storage - load in presence of lightweight fences).
p
n:[x=v ]
−−−−→ p ′
∀w ∈ ran(ω2) • canSeePastn(x ,w)
(storage ω1
⌢ (x 7→ v)mS
⌢ ω2 • p)
n:[x=v ]
−−−−→
(storage seem
n
(ω1)
⌢ (x 7→ v)mS∪{n}
⌢ ω2 • p
′)
where
lwflushn(〈〉) = 〈〉 lwflushn(ω
⌢ w) = lwflushn(ω)
⌢ lwflushn(w)
lwflushn((x 7→ v)
m
S) =
{
(x 7→ v)mS∪{lwf(n)} if n ∈ S
(x 7→ v)mS otherwise
(79)
seem
n
(〈〉) = 〈〉 seem
n
(ω ⌢ w) = seem
n
(ω) ⌢ seem
n
(w)
seem
n
((x 7→ v)−S ) =
{
(x 7→ v)−S∪{lwf(n),n} if lwf(m) ∈ S
(x 7→ v)−S otherwise
(80)
w
w
⇐ (x 7→ v)n{n} =̂ n 6= w .thread ∧
(x = w .var ⇒ n 6∈ w .seen) ∧ (81)
(x 6= w .var ⇒ lwf(n) 6∈ w .seen)
Figure 7: Additions to storage subsystem to handle POWER’s lightweight fences
29
litmus tests did not complete in a reasonable amount of time using the tool
in [2]).10
The second set of tests comes from [3]. As we maintain the same order-
ing relationship for ARM and POWER, yet the POWER model in [3] was
stronger, unsurprisingly our model disagrees with more cases for POWER.
That is, of the 7820 successful runs (approx. 350 tests were excluded due to
parsing or timeout problems) 550, or 7%, disagreed with the model in [3] or
the hardware results.11 Based on the discussion in [3][Sect. 8.1] we conjec-
ture that this is because forwarding, especially with respect to branches, is
handled differently. However, given the conformance with the more recent
758 tests reported above, and for reasons discussed in Sect. 4.3, we have not
pursued a specific change to our model to account for the discrepancies.
Instead we note that if we exclude litmus tests with multiple stores to
the same variables in one process, multiple loads of the same variable in one
process (i.e., po-loc issues), there are only 18 tests out of 3142 where our
model disagrees with both the model in [3] and the POWER hardware they
tested.
9. Verification of higher-level algorithms
We now show how our semantics and its Maude encoding can be used
to investigate running programs expressed in our wide-spectrum language on
architectures with weak memory models. Throughout we assume that simple
assignments are atomic, noting where we use more complex constructs such
as compare-and-swap and allocating a new node on the heap. The code
listings are close to the Maude code used for verification, but are refactored
from the sources in the literature to eliminate returns from inside a loop or
branch, etc., so that the code is expressed in a straightforward structure that
preserves the intended paths.
9.1. Locking
We analyse a simple lock/unlock algorithm for correctness under ARM
and POWER (the code is taken from a version for Java in [8, Sect. 7.3]).
10Litmus test propagate-sync-coherence was apparently not run on hardware, but
our model’s result agreed with that of [2].
11We note that the POWER model of [3] allows behaviours not observed in 1002 out of
8141 tests; there are no cases where the model of [3] forbids behaviour that was observed.
30
Initial state: {locked 7→ false}
unlock =̂
locked := false
lock =̂
lcl success 7→ false •
while¬success do
success := ¬locked .getAndSet(true)
We use lcl to declare and initialise the local variable names used (registers
in assembler code). An important part of the algorithm is that the unlock
operation does not need to include a fence. The key line of code is success :=
¬locked .getAndSet(true) which in our framework we treat as an atomic block
with an implicit fence.
success := ¬locked .getAndSet(true) =̂
〈[¬locked ] , locked := true , fence〉 ; success := true ⊓
[locked ] ; success := false
The primitive locked .getAndSet(b) atomically changes the boolean variable
locked to b and returns the previous value of locked . Thus if locked is already
true locked .getAndSet(true) has no effect and returns true; if locked is false
then it becomes true and false is returned. The atomic action may be imple-
mented using load-linked/store-conditional or compare-and-swap primitives,
but we abstract from that and use an atomic block to define getAndSet . Note
that the guard condition references a global variable. This is not possible
in ARM or POWER directly, requiring a load of locked into a local regis-
ter first, however we can straightforwardly accommodate this in our general
semantics.
9.1.1. Model checking
To test this lock implementation provides mutual exclusion in ARM or
POWER, we call lock ; unlock with an intervening abstract critical section,
in which a flag per process is set to true/false on entry/exit, with one process
setting a variable conflict to true if both are in their critical section at the
same time. For model checking purposes we unfold the loop a finite number
of times, giving a structure of nested branches. Many of the generated paths
end with a guard testing success , which may be false; in which case that path
is removed from the analysis.
31
The prototype tool was able to confirm mutual exclusion was satisfied
for two concurrent processes with approximately 800 million rewrites in 14
minutes of processor time.
9.2. Treiber stack
We model the Treiber stack, a well known lock-free data structure im-
plementing an abstract stack [9], in which push and pop operations can be
called concurrently. The operations are structured as potentially infinite
loops which retry if interference occurs on the Head of the stack, other-
wise succeeding and modifying Head atomically with a compare-and-swap
instruction.
Initial state: {head 7→ null}
push(v) =̂
lcl head 7→ null, n 7→ null •
n := newNode(v)
repeat
head := Head
n.next := head
untilCAS (Head , head , n)
pop =̂
lcl head 7→ null, n 7→ null,
return 7→ retry •
while return = retry do
head := Head
if head = null then
return := empty
else
n := heap(head)
if CAS (Head , head , n.next) then
return := n.value
The algorithm accesses the heap, which we model as individual shared
variables (addresses) heap(0), heap(1), etc., which can be both assigned to
and appear in expressions. An implicit shared variable maxh, initially 0,
keeps track of the next free address in the heap. Addresses are allocated in
order from 0. A node value is the term node(v , p) where p is a pointer (either
a natural number index into the heap or null).12
12If pointers are freed and reused the algorithms suffer from the common “ABA prob-
lem”, where a new node may reuse a heap location and cause a CAS operation to succeed
where it should fail. The chances of this occurring in a practical setting can be made
acceptably small by introducing modification counts to the pointers [24]. For our model
32
We assume the underlying system provides the following implementation
of n := newNode(v).
〈heap(maxh) := node(v , null) , n :=maxh , maxh := (maxh + 1) , fence〉
That is, when a new node is allocated, the next available address is initialised
to a new node value, maxh is incremented, the return value n is set to point
to the new allocation, and a final fence ensures that all other threads see
the effect on maxh atomically.13
Updating the next or val part of a node also uses a shorthand, i.e.,
n.next := head is expanded to heap(n) := node(heap(n).value, head), which
overwrites the node’s next pointer, but keeps the original value.
9.2.1. Verification
To verify this algorithm we compare the final values of the stack and the
return values of the pop operations against those of an abstract specification
of a stack (we do not prove termination). We use braces 〈..〉 to enclose
sequences and ⌢ for sequence concatenation.
push(v) =̂ s := 〈v〉 ⌢ s (82)
pop =̂ lcl return 7→ empty • (83)
〈[s = 〈〉] , return := empty〉 ⊓ (84)
〈[s 6= 〈〉] , return := head(s) , s := tail(s)〉 (85)
The operations are modelled using atomic steps, and we model an explicit
return value in the abstract specification.
Although we check the values returned by pop, we do not attempt to
relate intermediate values of the stack to their abstract counterpart, only
the final value. Agreement on final values of the stack does give some confi-
dence that incorrect values are not discovered during the execution; however
we emphasise that this approach is an approximation of correctness only.
checking purposes we assume the system provides some garbage collection mechanism that
avoids this problem and we do not explicitly model freeing nodes or avoiding the ABA
problem.
13Note that the fence does not affect variables other thanmaxh as there are no preceding
instructions in push and all operations (potentially preceding the push) end with a CAS
with an implicit fence (20).
33
Nevertheless, the technique can potentially identify problems in running al-
gorithms on weak memory models: the same technique exposed a flaw in a
published algorithm, as discussed in Sect. 9.3.
We validate the code against the abstract specification by running a
process p which is a combination of push and pop operations, for instance
push(v) ‖ pop. We then check that the heap and Head pointers abstractly
give a valid stack, and the return values correspond with the expected ab-
stract return values.
We ran 4 combinations of parallel processes (in addition to testing push
and pop running in isolation), and all four gave the expected abstract be-
haviour. The most time intensive test involved three concurrent processes
formed from a push and two pops. This required 103 billion rewrites and took
23 hours to return. These results give confidence that the Treiber stack algo-
rithm will work on ARM and POWER-style weak memory models assuming
CAS is implemented as in (20).
9.3. Chase-Lev deque
Leˆ et. al [11] present a version of the Chase-Lev deque (double-ended
queue) [10] adapted for weak memory models. The deque is implemented
as an array, where elements may be put on or taken from the tail, and
additionally, processes may steal an element from the head of the deque.
The put and take operations may be executed by a single thread only, hence
there is no interference between these two operations, although the thread-
local reorderings could cause consecutive invocations to overlap. The steal
operation can be executed by other processes.
The code we test is given in Fig. 8. The original code includes handling
array resizing, but here we focus on the insert/delete logic. As above, we
have refactored the algorithm to eliminate returns from within a branch, and
use CAS terminology for the atomic updates.
The put operation straightforwardly adds an element to the end of the
deque, incrementing the tail index. It includes a full fence so that the tail
pointer is not incremented before the element is placed in the array. The take
operation uses a CAS operation to atomically increment the head index.
Interference can occur if there is a concurrent steal operation in progress,
which also uses CAS to increment head to remove an element from the head
of the deque. The take and steal operation return empty if they observe an
empty deque. In addition the steal operation may return the special value
fail if interference on head occurs. Complexity arises if the deque has one
34
Initial state: {head 7→ 0, tail 7→ 0, tasks 7→ 〈 , , . . .〉}
put(v) =̂
lcl t 7→ •
t := tail ;
tasks [t mod L] := v ;
fence;
tail := t + 1
take =̂
lcl h 7→ , t 7→ , return 7→ •
t := tail − 1 ;
tail := t ;
fence ;
h := head ;
if h ≤ t then
return := tasks [t mod L] ;
if h = t then
if ¬CAS (head , h, h + 1)
then
return := empty
tail := t + 1
else
return := empty ;
tail := t + 1
steal =̂
lcl h 7→ , t 7→ , return 7→ •
h := head ;
fence ;
t := tail ;
cfence ; // unnecessary
if h < t then
return := tasks [h mod L] ;
cfence ; // incorrectly placed
if ¬CAS (head , h, h + 1) then
return := fail
else
return := empty
Figure 8: A version of Leˆ et. al’s work-stealing deque algorithm for ARM [11]
35
element and there are concurrent processes trying to both take and steal
that element at the same time.
Operations take and steal use a fence operation to ensure they have con-
sistent readings for the head and tail indexes, and later use CAS to atomically
update the head pointer (only if necessary, in the case of take). Additionally,
the steal operation contains two cfence barriers (ctrl_isync in ARM). Our
analysis suggests that the first control fence is redundant, and the second is
incorrectly placed. Eliminating the first cfence and swapping the order of
the second control fence with the preceding load into task gives the expected
behaviour. We describe this in more detail below.
9.3.1. Verification
As with the Treiber stack we use an abstract model of the deque and
its operations to generate the allowed final values of the deque and return
values. The function last(q) returns the last element in q and front(q) returns
q excluding its last element.
put(v) =̂ q := q a 〈v〉
take =̂ lcl return := none •
〈[q = 〈〉] , return := empty〉 ⊓
〈[q 6= 〈〉] , return := last(q) , q := front(q)〉
steal =̂ lcl return := none
〈[q = 〈〉] , return := empty〉 ⊓
〈[q 6= 〈〉] , return := head(q) , q := tail(q)〉
The abstract specification for steal is not precise as it does not attempt
to detect interference and return fail. As such we exclude these behaviours
of the concrete code from the analysis.
That the first control fence is redundant is shown by the following deriva-
tion.
h := Head ; fence ; t := Tail ; cfence; if . . .
⊑ Law (9)
h := Head ; fence ; t := Tail  cfence; if . . .
⊑ Law (10)
h := Head ; fence ; cfence  t := Tail ; if . . .
36
The control fence can be reordered before the previous load; it is now im-
mediately after a fence, and has no further effect on reorderings of later
instructions, and hence is redundant.
Our model checking using the Maude encoding exposes a bug in the code
which may occur when a put and steal operation execute in parallel on an
empty deque. The load task := tasks [h modL] can be speculatively executed
before the branch is evaluated, and hence also before the load of tail . Thus
the steal process may load head , load an irrelevant value for task , at which
point a put operation may complete, storing a value and incrementing tail .
The steal operation resumes, loading the new value for tail and observing
a non-empty deque, succeeding with its CAS and returning the irrelevant
value in task , which was loaded before the put operation had begun.
More concretely, the following reordering is possible (similar to the deriva-
tion in Sect. 6.1). We have removed the first (redundant) control fence, and
we leave much of the structure summarised as . . . as it is only the first in-
structions that are relevant.
h := Head ; fence ; t := Tail ;
if h < t then return := tasks [h mod L] ; cfence . . .
⊑ Defn. of if ; resolve to first branch
h := Head ; fence ; t := Tail ;
[h < t ] ; return := tasks [h mod L] ; cfence . . .
⊑ Law (9)
h := Head ; fence ; t := Tail ;
[h < t ] ; return := tasks [h mod L]  cfence . . .
⊑ Law (10)
h := Head ; fence ; t := Tail ;
return := tasks [h mod L]  [h < t ] ; cfence . . .
⊑ Law (10)
h := Head ; fence ; return := tasks [h mod L]  t := Tail ;
[h < t ] ; cfence . . .
The access to local variable h in the index expression now precedes the load
t := Tail . Hence, if h = t = 0 initially, the deque is empty, and the as-
signment return := tasks [h mod L] sets return to be the value at tasks [0],
which is an irrelevant value. Now a sibling process may execute a put(v) and
insert v at tasks [0], after it has been read, and increment Tail . Then the
original process resumes, reading t := 1, and succeeding the guard condition,
eventually returning the default value at tasks [0]. The trailing cfence in the
37
original algorithm has no effect on this possible reordering as it occurs after
the load.
Swapping the order of this second cfence with the load of task eliminates
the above reordering, and our analysis did not reveal any other problems. In
addition, eliminating the first cfence does not change the possible outcomes.
The original placement of the control fence is unusual in that it comes after
a load and before an atomic load and store. It is reasonable to assume that a
CAS cannot be reordered before a branch since it involves a store to a global
address. Therefore the placement of the control fence may indicate a minor
misunderstanding of the subtleties of where a control fence must be placed
to have the desired effect; certainly, a control fence appears to be required
for the algorithm to work correctly.
We tested 5 combinations of the three (modified) operations in parallel
(as well as testing single operations and combinations of put and take on
a single thread). The longest test to complete was a push with two steals,
which required 2 billion rewrites and 35 minutes.
10. Related work
This work builds on a significant body of work in elucidating the be-
haviour of weak memory models in TSO, ARM and POWER via both op-
erational and axiomatic semantics [7, 3, 2, 25, 4]. Those semantics were de-
veloped and validated through testing on real hardware and in consultation
with processor vendors. We therefore had the easier task of validating our
semantics against their results, in the form of the results of litmus tests. The
intention of that body of work was to provide the foundation for higher-level
verification of the sort that we have presented here.
More specifically, our model of the storage subsystem is similar to that of
the operational models of [2, 4]. However our thread model is quite different,
being defined in terms of relationships between actions. The key difference is
how we handle branching and the effects of speculative execution. The earlier
models are complicated in the sense that they are closer to the real execu-
tion of instructions on a processor, involving restarting reads if an earlier
read invalidates the choice taken at a branch point. We instead use a more
abstract formulation of branches as guards. Because speculative execution
should have no effect if an incorrect choice is made, it is straightforward to
eliminate such behaviours. However, the behaviour where the correct choice
is (eventually) made contains that choice as a guard action, before which
38
later actions may have been reordered (if allowed by the rules of the archi-
tecture). Our semantics is presented in a conventional operational semantics
style, where actions appear in the trace. Unlike Plotkin-style operational
semantics [26, 27], we do not keep the state in the configurations, but as a
first-order command of the language. This style interacts well with syntax-
specific behaviours such as distinguishing between behaviours for registers
or shared variables. A similar approach is used by Owens [28] and Abadi &
Harris [29] in operational semantics; using the syntax of labels is also used
in a denotational semantics by Brookes [30].
Our approach to modelling the non-multicopy atomic storage subsystem
state is based on that of the operational model of [2]. However, that model
maintains several partial orders on operations reflecting the nondeterminism
in the system, whereas we let the nondeterminism be represented by choices
in the operational rules. This means we maintain a simpler data structure,
a single global list of writes.
The axiomatic models, as exemplified by [3], define relationships between
instructions in a whole-system way, including relationships between instruc-
tions in concurrent threads. This gives a global view of how an architecture’s
reordering rules (and storage system) interact to reorder instructions in a sys-
tem. Such global orderings are not immediately obvious from our pair-wise
orderings on instructions. On the other hand, those globals orderings become
quite complex and obscure some details, and it is unclear how to extract some
of the generic principles such as (4).
We don’t distinguish between ARM and POWER, and our model is less
accurate to POWER with respect to the litmus tests than it is to ARM. The
difference between ARM and POWER is accounted for in [3] by weakening
the POWER model to obtain their ARM model, loosening the “po-loc”
constraint, i.e., allowing loads and stores to the same variable on a process
to not necessarily occur in program order. This is fundamentally against
our basic principles for reordering and we can’t directly represent the same
change in our framework. However, many of the behaviours discussed in
[3] as being peculiar to ARM were modelled by allowing forwarding (and
eliminating earlier writes).
The model checking approach we developed exposed a bug in an algo-
rithm in [11] in relation to the placement of a control fence. That paper
includes a formal proof of the correctness of the algorithm based on the ax-
iomatic model of [25]. The possible traces of the code were enumerated and
validated against a set of conditions on adding and removing elements from
39
the deque (rather than with respect to an abstract specification of the deque).
As shown via derivations in Sect. 9.3.1 the reordering is straightforward to
observe directly by looking at the code. The reordering relation for the ARM
architecture show that the first control fence is redundant (does not prevent
any reorderings) because it can be reordered to the previous fence. Similarly
the load in the branch can come before the branch point itself (speculatively),
and hence before the earlier load controlling the branch. The control fence,
in its original position, does nothing to prevent this. The semantics of [3]
does not uncover this anomaly as directly because it is more complex to
construct the whole-code relations, while operational models that are more
closely based on hardware mechanisms [2, 4] are more complex and obscure
this relatively straightforward property.
Other approaches. Describing weak memory models has been tackled in a
variety of other approaches. Our results agree broadly with those of [31] in
that many reorderings cannot be explained locally only, and need a storage
system for explanation. That work provides some results relating TSO, ARM
and POWER, although that model does not handle control fences. Alglave
and Cousot [32] develop an Owicki-Gries style proof method for concurrent
algorithms in which the program text is annotated with invariants. The
algebraic approach we adopt in is closer to the style of the Concurrent Kleene
Algebra [33], where sequential and parallel composition contribute to the
event ordering.
Tool support and model checking. The tool we developed was written in
Maude without any specific attempt to specialise for the performance issues
of weak memory models. The number of interleavings of parallel processes
is factorial in the total number of actions, and this explosion is compounded
by local reorderings. The relatively simple algorithms in Sect. 9 became in-
feasible to model check for 4 or more parallel processes. Potentially we could
restructure the semantics to develop partial orders on actions rather than
traces, and hence benefit from other model checking approaches such as [34].
However, our tool does perform well on the litmus tests in comparison with
the tools in [2, 4], which did not report results for some tests that our tool
was able to.
40
11. Conclusion
We have built upon earlier work to devise a model of relaxed memory
which is relatively straightforward to define and extend, and which lends
itself to model checking and formal analysis. While abstracting away from
the details of the architecture, we believe it provides a complementary insight
into why some reorderings are allowed, requiring a pair-wise relationship
rather than system-wide.
We have described the ordering condition as syntactic constraints on
atomic actions. This fits with the low-level decisions of hardware proces-
sors such as ARM and POWER, but variable references are not in general
maintained by compilers (for instance, r := y × 0 may be reduced to r := 0,
eliminating what is syntactically a load). Our main reordering principle (4)
is based on semantic concerns: preserving sequential behaviour. As such
our semantics may be applicable as a basis for understanding the interplay
of software memory models, compiler optimisations and hardware memory
models [35].
One of the key aspects of our semantics is that it uses labels to describe
the traces. Traditional Plotkin-style operational semantics [27] keep the state
in the configuration of the rule, which makes it difficult, if not impossible,
to determine the exact nature of an instruction executed by a subterm, and
hence to check the constraints of the reordering principle (4). In addition,
the use of guards from Dijkstra’s guarded command language [13] allowed an
abstract treatment of speculative execution and the effect of early loads (loads
occurring before a branch has been evaluated). Systems are specified in a
term structure (14) which theoretically lends itself to algebraic manipulation,
with the intention of being able to formally prove correctness of higher-level
algorithms. The first steps towards this goal are taken in Sections 9.2.1
and 9.3.1.
Acknowledgements. We thank Kirsten Winter and Ian Hayes for feedback
on this work. We also thank Luc Maranget, Peter Sewell, Jade Alglave, and
Christopher Pulte for assistance with litmus test analysis. The work was
supported by Australian Research Council Discovery Grant DP160102457.
References
[1] S. V. Adve, H.-J. Boehm, Memory models: A case for rethinking
parallel languages and hardware, Commun. ACM 53 (8) (2010)
41
90–101. doi:10.1145/1787234.1787255.
URL http://doi.acm.org/10.1145/1787234.1787255
[2] S. Sarkar, P. Sewell, J. Alglave, L. Maranget, D. Williams,
Understanding POWER multiprocessors, SIGPLAN Not. 46 (6) (2011)
175–186. doi:10.1145/1993316.1993520.
URL http://doi.acm.org/10.1145/1993316.1993520
[3] J. Alglave, L. Maranget, M. Tautschnig, Herding cats: Modelling,
simulation, testing, and data mining for weak memory, ACM Trans.
Program. Lang. Syst. 36 (2) (2014) 7:1–7:74. doi:10.1145/2627752.
URL http://doi.acm.org/10.1145/2627752
[4] S. Flur, K. E. Gray, C. Pulte, S. Sarkar, A. Sezgin, L. Maranget,
W. Deacon, P. Sewell, Modelling the ARMv8 architecture,
operationally: Concurrency and ISA, in: Proceedings of the 43rd
Annual ACM SIGPLAN-SIGACT Symposium on Principles of
Programming Languages, POPL ’16, ACM, New York, NY, USA,
2016, pp. 608–621. doi:10.1145/2837614.2837615.
URL http://doi.acm.org/10.1145/2837614.2837615
[5] J. Alglave, L. Maranget, S. Sarkar, P. Sewell, Litmus: Running tests
against hardware, in: P. A. Abdulla, K. R. M. Leino (Eds.), Tools and
Algorithms for the Construction and Analysis of Systems: 17th
International Conference, TACAS 2011, Held as Part of the Joint
European Conferences on Theory and Practice of Software, ETAPS
2011, Saarbru¨cken, Germany, March 26–April 3, 2011. Proceedings,
Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 41–44.
doi:10.1007/978-3-642-19835-9 5.
[6] S. Mador-Haim, R. Alur, M. M. K. Martin, Generating litmus tests for
contrasting memory consistency models, in: T. Touili, B. Cook,
P. Jackson (Eds.), Computer Aided Verification: 22nd International
Conference, CAV 2010, Edinburgh, UK, July 15-19, 2010. Proceedings,
Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 273–287.
doi:10.1007/978-3-642-14295-6 26.
[7] P. Sewell, S. Sarkar, S. Owens, F. Z. Nardelli, M. O. Myreen, X86-TSO:
A rigorous and usable programmer’s model for x86 multiprocessors,
42
Commun. ACM 53 (7) (2010) 89–97. doi:10.1145/1785414.1785443.
URL http://doi.acm.org/10.1145/1785414.1785443
[8] M. Herlihy, N. Shavit, The art of multiprocessor programming, Morgan
Kaufmann, 2011.
[9] R.K.Treiber, Systems Programming: Coping with Parallelism. RJ5118,
Tech. rep., IBM Almaden Research Center (April 1986).
[10] D. Chase, Y. Lev, Dynamic circular work-stealing deque, in: SPAA’05:
Proceedings of the 17th annual ACM symposium on Parallelism in
algorithms and architectures, ACM Press, New York, NY, USA, 2005,
pp. 21–28. doi:http://doi.acm.org/10.1145/1073970.1073974.
[11] N. M. Leˆ, A. Pop, A. Cohen, F. Zappa Nardelli, Correct and efficient
work-stealing for weak memory models, in: Proceedings of the 18th
ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, PPoPP ’13, ACM, New York, NY, USA, 2013, pp.
69–80. doi:10.1145/2442516.2442524.
URL http://doi.acm.org/10.1145/2442516.2442524
[12] R. J. Colvin, G. Smith, A wide-spectrum language for verification of
programs on weak memory models, in: K. Havelund, J. Peleska,
B. Roscoe, E. de Vink (Eds.), Formal Methods, Springer International
Publishing, Cham, 2018, pp. 240–257.
[13] E. W. Dijkstra, Guarded commands, nondeterminacy and formal
derivation of programs, Commun. ACM 18 (8) (1975) 453–457.
doi:10.1145/360933.360975.
URL http://doi.acm.org/10.1145/360933.360975
[14] R. Milner, A Calculus of Communicating Systems, Springer-Verlag
New York, Inc., 1982.
[15] C. A. R. Hoare, Communicating Sequential Processes, Prentice-Hall,
Inc., Upper Saddle River, NJ, USA, 1985.
[16] D. J. Sorin, M. D. Hill, D. A. Wood, A Primer on Memory Consistency
and Cache Coherence, 1st Edition, Morgan & Claypool Publishers,
2011.
43
[17] C. Pulte, S. Flur, W. Deacon, J. French, S. Sarkar, P. Sewell,
Simplifying ARM concurrency: Multicopy-atomic axiomatic and
operational models for ARMv8, in: Proceedings of the ACM
SIGPLAN-SIGACT Symposium on Principles of Programming
Languages (POPL), ACM Press, 2018, to appear.
[18] M. Clavel, F. Duran, S. Eker, P. Lincoln, N. Marti-Oliet, J. Meseguer,
J. F. Quesada, Maude: specification and programming in rewriting
logic, Theoretical Computer Science 285 (2) (2002) 187 – 243.
doi:10.1016/S0304-3975(01)00359-0.
URL http://www.sciencedirect.com/science/article/pii/
S0304397501003590
[19] A. Verdejo, N. Mart-Oliet, Executable structural operational semantics
in Maude, Journal of Logic and Algebraic Programming 67 (1-2)
(2006) 226 – 293. doi:10.1016/j.jlap.2005.09.008.
URL http://www.sciencedirect.com/science/article/pii/
S1567832605000846
[20] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp,
S. Mangard, T. Prescher, M. Schwarz, Y. Yarom, Spectre attacks:
Exploiting speculative execution, CoRR abs/1801.01203.
arXiv:1801.01203.
URL http://arxiv.org/abs/1801.01203
[21] C. Morgan, Programming from Specifications, 2nd Edition, Prentice
Hall, 1994.
[22] R. J. Back, J. von Wright, Refinement Calculus: A Systematic
Introduction, Springer-Verlag, 1998.
[23] Cortex-A9 MPCore, Programmer Advice Notice, Read-after-Read
Hazards., ARM Ltd. (2011).
[24] M. Moir, Practical implementations of non-blocking synchronization
primitives, in: Proceedings of the Sixteenth Annual ACM Symposium
on Principles of Distributed Computing, PODC ’97, ACM, New York,
NY, USA, 1997, pp. 219–228. doi:10.1145/259380.259442.
URL http://doi.acm.org/10.1145/259380.259442
44
[25] S. Mador-Haim, L. Maranget, S. Sarkar, K. Memarian, J. Alglave,
S. Owens, R. Alur, M. M. K. Martin, P. Sewell, D. Williams, An
axiomatic memory model for POWER multiprocessors, in:
Proceedings of the 24th International Conference on Computer Aided
Verification, CAV’12, Springer-Verlag, Berlin, Heidelberg, 2012, pp.
495–512. doi:10.1007/978-3-642-31424-7 36.
URL http://dx.doi.org/10.1007/978-3-642-31424-7_36
[26] G. D. Plotkin, A structural approach to operational semantics., Tech.
Rep. DAIMI FN-19, Computer Science Department, Aarhus University
(1981).
[27] G. D. Plotkin, A structural approach to operational semantics., J. Log.
Algebr. Program. 60-61 (2004) 17–139.
[28] S. Owens, A sound semantics for OCaml light, in: S. Drossopoulou
(Ed.), European Symposium on Programming (ESOP), Vol. 4960 of
Lecture Notes in Computer Science, Springer, 2008, pp. 1–15.
[29] M. Abadi, T. Harris, Perspectives on transactional memory, in:
M. Bravetti, G. Zavattaro (Eds.), Proceeding of Concurrency Theory
(CONCUR 2009), Vol. 5710 of Lecture Notes in Computer Science,
Springer, 2009, pp. 1–14.
[30] S. Brookes, A semantics for concurrent separation logic, Theoretical
Computer Science 375 (1-3) (2007) 227 – 270.
doi:10.1016/j.tcs.2006.12.034.
URL http://www.sciencedirect.com/science/article/pii/
S0304397506009248
[31] O. Lahav, V. Vafeiadis, Explaining relaxed memory models with
program transformations, in: J. Fitzgerald, C. Heitmeyer, S. Gnesi,
A. Philippou (Eds.), FM 2016: Formal Methods: 21st International
Symposium, Limassol, Cyprus, November 9-11, 2016, Proceedings,
Springer International Publishing, Cham, 2016, pp. 479–495.
doi:10.1007/978-3-319-48989-6 29.
URL http://dx.doi.org/10.1007/978-3-319-48989-6_29
[32] J. Alglave, P. Cousot, Ogre and pythia: An invariance proof method
for weak consistency models, in: Proceedings of the 44th ACM
45
SIGPLAN Symposium on Principles of Programming Languages,
POPL 2017, ACM, New York, NY, USA, 2017, pp. 3–18.
doi:10.1145/3009837.3009883.
URL http://doi.acm.org/10.1145/3009837.3009883
[33] C. A. R. T. Hoare, B. Mo¨ller, G. Struth, I. Wehrman, Concurrent
kleene algebra, in: M. Bravetti, G. Zavattaro (Eds.), CONCUR 2009 -
Concurrency Theory: 20th International Conference, CONCUR 2009,
Bologna, Italy, September 1-4, 2009. Proceedings, Springer Berlin
Heidelberg, Berlin, Heidelberg, 2009, pp. 399–414.
doi:10.1007/978-3-642-04081-8 27.
URL http://dx.doi.org/10.1007/978-3-642-04081-8_27
[34] J. Alglave, D. Kroening, M. Tautschnig, Partial orders for efficient
bounded model checking of concurrent software, in: N. Sharygina,
H. Veith (Eds.), Computer Aided Verification: 25th International
Conference, CAV 2013, Saint Petersburg, Russia, July 13-19, 2013.
Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp.
141–157. doi:10.1007/978-3-642-39799-8 9.
URL http://dx.doi.org/10.1007/978-3-642-39799-8_9
[35] J. Kang, C.-K. Hur, O. Lahav, V. Vafeiadis, D. Dreyer, A promising
semantics for relaxed-memory concurrency, in: Proceedings of the 44th
ACM SIGPLAN Symposium on Principles of Programming Languages,
POPL 2017, ACM, New York, NY, USA, 2017, pp. 175–189.
doi:10.1145/3009837.3009850.
URL http://doi.acm.org/10.1145/3009837.3009850
Appendix A. Shadow registers
Consider the test MP+dmb+rs from [4]14
(x := 1 ; fence ; y := 1) ‖ (r1 := y ; r2 := r1 ; r1 := x ) (A.1)
Shadow registers allow behaviour where the above process can reach a final
state where r1 = 0 and r2 = 1. This implies that r1 has read the initial
value of x after reading an updated version of y , which is prohibited by the
14http://www.cl.cam.ac.uk/~pes20/arm-supplemental/src/MP+dmb+rs.litmus
46
fence in the first process. According to (4) reordering of any of the register
operations in process 2 should be disallowed because there is a dependency
(each references r1). But it seems that the first two are collapsed into a load
r2 := y , and this can be reordered with r1 := x as there is no dependency.
This would be sound except that the value of r1 should be preserved. This
litmus test and surrounding discussion indicate that the final value of shadow
registers should not be referenced, and that shadow registers are used for
storing temporary calculated values. In this case, we do not need to model
such registers for high-level code as they do not have the typical semantics of
a local variable (one would instead use an explicit temporary variable rather
than reuse a variable name such as r1).
While we rule out shadow registers from consideration, we could straight-
forwardly extend our framework to allow them. This would require distin-
guishing them as a special variable type with tailored instruction types and
liberal reordering relation. This change would be unlikely to affect the results
of any other litmus test, since reusing registers does not typically occur; nor
would the change be likely to affect the majority of high-level code since,
as discussed above, local variable names are typically not reused. However,
these changes would violate the principle of sequential consistency for the
local thread (4). Hence we consider shadow registers to be a special type of
variable with a special semantics designed for low-level use, and therefore be-
lieve that their behaviour does not invalidate the principles of our reordering
model.
47
