Abstract. We present a class of relaxed memory models, defined in Coq, parameterised by the chosen permitted local reorderings of reads and writes, and the visibility of inter-and intra-processor communications through memory (e.g. store atomicity relaxation). We prove results on the required behaviour and placement of memory fences to restore a given model (such as Sequential Consistency) from a weaker one. Based on this class of models we develop a tool, diy, that systematically and automatically generates and runs litmus tests to determine properties of processor implementations. We detail the results of our experiments on Power and the model we base on them. This work identified a rare implementation error in Power 5 memory barriers (for which IBM is providing a workaround); our results also suggest that Power 6 does not suffer from this problem.
Introduction
Most multiprocessors exhibit subtle relaxed-memory behaviour, with writes from one thread not immediately visible to all others; they do not provide sequentially consistent (SC ) memory [17] . For some, such as x86 [22, 20] and Power [21] , the vendor documentation is in inevitably ambiguous informal prose, leading to confusion. Thus we have no foundation for software verification of concurrent systems code, and no target specification for hardware verification of microarchitecture. To remedy this state of affairs, we take a firmly empirical approach, developing, in tandem, testing tools and models of multiprocessor behaviour-the test results guiding model development and the modelling suggesting interesting tests. In this paper we make five new contributions:
1. We introduce a class of memory models, defined in Coq [8] , which we show how to instantiate to produce SC , TSO [24] , and a Power model (3 below). 2. We describe our diy testing tool. Much discussion of memory models has been in terms of litmus tests (e.g. iriw [9] ): ad-hoc multiprocessor programs for which particular final states may be allowed on a given architecture. Given a violation of SC , diy systematically and automatically generates litmus tests (including classical ones such as iriw) and runs them on the hardware. 3. We model important aspects of Power processors' behaviour, i.e. ordering relaxations, the lack of store atomicity [3, 7] , and A-cumulative barriers [21] . 4 . We use diy to generate about 800 tests, running them up to 1e12 times on 3 Power machines. Our experimental results confirm that our model captures many important aspects of the processor's behaviour, despite being
Init: x=0; y=0; P0 P1 (a) x ← 1 (c) y ← 1 (b) r1 ← y (d) r2 ← x Observed? r1=0; r2=0; in a simple global-time style rather than the per-processor timelines of the architecture text. They also identified a rarely occurring implementation error in Power 5 memory barriers (for which IBM is providing a workaround). They further suggest that Power 6 does not suffer from this. 5. We prove in Coq theorems about the strength and placement of memory barriers required to regain a strong model from a weaker model.
The experimental details and the sources and documentation of diy are available online 1 , as are the Coq development and typeset outlines of the proofs 2 .
Our class of models
A memory model determines whether a candidate execution of a program is valid. For example, Fig. 1 (a) shows a simple litmus test, comprising an initial state (which gathers the initial values of registers and memory locations used in the test), a program in pseudo-or assembly code, and a final condition on registers and memory (we write x, y for memory locations and r1, r2 for registers). If each location initially holds 0 (henceforth we omit the initial state if so), then, e.g. on x86 processors, there are valid executions with the specified final state [20] . Rather than dealing directly with programs, our models are in terms of the events E occurring in a candidate program execution. A memory event m represents a memory access, specified by its direction (write or read), its location loc(m), its value val(m), its processor proc(m), and a unique label. The store to x with value 1 marked (a) in Fig. 1(a) generates the event (a) Wx1 in Fig. 1(b) . Henceforth, we write r (resp. w) for a read (resp. write) event. We write M ℓ,v (resp. R ℓ,v , W ℓ,v ) for the set of memory events (resp. reads, writes) to a location ℓ with value v (we omit ℓ and v when quantifying over all of them). A barrier instruction generates a barrier event b; we write B for the set of all such events. The models are defined in terms of binary relations over these events, and Fig. 2 has a table of the relations we use.
As usual, the program order po → is a total order amongst the events from the same processor that never relates events from different processors. It reflects the sequential execution of instructions on a single processor: given two instruction execution instances i 1 and i 2 that generate events e 1 and e 2 , e 1 po → e 2 means that a sequential processor would execute i 1 before i 2 . When instructions may perform several memory accesses, we take intra-instruction dependencies [22] into account to build a total order.
We postulate a dp → relation to model the dependencies between instructions, such as data or control dependencies [21, pp. 653-668] . This relation is a subrelation of po →, and always has a read as its source.
Execution witnesses
Although po → conveys important features of program execution, e.g. branch resolution, it does not characterise an execution. To do so, we postulate two relations rf → and ws → over memory events.
Reads-from map We write w rf → r to mean that r loads the value stored by w (so w and r must share the same location and value). Given a read r there exists a unique write w such that w rf → r (w can be an init store when r loads from the initial state). Thus, rf → must be well formed following the wf − rf predicate:
Write serialisation We assume all values written to a given location ℓ to be serialised, following a coherence order. This property is widely assumed by modern architectures. We define ws → as the union of the coherence orders for all memory locations, which must be well formed following the wf − ws predicate:
From-read map We define the following derived relation fr → [4] which gathers all pairs of reads r and writes w such that r reads from a write that is before w in
We define an execution witness X as follows (the well-formedness predicate wf on execution witnesses is the conjunction of those for ws → and rf →): 
Global Happens-Before
An execution witness is valid if the memory events can be embedded in an acyclic global happens-before relation ghb → (together with two auxiliary conditions detailed in Sec. 2.3). This order corresponds roughly to the vendor documentation concept of memory events being globally performed [21, 13] : a write in ghb → represents the point in global time when this write becomes visible to all processors; whereas a read in ghb → represents the point in global time when the read takes place. There remain key choices as to which relations we include in ghb → (i.e. which we consider to be in global time), which leads us to define a class of models.
Globality Writes are not necessarily globally performed at once. Thus, → . Indeed, the write serialisation for a given location ℓ is the order in which writes to ℓ are globally performed. Moreover, as r fr → w expresses that the write w ′ from which r reads is globally performed before w, it forces the read r to be globally performed (since a read is globally performed as soon as it is performed) before w is globally performed.
Preserved program order In any given architecture, certain pairs of events in the program order are guaranteed to occur in that order. We postulate a global relation ppo → gathering all such pairs. For example, the execution witness in Fig. 1 (c) is only valid if the writes and reads to different locations on each processor have been reordered. Indeed, if these pairs were forced to be in program order, we would have a cycle in
Barrier constraints Architectures also provide barrier instructions, e.g. the Power sync (discussed in Sec. 3) to enforce ordering between pairs of events. We postulate a global relation 
We define ghb → as the union of the global relations:
Validity of an execution w .r .t. an architecture
We now add two sanity conditions to the above. First, we require each processor to respect memory coherence for each location [11] . If a processor writes e.g. v to ℓ and then reads v ′ from ℓ, v ′ should not precede v in the write serialisation. We define the relation Second, we rule out programs where values come out of thin air [19] (as in Fig. 3 (b) ):
We define the validity of an execution w .r .t. an architecture A as the conjunction of three checks independent of the architecture, namely wf(X), uniproc(X) and thin(X) with a last one that characterises the architecture:
Comparing architectures via validity predicates
From our definition of validity arises a simple notion of comparison among architectures. A 1 ≤ A 2 means that A 1 is weaker than A 2 :
The validity of an execution is decreasing w .r .t. the strength of the predicate; i.e. a weak architecture exhibits at least all the behaviours of a stronger one:
Programs running on an architecture A ǫ 1 exhibit executions that would be valid on a stronger architecture A ǫ 2 ; we characterise all such executions as follows:
These two theorems, though fairly simple, will be useful to compare two models and to restore a strong model from a weaker one, as in Sec. 3.
Examples
We propose here alternative formulations of SC [17] and Sparc's TSO [24] in our framework, which we proved equivalent to the original definitions. We omit proofs and the formal details for lack of space, but they can be found online 2 . We write po(X) (resp. rf(X), rfe(X)) for the function extracting the po → (resp. rf →, rfe →) relation from X. We define notations to extract pairs of memory events from the program order:
SC allows no reordering of events ( 
Semantics of barriers
We define the semantics and placement in the code that barriers should have to restore a stronger model from a weaker one. It is clearly enough to have w ab1 → r whenever w grf 2\1 → r holds to restore store atomicity, i.e. a barrier ensuring rf → is global. But then a processor holding such a barrier placed after r would wait until w is globally performed, then read again to ensure r is globally performed after w. We provide a less costly requirement: when w rf → r po → m, where r may not be globally performed after w is, inserting a barrier instruction between the instructions generating r and m only forces the processor generating r and m to delay m until w is globally performed.
Formally, given A 1 ≤ A 2 , we define the predicate fb (fully barriered) on executions X by
→ is the set difference, and
→ y. The fb predicate provides an insight on the strength that the barriers of the architecture A 1 should have to restore the stronger A 2 . They should:
1. restore the pairs that are preserved in the program order on A 2 and not on A 1 , which is a static property; 2. compensate for the fact that some writes may not be globally performed at once on A 1 while they are on A 2 , which we model by (some subrelation of) rf → not being global on A 1 while it is on A 2 ; this is a dynamic property.
We can then prove that the above condition on ab1 → is sufficient to regain A ǫ 2 from A 1 :
Theorem 1 (Barrier guarantee).
The static property of barriers is expressed by the condition
→ . A barrier provided by A 1 should ensure that the events generated by a same processor are globally performed in program order if they are on A 2 . In this case, it is enough to insert a barrier between the instructions that generate these events.
The dynamic property of barriers is expressed by the condition
A barrier provided by A 1 should ensure store atomicity to the write events that have this property on A 2 . This is how we interpret the cumulativity of barriers as stated by Power [21] : the A-cumulativity (resp. B-cumulativity) property applies to barriers that enforce ordering of pairs in
We consider a barrier that only preserves pairs in po → to be non-cumulative. Thm. 1 states that, to restore A 2 from A 1 , it suffices to insert an A-cumulative barrier between each pair of instructions such that the first one in the program order reads from a write which is to be globally performed on A 2 but is not on A 1 .
Restoring SC
We model an A-cumulative barrier as a function returning an ordering relation when given a placement of the barriers in the code:
fenced → Thm. 1 shows that inserting such a barrier between all po → pairs restores SC :
Corollary 1 (Barriers restoring SC ).
Consider e.g. the iriw test depicted in Fig. 4 . The specified outcome may be the result of a non-SC execution on a weak architecture in the absence of barriers. Our A-cumulative barrier forbids this outcome, as shown in Fig. 4 : if placed between each pair of reads on P 0 and P 1 , not only does it prevent their reordering, but also ensures that the write (e) on P 2 (resp. (y) P 3 ) is globally performed before the second read (b) on P 0 (resp. (d) on P 1 ). 
diy: a testing tool
We present our diy (do it yourself ) tool, which computes litmus tests in x86 or Power assembly code by generating violations of SC , i.e. cycles in
Cycles as specifications of litmus tests
Consider e.g. the outcome of We call sequences of relations relaxations and give them a concrete syntax (see Fig. 7 and 8) . Thus Rfe represents a rfe → arrow, Fre a fre → arrow, and DpdR a dp → (Dp) arrow targeting a read (R), with different (d) source and target locations. diy needs to be specified which relaxations are considered global and which are not. When specified a pool of global relaxations, a single non-global relaxation, and a size n (i.e. the number of relaxations arrows in the cycle, e.g. 6 for iriw), diy generates cycles up to size n that contains at least one occurrence of the nonglobal relaxation. If no non-global relaxation is specified, diy generates cycles up to size n containing the specified global relaxations. When the cycles generation is done, diy computes litmus tests from these cycles, as detailed in the following.
Code generation
We show here how we generate a Power litmus test from a given cycle of relaxations by an example below. The complete algorithm for code generation is available online 2 . We write for the information not yet set by diy: is an undetermined event, W a write with yet unset location and value, and Rx a read from x with undetermined value.
1. Consider e.g. the input cycle, issued by diy's cycles generation phase:
. 2. A linear scan sets the directions from the edges. Observe e.g. the last edge;
Fre −→ requires a R source and a W target:
3. We pick an event e which is the target of a relaxation specifying a location change. If there is none, generation fails. Otherwise, a linear scan starting from e sets the locations. At the end of the scan, if e and its predecessor have the same location (e.g.
Rfe
−→ e PodRW −→ ), generation fails. As DpdR −→ specifies a location change (i.e. we pick (c)), we rewrite the cycle as:
We set the locations starting from (c), changing location between (e) and (f ):
4. We cut the input cycle into maximal sequences of events with the same location (i.e. (c)(d)(e) and (f )(a)(b)), each being scanned w .r .t. the cycle order: the first write in each sequence is given value 1, the second one 2, etc. The values then reflect the write serialisation order for the specified location:
5. Significant reads are the sources of fr → and the targets of rf → edges. We associate them with the write on the other side of the edge. In the rf → case, the value of the read is the one of its associated write. In the fr → case, the value of the read is the value of the predecessor of its associated write in ws →, i.e. by construction the value of its associated write minus 1. Non significant reads do not appear in the test condition. All the reads are significant here: 6. We generate the litmus test given in Fig. 5 for Power. We add e.g. a xor instruction between the instructions associated with the events (b) and (c) to implement the dependency required by the DpdR −→ relation between them.
The test in Fig. 5 actually is a Power implementation of iriw [9] with dependencies. diy recovers indeed many classical tests, such as rwc [9] (see also Fig. 8 ).
Case study: the Power architecture
We now instantiate the formalism of Sec. 2 for Power by adding register events to reflect register accesses [22] , and commit events to express branching decisions. C is the set of commits, and c is an element of C. We handle three barrier instructions : isync, sync and lwsync. We distinguish the corresponding events by the eponymous predicates, e.g. is-isync. An execution witness includes an additional intra-instruction causality relation Running iriw with data dependencies (Fig. 5) on Power reveals that rfe → is not global either. This is the main particularity of the Power architecture.
Barriers We define in Fig. 6 (b Fig. 6 (b) . lwsync acts as sync except on store-load pairs, in both the base and cumulativity cases. → to be global and tested it by computing safe tests whose input cycles only include relaxations we suppose global, e.g. Fre −→. We ran the tests supposed to, according to our model, exhibit relaxations. These tests are given in Fig. 7 (where M stands for million). We observed all of them at least on one machine, which corresponds with our model. For each relaxation observed on a given machine, we write the highest number of outcomes. When a relaxation is not observed, we write the total of outcomes: thus we write e.g. 0/16725M for PodRR on vargas.
For each machine, we observed the number of runs required to exhibit the least frequent relaxation (e.g. 32 million for Rfe on vargas), and ran the safe tests at least 20 times this number. The outcomes of the safe tests have not been observed on vargas and squale, which increases our confidence in the safe set we assumed. Yet, hpcx exhibits non-SC behaviours for some A-cumulativity tests, including classical ones [9] like iriw with sync instructions on P 0 and P 1 (see Fig. 8 ). We understand that this is due to an erratum in the Power 5 implementation. IBM is providing a workaround, replacing the sync barrier by a short code sequence [Personal Communication], and our testing suggests this does regain SC behaviour for the examples in question (e.g. with 0/4e10 non-SC 
a Notation: rℓ (wℓ) is a read (write) event with location ℓ. results for iriw). We understand also that the erratum should not be observable for conventional lock-based code and that Power 6 is not subject to it; the latter is consistent with our testing on vargas.
Related Work
Formal memory models roughly fall into two classes: operational models and axiomatic models. Operational models, e.g. [25, 15] , are abstractions of actual machines composed of idealised hardware components such as queues. They can be appealingly intuitive and offer a relatively direct path to simulation, at least in principle. Axiomatic models focus on segregating allowed and forbidden behaviours, usually by constraining various order relations on memory accesses; they are particularly well adapted for model exploration, as we do here. Several of the more formal vendor specifications have been in this style [5, 24, 16] .
One generic axiomatic model related to ours is Nemos [26] . This covers a broad range of models including Itanium as the most substantial example. Itanium is rather different to Power; we do not know whether our framework could handle such a model or whether a satisfactory Power model could be expressed in Nemos. By contrast, our framework owes much to the concept of relaxation, informally presented in [3] . As regards tools, Nemos calculates the behaviour of example programs w.r.t. to a model, but offers no support for generating or running tests on actual hardware.
Previous work on model-building based on experimental testing includes that of Collier [12] and Adir et al. [2, 1] . The former is based on hand-coded test programs and Collier's model, in which the cumulativity of the Power barriers does not seem to fit naturally. The latter developed an axiomatic model for a version of Power before cumulative barriers [1] ; their testing [2] aims to produce interesting collisions (accesses to related locations) with knowledge of the microarchitecture, using an architecture model as an oracle to determine the legal results of tests rather than (as we do) generating interesting tests from the memory model.
Conclusion
We present here a general class of axiomatic memory models, extending smoothly from SC to a highly relaxed model for Power processors. We model their relaxation of store atomicity without requiring multiple write events per store [16] , or a view order per processor [12, 1, 21, 6] . Our principal validity condition is simple, just an acyclicity check of the global happens before relation. This check is already known for SC [18] , and recent verification tools use it for architectures with store buffer relaxation [14, 10] . Our Power model captures key aspects of the behaviour of cumulative barriers, though we do not regard it as definitive: on the one hand there are known tests for which the model is too weak w.r.t. our perception of the architect's intent (particularly involving the lightweight barrier lwsync); on the other hand, given that we rely heavily on black-box testing, it is hard to establish confidence that there are not tests that would invalidate our model. Despite that, our automatic test generation based on the model succeeds in generating interesting tests, revealing a rare Power 5 implementation erratum for barriers in lock-free code. This is a significant advance over reliance on hand-crafted litmus tests.
