Proving the correctness of programs written for multiple processors is a challenging problem, due in no small part to the weaker memory guarantees afforded by most modern architectures. In particular, the existence of store buffers means that the programmer can no longer assume that writes to different locations become visible to all processors in the same order. However, all practical architectures do provide a collection of weaker guarantees about memory consistency across processors, which enable the programmer to write provably correct programs in spite of a lack of full sequential consistency. In this work, we present a mechanization in the ACL2 theorem prover of an axiomatic weak memory model (introduced by Alglave et al. [2] ). In the process, we provide a new proof of an established theorem involving these axioms.
Introduction
Analysis of sequential programs is a well-understood problem for which a variety of proof techniques and methodologies exist. [5] Many of these techniques can be adapted to a multiprocessor setting if we assume sequential consistency (SC) -i.e., that for any concurrent execution of the program, there exists an interleaving of the memory events that is consistent with both the program order and the communication dependencies between processes. [6, 8] However, sequential consistency turns out to be a much stronger requirement than is practically necessary. Moreover, due to the inherently high runtime and resource penalties of SC, designers of multiprocessor architectures are motivated to relax this constraint in order to achieve better performance.
To understand why a lack of sequential consistency impacts us as programmers, consider the following example. Suppose our architecture consists of a number of processors P 1 , . . . , P n and a shared memory M. Assume that when a processor issues a write to memory, that write is immediately visible to all other processors.
Consider the program execution represented in Figure 1 . Each processor assigns the value 1 to memory location x or y, and reads the value at the other location into a register. (Assume x and y are both initially equal to 0.) Now, we ask the question: what are the possible values of registers r 0 and r 1 after running this program? It is easy to see that r 0 = 1, r 1 = 1 is one possible final state, obtained by a scheduler that alternates between P 0 and P 1 . We can also obtain r 0 = 0, r 1 = 1 by running P 0 's program to the end, and then subsequently running P 1 's program to the end. Likewise, it is also possible to obtain r 0 = 1, r 1 = 0. These are the only possible final states, because this (sketch of an) architecture is sequentially consistent; every processor completely executes its first instruction before continuing to the second. Now, consider the following modification of this architecture. Each of the processors P i is equipped with a store buffer B i . When P i issues a write, instead of propagating the write directly to shared memory, the write is initially sent to buffer B i . That write will eventually hit memory, although we have no guarantee of when that will happen (unless the programmer inserts an explicit memory fence). If P i wishes to read a value from memory, it first checks its own store buffer to see if it has issued any pending writes to that memory location. If it has, it uses that value; otherwise, it obtains the value from memory.
If we run the same program on this architecture, it is easy to see that the final state r 0 = 0, r 1 = 0 is obtainable if neither processor's store buffer is flushed before the reads are performed; both processors issue a write, but those writes are not globally visible by the time each process issues its read, and hence both processors read the "old" values of x and y. This is a clear violation of sequential consistency. There is no way to linearly order the instructions of the two programs as atomic memory events and obtain this final state; nevertheless, this behavior is possible on this architecture. This odd behavior isn't merely a theoretical possibility; it is actually observable on x86 machines.
In spite of the fact that we do not generally have sequential consistency, most weaker memory models do uphold a set of guarantees which, though they are not as strong as sequential consistency, do prohibit certain behaviors. These guarantees vary greatly from model to model [3, 4, 7, 9, 10] , and the variety and abundance of these models suggests the need for a more generic framework for weak memory. Such a framework ought to be both general enough to capture the semantics of all modern architectures, and strong enough to enforce meaningful constraints that are universally upheld. One such framework is introduced in Alglave et al. [2] , and in this paper we present its mechanization in ACL2. Furthermore, we present a new proof of an established theorem about this framework, and we discuss the mechanized proof.
A brief notational remark: throughout this paper, given a relation R, we will let R + denote the irreflexive transitive closure of R. Given two relations R and Q, we let R; Q denote the sequencing of R and Q, i.e.
Background: An Axiomatic Framework for Weak Memory
The execution of a sequential program results in a linear sequence of events (usually reads or writes from/to a location in memory). The event order derived from this sequence is called the program order. The program order is a total order on all events, and from this order we can reason in a straightforward way about the possible final states that can result from a run of the program by considering all possible event orderings and demonstrating that they all produce a final state in a particular configuration.
With concurrent programs, however, the situation is more complicated. Generally speaking, an execution on a concurrent machine is not simply a sequence of events with a global program order. Events that occur on different processors are not necessarily comparable, because a write issued by one processor may not be visible to any other processors for some time (despite being immediately visible to the process that executed it). Therefore, in order to specify a set of requirements for our weaker memory guarantees, we need a weakened definition of a program execution that retains enough structure to be amenable to subsequent constraints and analyses. In this section, we describe a compelling axiomatic framework for weak memory [2] , which includes both a more general notion of execution for multiple processors and a parameterized set of requirements that is meant to characterize all modern multiprocessor architectures.
Concurrent Executions
We begin with two definitions.
Definition. An event e is an object which consists of a unique identifier id(e), a process proc(e), a type type(e) which identifies e as being either a read or a write, an address addr(e) equal to the address in memory that e reads from or writes to, and a value val(e) equal to the value read or written by e.
Definition. An execution is a tuple E = (E, po,co,rf) where E is a collection of events, and po, co, and rf are all relations on E satisfying:
• po is a total order on events, when restricted to a single process
• co is a total order on writes, when restricted to a single address
• rf is a relation from writes to reads such that for all reads r ∈ E, there exists a unique write w ∈ E such that w rf − → r (we also require that val(w) = val(r)).
The relation po is undefined on events belonging to different processes, and likewise, co is undefined on any pair of events that are not writes to the same address.
The relation po is our concurrent version of program order; it is a total order not on all events, but only on those belonging to the same processor. The "coherence order" co is a total order on writes to the same location in memory. This order corresponds to our intuition that the writes to each individual location hit memory in a particular sequential order. The read-from relation rf captures the dependency between writes and reads; w rf − → r means "r takes its value from the write w." 1 It is a surjective relation with a one-sided inverse function, rf
The purpose of co and rf is to capture interprocess dependencies between events occurring at the same location; co captures dependencies between two writes arising from their relative visibility with respect to time, and rf captures the dependency of reads on the writes they take their value from. However, it is also intuitively possible to have a write "depend" on a read. If w, w ′ are writes and r is a read such that w ′ rf − → r and w ′ co − → w, then there is a sense in which w "comes after" r, because r takes its value from an earlier write. Therefore, we have another relation, which we refer to as the "from-read" relation.
Definition. Let E = (E, po,co,rf) be an execution. The "from-read" relation fr is defined as 1 The reader may be wondering why we choose to write w rf − → r rather than r rf − → w -the latter certainly seems more sensible when read aloud ("r read-from w"). The reason is that the direction of the arrow is meant to represent a dependency between two events, with the arrow pointing toward the dependent ("later") event. This will enable us to state our weak memory requirements as assertions of the acyclicity of various combinations of these and other relations. M 0
. . . Our three relations rf, co, and fr will be sufficient to specify certain communication dependencies regarding reads and writes to the same location. We abbreviate the three into a single relation.
Definition. Let E = (E, po,co,rf) be an execution. The relation com is defined as
The po and com relations represent two distinct types of dependencies between events; po captures per-process dependencies, and com relation captures per-location dependencies. The existence of these two relations suggests two distinct views of our event graph. The first is the per-process view, where we organize all the events by the process they belong to, and list them in program order (see Figure 2a) . The second is the per-location view, where we organize the events by the memory location at which they occur, and list each write event in coherence order (see Figure 2b for an example of what this might look like for a particular location M 0 ).
Sequential Consistency and SC-Per-Location
In the previous section, we presented a generalization of the notion of a sequential execution to an arbitrary number of processors. Whereas a sequential execution has a single relation, the program order (which is a total order on all events), a concurrent execution consists of two: its per-process program order po, and the communication dependency relation com. In our framework, the usual definition of sequential consistency [6] is that there exists a completion of the relation po∪ com which is a total order on all events. An equivalent way to state this is that the relation po∪ com is acyclic, and so we have the following definition:
Definition. An execution E = (E, po,co,rf) is sequentially consistent (SC) if acyclic(po∪ com),
i.e. the union of the po and com relations is acyclic.
As we have already discussed, sequential consistency does not hold in general for modern multiprocessor architectures. However, if we restrict the program order po to events at the same location, then we get a new, weaker property. As it happens, this property holds for all modern architectures.
To this end, we define another relation, pol, which is the restriction of po to events that occur at the same location.
Definition. Let E = (E, po,co,rf) be an execution. The relation pol is defined as We are now in a position to reproduce the definition for a weakened version of sequential consistency for concurrent executions (originally given in [2] ), which we refer to as sequential consistency per location.
Definition. An execution E = (E, po,co,rf) is sequentially consistent per location (SC-Per-Location) if acyclic(pol ∪ com),
i.e. the union of the pol and com relations is acyclic.
The intuition behind this definition is that if we restrict ourselves to examining one memory location, the system appears to be sequentially consistent. The acyclicity of program order and the communication relations co, rf and fr guarantee the existence of a sequential execution of these events that produces the same behavior (for this memory location) as the concurrent one. However, this cannot necessarily be generalized to multiple memory locations; the sequential ordering of events for one location may conflict (i.e. create a cycle) with the sequential ordering for another location.
The full set of requirements
SC-Per-Location is one of the four requirements of this framework. It is the only requirement described solely in terms of executions; the other three are defined in terms of a particular architecture. This requires a formal definition of an architecture.
Definition. An architecture is a function A which maps executions E = (E, po,co,rf) to tuples (ppo, fence,prop) such that for all executions E,
• fence is some relation on events
• prop is some relation on the writes of E (not necessarily to the same location) Here, the relation ppo ("preserved program order") refers to some subset of the program order that relates events which aren't allowed to be reordered in an execution, fence refers to pairs of events which are separated by a fence, and prop ("propagation order") refers to additional constraints (beyond those specified by co) on the order in which events get propagated to memory.
This definition formulates the notion of an architecture as a set of further restrictions on executions.
Depending on how we define the orders ppo, fence, and prop on an execution, our model will satisfy different memory constraints, because our constraints are defined in terms of these relations. The set of all possible architectures that can be specified from this framework corresponds to all the different ways we can define these relations in terms of a given execution.
The full set of weak memory requirements is as follows. Let A be an architecture. Then for any execution E = (E, po,co,rf), we require SC-Per-Location was described above; the other three requirements are discussed thoroughly in [2] , and are best understood in the context of the various examples provided in that work. We present the full framework here for completeness, but our investigation into these properties was limited to SC-PerLocation.
SC-Per-Location: an alternate definition
The definition we have for SC-Per-Location makes intuitive sense -it corresponds directly to the classic definition of sequential consistency. However, as it turns out, this definition is equivalent to a seemingly weaker property (originally introduced in [1] ), which we reproduce below.
Definition. An execution E = (E, po,co,rf) satisfies the property SC-Per-Location-2 if
i.e. no two events be related by pol in one direction and com + in the other direction.
This alternate definition captures the intuition that if an event precedes another event in program order, it cannot have a communication dependency (or a sequence of dependencies) on the latter event.
Clearly, the existence of such a dependency would create a cycle in pol∪ com, and so it is easy to see that SC-Per-Location implies SC-Per-Location-2. As it turns out, this definition of SC-Per-Location-2 is actually equivalent to the one given in Section 2.2; this was first proved in Alglave [1] and we give a new proof of this result in the next section. Now, as it turns out, the com + relation can be written as the union of the five relations rf,co,fr,co;rf, and fr;rf. We state this as a theorem, and provide a sketch of the proof. − − → y, and we are done. Now, suppose k ≥ 1 and assume inductively that the theorem holds for the all shorter paths. We have From this theorem, we can clearly see that an execution satisfies SC-Per-Location-2 if and only if it does not contain any of the patterns in Figure 3 . We will ultimately prove that SC-Per-Location is equivalent to SC-Per-Location-2, which guarantees that a cycle of any kind in pol∪ com, no matter how big the cycle is, will imply the existence of a "mini"-cycle of one of these five variants.
An equivalence theorem
Before we state and prove the equivalence theorem (originally proved in [1] , but proved here in a somewhat more straightforward manner), we first establish two simple lemmas. − − → x. This is impossible; co is irreflexive by definition (it is an irreflexive total order), and rf and fr are both trivially irreflexive because they only relate events of different types.
Case 2: x
co;rf − −− → x. This is impossible; co;rf relates writes to reads, and hence is irreflexive. Proof. We have four cases, corresponding to x and y each being either reads or writes; however, the symmetry of the read-write cases reduces the number to three. In all three cases, the theorem reduces to the totality of co. Case 3: x is a read, y is a read. Then by totality of co, either rf
In the first case, x fr;rf − −− → y; in the second, we are done; and in the third, y
Theorem 2.4. Let E be an execution. Then E satisfies SC-Per-Location if and only if E satisfies SC-PerLocation-2.
Proof. It is clear that SC-Per-Location implies SC-Per-Location-2. We prove the other direction by contrapositive. Suppose SC-Per-Location does not hold; that is, there exists a cycle in pol∪ com. Clearly any such cycle is also a cycle in pol ∪ com + (since com ⊆ com + ).
We proceed by induction on the length of this cycle, noting trivially that the length cannot be 1 (because we know that pol and com are both irreflexive).
If the cycle has length two, we must either have
because both of these relations are by themselves acyclic. In either case, the SC-Per-Location-2 condition is clearly violated by x and p.
Suppose the cycle has length three or more, i.e.
where → abbreviates the union of pol and com + . Also, inductively assume that the existence of a shorter cycle implies that SC-Per-Location-2 does not hold. Assume that x
because otherwise it is clear by transitivity of com + and pol that we can obtain a shorter cycle x → p 2 → · · · → x, and so by our inductive hypothesis SC-Per-Location-2 doesn't hold. Then we have several cases, based on Lemma 2.3. We believe this proof is new. Its direct use of an inductive argument and a "totality" lemma (Lemma 2.3) for com + both distinguishes it from the original [1] , and makes its mechanization in ACL2 much easier. One of ACL2's big strengths is its ability to prove theorems inductively, and by understanding an inductive hand proof of this theorem, we were able to make the ACL2 proof much more straightforward.
ACL2 Mechanization
In this section we present our ACL2 mechanization of the framework and proofs presented above. We make extensive use of the defun-sk construct; our definitions of the relations po, co, rf, and fr, as well as various combinations of these relations, are introduced with defun-sk in order to make the concepts as general as possible; instead of defining them in terms of a specific data structure (like a graph), we define them as completely general relations which satisfy only the properties we require.
For clarity, we have chosen to present the ACL2 mechanization in a separate section from the preceding one. We have also opted to reproduce most of the definitions, theorems, and even a few key lemmas in order to give the reader a fuller understanding of how these ideas were mechanized. The interested reader might gain some insight into reading the ACL2 code carefully, but is encouraged to skim through it if necessary.
Mechanization of Concurrent Executions
We formalize the concepts of events, po, co, and rf as constrained functions that satisfy the requirements given in the previous section. 
constraints omitted )
The required properties of these functions are guaranteed by a number of exported theorems, such as totality of po on events in the same process, totality of co on writes to the same location, and the onesided invertibility of rf (this last property implicitly make use of rf's inverse function rf-inv-fn).
We define the function fr in terms of co and rf using ACL2's defun-sk construct:
(defun-sk fr (x z) (exists y (and (rf y x) (co y z))))
We define the ACL2 analogues of sequenced relations co;rf and fr;rf similarly:
(defun-sk co->rf (x z) (exists y (and (co x y) (rf y z)))) (defun-sk fr->rf (x z) (exists y (and (fr x y) (rf y z))))
We define the functions com and pol as expected:
(defun com (x y) (or (co x y) (rf x y) (fr x y))) (defun pol (x y) (and (po x y) (equal (addr x) (addr y))))
The transitive closure of com is defined in terms of the existence of a path:
(defun com-pathp (path x y) (cond ((endp path) (com x y)) (t (and (com x (car path)) (com-pathp (cdr path) (car path) y))))) (defun-sk com+ (x y) (exists path (com-pathp path x y)))
The variable path represents the elements between (and not including) x and y. We prove that we can rewrite com+ according to Theorem 2.1:
(defthm rewrite-com+ (equal (com+ x y) (or (com x y) (co->rf x y) (fr->rf x y))))
We prove that com+ is irreflexive, corresponding to Lemma 2.2:
(x x) (potential-cycle (list y (fr->rf-witness y x))))))))
Finally, we state the fully quantified version of the theorem, which ACL2 proves immediately: (defthm sc-per-location-1-implies-2 (implies (sc-per-location-1) (sc-per-location-2)))
Mechanization of the equivalence proof, Part 2
The proof that (sc-per-location-2) implies (sc-per-location-1) was broken down into 4 steps: 1. Prove that any 2-cycle in pol-com+ violates sc-per-location-2, and that if there is a cycle of length 3 or greater in pol-com+, where pol-com+ is the union of pol and com+, then there is a smaller cycle in pol-com+, and 2. Use the theorem in step 1 to define a function, collapse-cycle, which takes a cycle in pol-com+ and produces a pair (x y) such that (pol x y) and (com+ y x)
3. Combine steps 1 and 2 to show that if we have a cycle in pol-com (i.e. a violation of sc-per-location-1), we have a pair (x y) which violates sc-per-location-2 Step 1 is summarized by two theorems, one that states that 2-cycles in pol-com+ violate sc-per-location-2, and one that takes cycles longer than 2 and produces a smaller cycle.
(defthm cycle-2 (implies (and (pol-com+-cyclep cycle x) (endp (cdr cycle)) (not (and (pol x (car cycle)) (com+ (car cycle) x)))) (and (pol (car cycle) x) (com+ x (car cycle))))) (defthm collapse-cycle-thm (implies (and (not (pol-com+-cyclep (list p1) x)) (not (pol-com+-cyclep (list* p2 rst) x)) (not (pol-com+-cyclep rst x)) (not (pol-com+-cyclep (list p2) p1))) (not (pol-com+-cyclep (list* p1 p2 rst) x)) :hints (("Goal"
:cases ((com+ x p2) (and (writep x) (writep p2) (equal x p2)) (and (readp x) (readp p2) (equal (rf-inv-fn x) (rf-inv-fn p2))) (com+ p2 x)))))
Notice that the case split corresponds exactly to Theorem 2.3, just as in the written proof.
For
Step 2, we define the function collapse-cycle to shorten the cycle according to the previous theorem. The collapse-cycle function satisfies the property that if it is given a violation of sc-per-location-1, it produces a violation of sc-per-location-2:
(defun collapse-cycle (cycle x)
The concepts of No Thin Air, Observation, and Propagation were defined as follows:
(defun-sk no-thin-air () (forall (x potential-cycle) (not (hb-cyclep potential-cycle x)))) (defun-sk observation () (forall x (not (fre->prop->hb* x x)))) (defun-sk propagation () (forall (x cycle) (not (co-prop-cyclep cycle x))))
We did not investigate these requirements to the extent that we analyzed SC-Per-Location. We reproduce their definitions here for completeness.
Conclusions
In this work, we presented an ACL2 mechanization of a generic framework for weak memory, as well as a novel proof of an established result for this framework. We hope to incorporate this framework into our ongoing research into how a theorem prover like ACL2 can be used to verify correctness properties of real-world concurrent programs. Our most immediate future work consists of applying these concepts (actually, a simplification of these concepts) to proofs on a multi-processor x86 model, but this work suggests the possibility of applying a general weak memory framework to other models as well. This work was supported by NSF. We gratefully acknowledge the many helpful comments and discussions provided by Jade Alglave and Matt Kaufmann.
