Abstract-The memory model for RISC-V, a newly developed open source ISA, has not been finalized yet and thus, offers an opportunity to evaluate existing memory models. We believe RISC-V should not adopt the memory models of POWER or ARM, because their axiomatic and operational definitions are too complicated. We propose two new weak memory models: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. Both allow all instruction reorderings except overtaking of loads by a store. We show that this restriction has little impact on performance and it considerably simplifies operational definitions. It also rules out the out-of-thin-air problem that plagues many definitions. WMM is simple (it is similar to the Alpha memory model), but it disallows behaviors arising due to shared store buffers and shared write-through caches (which are seen in POWER processors). WMM-S, on the other hand, is more complex and allows these behaviors. We give the operational definitions of both models using Instantaneous Instruction Execution (I 2 E), which has been used in the definitions of SC and TSO. We also show how both models can be implemented using conventional cache-coherent memory systems and out-of-order processors, and encompasses the behaviors of most known optimizations.
I. INTRODUCTION
A memory model for an ISA is the specification of all legal multithreaded program behaviors. If microarchitectural changes conform to the memory model, software remains compatible. Leaving the meanings of corner cases to be implementation dependent makes the task of proving the correctness of multithreaded programs, microarchitectures and cache protocols untenable. While strong memory models like SC and SPARC/Intel-TSO are well understood, weak memory models of commercial ISAs like ARM and POWER are driven too much by microarchitectural details, and inadequately documented by manufacturers. For example, the memory model in the POWER ISA manual [11] is "defined" as reorderings of events, and an event refers to performing an instruction with respect to a processor. While reorderings capture some properties of memory models, it does not specify the result of each load, which is the most important information to understand program behaviors. This forces the researchers to formalize these weak memory models by empirically determining the allowed/disallowed behaviors of commercial processors and then constructing models to fit these observations [7] - [10] , [12] - [16] .
The newly designed open-source RISC-V ISA [17] offers a unique opportunity to reverse this trend by giving a clear definition with understandable implications for implementations. The RISC-V ISA manual only states that its memory model is weak in the sense that it allows a variety of instruction reorderings [18] . However, so far no detailed definition has been provided, and the memory model is not fixed yet.
In this paper we propose two weak memory models for RISC-V: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. The difference between the two models is regarding store atomicity, which is often classified into the following three types [19] [8] Complex [9] , [10] Non Figure 1 . Summary of different memory models model are those that can result by running the program on the abstract machine. We observe a growing interest in operational definitions: memory models of x86, ARM and POWER have all been formalized operationally [2] , [7] , [8] , [15] , [25] , and researchers are even seeking operational definitions for high-level languages like C++ [26] . This is perhaps because all possible program results can be derived from operational definitions mechanically while axiomatic definitions require guessing the whole program execution at the beginning. For complex programs with dependencies, loops and conditional branches, guessing the whole execution may become prohibitive. Unfortunately, the operational models of ARM and POWER are too complicated because their abstract machines involve microarchitectural details like reorder buffers (ROBs), partial and speculative instruction execution, instruction replay on speculation failure, etc. The operational definitions of WMM and WMM-S are much simpler because they are described in terms of Instantaneous Instruction Execution (I 2 E), which is the style used in the operational definitions of SC [1] and TSO [2] , [25] . An I 2 E abstract machine consists of n atomic processors and an n-ported atomic memory. The atomic processor executes instructions instantaneously and in order, so it always has the up-to-date architectural (register) state. The atomic memory executes loads and stores instantaneously. Instruction reorderings and store atomicity/non-atomicity are captured by including different types of buffers between the processors and the atomic memory, like the store buffer in the definition of TSO. In the background, data moves between these buffers and the memory asynchronously, e.g., to drain a value from a store buffer to the memory.
I 2 E definitions free programmers from reasoning partially executed instructions, which is unavoidable for ARM and POWER operational definitions. One key tradeoff to achieve I 2 E is to forbid a store to overtake a load, i.e., disallow Ld-St reordering. Allowing such reordering requires each processor in the abstract machine to maintain multiple unexecuted instructions in order to see the effects of future stores, and the abstract machine has to contain the complicated ROBlike structures. Ld-St reordering also complicates axiomatic definitions because it creates the possibility of "out-ofthin-air" behaviors [27] , which are impossible in any real implementation and consequently must be disallowed. We also offer evidence, based on simulation experiments, that disallowing Ld-St reordering has no discernible impact on performance.
For a quick comparison, we summarize the properties of common memory models in Figure 1 . SC and TSO have simple definitions but forbid Ld-Ld and St-St reorderings, and consequently, are not candidates for RISC-V. WMM is similar to RMO and Alpha but neither has an operational definition. Also WMM has a simple axiomatic definition, while Alpha requires a complicated axiom to forbid outof-thin-air behaviors (see Section V-B), and RMO has an incorrect axiom about data-dependency ordering (see Section X).
ARM, POWER, and WMM-S are similar models in the sense that they all admit non-atomic stores. While the operational models of ARM and POWER are complicated, WMM-S has a simpler I 2 E definition and allows competitive implementations (see Section IX-B). The axiomatic models of ARM and POWER are also complicated: four relations in the POWER axiomatic model [10, Section 6] are defined in a fixed point manner, i.e., their definitions mutually depend on each other.
Release Consistency (RC) are often mixed with the concept of "SC for data-race-free (DRF) programs" [28] . It should be noted that "SC for DRF" is inadequate for an ISA memory model, which must specify behaviors of all programs. The original RC definition [6] attempts to specify all program behaviors, and are more complex and subtle than the "SC for DRF" concept. We show in Section X that the RC definition fails a litmus test for non-atomic stores and forbids shared write-through caches in implementation.
This paper makes the following contributions: 1) WMM, the first weak memory model that is defined in I 2 E and allows Ld-Ld reordering, and its axiomatic definition; 2) WMM-S, an extension on WMM that admits non-atomic stores and has an I 2 E definition; 3) WMM and WMM-S implementations based on OOO processors that admit all uniprocessor speculative techniques (such as load-value prediction) without additional checks; 4) Introduction of invalidation buffers in the I 2 E definitional framework to model Ld-Ld and other reorderings. Paper organization: Section II presents the related work. Section III gives litmus tests for distinguishing memory models. Section IV introduces I 2 E. Section V defines WMM. Section VI shows the WMM implementation using OOO processors. Section VII evaluates the performance of WMM and the influence of forbidding Ld-St reordering. Section VIII defines WMM-S. Section IX presents the WMM-S implementations with non-atomic stores. Section X shows the problems of RC and RMO. Section XI offers the conclusion.
II. RELATED WORK SC [1] is the simplest model, but naive implementations of SC suffer from poor performance. Although researchers have proposed aggressive techniques to preserve SC [29] - [38] , they are rarely adopted in commercial processors perhaps due to their hardware complexity. Instead the manufactures and researchers have chosen to present weaker memory models, e.g., TSO [2] , [3] , [25] , [39] , PSO [4] , RMO [4] , Alpha [5] , Processor Consistency [40] , Weak Consistency [41] , RC [6] , CRF [42] , Instruction Reordering + Store Atomicity [43] , POWER [11] and ARM [44] . The tutorials by Adve et al. [45] and by Maranget et al. [46] provide relationships among some of these models.
A large amount of research has also been devoted to specifying the memory models of high-level languages: C++ [26] , [47] - [50] , Java [51] - [53] , etc. We will provide compilation schemes from C++ to WMM and WMM-S.
Recently, Lustig et al. have used Memory Ordering Specification Tables (MOSTs) to describe memory models, and proposed a hardware scheme to dynamically convert programs across memory models described in MOSTs [19] . MOST specifies the ordering strength (e.g., locally ordered, multi-copy atomic) of two instructions from the same processor under different conditions (e.g., data dependency, control dependency). Our work is orthogonal in that we propose new memory models with operational definitions.
III. MEMORY MODEL LITMUS TESTS
Here we offer two sets of litmus tests to highlight the differences between memory models regarding store atomicity and instruction reorderings, including enforcement of dependency-ordering. All memory locations are initialized to 0. Figure 2 shows four litmus tests to distinguish between these three types of stores. We have deliberately added data dependencies and Ld-Ld fences (FENCE LL ) to these litmus tests to prevent instruction reordering, e.g., the data dependency between I 2 and I 3 in Figure 2a . Thus the resulting behaviors can arise only because of different store atomicity properties. We use FENCE LL for memory models that can reorder data-dependent loads, e.g., I 5 in Figure 2b would be the MB fence for Alpha. For other memory models that order data-dependent loads (e.g., ARM), FENCE LL could be replaced by a data dependency (like the data dependency between I 2 and I 3 in Figure 2a ). The Ld-Ld fences only stop Ld-Ld reordering; they do not affect store atomicity in these tests.
A. Store Atomicity Litmus Tests
Proc. P1
Proc. P2
I 6 : r 4 = Ld (a+r 3 −1) SC forbids but TSO allows: r 1 = 1, r 2 = 0, r 3 = 1, r 4 = 0 (a) SBE: test for multi-copy atomic stores Proc. P1
Proc. P2 Proc. P3
I 5 : FENCE LL I 6 : r 3 = Ld a TSO, RMO and Alpha forbid, but RC, ARM and POWER allow: r 1 = 2, r 2 = 1, r 3 = 0 (b) WRC: test for non-atomic stores [7] Proc. P1
I 5 : St a r 2 TSO, RMO, Alpha and RC forbid, but ARM and POWER allow:
(c) WWC: test for non-atomic stores [46] , [54] Proc. P1
Proc. P2 Proc. P3 Proc. P4 
SBE:
In a machine with single-copy atomic stores (e.g., an SC machine), when both I 2 and I 5 have returned value 1, stores I 1 and I 4 must have been globally advertised. Thus r 2 and r 4 cannot both be 0. However, a machine with store buffers (e.g., a TSO machine) allows P1 to forward the value of I 1 to I 2 locally without advertising I 1 to other processors, violating the single-copy atomicity of stores. WRC: Assuming the store buffer is private to each processor (i.e., multi-copy atomic stores), if one observes r 1 = 2 and r 2 = 1 then r 3 must be 2. However, if an architecture allows a store buffer to be shared by P1 and P2 but not P3, then P2 can see the value of I 1 from the shared store buffer before I 1 has updated the memory, allowing P3 to still see the old value of a. A write-through cache shared by P1 and P2 but not P3 can cause this non-atomic store behavior in a similar way, e.g., I 1 updates the shared write-through cache but has not invalidated the copy in the private cache of P3 before I 6 is executed. WWC: This litmus test is similar to WRC but replaces the load in I 6 with a store. The behavior is possible if P1 and P2 share a write-through cache or store buffer. However, RC forbids this behavior (see Section X). IRIW: This behavior is possible if P1 and P2 share a writethrough cache or a store buffer and so do P3 and P4.
B. Instruction Reordering Litmus Tests
Although processors fetch and commit instructions in order, speculative and out-of-order execution causes behaviors as if instructions were reordered. Figure 3 shows the litmus tests on these reordering behaviors.
(a) SB: test for St-Ld reordering [46] Proc. P1
Proc. P2 I 1 : St a 1 I 3 : r 1 = Ld b I 2 : St b 1 I 4 : r 2 = Ld a TSO forbids, but Alpha and RMO allow: r 1 = 1, r 2 = 0 (b) MP: test for Ld-Ld and St-St reorderings [7] Proc. P1
Proc. P2 I 1 : r 1 = Ld b I 3 : r 2 = Ld a I 2 : St a 1 I 4 : St b 1 TSO forbids, but Alpha, RMO, RC, POWER and ARM allow: r 1 = r 2 = 1 (c) LB: test for Ld-St reordering [7] Proc. P1
Proc. P2 Figure 3 . Litmus tests for instruction reorderings SB: A TSO machine can execute I 2 and I 4 while I 1 and I 3 are buffered in the store buffers. The resulting behavior is as if the store and the load were reordered on each processor. MP: In an Alpha machine, I 1 and I 2 may be drained from the store buffer of P1 out of order; I 3 and I 4 in the ROB of P2 may be executed out of order. This is as if P1 reordered the two stores and P2 reordered the two loads. LB: Some machines may enter a store into the memory before all older instructions have been committed. This results in the Ld-St reordering shown in Figure 3c . Since instructions are committed in order and stores are usually not on the critical path, the benefit of the eager execution of stores is limited. In fact we will show by simulation that Ld-St reordering does not improve performance (Section VII). MP+Ctrl: This test is a variant of MP. The two stores in P1 must update memory in order due to the fence. Although the execution of I 6 is conditional on the result of I 4 , P2 can issue I 6 speculatively by predicting branch I 5 to be not taken. The execution order I 6 , I 1 , I 2 , I 3 , I 4 , I 5 results in r 1 = 1 and r 2 = 0. MP+Mem: This test replaces the control dependency in MP+Ctrl with a (potential) memory dependency, i.e., the unresolved store address of I 5 may be the same as the load address of I 6 before I 4 is executed, However, P2 can execute I 6 speculatively by predicting the addresses are not the same. This results in having I 6 overtake I 4 and I 5 . MP+Data: This test replaces the control dependency in MP+Ctrl with a data dependency, i.e., the load address of I 5 depends on the result of I 4 . A processor with loadvalue prediction [20] - [24] may guess the result of I 4 before executing it, and issue I 5 speculatively. If the guess fails to match the real execution result of I 4 , then I 5 would be killed. But, if the guess is right, then essentially the execution of the two data-dependent loads (I 4 and I 5 ) has been reordered.
C. Miscellaneous Tests
All programmers expect memory models to obey perlocation SC [55] , i.e., all accesses to a single address appear to execute in a sequential order which is consistent with the program order of each thread (Figure 4 ). 
IV. DEFINING MEMORY MODELS IN I
2 E Figure 6 shows the I 2 E abstract machines for SC, TSO/PSO and WMM models. All abstract machines consist of n atomic processors and an n-ported atomic memory m. Each processor contains a register state s, which represents all architectural registers, including both the general purpose registers and special purpose registers, such as PC. The abstract machines for TSO/PSO and WMM also contain a store buffer sb for each processor, and the one for WMM also contains an invalidation buffer ib for each processor as shown in the figure. In the abstract machines all buffers are unbounded. The operations of these buffers will be explained shortly.
The operations of the SC abstract machine are the simplest: in one step we can select any processor to execute the next instruction on that processor atomically. That is, if the instruction is a non-memory instruction (e.g., ALU or branch), it just modifies the register states of the processor; if it is a load, it reads from the atomic memory instantaneously and updates the register state; and if it is a store, it updates the atomic memory instantaneously and increments the PC.
A. TSO Model
The TSO abstract machine proposed in [2] , [25] (Figure  6b ) contains a store buffer sb for each processor. Just like SC, any processor can execute an instruction atomically, and if the instruction is a non-memory instruction, it just modifies the local register state. A store is executed by inserting its address, value pair into the local sb instead of writing the data in memory. A load first looks for the load address in the local sb and returns the value of the youngest store for that address. If the address is not in the local sb, then the load returns the value from the atomic memory. TSO can also perform a background operation, which removes the oldest store from a sb and writes it into the atomic memory. As we discussed in Section III, store buffer allows TSO to do St-Ld reordering, i.e., pass the SB litmus test (Figure 3a) .
In order to enforce ordering in accessing the memory and to rule out non-SC behaviors, TSO has a fence instruction, which we refer to as Commit. When a processor executes a Commit fence, it gets blocked unless its sb is empty. Eventually, any sb will become empty as a consequence of the background operations that move data from the sb to the memory. For example, we need to insert a Commit fence after each store in Figure 3a to forbid the non-SC behavior in TSO.
We summarize the operations of the TSO abstract machine in Figure 7 . Each operation consists of a predicate and an action. The operation can be performed by taking the action only when the predicate is true. Each time we perform only one operation (either instruction execution or sb dequeue) atomically in the whole system (e.g., no two processors can execute instructions simultaneously). The choice of which operation to perform is nondeterministic. Enabling St-St reordering: We can extend TSO to PSO by changing the background operation to dequeue the oldest store for any address in sb (see the PSO-DeqSb operation in Figure 7 ). This extends TSO by permitting St-St reordering.
V. WMM MODEL WMM allows Ld-Ld reordering in addition to the reorderings allowed by PSO. Since a reordered load may read a stale TSO-Nm (non-memory execution) Predicate: The next instruction of a processor is a non-memory instruction. Action: Instruction is executed by local computation. TSO-Ld (load execution) Predicate: The next instruction of a processor is a load. Action: Assume the load address is a. The load returns the value of the youngest store for a in sb if a is present in the sb of the processor, otherwise, the load returns m[a], i.e., the value of address a in the atomic memory. TSO-St (store execution) Predicate: The next instruction of a processor is a store. Action: Assume the store address is a and the store value is v. The processor inserts the store a, v into its sb. TSO-Com (Commit execution) Predicate: The next instruction of a processor is a Commit and the sb of the processor is empty. Action: The Commit fence is executed simply as a NOP. TSO-DeqSb (background store buffer dequeue) Predicate: The sb of a processor is not empty. Action: Assume the address, value pair of the oldest store in the sb is a, v . Then this store is removed from sb, and the atomic memory m[a] is updated to v.
PSO-DeqSb (background store buffer dequeue) Predicate: The sb of a processor is not empty. Action: Assume the value of the oldest store for some address a in the sb is v. Then this store is removed from sb, and the atomic memory m[a] is updated to v. value, we introduce a conceptual device called invalidation buffer, ib, for each processor in the I 2 E abstract machine (see Figure 6c ). ib is an unbounded buffer of address, value pairs, each representing a stale memory value for an address that can be observed by the processor. Multiple stale values for an address in ib are kept ordered by their staleness.
The operations of the WMM abstract machine are similar to those of PSO except for the background operation and the load execution. When the background operation moves a store from sb to the atomic memory, the original value in the atomic memory, i.e., the stale value, enters the ib of every other processor. A load first searches the local sb. If the address is not found in sb, it either reads the value in the atomic memory or any stale value for the address in the local ib, the choice between the two being nondeterministic.
The abstract machine operations maintain the following invariants: once a processor observes a store, it cannot observe any staler store for that address. Therefore, (1) when a store is executed, values for the store address in the local ib are purged; (2) when a load is executed, values staler than the load result are flushed from the local ib; and (3) the background operation does not insert the stale value into the ib of a processor if the sb of the processor contains the address.
Just like introducing the Commit fence in TSO, to prevent loads from reading the stale values in ib, we introduce the Reconcile fence to clear the local ib. Figure 8 summarizes (Figures 3a and 3b) . To forbid the behavior in Figure 3a , we need to insert a Commit followed by a Reconcile after the store in each processor. Reconcile is needed to prevent loads from getting stale values from ib. The I 2 E definition of WMM automatically forbids Ld-St reordering ( Figure 3c ) and out-of-thin-air behaviors ( Figure 5 ). Ld-Ld reordering: WMM allows the behavior in Figure 3b due to St-St reordering. Even if we insert a Commit between the two stores in P1, the behavior is still allowed because I 4 can read the stale value 0 from ib. This is as if the two loads in P2 were reordered. Thus, we also need a Reconcile between the two loads in P2 to forbid this behavior in WMM. No dependency ordering: WMM does not enforce any dependency ordering. For example, WMM allows the behaviors of litmus tests in Figures 3d, 3e and 3f (I 2 should be Commit in case of WMM), because the last load in P2 can always get the stale value 0 from ib in each test. Thus, it requires Reconcile fences to enforce dependency ordering in WMM. In particular, WMM can reorder the data-dependent loads (i.e., I 4 and I 5 ) in Figure 3f . Multi-copy atomic stores: Stores in WMM are multicopy atomic, and WMM allows the behavior in Figure 2a even when Reconcile fences are inserted between Ld-Ld pairs I 2 , I 3 and I 5 , I 6 . This is because a store can be read by a load from the same processor while the store is in sb. However, if the store is ever pushed from sb to the atomic memory, it becomes visible to all other processors simultaneously. Thus, WMM forbids the behaviors in Figures 2b, 2c and 2d (FENCE LL should be Reconcile in these tests). Per-location SC: WMM enforces per-location SC (Figure 4) , because both sb and ib enforce FIFO on same address entries.
B. Axiomatic Definition of WMM
Based on the above properties of WMM, we give a simple axiomatic definition for WMM in Figure 9 in the style of the axiomatic definitions of TSO and Alpha. A True entry in the order-preserving table (Figure 9b ) indicates that if instruction X precedes instruction Y in the program order (X < po Y ) then the order must be maintained in the global memory order (< mo ). < mo is a total order of all the memory and fence instructions from all processors. The notation S rf − → L means a load L reads from a store S. The notation max <mo {set of stores} means the youngest store in the set according to < mo . The axioms are self-explanatory: the program order must be maintained if the order-preserving table says so, and a load must read from the youngest store among all stores that precede the load in either the memory order or the program order. (See Appendix A for the equivalence proof of the axiomatic and I 2 E definitions.) These axioms also hold for Alpha with a slightly different order-preserving table, which marks the (Ld,St) entry as a = b. (Alpha also merges Commit and Reconcile into a single fence). However, allowing Ld-St reordering creates the possibility of out-of-thin-air behaviors, and Alpha uses an additional complicated axiom to disallow such behaviors [5, Chapter 5.6.1.7] . This axiom requires considering all possible execution paths to determine if a store is ordered after a load by dependency, while normal axiomatic models only examine a single execution path at a time. Allowing Ld-St reordering also makes it difficult to define Alpha operationally.
Axiom Inst-Order (preserved instruction ordering): C++ primitives [47] can be mapped to WMM instructions in an efficient way as shown in Figure 10 . For the purpose of comparison, we also include a mapping to POWER [56] . C++ The Commit; Reconcile sequence in WMM is the same as a sync fence in POWER, and Commit is similar to lwsync. The cmp; bc; isync sequence in POWER serves as a Ld-Ld fence, so it is similar to a Reconcile fence in WMM. In case of Store SC in C++, WMM uses a Commit while POWER uses a sync, so WMM effectively saves one Reconcile. On the other hand, POWER does not need any fence for Load Consume in C++, while WMM requires a Reconcile.
Besides the C++ primitives, a common programming paradigm is the well-synchronized program, in which all critical sections are protected by locks. To maintain SC behaviors for such programs in WMM, we can add a Reconcile after acquiring the lock and a Commit before releasing the lock.
For any program, if we insert a Commit before every store and insert a Commit followed by a Reconcile before every load, then the program behavior in WMM is guaranteed to be sequentially consistent. This provides a conservative way for inserting fences when performance is not an issue.
VI. WMM IMPLEMENTATION
WMM can be implemented using conventional OOO multiprocessors, and even the most aggressive speculative techniques cannot step beyond WMM. To demonstrate this, we describe an OOO implementation of WMM, and show simultaneously how the WMM model (i.e., the I 2 E abstract machine) captures the behaviors of the implementation. The implementation is described abstractly to skip unrelated details (e.g., ROB entry reuse). The implementation consists of n OOO processors and a coherent write-back cache hierarchy which we discuss next.
A. Write-Back Cache Hierarchy (CCM)
We describe CCM as an abstraction of a conventional write-back cache hierarchy to avoid too much details. In the following, we explain the function of such a cache hierarchy, abstract it to CCM, and relate CCM to the WMM model. Consider a real n-ported write-back cache hierarchy with each port i connected to processor P i. A request issued to port i may be from a load instruction in the ROB of P i or a store in the store buffer of P i. In conventional coherence protocols, all memory requests can be serialized, i.e., each request can be considered as taking effect at some time point within its processing period [57] . For example, consider the non-stalling MSI directory protocol in the Primer by Sorin et al. [58, Chapter 8.7.2] . In this protocol, a load request takes effect immediately if it hits in the cache; otherwise, it takes effect when it gets the data at the directory or a remote cache with M state. A store request always takes effect at the time of writing the cache, i.e., either when it hits in the cache, or when it has received the directory response and all invalidation responses in case of miss. We also remove the requesting store from the store buffer when a store request takes effect. Since a cache cannot process multiple requests to the same address simultaneously, we assume requests to the same address from the same processor are processed in the order that the requests are issued to the cache. CCM ( Figure 11 ) abstracts the above cache hierarchy by operating as follows: every new request from port i is inserted into a memory request buffer mrb[i], which keeps requests to the same address in order; at any time we can remove the oldest request for an address from a mrb, let the request access the atomic memory m, and either send the load result to ROB (which may experience a delay) or immediately dequeue the store buffer. m represents the coherent memory states. Removing a request from mrb and accessing m captures the moment when the request takes effect.
It is easy to see that the atomic memory in CCM corresponds to the atomic memory in the WMM model, because they both hold the coherent memory values. We will show shortly that how WMM captures the combination of CCM and OOO processors. Thus any coherent protocol that can be abstracted as CCM can be used to implement WMM.
B. Out-of-Order Processor (OOO)
The major components of an OOO processor are the ROB and the store buffer (see Figure 11 ). Instructions are fetched into and committed from ROB in order; loads can be issued (i.e., search for data forwarding and possibly request CCM) as soon as its address is known; a store is enqueued into the store buffer only when the store commits (i.e., entries in a store buffer cannot be killed). To maintain the per-location SC property of WMM, when a load L is issued, it kills younger loads which have been issued but do not read from stores younger than L. Next we give the correspondence between OOO and WMM. Store buffer: The state of the store buffer in OOO is represented by the sb in WMM. Entry into the store buffer when a store commits in OOO corresponds to the WMM-St operation. In OOO, the store buffer only issues the oldest store for some address to CCM. The store is removed from the store buffer when the store updates the atomic memory in CCM. This corresponds to the WMM-DeqSb operation. ROB and eager loads: Committing an instruction from ROB corresponds to executing it in WMM, and thus the architectural register state in both WMM and OOO must match at the time of commit. Early execution of a load L to address a with a return value v in OOO can be understood by considering where a, v resides in OOO when L commits.
Reading from sb or atomic memory m in the WMM-Ld operation covers the cases that a, v is, respectively, in the store buffer or the atomic memory of CCM when L commits. Otherwise a, v is no longer present in CCM+OOO at the time of load commit and must have been overwritten in the atomic memory of CCM. This case corresponds to having performed the WMM-DeqSb operation to insert a, v into ib previously, and now using the WMM-Ld operation to read v from ib. Speculations: OOO can issue a load speculatively by aggressive predictions, such as branch prediction (Figure 3d ), memory dependency prediction ( Figure 3e ) and even loadvalue prediction ( Figure 3f ). As long as all predictions related to the load eventually turn out to be correct, the load result got from the speculative execution can be preserved. No further check is needed. Speculations effectively reorder dependent instructions, e.g., load-value speculation reorders data-dependent loads. Since WMM does not require preserving any dependency ordering, speculations will neither break WMM nor affect the above correspondence between OOO and WMM. Fences: Fences never go into store buffers or CCM in the implementation. In OOO, a Commit can commit from ROB only when the local store buffer is empty. Reconcile plays a different role; at the time of commit it is a NOP, but while it is in the ROB, it stalls all younger loads (unless the load can bypass directly from a store which is younger than the Reconcile). The stall prevents younger loads from reading values that would become stale when the Reconcile commits. This corresponds to clearing ib in WMM. Summary: For any execution in the CCM+OOO implementation, we can operate the WMM model following the above correspondence. Each time CCM+OOO commits an instruction I from ROB or dequeues a store S from a store buffer to memory, the atomic memory of CCM, store buffers, and the results of committed instructions in CCM+OOO are exactly the same as those in the WMM model when the WMM model executes I or dequeues S from sb, respectively.
VII. PERFORMANCE EVALUATION OF WMM
We evaluate the performance of implementations of WMM, Alpha, SC and TSO. All implementations use OOO cores and coherent write-back cache hierarchy. Since Alpha allows Ld-St reordering, the comparison of WMM and Alpha will show whether such reordering affects performance.
A. Evaluation Methodology
We ran SPLASH-2x benchmarks [59] , [60] on an 8-core multiprocessor using the ESESC simulator [61] . We ran all benchmarks except ocean ncp, which allocates too much memory and breaks the original simulator. We used simmedium inputs except for cholesky, fft and radix, where we used sim-large inputs. We ran all benchmarks to completion without sampling.
The configuration of the 8-core multiprocessor is shown in Figures 12 and 13 . We do not use load-value speculation in this evaluation. The Alpha implementation can mark a younger store as committed when instruction commit is stalled, as long as the store can never be squashed and the early commit will not affect single-thread correctness. A committed store can be issued to memory or merged with another committed store in WMM and Alpha. SC and TSO issue loads speculatively and monitor L1 cache evictions to kill speculative loads that violate the consistency model. We also implement store prefetch as an optional feature for SC and TSO; We use SC-pf and TSO-pf to denote the respective implementations with store prefetch. Normalized 
B. Simulation Results
A common way to study the performance of memory models is to monitor the commit of instructions at the commit slot of ROB (i.e., the oldest ROB entry). Here are some reasons why an instruction may not commit in a given cycle:
• empty: The ROB is empty.
• exe: The instruction at the commit slot is still executing.
• pendSt: The load (in SC) or Commit (in TSO, Alpha and WMM) cannot commit due to pending older stores.
• flushLS: ROB is being flushed because a load is killed by another older load (only in WMM and Alpha) or older store (in all models) to the same address.
• flushInv: ROB is being flushed after cache invalidation caused by a remote store (only in SC or TSO).
• flushRep: ROB is being flushed after cache replacement (only in SC or TSO). Figure 14 shows the execution time (normalized to WMM) and its breakdown at the commit slot of ROB. The total height of each bar represents the normalized execution time, and stacks represent different types of stall times added to the active committing time at the commit slot. WMM versus SC: WMM is much faster than both SC and SC-pf for most benchmarks, because a pending older store in the store queue can block SC from committing loads. WMM versus TSO: WMM never does worse than TSO or TSO-pf, and in some cases it shows up to 1.45× speedup over TSO (in radix) and 1.18× over TSO-pf (in lu ncb). There are two disadvantages of TSO compared to WMM. First, load speculation in TSO is subject to L1 cache eviction, e.g., in benchmark ocean cp. Second, TSO requires prefetch to reduce store miss latency, e.g., a full store queue in TSO stalls issue to ROB and makes ROB empty in benchmark radix. However, prefetch may sometimes degrade performance due to interference with load execution, e.g., TSO-pf has more commit stalls due to unfinished loads in benchmark lu ncb. WMM versus Alpha: Figure 15 shows the average number of cycles that a store in Alpha can commit before it reaches the commit slot. However, the early commit (i.e., Ld-St reordering) does not make Alpha outperform WMM (see Figure 14) , because store buffers can already hide the store miss latency. Note that ROB is typically implemented as a FIFO (i.e., a circular buffer) for register renaming (e.g., freeing physical registers in order), precise exceptions, etc. Thus, if the early committed store is in the middle of ROB, its ROB entry cannot be reused by a newly fetched instruction, i.e., the effective size of the ROB will not increase. In summary, the Ld-St reordering in Alpha does not increase performance but complicates the definition (Section V-B). 
VIII. WMM-S MODEL
Unlike the multi-copy atomic stores in WMM, stores in some processors (e.g., POWER) are non-atomic due to shared write-through caches or shared store buffers. If multiple processors share a store buffer or write-through cache, a store by any of these processors may be seen by all these processors before other processors. Although we could tag stores with processor IDs in the store buffer, it is infeasible to separate values stored by different processors in a cache.
In this section, we introduce a new I 2 E model, WMM-S, which captures the non-atomic store behaviors in a way independent from the sharing topology. WMM-S is derived from WMM by adding a new background operation. We will show later in Section IX why WMM-S can be implemented using memory systems with non-atomic stores.
A. I 2 E Definition of WMM-S
The structure of the abstract machine of WMM-S is the same as that of WMM. To model non-atomicity of stores, i.e., to make a store by one processor readable by another processor before the store updates the atomic memory, WMM-S introduces a new background operation that copies a store from one store buffer into another. However, we need to ensure that all stores for an address can still be put in a total order (i.e., the coherence order), and the order seen by any processor is consistent with this total order (i.e., per-location SC).
To identify all the copies of a store in various store buffers, we assign a unique tag t when a store is executed (by being inserted into sb), and this tag is copied when a store is copied from one store buffer to another. When a background operation dequeues a store from a store buffer to the memory, all its copies must be deleted from all the store buffers which have them. This requires that all copies of the store are the oldest for that address in their respective store buffers.
All the stores for an address in a store buffer can be strictly ordered as a list, where the youngest store is the one that entered the store buffer last. We make sure that all ordered lists (of all store buffers) can be combined transitively to form a partial order (i.e., no cycle), which has now to be understood in terms of the tags on stores because of the copies. We refer to this partial order as the partial coherence order (< co ), because it is consistent with the coherence order.
Consider the states of store buffers shown in Figure 16 (primes are copies). A, B, C and D are different stores to the same address, and their tags are t A , t B , t C and t D , respectively. A and B are copies of A and B respectively created by the background copy operation. Ignoring C , the partial coherence order contains: t D < co t B < co t A (D is older than B, and B is older than A in P2), and t C < co t B (C is older than B in P3). Note that t D and t C are not related here.
At this point, if we copied C in P3 as C into P1, we would add a new edge t A < co t C , breaking the partial order by introducing the cycle t A < co t C < co t B < co t A . Thus copying of C into P1 should be forbidden in this state. Similarly, copying a store with tag t A into P1 or P2 should be forbidden because it would immediately create a cycle: t A < co t A . In general, the background copy operation must be constrained so that the partial coherence order is still acyclic after copying. Figure 17 shows the background operations of the WMM-S abstract machine. The operations that execute instructions in WMM-S are the same as those in WMM, so we do not show them again. (The store execution operation in WMM-S needs to also insert the tag of the store into sb). Binding background copy with load execution: If the WMM-S-Copy operation is restricted to always happen right before a load execution operation that reads from the newly created copy, it is not difficult to prove that the WMM-S model remains the same, i.e., legal behaviors do not change. In the rest of the paper, we will only consider this "restricted" version of WMM-S. In particular, all WMM-SCopy operations in the following analysis of litmus tests fall WMM-DeqSb (background store buffer dequeue) Predicate: There is a store S in a store buffer, and all copies of S are the oldest store for that address in their respective store buffers. Action: Assume the address, value, tag tuple of store S is a, v, t . First, the stale address, value pair a, m[a] is inserted to the ib of every processor whose sb does not contain a. Then all copies of S are removed from their respective store buffers, and the atomic memory m[a] is updated to v. WMM-S-Copy (background store copy) Predicate: There is a store S that is in the sb of some processor i but not in the sb of some other processor j. Additionally, the partial coherence order will still be acyclic if we insert a copy of S into the sb of processor j. Action: Insert a copy of S into the sb of processor j, and remove all values for the store address of S from the ib of processor j. 
B. Properties of WMM-S
WMM-S enforces per-location SC (Figure 4) , because it prevents cycles in the order of stores to the same address. It also allows the same instruction reorderings as WMM does (Figure 3) . We focus on the store non-atomicity of WMM-S. Non-atomic stores and cumulative fences: Consider the litmus tests for non-atomic stores in Figures 2b, 2c and 2d (FENCE LL should be Reconcile in these tests). WMM-S allows the behavior in Figure 2b by copying I 1 into the sb of P2 and then executing I 2 , I 3 , I 4 , I 5 , I 6 sequentially. I 1 will not be dequeued from sb until I 6 returns value 0. To forbid this behavior, a Commit is required between I 2 and I 3 in P2 to push I 1 into memory. Similarly, WMM-S allows the behavior in Figure 2c ( i.e., we copy I 1 into the sb of P2 to satisfy I 2 , and I 1 is dequeued after I 5 has updated the atomic memory), and we need a Commit between I 2 and I 3 to forbid the behavior. In both litmus tests, the inserted fences have a cumulative global effect in ordering I 1 before I 3 and the last instruction in P3.
WMM-S also allows the behavior in Figure 2d by copying I 1 into the sb of P2 to satisfy I 2 , and copying I 5 into the sb of P4 to satisfy I 6 . To forbid the behavior, we need to add a Commit right after the first load in P2 and P4 (but before the FENCE LL /Reconcile that we added to stop Ld-Ld reordering). As we can see, Commit and Reconcile are similar to release and acquire respectively. Cumulation is achieved by globally advertising observed stores (Commit) and preventing later loads from reading stale values (Reconcile). Programming properties: WMM-S is the same as WMM in the properties described in Section V-C, including the compilation of C++ primitives, maintaining SC for wellsynchronized programs, and the conservative way of inserting fences.
IX. WMM-S IMPLEMENTATIONS
Since WMM-S is strictly more relaxed than WMM, any WMM implementation is a valid WMM-S implementation. However, we are more interested in implementations with non-atomic memory systems. Instead of discussing each specific system one by one, we explain how WMM-S can be implemented using the ARMv8 flowing model, which is a general abstraction of non-atomic memory systems [8] . We first describe the adapted flowing model (FM) which uses fences in WMM-S instead of ARM, and then explain how it obeys WMM-S. Figure 19 shows four OOO processors (P1. . .P4) connected to a 4-ported FM which has six segments (s[1 . . . 6]). Each segment is a list of memory requests, (e.g., the list of blue nodes in s [6] , whose head is at the bottom and the tail is at the top).
A. The Flowing Model (FM)
OOO interacts with FM in a slightly different way than CCM. Every memory request from a processor is appended to the tail of the list of the segment connected to the processor (e.g., s [1] for P1). OOO no longer contains a store buffer; after a store is committed from ROB, it is directly sent to FM and there is no store response. When a Commit fence reaches the commit slot of ROB, the processor sends a Commit request to FM, and the ROB will not commit the Commit fence until FM sends back the response for the Commit request.
Inside FM, there are three background operations: (1) Two requests in the same segment can be reordered in certain cases; (2) A load can bypass from a store in the same segment; (3) The request at the head of the list of a segment can flow into the parent segment (e.g., flow from s [1] into s [5] ) or the atomic memory (in case the parent of the segment, e.g., s [6] , is m). Details of these operations are shown in Figure 18 .
It is easy to see that FM abstracts non-atomic memory systems, e.g., Figure 19 abstracts a system in which P1 and P2 share a write-through cache while P3 and P4 share another. Two properties of FM+OOO: First, FM+OOO enforces per-location SC because the segments in FM never reorder requests to the same address. Second, stores for the same address, which lie on the path from a processor to m in the tree structure of FM, are strictly ordered based on their distance to the tree root m; and the combination of all such orderings will not contain any cycle. For example, in Figure 19 , stores in segments s [3] and s [6] are on the path from P3 to m; a store in s [6] is older than any store (for the same address) in s [3] , and stores (for the same address) in the same segment are ordered from bottom to top (bottom is older).
FM-Reorder
• If r is a load, we send a load response using the value in m.
• If r is a store a, v , we update m[a] to v.
• If r is a Commit, we send a response to the requesting processor and the Commit fence can then be committed from ROB. 3) The sb of each processor P i in WMM-S holds exactly all the stores in FM+OOO that is observed by the commits of P i but have not updated the atomic memory. (A store is observed by the commits of P i if it has been either committed by P i or returned by a load that has been committed by P i). 4) The order of stores for the same address in the sb of any processor in WMM-S is exactly the order of those stores on the path from P i to m in FM+OOO.
It is easy to see how the invariants are maintained when the atomic memory is updated or a non-load instruction is committed in FM+OOO. To understand the commit of a load L to address a with result v in processor P i in FM+OOO, we still consider where a, v resides when L commits. Similar to WMM, reading atomic memory m or local ib in the load execution operation of WMM-S covers the cases that a, v is still in the atomic memory of FM or has already been overwritten by another store in the atomic memory of FM, respectively. In case a, v is a store that has not yet updated the atomic memory in FM, a, v must be on the path from P i to m. In this case, if a, v has been observed by the commits of P i before L is committed, then L can be executed by reading the local sb in WMM-S. Otherwise, on the path from P i to m, a, v must be younger than any other store observed by the commits of P i. Thus, WMM-S can copy a, v into the sb of P i without breaking any invariant. The copy will not create any cycle in < co because of invariants 3 and 4 as well as the second property of FM+OOO mentioned above. After the copy, WMM-S can have L read v from the local sb.
Performance comparison with ARM and POWER: As we have shown that WMM-S can be implemented using the generalized memory system of ARM, we can turn an ARM multicore into a WMM-S implementation by stopping Ld-St reordering in the ROB. Since Section VII already shows that Ld-St reordering does not affect performance, we can conclude qualitatively that there is no discernible performance difference between WMM-S and ARM implementations. The same arguments apply to the comparison against POWER and RC. Here we elaborate the problems of RC (both RC sc and RC pc ) and RMO, which have been pointed out in Section I. RC: Although the RC definition [6] allows the behaviors of WRC and IRIW (Figures 2b and 2d) , it disallows the behavior of WWC (Figure 2c ). In WWC, when I 2 reads the value of store I 1 , the RC definition says that I 1 is performed with respect to (w.r.t) P2. Since store I 5 has not been issued due to the data dependencies in P2 and P3, I 1 must be performed w.r.t P2 before I 5 . The RC definition says that "all writes to the same location are serialized in some order and are performed in that order with respect to any processor" [6, Section 2]. Thus, I 1 is before I 5 in the serialization order of stores for address a, and the final memory value of a cannot be 2 (the value of I 1 ), i.e., RC forbids the behavior of WWC and thus forbids shared write-through caches in implementations. Figure 20 (MEMBAR is the fence in RMO). In P2, the execution of I 6 is conditional on the result of I 4 , I 7 loads from the address that I 6 stores to, and I 9 uses the results of I 7 . According the definition of dependency ordering in RMO [4, Section D.3.3], I 9 depends on I 4 transitively. Then the RMO axioms [4, Section D.4] dictate that I 9 must be after I 4 in the memory order, and thus forbid the behavior in Figure 20 . However, this behavior is possible in hardware with speculative load execution and store forwarding, i.e., I 7 first speculatively bypasses from I 6 , and then I 9 executes speculatively to get 0. Since most architects will not be willing to give up on these two optimizations, RISC-V should not adopt RMO.
RMO: The RMO definition [4, Section D] is incorrect in enforcing dependency ordering. Consider the litmus test in

XI. CONCLUSION
We have proposed two weak memory models, WMM and WMM-S, for RISC-V with different tradeoffs between definitional simplicity and implementation flexibility. However RISC-V can have only one memory model. Since there is no obvious evidence that restricting to multi-copy atomic stores affects performance or increases hardware complexity, RISC-V should adopt WMM in favor of simplicity.
XII. ACKNOWLEDGMENT
We thank all the anonymous reviewers on the different versions of this paper over the two years. We have also benefited from the discussions with Andy Wright, Thomas Bourgeat, Joonwon Choi, Xiangyao Yu, and Guowei Zhang. This work was done as part of the Proteus project under the DARPA BRASS Program (grant number 6933218).
APPENDIX A. PROOF OF EQUIVALENCE BETWEEN WMM I
2 E MODEL AND WMM AXIOMATIC MODEL Here we present the equivalence proof for the I 2 E definition and the axiomatic definition of WMM.
Proof: The goal is that for any execution in the WMM I 2 E model, we can construct relations < po , < mo , rf − → that have the same program behavior and satisfy the WMM axioms. To do this, we first introduce the following ghost states to the I 2 E model:
• Field source in the atomic memory: For each address a, we add state m[a].source to record the store that writes the current memory value.
• Fields source and overwrite in the invalidation buffer: For each stale value a, v in an invalidation buffer, we add state v.source to denote the store of this stale value, and add state v.overwrite to denote the store that overwrites v.source in the memory.
• Per-processor list < po-i2e : For each processor, < po-i2e is the list of all the instructions that has been executed by the processor. The order in < po-i2e is the same as the execution order in the processor. We also use < po-i2e to represent the ordering relation in the list (the head of the list is the oldest/minimum in < po-i2e ).
• Global list < mo-i2e : < mo-i2e is a list of all the executed loads, executed fences, and stores that have been dequeued from the store buffers. < mo-i2e contains instructions from all processors. We also use < mo-i2e to represent the ordering relation in the list (the head of the list is the oldest/minimum in < mo-i2e ).
• Read-from relations
− −−− →:
− −−− → is a set of edges. Each edge points from a store to a load, indicating that the load had read from the store in the I 2 E model.
.source initially points to the initialization store, and < po-i2e , < mo-i2e ,
− −−− → are all initially empty. We now show how these states are updated in the operations of the WMM I 2 E model.
1) WMM-Nm, WMM-Com, WMM-Rec, WMM-St: Assume
the operation executes an instruction I in processor i. We append I to the tail of list < po-i2e of processor i. If I is a fence (i.e., the operation is WMM-Com or WMM-Rec), then we also append I to the tail of list < mo-i2e . 2) WMM-DeqSb: Assume the operation dequeues a store S for address a. In this case, we update m[a].source to be S. Let S 0 be the original m[a].source before this operation is performed. Then for each new stale value a, v inserted into any invalidation buffer, we set v.source = S 0 and v.overwrite = S. We also append S to the tail of list < mo-i2e . 3) WMM-Ld: Assume the operation executes a load L for address a in processor i. We append L to the tail of list < po-i2e of processor i. The remaining actions depends on how L gets its value in this operation:
• If L reads from a store S in the local store buffer, then we add edge S rf -i2e − −−− → L, and append L to the tail of list < mo-i2e .
• If L reads the atomic memory m[a], then we add edge m[a].source
− −−− → L, and append L to the tail of list < mo-i2e .
• If L reads a stale value a, v in the local invalidation buffer, then we add edge v.source
− −−− → L, and we insert L to be right before v.overwrite in list < mo-i2e (i.e., L is older than v.overwrite, but is younger than any other instruction which is older than v.overwrite). As we will see later, at the end of the I 2 E execution, < po-i2e , < mo-i2e and rf -i2e
− −−− → will become the < po , < mo , rf − → relations that satisfy the WMM axioms. Before getting there, we show that the I 2 E model has the following invariants after each operation is performed: 1) For each address a, m [a] .source in the I 2 E model is the youngest store for a in < mo-i2e . 2) All loads and fences that have been executed in the I 2 E model are in < mo-i2e . 3) An executed store is either in < mo-i2e or in store buffer, i.e., for each processor i, the store buffer of processor i contains exactly every store that has been executed in the I 2 E model but is not in < mo-i2e . 4) For any two stores S 1 and S 2 for the same address in the store buffer of any processor i in the I 2 E model, if S 1 is older than S 2 in the store buffer, then S 1 < po-i2e S 2 . 5) For any processor i and any address a, address a cannot be present in the store buffer and invalidation buffer of processor i at the same time. 6) For any stale value v for any address a in the invalidation buffer of any processor i in the I 2 E model, the following invariants hold: a) v.source and v.overwrite are in < mo-i2e , and v.source < mo-i2e v.overwrite, and there is no other store for a between them in < mo-i2e . b) For any Reconcile fence F that has been executed by processor i in the I 2 E model, F < mo-i2e v.overwrite. c) For any store S for a that has been executed by processor i in the I 2 E model, S < mo-i2e v.overwrite. d) For any load L for a that has been executed by processor i in the I 2 E model, if store S
For any two stale values v 1 and v 2 for the same address in the invalidation buffer of any processor i in the I 2 E model, if v 1 is older than v 2 in the invalidation buffer, then v 1 .source < mo-i2e v 2 .source. 8) For any instructions I 1 and I 2 , if I 1 < po-i2e I 2 and order(I 1 , I 2 ) and I 2 is in < mo-i2e , then I 1 < mo-i2e I 2 . 9) For any load L and store S, if S rf -i2e − −−− → L, then the following invariants hold: a) If S not in < mo-i2e , then S is in the store buffer of the processor of L, and S < po-i2e L, and there is no store S for the same address in the same store buffer such that
and there is no other store S for the same address in the store buffer of the processor of L such that S < po-i2e L. We now prove inductively that all invariants hold after each operation R is performed in the I 2 E model, i.e., we assume all invariants hold before R is performed. In case performing R changes some states (e.g., < mo-i2e ), we use superscript 0 to denote the state before R is performed (e.g., < 0 mo-i2e ) and use superscript 1 to denote the state after R is performed (e.g., < 1 mo-i2e ). Now we consider the type of R: 1) WMM-Nm: All invariants still hold. 2) WMM-St: Assume R executes a store S for address a in processor i. R changes the states of the store buffer, invalidation buffer, and < po-i2e of processor i. Now we consider each invariant.
• Invariant 1, 2: These are not affected.
• Invariant 3: This invariant still holds for the newly executed store S.
• Invariant 4: Since S becomes the youngest store in the store buffer of processor i, this invariant still holds.
• Invariant 5: Since R will clear address a from the invalidation buffer of processor i, this invariant still holds.
• Invariant 6: Invariants 6a, 6b, 6d are not affected.
Invariant 6c still holds because there is no stale value for a in the invalidation buffer of processor i after R is performed.
• Invariant 7: This is not affected, because R can only remove values from the invalidation buffer.
• Invariant 8: This is not affected because R is not in < mo-i2e . • Invariant 9: Consider load L * and store S * for address a such that S * rf -i2e
* and L * is from processor i. We need to show that this invariant still holds for L * and S * . Since L * has been executed, we have L * < 1 po-i2e S. Thus this invariant cannot be affected. 3) WMM-Com: Assume R executes a Commit fence F in processor i. R adds F to the end of the < po-i2e of processor i and adds F to the end of < mo-i2e . Now we consider each invariant.
• Invariants 1, 3, 4, 5, 6, 7, 9: These are not affected.
• Invariant 2: This still holds because F is added to < mo-i2e .
• Invariant 8: Consider instruction I in processor i such that I < po-i2e F and order(I, F ). We need to show that I < 1 mo-i2e F . Since order(I, F ), I can be a load, or store, or fence. If I is a load or fence, since I has been executed, invariant 2 says that I is in < 0 mo-i2e before R is performed. Since F is added to the end of < mo-i2e , I < 1 mo-i2e F . If I is a store, the predicate of R says that I is not in the store buffer. Then invariant 3 says that I must be in < 0 mo-i2e , and we have I < 1 mo-i2e F . 4) WMM-Rec: Assume R executes a Reconcile fence F in processor i. R adds F to the end of the < po-i2e of processor i, adds F to the end of < mo-i2e , and clear the invalidation buffer of processor i. Now we consider each invariant.
• Invariants 1, 3, 4, 9: These are not affected.
• Invariants 5, 6, 7: These invariant still hold because the invalidation buffer of processor i is now empty.
• Invariant 8: Consider instruction I in processor i such that I < po-i2e F and order(I, F ). We need to show that I < 1 mo-i2e F . Since order(I, F ), I can be a load or fence. Since I has been executed, I must be in < 0 mo-i2e
before R is performed according to invariant 2. Thus, I < 1 mo-i2e F . 5) WMM-DeqSb: Assume R dequeues a store S for address a from the store buffer of processor i. R changes the store buffer of processor i, the atomic memory m[a], and invalidation buffers of other processors. R also adds S to the end of < mo-i2e . Now we consider each invariant.
• Invariant 1: This invariant still holds, because m[a].source 1 = S and S becomes the youngest store for a in < 1 mo-i2e .
• Invariant 2: This is not affected.
• Invariant 3: This invariant still holds, because S is removed from store buffer and added to < mo-i2e .
• Invariants 4: This is not affected because we only remove stores from the store buffer.
• Invariant 5: The store buffer and invaliation buffer of processor i cannot be affected. The store buffer and invalidation buffer of processor j ( = i) may be affected, because m[a] 0 may be inserted into the invalidation buffer of processor j. The predicate of R ensures that the insertion will not happen if the store buffer of processor j contains address a, so the invariant still holds.
• Invariant 6: We need to consider the influence on both existing stale values and the newly inserted stale values. a) Consider stale value a, v which is in the invalidation buffer of processor j both before and after operation R is performed. This implies j = i, because the store buffer of processor i contains address a before R is performed, and invariant 5 says that the invalidation buffer of processor i cannot have address a before R is performed. Now we show that each invariant still holds for a, v .
-Invariant 6a: This still holds because S is the youngest in < mo-i2e . -Invariant 6b: This is not affected. -Invariant 6c: This is not affected because S is not executed by processor j. -Invariant 6d: Since S is not in < 0 mo-i2e , invariant 9a says that any load that has read S must be from process i. Since i = j, this invariant cannot be affected. − −−− → L. The predicate of R says that the store buffer of processor j cannot contain address a. Thus, S must be in < 0 mo-i2e according to invariant 9a. Therefore, S < 1 mo-i2e S = v.overwrite, and the invariant still holds.
• Invariant 8: Consider instruction I such that I < po-i2e S and order(I, S). Since order(I, S), I can be a load, fence, or store for a. If I is a load or fence, then invariant 2 says that I is in < 0 mo-i2e , and thus I <
S, i.e., the invariant holds. If I is a store for a, then the predicate of R and invariant 4 imply that I is not in the store buffer of processor i. Then invariant 3 says that I must be in < 0 mo-i2e , and thus I < 1 mo-i2e S, i.e., the invariant holds.
• Invariant 9: We need to consider the influence on both loads that read S and loads that reads stores other than S. a) Consider load L for address a that reads from S, i.e., S
mo-i2e before R is performed, invariant 9a says that L must be executed by processor i, S < po-i2e L, and there is no store S for a in the store buffer of processor i such that S < po-i2e S < po-i2e L. Now we show that both invariants still hold for S rf -i2e − −−− → L. -Invariant 9a: This is not affected because S is in < 1 mo-i2e after R is performed. -Invariant 9b: Since S < po-i2e L and S is the youngest in < 1 mo-i2e , S satisfies the max mo-i2e formula. We prove the rest of this invariant by contradiction, i.e., we assume there is store S for a in the store buffer of processor i after R is performed such that S < po-i2e L. Note that < po-i2e is not changed by R. The predicate of R ensures that S is the oldest store for a in the store buffer. Invariant 4 says that S < po-i2e S . Now we have S < po-i2e S < po-i2e L (before R is performed), contradicting with invariant 9a. b) Consider load L for address a from processor j that reads from store S * ( = S), i.e., S = S * rf -i2e
− −−− → L. Now we show that both invariants still hold for
This invariant cannot be affected, because performing R can only remove a store from a store buffer. -Invariant 9b: This invariant can only be affected when S * is in < 0 mo-i2e . Since R can only remove a store from a store buffer, the second half of this invariant is not affected (i.e, no store S in the store buffer and so on). We only need to focus on the max mo-i2e formula, i.e.,
mo-i2e S, this formula can only be affected when S < po-i2e L and i = j. In this case, before R is performed, S is in the store buffer of processor i, and S < po-i2e L, and L reads from S * = S. This contradicts with invariant 9b which is assume to hold before R is performed. Thus, the max mo-i2e formula cannot be affected either, i.e., the invariant holds. 6) WMM-Ld that reads from local store buffer: Assume R executes a load L for address a in processor i, and L reads from store S in the local store buffer. R appends L to the < po-i2e of processor i, appends L to < mo-i2e , and adds S rf -i2e
Note that R does not change any invalidation buffer or store buffer. Now we consider each invariant.
• Invariants 1, 3, 4, 5, 7: These are not affected.
• Invariant 2: This still holds because L is added to < mo-i2e .
• Invariant 6: We consider each invariant.
-Invariants 6a, 6b, 6c: These are not affected.
-Invariant 6d: L can only influence stale values for a in the invalidation buffer of processor i. However, since S is in the store buffer of processor i before R is performed, invariant 5 says that the invalidation buffer of processor i cannot contain address a. Therefore this invariant still holds.
• Invariant 8: We consider instruction I such that I < 1 po-i2e L and order(I, L). Since order(I, L), I can only be a Reconcile fence or a load for a. In either case, invariant 2 says that I is in < 0 mo-i2e . Since L is appended to the end of < mo-i2e , I < 1 mo-i2e L, i.e., the invariant still holds.
• Invariant 9: Since R does not change any store buffer or any load/store already in < 0 mo-i2e , R cannot affect this invariant for loads other than L. We only need to show that S rf -i2e
L satisfies this invariant. Since S is in the store buffer, invariant 3 says that S is not in < mo-i2e . Therefore we only need to consider invariant 9a. We prove by contradiction, i.e., we assume there is store S for a in the store buffer of processor i and S < 1 po-i2e S < 1 po-i2e L. Since R does not change store buffer states, S and S are both in the store buffer before R is performed. We also have S < 0 po-i2e S (because the only change in < po-i2e is to append L to the end). According to the predicate of R, S should be younger than S , so S < 0 po-i2e S (according to invariant 4), contradicting with the previous conclusion. Therefore, the invariant still holds. 7) WMM-Ld that reads from atomic memory: Assume R executes a load L for address a in processor i, and L reads from atomic memory m[a]. R appends L to the < po-i2e of processor i, appends L to < mo-i2e , adds
L, and may remove stale values from the invalidation buffer of processor i. Now we consider each invariant.
• Invariants 1, 3, 4: These are not affected.
• Invariants 5, 7: These are not affected because R only remove values from an invalidation buffer.
-Invariants 6a, 6b, 6c: These are not affected because R only remove values from an invalidation buffer. -Invariant 6d: L can only influence stale values for a in the invalidation buffer of processor i. However, R will remove address a from the the invalidation buffer of processor i. Therefore this invariant still holds.
• Invariant 9: Since R does not change any store buffer or any load/store already in < 0 mo-i2e , R cannot affect this invariant for loads other than L. We only need to show that m[a].source
is not changed before and after R is performed). According to invariant 1, m[a].source is the youngest store for a in < 0 mo-i2e . Therefore we only need to consider invariant 9b. Since we also have
.source, i.e., the first half the invariant holds. The predicate of R ensures that there is no store for a in the store buffer of processor i, so the second half the invariant also holds. 8) WMM-Ld that reads from the invalidation buffer: Assume R executes a load L for address a in processor i, and L reads from the stale value a, v in the local invalidation buffer. R appends L to the < po-i2e of processor i, appends L to < mo-i2e , adds v.source
• Invariants 5, 7: These are not affected because R can only remove values from an invalidation buffer.
-Invariants 6a, 6b, 6c: These are not affected because R can only remove values from an invalidation buffer. -Invariant 6d: Only stale values in the invalidation buffer of processor i can be affected. Consider stale value a, v in the invalidation buffer of processor i after R is performed. We need to show that v.source < • Invariant 9: Since R does not change any store buffer or any load/store already in < 0 mo-i2e , R cannot affect this invariant for loads other than L. We only need to show that v.source
Since v.source is in < 0 mo-i2e , we only need to consider invariant 9b. The predicate of R ensures that the store buffer of processor i cannot contain address a, so the second half of the invariant holds (i.e., there is no S and so on). Now we prove the first half of the invariant, i.e., consider
L)}. Consider any store S that is also in this set,
overwrite. In either case, we have S < mo-i2e v.overwrite. Since we have proved invariant 6a holds after R is performed, either S = v.source or S < mo-i2e v.overwrite. Therefore, max mo-i2e will return v.source. It is easy to see that at the end of the I 2 E execution (of a program), there is no instruction to execute in each processor and all store buffers are empty (i.e., all exected loads stores and fences are in < mo-i2e ). At that time, if we define axiomatic relations < po , < mo , rf − → as < po-i2e , < mo-i2e , rf -i2e − −−− → respectively, then invariants 8 and 9b become the Inst-Order and Ld-Val axioms respectively. That is, < po-i2e , < mo-i2e , rf -i2e − −−− → are the relations that satisfy the WMM axioms and have the same program behavior as the I 2 E execution.
The goal is that for any axiomatic relations < po , < mo , rf − → that satisfy the WMM axioms, we can run the same program in the I 2 E model and get the same program behavior. We will devise an algorithm to operate the I 2 E model to get the same program behavior as in axiomatic relations < po , < mo , rf − → . In the algorithm, for each instruction in the I 2 E model, we need to find its corresponding instruction in the < po in axiomatic relations. Note that this mapping should be an one-to-one mapping, i.e., one instruction in the I 2 E model will exactly correspond to one instruction in the axiomatic relations and vice versa, so we do not distinguish between the directions of the mapping. The algorithm will create this mapping incrementally. Initially (i.e., before the I 2 E model performs any operation), for each processor i, we only map the next instruction to execute in processor i of the I 2 E model to the oldest instruction in the < po of processor i in the axiomatic relations. After the algorithm starts to operate the I 2 E model, whenever we have executed an instruction in a processor in the I 2 E model, we map the next instruction to execute in that processor in the I 2 E model to the oldest unmapped instruction in the < po of that processor in the axiomatic relations. The mapping scheme obviously has the following two properties:
• The k-th executed instruction in a processor in the I 2 E model is mapped to the k-th oldest instruction in the < po of that processor in the axiomatic relations.
• In the I 2 E model, when a processor has executed x instructions, only the first x + 1 instructions (i.e., the executed x instructions and the next instruction to execute) of that processor are mapped to instructions in the axiomatic relations.
Of course, later in the proof, we will show that the two corresponding instructions (one in the I 2 E model and the other in the axiomatic relations) have the same instruction types, same load/store addresses (if they are memory accesses), same store data (if they are stores), and same execution results. In the following, we will assume the action of adding new instruction mappings as an implicit procedure in the algorithm, so we do not state it over and over again when we explain the algorithm. When there is no ambiguity, we do not distinguish an instruction in the I 2 E model and an instruction in the axiomatic relations if these two instructions corresponds to each other (i.e., the algorithm has built the mapping between them). Now we give the details of the algorithm. The algorithm begins with the I 2 E model (in initial state), an empty set Z, and a queue Q which contains all the memory and fence instructions in < mo . The order of instructions in Q is the same as < mo , i.e., the head of Q is the oldest instruction in < mo . The instructions in Q and Z are all considered as instructions in the axiomatic relations. In each step of the algorithm, we perform one of the followings actions: 1) If the next instruction of some processor in the I 2 E model is a non-memory instruction, then we perform the WMMNm operation to execute it in the I 2 E model. 2) Otherwise, if the next instruction of some processor in the I 2 E model is a store, then we perform the WMM-St operation to execute that store in the I 2 E model. 3) Otherwise, if the next instruction of some processor in the I 2 E model is mapped to a load L in set Z, then we perform the WMM-Ld operation to execute L in the I 2 E model, and we remove L from Z. 4) Otherwise, we pop out instruction I from the head of Q and process it in the following way: a) If I is a store, then I must have been mapped to a store in some store buffer (we will prove this), and we perform the WMM-DeqSb operation to dequeue I from the store buffer in the I 2 E model. b) If I is a Reconcile fence, then I must have been mapped to the next instruction to execute in some processor (we will prove this), and we perform the WMM-Rec operation to execute I in the I 2 E model. c) If I is a Commit fence, then I must have been mapped to the next instruction to execute in some processor (we will prove this), and we perform the WMM-Com operation to execute I in the I 2 E model. d) I must be a load in this case. If I has been mapped, then it must be mapped to the next instruction to execute in some processor in the I 2 E model (we will prove this), and we perform the WMM-Ld operation to execute I in the I 2 E model. Otherwise, we just add I into set Z. For proof purposes, we introduce the following ghost states to the I 2 E model:
• For proof purposes, we define a function overwrite. For each store S in < mo , overwrite(S) returns the store for the same address such that
• S < mo overwrite(S), and • there is no store S for the same address such that S < mo S < mo overwrite(S). In other words, overwrite(S) returns the store that overwrites S in < mo . (overwrite(S) does not exist if S is the last store for its address in < mo .)
Also for proof purposes, at each time in the algorithm, we use V i to represent the set of every store S in < mo that satisfies all the following requirements: 1) The store buffer of processor i does not contain the address of S. 2) overwrite(S) exists and overwrite(S) has been popped from Q. 3) For each Reconcile fence F that has been executed by processor i in the I 2 E model, F < mo overwrite(S). 4) For each store S for the same address that has been executed by processor i in the I 2 E model, S < mo overwrite(S). 5) For each load L for the same address that has been executed by processor i in the I 2 E model, if store S rf − → L in the axiomatic relations, then S < mo overwrite(S). With the above definitions and new states, we introduce the invariants of the algorithm. After each step of the algorithm, we have the following invariants for the states of the I 2 E model, Z and Q: 1) For each processor i, the execution order of all executed instructions in processor i in the I 2 E model is a prefix of the < po of processor i in the axiomatic relations.
2) The predicate of any operation performed in this step is satisfied. 3) If we perform an operation to execute an instruction in the I 2 E model in this step, the operation is able to get the same instruction result as that of the corresponding instruction in the axiomatic relations. 4) The instruction type, load/stores address, and store data of every mapped instruction in the I 2 E model are the same as those of the corresponding instruction in the axiomatic relations. 5) All loads that have been executed in the I 2 E model are mapped exactly to all loads in < mo but not in Q or Z.
6) All fences that have been executed in processor i are mapped exactly to all fences in < mo but not in Q. 7) All stores that have been executed and dequeued from the store buffers in the I 2 E model are mapped exactly to all stores in < mo but not in Q. 8) For each address a, m[a] .source in the I 2 E model is mapped to the youngest store for a, which has been popped from Q, in < mo . 9) For each processor i, the store buffer of processor i contains exactly every store that has been executed in the I 2 E model but still in Q. 10) For any two stores S 1 and S 2 for the same address in the store buffer of any processor i in the I 2 E model, if S 1 is older than S 2 in the store buffer, then S 1 < po S 2 . 11) For any processor i and any address a, address a cannot be present in the store buffer and invalidation buffer of processor i at the same time. 12) For any processor i, for each store S in V i , the invalidation buffer of processor i contains an entry whose source field is mapped to S. 13) For any stale value a, v in any invalidation buffer, v.source has been mapped to a store in < mo , and overwrite(v.source) exists, and overwrite(v.source) is not in Q. 14) For any two stale values v 1 and v 2 for the same address in the invalidation buffer of any processor i in the I 2 E model, if v 1 is older than v 2 in the invalidation buffer, then v 1 .source < mo v 2 .source.
These invariants guarantee that the algorithm will operate the I 2 E model to produce the same program behavior as the axiomatic model. We now prove inductively that all invariants hold after each step of the algorithm, i.e., we assume all invariants hold before the step. In case a state is changed in this step, we use superscript 0 to denote the state before this step (e.g., Q 0 ) and use superscript 1 to denote the state after this step (e.g., Q 1 ). We consider which action is performed in this step.
• Action 1: We perform a WMM-Nm operation that executes a non-memory instruction in the I 2 E model. All the invariants still hold after this step.
• Action 2: We perform a WMM-St operation that executes a store S for address a in processor i in the I 2 E model. We consider each invariant.
-Invariants 1, 2, 4: These invariants obviously hold.
-Invariants 3, 5, 6, 7, 8: These are not affected. -Invariant 9: Note that S is mapped before this step. Since S cannot be dequeued from the store buffer before this step, invariant 7 says that S is still in Q. Thus, this invariant holds. -Invariant 10: Since S is the youngest store in store buffer and invariant 1 holds after this step, this invariant also holds. -Invariant 11: Since the WMM-St operation removes all stale values for a from the invalidation buffer of processor i, this invariant still holds. -Invariant 12: For any processor j (j = i), the action in this step cannot change V j or the invalidation buffer of processor j. We only need to consider processor i. The action in this step cannot introduce any new store into V i , i.e., V
. Also notice that V 1 i does not contain any store for a due to requirement 1. Since the action in this step only removes values for address a from the invalidation buffer of processor i, this invariant still holds for i. -Invariants 13, 14: These still hold, because we can only remove values from the invalidation buffer in this step.
• Action 3: We perform a WMM-Ld operation that executes a load L in Z. (Note that L has been popped from Q before.) We assume L is in processor i (both the axiomatic relations and the I 2 E model agree on this because of the way we create mappings). We also assume that L has address a in the axiomatic relations, and that store S rf − → L in the axiomatic relations. According to invariant 4, L also has load address a in the I 2 E model. We first consider several simple invariants: -Invariants 1, 2, 5: These invariants obviously hold.
-Invariants 6, 7, 8, 9, 10: These are not affected. -Invariant 11: Since the WMM-Ld operation does not change store buffers and can only remove values from the invalidation buffers, this invariant still holds. -Invariants 13, 14: These still hold, because we can only remove values from the invalidation buffer in this step. We now consider the remaining invariants, i.e., 3, 4 and 12, according to the current state of Q (note that Q is not changed in this step): 1) S is in Q: We show that the WMM-Ld operation can read the value of S from the store buffer of processor i in the I 2 E model. We first show that S is in the store buffer of processor i. Since L is not in Q, we have L < mo S. According to the Ld-Val axiom, we know S < po L, so S must have been executed. Since S is in Q, invariant 7 says that S cannot be dequeued from the store buffer, i.e., S is in the store buffer of processor i. Now we prove that S is the youngest store for a in the store buffer of processor i by contradiction, i.e., we assume there is another store S for a which is in the store buffer of processor i and is younger than S. Invariant 10 says that S < po S . Since S and S are stores for the same address, the Inst-Order axiom says that S < mo S . Since S is in the store buffer, it is executed before L. According to invariant 1, S < po L. Then S rf − → L contradicts with the Ld-Val axiom. Now we can prove the invariants: -Invariant 3: This holds because the WMM-Ld operation reads S from the store buffer.
-Invariant 4: This holds because invariant 3 holds after this step.
The execution of L in this step cannot introduce new stores into V j for any j, i.e., V 1 j ⊆ V 0 j . Since there is no change to any invalidation buffer when WMM-Ld reads from the store buffer, this invariant still holds. 2) S is not in Q but overwrite(S) is in Q: We show that the WMM-Ld operation can read the value of S from the atomic memory. Since S has been popped from Q while overwrite(S) is not, S is the youngest store for a in < mo that has been popped from Q. According to invariant 8, the current m[a].source in the I 2 E model is S. To let WMM-Ld read m[a], we only need to show that the store buffer of processor i does not contain any store for a. We prove by contradiction, i.e., we assume there is a store S for a in the store buffer of processor i. According to invariant 9, S has been executed in the I 2 E model, and S is still in Q. Thus, we have S < po L (according to invariant 1), and S < mo S . Since there is no change to any invalidation buffer of any processor other than i, we only need to consider processor i. The WMM-Ld removes all values for a from the invalidation buffer of processor i, so the goal is to show that there is no store for a in V 1 i . We prove by contradiction, i.e., assume there is store S for a in V 1 i . Requirement 5 for V i says that S < mo overwrite(S ). Since overwrite(S) is in Q, overwrite(S ) is also in Q. This contradicts with requirement 2. Therefore, there is no store for a in V 1 i , and this invariant holds. 3) Both S and overwrite(S) are not in Q: We show that the WMM-Ld operation can read the value of S from the invalidation buffer of processor i. That is, we need to show S ∈ V 0 i . We now prove that S satisfies all the requirements for V 0 i : -Requirement 1: We prove by contradiction, i.e., we assume there is store S for a in the store buffer of processor i. Invariant 9 says that S has been executed but not in Q. Then we have S < po L (invariant 1) and S < mo S . Then S rf − → L contradicts with the Ld-Val axiom. -Requirement 2: This satisfied because we assume overwrite(S) is not in Q.
-Requirement 3: We prove by contradiction, i.e., we assume that Reconcile fence F has been executed by processor i, and overwrite(S) < mo F . Since F is executed before L, invariant 1 says that F < po L.
Since order(F, L), the Inst-Order axiom says that F < mo L. Now we have S < mo overwrite(S) < mo F < mo L. Thus, S rf − → L contradicts with the Ld-Val axiom.
-Requirement 4: We prove by contradiction, i.e., we assume that store S for a has been executed by processor i, and either S = overwrite(S) or overwrite(S) < mo S . According to the definition of overwrite, we have S < mo S . Since S has been executed, invariant 1 says that S < po L. Then S rf − → L contradicts with the Ld-Val axiom. -Requirement 5: We prove by contradiction, i.e., we assume that store S and load L are both for address a, L has been executed by processor i, S rf − → L , and either S = overwrite(S) or overwrite(S) < mo S . According to the definition of overwrite, we have Since there is no change to any invalidation buffer of any processor other than i, we only need to consider processor i. Assume the invalidation buffer entry read by the WMM-Ld operation is a, v , and v.source = S. The WMM-Ld rule removes any stale value a, v that is older than a, v from the invalidation buffer of processor i. The goal is to show that v .source cannot be in V 1 i . We prove by contradiction, i.e., we assume that v .source ∈ V 1 i . Since L has been executed after this step, requirement 5 says that S < mo overwrite(v .source). Since v is older than v in the invalidation buffer before this step, invariant 14 says that v .source < mo v.source = S. The above two statements contradict with each other. Therefore, this invariant still holds.
• Action 4a: We pop a store S from the head of Q, and we perform a WMM-DeqSb operation to dequeue S from the store buffer. Assume that S is for address a, and in processor i in the axiomatic relations. We first prove that S has been mapped before this step. We prove by contradiction, i.e., we assume S has not been mapped to any instruction in the I 2 E model before this step. Consider the state right before this step. Let I be the next instruction to execute in processor i in the I 2 E model. We know I is mapped and I < po S. The condition for performing action 4a in this step says that I can only be a fence or load, and we have order(I, S). According to the Inst-Order axiom, I < mo S, so I has been popped from Q 0 .
1) If I is a fence, since I is in < mo but not in Q 0 , invariant 6 says that I must be executed, contradicting our assumption that I is the next instruction to execute. 2) If I is a load, since I is not executed, and I is in < mo , and I is not in Q 0 , invariant 5 says that I must be in Z 0 . Then this algorithm step should use action 3 instead of action 4a. Due to the contradictions, we know S must have been mapped. Note that the next instruction to execute in processor i cannot be store, because otherwise this step will use action 2. According to invariant 4, S cannot be mapped to the next instruction to execution in processor i. Therefore S must have been executed in processor i in the I 2 E model before this step. Also according to invariant 4, the address and data of S in the I 2 E model are the same as those in the axiomatic relations. Now we consider each invariant.
-Invariants 1, 4, 7, 9, 10: These invariants obviously hold. -Invariants 3, 5, 6: These are not affected. -Invariant 2: We prove by contradiction, i.e., we assume there is a store S for a younger than S in the store buffer of processor i (before this step). According to invariant 9, S is in Q. Since S is the head of Q 0 , S < mo S . According to invariant 10, S < po S. Since order(S , S), S < mo S, contradicting with previous statement. Thus, the predicate of the WMM-DeqSb operation is satisfied, and the invariant holds. -Invariant 8: S is the youngest instruction in < mo that has been popped from Q, and m[a].source is updated to S. Thus, this invariant still holds. -Invariant 11: Since the WMM-DeqSb operation will not insert the stale value into an invalidation buffer of a processor if the store buffer of that processor contains the same address, this invariant still holds. -Invariant 12: For any processor j, the action this step will not remove stores from V j but may introduce new stores to V j , i.e., V . We prove by contradiction, i.e., we assume there is store S such that S ∈ V 1 i but S / ∈ V 0 i . Since S satisfies requirements 3, 4, 5 after this step, it also satisfies these three requirements before this step. Then S must fail to meet at least one of requirements 1 and 2 before this step. a) If S does not meet requirement 1 before this step, then S .addr is in the store buffer of processor i before this step and S .addr is not in this store buffer after this step. Thus, S .addr must be a. Since S meets requirement 4 before this step and S has been executed by processor i before this step, we know S < mo overwrite(S ). Since S meets requirement 2 after this step, overwrite(S ) is not in Q 1 . Since Q 1 is derived by popping the oldest store form Q 0 , we know S is not in Q 0 . Since S is in the store buffer before this step, this contradicts invariant 9. Therefore this case is impossible. b) If S does not meet requirement 2 before this step, then overwrite(S ) is in Q 0 but not in Q 1 . Then overwrite(S ) = S. Since S has been executed by processor i, S will fail to meet requirement 4 after this step. This contradicts with S ∈ V 1 i , so this case is impossible either. Now we have proved that
Since the WMMDeqSb operation does not change the invalidation buffer of processor i, this invariant holds for processor i.
2) Processor j ( = i): We consider any store S such that S ∈ V 1 j but S / ∈ V 0 j . Since S satisfies requirements 1, 3, 4, 5 after this step, it also satisfies these four requirements before this step. Then S must fail to meet requirement requirement 2 before this step, i.e., overwrite(S ) is in Q 0 but not in Q 1 . Then overwrite(S ) = S. According to invariant 8, we know S is m[a].source. Consider the following two cases. a) The store buffer of processor j contains address a: Since S cannot meet requirement 1, V 1 j = V 0 j . Since WMM-DeqSb cannot remove any value from the invalidation buffer of processor j, this invariant holds. b) The store buffer of processor j does not contain address a: In this case, the WMM-DeqSb operation will insert stale value a, m[a] 0 into the invalidation buffer of processor j, so the invariant still holds. • Action 4b: We pop a Reconcile fence F from Q, and perform a WMM-Rec operation to execute it in the I 2 E model. Assume F is in processor i in the axiomatic relations. We first prove that F has been mapped before this step. We prove by contradiction, i.e., we assume F is not mapped before this step. Consider the state right before this step. Let I be the next instruction to execute in processor i in the I 2 E model. We know I is mapped and I < po F . The condition for performing action 4b in this step says that I can only be a fence or load, so we have order(I, F ). According to the Inst-Order axiom, I < mo F , so I has been popped from Q 0 .
1) If I is a fence, since I is in < mo but not in Q 0 , invariant 6 says that I must be executed, contradicting our assumption that I is the next instruction to execute. 2) If I is a load, since I is not executed, and I is in < mo , and I k is not in Q 0 , invariant 5 says that I must be in Z 0 . Then this algorithm step should use action 3 instead of action 4a. Due to the contradictions, we know F must have been mapped before this step. According to invariant 6, since F is in Q 0 , F must have not been executed in the I 2 E model. Thus, F is mapped to the next instruction to execute in processor i in the I 2 E model. Now we consider each invariant: -Invariants 1, 2, 4, 6: These obviously hold. -Invariants 3, 5, 7, 8, 9, 10: These are not affected. -Invariants 11, 13, 14: These invariants hold, because the invalidation buffer of any processor j ( = i) is not changed, and the invalidation buffer of processor i is empty after this step. -Invariant 12: For any processor j ( = i), V 1 j = V 0 j and the invalidation buffer of processor j is not changed in this step. Thus, this invariant holds for any processor j ( = i). We now consider processor i. The invalidation buffer of processor i is empty after this step, so we need to show that V 1 i is empty. We prove by contradiction, i.e., we assume there is a store S ∈ V 1 i . Since F has been executed in processor i after this step, requirement 3 says that F < mo overwrite(S). Since F is the head of Q 0 , overwrite(S) must be in Q 1 . Then S fails to meet requirement 2 after this step, contradicting with S ∈ V 1 i . Therefore V 1 i is empty, and this invariant also holds for processor i.
• Action 4c: We pop a Commit fence F from Q, and perform a WMM-Com operation to execute it in the I 2 E model. Assume F is in processor i in the axiomatic relations. Using the same argument as in the previous case, we can prove that F is mapped to the next instruction to execute in processor i in the I 2 E model (before this step). Now we consider each invariant: -Invariants 1, 4, 6: These obviously hold.
-Invariants 3, 5, 7, 8, 9, 10, 11, 12, 13, 14: These are not affected. -Invariants 2: We prove by contradiction, i.e., we assume there is store S in the store buffer of processor i before this step. According to invariant 9, S has been executed in processor i and is in Q 0 . Thus, we have S < po F . Since order(S, F ), the Ld-Val axiom says that S < mo F . Then F is not the head of Q 0 , contradicting with the fact that we pop F from the head of Q 0 .
• Action 4d: We pop a load L from Q. Assume that L is for address a and is in processor i in the axiomatic relations. If we add L to Z, then all invariants obviously hold. We only need to consider the case that we perform a WMM-Ld operation to execute L in the I 2 E model. In this case, L is mapped before this step. Since L is in Q 0 , according to invariant 5, L must be mapped to an unexecuted instruction in the I 2 E model. That is, L is mapped to the next instruction to execute in processor i in the I 2 E model. Invariant 4 ensures that L has the same load address in the I 2 E model. We first consider several simple invariants: -Invariants 1, 2, 5: These invariants obviously hold.
-Invariants 6, 7, 8, 9, 10: These are not affected. -Invariant 11: Since the WMM-Ld operation does not change store buffers and can only remove values from the invalidation buffers, this invariant still holds. -Invariants 13, 14: These still hold, because we can only remove values from the invalidation buffer in this step.
Assume store S rf − → L in the axiomatic relations. We prove the remaining invariants (i.e., 3, 4 and 12) according to the position of S in < mo . 1) L < mo S: We show that the WMM-Ld can read S from the store buffer of processor i in the I 2 E model. The Ld-Val axiom says that S < po L. Then S must have been executed in processor i in the I 2 E model according to invariant 1. Since S is in Q 0 , invariant 9 ensures that S is in the store buffer of processor i before this step. To let WMM-Ld read S from the store buffer, we now only need to prove that S is the youngest store for a in the store buffer of processor i. We prove by contradiction, i.e., we assume there is another store S for a which is in the store buffer of processor i and is younger than S. Invariant 10 says that S < po S . Since S and S are stores for the same address, the Inst-Order axiom says that S < mo S . Since S is in the store buffer, it is executed before L. According to invariant 1, S < po L. Then S rf − → L contradicts with the Ld-Val axiom. Now we can prove the invariants: -Invariant 3: This holds because the WMM-Ld operation reads S from the store buffer. -Invariant 4: This holds because invariant 3 holds after this step. -Invariant 12: The execution of L in this step cannot introduce new stores into V j for any j, i.e., V 1 j ⊆ V 0 j . Since there is no change to any invalidation buffer when WMM-Ld reads from the store buffer, this invariant still holds. 2) S < mo L: We show that the WMM-Ld operation can read the value of S from the atomic memory. Since L is the head of Q 0 , S is not in Q 0 . According to the Ld-Val axiom, there cannot be any store for a between S and L in < mo . Thus, S is the youngest store for a in < mo that has been popped from Q. According to invariant 8, the current m[a].source in the I 2 E model is S. To let WMM-Ld read m[a], we only need to show that the store buffer of processor i does not contain any store for a. We prove by contradiction, i.e., we assume there is a store S for a in the store buffer of processor i. According to invariant 9, S has executed in the I 2 E model, and S is in Q 0 . Thus, we have S < po L (according to invariant 1), and S < mo S . Since there is no change to any invalidation buffer of any processor other than i, we only need to consider processor i. The WMM-Ld removes all values for a from the invalidation buffer of processor i, so the goal is to show that there is no store for a in V 1 i . We prove by contradiction, i.e., assume there is store S for a in V 1 i . Requirement 5 for V i says that S < mo overwrite(S ). Since S is the youngest store for a that is in < mo but not Q 0 , overwrite(S ) must be in Q 0 . This contradicts with requirement 2. Therefore, there is no store for a in V 
