Write buffering is one of many successful mechanisms that improves the performance and scalability of multiprocessors. However, it leads to more complex memory system behavior, which cannot be described using intuitive consistency models, such as Sequential Consistency. It is crucial to provide programmers with a specification of the exact behavior of such complex memories. This article presents a uniform framework for describing systems at different levels of abstraction and proving their equivalence. The framework is used to derive and prove correct simple specifications in terms of program-level instructions of the SPARC total store order and partial store order memories.
INTRODUCTION
Distributed algorithm designers typically assume strong consistency models such as Sequential Consistency and Linearizability [Lamport 1979a; Herlihy and Wing 1990] . Multiprocessor architectures, however, incorporate mechanisms such as write-buffers to improve the performance and scalability. Such mechanisms lead to the relaxation of the resulting memory consistency model. These weakened consistency guarantees make it challenging for a programmer to be certain of the possible program outcomes and to implement correct and efficient distributed algorithms. Some attempted specifications of the behavior of such complex memory systems resulted in incomplete or incorrect conclusions. Thus, such specifications must be shown to exactly capture the operational behavior of the underlying memory architecture.
This article presents a framework that allows us to describe systems at various levels of abstraction, ranging from operational or architectural hardware functionality, to object-oriented or transaction-based abstractions. The uniform framework used for such descriptions facilitates the comparison of different models of consistency, so that claims of the relative strength of models can be stated precisely and proven correct. In particular, it provides the setting in which to prove that a nonoperational description of a multiprocessor system captures exactly the operational behavior of that system. The uniformity of the descriptions at various levels of abstraction means that proofs of correct implementations of abstract systems on systems with finer granularity of description can be composed. Starting from a highly abstract description, a sequence of provably correct implementations onto successively lower levels of abstraction can be constructed within the framework, eventually arriving at the actual operational machine. Thus, the framework can be seen as a generalization of the well-established notions of levels of abstraction and proofs of correctness of implementations, to include arbitrary memory consistency models.
The framework is used to describe, at an operational level, the execution of programs on possible write-buffer multiprocessor architectures. Three different write-buffer machines are considered. One represents the "bare minimum" write-buffer machine capturing the very basic behavior of write-buffers. The other two successively stronger machines capture the operational behavior of SPARC's partial and total store order machines [SPARC International, Inc. 1992; Weaver and Germond 2000] . Using the framework, three sets of increasingly stringent constraints on outcomes of program executions on these machines are specified.
The same framework is used to provide two more abstract (nonoperational) memory consistency models, corresponding to the SPARC's total and partial store order machines. These higher level models constitute the programmer's view of the behavior of multiprocess programs on the respective machines. For each of the two pairs, the operational and nonoperational descriptions are proven to be equivalent.
The SPARC version 9 manual [Weaver and Germond 2000] specifies a third memory consistency model called relaxed memory order, which seems to be intended to capture the executions of a still more permissive write-buffer architecture. Exploiting our framework, we show, however, that relaxed memory order does not correspond to any implementation on a write-buffer multiprocessor. In fact, we prove that Coherence [Dubois et al. 1986; Goodman 1989 ] or any weaker model does not correspond to a write-buffer architecture. We also prove that Alpha consistency [Compaq Computer Corporation 1998 ], which is Coherence combined with some dependencies [Attiya and Friedman 1994] that arise from processors' private registers, admits computations that cannot arise on any write-buffer machine. Thus, any implementation of Alpha consistency using write-buffers must produce a memory consistency model strictly stronger than that specified by the Alpha documentation [Compaq Computer Corporation 1998 ], even though this documentation suggests such an implementation.
Software system designers require an unambiguous description of the memory consistency conditions at the level of the operations used in their programs. This applies to the SPARC architecture, with which the UltraSPARC I, II, III, and IV processors are all compliant [Sun Microsystems 2004] , since it continues to be one of the commercial choices for server machines. Typically, hardware architecture manuals use natural language or axiomatic specifications to describe how the hardware operates [SPARC International, Inc. 1992; Weaver and Germond 2000; Intel Corporation 2002; International Business Machines Corporation 1997] . These descriptions are not programmer centric and may contain ambiguities. They can lead to incorrect or inefficient programs and they are complex to work with. Frameworks have been defined to provide a way to formalize and unify descriptions of memory consistency models, and to help us reason about them. The literature contains many examples of such frameworks [Hoare 1972; Owicki and Gries 1976; Misra 1986; Lamport 1979b Lamport , 1997 Lamport , 1986a Lamport , 1986b Anger 1989; Attiya et al. 1998; Attiya and Friedman 1992; Friedman 1995; Ahamad et al. 1993 Ahamad et al. , 1995 Kohli et al. 1993; Gibbons and Merritt 1992; Lynch and Tuttle 1989; Lynch 1996; Herlihy and Wing 1990; Adir et al. 2003 ]. A comparison of our framework to the closely related ones is provided in Subsection 2.3 after ours is defined.
Capturing the consistency of a multiprocessor architecture simply and precisely at a programmer oriented level is crucial for the development of correct and efficient programs targeted to these architectures. But this has proven to be surprisingly tricky for several machines. Erroneous definitions may lead to incorrect programs. Complex definitions lead programmers to unnecessarily and aggressively use expensive synchronization primitives, which negatively impact program performance. For example, one earlier attempt to define the semantics of the SPARC memory consistency model called total store order ] resulted in a definition that is stronger than what the machine level architecture description in the SUN Microsystems manuals [SPARC International, Inc. 1992; Weaver and Germond 2000] actually provides. In fact, any program using only read and write operations on variables that is correct for Sequential Consistency can be compiled into an equivalent program with only read/write operations that is correct for this strong definition [Higham and Kawash 2000] . Thus using this definition, there is a solution to the mutual exclusion problem that uses only read-write variables. In earlier work [Higham and Kawash 2000; Kawash 2000 ], we exploit the definitions derived and proven correct in this article to prove, to the contrary, that read-write variables are insufficient to solve the mutual exclusion problem on a SPARC total store order machine. The total and partial store order definitions derived in this article are used in another work [Higham and Kawash 2005] , to determine under what conditions wait-free producer-consumer coordination is possible or impossible in various SPARC models without resorting to expensive synchronization primitives, in spite of the impossibility of mutual exclusion.
The simplicity of the SPARC memory consistency descriptions provided in this article gives programmers an improved tool for reasoning about the outcomes of their programs. It also facilitates the comparison of the SPARC models to each other and to proposed consistency models including Processor Consistency [Goodman 1989 ], Causal Consistency [Ahamad et al. 1995] , and Java consistency [Gontmakher and Schuster 2000; Higham and Kawash 1998 ], and aids the development of verification tools [Park and Dill 1999] .
The rest of the article is organized as follows. Section 2 describes the framework that is used throughout the article. In Section 3, the framework is used to describe two write-buffer machines that capture the familiar total store order and partial store order semantics of a multiprocessor with write-buffers. Section 4 defines two nonoperational memory consistency specifications. It is then proved that the total store order and partial store order machines implement exactly these specifications, establishing that these two nonoperational specifications are correct. Section 5 establishes that these operational and nonoperational models are equivalent to the corresponding consistency models described in the SPARC manuals. A computation is given in Section 6 that is possible on any of the memory consistency models defined by the SPARC relaxed memory order, Alpha consistency, and Coherence yet cannot arise from any write-buffer machine-even one substantially more permissive than a partial store order write buffer machine. Thus, these consistency models are not equivalent to any write buffer model. Section 7 summarizes and concludes.
A CONSISTENCY MODELLING FRAMEWORK

Describing Systems
A multiprocess system can be modelled as a collection of programs operating on a collection of shared data objects under some partial order constraints called a memory consistency model. In this section, we specify each of these components.
Two running examples are provided and will be used in subsequent sections of this article.
Shared Data Objects.
A shared data object can be described by providing an object's initial state, the operations that can be applied to it and the change of state and response that results from each applicable operation. This gives rise to a set of allowable sequences of operations for each such object. So, we specify a shared data object to be a set of sequences of operations. An operation has the form out←ACT (obj,in) where ACT is an action with input parameters in the list 'in' applied to object 'obj' and that returns the output values in the list 'out'. If 'in' is non-empty the operation is a state-change operation and if 'out' is non-empty the operation is an output-generating operation. An operation could be both a statechange and an output-generating operation. An operation out←ACT(obj,in) has two components; its invocation is 'ACT(obj,in)' and its response is 'out'. For a more concise notation, if an operation does not generate an 'out' value, it is written as ACT(obj,in) . If it does not require an input value, 'in', it is written as out←ACT (obj) .
An arbitrary sequence of operations applied to object X is valid for X if and only if it is in the specification of X . An arbitrary sequence S of operations (applied to possibly several objects) is valid if and only if, for each object X , the subsequence of S consisting of exactly those operations applied to X is valid for X . We say that a sequence S of operations and the sequence S of operation invocations formed by removing the response component of each operation in S are associated sequences.
Throughout, we assume that for each state-change operation the input parameter has a distinct value. A footnote will indicate when this assumption is being used. This is a common assumption when defining consistency models. It is not essential but removing it adds messiness. Section 7 comments further on this assumption and its implications. Also, for this article, any nonempty input or output list contains only one item and is therefore abbreviated by omitting list delimiters. An operation type is just an operation that has one or more parameters chosen from some given set. The set of possible output parameters is indicated by , input parameters by ·, and object parameters by ·. For example, the notation WRITE(x,·) denotes the set of WRITE operations on object x, or (depending on context) any element of that set; ←READ(·) denotes the set of all READ operations.
For most of this article, we consider only two types of objects: read/write objects (which we call variables) and list objects. Section 6 introduces a set object to support our impossibility results. Lists (respectively, sets) will be used to capture the behavior of ordered (respectively, unordered) write-buffers.
Example 1(a) -Variable. A variable, x, is the set of sequences over operations of the type WRITE(x,·) and ←READ(x) that satisfy the validity condition:
The output value returned by each READ operation is the same as the input value written by the most recent preceding WRITE operation in the sequence, if such a WRITE exists, and is ⊥ otherwise.
• L. Higham et al. Example 1(b) -List Object. In a sequence of operations that includes types APPEND(l ,·) and DELETE(l ,·), an a = APPEND(l ,x) has a matching delete if there is a DELETE(l ,x) that follows a in the sequence. Otherwise, it is undeleted.
1
A list object, l , is the set of sequences over operations of the type APPEND(l ,·), DELETE(l ,·), and ←LAST(l ) that satisfy the validity condition:
For any LAST operation on list object, l , it returns a value ρ =⊥ if and only if the most recent preceding undeleted APPEND operation is APPEND(l ,ρ). Otherwise, it returns ⊥.
2
Definitions of objects typically constrain the output of output-generating operations to be related to the input of some preceding state-change operation. For variables, a READ operation returns the value of a preceding WRITE operation; and for list objects LAST returns the value of the most recent preceding undeleted APPEND operation. In each case, the output-generating operation and the unique related state-change operation are said to be causally related.
Programs and
Multiprocesses. An individual program in a multiprocess system is sequential computer code consisting of local operation invocations, local computation, control structures, and operation invocations on shared objects. A collection of individual programs is a multiprogram. Let P be a multiprogram whose non-local operation invocations are applied to shared objects in J . Then, the pair (P, J ) is called a multiprocess and P is compatible with J .
Example 2(a) -Variable Multiprogram. Let x, y and z be variables, and let { p, q} be the multiprogram: READ(x) Then ({ p, q}, {x, y, z}) is a (variable) multiprocess and { p, q} is compatible with {x, y, z}.
Example 2(b) -List Object Multiprogram. Let l be a list object, and let { p, q} be the multiprogram:
Then ({ p, q}, {l }) is a (list object) multiprocess and { p, q} is compatible with {l }.
The preceding two program examples are particularly simple because they are only invocations on shared data objects and they lack any control structures. More elaborate programs could contain code that includes branches and loops.
To highlight the association of operations with their invoking programs, ACT is subscripted with the program identifier when required. For instance, the notation WRITE p (x, ν) emphasizes that this write operation is invoked by individual program p.
Memory Consistency Models. Informally, when a multiprogram is executed, the individual programs of the processes somehow interact, and each individual program's control structures gives rise to a path through its code, producing responses to the output-generating instruction invocations on these paths. Due to control structures and nondeterministic timing and interactions, different executions of the same multiprogram can, in general, generate different individual paths, and each set of paths might generate different sets of responses depending on the rules governing the interactions. For a Sequentially Consistent [Lamport 1979a ] multiprocess, any such execution can be modelled as a single valid sequence of operations that arises from some interleaving of the operation invocations generated by paths through the individual programs of the multiprocess. In weaker memory consistency models, no such straightforward interleaving is guaranteed.
To model the more general settings, we first define a computation. An individual computation of a program is a sequence of operations. The order of the operations in this sequence is called program order and must be the same as the order in which the associated operation invocations appear in some path through the individual program. A (system) computation of a multiprocess (P, J ) is a collection of individual computations, one for each individual program, p ∈ P . For this article, it suffices to consider completed (as opposed to partial) computations of a multiprogram; that is, the individual computations of a multiprogram must be associated with the sequences formed from all the operation invocations in the paths executed by the multiprogram. Therefore, when the individual programs of a multiprogram P are each straight line programs as in our examples above, a computation of P is just like P except each operation invocation is completed to an operation; that is, all "out" values are filled in.
Example 3(a) -Read/Write Computations. C a (i) and C a (ii) are both computations of the read/write multiprocess in Example 2(a), assuming that initial values of x, y and z are 0.
Example 3(b) -List Object Computation. C b is a computation of the list object multiprocess in Example 2(b).
Notice that the definition of a (system) computation does not constrain the response components of the operations in the computation. Rather, the possible responses are determined by the architecture, which we model as a set of memory consistency constraints. A memory consistency model is a set of partial order constraints on the operations of a computation. These partial orders are defined on subsets of all the operations. For example, Lamport's Sequential Consistency [Lamport 1979a ] requires: "the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program."
Let O be all the operations of a computation C of a multiprocess (P, J Example 4(a) -Sequential Consistency on a Variable Multiprocess. Neither Computation C a (i) nor C a (ii) (of Example 3(a)) is Sequentially Consistent because it is not possible to create a sequence of all the computation's operations that both maintains program order and is valid. The following cycle in the operations of either C a (i) or C a (ii) shows that these computations are not Sequentially Consistent:
Example 4(b) -Sequential Consistency on a List Object Multiprocess. Computation C b (of Example 3(b)) is Sequentially Consistent. One valid sequence that preserves program order is: APPEND(l , 7), APPEND(l , 4), APPEND(l , 6), DELETE(l , 6), 4←LAST(l ), DELETE(l , 4), 7←LAST(l ).
Partial order constraints that are weaker than Sequential Consistency constitute a weak memory consistency model. In Section 4.1, we will define weak consistency models TSO and PSO and show that Computation C a (i) is PSO but not TSO and C a (ii) is TSO.
Let J be a set of objects and P a multiprogram compatible with J , and let M be a memory consistency model. Then, the triple (P, J, M ) is called a multiprocess system and the couple (J, M ) is called a platform. The computations of a multiprocess system (P, J, M ) are all the computations of the multiprocess (P, J ) that satisfy the memory consistency constraints M .
Comparing and Implementing Systems
Transformations between multiprocesses are our main tool for comparing systems, and for developing abstract descriptions of the memory consistency models that arise from concrete or operational level executions of multiprograms. One multiprocess, called the specified multiprocess (with corresponding specified objects and multiprograms) is transformed into the target multiprocess (with corresponding target objects and multiprograms). An operation on a specified object is transformed to the target objects by providing a subroutine for the operation's invocation where this subroutine uses only operation invocations on the target objects. If the specified operation is output-generating, then the subroutine must return a value of the same type as this output. An object is transformed to the target object(s) by transforming each of its operations. A transformation of each of the objects of a specified multiprocess to objects of the target multiprocess can be naturally extended to a program transformation by replacing each operation invocation in the specified multiprogram with the subroutine for that operation invocation. The transformed multiprogram together with the target objects comprise the transformed multiprocess.
The transformed multiprocess and a target memory consistency model (called the target system) gives rise to a collection of computations-exactly those that arise from the transformed multiprogram interacting with the target objects and satisfying the target memory consistency constraints. Any such computation can be interpreted as a computation of the specified multiprogram by attaching to each operation invocation of the specified multiprogram, the value returned by the corresponding subroutine. Thus, the set of computations of the target system provides a set of interpreted computations of the specified multiprocess.
Another way to associate a set of computations with the specified multiprocess is to specify a memory consistency model and consider the set of computations of this specified system. The relationship between these two sets of computations of the specified system determines whether or not we consider the transformation to be an implementation. A transformation of a specified multiprogram will be called an implementation of the specified system, if, informally (in Figure 1 ) the set C of computations produced by traveling the long way around is a non-empty subset of the computations C allowed by the specified system. More precisely, let C be the set of computations of a specified multiprocess system (P, J, M ). Let τ be a transformation from the objects in J to the objects in J and denote by τ (P ) the transformation of the multiprogram P , using τ . Let D be any computation of system (τ (P ), J , M ) and let D be the interpretation of D for (P, J ). Then, the transformation τ implements the system (P, J, M ) on the platform ( J , M ) if D ∈ C, for any such D . The transformation τ exactly implements the system (P, J, M ) on the platform ( J , M ) if τ implements (P, J, M ) on ( J , M ) and, for any D ∈ C there is a computation D of (τ (P ), J , M ) whose interpretation is D. A transformation is a compiler (respectively, an exact compiler) from platform (J, M ) to platform ( J , M ) if, for any multiprogram P compatible with J , the transformation implements (respectively, exactly implements) the system (P, J, M ) on the platform ( J , M ). 
Comparison of the Consistency Modelling Framework with Other Frameworks
One of the earliest frameworks, an axiomatic one by Hoare [1972] , is difficult to use to prove correctness even for simple algorithms. Owicki and Gries [1976] provide strengthened axioms and used them to provide proofs of coordination algorithms in a parallel programming language. Misra [1986] presented a framework and applied it to a set of register axioms. In each case, the framework is customized for a particular level of system design (parallel programs or hardware registers). Lamport [1979b] provided an axiomatic system and used it as the basis for descriptions and proofs of distributed system features [Lamport 1997 [Lamport , 1986a [Lamport , 1986b . In his extension of these axioms, Anger [1989] states that "The method of analysis for high-level system descriptions is the same as for low-level operation descriptions."
The framework developed by Attiya et al. [1998] , Attiya and Friedman [1992] , and Friedman [1995] is used to describe coordination algorithms and provide transformations of programs written assuming one memory consistency model to new programs intended to be run under different consistency assumptions. Ahamad et al. [1993 Ahamad et al. [ , 1995 and Kohli et al. [1993] present a framework that is used to define Causal Consistency and to re-state some memory consistency definitions, such as Processor Consistency and pipelined-RAM. Gibbons and Merritt [1992] present a framework that uses I/O Automata to represent architectural assumptions and a memory consistency model that restricts its actions.
Our framework uses ideas from much of this previous work. For example, it supports descriptions at various level of abstraction. Transformations between different memory consistency models are facilitated by the framework, as in the work of Attiya and Friedman [1992] . The descriptive style provided by the framework is general, but is similar to that of several earlier works particularly that of Ahamad et al. [1993 Ahamad et al. [ , 1995 , Lynch's I/O Automata [Lynch and Tuttle 1989; Lynch 1996] , and Herlihy and Wing's Linearizability [Herlihy and Wing 1990] . There is also less-closely related work that we do not discuss here, primarily due to differences in intended objectives.
The major contribution of this article, as we see it, is a general framework, the Consistency Modeling Framework (henceforth abbreviated CMF), for describing consistency conditions at various levels of abstraction and proving the equivalence between them. This is demonstrated on SPARC, a real-life example. Hence, the most closely related work to ours is Lynch and Tuttle's I/O Automata [Lynch 1996; Lynch and Tuttle 1989] and Herlihy and Wing's Linearizability [Herlihy and Wing 1990] . Both propose well-established models that support different levels of abstraction, as does CMF. I/O Automata is operational and Linearizability and CMF are non-operational. The definition style in CMF is closer to the Linearizability style. However, CMF, as presented in this paper, does not support Linearizability's global time. CMF can be extended to include a notion of global time, but this is beyond the scope of this article.
The important difference between CMF and these two models is its generalization of the notion of computations. For Linearizability and I/O Automata, a computation is a sequence of operation invocations and responses (for the former) or events (for the latter). This is also the case with Lamport's Sequential Consistency [Lamport 1979a ] (a sequence of operations). Our notion of computation, a set of sequences of operations one for each process, is more general and is more suitable for modern (loosely coupled) machines. This notion of a computation is similar to the notion of history by Ahamad et al. [1993 Ahamad et al. [ , 1995 . However, CMF is more general as it can be applied to arbitrary consistency models with arbitrary objects. Furthermore, the layering approach in CMF does not exist in the work of Ahamad et al. [1993 Ahamad et al. [ , 1995 .
In CMF, the notion of validity is independent from the computation. In a computation, each response (the "out" parameter) is simply glued to its operation invocation, regardless of when it is actually received by the process. This provides needed flexibility. Also, unlike some definitions (such as the complex value axiom in Weaver and Germond [2000] ), in CMF the notion of a valid sequence of operations is the natural one for the object and does not change.
Adir et al.'s framework [Adir et al. 2003 ] for Information Flow Modelling includes private registers and requires a more thorough treatment of causallyrelated operations (what they call the reads-from mapping). These details can be crucial (as Adir et al. demonstrate with the PowerPC architecture) but are not required to achieve the objective of this article. Modelling systems at different levels of abstraction, and proving equivalence between them is not a goal of the work of Adir et al. and does not seem to be facilitated by their framework. It is a distinctive feature of CMF, which this article demonstrates.
WRITE-BUFFER ARCHITECTURES
One common multiprocessor architecture, such as the SPARC multiprocessor, associates a write-buffer with each processor as shown in Figure 2 . The main memory is single ported with a nondeterministic switch providing one memory access at a time. Each write-buffer operates in parallel with the processor.
We contend that even the most permissive variant of any write-buffer machine will execute as follows: When a processor performs a write, it need not wait for it to be committed to main memory. Instead, it is stored in the write-buffer, which is responsible for committing pending writes to main memory. When a read is issued by a processor, the processor's associated write-buffer is checked for pending writes to the same location. If there is any such write, the value "to be written" by some such write to that location is returned. In this case, the read completes without accessing main memory. Otherwise, the read accesses main memory and returns the value of the location in main memory. Notice that this basic machine has quite weak operational constraints. A read action by some processor applied to some location can return any value written by the same processor for the same location that is not yet committed. Furthermore, operations need not be blocking; once the buffer action is complete, a process can invoke its next operation in program order. However, accesses to any individual location in main memory by any one processor are via a FIFO channel, which connects the buffer and main memory. Therefore, for each location and each processor, pending memory write and read operations of main memory by that processor and to that location are performed in the same order as they are issued. This means that a write that has left the buffer and is still on its way to main memory, cannot be bypassed by a read of the same location by the same processor that is issued later than the write.
Two more restrictive write-buffer behaviors are distinguished by (1) constraining the order in which pending main memory operations are performed; (2) constraining which value of a location that is in the write-buffer can be returned by a read; and (3) requiring the processor to block on reads.
A partial store order write-buffer machine, which is modelled after a SPARC machine described in the version 8 manual [SPARC International, Inc. 1992] , adds two constraints to the basic write-buffer machine: (1) when the writebuffer contains pending writes to a location being read, the value of the most recent such write is returned; and (2) when the write-buffer does not contain any pending writes to a location being read, the processor blocks until the read is complete.
A total store order write-buffer machine, also modelled after a SPARC version 8 machine, further constrains the partial store order write-buffer machine by requiring that all pending writes from a write-buffer to main memory (rather than just those to the same location) must be performed in FIFO order.
Throughout this article, we address only read and write operations to variables, even though multiprocessors support other operations that affect shared memory. Complicated memory consistency behavior arises, however, because read and write operations are executed as a sequence of (sometimes, nonblocking) steps. Operations such as various read-modify-write and memory barrier operations impose stronger coordination than reads and writes of variables, and hence they limit or eliminate these complicated scenarios. Thus, they can be easily added to all the systems considered here [Kawash 2000 ]. Since it is most challenging to pin down the memory consistency constraints of the weak operations, we focus on them in this article. We are also motivated to minimize the use of strong synchronization primitives since they negatively impact the system performance. So this article focuses on specifying the behaviors of these systems when they avoid the use of operations stronger than just read and write.
Partial and Total Store Order Write-Buffer Systems
The preceding informal descriptions of the behavior of each write-buffer machine can be converted to a precise definition of a system, using the framework of Section 2. For the remainder of this article, J is a set of (specified) variables and P = {p 1 , . . . , p n } is a multiprogram compatible with J . A target writebuffer system associated with multiprocess (P, J ) is defined by specifying each of its components.
The partial store order write-buffer system ( P , J , PSO) and the total store order write-buffer system ( P , J , TSO) associated with multiprocess (P, J ) are defined as follows. They differ only in the memory consistency component.
Objects. For each variable in J there is a corresponding main memory variable and there is a list object for that variable in each processor's write-buffer. More precisely, for each x ∈ J , associate with x a variable x, and for each pair x ∈ J and i ∈ {1, . . . , n}, associate a list object x i . The set of objects J is:
Programs. To define the transformation τ from the READ and WRITE operations on the variables in J to operations on the objects in J , it is convenient to extend the definition of an operation so that it is applicable to more than one object. A compound operation has the form out←ACT (objs,in) where ACT is an action applied to the collection of objects "objs" with input parameters in the list "in" returning the output values in the list "out" and which is defined as an indivisible sequence of operations. Indivisible sequences must appear to be atomic. That is, in any valid sequence of operations that is used to confirm that a computation satisfies a given memory consistency, the subsequence of operations that comprise an indivisible operation must be contiguous. For list object x i and variable x, define the compound operation MOVE({ x i , x}, v) to be the indivisible sequence of operations: DELETE( x i , v), WRITE( x, v) . A sequence of operations on list objects and variables that includes MOVE operations is valid if and only if the sequence modified by replacing each MOVE with the DELETE,WRITE sequence that defines the MOVE is valid.
To distinguish between READ operations on variables in J and READ operations on the variables in J , the latter are renamed MEM-READ.
The transformation τ from J to J is defined by:
The transformation τ is used to transform each WRITE and each READ operation invocation in P yielding a transformed multiprogram P = τ (P ). Clearly P is compatible with J . The program order of operations in the transformed
Example 5(a) -Write-Buffer Multiprogram. Transformation τ applied to the variable multiprocess in Example 2(a) gives the write-buffer multiprocess ({ p, q}, { x, y, z, x p 
, where the multiprogram { p, q} is as follows:
Memory Consistency. Let O be the set of all the operations of a computation C of (τ (P ), J ). Subsets of O are denoted by O| Q where Q is a collection of operation types, a process or an object. For example, O| APPEND∪LAST is the set of all the APPEND and LAST operations in O and O| x is the subset of O consisting of all the operations on object x. The notation is often combined to produce an intersection of these subsets. Informally, matching-ops order ensures that each READ and WRITE is implemented according to the definition of its transformation. Buffer-lists order ensures that each READ or WRITE operation appears to be invoked in program order. In particular, requiring that o 1 , o 2 ∈ O| APPEND∪LAST ensures that the initial part of the implementation of each READ or WRITE, which is applied to the local buffer, is applied in the program order of the corresponding READ or WRITE. Notice that READ and WRITE operations are not in general required to complete in program order since the MEM-READ and MOVE operations are not similarly constrained. FIFO-per-location-memory ensures that MOVE and MEM-READ operations by the same individual program to the same location remain in program order. Blocking-loads order ensures that when a LAST returns ⊥ and hence main memory is consulted for a value, the invoking processor waits for the MEM-READ to return, before initiating either another READ (beginning with LAST) or another WRITE (beginning with an APPEND). The FIFOmove order ensures that pending WRITEs from a buffer are performed in FIFO order.
Thus, these orders capture the informal description of the various behavioral conditions of the write-buffer machines. The informal description of the behavior of the partial store order write-buffer machine is captured by the first four partial orders.
PSO Consistency. A computation C satisfies PSO consistency (abbreviated PSO) if there exists a valid total order of the operations O of C that preserves matching-ops order, buffer-lists order, FIFO-per-location-memory order and blocking-loads order.
The stronger requirement of FIFO-move, together with the orders required for PSO consistency captures the informal description of the behavior of the total store order write-buffer machine.
TSO Consistency. A computation C satisfies TSO consistency (abbreviated TSO) if there exists a valid total order of the operations O of C that preserves matching-ops order, buffer-lists order, FIFO-per-location-memory order, blocking-loads order and FIFO-move order.
The partial orders that define PSO and TSO consistency relate directly to the operational behavior of the corresponding write-buffer machine. However, the combinations of partial orders contain redundancies. Simplifying them to remove these redundancies will, in turn, simplify some of our proofs in Section 4. Define the partial order: PROOF. For any computation that satisfies the definition of PSO consistency, there is a valid sequence of all its operations that preserves matching-ops order, buffer-lists order, FIFO-per-location-memory order and blocking-loads order. It is straightforward to confirm that this same sequence preserves the partial orders required for the claim.
For the other direction, let pso be a valid sequence of all the operations O in C, where this sequence preserves matching-ops order, FIFO-per-location-move order and blocking-loads order. Then pso clearly preserves buffer-lists order because buffer-lists is a subset of blocking-loads order. It remains to show that pso also preserves FIFO-per-location-memory order. PROOF. By the definitions of PSO and TSO consistencies, a computation satisfies TSO consistency if and only if there is a sequence of all its operations that satisfies the requirements of PSO consistency and preserves FIFO-move order. Thus, by Claim 3.1, a computation C satisfies TSO consistency if and only if there exists a valid total order of its operations that preserves matchingops order, FIFO-per-location-move order, blocking-loads order, and FIFO-move order. Since FIFO-per-location-move order is a subset of FIFO-move order, the claim follows.
Example 6(a) -Write-Buffer Computations. C a (i) and C a (ii) are computations of the multiprogram in Example 5(a).
Computation C a (i) satisfies PSO as is confirmed by the following sequence:
In this sequence, MOVE({ z p , z}, 2) and MOVE({ x p , x}, 1) violate TSO's FIFO-move order, which is not required for PSO. Computation C a (i) does not satisfy TSO due to the following cycle:
Computation C a (ii) satisfies TSO as is confirmed by the following sequence:
PARTIAL AND TOTAL STORE ORDER CONSISTENCY
One way to associate a set of computations with (P, J ) is to first transform it to (τ (P ), J ) (using τ as defined in Section 3.1) and then to consider all PSO (or all TSO) computations that can arise. Each such computation can be interpreted as a computation of (P, J ) by setting the "out" value of each READ in P to the value returned by the corresponding LAST or MEM-READ. Another way is to specify a memory consistency model M and consider all computations of the system (P, J, M ).
In this section, we define the two such memory consistency models PSO consistency and TSO consistency. Then, we prove that the set of computations of the system (P, J, PSO) is exactly the same as the set of interpreted computations of (τ (P ), J , PSO), and the corresponding result for TSO. Thus, we establish that τ is an exact compiler (defined in Section 2.2) from the platform (J, PSO) to the platform ( J , PSO) and from (J, TSO) to the platform ( J , TSO). Hence PSO and TSO, which are our "programmer level" definitions of partial and total store order computations, are correct in that they exactly capture the set of possible outcomes that can arise when any multiprocess that operates on shared variables is executed on a partial or total store order write-buffer machine.
Definitions
Let O be all operations of a computation C of (P, J Informally, these partial orders together capture constraints on a "view of operations" as is "seen" by main memory. The same-object order ensures that the write-buffers in these models are at least FIFO per-location and the channels connecting these buffers to main memory are also FIFO. A foreign READ necessarily misses the buffer and returns a value from main memory. These READs are ordered according to when main memory "sees" them. A similar guarantee cannot be made about a domestic READ because it may hit the buffer and if it does so, it returns values that are yet to be committed to main memory. Such a READ will be indirectly "seen" by main memory in the future, only after its pending WRITE is applied to main memory. This allows domestic READs to overtake some preceding operations in program order. The following-write order captures the FIFO buffers and the fact that READs are blocking.
Define two weak memory consistency models using these orders. PSO Consistency. Computation C satisfies PSO consistency if there exists a valid total order of all its operations that preserves same-object and precedingread orders.
TSO Consistency. Computation C satisfies TSO consistency if there exists a valid total order of all its operations that preserves same-object, preceding-read and following-write orders.
Example 7(a) -TSO and PSO Consistency. Computation C a (i) (of Example 3(a)) satisfies PSO consistency as shown by the following valid total order that preserves same-object and preceding-read orders:
However, C a (i) does not satisfy TSO consistency. The following cycle shows that there is no valid total order that extends preceding-read and followingwrite orders:
Computation C a (ii) is PSO and TSO as shown by the valid total order that preserves same-object, preceding-read and following-write orders:
Note that C a (i) (respectively, C a (ii)) is an interpretation of C a (i) (respectively, C a (ii)).
The proofs for partial and total store order in the next two subsections are similar, so it is convenient to define some general notation and constructions that apply to either system. Denote by C * any computation of (P, J, PSO) or (P, J, TSO). Similarly, denote by C * any computation of (τ (P ), J , PSO) or (τ (P ), J , TSO). 
Transformation τ Is a Compiler
The definitions of PSO consistency and TSO consistency guarantee that there is a valid total order, which we denote by (O, * so −→), on the operations of C * that preserves some subset of the program order of P . Similarly, the definitions of PSO and TSO guarantee that there is a valid total order, which we denote by ( O, * so −→), on the operations of C * that preserves some subset of the program order of τ (P ).
Construction One. This procedure constructs a computation, C, of (P, J ) and a sequence, S, of all operations of C from a valid total order ( O, * so −→) of all the operations of a computation C * of (τ (P ), J , PSO) or of (τ (P ), J , TSO).
Consider the unique sequence * so that agrees with some total order ( O, * so −→ (x, v) .
Computation C of (P, J ) is constructed by applying to computation C * the same deletions and replacements of operations as is done in the conversion step (above). Say that each WRITE (respectively, READ) 
. The sequence S is a valid total order of all operations in C.
PROOF. Sequence * so is valid. Also, the WRITE operations in S and the READ operations in S that relate to MEM-READ operations are in the same order as the related MOVE and MEM-READ operations in * so. Therefore, each READ in S that relates to a MEM-READ returns the value of the most recent preceding WRITE to the same variable. In the reordering step, each LAST that returns a non-⊥ value is moved so that the value it returns is the value written by the most recent preceding MOVE to the same list object. Therefore, each READ in S that relates to a LAST in * so also returns the value of the most recent preceding WRITE to the same variable. Since S contains exactly the operations in C, it is a valid total order of the operations in C. PROOF. For any multiprogram P compatible with J , consider any computation C * of (τ (P ), J , PSO) (respectively, (τ (P ), J , TSO)). Construct the computation C and the sequence S using construction one from the total order * so of the operations O in C * . By Claim 4.1, S is a valid total order of the operations O in C. By Claim 4.3, S preserves the same-object order of O. By Claim 4.2, S preserves the preceding-read order of O. Thus, C satisfies PSO consistency and Case (1) 
PROOF. Since o
1 pro g −→ o 2 in C, τ (o 1 ) pro g −→ τ (o 2 ) in C * . So in particular, o 1 pro g −→ o 2 in C * ,
Transformation τ Is An Exact Compiler
Construction Two. Using the computation C * of (P, J, PSO) or (P, J, TSO) and the valid sequence * so, this procedure constructs a computation C of (τ (P ), J ) and a sequence S of all the operations, O, in C. First an intermediate computation D and sequence T are constructed by a straightforward conversion from C * . Then, D and T are adjusted to form the final computation C and sequence S.
Construct a computation D of (τ (P ), J ) from C * by replacing each
v)) and replacing each v←READ p i (x) with (⊥ ←LAST p i ( x i ), v←MEM-READ p i ( x)
). An operation in C * and the pair of operations that replaced it in D are said to potentially correspond.
Sequence T is constructed by the algorithm in Figure 3 . We use the notation T ← T ++ o to denote that operation o is appended to T . For each individual computation of p i in C * , maintain a pointer ↓ p i that initially points at the first operation in p i 's computation. If ↓ p i points at o, then we say ↓ p i = o. Similarly, maintain the pointer ↓ * so to operations in * so. Initially, ↓ * so points at the first operation in * so. When there are no more operations to consider in * so, we say ↓ * so = nil. Also, advancing a pointer means the pointer is incremented to point at the next operation in the corresponding sequence. Initially, all operations in * so are unmarked.
To create the final computation C and sequence S, delete from both T and D each v←MEM-READ( x) operation that returns the value v of a MOVE({ x i , x}, v) such 
Matching LAST and MEM-READ operations are always added to T in order of LAST immediately followed by MEM-READ. S maintains the order of all operations that are not removed.
CLAIM 4.8. The sequence S is a valid total order of all operations in C.
PROOF. Exactly the same operations have been inserted into T and into D.
Also, exactly the same operations have been altered or deleted from T and D before renaming to S and C. Thus, S is a total order of the operations in C.
Sequence * so is valid. Consider any READ operation, v←READ p j (x), in * so. Call this READ r. It returns the value of the most recent preceding WRITE operation to the same object, WRITE p i (x, v) . Call this WRITE w. We now examine the validity of the LAST and MEM-READ operations that are placed onto S and correspond to r. Define I to be the interval in * so between w and r and J to be the interval in the p j 's individual computation between o j and r. Fact 1. By the validity of * so, I contains no WRITE operations to object x. Fact 2. J contains no WRITE operations to object x. If there were one, say w (x), then w (x) pro g −→r which implies w (x) * so −→r by same-object order. This implies that w (x) * so −→w by Fact 1. This is impossible since ↓ * so advanced beyond w (x) only if ↓ p j advanced beyond w (x).
When ↓ * so advances past w a corresponding MOVE is placed onto T . The pair of operations ⊥←LAST( x) and v←MEM-READ( x) that potentially correspond to r are placed onto T when the first of ↓ * so advances to r, or ↓ p j advances to r. Consider the interval K in T between this MOVE and this pair of LAST The algorithm placed the APPEND that corresponds to w onto T when ↓ p j =w. It placed the LAST and MEM-READ pair that potentially correspond to r onto T when ↓ p j advanced to r. Both of these occurred before ↓ * so advanced to o j and hence before ↓ * so advanced past w, which places the MOVE that corresponds to w onto T . Thus, the LAST and MEM-READ for r are between the APPEND and MOVE for w. Hence the final adjustment that transforms T to S replaces this LAST and MEM-READ pair with a LAST that returns the value of the APPEND that corresponds to w, which is valid. PROOF. Since Lemma 4.4 establishes that τ is a compiler it remains only to show that for any computation, C of (P, J, PSO) (respectively, (P, J, TSO)), there is a computation C of (τ (P ), J , PSO) (respectively, (τ (P ), J , TSO)) whose interpretation is C.
Construct the computation C and the sequence S using construction two from the total order * so of the operations in C. By Claim 4.8, S is a valid total order of the operations in C. By Claims 3.1 and 3.2, it suffices to prove that S preserves matching-ops order, FIFO-per-location-move order and blockingloads order (for PSO) and matching-ops order, FIFO-move order and blockingloads order (for TSO). Claim 4.6 ensures blocking-loads order and Claim 4.7 ensures matching-ops order.
By Claim 4.5, the MOVE operations in S are in the same order as their corresponding WRITE operations in * so. If C satisfies PSO, then * so preserves same-object order and, therefore, S maintains FIFO-per-location-move order. Similarly, if C satisfies TSO, then * so preserves following-write order and, therefore, S maintains FIFO-move order.
SPARC ARCHITECTURE MANUAL SPECIfiCATIONS
It remains to show that the systems defined in Section 3.1 are equivalent to these specifications in the SPARC architecture manuals [SPARC International, Inc. 1992; Weaver and Germond 2000] . The SPARC version 8 manual defines total store order and partial store order [SPARC International, Inc. 1992] . The version 9 manual [Weaver and Germond 2000] redefines these two models and introduces a third model, the relaxed memory order. The manuals use both operational and axiomatic descriptions. In either case, it is difficult to interpret these models as successive weakenings of the intuitive Sequential Consistency. It is even more difficult to design algorithms for multiprocessor systems using these descriptions.
Section 5.1 defines partial store order and total store order architecture systems within our framework using a validity condition and a program order definition that mimic those used in the version 9 architecture manual [Weaver and Germond 2000] . In Section 5.2, we argue that our architectural level definition captures that of the SPARC manuals. Finally, in Section 5.3, we prove that the partial and total store order write-buffer systems of Section 3.1 exactly implement the partial and total store order architecture systems (respectively). Throughout, we restrict our scope to READ and WRITE operations even though the manual defines a series of memory barrier and atomic read-modify-write operations. These can be easily added [Kawash 2000 ], but are outside the scope of this article.
Systems Based on SPARC Manual Specifications
The systems (P a , J a , TSO a ) and (P a , J a , PSO a ) that correspond to the manual specifications are defined using our Consistency Modelling Framework's notion of programs, objects, and memory consistency. In the SPARC architecture manual, programs are described in terms of the objects they operate on, and the validity of these objects is defined in terms of the programs. So, in order to remain close to the manual specifications, we begin our definition with programs, even though there is a forward reference to objects.
Programs. P a = {p 1 , . . . , p n } is a collection of individual programs, each containing operation invocations on SPARC variables. The order of operations in each individual program is denoted <p .
Objects. A SPARC variable x is the set of sequences over operations of the type −→) that preserves SPARC-same-object, SPARC-read and SPARC-write orders.
Because SPARC-same-object order is a subset of SPARC-write order we redefine TSO a consistency more concisely as follows:
TSO a consistency. Computation C a satisfies TSO a consistency if there exists a valid total order (O a , tso a −→) that preserves SPARC-read and SPARC-write orders.
Equivalence with SPARC Manual Specifications
The purpose of this subsection is to argue that the systems (P a , J a , PSO a ) and (P a , J a , T SO a ) just defined using the Consistency Modeling Framework are equivalent to the systems defined in the SPARC Architecture Manual version 9 [Weaver and Germond 2000] . This is done by showing where each part of the definition is captured in the manual. Of necessity, this section relies heavily on this manual, especially Appendix D: Formal Specifications of the Memory Model. Thus, this section is not self-contained. It can be skipped however, without jeopardizing understanding of the rest of the article. The discussion focuses on the memory consistency model and the validity condition of the SPARC variables of the manual.
Programs: The manual (Section D.1) defines the system to be a collection of processors, P 0 , P 1 , . . . , P n−1 , each with its own instruction stream, sharing address space and accessing real memory and I/O locations. Our abstraction captures this as a multiprogram P a = {p 1 , . . . p n }. SPARC program order, denoted by < p, is defined as "X n <p Y n is true if and only if the memory transaction X n is caused by an instruction that is executed before the instruction that caused memory transaction Y n ," where X n and Y n are executed in the same processor n (Section D.3.2 of Weaver and Germond [2000] ). Thus, SPARC program order agrees with the program order defined in our framework.
Objects. The validity condition for SPARC variables is more complex than the validity condition of variables defined in Section 2 of this article; it is based directly on the manual definition of Section D.4.5. "The value of a load [· · ·] is the value of the most recent store that was performed with respect to memory order or the value of the most recent initiated store by the same processor" (Section D.4.5 of Weaver and Germond [2000] .) It is straightforward to check that the definition of SPARC variables is exactly this condition.
Memory Consistency. Sections D.5 and D.6 specify the partial and total store orders that we call PSO a and TSO a . These specifications, however, rely on the RMO (relaxed memory order) specification of Section D.4. They also rely on definitions of MEMBAR instructions. A MEMBAR instruction is not an operation; rather it is a memory barrier that enforces constraints on what instructions can be reordered. There are four basic MEMBAR instructions. A MEMBAR #LoadStore imposes the constraint that any instruction with load semantics that is before the MEMBAR in program order, must be completed before any instruction with store semantics that is after the MEMBAR can be invoked. MEMBAR #LoadLoad, MEMBAR #StoreLoad, and MEMBAR #StoreStore are defined similarly. Basic MEMBARs can be combined; for example the constraints imposed by MEMBAR #LoadLoad|#LoadStore is the union of the constraints of MEMBAR #LoadLoad and MEMBAR #LoadStore. Our focus is on sections D.4.4 (Memory Order Constraints) and D.4.5 (Value of Memory Transactions).
Rule (1) of D.4.4 requires that dependence order be maintained in the memory order when the preceding operation has load semantics. For PSO a and TSO a , the implied MEMBAR #LoadLoad|#LoadStore after each instruction with load semantics as specified in Section D.5 subsumes the RMO Rule (1) of D.4.4. This MEMBAR ensures that instructions with load semantics are completed before any following instructions in the program with load or store semantics is invoked. All the operations we consider have either load or store semantics. Therefore, the SPARC-read order captures the constraints of the implied MEMBAR in D.5.
Rule (2) of D.4.4 refers to an explicit MEMBAR operation, which is not included in our model.
Rule (3) of D.4.4 is equivalent to the SPARC-same-object partial order of PSO a . For TSO a , the implied MEMBAR #StoreStore (Section D.6) after every store instruction subsumes Rule(3) of RMO. The SPARC-write order of TSO a is equivalent for this implied MEMBAR.
Therefore, the systems (P a , J a , TSO a ) and (P a , J a , PSO a ) are faithful to the descriptions of total and partial store orders in the SPARC manuals.
Equivalence with the Systems of Section 3
The ideas in the proof of the following theorem are almost identical to those in Section 4. Hence, the proof has been relegated to Appendix A. Transformation τ refers to the transformation with this name in Section 3.1. The key difference between this proof and those in Section 4 is that a new construction is used to show that τ is a compiler. In the new construction, no re-ordering is required; the following conversions are applied: all LAST operations that return non-⊥ and MEM-READs are replaced by READ operations, and MOVEs are replaced by WRITEs. Construction two is used to establish that the compiler is exact. Due to the different validity condition in the definition of the variables, the lemmas associated with Construction two must be re-established.
THEOREM 5.4. Transformation τ is an exact compiler
Section 4 showed that the PSO and TSO memory consistency models capture exactly the computations that can arise on a machine with write-buffers that operates with partial store order or total store order semantics, respectively. Similarly, Section 5 shows that this same write-buffer machine captures the computations specified as PSO a and TSO a in the SPARC manual. Hence, our nonoperational description of PSO and TSO are equivalent to the SPARC manual architectural descriptions.
OTHER WRITE-BUFFER ARCHITECTURES
The SPARC version 9 architecture includes another memory consistency model called the relaxed memory order [Weaver and Germond 2000] , which is a relaxation of PSO. An obvious question is whether this relaxed memory order model is also equivalent to some natural operation of a write-buffer machine. In this section, we investigate this question and extend our observations to the weak memory consistency model Coherence and to the memory consistency of the Alpha multiprocessor [Compaq Computer Corporation 1998 ] as formalized by Attiya and Friedman Attiya and Friedman [1994] . We will argue that none of these memory consistency models have an exact implementation on any natural write-buffer multiprocessor. The key is a Coherent computation (defined below) that cannot occur on even the most permissive write-buffer machine we might imagine. So, we first define this very basic write-buffer system.
Basic Write-Buffer System
A set object is similar to a list object but lacks order.
A set object s, is the set of sequences over operations of the type INSERT(s,·), DELETE(s,·), and ←SELECT(s) that satisfy the validity condition:
For any SELECT operation on set object, s, it returns a value ρ =⊥ if and only if there is a preceding INSERT(s,ρ) operation and, between this INSERT and the SELECT there is no DELETE(s,ρ) . Otherwise, it returns ⊥.
6
The basic write-buffer platform is defined similarly to partial store order and total store order write-buffer platform except that J contains set objects (instead of list objects). The result is a machine that admits many more computations than the partial store order write-buffer machine.
The basic write-buffer system ( P , J , WB) associated with multiprocess (P, J ) is defined as follows:
Objects. For each variable in J , there is a corresponding main-memory variable and there is a set object for that variable in each processor's write-buffer.
• L. Higham et al. More precisely, for each x ∈ J , associate with x a variable x, and for each pair x ∈ J and i ∈ {1, . . . , n}, associate a set object x i . The set of objects J is:
Programs. For set object or list object x i and variable x, define the compound operation MOVE({ x i , x}, v) to be the indivisible sequence of operations: WRITE( x, v) . A sequence of operations on set objects and variables that includes MOVE operations is valid if and only if the sequence modified by replacing each MOVE with the DELETE,WRITE sequence that defines the MOVE is valid.
To distinguish between READ operations on variables in J , and READ operations on the variables in J , the latter are renamed MEM-READ. The transformation γ from J to J is defined by:
The transformation γ is used to transform each WRITE and each READ operation invocation in P yielding a transformed multiprogram P = γ (P ). Clearly, P is compatible with J . The program order of operations in the transformed multiprogram γ (P ) is denoted Notice that in a basic write-buffer machine, a READ can return the value of any of the previous WRITEs to the same location by the same processor that has not been moved to main memory, not necessarily the latest one. This may seem unrealistic. This excessively permissive definition, however, serves to strengthen our impossibility claims in the Section 6.3.
Coherence, Alpha, and RMO Consistency
Coherence is a weak memory consistency condition that is sometimes assumed to be a minimum consistency requirement for any reasonable multiprocessor [Frigo 1998 ]. Coherence only requires the same-object order of Section 4.1. Recall that J is a set of variables and P is any multiprogram compatible with J , and O is all the operations of a computation C of (P, J ).
Coherence. Computation C satisfies Coherence if for each object x, there is a valid total order of all the operations in O|x that preserves same-object order.
According to the Alpha architecture specifications [Compaq Computer Corporation 1998 ], the memory consistency of the Alpha multiprocessor is at least as strong as Coherence. The specifications state that "All processors must provide a coherent view of memory," but indicate that "Write buffers may be used to delay and aggregate writes" (pages 5-4 and 5-5 of Compaq Computer Corporation [1998] ). From the Alpha definition by Attiya and Friedman [1994] , the Alpha multiprocessor becomes exactly Coherence when no control and data dependences are present.
The relaxed memory order consistency of the SPARC machine as described in the SPARC architecture manual Weaver and Germond [2000] allows some computations that are not Coherent. The third "memory order constraint" (page 260 of Weaver and Germond [2000] ), however, requires that if two operations are on the same object, these operations must maintain their program order in the "memory order" provided at least one of them is a WRITE. Relaxed memory order does not require program order to be maintained between two operations that are on different objects, as long as there is no "dependence" relation (page 258 of Weaver and Germond [2000] ) between them. This weakening can be expressed as a partial order: The SPARC relaxed memory order (called RMO) can be defined as: RMO Consistency. Computation C satisfies RMO consistency if there is a valid total order of all its operations that preserves weak same object order.
Coherence, Alpha and RMO using Write-Buffers
Consider the following multiprogram, where x and y are variables:
The following is a computation of this multiprogram: Observe that there is only one valid total order for variable x that maintains program order.
Similarly, there is only one valid total order for variable y that maintains program order:
These valid total orders confirm that C WB is Coherent. Notice that all READ operations in C WB are foreign READ operations. Also observe that C WB satisfies neither PSO nor TSO consistency because there is no valid total order that preserves preceding-read order. Now consider the basic write-buffer system:
associated with ({ p, q}, {x, y}) where x and y are variables and x p , y p , x q and y q are set objects.
CLAIM 6.1. There is no computation of ( P , K , WB) whose interpretation is C WB .
PROOF. Suppose there is a computation C of ( P , K , WB) whose interpretation is C WB . Because each READ in C WB is a foreign READ, each READ u (z) for z ∈ {x, y} and u ∈ {p, q} is transformed to the sequence of operations:
Since C is a basic write-buffer computation, there is a valid total order of its operations that extends buffer-sets order, matching-ops order, and FIFO-perlocation-memory order. These partial orders on the operations of C are represented in Figure 4 where buffer-ops order is represented by , matching-ops order by and FIFO-per-location-memory by . To ensure the validity of the two SELECT operations in dashed boxes, each must be preceded by the MOVE operation by the same process in the solid box since otherwise, SELECT cannot return ⊥. In addition, to ensure validity of the 5←MEM-READ p ( x) and 6←MEM-READ q ( y) operations, each must be preceded by the opposite processor's causally related MOVE operation MOVE q ({ x q , x}, 5) and MOVE p ({ x p , y}, 6) respectively. Adding these four validity arrows creates a cycle:
Since the union of the partial orders cannot be extended into a valid total order that preserves the basic write-buffer orders, computation C does not exist.
Computation C WB establishes that some particular platforms cannot be exactly implemented on any write-buffer machine. PROOF. C WB was observed to be a Coherent computation. Since it has no control or data dependencies, it is also an Alpha computation. Since RMO is weaker than Coherence, C WB is also an RMO computation. That is, C WB is a computation that could occur on any Coherent, RMO or Alpha platform. However, by Claim 6.1, C WB is not the interpretation of any computation of any multiprogram when transformed to basic write-buffer system. Hence, C WB cannot be the interpretation of any write-buffer system computation.
Since there is a computation that is Coherent, RMO and Alpha but does not satisfy even basic write-buffer consistency, the various write-buffer consistencies are either stronger than or incomparable to Coherence, RMO and Alpha. To examine this further, first consider Coherence. Any PSO computation has a valid total order that extends same object order, so any PSO computation is Coherent. Since, the partial store order write-buffer machine is an exact compiler for a PSO platform, we conclude that PSO (and hence TSO) consistency is strictly stronger than Coherence. RMO is weaker than Coherence, so PSO is also strictly stronger than RMO. For Alpha, the case is more involved. Alpha consistency is stronger than Coherence because any Alpha computation must have a valid total order that preserves control and data dependencies as well as Coherence. The framework as presented in this paper does not capture dependencies that arise from individual programs and the private registers of processes. However, we could imagine a write-buffer platform that also imposes the register, control and data dependencies that are required by Alpha. Such a machine, with PSO consistency for memory operations would be strictly stronger than Alpha consistency. Equivalently, PSO is strictly stronger than the consistency of Alpha when the additional constraints imposed by each process' local actions on its private registers and the control structure of its program are ignored. The details required to capture the full Alpha consistency are beyond the scope of this article.
Weakening the partial store order write-buffer machine to a basic writebuffer machine, changes the relationship of the machine with Coherence, Alpha, and RMO from strictly stronger to incomparable. Consider the trivial multiprocess ({s}, {x}) with just one process and one variable, where the program for s is: Higham et al. and the computation of the multiprocess ({γ (s)}, {γ (x)}) is:
The following is a valid sequence of all the operations of C1 that preserves matching-ops order, buffer-sets order, and FIFO-per-location-memory order, assuming x is initialized to 0:
Thus, C1 is a computation of the basic write-buffer machine. 
SUMMARY AND FURTHER COMMENTS
This article presented a framework for specifying memory consistency models and proving them correct. Distributed and parallel systems can be uniformly specified at different levels of abstraction. At any level, the system is described as a triple consisting of programs, shared objects, and a memory consistency model. The proofs establish a relationship between the components of one system at one level with the components of a corresponding system at a different level.
The article provides simple memory consistency specifications for the SPARC version 8 architecture variants, total and partial store orders. The framework is used to show that these nonoperational specifications exactly capture the operational descriptions, and also the more complicated non-operational descriptions of the subsequent SPARC version 9 manual. These equivalences highlight some serious flaws in other specification attempts or even in the official manual specifications, such as RMO and Alpha consistency.
The minimum consistency guarantee that would arise from any reasonable implementation on a machine with write-buffers of a multiprogram that uses shared variables is defined. This guarantee is used to prove that each of several well known memory consistency models does not exactly correspond to any implementation that uses write-buffers. For instance, RMO as described in the official SPARC version 9 manual [Weaver and Germond 2000] cannot be a description of a write-buffer machine. We conjecture that the manual description was not faithful to the intended operation of an RMO machine. As another example, the memory consistency of the Alpha multiprocessor [Compaq Computer Corporation 1998 ] as formalized by Attiya and Friedman [1994] cannot be described as a basic write-buffer system. The DEC-Alpha reference manually explicitly states that write-buffers can be utilized in Alpha multiprocessors (see pages 5-4 and 5-5 of Compaq Computer Corporation [1998] ). However, we have shown that any such utilization of write-buffers would necessarily give a memory consistency model more constrained than that of Alpha.
The assumption of distinct input values for state-change operations is for convenience and is often assumed in memory consistency modelling. In this article, it simplifies the definitions of set and list objects considerably. The more general definitions would add messiness to the already detailed proofs without adding insights. Notice that determining whether a read is "foreign" or "domestic" is easy if written values do happen to be distinct, but this is not part of the definition of foreign/domestic reads. Whether or not some reads are foreign could possibly be inferred in other ways by analyzing the program even when written values are not distinct. For example, reads of a shared single-writer variable by all other processes are necessarily foreign. When it cannot be asserted that a read is foreign, programmers must allow for the case that the read may be domestic and therefore ensure their code is correct under the additional re-orderings that might arise. For chip testing and verification purposes, which is one area where we envision this work to be applicable, unique values can be enforced by augmenting each value with a unique Lamport time-stamp [Lamport 1978] .
Many concerns such as asynchrony, weak consistency and faults make implementation of correct distributed systems very difficult. Writing correct programs is facilitated by providing the parallel and distributed application developers with definitions at the level of the programming instructions. Elsewhere [Higham and Kawash 2000, 2005; Kawash 2000 ], we exploited the nonoperational definitions for TSO and PSO developed in this article to show some algorithmic possibilities and impossibilities for these architectures. For example, -Contrary to Sequential Consistency, there is no solution for mutual exclusion on a TSO or PSO machine using only variables, so expensive synchronization primitives are essential. -There is a (non-waitfree) construction of a producer-consumer queue in a TSO or PSO system, using only reads and writes of variables for any number of producers and consumers.
• L. Higham et al.
-One read/write (multi-writer) variable is necessary and sufficient for building a one producer and one consumer queue in a TSO or PSO system. -There are wait-free solutions for a two-process producer-consumer queue that use only single-writer variables in a TSO or PSO system.
The task of reasoning about the correctness of distributed systems is further eased if programmers can assume a strong consistency model such as Sequential Consistency or Linearizability. This motivates us to hide the consistency model complexities by building compilers that transform a program that is correct under Sequential Consistency to an equivalent program that is also correct under weaker consistency models. Necessary initial steps include understanding exactly what constraints are guaranteed by a given weak system, and to polish techniques that prove the correctness of these compilers, as is demonstrated in this article.
APPENDIX
A. PROOF OF THEOREM 5.1
A.1 Transformation τ Is a Compiler
Construction Three. This procedure constructs a computation, C a , of (P a , J a ) and a sequence, S a , of all the operations of C a from a valid total order ( O, * so −→) of all the operations of a computation C * of (τ (P a ), J , PSO) or of (τ (P a ), J , TSO). PROOF. Because * so is a total order of all operations in C * , the sequence S a created by Construction three is a total order of the operations in C a . It remains to show S a is valid. The sequence * so is valid (that is, using the validity condition of Section 5.1.)
By Construction Three and the validity of * so, any READ that relates to a MEM-READ will return the value of the most recent WRITE to the same object that precedes the READ in S a . However, we must show that there is no other WRITE w to that object by the same process such that w precedes this READ in <p of C a and follows it in S a . Suppose such a w exists. Let mr be the related operation to this READ r and l be its matching LAST Now consider a READ r that relates to a non-⊥ LAST operation l . It returns the value of a later WRITE w in S a whose related MOVE m matches the most recent preceding APPEND a in * so by the same process to the same object. This a is causally related to l . By the validity of * so, we have a * so −→ l * so −→ m. Since a * so −→ l and * so satisfies blocking-loads order it must be that a pro g −→ l in C * . Since C * is a computation of τ (P a ), we must also have a pro g −→ m pro g −→ l . Hence, w<p r in C a . It remains to show that there is no later WRITE w to the same object by the same process as r where w s a −→w and w <p r. Let w be any WRITE after w in S a . Its related MOVE m must follow m in * so. Since * so preserves FIFO-per-locationmemory order it follows that m pro g −→ m . By blocking-loads order the APPEND a that matches m must follow a in * so. But then, by the validity of l (i.e., the assumption that a was the most recent preceding APPEND in * so), l * so −→ a . Thus, r<p w and, w is the latest WRITE in S a that proceeds r in program order and therefore S a is valid. (1) from (J a , PSO a ) to ( J , PSO) and (2) from (J a , TSO a ) to ( J , TSO).
PROOF. For any multiprogram P a compatible with J a , consider any computation C * of (τ (P a ), J , PSO) (respectively, (τ (P a ), J , TSO)). Construct the computation C a and the sequence S a using Construction three from the total order * so of the operations in C * . By Claim A. 
A.2 Transformation τ Is an Exact Compiler
Let (O a , * so a −→) be a valid total order of all the operations of a computation C a of (P a , J a , PSO a ) or (P a , J a , TSO a ). Let * so a be the unique sequence that agrees with (O a , * so a −→). Construction two of Section 4 is applied to the computation C a of (P a , J a , PSO a ) or (P a , J a , TSO a ) and the sequence * so a , yielding a computation C of (τ (P a ), J ) and a sequence S of all the operations, O, in C. The proofs of Claims 4.5, 4.6, and 4.7 remain true when * so a replaces * so as an input sequence for Construction two. Since * so a satisfies a different validity condition from * so the following claim is required to replace Claim 4.8.
CLAIM A.5. The sequence S is a valid total order of all operations in C.
PROOF. Exactly the same operations have been inserted into T and into D. Also, exactly the same operations have been altered or deleted from T and D before renaming to S and C. Thus, S is a total order of the operations in C.
Sequence * so a is a valid sequence of operations on SPARC variables. Consider any READ operation, v←READ p j (x), in * so a . Call this READ r. It returns the value of the latest in * so a of either (1) the most recent WRITE to the same object that
