Abstract. We formulate a programmer-centric description of the memory consistency model provided by the Itanium architecture. This allows reasoning about programs at a non-operational level in the natural way, not obscured by the implementation details of the underlying architecture. However, our definition is not tight. We provide two very similar definitions and show that the specification of the Itanium memory model lies between the two. These two definitions are motivated by slightly different implementations of load-acquire instructions.
Introduction
We contend that for programming purposes, a memory consistency model should be specified as a set of (ordering) rules on the instructions used by the programmer, rather than on a lower level collection of operations. Furthermore, the validity condition should be the natural notion of validity of sequences of these instructions acting on the objects of the system. For example, in a valid sequence of loads and stores, the value returned by each load instruction should be the value written by the most recent preceding instruction in the sequence that stored a value to the same memory location. Such a description is useful to a programmer of the system since she can reason about her code directly, and therefore we call it programmer-centric. Descriptions in terms of lower level operations specify an implementation (in hardware or on a virtual platform) and are useful for an architect who is building the system, but should not be confused with its specification. In this case these lower level implementations should be proved equivalent to the specification. A further advantage of our approach is that constructions can be composed. A high level specification of an object oriented system can be implemented by a succession of constructions, such that an implementation at one level is the specification for a still lower level, and each level of implementation is proved to correctly implement its specification. This, of course, is the familiar notion of abstraction; we simply extend it to weak models of memory consistency.
In previous work we established a framework for specifying programmercentric memory consistency models and for proving such equivalences between specifications and implementations [7, 6] . This paper applies these ideas to the Intel Itanium architecture. That is, we aim for a programmer-centric specification of the memory consistency of the Itanium multiprocessor. As will be seen, we failed to realize this goal. Instead, we define two very similar programmer-centric memory consistency models, Itanium A and Itanium B , and show that "official" Itanium memory consistency [9] , henceforth referred to as Itanium (with no subscript), lies strictly between these two (Section 3). Itanium B and Itanium A differ only slightly in the ordering constraints involving Itanium load-acquire instructions, and each is motivated by a plausible hardware implementation. Several other plausible definitions also fail to exactly capture the Itanium memory consistency specification (Section 5). The main results of this paper are preceded, in Section 2, with an overview the Itanium architecture and a synopsis of its operational level memory consistency as described by Intel [9] .
Several other frameworks for describing memory consistency have been proposed but are not central to this paper. The framework of Adir, Attiya and Shurek [1] is very similar to ours and precedes ours. Arvind and Maessem [3] provide a framework for serializable memory models. We are unaware, however, of how to use these frameworks to prove equivalence between systems. Yang et. al. [11, 12, 4 ] present a non-operational approach to specifying and analyzing shared memory consistency models and use it to provide a translation of the rules of Itanium specification. Adve and Gharachorloo [2] consider the question of programmer centricity of memory consisteny models and provide it by giving the illusion of sequentially consistent memory. The TLA work of Joshi et. al. [10] is a precise specification of Itanium and is the basis of the official specification [9] .
Itanium Multiprocessors

Itanium Architecture
The Itanium specifications [9] are independent of specific machine implementations. Although we do not know of a concrete machine implementation that exactly captures the Itanium specifications, in this subsection we overview some of the architectural features of such a machine.
Itanium provides a distributed-shared memory (DSM) architecture where each processor maintains a replicated copy of the shared address space. The rules that govern processor execution and inter-processor communication are complicated and give rise to complex behaviors. Itanium also supports write-buffers with read by-passing, which further complicates the behavior of Itanium. Bypassing loads can complete before earlier buffered stores and give rise to an out-of-order execution. When a processor loads a variable that it never stores, however, (such as a single-writer variable owned by a different processor) the load returns the value from the local replica rather than from the local buffer. Itanium write buffers are guaranteed to be FIFO only per variable. Hence, two store instructions to different variables can be applied to a replica in the opposite order to that in which they occur in a processor's program.
To constrain out-of-order execution, Itanium supports the extensions of "acquire" and "release" to load and store instructions, respectively. A load-acquire instruction (denoted ld.acq) is always performed before any subsequent instruction in the program. A store-release (denoted st.rel) is always performed after every preceding instruction in the program. Store-releases also constrain interprocessor interaction. Specifically, incoming store-releases force earlier stores by the same processor to be applied remotely in the same order they are applied at the issuing processor. Acquires and releases also restrict the write buffer's behavior. For instance, when a release is buffered it forces all previously buffered stores to be removed from the buffer and applied to the local replicas before the release itself. A load-acquire can also force the buffer to be flushed, but this is not necessary in general.
Itanium Memory Consistency According to the Itanium Manual
Itanium memory consistency is specified in the Intel manual [9] . We paraphrase (and simplify) it here as concisely as possible, so that it can be compared to the programmer-centric version that we describe in Section 3. When the same things are named differently in the manual [9] Similarly, the value loaded by ld[.acq] (respectively, read by RD) is denoted RdVal(ld[.acq]) (respectively, RdVal(RD)). Every location b in memory has an initial value, denoted by InitVal(b), that will be returned to read operations when they occur before there are any write operations to that location.
Any computation of the basic Itanium processor family memory ordering model must have an associated visibility order, which linearly orders all the operations that correspond to all the instructions of the computation and satisfies the Itanium rules below. If there is no visibility order for a computation that satisfies all of these rules, the computation is not permitted by the architecture.
If an instruction i is by a processor p, we write p = Proc(i). For any two operations O and U, O V −→ U means that O precedes U in the visibility order V. If there is a store instruction st p (x, ·) and a load instruction ld p (x) such that
) is a local read in V and ld p (x) is a local load in V (or simply a local load or local read when V is clear).
Itanium Rules (WO):
Every store becomes visible locally before it becomes visible remotely.
For every store st
and -
The remaining rules determine what value must be returned by a load, which depends on the placement of the read of the load within the low level write operations to the same location. 3 Programmer-Centric Itanium-Based Consistency
Framework
As each process in a multiprocess system executes, it issues a sequence of instruction invocations on shared memory objects. 1 For this paper the shared memory consists of only shared variables, and each instruction invocation is Itaniumbased. That is, each instruction invocation is of the form st p (x, v) or st.rel p (x,
2 An (Itanium-based) multiprogram is a finite set of these individual programs.
An instruction is an instruction invocation completed with a response. In our setting the response of a store instruction invocation or a fence instruction invocation is an acknowledgment and is ignored. The response of a load invocation is the value returned by the invocation. A (multiprocess) computation of an Itanium-based multiprogram, P is created from P by changing each load instruction invocation, ld p (x) (respectively, ld.acq p (x)) to ν ←ld p (x) (respectively, ν ←ld.acq p (x)) where ν is either the initial value of x or some value stored to x by some store to x in the multiprogram.
Notice that the definition of a computation permits the value returned by each ld(x) or ld.acq(x) instruction invocation to be arbitrarily chosen from the set of values stored to x by the multiprogram. In an Itanium (or any other) multiprocessor, the values that might actually be returned are substantially further constrained by its architecture, which determines the way in which the processes communicate and that shared memory is implemented. A memory consistency model captures these constraints by specifying a set of additional requirements that computations must satisfy. Typically, these require the existence of a set of sequences of instructions that satisfy certain properties. A collection of such sequences for a computation C that meet all the requirements of memory consistency model M is called a set of M-verifying sequences for C. We use C(P , M) to denote the set of all computations of multiprogram P that satisfy the memory consistency model M. Memory consistency model M is stronger than M if, for every Multiprogram P ,
The terms weaker and strictly weaker are defined similarly.
The description of a memory consistency model is simplified by assuming that each store instruction invocation has a distinct value. Although it is technically straightforward to remove this assumption, without it, the description of the memory model is messy and its properties are consequently obscured. 
For an Itanium-based computation C, I(C) denotes all the instructions in C. I(C)|p is the subset of I(C) in processor p's program sequence; I(C)|x is the subset of I(C) applied to variable x; I(C)|r is the subset containing only the load instructions; I(C)|w
−→ ).
A load instruction is domestic if the value it returns was stored into shared memory location x by a store instruction by the same processor; memory fence instructions and load instructions that are not domestic are foreign. If an instruction, i, with load semantics returns the value stored by an instruction, j, with store semantics then i and j are causally related.
Weak and Strong Itanium Memory Consistency
This section uses the framework of Subsection 3.1 to formulate two programmercentric definitions of Itanium consistency and describe informally what the definitions are intended to capture. They differ only slightly in the constraints on ld.acq instructions.
Define the following partial orders. Let i, j ∈ I(C) such that i prog −→j.
Acquire A: i
Acquire A
−→ j if and only if i ∈ I(C)|acq.
Acquire B: i
Acquire B
−→ j if and only if i ∈ I(C)|acq and i is foreign.
Acquire A describes a conservative implementation of ld.acq instructions, which requires any ld.acq to precede all instructions that follows it in the program. In the presence of buffers, certain architectural decisions can sacrifice this "text-book" behavior. For instance, Acquire B captures the situation when a ld.acq can be satisfied from the buffer (a domestic ld.acq). A ld that follows the ld.acq in program order could by-pass the buffer, or, a following st to a different variable could be committed to the local replica earlier than the buffered st that is used to satisfy the ld.acq. In these cases, the program order between the ld.acq and the subsequent ld or st is not necessarily preserved. There is one occurrence of each st in a processor's view, and these views are constructed based on the order in which stores occur in the local replicas. To maintain the intuitive notion of validity, the ld.acq must be delayed in the view until its causally-related st occurs in the local replica. Hence, a domestic ld.acq may occur in a view after a ld or a st that follows it in program order.
Acquire B allows this behavior, but prohibits it when the ld.acq is foreign (necessarily satisfied from the local replica rather than the buffer). It is also prohibited when ld.acq and the ld are applied to the same variable: if the ld.acq is satisfied from the buffer, then either the ld is also satisfied from the buffer or, if not, the st under consideration must have been applied to the local replica. This will be taken care of by the coherence requirement in the Itanium consistency definition.
One mechanism to prohibit a domestic ld.acq from occurring in a processor's view later than it should is to flush the buffer before the ld.acq is completed, ensuring that the ld.acq is always satisfied from the local replica. Such an architecture could achieve views satisfying Acquire A.
The following definition is parameterized by an arbitrary partial order on I(C), denoted R, which will be replaced by various partial orders (such as Acquire A and B) to construct variants of Itanium consistency. 
Itanium A abbreviates Itanium R when R = Acquire A. Itanium B is defined similarly (R = Acquire B). Section 5 defines additional Itanium models based on further variants of acquire orders.
Notice that a view of a processor consists of its own instructions in addition to the store instructions of all other processors. A specified Acquire Order is maintained by each view (item 1). The Release Order (item 2) is simply what a programmer expects: any instruction preceding a st.rel must maintain this order in the processors' views. Item 3 specifies the coherence requirement. The remaining items are requirements that establishing some agreement between the views of each processor. Since channels between processors are FIFO for each variable, the communicated store instructions to the same variable must appear in every view in the same order (item 4). A st.rel instruction occurs in all replicas atomically so item 5 requires the st.rel instructions to be seen in the same order by all processors. Furthermore, by item 6, if a st is seen by the storing processor after a st.rel, that st must be seen after the st.rel by all processors. Item 7 is a technical condition arising from timing considerations. Consider a store s p by p and a store s q by q. Since a store is visible to the storing processor before it is visible to others, it is not possible for p see s q before s p , and yet for q see s p before s q . Item 7 generalizes this to any number of processors.
Itanium Is Strictly Between Itanium A and Itanium B
Because of space constraints, the proofs of our two major theorems, Theorems 1 and 2, are omitted but can be found elsewhere [8] . 
Computation 1 does not satisfy Itanium because of the following cycle of operations:
−→ R(4 ←ld.ac(y))
Any verifying visibility sequence is a total order, so no such sequence could extend the orders of this cycle. 
Itanium A does not allow Computation 2 since Itanium A requires all ld.acq instructions to be satisfied from the local replica rather than the buffer. Hence, st q (x, 3) is guaranteed to be applied to q's replica before even st q (y, 4) is buffered. p must see st q (y, 4) before it buffers st p (x, 5) because it sees the value in y through a ld.acq instruction. When p sees st q (y, 4), the value of x in q's replica must be 3. p's st.rel p (z, 2) forces st p (x, 5) to be applied everywhere before the st.rel itself. When q sees st.rel p (z, 2), it must also have seen st p (x, 5). So the value of x in q's replica must be 5, overwriting the earlier value of 3. 3 ←ld q (x) must take place after 2 ←ld.ac(z), since Itanium A requires the ld.acq to precede any following instruction. However, we have already argued that the value of x according to q cannot be 3.
Formally, the Itanium A sequence, S p , must extend: 
. This makes the final 3 ←ld q (x) invalid.
Other Acquire Orders
Itanium B and Itanium A bound Itanium and the only difference between them is slight changes in the Acquire Order. So a natural question is: "Is there a definition of an Acquire Order that yields a programmer-centric memory consistency specification that is equivalent to Itanium?" This section examines several plausible Acquire Order definitions and compares their relative strengths. One interesting result is another memory consistency model that is weaker than Itanium A yet still strictly stronger than Itanium.
Define the write-before-read relation (I(C), Acquire C models a possible implementation where two load instructions, i 1 = ld.acq which is program ordered before i 2 = ld or ld.acq, and i 1 checks the write-buffer and misses it, bypasses any pending stores, and returns its value from the local replica. Meanwhile i 2 hits the buffer and returns. The effect is that i 2 bypasses i 1 because when constructing the processor's view i 2 will be delayed until its causally-related buffered write is committed to the local replica. Acquire D restricts this behavior in which any instruction can similarly bypass an earlier (in program order) domestic ld.acq. The bypassing instruction cannot be moved too early in the processor's view. It must follow the st that is causally related to the bypassed ld.acq.
These two partial orders give rise to two new definitions for Itanium consistency, in particular Itanium C (Definition 1 with R = Acquire C) and Itanium D (Definition 1 with R = Acquire D).
More variants of the Itanium memory consistency model are defined by combining the four basic acquire orders based either on intersection or conjunction as follows. Let γ, β ∈ {A, B, C, D}.
Intersection: A computation C satisfies Itanium γ∩β if C satisfies Itanium γ and Itanium β . Conjunction: A computation C satisfies Itanium γ∧β if C satisfies Itanium γ∩β and there is a set of Itanium γ -verifying sequences for C that are also Itanium β -verifying sequences for C.
Note that the models Itanium γ∩β allow the Itanium γ -verifying sequences for C to be different from Itanium β -verifying sequences for C. Hence, Itanium γ∧β is stronger than Itanium γ∩β . Since Itanium A is stronger than each of Itanium B , Itanium C , and Itanium D , this introduces six new and distinct Itanium memory consistency models: Itanium C∩B , Itanium C∩D Itanium D∩B Itanium C∧B , Itanium C∧D and Itanium D∧B . Observe that Itanium A is also stronger than each of the models Itanium C∧B , Itanium C∧D and Itanium D∧B . Figure 1 shows the relative strength of each system. The proofs are elsewhere [8] .
A conclusion of this investigation is that Itanium D∧B is weaker than Itanium A but still stronger than Itanium. At present a programmer-centric consistency model that is equivalent to Itanium has not been identified. However, there is promise in this technique of strengthening the Acquire B order.
