Existing memory models and cache consistency protocols assume the memory coherence property which requires that all processors observe the same ordering of write operations to the same location. In this paper, we address the problem of de ning a memory model that does not rely on the memory coherence assumption, and also the problem of designing a cache consistency protocol based on such a memory model. We de ne a new memory consistency model, called Location Consistency (LC), in which the state of a memory location is modeled as a partially ordered multiset (pomset) of write and synchronization operations. We prove that LC is strictly weaker than existing memory models, but is still equivalent to stronger models for parallel programs that have no data races.
Introduction
A memory consistency model speci es the semantics of concurrent memory operations, such as load/store and synchronization operations, in a multiprocessor system. The most commonly used hardware memory consistency model is sequential consistency (SC) 17]. The main approach taken in recent work on memory consistency models is to allow performance optimizations to be applied, while ensuring that sequential consistency is retained for a restricted class of programs | mainly programs that do not exhibit data races 1] 1 . Therefore, we refer to these weaker memory consistency models as SC-derived models.
A fundamental limitation of all SC-derived memory consistency models is the memory coherence assumption, which can be stated as follows 12]: \all writes to the same location are serialized in some order and are performed in that order with respect to any processor". We have argued earlier that the memory coherence assumption poses fundamental obstacles to de ning a scalable and e cient view of memory consistency in computer systems 11]. The memory coherence assumption imposes restrictions on the ordering of memory operations that go beyond the partial order de ned by synchronization operations in a parallel program. In this paper, we address the problem of de ning a memory consistency model and of designing an accompanying cache consistency protocol based on a partial order semantics that does not rely on the memory coherence assumption.
The rest of the paper is organized as follows. Section 2 introduces the LC memory model. Section 3 establishes some important properties of the LC model to demonstrate its usefulness. Section 4 describes the LC cache consistency protocol. Section 5 discusses related work and section 6 contains our conclusions.
The Location Consistency (LC) Model
The main purpose of a memory model is to serve as an agreement between hardware and software on the semantics of memory operations so as to ensure correct execution of user programs. However, the bulk of past work on memory models has been pursued from the hardware viewpoint. These models assume the memory coherence property because it is viewed as a \natural" property from the hardware viewpoint. Our position is that it is essential to adopt an end-to-end view of memory consistency to obtain memory models that are scalable and that can be understood uniformly at all levels of software and hardware. We believe that this is possible with a memory consistency model based on partial order execution semantics, which is the motivation for the Location Consistency (LC) model described in this section. 1 In a data-race-free parallel program, all accesses to shared data must be protected by data synchronization or by control sequencing.
Program Model
In this section, we outline the program model assumed for this work. This model is constructed in such a way as not to favor any particular concurrent programming paradigm. The focus of the LC model is on de ning memory consistency. To that end, we de ne a small set of memory and synchronization operations that are relevant to de ning memory consistency. Using these memory and synchronization operations as primitives, the LC model can be used to support parallel control structures selected from a variety of di erent parallel program models such as fork- The primitive memory and synchronization operations assumed in de ning the LC model are as follows. For simplicity, we assume that all other shared-memory and synchronization operations in a multiprocessor system can be de ned as a combination of these memory and synchronization operations. A de nition of the LC model for a richer set of memory and synchronization operations can be found in 9, 10].
Memory write | if processor P i needs to write value v in location L, it performs a write(P i ; v; L) operation, which we also represent by the notation L := v in processor P i 's instruction sequence.
Memory read | if processor P i needs to read a value from location L, it performs a read(P i ; L) operation, which we also represented by the notation read L in processor P i 's instruction sequence.
Acquire-release data synchronization | if processors P 1 ; : : : ; P k need exclusive access to a shared location L, the synchronization is accomplished by each processor performing an acquire(P i ; L) operation followed by a release(P i ; L) operation 12] . In general, the processors may use acquire-release to request exclusive access to a set of shared variables, fL 1 ; : : : ; L m g, rather than a single location.
LC Memory Consistency Model
We now de ne the LC memory model for the program model outlined in section 2.1. De nitions of the LC memory model for more complicated program models were provided in 9, 10].
Abstraction of Memory System
In Location Consistency, we view the state of a memory location as a partial order (rather than a total order) of write operations. Speci cally, the state of a memory location, L, is conceptually represented by a partially ordered multiset (pomset), state(L) = (S; ), where S is a multiset and S S is a partial order on S. Each element e of the multiset S denotes a write operation or a synchronization operation involving location L. We de ne processorset(e) to be the set of processors involved in operation e. For the operations de ned in section 2.1, processorset(e) will always be a singleton; however, processorset(e) can be a set in general when considering operations such as barrier synchronization as discussed in 9, 10].
Speci cally, an element e in multiset S can denote one of three di erent types of operations:
Element e = write(P i ; v; L) denotes a write operation on location L from processor P i with value v and processorset(e) = fP i g.
Element e = acquire(P i ; L) denotes an acquire synchronization operation on location L by processor P i with processorset(e) = fP i g.
Element e = release(P i ; L) denotes an release synchronization operation on location L by processor P i with processorset(e) = fP i g. For the sake of simplicity, we assume that each consecutive acquire/release pair of operations performed on a given memory location must be issued by the same processor.
For a given acquire operation e, we de ne most recent release(e) = e 0 , where e 0 is the most recent release operation performed on location L prior to acquire action e. Due to the mutual exclusion imposed by acquire/release, there can be at most one such release operation e 0 . If e is the rst acquire operation to be performed on location L, we say that most recent release(e) is unde ned.
State-Update Operations for a Memory Location
The state (pomset) for a memory location is updated when a write, acquire, or release operation is performed. The basic rule for computing the new pomset (S new ; new ) from the old pomset (S old ; old ) after operation e, is as follows:
S new := S old feg new := old f(x; e) j x 2 S old^p rocessorset(x) \ processorset(e) 6 = ;g
The basic rule is simple. The new operation, e, is inserted into the multiset S, and the partial order is updated so that x precedes e in S, if processorset(e) and processorset(x) have a non-empty intersection. The initial state of the memory location is assumed to be the empty pomset. In general, the partial order captures the sequencing constraints of the memory and synchronization operations performed on the location. There are no additional processor-based sequencing constraints as in the memory coherence assumption. Since all ordering constraints are captured at the location level, we chose the name Location Consistency for our memory model.
The basic update rule de ned above is used for write and release operations. A modi ed rule with most recent release() is used for the acquire operation. The state update rules for the write, acquire, and release operations are as follows. (State update rules for additional operations can be found in 9, 10].) write | when a write operation from any processor P i to location L with value v is executed, the element e = write(P i ; v) is inserted into the pomset for state(L), using the basic state-update rule.
release { when a release operation is performed on location L by any processor P i , the element e = release(P i ) is inserted into the pomset for location L, using the default state-update rule.
acquire { when an acquire operation is performed on location L by any processor P i , the element e = acquire(P i ) is inserted into the pomset for location L. If Note that the pomset model for the state of a memory location is solely an abstraction used for de ning the LC model. It is not intended that the pomset and the accompanying stateupdate rules be actually implemented on a real multiprocessor system. Instead, as described below, the pomset de nes the set of all permissible values that can be returned for a read operation executed on a computer system that satis es the LC model.
State Observability for a Memory Location
The state-update rules specify how the state of a memory location is updated by write operations and synchronization operations. We now de ne how the state of a memory location is observed via read operations.
Consider a read operation e on memory location L from processor P i . Since the state, (S; ), of L is a pomset, in general, the read operation may return an element from the set, V (e), of possible \live" values depending on (S; ). We now specify the rule for deriving the value set V (e) from (S; ), for a given read operation. As an example, Figure 1 shows the extended partial orders and value sets for two scenarios for the execution of the read operation on processor 1 in the simple parallel program trace shown in gure 1:
Case 1: When the read executes before the acquire on processor 2.
In this case, the partial order only contains a single write operation (from processor 1), and the value set for the read operation is V (e) = f val1 g according to condition 1 above. Case 2: When the read executes after the release on processor 2.
In this case, the partial order contains two write operations (from processors 1 and 2). In the extended partial order, we see that write(P 1 ) precedes read(P 1 ) but write(P 2 ) and read(P 1 ) are unrelated. Therefore, the MRPW set is the same as in case 1. However, the value set for the read operation is V (e) = f val1, val2 g. val2 is included in V (e) because of condition 2 above.
Since V (e) contains multiple elements, it indicates that there is a data race in the program trace in gure 1. The programmer could have avoided the data race by enclosing the read operation in a pair of acquire-release synchronization operations.
We now formalize the connection between execution scenarios and value sets by precisely de ning what it means for a parallel execution of a program to be location consistent. Figure 1: Example of extended partial orders and value sets for read operation The formal de nition for Location Consistency uses the concept of an abstract interpreter for a given concurrent program. The abstract interpreter works under an idealized execution model as outlined below. The reason for using the abstract interpreter in de ning the memory model is to break the cycle between program execution and memory semantics viz., the memory semantics depends on the program execution and the unfolding of the program execution depends on the memory semantics. Many memory models de ned in past work ignored this cyclic dependence and assumed the availability of some speci c program execution trace when de ning the memory semantics.
The execution model for the abstract interpreter maintains the state of each memory location as a pomset. The initial state of each memory location is assumed to be a singleton pomset containing write(fP 1 ; P 2 ; :::g; ?) i.e., each memory location is initialized with a (pseudo) write of an unde ned value (represented by ?) from all processors at the start of program execution.
The abstract interpreter mimics the execution of the memory and synchronization operations encountered in the concurrent program, and updates the states of memory locations according to the rules speci ed earlier. For each read operation r on location L, the abstract interpreter computes the value set V (r) from the pomset for location L and (arbitrarily) returns a value from the set V (r) as the result of the read operation 2 . As in de nitions of other memory consistency models 12], we assume that all uniprocessor control and data dependences are satis ed.
Analogous to the notion of a sequential consistent execution de ned in 1], we introduce the notion of a location consistent execution as any execution of a program by the abstract interpreter discussed above. Note that the execution model of the abstract interpreter makes no assumption on the timing of events in the program execution. Therefore, there may be many location consistent executions for a given set of program inputs due to the nondeterminism and data races that may be inherent in the program.
The Location Consistency (LC) model is now de ned as follows:
De nition 2.1. A multiprocessor system is location consistent if for any execution of a program on the system: the operations of the execution are the same as those for some location consistent execution of the program; and for each read operation R with target location L that is executed on the multiprocessor, the result returned by R belongs to the value-set V , where V is speci ed by the state of the memory location L as maintained by the abstract interpreter in the corresponding location consistent execution.
Note that our de nition of LC requires that the results of any (every) execution of a program on location consistent hardware can be obtained by some location consistent execution of the program on the abstract interpreter, but not vice versa i.e., as might be expected, it does not require that the results of any (every) location consistent execution of a program on the abstract interpreter be reproducible by some execution of the program on location consistent hardware.
3 Properties of the LC Model compared to other Cache Consistency Models
In this section, we establish some important properties of the LC model to demonstrate that it is a robust and reasonable memory consistency model. In addition, we show that the LC model is strictly weaker than other memory consistency model that have been presented in past work, thus making the LC model more attractive for use in future scalable shared-memory multiprocessor systems. Our study of these properties for the LC model was inspired in part by a study of these kinds of properties in 8] for the dag consistency model using a computationcentric framework i.e., a framework in which memory consistency is de ned for some speci c (partially ordered) trace of a program execution (unlike our framework based on the abstract interpreter described at the end of section 2.2). Speci cally, we prove that the LC model satis es the four properties listed below:
Weakness Property The LC model is strictly weaker than the Release Consistency (RC) models (which is in turn strictly weaker than the Sequential Consistency model).
The Weakness Property reveals why we can expect to nd a more e cient protocol for the LC memory model compared to protocols for the RC and stronger models.
Equivalence Property For parallel programs that have no data races, the LC model is equivalent to the RC model. Since most parallel programs are designed to be free of data races, the Equivalence Property shows that the LC model is as useful as the RC model (and the SC model) for this large class of programs.
Monotonicity The LC model is monotonic with respect to parallelism | if the LC model permits a certain mapping of values to dynamic instances of read instructions in the execution of a parallel program, it must permit the same mapping in an isomorphic execution of a legal \more parallel" version of the same program (i.e., in an execution that has fewer edges/constraints in the partial order of the program's execution).
Monotonicity is a measure of robustness of the LC model. If a legal parallelization transformation is performed on a program, any set of values returned by read operations in the original program should also be permitted in the parallelized program.
Non-intrusive Reads Reads are non-intrusive in the LC model | the addition or removal of a read instruction in a parallel program cannot change the legality of values returned by dynamic instances of read instructions in a given execution of the parallel program. Non-intrusive reads is another measure of robustness of the LC model. It implies that the addition or removal of debug statements (for example) will not change the memory semantics of locations being examined.
We will use the notation E to refer to a speci c execution of a parallel program and the notation R(E) to refer to the read mapping for execution E. Speci cally, R(E) r] = v contains the value v that was returned by (dynamic) read operation r in execution E. In general, E may refer to a partial execution of a parallel program.
Recall that when the abstract interpreter de ned in section 2 performs a location consistent execution of a program, it simulates an abstract memory system in which it maintains the state of each memory location L as a pomset of write/synchronization instructions, state(L) = (S; ). The value set, V (r; L), for a read operation r on location L can then be derived from state(L), as de ned in section 2. In a location consistent execution, each read operation r on some location L must return a value that is an element of V (r; L).
Weakness Property
In this section, we prove that the LC model is strictly weaker than the RC model (which in turn is strictly weaker than the SC model). The rst part of the proof is to show that every release consistent execution of a parallel program also satis es the LC model. This can be shown by establishing that the value returned for each read operation in an RC execution must also be an element of the value set for the read operation in an LC execution. We omit the details for the sake of brevity. The second part of the proof is to show that the LC model is strictly weaker than the RC model. To do so, we outline a program execution that satis es the LC model but does not satisfy the RC model. Figure 2 in which processor 1 writes L := 1 asynchronously (without locking), and processor 2 acquires the lock for location L to write L := 2. Next, processor 1 acquires the lock for location L and reads the value in L (read operation r1); processor 1 later performs a second read operation (r2) on location L but this second read is asynchronous i.e., read operation r2 is performed without any locking. In the RC model, it is illegal for read operations r1 and r2 to return di erent values for program execution E. This is because both write operations, w1 and w2, are executed in E before read operations r1 and r2. According to the RC model (and any other consistency model that relies on the memory coherence assumption), all writes to the same location are serialized in some order and are performed in that order with respect to any processor; this implies that read operations r1 and r2 must both see the same value (either from w1 or from w2 which ever appears last in the order) in a memory model that obeys memory coherence. Even from this simple example, one can surmise that a cache consistency protocol for the RC model must include some bookkeeping overhead to identify a xed (serialized) order for all writes to the same location. This bookkeeping is usually achieved by maintaining a single \home location" for each shared location in shared-memory multiprocessors.
Consider a parallel execution E of the program shown in
However, in the LC model, the value sets for read operations r1 and r2 are V (r1) = V (r2) = f1; 2g i.e., the value sets include the values from both w1 and w2 which is consistent with the program's partial order relationship revealed by execution E. Thus, read operations r1 and r2 may return di erent values under the LC model, say 1 and 2 respectively as shown in gure 2. From this simple example, one can see that cache consistency protocols for the LC model need not incur the bookkeeping overhead of identifying a xed (serialized) order for all writes to the same location. Instead, a protocol for the LC model can return any value for a read operation r that belongs to the value set V (r). This property is exploited by the LC cache consistency protocol presented in section 4. 
Equivalence Property
The Weakness Property from section 3.1 reveals why the LC model can be more attractive than other memory consistency models for use in future scalable shared-memory multiprocessor systems. In this section, we show why LC is a useful model by proving that the LC model is equivalent to the RC model for all program executions that are free of data races (access anomalies). Prior work 12] has shown that the RC model is equivalent to the SC model for such program executions, so the Equivalence Property also implies that the LC model is equivalent to Sequential Consistency for all program executions that are free of data races.
The Equivalence Property holds for parallel programs that are free of data races. We say that a program is free of data race if all its accesses to shared data are protected by control sequencing or by either direct or indirect data synchronization. This class of programs have also been referred to as data-race-free 1] and as properly labeled 12] in the literature.
To prove the Equivalence Property, we de ne a Memory Coherent Abstract Interpreter (MCAI) that is a variant of the Location Consistent Abstract Interpreter (LCAI) introduced in section 2. The MCAI treats each write operation w on location L as having processorset(w) = f1; : : : ; Pg, i.e., every processor is included in processorset(w) regardless of which processor performed the write operation. Therefore, all write operations will be totally ordered in the state of each memory location, instead of the partial order maintained by the LCAI. Further, if we consider an extended pomset (S 0 ; 0 ) for any read operation e on location L, all write operations to location L must be predecessors of e. Hence, the value set for any location in the MCAI will always have size 1, because there will be at most one \most recent predecessor write" for each location. The value returned by a read operation in the MCAI will be unde ned if the value set is empty. An execution of a parallel program will be release consistent (i.e., will obey the RC model) if the result of each read operation is the same as the result of the read operation in a corresponding execution of the MCAI. Now consider a location consistent execution E that has no data races. Since the execution obeys the LC model, the result of each read operation r in E must belong to the value set V (r) in the LCAI execution that corresponds to E. Since E has no data races, each V (r) set must have size leq 1. Therefore, the value returned by read operation r in the LCAI execution must be identical to the value returned by a corresponding MCAI execution. Hence execution E obeys the RC model.
Monotonicity
In this section, we show that the LC model is monotonic with respect to parallelism. Consider a location consistent execution E 1 of a given parallel program P 1 . Now consider parallel program P 2 obtained by removing a single pair of acquire-release operations from P 1 ; we say that program P 2 is strictly \more parallel" than program P 1 . We assume that this transformation is legal i.e., performing the transformation does not violate any data dependences. (Analogous transformations can be de ned for other synchronization operations.)
We derive an execution E 2 for program P 2 from execution E 1 by simply deleting all dynamic instances of the acquire operation and release operations that were deleted from P 1 to obtain P 2 . Monotonicity guarantees that E 2 will be a location consistent execution of program P 2 .
The proof of this result hinges on a simple observation. Let R(E 1 ) and R(E 2 ) be the read mappings for program executions E 1 and E 2 . Consider a read operation r on location L in execution E 1 ; r must also be present in execution E 2 . Let V 1 (r; L) and V 2 (r; L) be the value sets for read operation r in executions E 1 and E 2 respectively. Since program P 2 is more parallel than program P 1 , it must be the case that the partial order 2 for the state of location L in execution E 2 is a subset of the partial order 1 for the state of location L in execution E 1 i.e., 2 1 . This implies a subset relationship among the most-recent-predecessor-write sets with respect to any read operation e on location L, MRPW 1 (S; 1 ; e) MRW 2 (S; 2 ; e), and hence among value sets, V 1 (r; L) V 2 (r; L). The value v = R(E 1 ) r] returned by read operation r in execution E 1 must satisfy v 2 V 1 (r; L). Therefore, v 2 V 2 (r; L) and hence R(E 2 ) r] = v conforms with a location consistent execution of program P 2 .
As a concluding note, we observe that a similar monotonicity result can also be established when P 2 is obtained by partitioning a sequential thread from P 1 into two parallel threads (assuming that this partitioning satis es all control and data dependences). This is another way in which P 2 can be strictly \more parallel" than P 1 .
To the best of our knowledge, the monotonicity property holds for all memory consistency models proposed in the literature. Proving monotonicity for the LC model provides extra evidence that LC is a reasonable memory model.
Non-intrusive Reads
In this section, we show that reads are non-intrusive in the LC model. Consider a location consistent execution E 1 of a given parallel program P 1 . Now consider parallel program P 2 obtained by adding a single read operation to P 1 . We derive an execution E 2 for program P 2 from execution E 1 by inserting all dynamic instances of the read operation that was added to P 1 so as to obtain P 2 . The non-intrusive reads property guarantees that E 2 will be a location consistent execution of program P 2 .
The proof of the non-intrusive reads property for the LC model follows simply from the fact that read operations do not appear in the pomset state of a memory location. Let R(E 1 ) and R(E 2 ) be the read mappings for program executions E 1 and E 2 . Consider a read operation r on location L in execution E 1 ; r must also be present in execution E 2 . Let V 1 (r; L) and V 2 (r; L) be the value sets for read operation r in executions E 1 and E 2 respectively. Since read operations do not appear in state(L), it must be the case that V 1 (r; L) = V 2 (r; L). Note that any value v = R(E 1 ) r] returned by read operation r in execution E 1 must satisfy v 2 V 1 (r; L). Therefore, v 2 V 2 (r; L) and hence R(E 2 ) r] = v conforms with a location consistent execution of program P 2 .
To the best of our knowledge, the non-intrusive reads property holds for all memory consistency models proposed in the literature. Proving the non-intrusive reads property for the LC model provides extra evidence that LC is a reasonable memory model. 4 The LC Cache Consistency Protocol
In cache-based shared-memory multiprocessor systems, the management of the cache is a vital issue that has a signi cant impact on system performance. The presence of copies of the same location in multiple caches requires that these copies be managed in a way that does not violate the requirements of the underlying memory consistency model. This is known as the cache consistency problem. Aggressive cache management schemes can exploit loose constraints in weaker memory models by reducing consistency-related tra c in the memory system.
In this section, we introduce the LC protocol 3 , a new cache consistency protocol that supports the semantics of the LC model. As we will see, the partial order semantics of the LC model enables the LC protocol to be simpler and more scalable than existing protocols that must obey the memory coherence assumption.
Classi cation of Existing Cache Protocols: Snooping or Directory-Based
Existing cache-consistency protocols for SC-derived memory models can be divided in two classes 13]:
Snooping: Each cache block contains the shared status of the corresponding memory block, but no central state information is kept in memory. Snooping protocols rely on the presence of on a global shared-memory bus and on a cache controller in each processor that \snoops" on the bus transactions. If a processor contains a cached copy of the block involved in a bus transaction, the cache controller for the processor performs the appropriate action based on the state information of the cached block.
Directory-based: The state of a memory block and its cached copies is maintained in a directory. The directory keeps track of the state changes of each cache block and takes appropriate actions to maintain coherence by sending messages to individual processors listed in the directory.
We rst discuss the issues of write-through vs. write-back policies and invalidation vs. write broadcast approaches in cache consistency protocols. Maintaining the \memory coherence" requirement of the SC-derived memory consistency models is simpler when a write-through policy is assumed. However, most modern processors nd the extra memory tra c caused by a write-through policy to be prohibitively high, and thus employ a write-back policy instead. In the presence of a write-back policy, both snooping and directory-based protocols maintain coherence by ensuring that at any time there is a unique \owner" of a cached block.
There are two approaches to maintaining this ownership. The rst is to ensure exclusive access to a data item before a write to that item takes place | this is called the invalidation approach because other copies of the data item are invalidated before the write. The second is the write broadcast approach in which a write operation on one processor triggers a broadcast of the new value to all other cached copies. Among these two approaches, there seems to be a marked preference for the invalidation approach. Therefore, we will limit our attention in this paper to cache consistency protocols that assume a write-back policy and that follow an invalidation approach since these are the predominant assumptions in modern cache consistency protocols.
To ensure the single-ownership property, snooping protocols use the serialization of accesses on the bus as a single point of arbitration, which dictates a serialization of writes. That is, when two writes are in a race, the write that wins the arbitration will be initiated and will not complete until all its invalidation requests are satis ed. Snooping protocols have been used successfully in small-scale bus-based shared-memory multiprocessors. On the other hand, directory-based protocols have been proposed for large-scale shared memory multiprocessors that use an interconnection network instead of a bus. There is no longer a single point of arbitration. However, a single place | the directory | is provided to maintain the state information of cached blocks for each memory block. In this case a more sophisticated protocol engine is needed to ensure that unique ownership is enforced. The directory serves as a single rendezvous point for the serialization of writes.
Let us use the parallel program outlined in Figure 3 as a simple example to discuss existing snoopy and directory-based protocols. In this example, each processor executes a while loop; the only synchronization among the processors is through the acquire and release operations on the shared variable X. Each processor uses the acquire-release construct to perform some (processor-speci c) read-modify-write sequence on X in mutual exclusion. The acquire-release operations make the execution of this parallel program nondeterministic, but all executions of the program are assumed to be free of data races. Self-scheduled execution of a parallel loop 22] is a simple example that ts the parallel program structure shown in gure 3.
Consider the execution of the program in gure 3 with a snooping protocol. Executing a write operation of X in the acquire-release construct will force an invalidation of cached copies of X on all other processors, so as to change the ownership from \shared" to \exclusive". This invalidation is usually accomplished by broadcasting an invalidate message on the bus. Though the broadcast is convenient to perform on a bus, it still consumes an entire bus cycle and leads to extra bus tra c.
Directory-based protocols avoid the broadcast overhead incurred for write operations in snooping protocols by instead having the owner send invalidate messages to only those processors listed in the directory as containing cached copies of the block being written into. This can lead to a savings if only a small number of processors contain cached copies. However, even in that case, the savings comes at the cost of the extra complexity of maintaining directories in the cache consistency protocol.
A New Cache Protocol Based on the LC Model
Under the LC model the only requirement of a memory system is to obey the ordering dictated by the partial order semantics, which is speci ed by the programming model and memory abstractions (see Section 2). Since there are no memory-coherence or serialization requirements beyond what is implied by the program's partial order, a cache consistency protocol for the LC model does not need to ensure single ownership of memory blocks. Therefore, we can develop new cache protocols to support the LC model. The properties that we desire for such new protocols are that they:
should not be directory-based, thus avoiding the cost associated with the maintenance of a directory, should not rely on snooping or on any centralized arbitration to maintain coherence, thus eliminating a synchronization bottleneck and allowing for the construction of scalable machines, and should avoid invalidate requests for maintenance of memory and cache coherence.
Intuitively, since no memory coherence requirement is imposed, these new cache protocols should guarantee that a read operation always returns an element of the \value set" speci ed by the LC model (see section 2.2). The solution adopted in this paper is to always keep a valid value in the main memory location. Thus, any read miss in a cache will always nd a legal value in the corresponding memory location.
Our new cache protocol only needs to update the main memory location. Since the value set of a read operation can contain multiple legal values, there is no need to invalidate other cached copies of the memory. In other words, memory and caches do not need to be coherent! Several caches might contain di erent (but legal) values of the same memory location. Consequently our new protocol does not require a directory or the use of a bus snooping mechanism to maintain cache consistency.
An outline for the LC cache consistency protocol is as follows. In order to simplify the initial discussion, we assume that each cache line contains a single word (location). An extension to a cache with multi-word lines is presented in section 4.4. The LC protocol assumes that each cache line may be in one of the following states: invalid | the cache line does not contain valid information. A read or write operation to the line will result in a miss.
clean | the cache line contains valid information. A read or write operation to the line will result in a hit. If the cache line receives a \self-invalidation" it will go to an invalid state. A self-invalidation might result from an acquire operation performed in the same processor as explained later.
dirty | the cache line contains valid information. A read or write operation to the line will result in a hit. The cache line will remain in the dirty state even if it is the target of a self-invalidation operation.
The above three states can be easily implemented through the use of two state bits: a valid bit and a dirty bit. Table 1 : Correspondence between cache states and state bits. The state information for the LC protocol is only maintained in each processor's cache; no form of directory information or state information needs to be maintained in the memory. We observe that the two state bits per cache line required by the LC protocol are already present in many uniprocessor cache implementations in modern microprocessors. The main di erence from the uniprocessor case is in the use of these state bits by the LC cache protocol to determine what consistency actions should be performed for acquire and release synchronization operations.
Speci cally, the actions performed by the LC protocol for read/write/release/acquire operations are as follows:
Read Operation
If the line X is in a clean or dirty state, read(X) results in a hit. The value in the cache is a legal value to be returned by the operation read(X).
If the line X is in an invalid state, read(X) results in a miss. The line X of the cache will be fetched from the memory and stored in the cache in a clean state. Our protocol guarantees that the value brought from the memory is a legal value.
Write Operation
If the line X is in a dirty state, operation write(X) will simply write the new value in the line.
If the line X is in a clean or an invalid state, operation write(X) will write the new value in the line and change the line into a dirty state.
Acquire Operation
When executing an acquire(X) operation a processor p rst performs the atomic hardware operation used to get the lock, and then performs the following consistency operations:
If the cache line X is in a clean state, p will invalidate X in the cache. This self invalidation by processor p is necessary because the last processor to perform a release(X) may have written a value for X that p must observe in accordance with the LC model.
If the cache line X is in a dirty or invalid state no state change is necessary. If the line X is in a dirty state, operation release(X) will update the memory, wait for an acknowledgment, and set the cache line X to a clean state. If the line X is in a clean or an invalid state, operation release(X) will not incur in any state change. To complete the release operation, the processor also needs to give up the lock after performing the above consistency actions. We assume that all write operations to X in the same processor must complete before the release(X) operation can complete. Thus, if a write bu er is used, a release(X) operation cannot complete till all write operations to X have been transfered from the write bu er to memory. The processor then waits for an acknowledgment from memory that all its write operations to X have completed.
Release Operation
Note that this wait for write completion does not require exclusive ownership.
Under the LC protocol, a write operation is always locally performed in the cache. No consistency-related actions, such as obtaining exclusive ownership of the location before the write, are necessary. When a cache line X that is in a dirty state needs to be ejected to make room for a new value brought in by a read miss the value of X is written back to memory, but no other actions are needed. Figure 4 presents the state transition diagram for the proposed LC cache protocol. Note that this state diagram is much simpler than the state diagram for traditional cache coherence protocols such as those described in 14]. For example, the LC protocol does not need to send any global invalidation messages to other caches. All the operations presented in the state transition diagram are performed by the local processor (e.g. read, write, release, acquire).
For the sake of comparison, consider the execution of the data-race-free example program in gure 3 with the LC protocol. Each processor performs a \self-invalidate" of X when it acquires the lock for X, thus ensuring that receives the most recent value from memory. In addition, each processor writes back its most recent value of X to memory when performing a release operation. There is no broadcast of invalidates as in a snooping protocol, or sends of multiple invalidate messages to multiple readers as in a directory-based protocol.
We conclude this section with a theorem that establishes the correctness of the LC cache consistency protocol. ii. The read is a miss. In this case, the value v returned by the read miss is delivered from the memory location. Assume that this value was written by a write operation w = write(P k ; v) performed by processor P k . There are three possible cases for how value v written by write operation w in processor P k might have reached the memory location:
(a) Processor P k had the location cached in a dirty state and then performed a release operation on the location causing value v to be written back to memory by the LC protocol.
(b) Processor P k had the location cached in a dirty state and incurred a con ict/capacity miss that caused value v to be written back to memory by the LC protocol.
(c) Write operation w is the initial write of unde ned value ?. (Recall from section 2.2 that we assume that each memory location is initialized with a write of an unde ned value at the start of program execution.) In the rst two cases, the value was copied to the memory from a dirty cache location. We will show (by contradiction) that no other write operation w In the third case above, the value returned by read operation r is the initial unde ned value, v =?, for the location created by (pseudo) write operation w. Again, we will show (by contradiction) that no other write operation w This concludes the proof for the case when read operation r is a miss.
iii. The read is a hit in a clean line. If the cache line is in a clean state, we have to consider two situations:
(a) The cache location was updated by a prior cache miss on processor P j , but the memory location was since changed by a write back or release operation on processor P k 6 = P j . If the cache in processor P j is still in a clean state it means that P j has not issued an acquire operation for this location | meaning that the programmer did not intend for this value to be synchronized. Therefore the value stored in the clean line of processor P j is a member of the value set for the read operation, and hence still a legal value according to the LC model.
(b) The cache location has the same value as the corresponding memory location. This is similar to the read miss case discussed above where the value returned by the read operation is the same as the value in the corresponding memory location.
Having considered all possible cases for a read operation, we conclude our proof that the value returned by any read operation is a legal value in the corresponding LC value set. 2 
Comparison with Existing Cache Protocols
In this section, we provide a qualitative comparison of the LC protocol with existing snooping and directory-based cache consistency protocols. The main advantages of the LC protocol are its simplicity and its scalability. The simplicity comes from the fact that the LC protocol does not require extra hardware either to snoop on a bus or to maintain a directory. The scalability comes from the fact that the LC protocol does not perform invalidations on multiple processors when a variable is acquired or written. In contrast, a snooping protocol has to broadcast invalidates to all processors when performing an acquire or a write operation. A directorybased protocol has to send invalidate messages to all processors that have a cached copy of the location being written or acquired.
The LC protocol derives these bene ts from the underlying LC model. The fact that the LC model is weaker than RC and other consistency models based on memory coherence enables the LC protocol to be simpler than consistency protocols for memory models based on the memory coherence assumption. If either the snooping protocols or the directory-based protocols were relaxed so as to only perform self-invalidates as in the LC protocol, they would no longer be correct protocols for their underlying memory consistency model.
The properties described in section 3 have already established the usefulness of the LC model. In particular, the Equivalence Property guarantees that the LC model is equivalent to the RC model for parallel programs that have no data races. For such programs the simpler and more scalable LC protocol is guaranteed to implement the same memory consistency semantics as the snooping and directory-based protocols for the SC and RC models. We saw an illustration of this in section 4.2 in the discussion of how the LC protocol operates on the parallel program in Figure 3 .
For programs with data races, it is the job of the memory consistency protocol to implement the semantics of data races speci ed by the underlying memory model. We advocate the use of the LC model because it guarantees important properties such as the Equivalence Property, while being strictly weaker than other memory consistency models that guarantee the same properties. In snooping and directory-based protocols the extra overhead of supporting memory coherence in programs with data races is incurred by all programs, including programs that are free of data races. It is not easy to disable this support because it is not possible to determine at the start of a program's execution whether or not the execution will exhibit a data race. This is unfortunate because it is desirable for parallel programs be data-race-free, and in fact most parallel program executions in practice do not exhibit data races.
The LC model ensures that data-race-free programs are not penalized by extra consistency overhead that would only be necessary for programs with data races. Instead, the LC model provides a weaker semantics for programs with data races; if there is a need to write a parallel program using the LC model that has data races and that satis es the memory coherence assumption for some variable(s), each memory access to the variable(s) can be enclosed in an Figure 5 : Example of a parallel program with data races.
acquire-release construct 4 . Thus, in the LC model and the LC protocol, the extra overhead of maintaining memory coherence is only incurred by programs that have data races and that need the memory coherence property.
To demonstrate how the LC protocol works for a program with data races, consider the relaxation-style parallel program outlined in Figure 5 . The main di erence with the program in Figure 3 is that each processor now performs an asynchronous read of X in each iteration of the while loop. In this example, we assume that there are no asynchronous write operations. The presence of asynchronous reads is su cient to cause data races because there may be multiple candidate write operations that can supply the value of the read operation. However, the uniprocessor ordering of the read X operation following the release(X) ensures that the result of the read must either come from the most recent write performed in the same processor or from a later write operation on another processor. This kind of guarantee can be su cient to ensure termination in many relaxation algorithms.
With the LC protocol, the acquire-release construct will execute just like the execution of the acquire-release construct in gure 3 discussed in section 4.2. For the asynchronous read, there are two cases:
Case 1: X remained in the processor's cache between release(X) and read X In this case, the LC protocol will just return the value of X in cache which will still be valid.
Case 2: X was evicted from the processor's cache between release(X) and read X In this case, the read operation will incur a cache miss just as in the uniprocessor case. The LC protocol would have ensured that the value in memory was written back either by the most recent release(X) on the same processor or by a later release(X) performed on another processor. Therefore, each acquire/release/read/write operation incurs a constant time overhead without involving additional processors.
With a snooping or directory-based protocol, each write operation will cause all N copies of X to be invalidated on N processors. Therefore, O(N) overhead is incurred for each write operation in these protocols, which is an order-of-magnitude larger than the constant-time overhead incurred in the LC protocol. This example demonstrates the scalability of the LC protocol compared to snooping or directory-based protocols.
Extensions to the LC Protocol
We describe a straightforward extension of the LC protocol to caches with multi-word (multilocation) lines. The basic idea is as follows: the state information described in section 4.2 is maintained for each location in a line. The main e ect of having multiple locations per cache block can be thought of as implementing anticipatory prefetches of independent locations to exploit spatial locality.
As a result, di erent locations in a cache line can be in di erent states! When a cache line is brought into the cache, all invalid locations in the line are fetched into the cache in clean state. When a cache line is ejected from cache, the consistency actions are performed for each location that is being ejected. In other words, the ejection of a line from the cache can be thought of as a block ejection of multiple independent locations.
We illustrate the general mechanism by describing how read and write operations would work in the extended setting. The release and acquire operations can be extended in a similar fashion:
Read operations A read to a location L hits in a cache if the line containing the location is in the cache and the location within the line contains valid data (i.e if the location is not in invalid state).
In this case the value of location L in the cache line is a legal value to be returned by the read. Otherwise, the read is said to miss. All read-misses in a cache are guaranteed to be legally serviced by memory. Our LC-based protocol guarantees that if a read access cannot be serviced by a cache, the value retrieved from memory is a legal value.
A read to a location L misses if the corresponding line is not in the cache, or if the location within the line is in invalid state. When a read-miss occurs, the location is fetched from memory and put in clean state. All other locations that are fetched from memory are also put in the clean state. We assume that the cache replacement mechanism will take care of writing back locations in the replaced line that are in dirty state.
Write Operations
When a write operation to location L is executed, it will check the cache state of the location L. If L is in a dirty state, it will simply write the value into the location of the corresponding cache line. Otherwise, if it is in clean state, then it writes the value in the location of the line and change the state into dirty state. In case it is in invalid state (i.e. result in a write cache miss), then it writes the value in the location of the line and change the state into dirty state. If the line containing location L is not in the cache, a separate replacement mechanism will make room in the cache for the new line, location L will be written and put in dirty state and all the other locations will be in invalid state 5 .
Even in the case of multiple locations per cache line, all consistency state information is maintained at the location level in this extension. This extension implies that two state bits are needed for each word in the cache, which leads to extra cost in the cache support of some modern processor architecture. For example, the Pentium-Pro and the MIPS-4000 processors maintain only one valid and one dirty bit per cache line. The cost of the extension is lowered if the processor maintains a valid bit and a dirty bit per subblock, as in the Alpha 21164 processor. Finally, if it is not possible to maintain valid and dirty bits on a location level, our plan is to extend the LC protocol along the lines of the data merging protocol reported in 15].
Optimization of the LC Cache Protocol
So far, our discussion on the LC cache protocol is based directly on memory operations such as read, write, acquire, and release. However, further re nement and optimization is possible and desirable.
For example, examine the code section (a) shown in Figure 6 . This is a critical-section, where the variable X is read twice and updated once. In code section (b), we show how the code is re ned. We separate the two functions of the acquire/release operations | the lock/unlock operations and consistency related operations. That is, acquire and release operations are replaced by normal lock/unlock operations which are commonly supported as low level primitives in modern processor architectures. The consistency related operation associated with acquire and release are now implemented by \refresh" and \write-back" operations.
We introduce a \refresh" operator before each read operation. A refresh(X) operation will perform a self-invalidation of X following the same rule used for an acquire operation. We also introduce a \write-back" operation after each write operation. The \write-back" operation plays the role of a synchronization with the nal \sync-writeback" operation to be placed before the corresponding release operation. The sync-writeback operation can only complete when it has received all synchronization signals from the write-back operations in the critical section.
This ensures that the unlock is not performed until all the write operations to X have been completed.
The re nement presented in this section can be an optimization if, for example, the read operations are conditionally executed. In this case, a refresh(X) operation is performed only Figure 6 : Example of a re nement and optimization when a read(X) operation is performed; the refresh(X) operation need not be performed at each acquire(X) operation as described in section 4.2. Therefore, this re nement can be useful in removing some unnecessary self-invalidation operations at runtime. We anticipate several opportunities for compiler optimization after this kind of re nement is performed, such as global code motion to minimize the overhead of consistency related operations.
Related Work
Related work on cache-coherence protocols has been studied widely and the readers should nd an excellent introduction in 14] and other references therein. We have also given a brief review of issues related to this paper in Section 4. We focus our discussion of related work in this section on relaxed memory models which attempt to relax the restrictions on memory access ordering imposed by the sequential consistency (SC) model so as to enhance performance.
In order to alleviate the SC performance limitations, designers have proposed models that guarantee the SC interface for a restricted set of programs, but allow optimizations to be applied safely. The looser models share the common intuition that if a program has enough synchronization then the program can appear to execute in a sequentially consistent manner without hardware support for sequential consistency. In other words, the results of every run of a \properly synchronized" program are guaranteed to be consistent with some sequentially consistent run of the program. We present these relaxed consistency models in the chronological order in which they appeared in the literature: Release Consistency (1990) in section 5. 
Release Consistency
The goal of Release Consistency (RC) is \to exploit additional information about shared accesses to develop a memory consistency model that allows for more e cient implementations" 12]. The RC model distinguishes between ordinary shared memory accesses and synchronization instructions, and further distinguishes between acquire and release synchronization operations.
For example, in the RC protocol, a release triggers a write of a shared variable and an acquire triggers the read of a shared variable. The purpose of a release is to inform other processes that all accesses that precede it (in program order) have completed. Similarly, the purpose of an acquire is to await such a signal from another processor before initiating any further accesses.
RC guarantees sequential consistency for a speci c class of programs which it classi es as \properly labeled". Intuitively, a program is properly labeled if there is enough synchronization so that for all legal interleaving of accesses, pairs of con icting ordinary accesses are separated by a release-acquire chain i.e., the program has no data races.
The RC condition has imposed certain limitations on how memory operations should be ordered. In the DASH implementation of RC 18] , each write operation must be explicitly acknowledged to ensure atomicity and thus satisfy the memory coherence condition. In addition, the release operation contains an implicit \fence" operation which ensures that all previous memory operations must be completed before subsequent operations can be issued.
Lazy Release Consistency
Lazy Release Consistency (LRC) can be viewed as an extension to RC aimed at reducing the number of messages and the amount of data exchanged in a distributed shared-memory system implemented in software 16]. The basic motivation for the algorithm is the observation that cross-processor consistency information only needs to be propagated at acquire synchronization points, at which point RC requires all ordinary accesses that precede the corresponding release to be performed with respect to the acquiring processor. While the eager counterparts of LRC 6 make modi cations globally visible at the time of a release, LRC exploits the intuition that only the processor that acquires a variable needs to see all modi cations that precede the acquire. So, whereas in an eager implementation, consistency broadcasts are made at release points and may involve all processors, in the lazy implementation, the broadcasts are made at the points of acquire, and involve only the acquiring and releasing processors.
In the DSM implementation of LRC 16] , coherence actions only take place at the acquire points, while release operations involve no coherence actions. It is found that both the number of messages as well as the amount of data exchanged are generally smaller for LRC than an eager implementation of RC.
Entry Consistency and the Midway System
Entry Consistency (EC) 2] can also be viewed as a further relaxed extension of RC. The basic di erence is the following: whereas in RC, a synchronization object (variable) protects access to all shared data, in EC, there is an explicit correspondence between synchronization variables and the shared data they guard. As in LRC, modi cations are propagated at an acquiring synchronization, but now, only the shared data that the synchronization variable guards is guaranteed to be consistent at that point. This correspondence between shared and synchronization data, which are implicit in the structure of a parallel program, is required by EC to be made explicit to the compiler and run-time system. An aggressive implementation can make use of this information to reduce the number of consistency messages (e.g. cache invalidations and/or updates) owing across the system.
As in previous models, programs that include all the necessary labeling information, and have no data races (i.e., are \properly synchronized") observe a sequentially consistent shared memory. Measurements made on the implementation of EC on Midway 2] show that a program written for EC requires substantially fewer consistency transactions than stronger models such as RC. However, the caching protocol is still essentially based on a \single-ownership" model.
Dag Consistency
Blumofe et al 3] de ned the dag consistency memory model for deterministic spawn-sync multithreaded programs in which dynamic program execution is modeled as a computation dag (directed acyclic graph) of non-suspensive \threads". Threads are created as follows. Main program execution begins in a single thread. A spawn (fork) statement creates a procedure call in a new thread that can execute concurrently with the caller. A sync (join) statement terminates the executing thread and creates a new thread for the computation that follows the sync statement. This new thread must wait for all threads spawned by the previous thread to terminate before it can start execution. Each vertex in the computation dag corresponds to a distinct thread. An edge in the computation dag represents a partial order constraint i.e., an edge exists between the caller thread and the callee thread in a spawn statement and between each previously spawned thread and the continuation thread in a sync statement. Thus, the computation dag de nes a partial order on threads. The computation dag is an abstraction used in de ning dag consistency, and is not actually computed at runtime.
The shared memory of a multithreaded computation is said to be dag-consistent if the following two conditions hold:
1. When thread i reads a memory location, it receives a value that was written by some thread j such that i 6 j in the computation dag.
2. For any three threads i; j; k such that i j k, if k reads a location written by both i and j then the value read by thread k is not the one written by thread i.
These conditions for dag consistency appear to be identical to the conditions for location consistency de ned in 9, 10] when applied to the special case of deterministic spawn-sync programs considered in 3]. For deterministic programs, the set of most recent writes for a location in the LC model will always either be an empty set or a singleton set.
Conclusions
In this paper, we argued that the past trend of including the memory coherence assumption in all memory consistency models should be re-examined. We believe that this assumption imposes serious limitations in the design of scalable shared-memory multiprocessors. We propose that memory consistency models should instead be based on the partial order execution semantics of parallel programs without relying on the memory coherence assumption.
We de ned a new memory model called Location Consistency (LC) in which the state of a memory location is modeled as a partially ordered multiset (pomset) of write operations and synchronization operations. We established the usefulness and robustness of the LC model by proving the weakness, equivalence, monotonicity and non-intrusive properties for the LC model. We introduced a new cache consistency protocol for the LC model. This LC protocol is simple and scalable. It is unique in its support of the LC model without incurring any overhead of bus snooping or of maintaining directories.
