Abstract. This paper gives a correctness proof for the on-chip COMA cache coherence protocol that supports the Microgrid of microtheaded architecture, a multi-core architecture capable of integrating hundreds to hundreds of thousands of processors on single silicon chip. We use the Abstract State Machine (ASM) as a theoretical framework for the specification of the on-chip COMA cache coherence protocol. We show that the protocol obeys the Location Consistency model proposed by Gao and Sakar.
Introduction
A number of computer system architecture and implementation issues as long wire delay, heat dissipation, memory synchronization, etc., have driven the computer architecture to an inevitable transition from single-core to multicore processor design. The Microgrid of microthreaded [14, 3, 4, 13] architecture is designed to possess thousands of on-chip simple processing cores , while providing the scalable throughput both on and off chip. The microthreaded architecture could perform explicit context switch during long latency operations as memory accesses without wasting the processor time.
The shift from off-ship to on-chip multiprocessing allows the cache coherence to operate at a higher clock rate. In addition, the capability of tolerating long memory access latency in the microthreaded architecture helps us revive a paradigm used in earlier parallel computers, such as the Kendal Square KSR1 [5] . We introduce a Cache Only Memory Architecture (COMA) [9] for the onchip cache system. In COMA, all the memory modules can be considered as large caches, called Attraction Memory (AM). Data is stored by cacheline but the line has no fixed location to find. Similar to COMA, in on-chip COMA, a certain piece of data can be replicated and migrated dynamically between caches. A main difference between the on-chip COMA and traditional COMA is that the traditional COMA system will hold all data in the system without a backing store, while the on-chip COMA has a backing store for data off chip, where an interface is provided for storing incoming data. The readers are referred to [17] for more detail.
Although the on-chip COMA has some similar property and structure as the traditional COMA, the underlying consistency models and supporting cache coherence protocol are largely different. To balance the programming complexity and execution efficiency, a number of memory consistency models [12] had been proposed before. The most commonly used memory consistency model is the sequential consistency (SC) model given by Lamport [16] . In this model, memory operations performed by the processors are serialized. Since SC requires all the processors to observe the write requests in some unique order, thus the atomic broadcast communication is generally required for implementing SC, which may severely impair the cache throughput and execution efficiency. Furthermore, the assumption of the universal order poses fundamental obstacles to defining a scalable and efficient view of the memory consistency in computer system. A number of more relaxed consistency models of the SC model such as release consistency, lazy release consistency, entry consistency and dag consistency have been proposed in [9, 15, 1, 2] . Location Consistency (LC) proposed by Gao and Sakar in [7, 8] , is considered the weakest memory model to date. In the LC model, memory operations performed by processors need not be seen in the same order by all processors, and therefore, there can be many multiple legal values for a memory location at the same time. The on-chip COMA cache coherence protocol is designed to obey this consistency model. In its cache system, multiple legal values of a memory location are stored in different caches. This reduces the consistency-related traffic in the network of the cache coherence protocol significantly, since a read operation can read a legal value from a local cache or from the main memory (in the case that there are no legal values available in the caches).
In this paper, we give a correctness proof for the design of the on-chip COMA cache coherence protocol. We show that our protocol does not rely on the memory coherence assumption, and therefore, it does not satisfy the SC and SCderived models. However, it obeys the LC model of Gao and Sakar. Indeed, our protocol is strictly stronger than the LC model. We will use the Abstract State Machine (ASM) [10, 11] as a theoretical framework for the specification and verification of our protocol.
Location consistency
In this section, we follow Gao and Sakar [7, 8] to define the location consistency with respect to the microthreaded architecture.
Program model
Our program model consists of two memory operations and two synchronization operations whose descriptions are as follows:
Memory read : If thread T i needs to read a value from memory location L, it performs a read(T i , L) operation, which is also represented by the notation read L in thread T i 's instruction sequence. Memory write : If thread T i needs to write the value v on location L then it must wait for all read operations issued by T i and its subthreads on location L to be complete and then performs a write(T i , v, L) operation, which is also represented by the notation L := v in thread T i 's instruction sequence. Thread creation : If thread T i needs to create a family of threads then it must wait for all write operations issued by T i and its subthreads to be complete and then performs a create(T i , F) operation where F is a sequence of threads. This operation is represented by the notation create(F) in T i 's instruction sequence. We note that every thread T j of F is a subthread of T i , and all subthreads of T j are also subthreads of T i . Barrier synchronization : If thread T i needs to identify the termination of a specified family of threads, it performs a sync(T i , F) operation where F is the specified family. This operation is represented by the notation sync(F) in thread T i 's instruction sequence. The subsequent instructions after sync(F) in thread T i must wait until all write operations of the threads in F and their subthreads are complete.
State update for a memory location
In the LC model, the state of a memory location is a partial ordered set of memory and synchronization operations. Given a memory location L, the state of L is a partially ordered multiset (pomset) state(L) = (S, ≺), where S is a multiset and ≺ is a partial order on S. Each element of S is a memory operation or a synchronization involving location L. Two elements in multiset S can have the same value, however, they can be distinguished by the partial order. For two operations e 1 , e 2 ∈ S such that (e 1 , e 2 ) ∈≺, we say that e 1 is a predecessor of e 2 . Initially, the state of a memory location is the empty set. For an operation e, we denote thread(e) as the thread involved in operation e, i.e. thread(e) = T i where e ∈ {write(
The state of a memory location L is updated when a memory operation on the location L or a synchronization operation is performed. This new operation is inserted to the current multiset of the state. The precedence relation (the partial order ≺) is updated by the following rules:
1. All operations in the multiset from the same thread with the new operation are considered as the predecessors of that new operation. 2. The thread creation operation creating the thread containing the new operation is a predecessor of that new operation. 3. If this new operation is a barrier synchronization operation then all operations issued by the threads involved in the barrier synchronization operation are predecessors of that new operation.
Let L be a memory location with the current state (S, ≺). The state update of L with operation e is defined as follows. S:=S ∪ {e}. Moreover, 1 . if e is a memory operation then ≺:= trans( ≺ ∪ {(e , e) | e ∈ S ∧ e = e : thread(e ) = thread(e)} ∪ {(e , e) | e ∈ S : e = create(T i , F) ∧ thread(e) ∈ F})
2. if e = create(T i , F) then ≺:= trans( ≺ ∪ {(e , e) | e ∈ S ∧ e = e : thread(e ) = thread(e)})
3
. if e = sync(T i , F) then ≺:= trans( ≺ ∪ {(e , e) | e ∈ S ∧ e = e : thread(e ) = thread(e)} ∪ {(e , e) | e ∈ S : thread(e ) ∈ F})
The function trans is to maintain the transitive property of the precedence relation ≺. Let ≺ be a binary relation over a set S. Then trans(≺) =≺ ∪{(e 1 , e 2 ) | ∃e 1 , e 2 , e ∈ S : (e 1 , e ) ∈≺ ∧(e , e 2 ) ∈≺}
State observability for a memory location
The state of a memory location in the LC model can be observed via read operations. Let L be a memory location with state(L) = (S, ≺), and r ∈ S a read operation on L. The most recent predecessor write with respect to r is a write operation w ≺ r such that there is no other write operation w ∈ S satisfying w ≺ w ≺ r. The read operation r reads a legal value v if there is a write operation w such that w = write(T, v, L), and 1. w is the most recent predecessor write with respect to r, or 2. r and w are unordered, i.e. (w, r) / ∈≺.
The set V (r) is the set of all legal values returned by r.
Finally, we recall the definition of the Location Consistency from [8] as follows: A multiprocessor system is location consistent if for any read operation R with target location L of any execution of a program on the system, R always returns one legal value.
3 The on-chip COMA cache coherence protocol Threads are distributed to processors for their execution. One or more threads can be executed on a processor. Each processor may cache values for many memory locations. In particular, a processor has a cache consisting of a number of cachelines. Note that a cache can be connected to more than one processor. The value of a memory location is cached in a cacheline. The values of a memory location stored in different caches can be different, since they are not updated at the same time. The on-chip COMA cache coherence protocol is designed to maintain the location consistency in the cache system of the microthreaded architecture. This section briefly introduces the on-chip COMA cache coherence protocol. For simplicity, we assume that a cache is connected to one processor only, and a cacheline in the cache coherence protocol stores a unique value.
In the protocol, caches are connected in a directed ring network which has a directory to hold the information about all the data available on the ring. Only the directory has the access to the main memory. Hence, any loading data from or writing back to the main memory must be handled through this node. The on-chip COMA cache coherence protocol is based on MOSI variations in which a cacheline have four main states MODIFIED, OWNER, SHARED and INVALID, and three temporary states READ PENDING, READ PENDING I and WRITE PENDING (see Fig.  1 ) whose descriptions are given below:
-INVALID: If a cacheline is in a INVALID state then it has no valid data; -MODIFIED: If a cacheline is in a MODIFIED state, it has the exclusiveness of the data. -OWNER: If a cacheline is in a OWNER state, it has the ownership of the data, and there can be another cacheline in the system that has a valid data; -SHARED: If a cacheline is in a SHARED state, it has a valid data but no ownership of the data. -READ PENDING: If a cacheline is in a READ PENDING state, it is waiting for a valid data to be loaded; -READ PENDING I: If a cacheline is in a READ PENDING I state, it receives an invalidation request while waiting for a valid data to be loaded. After a valid data is loaded, its state will become INVALID; -WRITE PENDING: If a cacheline is in a WRITE PENDING state, it is waiting for the exclusiveness of the data.
A cache can handle two kinds of requests: local requests and network requests. Local requests are memory operations issued by processors, while network requests (or messages) occur during the communication between caches and have a higher priority to be considered than local ones. We note that since caches are connected in a directed ring, network requests can only be sent or passed to the next cache on the ring. The types of requests of the on-chip COMA cache coherence protocol are:
-LR (Local Read): the type of a read operation issued by a processor; -LW (Local Write): the type of a write operation issued by a processor; -RS (Remote Read to SHARED state): issued by a cache to ask for a valid data when it receives a LR request but it has no valid data; -SR (to SHARED state) Read Reply: issued by a cache when it receives a RS request and has a valid data; -IV (InValidation): issued by a cache when it receives a LW request. It wants to become the ownership of the data, and therefore, tries to invalidate all other data on the ring network; -WB (Write Back to main memory): issued by a cache to write back a dirty data (whose cacheline is in MODIFIED or OWNER state) to the main memory before the ejection of the cacheline; -eject: issued by a cache to notice the directory that a cacheline in SHARED state has been ejected.
The specification of the protocol
This section specifies the on-chip COMA cache coherence protocol in the Abstract State Machine (ASM) framework [10, 11] . The protocol is considered as an ASM whose transition rules represent the behavior of the protocol.
Vocabulary
We assume the existence of a fixed set Thread of threads, a fixed set Processor of processors, a fixed set Location of memory locations, a fixed set Operation of operations, a fixed set Message of messages, and a fixed set Data of data values. The undefined value or attribute of an object is specified as undef. For a thread T , there is an attribute proc ∈ Processor to characterize the processor where T is distributed to. Let Type = {LR, LW, CRE, SYNC, RS, SR, IV, WB, eject}. An operation e ∈ Operation has four attributes type ∈ Type, thread ∈ Thread, val ∈ Data and loc ∈ Location to characterize the type, the thread, the data and the memory location involved in the operation. For instance, for a write operation e = write(T, v, L), e.type = LW, e.thread = T , e.val = v and e.loc = L.
By the assumption, for a processor there is only one cache, and vice versa. We can assume that messages are issued by processors as well. Hence, a message m ∈ Message has three attributes type ∈ Type, val ∈ Data and loc ∈ Location to characterize the type, the data and the memory location involved in the message. Moreover, it has an attribute source ∈ Processor to characterize the processor who originally sends out the request. We denote the empty message as noMess. For each processor P and for each location l ∈ Location, the pair (P, l) represents a unique cacheline whose description is given by the following functions:
-state(P, l) ∈ {undef, INVALID, MODIFIED, OWNER, READ PENDING, READ PENDING I, WRITE PENDING}. Initially, state(P, l) = undef; -cacheDirty?(P, l) to characterize whether the cacheline holds a dirty data or not. Initially, cacheDirty?(P, l) = false; -Pending?(P, l) = if state(P, l) = READ PENDING or state(P, l) = READ PENDING I or state(P, l) = WRITE PENDING then true else false. This function determines whether the cacheline is waiting for a data to be loaded or the ownership of the data or not; -Eject(P, l) = {state(P, l) := undef, cacheDirty?(P, l) := false, cacheValid?(P, l) := false, state(P, P.curOp.loc) := INVALID}. This function is to make a place for loading or writing a data to the current memory location concerned by P ; -Invalidate(P, l) = {state(P, l) := INVALID, cacheDirty?(P, l) := false, cacheValid?(P, l) := false}
Let dir be the directory of the ring network that can access to the main memory, and holds information about the data available on the ring network. A processor P ∈ Processor has the following attributes:
-cacheOccupied? ∈ {true, false} to determine whether all cache entries of P are occupied or not. This function is monitored by the execution environment; -id ∈ N to define the index of P ; -neighbor ∈ Processor ∪ {dir} to characterize the next node of P on the ring network; -ejectee ∈ Location to characterize the memory location to be ejected for an empty cache entry when all cache entries of P are occupied. This function is monitored by the execution environment which satisfies the condition that state(P, P.ejectee) = undef; -Return ∈ Data × Location to return a read value asked by a LR request; -curMess ∈ Message to characterize the current network request to be handled by P ; -curOp ∈ Operation to characterize the current operation performed by P in the current step. This function is a dynamic function and is monitored by the execution environment; -nextOp ∈ Operation to indicate the next operation performed by P . This function is a dynamic function and is monitored by the execution environment.
The directory dir has the following attributes:
-MMVal : Location → Data to determine the value of a location stored in the main memory;
-neighbor ∈ Processor to characterize the next node of dir on the ring network; -curMess ∈ Message to characterize the current network request to be handled by dir; -cacheCounter : Location → N to determine the numbers of valid caches for a memory location on the ring network. Initially, for all l ∈ Location, dir.cacheCounter(l) = 0. This counter is updated as follows. When the directory receives an IV request, meaning that someone wants to become the ownership of the data, the counter is set to 1. When the directory receives a RS (or SR) request from (or for) a processor whose cacheline is not in READ PENDING I state, meaning that someone wants to have a valid data, the counter is increased by 1. When the directory receives a WB (or eject) request, meaning that someone who has a valid data has been ejected, the counter is decreased by 1.
There are also three auxiliary functions needed for the specification of the protocol:
-SendMess(P, messType, val, loc) = {if messType = SR then mess.source := P.curMess.source else mess.source = proc, mess.type := messType, mess.val := val, mess.loc := loc, P.neighbor.curMess := mess, P.curMess := noMess} to send a message from node P to the next node (P.neighbor) on the ring; -PassMess(P ) = {P.neighbor.curMess = P.curMess, P.curMess := noMess} to pass the current message of node P to the next node P.neighbor) on the ring; -readPending?(P, loc) = if state(P, loc) = READ PENDING or state(P, loc) = READ PENDING I then true else false.
Transition rules
The behavior of the on-chip COMA protocol is represented as an ASM module whose transition rules are given in Table 1, Table 2, Table 3, Table 4 , Table 5 , Table 6 and Table 8 . We will sometimes shorten macros such as self.curMess| cacheOccupied?|ejectee|neighbor|curOp|nextOp|id|MMVal|cacheCounter by curMess, cacheOccupied?, ejectee, neighbor, curOp, nextOp, id, MMVal and cacheCounter. With reference to Table 1 , we first explain how a processor P reacts when it receives a LR (Local Read) request. As mentioned earlier, network requests have higher priority to be considered than the local ones. Thus, this local read request is only considered in the case that there is no network request available for P , i.e. P.curMess = noMess. If there is no cache entry set up for the memory location involved in the request yet (state(P, P.curOp.loc) = undef), then P first checks whether all cache entries are occupied or not. If yes (P.cacheOccupied? = true), P has to eject a cacheline determined by the execution environment (P.ejectee with state(P, P.ejectee) = undef). If the ejectee has a dirty data then this data is written back to the main memory by sending a WB (Write Back) message to the directory. If the ejectee has a valid (but not dirty) data then P also sends out a eject message to notice the directory. If the ejectee is in a pending state then the removal is also pending. P then removes the ejectee, and sets up another cacheline for the location concerned (state(P, P.curOp.loc) = INVALID).
If this cacheline contains a valid data, then P just sends back the value stored in the cacheline. The read operation is considered complete. If it contains no valid data (state(P, P.curOp.loc) = INVALID), P then sends a RS request to other processors to ask for a valid data.
In Table 2 , we now explain how a processor P reacts when it receives a LW (Local Write) request. Similar to the previous case, this request is only considered by the processor and when no network request is available. Moreover, P also has to set up a cacheline for writing the new data as in Table 1 in the case that there is no place for that yet. If there is already an available cacheline, we just overwrite the data. If the cacheline is not waiting for the exclusiveness of the data (i.e. it is not in a WRITE PENDING state), P then sends out a IV request to other processors to ask for the exclusiveness for the data. The state of the cacheline becomes WRITE PENDING. We impose the following condition on all write operations: Condition 1 For every processor P , if the current operation of P is a write operation (P.curOp = w) then all the read operations performed by w.thread and its subthreads on the same location (w.loc) must be complete. the request and it has a valid data then P just sends back the value stored in its cacheline. All the read operations performed by P on the same location are considered complete. Otherwise, if P is the directory and there are no caches on the ring having a valid data (P.cacheCounter(curMess.loc) = 0) then it sends out the value stored in the main memory together with the read reply SR. The counter of valid caches for the location involved in the message is increased by 1 if its cacheline was not in a READ PENDING I state. If P is a processor who has a valid data then it sends out the cached value together with the reply SR. Note that if the cacheline is in MODIFIED state, then its state will become OWNER. In the remaining cases, it just passes the message to the next node on the ring. In Table 4 , transition rules reacting SR (to Shared state Read Reply) requests for a node P are given. If P is waiting for a valid data (state(self, curMess.loc) ∈ {READ PENDING, READ PENDING I}), then P just overwrites the data. The state of the cacheline concerned becomes SHARED if it was READ PENDING, and becomes INVALID otherwise. If P is waiting for this reply then it just sends back the data involved in the reply. All the read operations performed by P on the same location are considered complete. Otherwise, P passes the message to the next node on the ring network. Note that if P is the directory and the cacheline waiting for this reply is not in a READ PENDING I state, then the counter of valid caches for the memory location involved in the message is increased by 1. Table 5 presents transition rules reacting IV (Invalidation) requests for a node P . If P originally sent out the request and is waiting for the exclusiveness of the data then it has the exclusiveness of the data. concerned becomes MODIFIED. All the write operations performed by P on the same location are considered complete. If P did not sent out the request but it is also waiting for the exclusiveness of the data then a racing situation occurs.
In this case, we compare the indexes of P and the processor who sent out the request. If P has a smaller index then it has to give up the exclusiveness of the data. The state of the cacheline concerned becomes INVALID. If P is waiting for a valid data (state(P, P.curMess.loc) = READ PENDING), then the state of the cacheline concerned becomes READ PENDING I. In the remaining defined states, the state is reset to INVALID. Finally, if P is the directory then the counter of valid caches for the location involved in the message is reset to 1. With reference to Table 6 , we explain how a node P reacts when it receives a WB (Write Back to the main memory) request. If P is the directory, then P updates the memory value with the data involved in the message. of valid caches for the location concerned is decreased by 1. If P is waiting for a valid data, then similar to the case of receiving a SR request, P just updates the cached value with the data value involved in the WB message. The state of the cacheline concerned in P becomes SHARED if it was currently in READ PENDING state, and it becomes INVALID otherwise. P then passes the message to the next node on the ring. Table 7 presents transition rules for a node P when it receives a eject message. They are similar to the transition rules in Table 6 except that P does not update the value stored in the main memory.
In Table 8 , we provide transition rules for a processor p in the case that its current operation is a thread creation or a synchronization operation. We impose the following conditions to ensure that these operations are treated in the right order. Condition 2 1. For every processor P , if the current operation of P is a creation operation (P.curOp = c) then all the write operations performed by the creating thread c.thread and its subthreads must be complete. 2. For every processor P , if the current operation of P is a synchronization operation (P.curOp = sync(F)) then all the write operations performed by the threads in F and their subthreads must be complete.
The on-chip COMA cache coherence protocol obeys LC
In this section, we show that the on-chip COMA cache coherence protocol obeys the LC model, i.e. a read operation r always returns a value belonging to the set V (r) defined as in Section 2. By the ASM Lipari Guide [11] , we lose no generality by proving correctness of an arbitrary linearization of a run of a distributed ASM. Hence, let ρ be a linearization of an arbitrary distributed run of the on-chip COMA cache coherence protocol. We adapt the definition of state update for a memory location in Section 2.2 as follows. When the current operation of a processor P concerning with a memory location L is updated by a move P.curOp := P.nextOp in ρ, the state of memory location L is updated as S := S ∪ {P.curOp}. We say that: -A processor P performs a read r at a move P r if P.curOp = r; -A processor P completes a read r and reads value v at a move C r if cacheValid?(P, r.loc) at P r , C r = PR r and v = cacheVal(P, r.loc), or C r is the first move after P r at which P.curMess = mess with mess.source = P , mess.type = RS, cacheValid?(P, r.loc) and v = cacheVal(P, r.loc), or mess.type = SR and v = mess.val.
Theorem 1.
In ρ, let C r be a move at which a processor P reads value v for a read operation r. Then v is a legal value returned by r, i.e. v ∈ V (r).
Proof. See Appendix A.
6 Relation with the standard consistency models
Our consistency model is weaker than SC model
Sequential consistency requires all memory operations to be executed in some sequential order, and the operations with in a process to be executed in program order. In the context of microthreaded architecture, adapting location consistency model, memory accesses to different locations do not conform to any order. To illustrate the model's discontentment of SC, we recall the standard example from [7] as follows.
Example 1. Let threads T 1 and T 2 be distributed to two different processors. T 1 first writes 1 to the shared variable x and then reads the value of the shared variable y. Symmetrically, T 2 writes 1 to the shared variable y and then read the value of x. Note that initially, x = y = 0. Under the SC and SC-derived models, the operations from T 1 and T 2 are seen in the same order by both processors. Hence, the case that both read operations r 1 and r 2 return 0 is prohibited by the SC and SC-derived models. However, this can happen according to the on-chip COMA cache coherence protocol. Here, r 1 and r 2 can just return the initialized values of x and y which are 0.
Our model is stronger than the strong LC model
After proving our on-chip COMA system complies with LC model, in this section, we show that our system is not strongly LC consistent [7] . Here, we consider the following example.
Example 2. Thread T 0 creates two separate thread families consisting of T 1 , T 2 and T 3 , T 4 accordingly. Threads T 1 and T 2 perform write operations on location L, and T 3 and T 4 perform read operations of L. We assume threads T 1 , T 2 , T 3 and T 4 are running on different processors. Under the strong location consistency definition, r 0 and r 1 can return different values from any of the two indeterministic writes in T 1 and T 2 . However, in our system, after the synchronization on T 1 and T 2 , only one of the two values written by T 1 and T 2 will be alive. Thus, the r0 and r1 cannot observe different values left by the two write operations. Hence, we conclude that our memory system implementing location consistency is not strongly location consistent.
Concluding remarks
In this paper, the on-chip COMA cache coherence protocol has been formally specified in the ASM framework. We gave a proof on the correctness of the coherence protocol. Furthermore, we showed that our memory system is weaker than the SC and SC-derived models. It complies with location consistency but it is not strongly location consistent. Our work is a part of a project investigating microthreading in a collaboration between the Computer Systems Architecture group and Sectie Software Engineering at the University of Amsterdam 3 . Another research on the on-chip COMA cache coherence protocol verification is carried out by state enumeration method [18] . By describing the cache behavior in Murphi language [6] , the Murphi program can automatically explore and examine all the reachable system state. However, having a number of cache states and request types, even with relatively small number of cache modules included in a system, the system state exploration process may take the risk of state explosion problem.
