Abstract. In this paper we propose a methodology for verifying the sequential consistency of caching algorithms. The scheme combines timestamping and an auxiliary history table to construct a serial executioǹ matching' any given execution of the algorithm. We believe that this approach is applicable to an interesting class of sequentially consistent algorithms in which the bu ering of cache updates allows stale values to be read from cache. We illustrate this methodology by verifying the high level speci cations of the lazy caching and ring algorithms.
In shared memory multiprocessor systems a memory consistency model species how memory operations will appear to execute to the programmer. The closer the memory consistency model forces the shared memory to behave as a serial memory system { a system in which all operations are performed atomically directly on memory with no bu ering or caching (Figure 1(a) ) { the easier it is for the programmer to write correct code for the system. However, the stricter the memory model the more hardware and compiler optimizations are disallowed. Sequential consistency is an intuitive memory model, in which, \the result of any execution is the same as if the memory] operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order speci ed by the program " 24] . Sequential consistency is a relatively restrictive model when compared with the more relaxed memory models (such as partial or total store ordering, or release consistency) which are supported by some commercially available architectures (e.g. PowerPC, SPARC, Digital Alpha) 1].
Many sequentially consistent models implement coherence, an even stricter consistency model. Whereas an execution is sequentially consistent if all of the processors' local views can be interleaved to form a single serial behavior, regardless of the relative ordering of events at di erent processors, coherence requires that the events, as ordered globally, be a trace of serial memory 2].
To prove sequential consistency of a proposed memory implementation M it su ces to construct, for every M , an execution of M, a matching serial execution S such that all operations in S read and write the same values as in ? Research supported in part by a grant from the German-Israel bi-national GIF foundation and a gift from Intel. Fig. 1 . Architecture of (a) a serial memory and (b) the lazy caching algorithm M . However, the creation of such a \witness" serial execution may require that a potentially unbounded number of operations be re-ordered. In fact, the problem of verifying sequential consistency is known to be undecidable 3]. Thus, unlike coherence which can often be veri ed quite easily, sequential consistency does not comfortably t the pattern of standard re nement techniques (trace inclusion, bisimulation, testing preorder). The non-coherent lazy caching algorithm was therefore proposed by Rob Gerth as an example on which di erent re nement methods can be tried 15] , and in 1999 a special edition of Distributed Computing was devoted to this project 13] .
In this paper we present a proof methodology which involves timestamping the cache reads and shared memory updates of an execution and placing them in a history table. Intuitively, every processor P i has a cache C i which contains a subset of the values in the shared memory at some time t i t G , where t G is the global system time. All writes to memory occurring in the interval (t i ; t G ] have not yet been applied to C i . The local time t i is precisely the time at which the global memory had contents consistent with C i . We timestamp instructions with the local time (and other information, in order to create a total ordering between instructions executing at the same local time) and place them in a history table ordered by timestamp. The information in the history table contains su cient information for a matching serial execution to be built, and the algorithm to be proved sequentially consistent.
We believe that this methodology is suitable for the veri cation of the sequential consistency of many non-coherent memory models, as demonstrated by our applying this proof method, using the PVS 27] theorem prover, to two examples, lazy caching 2, 15] and a ring algorithm 6] 1 . While this methodology is theoretically applicable to coherent snoopy protocols, we believe that it is more complicated than is required for such algorithms. Current work considers increasing the automation of deductive proofs, and we hope later to consider the application of the methodology to other classes of caching algorithms. 
. Lazy Caching transitions
The paper is structured as follows: In Section 1 we describe the lazy caching algorithm. In Section 2 we explain how timestamping and the history table are used to derive a serial execution. In Section 3 we de ne the ring algorithm and describe how it tted into our methodology. Section 4 discusses related works and in Section 5 we summarize our conclusions.
Lazy Caching
The \lazy cache algorithm" 2] is a sequentially consistent protocol in which cache updates can be postponed, and writes are bu ered, allowing processors to access stale cache data.
As illustrated in Figure 1(b) , the system consists of n processors, P 1 ; : : : ; P n with each P i owning a cache C i , and FIFO in-and out-queues In i and Out i , respectively. We have further associated with each processor an unbounded instruction list, containing instruction of the form \read a" and \write a, d". Instructions in the instruction list are executed sequentially, with a program counter, pc i , pointing to the next instruction.
A processor P i initiates a write event w i by placing a record recording the instruction address and new value at the tail of Out i . When this record reaches the top of Out i it can be popped o and the memory write mw i occurs. That is, the shared memory is updated, and a new record recording the address and value is placed in the in-queue In j of all processors P j . The copy placed in In i is starred. When the entry at the head of In i is popped o a cache update cu i occurs, and C i is updated with the value recorded in the In i entry.
A read event r i can be performed if the address a requested is in the cache, Out i is empty and In i does not contain any starred entries. The value read is that in the cache. We note that this value may di er from that in the memory if a write to a is bu ered in In i . Locations (which are not currently in cache) can be P1 w1 (6) mw1 (6) P2 w2(8) mw2 (8) 
P3
r3 (0) cu3 (8) r3 (8) 
P4
cu4 (8) cu4 (6) (0) r5 (0) mw2 (8) r3 (8) mw1 (6) r4 (6) (b) brought into the cache by placing the memory value in the in queue in a memory read (mr i ) action, and can be summarily evicted by cache invalidation (ci). In our interleaving model at any step a processor can either initiate a read or write (if one is enabled), pop an entry o its in-or out-queue if they are nonempty, initiate a cache update, invalidate a cache entry, or idle (i). The system is parameterized by the number of processors and there is no restriction on the maximum size of the queues, the address space, or the set of memory values. Our model, summarized in Figure 2 , very closely resembles that of Gerth 15] . The reader is referred to this paper, or our PVS source les 4], for more information.
An example execution fragment In Figure 3 (a) we consider a very small execution sequence which illustrates the non-coherent nature of the lazy caching algorithm. We assume that address a has initial value 0. Process P 1 initiates a write of 6 to a, placing the tuple (a; 6) on its out-queue. Process P 2 then initiates a write of 8 to a. Process P 2 pops (a; 8) o Out 2 , in a memory write mw 2 action, pushing the (address, data) tuple onto the in-queues of all processors. Sometime thereafter action mw 1 also occurs. Process P 3 reads the value of 0 for a, updates its cache with 8, and then reads 8 as the value of a, while the write of 6 is bu ered. Process P 4 updates its cache with both values before reading reading a as 6; process P 5 reads a as 0.
We note that the memory is updated in the opposite order to which the writes were initiated, and thus a has the nal value of 6. Furthermore, processors P 3 and P 5 read stale values for a after P 4 has read the new value.
Creating a Serial Execution
To prove an algorithm sequentially consistent we show that each of its executions has an equivalent serial execution. In the serial execution all operations are executed directly on memory, in some sequential order, and the operations of each individual processor are in program order, where \read" and \write" instructions correspond to r and mw events. It is shown that reads in the two executions return the same value, and the nal memory values are identical. 
Logical time
Each processor has a view of memory which is consistent with the values memory had at some time in the past: It sees the memory as it was before it was modi ed by the last x writes, these being the writes which are bu ered in the in-queue.
The global time t G is determined by an auxiliary global clock, and is initially zero. Every time a memory write occurs the global time is incremented by one. Each processor has an auxiliary local clock which counts the number of writes which have been applied to its cache. This clock gives its local time. It is updated each time a process performs a cache update which was initiated by a memory write. These cache updates are termed countable. (In order to distinguish countable cache updates from those initiated by memory reads, we add an auxiliary processor id eld to in-queue records. An entry is the result of a memory read exactly if the processor id in the record is that of the processor and the record is not starred.) The processor has a view of memory consistent with the values that memory held when the global time was the current local time of the processor.
Every read (r) or memory write (mw) event in the system is given a unique timestamp when it occurs. The timestamp is a tuple (t; r; id), where t is the local time at which the event occurs, r is the numbers of reads which this processor has performed since the last counted cached update, and id is the identi er of the processor that initiated the read/write. Time 0 is the time given to all reads of the initial, unmodi ed memory. For every t i > 0 the \smallest" timestamp with time t i will always be a memory write (mw), as the reads eld of a timestamp is zero exactly when it represents a memory write operation. Since the local clocks are incremented every time that a cache update is performed, there is only one memory write at time t i and all other operations timestamped with t = t i are reads. As they are all reads from the same memory, with no intervening writes, they will return the same value irrespective of the ordering between them. However, it is desirable that the program order of each processor be maintained, and this is done by the reads eld of the timestamp. The id eld of the timestamp is used to order operations at the same local time by di erent processors. The relative ordering of these operations is unimportant, and ours in one of a number of possibilities.
(a)
Instruction Action Timestamp P1 P2 P3 P4 P5 Global Memory (t; r; id) t a r t a r t a r t a r t a r These counters and timestamps are variants of Lamport clocks 23]. However, in our system each processor updates its clock independently, without reading the timestamps on incoming messages.
Extracting a serial execution from the history table
The history table is an ordered list of entries sorted in non-decreasing order of timestamp. Since memory writes always have a greater timestamp than any other elements in the table at the time they occur they are appended to its end. Reads, however, may be inserted in the middle of the history table. The function size(H) returns the number of entries in H. For every x size(H), H x] refers to the x'th entry of H.
In Figure 4 (a) we revisit the example of Section 1, showing how the history table would be constructed. For each processor the table records its local time t, the value it stores for a, and r, the number of reads it has performed since the last countable cache update. The timestamp column indicates the timestamp of the entry which is added to the history table at the step in which it is added. Time progresses from top to bottom in the table.
A serial execution can be derived from the history table such that the i'th entry in the history table corresponds to the i'th operation in the serial execution.
It is proved that in this serial execution every processor issues its instructions in the same order as in the original execution, all reads return the same values as in the lazy caching execution, and the nal memory values are the same as in the original execution.
In Figure 4 (b) we present the history table built in the example of Figure 4 (a), with entries ordered by timestamp. The table illustrates all the elds in the history table. Figure 4(c) illustrates the serial execution which is derived.
The proof
The auxiliary history (H) list and memHist array and readValues arrays are intrinsic to the presented proof. Each processor has a readValues array which maps instruction indices to values. Every time a read operation occurs the value read is stored in the relevant entry of the readValues array. This array is later used to insure that the lazy caching and serial executions return identical values for every read. The memHist array is a history of memory contents, where memHist t] is a copy of the shared memory at global time t. In addition, memHist also stores for every time t the processor id and program counter for the instruction that updated memory from memHist t ? 1] to memHist t]. We also found it useful to add auxiliary elds to the in-and out-queue entries: in addition to the address, value and \*" elds, we added auxiliary elds recording the processor id and program counter of the related instruction, and the global time at which the related event occurs. We note that this time eld is not used to update the processors local clocks, or any other variables. Some of the data structures are detailed in Figure 5 .
In order to construct the serial execution we prove a one to one relationship between executed operations and history table entries. The bulk of the proof effort involved manually de ning properties of the lazy caching algorithm and then proving their invariance in the PVS 27] system. We list some of the invariants used in the proof. { For all 0 < r < r x , there is an entry H z], z < x timestamped (t x ; r; id x ) in H. ( That is, every read in the two systems returns the same value.
5. The program counter of processor P i at the end of the sequential execution, S size(H)]:pc i , is equal to L:pc i if L:Out i is empty, and the (auxiliary) program counter eld in the top L:Out i entry, otherwise. We prove inductively that for every reachable lazy caching state L there is an S such that (L; S): We rst prove that predicate holds for the initial states of the two systems, and then that if (L; S) holds, then for any L 0 such that lazy (L; L 0 ) is a lazy caching transition, we can build an S 0 such that (L 0 ; S 0 ). From parts (1) and (2) of S records a legal serial execution. Given that L:memHist t G ] is proved to equal L:Mem, the currently lazy caching memory, from (3) we can deduce that the memory values in the two systems agree. From (4) we prove that both systems return the same value for every read.
We complete the proof by showing that the lazy caching system can always progress meaningfully.
The Ring Algorithm
In order to test the applicability of our methodology we applied it also to a model based on Collier's ring algorithm 6]:
Processors P 0 , . . . , P n?1 are connected in a ring, with P i sending messages only to its successor, P i+1 mod n . The channels between every two successive processors are FIFO queues of messages. Processor P 0 is designated the supervisor.
If processor P i ; i 6 = 0 wants to perform a write of value v to address a it sends to its successor a WriteRequest(a, v) message and enters a waiting state. This write request is passed around the ring until it reaches the supervisor. The supervisor updates memory with this address and value, and then sends a WriteReturn(a, v) message. On receiving a WriteReturn message all processors update their caches, and then pass it on to their successor. Process P i also releases itself from its waiting state and can proceed. When the write return reaches the supervisor, it is removed from the system.
A processor can execute a read instruction if the address is in its cache. Otherwise it sends a ReadRequest, which the supervisor answers with a ReadReturn. After thus bringing the address into the cache, the read can be executed. Fig. 6 . An example con guration in the ring algorithm
The supervisor accesses memory directly (its local cache is the \shared memory") and never issues ReadRequest or WriteRequest messages. On performing a write it sends a WriteReturn message so that all other caches can be updated.
This model ts neatly into our framework. As in the lazy caching example, cache reads and updates to the shared memory are entered into the history table when they occur. (In this algorithm the memory update occurs when a WriteReturn in initiated by the supervisor.) The supervisor increments its local clock when it sends a WriteReturn, and all other processors increment their local clocks on receiving the WriteReturn. The local time of the supervisor is the global system time. The local time t i of P i is the global time minus the number of writeReturns on channels between P 0 and P i . An example con guration is given in Figure 6 .
Related Works
Various methodologies, ranging from CSP 5, 9], to abstraction 16] and model checking 19] have been used to verify lazy caching. The primary di culty in verifying lazy caching seems to be that at the time that a memory is updated by a write in the lazy caching system, it is not known how many reads reading the stale value will still occur. That is, nondeterministic choices in the abstract (serial) system occur earlier than in the concrete (lazy caching) system. One solution is to input the computation of the concrete system into a transducer, which queues segments of the concrete computation until they can be matched with an abstract execution 21]. Similarly, 19] propose a nite state observer that observes and re-orders the memory operations, while 22] use an auxiliary queue to record writes which have updated memory but have not yet updated the cache.
Step-wise re nement, in which the lazy caching system is transformed in a number of steps to a serial system, is used in 5] and 22]. Composition 20] and abstraction 16] are two other methodologies proposed, while in 9] decomposition is coupled with the use of CSP to prove trace inclusion.
The paper introducing lazy caching 2] presents a semantic proof that it is sequentially consistent. A WriteCounter is used to assign a sequence numbers to updates of the shared memory. Reads are assigned numbers according to the last write which the processor has popped o its in-queue. An auxiliary Hist variable is used, with semantics similar to that of our memHist variable.
Of the above mentioned veri cation e orts only 19] has been mechanized at all. The model-checking veri cation in 19] is of a restricted system in which there is no out-queue and the in-queue is of size at most one. Given the problems of state explosion, it is unclear how a more detailed system could be veri ed. It is claimed that the type of abstractions that are used in 16] could be computed algorithmically, thus partially mechanizing this proof.
Timestamping, using variants of logical Lamport clocks 23], has been used to verify various memory consistency models 7, 8] . The algorithms are veri ed at a lower level than we have considered, including message passing protocols. Timestamping is used to divide logical time into coherence epochs, intervals of logical time in which a node has read-only or read-write access to a block of data. Thus, it is possible for one epoch to contain multiple, or no, stores. Furthermore the same write can be given di erent timestamps when it is used to update different caches. In contrast, in our timestamping each memory update is identi ed with an epoch and has a unique timestamp. This underscores a di erence in our approaches to memory consistency { whether block control or memory contents are the primary concern. The di erence in emphasis is appropriate given the different levels (high level versus message passing) at which veri cation occurs, and the di erent algorithms considered. The proofs presented are entirely manual.
Theorem proving has been used by Park and Dill 11, 12] and Stoy et al 28] to verify cache coherence protocols at the message passing level. Park and Dill aggregate the steps of each transaction in the implementation into a single atomic transition in the speci cation. A commit point is identi ed, for each transition, and the aggregation function intuitively is a function completing committed instructions. This methodology has been used to e ectively verify a detailed model of the complex FLASH protocol. However, it is unclear how it could be used in our examples, where instructions may commit out of order (a read instruction may return an older value than a previous read, by another processor, for the same address). In 28] a PVS 27] implementation of Lamport's TLA 25] is used. Queues are drained to empty them of messages, and an abstraction function used to show re nement between two protocols.
A lot of research has been done on using model checking to verify cache coherence protocols. However, due to the di culties of verifying large systems many of these methodologies are restricted. E.g., the`test model-checking' of 17] in incomplete, the work by Delzanno, Pong and Dubois 10, 14] based on FSMs is only appropriate to coherent algorithms. Lazic 26] shows that data independence theorems can be used to make model checking of cache protocols more tractable.
Our construction of a serial execution is reminiscent of work by Glusman and Katz 18]. They allow independent operations to be re-ordered to create a convenient computation. Our \convenient" serial execution is not only a reordering of the events, but also a change in the nature of the occurring events.
There are more points of similarity between our work and those mentioned above. The auxiliary variables in 22, 19] perform some of the functions of our history table. While timestamping has been used previously in verifying cache consistency protocols 8], the similarities between this work and ours are in the terminology more than the semantics. Our timestamping is closer in meaning to the WriteCounter variable in 2]. Their Hist variable is also similar to our memHist variable. However, the proof in 2] is`on a semantical level and not grounded in a re nement methodology ' 15] . By creating a full timestamping scheme, and using a history table, we have developed a formal veri cation framework which allows mechanical veri cation, and can easily be applied to di erent veri cation problems.
The centrality of the history table, and the method in which it is coupled with timestamping is new, and provides a relatively simple proof which is amenable to mechanical veri cation. We believe that mechanical veri cation provides a higher degree of con dence than pen and paper proofs, and testi es to a relatively simple and natural methodology.
Conclusion
In this paper we present a re nement methodology for the veri cation of sequential consistency. Given that the general problem is known to be undecidable, our proof method cannot be complete. However, we believe that there is a class of`di cult', non-coherent algorithms, to which this methodology is suited, as illustrated by the successful veri cation of the lazy caching and ring algorithms.
We take cache reads and shared memory updates to be the important events to be recorded, and show that a correct ordering of these events allow the construction of a matching serial execution. While the idea of using timestamps (or, more generally, Lamport clocks) to order events is far from new, the timestamping that we have devised is particularly well suited to sequential consistency. It allows us to give a relative order (timestamp) to an \important event", when it occurs, relative to all past and possible future such events in the system. The history table provides a means of dynamically ordering these events, so that a serial execution can be extracted.
The methodology is sound { when it is applied a corresponding serial execution can be built. Since all steps are mechanically veri ed in the PVS theorem prover, this gives a very solid proof of sequential consistency.
The major drawback of this methodology is the large amount of human effort required (several person-weeks), devoted primarily to deriving the invaraint properties and directing the theorem prover. We are currently researching techniques to increase the automation of the proofs, and hope later to consider the extension of our methodology to other classes of algorithms.
