Abstract. This paper describes an ongoing e ort to verify the cache coherence protocol of the IEEE/ANSI Standard for Scalable Coherent Interface using the Mur' veri cation system. A model of the typical set protocol was constructed in the Mur' description language. This model was augmented with a speci cation of properties necessary for cache coherence. The Mur' veri cation system automatically checks if all reachable states in the model satisfy the given speci cation. Although veri cation is still under way, we have already found several errors in the C-code de ning the protocol. Finally, we elucidate the experiences gained in the veri cation project.
Introduction
The IEEE/ANSI Standard for Scalable Coherent Interface (SCI) includes a cache coherence protocol for distributed shared-memory multiprocessors. Designing a complex protocol { like this cache coherence protocol { is a challenging and di cult task. It is very hard for a designer to predict all possible interactions among the distributed system components. One way a computer can support the designer is by means of simulating random executions of the system. However, especially in complex systems, there is a high probability of missing executions containing errors using this simulation approach. Conversely, an automatic veri er tries to examine all states reachable from a given start state. The biggest obstacle in this exhaustive approach is the often unmanageably huge number of reachable states { the \state explosion problem".
We are currently using the Mur' veri cation system developed at Stanford to nd errors in the SCI cache coherence protocol. In prior work, the Mur' system was successfully applied to several industrial protocols 2, 3, 9, 14] . For verifying the SCI cache coherence protocol, the typical set protocol was modeled with the Mur' description language. This model was augmented with a speci cation of ? A preliminary version of this paper was presented at the 2nd International Workshop on SCI-based High-Performance Low-Cost Computing, Santa Clara University, 1995.
??
Ulrich Stern was supported during this research by a scholarship from the German Academic Exchange Service (DAAD-Doktorandenstipendium HSP-II).
necessary conditions for cache coherence. Together, the model and the specication form the input le for the veri er and consist of 3700 lines of Mur' code (not counting comments). The Mur' veri cation system automatically checks by explicit state enumeration if all reachable states in the model satisfy the given speci cation (model checking). Although veri cation is still under way, we have already found several errors in the C-code de ning the protocol. Various of these errors a ect both the typical set protocol and the full set protocol.
To alleviate the state explosion problem, the model was made scalable. By simply changing constant declarations, one is able to change the size of the model { and with that the number of reachable states. Since veri cation is usually possible only for \down-scaled" models, one cannot guarantee design correctness. Thus, we consider formal veri cation only as a debugging tool. However, since veri cation is completed for the down-scaled model of the system, it is likely to catch errors that are missed during simulation.
We tried to make the Mur' description of the model similar to the SCI Ccode to prevent incurring additional errors in the translation process. At the same time, however, we had to abstract away from low-level details of the protocol that are not important for the veri cation. The resulting description should be easy to understand for someone familiar with the C-code. Furthermore, we tried to make it easy to add new features of the SCI cache coherence protocol to the current Mur' description.
The C-code includes a multi-threaded execution environment, which incurred many implicit state variables of the SCI protocol. These variables now occur explicitly in our Mur' description, which should help in implementations of the protocol. The Mur' system contains a simulator as well, allowing to run executions without the need to construct a multi-threaded execution environment.
The Scalable Coherent Interface is speci ed in the IEEE Standard 1596{ 1992 7] . An easy-to-read introduction to the SCI cache coherence protocol can be found in 11]. An overview of the SCI and related standards projects is given in 6]. Previous work on formally verifying the SCI cache coherence protocol was done by Gjessing et al. 4, 5] . However, they did not use automatic methods and did not report nding any errors.
The paper is organized as follows. Sections 2 and 3 present an overview of the Mur' veri cation system and the SCI cache coherence protocol, respectively. The modeling of the SCI cache coherence protocol is described in Sect. 4, while the speci cation of cache coherence properties can be found in Sect. 5. In the Sect. 6, we report on some of the errors we found in the protocol and how they were xed. The experience gained during the course of the veri cation project is elucidated in Sect. 7. Finally, Sect. 8 contains some suggestions for future work.
The Mur' Veri cation System
The Mur' language is a simple high-level language for describing nondeterministic nite-state machines. Many features of the language are familiar from conventional programming languages. Its unique features not found in a \typical" high-level language can be described as follows: { { The parallel composition of two processes in Mur' is done by simply using the union of the rules of the two processes as rules for the composition. Each process can take any number of steps (actions) between the steps of the other. The resulting computational model is that of asynchronous, interleaving concurrency. Parallel processes communicate via shared variables. There are no special language constructs for communication.
{ The Mur' language supports scalable models. In a scalable model, one is able to change the size of the model by simply changing constant declarations. This down-scaling capability is important to reduce the number of reachable states and thus make veri cation feasible. In many cases, the errors in a system are also found in the down-scaled system. For example, in our SCI model the number of processors is scalable and de ned by a constant.
{ The Mur' veri er supports automatic symmetry reduction of models by special language constructs 8, 9] . For example, if we have two processors, the state where processor one is the head and two is the tail of a sharing list is { for veri cation purposes { the same as the state where processor one is the tail and two is the head.
{ There are several ways the Mur' veri er detects design errors. First, the description is checked for deadlocks. Second, there is an assert statement, which causes the veri er to print an error trace if the assertion condition is violated. An error trace is a sequence of states from the start state to a state exhibiting the problem. Besides the assert statement, Mur' has an error statement that always prints an error trace. In the SCI model, for example, the error statement was used in the default case of switch statements to check for illegal cache line states. Finally, one may specify invariants (Boolean conditions) that have to be true in every reachable state. For example, invariants were added to the SCI model to specify cache coherence properties. With the methods for detecting design errors described above, one is not able to specify fairness properties. Thus, livelocks cannot be detected and forward progress cannot be guaranteed with Mur'. This limitation will be lifted in the future. Note that the system whose state graph is shown in Fig.1 has one deadlock (s 4 ) and one livelock (s 6 ; s 8 ) assuming that the system should always return to the startstate (s 0 ).
Overview of the Protocol
Shared-memory multiprocessors are commonly deemed to be easier to program than distributed multiprocessors, where the communication takes place via message passing. However, the latter are easier to implement in hardware. A solution to this problem is a distributed shared-memory multiprocessor, which provides shared memory at the software level, while the actual hardware implementation is a distributed message passing system. The IEEE Standard for Scalable Coherent Interface (SCI) includes a protocol for maintaining cache coherence among the distributed components in such a distributed shared-memory multiprocessor.
An SCI node may contain a processor { consisting of (multiple) execution units and a cache { and may contain a memory. The SCI nodes communicate via transactions, each consisting of a request packet and a response packet. In this simpli ed description, echo packets are not taken into account. A distributed shared-memory multiprocessor can be assembled out of these nodes.
The SCI Standard consists of both an English language description and an accurate de nition in the C programming language. This C-code was also used for debugging the protocol when it was developed. Therefore, the C-code contains a multi-threaded execution environment for running simulations of the protocol. The SCI Standard contains many options that can each be enabled or disabled in actual hardware implementations. Thus, the protocol can be tailored to meet the needs of a speci c implementation. Furthermore, two subsets of the (full set) cache coherence protocol { the minimal set and the typical set protocol { are de ned for reducing the complexity of early implementations.
In a cache coherent SCI system, where snooping is not possible, for each memory line a list of all caches that have a copy of this line has to be maintained. In an SCI system, this \sharing list" is distributed among the system components. This is illustrated in Fig.2 . The left-hand side of this gure shows a sharing list of cache lines in processors B and C and the corresponding memory line. The pointers for the sharing lists are stored in additional bits (tags) in each memory and cache line. The current states of the memory and cache lines are also stored in these tags. We now give an example of a typical execution sequence in the cache coherence protocol. If processor A on the left-hand side of Fig.2 is executing a Load instruction and wants to read data from the memory line that is shared by processor B and C, it rst issues an mread64 request packet to the memory and is noti ed in the response packet that processor B has the data. Assume the data in processor B's cache line is modi ed. Then, processor A sends a cread64 request packet to processor B's cache, obtains the data in the response packet and becomes the new head (owner) of the sharing list as shown on the right-hand side of Fig.2 .
In the typical set protocol, ve instructions are de ned by which a processor may access the shared memory. In addition to executing a Load or Store instruction, a processor may Delete itself from a sharing list, Flush (i.e. purge) the whole sharing list or Lock the memory line. According to the standard, these instructions are executed in four phases { namely allocate, setup, execute and cleanup.
The three distinct behaviors of processors, caches and memories are de ned separately from each other in the C-code. According to this de nition, the execution of the routines implementing cache and memory behavior is performed atomically. However, the execution of a routine modeling the execution of an instruction by a processor may be non-atomical. For example, after processor A in the above example sent out its mread64 request packet to the memory, processor B may start a Delete instruction, processor C may continue its Lock instruction in progress, etc.
The Modeling of the Protocol
The model of the SCI cache coherence protocol was constructed in three steps. These steps are clari ed in the following subsections.
Abstraction
The goal of the abstraction or modeling was to extract the details of the SCI Standard that are important for the cache coherence protocol. Equivalently, this means that unnecessary details of the standard were omitted. Figure 3 shows an abstraction (model) of the SCI con guration. Details of the internal structures of the SCI nodes (processors and memories) are omitted. The transfer cloud connecting the system components is reliable. However, the order of packets is not preserved. Echo packets were not modeled, so a transaction consists of a request packet and a response packet. Only the elds of request and response packets were modeled that are actually used in the cache coherence part of the SCI Standard. In fact, chapter 4 of the SCI Standard 7] uses a similar abstraction to describe the cache coherence protocol. 
Simpli cation
The simpli cations done were needed to make the model construction possible in a \ nite" amount of time. The most signi cant of these simpli cations was not modeling the full set cache coherence protocol, but restricting ourselves to the typical set protocol. In addition, only three of the processor/cache options were implemented in our model, namely DIRTY, FRESH and MODS. For the memory, the option MOP FRESH was selected. The coherent instructions Load (with fetch options CO FETCH, CO LOAD and CO STORE), Store, Delete and Flush were implemented.
Another simpli cation was not to model DMA reads and writes that are \allowed" in the typical set protocol. Furthermore, strong ordering constraints were assumed, so pipelining during the cleanup phase of an instruction was disabled. Finally, only one execution unit is attached to each processor.
Implementation
As mentioned before, scalability is crucial for successful veri cation. In implementing the model, we kept the following parameters scalable: the number of processors, the number of lines in each cache, the number of memories, the number of lines in each memory and the number of di erent data values. Besides that, each SCI processor/cache option, each instruction and each fetch option can be enabled or disabled by simply changing a constant declaration.
The model can be explained by three di erent types of behaviors or processes, namely memory, cache and processor. There can be many individual processes of each of these three types. For example, there is one individual process of type memory for each memory in the system. The model consists of all the resulting processes running (asynchronously) in parallel.
{ Each memory has a simple request/response behavior, i.e. if there is a request packet for our memory in the transfer cloud, the memory reacts by sending a response packet. This is done atomically. However, before the memory responds to a request in the transfer cloud, any other process in the model may be active. The implementation in Mur' is done by using one rule, whose condition is true i there is a request for the particular memory in the transfer cloud. The action of the rule deletes the request from the transfer cloud, performs the update of the accessed memory line's data eld and tags and sends out a response on the transfer cloud.
{ Each cache has a simple request/response behavior, similar to that of the memory.
{ The processor behavior in our model is more complicated. A processor arbitrarily chooses a coherent instruction (Load, Store, Flush or Delete) to execute next, when the preceding instruction has completed. If the new instruction is, for example, a Load, the processor also chooses an arbitrary source address, an arbitrary cache line for cache misses and an arbitrary fetch option. Thus, we veri ed the cache coherence protocol while an arbitrary program is running in each processor. When a coherent instruction is in progress, the processor may several times send out a request on the transfer cloud to a cache or memory and then wait for the corresponding response. During this waiting time, any other system component may be active. Consequently, almost all SCI C-code routines describing the processor behavior are often executed non-atomically 4 . For example, Table1 shows a hierarchy of routines that may be called in a Flush instruction (main routine TypicalExecuteFlush()). The last routine called (CommonTransaction()) sends out the request packet and waits for the response packet. Thus, it is interruptable. Consequently, all routines shown in the table are interruptable, since each of them calls a subroutine that may be interrupted. To implement routines that can be re-entered, the corresponding Mur' routines had to store their current state in special global variables. Corresponding state variables also have to occur in a hardware implementation of the protocol. Since they are explicit in our Mur' model, this model could be useful for hardware designers as well. Finally, the implementation of the transfer cloud is described. We assumed that pipelining is disabled. Then, each processor can only have one outstanding request and (later) one non-processed response. Thus, in our implementation each processor has { as part of its state variables { a record for the \outgoing" request packet and another one for the \incoming" response packet. Each memory, for example, scans the request packet records of all processors to see if there is a request addressed to it in the transfer cloud.
Specifying Cache Coherence
The cache coherence property was speci ed in our Mur' model in two di erent ways: { First, the SCI C-code includes many assert statements to catch errors while running simulations with the built-in execution environment. Furthermore, the C-code contains several statements for the detection of memory-tag and cache-tag inconsistencies. We tried to include as many of these self-checks into our Mur' model as possible.
{ Second, we added invariants to the Mur' model to specify more accurately cache coherence. These invariants imposed local conditions on the elements of sharing lists. We give two examples to clarify that:
If a cache line is in an unmodi ed stable state 5 (e.g. CS ONLY CLEAN), the data value in the cache line must be the same as the one in the corresponding memory line. If a cache line is the head of a stable sharing list 6 (e.g. CS HEAD DIR-TY), there must be a successor in the sharing list having the same memory address and pointing back to the head. Even though our conditions specifying cache coherence are not su cient conditions for cache coherence, we expect them to be able to catch many of the errors that could occur. Furthermore, we are currently working to make the speci cation more accurate. In the process of specifying cache coherence with invariants, we rst attempted relatively straightforward conditions. If these conditions were violated by any execution, we checked whether a protocol error was detected or whether a legal state violated our conditions. In the latter case, the conditions were relaxed to take into account this state. We would also like to specify fairness properties. For example, a processor who starts a Load instruction should nally get a copy of the data and nish the Load instruction. As mentioned in Sect. 2, specifying fairness properties is not possible in the current version of Mur'.
Errors Found During Veri cation
All the errors found so far occurred in system con gurations with only two processors with one cache line each, one memory with one address and one data value (\zero bits of data") after examining a few thousand states in time on the order of minutes. Furthermore, only the protocol self-checks copied from the SCI C-code were triggered. None of our invariants was violated.
The largest example we ran had three processors with one cache line each, one memory with one address and two data values. The Load (fetch option CO LOAD), Store, and Delete instruction were enabled. The cache/processor options DIRTY, FRESH and MODS were selected. The Mur' veri er examined 5.8 million states in 6.4 h, running on a Sun SPARCstation 20 and using 61 bytes per state. However, this example revealed no new errors.
We also ran examples in which we used more than one memory, address or cache line. None of these examples revealed new errors in the protocol. We only 5 See Table 4{3 in 7] for a list of all stable cache-tag states. 6 See Table 4{4 in 7] for a list of all stable sharing lists.
sometimes found errors in our Mur' model that were due to incorrect translation from the C-code to Mur'.
All protocol errors we found can be divided into three di erent classes, that are described in the following subsections. For each class, error examples are given. The full error list was sent to the SCI code-bugs re ector.
Omissions in the typical set protocol
The typical set protocol can be considered as a simpli cation of the full set protocol. All errors in the rst class have in common that there were some program segments missing in the C-code of a routine of the typical set protocol but not in the corresponding routine of the full set protocol. Thus, these errors were easy to x. The missing program segments were copied from the full set routine into the corresponding typical set routine.
For example, a processor executing a Load instruction may set the current cache line state to CI ONLY EXCL. This intermediate cache line state is not considered in the routine TypicalLoad(), which reports an error instead. However, the routine FullLoad() considers this case and correctly changes the cache line state to CS ONLY DIRTY 7 .
Uninitialized variables
At some places in the SCI C-code uninitialized variables are accessed. During simulation runs using the C-code execution environment, these variables were presumably initialized to zero by code generated by the C-compiler { thus causing no problem. However, hardware implementations are less error-prone if all initializations are made explicit.
Instead of describing the situations when access to uninitialized variables occurs, we give two examples where variables have to be set to a de ned value to avoid problems later. { First, the routine MemoryAccessCoherent() should not return without setting the command nulli ed (cn) bit in the response packet to a de ned value. We added an assignment to set the cn bit by default to zero. { Second, the routine CacheRamAccess() should set the command.cmd eld in the response packet to the default value SC RESP00. This is especially important since the routine CommonTransaction() copies the incoming data into the cache line data eld dependent on the command.cmd eld of the incoming packet.
Logical protocol errors
The logical protocol errors we found required more subtle changes in the cache coherence protocol. These errors can be characterized as revealing aws in the logical structure of the protocol. Note, that this error class and the previous one not only a ect the correct operation of the typical set protocol, but also the full set protocol.
So far, we found the two logical protocol errors explained in the following. Only the rst error has been xed, the second one is currently being discussed with SCI working group members.
{ When a Flush instruction is in progress in a processor, the current cache line state may be set to CS HX INVAL OX. Assume a second processor who is a \TAIL VALID" member of the sharing list now also starts a Flush instruction. Then, he sends a cread00.CC PREV VTAIL request to the rst cache. When the rst cache tries to respond to this request, an assert statement is violated in the cache's routine CacheTagUpdate() because none of the CacheTag...Update() routines has processed the request. The error can be xed by adding CS HX INVAL OX to the \blocking states" in the routine CacheTagBasicUpdate(). { While the errors described so far occurred with the three processor/cache options DIRTY, FRESH and MODS enabled, the following error was found with only options FRESH and MODS enabled. Table2 shows the trace for this error, consisting of actions, starting from a state where both processors have invalid caches and ending in the error state. Usually, the protocol leaves several choices at each state for the successor state. Thus, the longer an error trace, the more unlikely it becomes to detect that error by simulation means. The error trace in Table2 was found by breadth-rst search and is therefore as short as possible.
Conclusion
The most important experiences gained in verifying the SCI cache coherence protocol can be summarized as follows:
{ The abstraction done in the modeling was relatively simple and straightforward. Modeling at a higher level of abstraction { for example, by mapping the many possible cache line states onto fewer abstract states { would have incurred the problem of comparing the abstract model with the real protocol de ned in the C-code. To avoid this (severe) di culty, we opted for simple abstractions.
{ A careful implementation of the model in Mur' is important in ghting the state explosion problem. For example, we were able to reduce the number of reachable states by a factor of over 20 by just setting all state variables whose current values were no longer needed to xed values.
{ The Mur' system for formal veri cation should be viewed as a debugging tool. Veri cation was only possible for down-scaled versions of the model and thus total correctness cannot be guaranteed.
{ It seems to be advantageous to design and specify complex protocols with the help of formal veri cation tools. First, the quality of the system is increased. Second, the time-consuming task of translating the description of the system into a \formal model" would be eliminated. Finally, our Mur' description is deemed to be easier to implement in hardware and not more complicated to understand than the original C-code.
8 Future Work
Our model of the SCI cache coherence protocol could be extended in several ways. However, one should keep in mind that these extensions worsen the state explosion.
{ First, the model could be enlarged to cover the full set cache coherence protocol and all of the processor/cache options de ned in the SCI Standard. Furthermore, the extensions of the SCI cache coherence protocol currently under development (for an overview see 10]) could be included in the model. { Second, in the current version of our model, the processor/cache options have to be enabled/disabled by hand and they are identical for all nodes. For automatic veri cation, they should be selected automatically, arbitrarily and separately for each node.
{ Finally, the model could be altered to allow multiple execution units for each processor and pipelining during the cleanup phase of an instruction. However, unlike the rst two suggestions, this would require signi cant changes in the current model. In our veri cation project, some errors were revealed that had not been found before. Veri cation methods are able to help in constructing better systems, but they have to keep pace with the increasing size of the systems. There are several ways by which the veri able size of the model could be increased. First, it might be possible to abstract the sharing list from a low-level doubly-linked list to an abstract list. This way, the number of reachable states could be reduced. Second, there are some ways to increase the number of explorable states in the current Mur' system. Examples would be state compression 13] and on-they methods 12]. Finally, symbolic methods to represent the set of reachable states 1] could yield further progress in the SCI veri cation.
