We o er a solution to the problem of verifying formal memory models of processors by combining the strengths of model-checking and a formal testing procedure for parallel machines. We characterize the formal basis for abstracting the tests into test automata and associated memory rule safety properties whose violations pinpoint the ordering rule being violated. Our experimental results on Verilog models of a commercial split transaction bus demonstrates the ability of our method to e ectively debug design models during early stages of their development.
Introduction
The fundamentally important problem AG96] of verifying whether a given memory system model (or \a memory system") provides a formal memory model (or \memory model") appears in a number of guises. CPU designers are interested in knowing whether some of the aggressive execution techniques such as speculative issue of memory operations violate sequential consistency; I/O bus designers are interested in knowing the exact semantics of shared accesses provided by split I/O transactions Cor97]; even language designers of multi-threaded languages such as Java that support shared updates GJS96] are interested in this problem. Formal veri cation methods are ideally suited for this problem because: (i) the semantics of memory orderings are too subtle to be fathomed through informal reasoning alone; (ii) ad hoc testing methods cannot provide assurance that the desired memory model has been implemented. Unfortunately, despite the central importance of this problem and the large body of formal methods research in this area, there is still no single formally based method that the designer of a realistic multiprocessor system can use on his/her detailed design model to quickly nd violations in the design. In this paper we describe such a method called test model-checking.
Test model-checking formally adapts to the realm of model-checking a formally based architectural testing method called Archtest. Archtest has been successfully used on a number of commercial multiprocessors Col] by running a suite of test-programs on them. Archtest is an incomplete testing method in that it does not, under all circumstances, detect violations of memory orderings Col92]. Nevertheless, its tests have been shown to be incisive in practice Col]. Most importantly, the formal theory of memory ordering rules developed by Collier in Col92] forms the basis for Archtest, which means that whenever a violation is detected by Archtest, there is a formal line of reasoning leading back to the precise cause.
Being based on Archtest, test model-checking is also incomplete. However, none of the (presumed) complete alternatives to date have been shown to be practical for verifying large designs. For example PD96] involves the use of manually guided mechanical theorem proving. Even approaches based on conventional model-checking are impossibly di cult to use in practice. For example, the assertions pertaining to the sequential consistency of lazy caching Ger95], a simple memory system, expressed in various temporal logics (by Gra94] in 8CTL CES86] and LLOR97] in TLA Lam94]) are highly complex. We do not believe that descriptions of this style will scale up. On the other hand, the test model-checking method has not only been able to comfortably handle the memory system de ned by the symmetric multiprocessor (SMP) bus called Runway BCS96, GGH + 97] used by Hewlett-Packard in their high-end machines, but also it discovered many subtle bugs in early models describing this bus that we created. Our model includes a number of details such as split transactions, out of order transaction completions, and even an element of speculative execution. The errors we made in capturing these details could well have been made in an actual industrial context. We believe that with growing system complexity, the role of debugging methods that are e ective and are formally based will only grow in signi cance, regardless of whether the methods are complete or not.
Test model-checking has a number of other desirable features. It involves model-checking a xed set of safety properties for each formal memory model, that are very nearly independent of the actual memory system model being tested. This xed nature greatly facilitates the use of test modelchecking within the design cycle where debugging is most e ective, design changes are frequent, and time-consuming alterations to the properties being veri ed following design changes would be frowned upon (test model-checking will not need such alterations). Also, the formal adaptation of the tests of Archtest made in test model-checking can be veri ed once and for all, thanks to the xed set of tests used in test model-checking (we describe and argue the correctness of these abstractions later). Finally, in test model-checking, a memory model is viewed as a collection of simpler ordering rules, and for each constituent ordering rule, a speci c property is tested on the memory system. We found that this signi cantly helps compartmentalize errors, as opposed to producing non-intuitive error traces that could result during conventional model-checking, which can be very di cult to understand for non-trivial memory systems.
Test model-checking is also a more e ective debugger for memory models than Archtest in a formal sense. The tests of Archtest are straight-line programs of length k, one per node. Such programs execute on various nodes of the multiprocessor concurrently. The recommendation accompanying Archtest is that users run the tests for as large a k that is feasible, because then the chances of being scheduled according to di erent interleavings (by the underlying operating system, memory controller arbiter, etc.) increase. In adapting the tests of Archtest, test model-checking
gives the e ect of choosing k = 1. Thus, we cover all possible schedules. The subtle bugs detected by test model-checking on realistic examples that are reported in Section 5 corroborate our intuition that test model-checking is indeed an e ective debugging tool for memory models. To reiterate, our speci c contributions in this paper are: (i) the adaptation of a formal testing method for memory models to model-checking, that can be applied during the design of modern microprocessors whose memory systems are very complex; (ii) a formal characterization (accompanied by proofs) of how the tests of the testing method are abstracted and turned into a xed set of safety properties that are then model-checked; and (iii) experimental results on three examples using the VIS model-checker, the last example being much larger than any previously reported.
Related Work
In Gra94], abstract interpretation CC77] is employed to reduce in nite-system veri cation tonite 8CTL model-checking. They apply this technique to verify the sequential consistency of lazy caching with unbounded queues. They recognize that to get an exact characterization of sequential consistency involving only the observable event names, one needs full second order logic Gra94]. To be able to express sequential consistency in 8CTL , they give a stronger characterization of sequen- Col92] . One of our contributions is that we answer these questions by elaborating on the theoretical as well as practical aspects of test model-checking.
In PD96], the authors use a method called aggregation on a distributed shared memory coherence protocol used in an experimental multiprocessor, to arrive at a simpli ed model of system behavior. Their technique involves manual theorem proving. The work in HMTLB95] as well as DPN93] are aimed at verifying that synchronization routines work correctly under various memory models, where the memory models themselves are described using nite-state operational models. They do not address the problem of establishing the memory models provided by detailed memory subsystem designs, which is our contribution. In GK97, GK94] , the authors analyze the problem of deciding whether a given set of traces are sequentially consistent. Our approach di ers in two respects. First, we are interested in proving that detailed models of memory systems are correct, while they obtain traces (presumably from actual machines) and analyze them for sequential consistency. Second, our method is more useful for CPU designers as it can give feedback during early phases of the design pinpointing which ordering rules are violated (if any).
Overview of Archtest
Archtest is based on the theory presented in Col92] that formally de nes and characterizes architectural rules obeyed by memory subsystems of multiprocessors. Although these rules are elemental, in realistic memory systems the rules manifest in compound form. Obeying a compound rule is tantamount to obeying all the constituent elemental rules; violating a compound rule is tantamount to violating any of the constituent elemental rules. Four crucial elemental ordering rules are:
Rule of Computation (CMP): This is a basic rule de ning how the terminal value of each operand is calculated from the initial values of the operand. Though most of the literature on memory architectures implicitly assumes this rule, we will often keep it explicit in our discussions. Rule of Write Order (WO): Same as RO with`write' substituted for`read'. Rule of Write Atomicity (WA): A write operation becomes visible to all processes instantaneously. More precisely, one conceptual store S i is associated with each processor node P i . Then, for each write operation W, one write event W i is de ned per store S i . Then, WA guarantees that there is no i; j and no event e such that e is before W i and is after W j . The test of Archtest for the compound rule consisting of the elemental rules CMP, RO, and WO, denoted A(CMP; RO; WO), is shown in Figure 2 The Atomic condition watches for the possibility that a write operation from P 1 and a write operation from P 4 appear to have nished in di erent orders to P 2 and P 3 . Since test programs such as Test 2 are meant to be run on real machines, there can't be any real guarantees that the particular interleavings that reveal violations (such as for condition WA watched by condition Atomic) will indeed happen. To allow for as many interleavings as possible, Archtest recommends that its tests be run for large values of k. With test model-checking, we e ectively run the tests for k = 1, as will be elaborated shortly.
Test model-checking
Test model-checking converts the tests of Archtest to corresponding memory rule test automata (\test automata") that drive model of the memory system being examined. In our experiments, we use the Verilog language supported by VIS Ver] to capture the memory system models as well as the test automata. The Conditions corresponding to each compound memory rule being tested are turned into corresponding memory rule safety properties that are checked by the VIS tool. The reader may take a peek at Section 4.1 to know which compound rules de ne sequential consistency Lam79]. In the remainder of this section, we explain the assumptions under which we formally derive test automata as well as memory rule safety properties, followed by a description of how test automata as well as memory rule safety properties are derived for speci c cases.
Assumptions about memory systems realized in hardware
Memory systems realized in hardware as well as nite-state models thereof are assumed to be data independent; i.e., the control logic of the system moves data around, and does not base its controlpoint settings on the data values themselves. We also assume that the system is address semidependent HB95], i.e. the control logic can at most compare two addresses for equality or inequality and base its actions on the outcome of this test. These assumptions are standard, and form the basis for de ning test automata as well as memory rule safety properties.
Creation of test automata
As illustrated in Figure 2(b) , we obtain test automata for various memory models by nitely abstracting the data used in test of Archtest, using non-determinism to justify the abstraction. For example, we abstract the speci c activities of process P 1 of Figure 2 (a) into that of (non-deterministically) writing all possible ascending values over f0,1g, as shown in P 1 of Figure 2(b) . Also, since we cannot store in nite arrays in creating process P 2 , we turn P 2 and the corresponding memory rule safety property into an automaton that checks that the array values read are monotonically increasing. This, in turn, can be performed using just two consecutive array values x1 and x2 that are nondeterministically recorded by P 2 . Hence, the memory rule safety property we model-check for is: P 2 in nal state ) x2 x1.
We now provide a justi cation that these abstractions preserve the memory rule safety properties, i.e., for the same memory system model, i.e. a violation of a condition occurs in a test of Archtest for k = 1 i the same violation will occur in model-checking the corresponding memory rule safety property when test automata are used to drive the memory system model. To keep the presentation simple, we formally argue how the test automata nds every violation present in the test of Archtest with k = 1; the opposite direction of i , i.e. how a test of Archtest with k = 1 nds violations found by the test automata is easy to see because the test automata just appears as a \stuttering" of the test of Archtest. For example, the actions of P 1 in Figure 2 when Test 1 for k = 1 detects a violation. Note that the nondeterminism employed in constructing test automata enables P 1 and P 2 to guess the right value of and i corresponding to the violation.
Abstracting Test 2
Test automaton for Test 2 is shown in Figure 5 . In this automaton P 1 and P 4 write all possible ascending sequences of f0, 1g in A and B respectively. Each processor independently and nondeterministically decides to switch from writing 0 to writing 1. Modi cations similar to those in Test 1 are applied to P 2 and P 3 also, to (nondeterministically) decide which U i],V i] pair and X j],Y j] pair are recorded in u; v and x; y. The memory rule safety property corresponding to condition Atomic is: P 2 and P 3 in their nal states ) v x _ y u. As was explained in Section 3.2 for Test 1, our abstraction avoids having to remember the entire extent of the arrays U, V , X, and Y . (In Test 2, one has to check for Monotonic also; this is done similarly to that in Test 1.) 
Abstracting Test 3
We now discuss a test for the elemental ordering rule Program Order (PO), which is somewhat more complex than the previous two tests. PO We obtain the test automaton and the memory rule safety property for Test 3 of Figure 6 (a) as illustrated in Figure 6 (b). P 1 executes a pair of instructions: write to A followed by read from B, in nitely often. The value written to A is 0 for some iterations and is nondeterministically changed to 1. P 2 runs similarly. P 1 nondeterministically selects a pair of write followed by read instruction. It assigns the value written to A to j and the value read from B to y. Similarly, processor 2 updates i and x. The dashed edges in Figure 6 show when x; y; i; j are updated. The memory rule safety property corresponding to condition PO Cross is: P 1 and P 2 in their nal states ) (x j _y i)^(x j _y i). We can show that this abstraction preserves PO Cross by an argument similar to that for Test 1 and Test 2.
Case Studies
To demonstrate the e ectiveness of our approach, we veri ed three di erent memory systems, namely serial memory, lazy caching, and a simpli ed version of the Runway bus, all using VIS Ver]. These three memory systems are described in some detail below, along with some of the subtle bugs that we could detect using test model-checking. Details of all our experiments can be obtained from the Web Mok] or by contacting the authors.
Sequential consistency and serial memory protocol
A sequentially consistent memory system Lam93] requires that there be a single self-consistent trace t of memory operations that when projected onto the memory operations of each individual processor P i (R i (a; d) and W i (a; d) for processor i) is according to program order for P i . As suggested in Col92], we can show that sequential consistency is A(CMP; PO; WA).
As Col92] does not list a single compound test to check for A(CMP; PO; WA), we can use the following two tests that are available: \Test 7" which tests for A(CMP; RO; WO; WA) and Test 3 which tests for A(CMP; PO). This combination is exactly equivalent to testing sequential consistency because PO implies RO and WO (as formally de ned in Col92]). For every memory system we consider, these two tests are model-checked separately and summarized in Figure 9 .
Serial memory and Lazy caching
The serial memory protocol for n processors and a memory is shown in Figure 7 . Serial memories are often used to de ne SC operationally. The lazy caching protocol Ger95], shown in Figure 8 , also implements sequential consistency, and is geared towards a bus based architecture. The memory interface still consists of reads and writes; however, caches C i are interposed between the shared memory Mem and the processors P i . Each cache C i contains a part of the memory Mem and has two queues associated with it: an out-queue Out i in which P i write requests are bu ered and an in-queue IN i in which the pending cache updates are stored. These queues model the asynchronous behavior of write events in a sequentially consistent memory. A write event W i (a; d) doesn't have an immediate e ect. Instead, a request (d; a) is placed in Out i . When the write request is taken out of the queue, by an internal memory-write event MW i (a; d), the memory is updated and a cache update request (d; a) is placed in every in-queue. This cache update is eventually removed by an internal cache update event CU j (a; d) as a result of which the cache C j gets updated. Cache evictions are modeled by internal caches invalidate events: CI i can arbitrarily remove locations from cache C i . Caches are lled both as the delayed result of write events and through internal memory-read events, MR(a; d). The latter events model the e ect of a cache-miss: in that case the read event stalls until the location is copied from the memory. A read event R i (a; d), predictably, stalls until a copy of location a is present in C i but also until the copy contains a correct value in the following sense: SC demands that a processor P i reads the value at a location a that was recently written by P i unless some other processor updated a in the meantime. Hence, a read event R i (a; d) cannot occur unless all pending writes in Out i are processed as well as the cache updates requests from In i that corresponds to writes of P i . For this reason, such cache updates requests are marked (with a ?).
Runway
Our third example, called Runway, is modeled after a commercial bus used to interconnect processors and memory controller together to form a multiprocessor system. The behavior of this memory system is described in some detail in GGH + 97]. The complexity of this protocol stems from many sources, a few of which are elaborated here (see GGH + 97] for more details). First, the queues in the clients introduce decoupled execution, leading to a large number of \otherwise equivalent" states. Next, the control mechanism is very complex, owing to many reasons, including: (i) lines can be obtained in various sharing modes such as read-shared-private and read-private; (ii) line states can be eagerly promoted to private before the data actually arrives (concurrent dirtying are merged into when the data arrives); (iii) hit after miss situations can be speculatively processed and unrolled when invalidated. Though we did not try to model each of these features in their full glory, we did include a modicum of these aggressive features into our VIS model, which in fact occupies more than 2,000 lines of VIS Verilog code.
Veri cation results
The tables in gure 9 show execution time for model-checking our Serial memory, Lazy caching and Runway models for tests of A(CMP, PO) and A(CMP,RO,WO,WA Figure 9: Veri cation results using VIS on a SPARC ULTRA-1 with 512 MB Memory also reported. Note the large number of states and small BDD size for Lazy Caching (compared to Runway) which are respectively due to queues and the low complexity of the control logic. On the other hand, observe that the very complex control logic of Runway model causes the large BDD size, which in turn results in high VIS run-time to nish searching correct models. However, in all our experiments, whenever there was any memory ordering rule violation in our model, test modelchecking detected it quickly (in the order of minutes). A very desirable feature one can provide in a tool based on test model-checking is a menu of previously generated test automata for the various compound rules in Col92], using which designers can probe their model. We now summarize an insidious bug in our models that has been revealed using test modelchecking.
Description of a Bug: The following bug in our model of Lazy Caching was caught by a violation of Test 4. The bug was in the queues used by Lazy Caching, which were implemented as shift registers.
We forgot to shift the ?-bit in In i when the processor P i receives a cache-update from In i queue. With this bug it is possible that In i queue is not ?-ed when it should be, and consequently reads in P i may bypass writes. This results in a violation of PO. This is a di cult bug to catch because its detection involves understanding the complex feedback from all components of the protocol to each other (queues, memory, and caches). Moreover, this bug is interesting because it violates PO but doesn't violate WA. This is so because only write-read (WR) order is a ected by this bug. Our technique e ectively caught this bug: the PO conditions does not pass when we model-checked the model for Test 3 (for A(CMP,PO)). However, Test 7 for A(CMP,RO,WO,WA) (note that it doesn't involve PO) passes! This shows the futility of ad hoc testing methods: one could apply subjective criteria to consider a test similar to Test 7 to be su ciently incisive, when in fact it fails to account for a crucial ordering relation such as PO. The distinctive advantage of a formally based testing method such as Archtest is that it covers various compound memory ordering rules must be apparent. We found a more complex bug in our model of the Runway bus. Details are too involved to report here, but can be easily obtained from the web site Mok].
Conclusion and Future Plans
We presented a new approach to verify multi-processors for formal memory models, which combines two existing powerful techniques: model-checking, and the testing method of Archtest. From our results, we conclude that test model-checking can be of great value in detecting bugs during early stages of the design cycle of modern microprocessors whose memory subsystems are complex. Our results on our model of the HP PA/Runway bus attest to this.
So far we have identi ed the rules and corresponding tests for sequential consistency. We are currently working on identifying similar rules and tests for other well-known formal memory models such as TSO, PSO, and RMO AG96] that are described in the SPARC V9 architecture manual WG94]. This work may involve de ning new rules as well as new tests corresponding to them.
We are currently working to formulate some reasonable assumptions about the memory system model under which the tests administered by our test automata can be rendered complete. Also, for a limited class of models, model-checking the test for some small value of k might actually be su cient. Our initial attempts in this direction are encouraging.
