as a systematic way to decompose the proof of correctness of pipelined microprocessors. The central idea is to construct the abstraction function using completion functions, one per un nished instruction, each of which speci es the e ect (on the observables) of completing the instruction. However, its applicability depends on the fact that the implementation \commits" the un nished instructions in the pipeline in program order. In this paper, we extend the completion functions approach when this is not true and demonstrate it on an implementation of Tomasulo's algorithm without a reorder bu er. The approach leads to an elegant decomposition of the proof of the correctness criterion, does not involve the construction of an explicit intermediate abstraction, makes heavy use of an automatic case-analysis strategy based on decision procedures and rewriting, and addresses both safety and liveness issues.
Introduction
For formal veri cation to be successful in practice, not only is it important to raise the level of automation but is also essential to develop methodologies that scale veri cation to large state-of-the-art designs. One of the reasons for the relative popularity of model checking in industry is that it is automatic when readily applicable. A technology originating from the theorem proving domain that can potentially provide a similarly high degree of automation in veri cation is one that makes heavy use of decision procedures for the combined theory of boolean expressions with uninterpreted functions and linear arithmetic CRSS94, BDL96] . Just as model checking su ers from a state-explosion problem, a veri cation strategy based on decision procedures su ers from a \case-explosion" problem. That is, when applied naively, the sizes of the terms generated and the number of examined cases during validity checking explodes. Just as compositional model checking provides a way of decomposing the overall proof and reducing the effort for an individual model checker run, a practical methodology for decision procedure-centered veri cation must prescribe a systematic way to decompose the correctness assertion into smaller problems that the decision procedures can handle.
In HSG98] , we proposed such a methodology for pipelined processor verication called the Completion Functions Approach. The central idea behind this approach is to de ne the abstraction function 1 as a composition of a sequence of completion functions, one for every un nished instruction, in their program order. A completion function speci es how a partially executed instruction is to be completed in an atomic fashion, that is, the desired e ect on the observables of completing that instruction, assuming those ahead of it in the program order are completed. Given such a de nition of the abstraction function in terms of completion functions, the methodology prescribes a way of organizing the verication into proving a hierarchy of veri cation conditions. The methodology has the following attributes:
The veri cation proceeds incrementally making debugging and error tracing easier.
The veri cation conditions and most of the supporting lemmas (such as the lemma on the correctness of the feedback logic) needed to support the incremental methodology can be generated systematically. Every generated veri cation condition and lemma can be proved, often automatically, using a strategy based on decision procedures and rewriting. The veri cation avoids the construction of an explicit intermediate abstraction as well as the large amount of manual e ort required to construct it. In summary, the completion functions approach strikes a balance between full automation that (if at all possible) can potentially overwhelm the decision procedures, and a potentially tedious manual proof. This methodology is implemented using PVS ORSvH95] and was applied (in HSG98]) to three processor examples: DLX HP90], dual-issue DLX, and a processor that exhibited limited out-of-order execution capability. The proof decomposition that this method achieves and the veri cation conditions generated in the DLX example is illustrated in Figure 1 .
Later, we extended the methodology to verify a truly out-of-order execution processor with a reorder bu er HSG99]. We observed that regardless of how many instructions are pending in the reorder bu er, the instructions can only be in one of a few (small nite number) distinct states and exploited this fact to provide a single compact parameterized completion function applicable to all the pending instructions in the reorder bu er. The proof was decomposed on the basis of how an instruction makes a transition from its present state to the next state.
However, the applicability of the completion functions approach depends on the fact that the implementation \commits" the un nished instructions in the 1 Our correctness criteria is based on using an abstraction function, as most others. pipeline in program order. The abstraction function is de ned by composing the completion functions of the un nished instructions in the program order too. Because of this, it is possible to relate the e ect of completing instructions one at a time in the present and the next states and incrementally build the proof of the commutative diagram (See Figure 1) . Also, one can provide for every un nished instruction, an \abstract" state where the instructions ahead of it are completed. This fact is useful in expressing the correctness of the feedback logic. If instructions were to commit out-of-order, it is not possible to use these ideas.
A processor implementing Tomasulo's algorithm without a reorder bu er executes instructions in the data-ow order, possibly committing them to the register le in an out-of-order manner. Hence, the basic premise of the completion functions approach|that instructions commit in the program order|is not true in this case. The implementation maintains the identity of the latest instruction writing a particular register. Those instructions issued earlier and not the latest ones to write their respective destinations, on completing their execution, only forward the results to other waiting instructions but do not update the register le. Observe that it is di cult to support branches or exceptions in such an implementation. (In an implementation supporting branches or exceptions, the latest instruction writing a register can not be easily determined.)
In this paper, we extend the completion functions approach to be applicable in such a scenario. Instead of de ning the completion function to directly update the observables, we de ne it to return the value an instruction computes in the various states. The completion function for a given instruction recursively completes the instructions it is dependent on to obtain its source values. The abstraction function is de ned to assign to a register the value computed by the latest instruction writing that register. We show that this modi ed approach leads to a decomposition of the overall proof of correctness, and we make heavy use of an automatic case-analysis strategy in discharging the di erent obligations in the decomposition. The proof does not involve the construction of an explicit intermediate abstraction. Finally, we address the proof of liveness properties too.
The rest of the paper is organized as follows: In Section 2, we describe our processor model. Section 3 describes our correctness criteria. This is followed by the proof of correctness in Section 4. We compare our work with others in Section 5 and nally provide the conclusions. (z and m are parameters to our implementation model.) A register translation table (RTT) maintains the identity of the latest pending instruction writing a particular register (the identity is a \tag"|in this case, the reservation station index). A scheduler controls the movement of the instructions through the execution pipeline (such as being dispatched, executed etc) and its behavior is modeled in the form of axioms (instead of a concrete implementation). Instructions are fetched from the instruction memory (using a program counter which then is incremented); and the implementation also takes a no op input, which suppresses an instruction fetch when asserted.
An instruction is issued by allocating a free reservation station for it (New slot). No instruction is issued if all the reservation stations are occupied or if no op is asserted. The RTT entry corresponding to destination of the instruction is updated to re ect the fact that the instruction being issued is the latest one to write that register. If the source operands are not being written by previously issued pending instructions (checked using the RTT) then their values are obtained from the register le, otherwise the tags of the instructions providing the source operands is maintained in the reservation station allocated to the instruction. An issued instruction monitors the execution units to see if they produce the values it is waiting for, by comparing the tags it is waiting on with the tags of the instructions producing the result. An instruction can be dispatched when its source operands are ready and the corresponding execution unit is free. Dispatch? and Dispatch slot outputs from the scheduler (each a m-wide vector) determine whether or not to dispatch an instruction to a particular execution unit and the reservation station index from where to dispatch. Dispatched instructions get executed after a non-deterministic amount of time as determined by the scheduler output Execute?. At a time determined by the Write back? output of the scheduler, an execution unit writes back its result which will be forwarded to other waiting instructions. A register updates its value with this result only if its RTT entry matches the tag of the instruction producing the result and then clears its RTT entry. Finally, when an instruction is written back, its reservation station is freed.
At the speci cation level, the state is represented by a register le, a program counter and an instruction memory. Instructions are fetched from the instruction memory, executed, result written back to the register le and the program counter incremented in one clock cycle.
Our Correctness Criteria
Intuitively, a pipelined processor is correct if the behavior of the processor starting in a ushed state (i.e., no partially executed instructions), executing a program and terminating in a ushed state is emulated by an ISA level speci cation machine whose starting and terminating states are in direct correspondence through projection. This criterion is shown in Figure 3 (a) where I step is the implementation transition function, A step is the speci cation transition function and projection extracts those implementation state components visible to the speci cation (i.e., observables). This criterion can be proved by an easy induction on n once the commutative diagram condition (due to Hoare Hoa72]) shown in Figure 3 (b) is proved on a single implementation machine transition (and a certain other condition discussed in the next paragraph holds).
The criterion in Figure 3 (b) states that if the implementation machine starts in an arbitrary reachable state impl state and the speci cation machine starts in a corresponding speci cation state (given by an abstraction function ABS), then after executing a transition their new states correspond. Further ABS must be chosen so that for all ushed states fs the projection condition ABS(fs) = projection(fs) holds. The commutative diagram uses a modi ed transition function A step', which denotes zero or more applications of A step, because an implementation transition from an arbitrary state might correspond to executing in the speci cation machine zero instruction (e.g., if the implementation machine stalls without fetching an instruction) or more than one instruction (e.g., if multiple instructions are fetched in a cycle). The number of instructions executed by the speci cation machine is provided by a user-de ned synchronization function on implementation states. One of the crucial proof obligations is to show that this function does not always return zero (No inde nite stutter obliga- One also needs to prove that the implementation machine will eventually reach a ushed state if no more instructions are inserted into the machine, to make sure that the correctness criterion in Figure 3 (a) is not vacuous (Eventual ush obligation). In addition, the user may need to discover invariants to restrict the set of impl state considered in the proof of Figure 3 (b) and prove that it is closed under I step.
Proof of Correctness
We introduce some notations which will be used throughout this section: q represents the implementation state, s the scheduler output, i the processor input, rf(q) the register le contents in state q and next(q,s,i) the \next state" after an implementation transition. \Primed" variables will be used to refer to the value of a given variable in the next state. Also, we identify an instruction in the processor by its reservation station index (i.e., instruction rsi means instruction at reservation station index rsi). When the instruction in question is clear from the context (say rsi), we use just rs op to refer to its opcode instead of rs op(q)(rsi). (rs op 0 will refer to rs op(next(q,s,i))(rsi)). The PVS speci cations and the proof scripts can be found at Hos99].
Specifying the completion functions
An instruction in the processor can be in one of the three following possible states inside the processor|issued, dispatched or executed. (Once written back, it is no longer present in the processor). We formulate predicates describing an instruction in each of these states and specify the value an instruction computes in each of these states. The de nition of the completion function is shown in 1 . 
In this implementation, when an instruction is in the executed state, the result value is available in eu result eld of the execution unit, so Value executed returns this value. We specify Value dispatched along the same lines. When an instruction is in the issued state, it may be waiting for its source operands to get ready. In determining the value computed by such an instruction, we need the source operands which we specify as follows: When rs src ptr1 is zero, the rst source operand is ready and its value is available in rs src value1, otherwise its value is obtained by completing the instruction it is waiting on (rs src ptr1 points to that instruction). Similarly the second source operand is speci ed.
To specify the completion function, we added three auxiliary variables. The rst one maintains the index of the execution unit an instruction is dispatched to. Since the completion function de nition is recursive, one needs to provide a measure function to show that the function is well-de ned ; the other two auxiliary variables are for this purpose. We should prove that instructions producing the source values for a given instruction rsi have a lower measure than rsi. So we assign a number rs instr num to every instruction that records the order in which it is issued and this is used as the measure function. (The counter that is used in assigning this number is the third auxiliary variable).
Constructing the abstraction function
The register translation table maintains the identity of the latest pending instruction writing a particular register. The abstraction function is constructed by updating every register with the value obtained by completing the appropriate pending instruction, as shown in 2 . The synchronization function returns zero if no op input is asserted or if there is no free reservation station to issue an instruction, otherwise returns one. We generate the di erent cases of the induction argument (as will be detailed shortly) based on how an instruction makes a transition from its present state to its next state. This is shown in Figure 4 where we have identi ed the conditions under which an instruction changes its state. For example, we identify the predicate Dispatch trans?(q,s,i,rsi) which takes the instruction rsi from issued state to dispatched state. In this implementation, this predicate is true when there is an execution unit for which Dispatch? output from the scheduler is true and the Dispatch slot output is equal to rsi. Similarly other \trans" predicates are de ned.
Having de ned these predicates, we prove that they indeed cause instructions to take the transitions shown. Consider a valid instruction rsi in the issued state i.e., issued pred(q,rsi) holds. We prove that if Dispatch trans?(q,s,i,rsi) is true, then after an implementation transition, rsi will be in dispatched state (i.e., dispatched pred(next(q,s,i),rsi) is true) and remains valid. (This is shown as a lemma in 4 .) Otherwise (if Dispatch trans?(q,s,i,rsi) is false), we prove that rsi remains in the issued state in next(q,s,i) and remains valid. There are three other similar lemmas for the other transitions. The sixth lemma is for the case when an instruction rsi in the executed state is written back. It states that rsi is no longer valid in next(q,s,i). Now we come back to the details of the same result lemma. In proving this lemma for an instruction rsi, one needs to assume that the lemma holds for the two instructions producing the source values for rsi (Details will be presented later). So we do an induction on rsi with rs instr num as the measure function. As explained earlier in Section 4.1, instructions producing the source values (rs src ptr1 and rs src ptr2 when non-zero) have a lower measure than rsi. The induction argument is based on a case analysis on the possible state rsi is in, and whether or not it makes a transition to its next state. Assume the instruction rsi is in issued state. We prove the induction claim in the two cases| Dispatch trans?(q,s,i,rsi) is true or false|separately. (The proof obligation for the rst case is shown in 5 .) We have similar proof obligations for rsi being in other states. In all, the proof decomposes into six proof obligations. We sketch the proof of issued to dispatched induction lemma. We refer to the goal that we are proving|Complete instr( : : : ) = Complete instr( : : : )|as the consequent. We expand the de nition of the completion function corresponding to rsi on both sides of the consequent. In q, rsi is in the issued state and in next(q,s,i), it is the dispatched state|this follows from the issued to dispatched lemma. After some rewriting and simpli cations in PVS, the left hand side of the consequent simpli es to Value issued(q,rsi) and the right hand side simpli es to Value dispatched(next(q,s,i),rsi).
(The proofs of all the obligations are similar till this point. After this point, it depends on the particular obligation being proved since di erent invariants are needed for the di erent obligations.) Proof now proceeds by expanding the denitions of Value issued and Value dispatched, using the necessary invariants and simplifying. We use the PVS strategy apply (then* (repeat (lift-if)) (bddsimp) (ground) (assert)) to do the simpli cations by automatic caseanalysis (many times, simply assert will do).
We illustrate the proof of another lemma issued remains induction (shown in 6 ) in greater detail pointing out how the feedback logic gets veri ed. As above, the proof obligation reduces to showing that Value issued(q,rsi) and Value issued (next(q,s,i) is zero, then it implies that in the current cycle, the instruction pointed to by rs src ptr1 completes its execution and forwards its result to rsi. So it is easy to prove rs src value1 0 (the value actually written back in the implementation) is the same as the expected value Complete instr(q,rs src ptr1(q)(rsi)). If rs src ptr1 0 is non-zero, then one can conclude from the induction hypothesis that rs src ptr1 computes the same value in q and in next(q,s,i).
Proving the commutative diagram Consider the case when no new instruction is issued in the current cycle, that is, the synchronization function returns zero. The commutative diagram obligation in this case is shown in 7 . 7 % sch_rs_slot (i.e., scheduler output New_slot) is valid means no % free reservation stations. commutes_no_issue: LEMMA (no_op?(i) OR rs_valid(q)(sch_rs_slot(s))) IMPLIES rf(ABS(q)) = rf(ABS (next(q,s,i) 
We expand the de nition of ABS (shown in 2 ) and consider a particular register r. This again leads to three cases as in the correctness of op val1 same. Consider the case when rtt (i.e., rtt(q)(r)) is zero. We then show that rtt 0 is zero too and the values of register r match in q and next(q,s,i). Consider the case when rtt is non-zero. rtt 0 may or may not be zero. If rtt 0 is zero, then it implies that in the current cycle, the instruction pointed to by rtt completes its execution and writes its result to r. It is easy to show that this value written into r is the same as the expected value Complete instr(q,rtt(q)(r)). If rtt 0 is non-zero, then we use same result lemma to conclude that the same value is written into r in q and next(q,s,i).
The case when a new instruction is issued is similar to the above except when r is the destination register of the instruction being issued. We show that in state next(q,s,i), the new instruction is in issued state, its operands as given by op val1 and op val2 equal the ones given by the speci cation machine and the value written into r by the implementation machine equals the value given by speci cation machine.
The program counter pc is incremented whenever an instruction is fetched. This is the only way pc is modi ed. So proving the commutative diagram for pc is simple. The commutative diagram proof for the instruction memory is trivial since it is not modi ed at all.
The invariants needed We describe in this section all the seven invariants needed by our proof. We do not have a uniform strategy for proving all these invariants but we use the automatic case-analysis strategy shown earlier to do the simpli cations during the proofs.
Two of invariants are related to rs instr num and instr counter, the auxiliary variables introduced for de ning a measure for every instruction. The rst invariant states that the measure of any instruction (rs instr num) is less than the running counter (instr counter). The second one states that for any instruction, if the source operands are not ready, then the measure of the instructions producing the source values is less than the measure of the instruction. The need for these was realized when we decided to introduce the two auxiliary variables mentioned above. Two other invariants are related to rs exec ptr, the auxiliary variable that maintains the execution unit index an instruction is dispatched to. The rst invariant states that, if rs exec ptr is non-zero, then that execution unit is busy and its tag (which records the instruction executing in the unit) points to the instruction itself. The second invariant states that, whenever an execution unit is busy, the instruction pointed to by its tag is valid and that instruction's rs exec ptr points to the execution unit itself. These invariants are very similar to ones we needed in an earlier veri cation e ort HSG99]. Two other invariants characterize when an instruction is valid. The rst one states that for any register, the instruction pointed to by rtt is valid. The second one states that for any given instruction, the instructions pointed to by rs src ptr1 and rs src ptr2 are valid. The nal invariant we needed was that rs exec ptr for any instruction is non-zero if and only if rs disp? ( a boolean variable that says whether or not an instruction is dispatched) is true. The need for these three invariants was realized during the proofs of other lemmas/invariants.
PVS proof timings:
The proofs of all the lemmas and the invariants discussed so far takes about 500 seconds on a 167 MHz Ultra Sparc machine. 2
Other obligations -liveness properties
We provide a sketch of the proof that the processor eventually gets ushed if no more instructions are inserted into it. The proof that the synchronization function eventually returns a nonzero value is similar. The proofs involve a set of obligations on the implementation machine, a set of fairness assumptions on the inputs to the implementation and a high level argument using these to prove the two liveness properties. All the obligations on the implementation machine are proved in PVS. In fact, most of them are related to the \instruction state" transitions shown in Figure 4 and the additional obligations needed (not proved earlier) takes only about 15 seconds on a 167 MHz Ultra Sparc machine. We now provide a sketch of the high level argument which is being formalized in PVS.
Proof sketch: The processor is ushed if for all registers r, rtt(q)(r) = 0.
First, we show that \any valid instruction in the dispatched state eventually goes to the executed state and be valid" and \any valid instruction in the executed state eventually gets written back and its reservation station will be freed". Consider a valid instruction rsi in the dispatched state. If in state q, Execute trans?(q,s,i,rsi) is true, then rsi goes to the executed state in next(q,s,i) and remains valid (refer to Figure 4) . Otherwise it continues to be in the dispatched state and remains valid. We observe that when rsi is in the dispatched state, the scheduler inputs that determine when an instruction should be executed are enabled and these remain enabled as long as rsi is in the dispatched state. By a fairness assumption on the scheduler, it eventually decides to execute the instruction (i.e., Execute trans?(q,s,i,rsi) will be true) and in next(q,s,i), the instruction will be in the executed state and be valid. By a similar argument, it eventually gets written back and the reservation station gets freed. The manual e ort involved in doing the proofs was one person week. The authors had veri ed a processor with a reorder bu er earlier HSG99] and most of the ideas/proofs carried over to this example.
Second, we show that \every busy execution unit eventually becomes free and stays free until an instruction is dispatched on it". This follows from the observation that whenever an execution unit is busy, the instruction occupying it is in the dispatched/executed state and that such an instruction eventually gets written back ( rst observation above). Third, we show that \a valid instruction in the issued state will eventually go to the dispatched state and be valid". Here, the proof is by induction (with rs instr num as the measure) since an arbitrary instruction rsi could be waiting for two previously issued instructions to produce its source values. Consider a valid instruction rsi in the issued state. If the source operands of rsi are ready, then we observe that the scheduler inputs that determine dispatching remain asserted as long as rsi is not dispatched. Busy execution units eventually get free and remain free until an instruction is dispatched on it (second observation above). So by a fairness assumption on the scheduler, rsi eventually gets dispatched. If a source operand is not ready, then the instruction producing it has a lower measure. By the induction hypothesis, it eventually goes to the dispatched state, eventually gets written back ( rst observation) forwarding the result to rsi. By a similar argument as above, rsi eventually gets dispatched. Finally, we show that \the processor eventually gets ushed". We observe that every valid instruction in the processor eventually gets written back freeing its reservation stations (third and rst observations). Since no new instructions are being inserted, free reservation stations remain free. Whenever rtt(q)(r) is non-zero, it points to an occupied reservation station. Since, eventually all reservation stations get free, all rtt entries become zero and the processor is ushed.
Related Work
The problem of verifying the control logic of out-of-order execution processors has received considerable attention in the last couple of years using both theorem proving and model checking approaches. In particular, prior to our work, one theorem prover based and three model checking based veri cations of a similar example|processor implementing Tomasulo's algorithm without a reorder bu er|have been carried out.
The theorem prover based veri cation reported in AP98] is based on re nement and the use of \predicted value". They introduce this \predicted value" as an auxiliary variable to help in comparing the implementation against its speci cation without constructing an intermediate abstraction. However there is no systematic way to generate the invariants and the obligations needed in their approach. And they do not address liveness issues needed to complete the proof.
A model checking based veri cation of Tomasulo's algorithm is carried out in McM98] . He uses compositional model checking and aggressive symmetry reductions to manually decompose the proof into smaller correctness obligations via re nement maps. Setting up the re nement maps requires information similar to that provided by the completion functions in addition to some details of the design. However the proof is dependent on the con guration of the processor (number of reservation stations etc) and also on the actual arithmetic operators.
Another veri cation of Tomasulo's algorithm is reported in BBCZ98] where they combine symbolic model checking with uninterpreted functions. They introduce a data structure called reference le for representing the contents of the register le. While they abstract away from the data path, the veri cation is for a xed con guration of the processor and they is no decomposition of the proof.
Yet another veri cation based on assume-guarantee reasoning and re nement checking is presented in HQR98]. The proof is decomposed by providing the de nitions of suitable \abstract" modules and \witness" modules. However the proof can be carried out for a xed small con guration of the processor only.
Finally, veri cation of a processor model implementing Tomasulo's algorithm with a reorder bu er, exceptions and speculative execution is carried out in SH98]. Their approach relies on constructing an explicit intermediate abstraction (called MAETT) and expressing invariant properties over this. Our approach avoids the construction of an intermediate abstraction and hence requires signi cantly less manual e ort.
Conclusion
We have showed in this paper how to extend the completion functions approach to be applicable in a scenario where the instructions are committed out-of-order and illustrated it on a processor implementation of Tomasulo's algorithm without a reorder bu er. Our approach lead to an elegant decomposition of the proof based on the \instruction state" transitions and did not involve the construction of an intermediate abstraction. The proofs made heavy use of an automatic case-analysis strategy and addressed both safety and liveness issues.
We are currently developing a PVS theory of the \eventually" temporal operator to mechanize the liveness proofs presented here. We are also working on extending the completion functions approach further to verify a detailed out-oforder execution processor (with a reorder bu er) involving branches, exceptions and speculative execution.
