INTRODUCTION
In this chapter, we extend a set of algebraic tools for microprocessors , and Fox and Harman [1996a] ) to model superscalar microprocessor implementations, and apply them to a case study. In superscalar microprocessors, the timing of events in an implementation can be substantially different from that of the architecture that they implement. We develop the existing correctness models of and to accommodate the more advanced timing relationships of superscalar processors, and consider formal verification. We illustrate our tools and techniques with an in-depth treatment of an example superscalar implementation, first seen in a simpler form in Fox and Harman [1996a] .
We are particularly interested in models of time and temporal abstraction. Clocks divide time into (not necessarily equal) segments, defined by the natural timing of the computational process of a device: for example, the execution of machine instructions, or some system clock. We formally relate clocks by surjective, monotonic maps called retimings. In the case of superscalar microprocessors, the normal relationship between 'architectural time' and 'implementation time' is complicated by the fact that events that are distinct in time at the architectural level can occur simultaneously at the implementation level.
Interesting recent work on pipelined microprocessors includes Windley and Coe [1994] on UINTA, a processor of moderate complexity, and its verification in HOL (Gordon and Melham [1993] ); and Miller and Srivas [1995a] , Miller and Srivas [1995b] on AAMP5, a more complex processor, and its verification in PVS ). In both UINTA AND AAMP5, the development and consideration of timing abstraction (see also and ) is conceptually similar to our own , and ). However, we put less emphasis on using software tools to automate verification of case studies, and focus more on mathematical models. Also of interest is Melham [1993] which again has a somewhat similar model of time. More recently, superscalar processors have been addressed: in particular, the increased complexity of verification in the face of complex timing behaviour (Windley and Burch [1996] , Burch [1996] , , ). We consider this question further in §5 and §9.1.
Other, earlier, work on microprocessors includes the following. Gordon's Computer ), since considered, in various forms, by others: for example, , Stavridou [1993] and ). Viper and ), further considered in Arora et al. [1993] . Landin's SECD machine (Landin [1963] ), considered in and Birtwistle and Graham [1990] . A PDP-11-based processor, the FM8501, and its more advanced successor, FM9001 are discussed in , and (see also Bose and Johnson [1993] ).
The structure of this paper is as follows. In §2 we introduce the basic iterated map model of a microprocessor. In §3 we consider how we may express the correctness of one model of a (non-superscalar) microprocessor with respect to another, at a different level of abstraction, when both are represented as iterated maps. In §4 we informally introduce the fundamental aspects of superscalar microprocessors. In §5 we consider how our correctness model from §3 must be modified for superscalar microprocessors. In §6 we introduce a simple machine architecture. In §7 we informally introduce ACS, a superscalar implementation of our simple architecture. In §8 we formalise ACS in detail. Although we include the majority of the formal representation of ACS, space considerations force us to omit certain parts. A full treatment can be found in Fox [1997] . Finally, in §9, we consider the correctness of ACS, and the problems of the formal verification of superscalar processors.
BASIC MODELS OF MICROPROCESSORS
In general, we model a microprocessor using an iterated map State, of the form: (State(t, state) ).
• T is a copy of the natural numbers N, representing discrete time intervals, or clock cycles.
• STATE is the state-set of the microprocessor. Generally, this will be a Cartesian product of components representing registers, memory, etc.
• init : STATE → STATE is an initialisation function, that enforces internal consistency of the initial state of the microprocessor (e.g. given memory m, program counter pc and instruction register ir, we expect: ir = m(pc)) and acts as an invariant in formal verification: see §9. In the case of an architectural-level model, init will often be the identity function. Furthermore, when considering init as an invariant, it should be as weak as possible ( §9).
• next : STATE → STATE is the next-state function, determining state evolution.
For simplicity, we have chosen to omit inputs and outputs as they are not needed in our case study. In practice, their inclusion causes no difficulty (see ).
The choice of T , STATE , init and next controls the level of abstraction of the formal representation. For example, the choice of clock T controls the level of timing abstraction. We can choose clock cycles of T to represent system clock cycles, which would be appropriate in the case of a low-level representation of an implementation; or we could choose cycles of T to represent instruction execution, with each clock cycle lasting precisely one instruction. This latter choice (an instruction clock ) would be more appropriate for a high-level, architectural description. Notice in the latter case that clock cycles would typically vary in length, since in general different instructions will have different execution times. Additionally, the choice of STATE controls the level of data abstraction. If we wish to represent a microprocessor at the architectural level, we will choose STATE to represent those components visible to the programmer. If we wish to represent an implementation, STATE will additionally include components not visible to the programmer (e.g. buffer registers, cache memories, etc.).
In addition to timing and data abstraction, we can consider structural abstraction, where a formal representation is sub-divided into component parts, representing a the physical structure of the implementation. We may partition, or decompose, state-set STATE , iterated map State and next-state function next to reflect both the physical partitioning of the microprocessor, and the conceptual sub-tasks that must be performed in instruction execution.
We may consider many different levels of abstraction when modeling microprocessors. However, we will restrict our attention to two. The programmer's model P M, corresponding to the user-visible architectural level, and the abstract circuit model AC, corresponding with a high-level view of the implementation, commonly called the organisation.
SIMPLE CORRECTNESS MODELS
Given two descriptions of microprocessors
how do we formulate the statement: "State AC correctly implements State P M "?
Figure 1: A simple retiming.
Retimings
First, we must consider how we can relate times on two different clocks. Given two clocks T and S, a function λ : S → T is called a retiming if it is: (i ) monotonic, ensuring time always runs forwards on T and S; and (ii ) surjective, ensuring that each time t ∈ T corresponds with some time s ∈ S. We denote the set of all such retimings by Ret(S, T ). In the case of microprocessors, we can construct state-dependent retimings λ : STATE → Ret(S, T ) that are functions of the state-set of a microprocessor representation. For example, in the case that T represents an instruction clock, and S a system clock, then λ would map times on S to the time on T corresponding to the execution of the current machine instruction. A simple retiming is illustrated in Figure 1 . We can build a number of formal tools based on retimings. In this paper, we require the following.
• The immersion: λ : T → S, defined by
• The start function start :
Further discussion of retimings can be found in , Harman and Tucker [1990] , and .
Correctness Statement
We construct a commutative diagram representing the correctness of State AC with respect to State P M as follows:
where:
• λ : STATE AC → Ret(S, T ) is a state-dependent retiming, mapping the system clock and state of State AC to the instruction clock of State P M .
• ψ : STATE AC → ST AT E PM is a projection function, discarding the elements of State AC not visible in State P M .
We say that State P M is correct with respect to State AC if the above diagram commutes, for all times s ∈ S such that:
that is, for all times corresponding with the end of a machine instruction, and for all states state ∈ STATE AC .
SUPERSCALAR MICROPROCESSORS
A pipelined processor allows new instructions to commence before their predecessors have finished execution. For example, instruction i may be fetched while instruction i − 1 is being decoded, i − 2 is being executed, and i − 3's result written back to registers or memory. Clearly, it is necessary to ensure that the relationships, or dependencies, between instructions permit this (see, for example, Stone [1993] , ).
Pipelined processors permit several instructions to be in different stages of execution simultaneously. Superscalar processors extend this by allowing more than one instruction to be processed at each stage. Several instructions may start execution simultaneously, and finish together, or even out of program order. To achieve this the processor must contain multiple pipeline units at each stage. Machines capable of parallel instruction execution have existed since the 1960s. For example, the CDC 6600 ), and the IBM 360/91 (Anderson et al. [1967] ). These machines were not superscalar, though they did contain parallel functional units: necessary because of the advanced pipelining techniques they used. The IBM 360/91 used Tomasulo's algorithm for the issuing logic (Tomasulo [1967] ), commonly used in modern superscalar microprocessors for scheduling instruction execution. Superscalar processors first appeared in the late 1980s: for example the IBM RISC System/6000 [1991] ). There are five types of dependency: the first three apply to both superscalar and pipelined processors.
• Data dependency occurs when an instruction needs the result of a preceding instruction.
• Procedural dependency occurs when a branch instruction disrupts the normal (sequential) flow of execution; requiring any work done on subsequent instructions to be discarded. This can be a particularly significant source of delay in the case of conditional branches, where the outcome may not be known until late in the pipeline.
• Resource conflicts occur when two instructions simultaneously require the same hardware resources (functional units, etc.) . This problem can often be reduced by duplicating hardware, at a cost. For example, providing two addition units removes the resource conflict between a pair of add instructions.
The remaining two dependencies apply only to superscalar processors.
• Antidependency occurs when an instruction overwrites the arguments of a preceding instruction. This is significant if instructions are allowed to execute out of order.
• Output dependency occurs when two instructions wish to store results at the same destination, which, again, is significant if instructions are allowed to execute out of order.
These dependencies must be resolved if the processor is to function correctly. That is, to be functionally indistinguishable from a non-superscalar sequential architecture model. In a sequential architecture, each instruction is assumed to finish before its successor. This is a natural model for instruction-level computation at the P M level of abstraction. Furthermore, it is also the model used by older implementations of currently-popular architectures. In order to preserve the natural model, and maintain backward compatibility, it is necessary for superscalar processors to retain, or be able to reconstruct, a state which corresponds exactly with a state of the P M level. A precise architectural state is a processor state which meets this condition. A processor can generate precise states if dependencies are correctly resolved and results are written to P M-level state components in program order. This is particularly important in the case of exceptions. For example, if instruction i causes an exception because of an error, it is reasonable to expect that instruction i + 1 has not yet executed.
There are a number of techniques used to maximise throughput, whilst observing dependencies and maintaining a precise architectural state. The most common method is some form of register renaming, where additional registers are used to resolve instruction dependencies, and to temporarily store results. A common form of register renaming is a reorder buffer. This consists of a circular buffer of registers, used to (temporarily) store the results of computations before they are are committed, or retired, to P M-level state components. Whenever an instruction is dispatched for execution, the next available slot in the buffer is reserved to store the result. Results are inserted into the buffer, in the appropriate location, when they become available. They are then removed from the head of the buffer, and are stored in architectual components, in program order. For example, suppose instructions i, i + 1 and i + 3 generate results at time t, but that i + 2 has yet to finish. The results of i and i + 1 can be transferred to P M-level components, but i + 3 must wait until i + 2 has also finished. This ensures a precise architectural state: if the machine is interrupted at this point, execution will restart with instruction i + 2, and i + 3 will be re-executed. In order to speed up execution, implementations generally permit bypassing. That is, results in the reorder buffer can be used by subsequent instructions (subject to dependencies) before they are moved to the relevant P M-level components.
SUPERSCALAR CORRECTNESS
The correctness model in §3 is applicable to simple, microprogrammed implementations, and to more complex pipelined implementations, where each system clock cycle corresponds with the termination of at most one machine instruction. In a superscalar implementation, it is possible for multiple instructions to terminate on a single clock cycle. That is, there may be cycles of clock T that correspond with no cycle of clock S. Hence there is no retiming from S to T . To solve this problem, we introduce a new retirement clock R. Cycles of clock R mark the committal of one or more machine instructions. We can construct two retimings λ 1 : T → R, mapping instruction clock cycles to retirement clock cycles, and λ 2 : S → R, mapping system clock cycles to retirement clock cycles. We can construct the map ρ : S → T from system clock cycles to instruction clock cycles by composition:
Function ρ is illustrated in Figure 2 . Note that ρ is not a retiming since it need not be surjective. If, for example, instructions i and i+1 commit simultaneously, it is not meaningful to talk about the correctness of State AC after instruction i, since there is no time at which instruction i has terminated, and instruction i + 1 has not.
PM: A SIMPLE ARCHITECTURE
To illustrate our algebraic tools, we will introduce a simple machine architecture P M, consisting of: separate data and instruction memories; a set of registers; and five instructions, with the following informal meanings. • add reg a , reg b , reg c -Add register reg a to register reg b , store the result in register reg c and increment the program-counter.
• branch addr -If register reg 0 is zero then add the program-counter to addr and store the result in the program-counter; otherwise increment the program-counter.
• load reg a , addr -Load the contents of the data memory, at location addr, into register reg a , and increment the program-counter.
• store reg a , addr -Store the contents of register reg a in the data memory at location addr, and increment the program-counter.
• set reg a , val -Store the constant val in register reg a , and increment the program-counter.
We will first describe the state algebra of P M, defining the state-space, clock and state function. We then describe the next state algebra, defining the initialisation and the next-state function.
The P M state algebra is as follows:
where − −− → state is a tuple of type ST AT E PM . We assume all states are valid initial states for time zero: hence there is no need for an initialisation function. Each cycle of clock T corresponds with one machine instruction.
The state-space of the architecture is
The various components, and subcomponents, of ST AT E PM are defined as follows.
+ is the PM register width),
and W n = Bit n (n ∈ N + ) is the set of n-bit words (Bit = {0, 1}). The typical state element will be of the form (mp, md, pc, reg) ∈ ST AT E PM where: mp ∈ Mem is the program memory; md ∈ Mem is the data memory; pc ∈ PC is the program-counter; and reg ∈ Reg is the set of registers.
The P M next-state algebra is as follows:
The function · + 1 is the successor clock cycle function which enumerates time from the initial cycle 0 ∈ T . The next-state function next PM is defined as follows
where ins = mp(pc) is the next instruction to be executed. There are six cases; one for each type of instruction with the exception of branch, which has two cases (taken and not taken). The generic notation r [v/a] is an abbreviation for functions of the form:
The functions op : R → OP , ra, rb, rc : R → RC , and addr, val : R → PC extract the op-code, register and address/value fields of an instruction, respectively.
The definitions of op, ra, rb, rc, addr and val are omitted. The pad function simply extends bit strings by padding with zeros. The definition is omitted.
ACS: AN INFORMAL DESCRIPTION
We introduce an implementation ACS of P M that will include component parts, typical of a superscalar microprocessor, in a somewhat simplified form. We will permit out-of-order instruction issue and execution, and will enforce in-order instruction retirement by means of a scoreboarding algorithm, implemented using a re-order buffer, thus preserving a precise architectural state. We will permit a maximum of four instructions to execute simultaneously, dependencies permitting.
The implementation presented lacks some features, such as register renaming, present in many superscalar machines (see for example Smith and Sohi [1995] ). Instructions can execute out-of-order and are committed in-order. Thornton's algorithm is used to resolve dependencies instead of the more complex Tomasulo's algorithm (Weiss and Smith [1984] ). In addition, we do not permit bypassing of results from the reorder buffer. However, our example does have the essential property of a superscalar implementation: the ability to commit more than one instruction in a single cycle. The intention is to present a pedagogical example, that is not obscured by unnecessary complexity. None of the omitted features would be difficult to introduce. First we describe the physical structure of the microprocessor and its component parts. Then we consider the conceptual operations performed by groups of physical components, operating together. We discuss the relationship between physical components and conceptual operations in §8.
Processor Organisation
The ACS processor consists of the eight physical units shown in Figure 3 . We briefly describe each unit below.
1. The Instruction Cache stores machine instructions, performing the same function as the PM program memory. For simplicity, we will assume that the instruction cache is the same size as the PM program memory. In reality, it will be much smaller, and an attempt to fetch an instruction not in the cache will cause a cache miss. This will cause a block of memory containing the missing instruction to be fetched into the instruction cache.
Steps will need to be taken to decide which cache replacement strategy to use, and the operation of the processor may stall while the new instruction is fetched. Modeling this algebraically causes no difficulties. However, in ACS, it will complicate the representation in an unhelpful way, and is hence omitted.
2. The Instruction Buffer contains a fetch program counter and a buffer. The buffer stores a small number of instructions, fetched from the instruction cache using the fetch program counter. The buffer maintains a reserve of instructions available for processing in the event of a cache miss. We have eliminated cache misses in ACS, but the instruction buffer is a necessary part of a superscalar processor, and hence we have retained it.
3. The Decode Unit breaks down instructions into component parts: operation codes, and operand specifiers. The ACS processor has a very simple, fixed instruction format, and hence the decode unit is very simple. Real processors, with more complex instruction sets, will generally require a more sophisticated decode unit. (Tomasulo [1967] ), one for each functional unit. These schedule program execution, by determining dependencies, and then forward instructions that are free to execute to the appropriate functional units.
The Issue Unit contains four reservation stations

5.
The Functional Units compute results based on instructions and their operands. There are two adders (allowing add instructions to execute in parallel), a load-store unit and a branch unit.
The Reorder Buffer Unit stores instruction results prior to their committal to P M-level components (data cache, registers and program counter).
Results from instructions that finish execution out of program order are held in a buffer until their predecessors have finished execution, at which point they are written back (committed ) to the appropriate P M-level components.
7. The Register Unit contains the PM registers, the PM program counter, and a reset flag for clearing the pipeline. It also contains an array of bits indicating which registers are currently in use, and a list of memory locations currently in use.
8. The Data Cache performs the same function as the PM data memory. As with the instruction cache, the data cache is assumed to be the same size as the P M model data memory, eliminating cache misses. Again, there is no difficulty in modeling a smaller cache algebraically; though we would have to deal with the additional question of the cache write strategy, since, unlike the instruction cache, it is possible to write to the data cache.
Each of these units are involved in one or more processor operations.
Processor Operations
There are six operations involved in instruction execution, outlined below.
1. Instruction Fetch. Instructions are read from the current fetch program counter value, which attempts to predict the future evolution of the P M program counter, assuming there are no branches taken. Clearly, any taken branches will render subsequently-fetched instructions invalid. To simplify ACS, branches are not treated differently: that is branches are predicted as not taken. Correctly predicating branch destinations significantly improves performance, because branching to the 'wrong' destination requires the pipeline to be emptied. In practice, it may be more effective to predict branches to be taken; and maintaining a history of previous behaviour would certainly be better. Modifying ACS to accommodate either of these (or other) strategies would present no difficulties. ). When resources are available instructions are dispatched (added) to the end of the appropriate reservation station's buffer, and a place is reserved in the reorder buffer to hold the result. Instructions can then be issued (removed), in any order, from the reservation station, when all dependencies are resolved.
4. Instruction Execution. The implementation contains two adders, a loadstore unit and a branch unit, for executing: add and set; load and store; and branch instructions respectively. Two add instructions may be executed simultaneously, or one add and one set.
Instruction Reorder.
Instruction results (from the functional units) are inserted into their pre-reserved slots in the reorder buffer.
6. Instruction Committal. Once results reach the top of the reorder buffer they are removed and committed to PM state components (register, program counter or data cache). This means that the PM state components always contain a precise architectural state, because instruction i is not allowed to commit until all instructions i , i < i have generated results in the reorder buffer, and have themselves committed.
ACS: A FORMAL DESCRIPTION
We will formally represent ACS as follows.
• We describe the state algebra of ACS in §8.1. This algebra defines the state-space, AC clock and state function of ACS. • We then describe the next state algebra in §8.2. This algebra defines the initialisation and next-state functions of ACS. There is one next-state function for each of the physical units in ACS.
• Finally, we describe the processor operation algebra in §8.3. Each of the next-state functions for the units described in §8.2 is defined in terms of processor operations, for example, instruction decode. Each operation will typically affect a number of processor units.
The relationships between processor operations and units is shown in Figure 4 . Operations are represented by rectangular boxes and machine units are represented by rounded boxes. For example, the operation Execute affects the issue unit (from which instructions are removed) and the functional units (where instruction results are computed). Execute receives input from the issue unit, the register unit and the data cache.
The State Algebra
The state algebra for the ACS processor consists of carrier sets for time S and the state of the processor STATE , and the state function State ACS , which returns a new state of the processor, given a time and an initial state.
The state function is defined below.
The hidden function next ACS : STATE → STATE is defined below. 
Processor Initialisation and Next-State Functions
The next-state algebra for ACS consists of carrier sets for time S and the state of the processor STATE , together with carrier sets for individual units (the Cartesian product of which constitutes the overall state set STATE ). The operations in the next-state algebra are the successor function for time, the initialisation function for the processor init ACS , and the individual next-state functions for each of the units. The components of this algebra are defined in the following sections.
Algebra
Processor State-Space and Clock
The machine state-space is made up of Cartesian product of the state-spaces for each of the eight main ACS units.
A typical vector will be of the form.
Figure 5 gives a pictorial representation of the AC state of the processor. The state-space of ACS is hierarchical in that each physical unit has its own state, which is constructed of simpler state sets, that may in turn be made up of yet more primitive components. The full name of a state component is of the form
· ready is the ready bit of the first operand, of the first adder reservation station in the issue unit. To simplify the definitions, and where no ambiguity results (the vast majority of cases), we will omit name elements.
The state-space of ACS makes heavy use of buffers and lists. We assume the existence of a general-purpose finite buffer algebra, and a general-purpose finite list algebra, which we will not define here: see Fox [1997] . In general, Buffer (bufsize,Data) and List (bufsize,Data) are the sets of finite buffers and lists respectively, of size bufsize containing elements from the set Data. We will informally introduce buffer and list operations as required.
In Figure 5 , we imply concrete buffer and list operations, based on register files, with head and tail pointers (and usage bits in the case of lists). In the case of the Dispatch only ( §8.3.3) , we require such a concrete definition. Given we are specifying hardware, this is not unreasonable.
The processor state-space is parameterised by eight constants, defining the sizes of the buffers and reservation stations in the processor: ibufsize = instruction buffer entries; decsize = decode unit entries; addsize1 = reservation station entries for the first adder; addsize2 = reservation station entries for the second adder; brsize = reservation station entries for the branch unit; lsrsize = reservation station entries for the load-store unit; reordersize = reorder buffer entries; and memusize = memory address usage entries.
These constants would be instantiated for any given concrete implementation of the processor and, in practice, are limited by technological constraints.
The clock S synchronises the processor's operations and is not a measure of absolute time: that is, it is not a system clock. The clock rate of an implementation of ACS would be determined by the minimum speed of the processor units: the maximum time taken for the slowest unit to reach its next state. It is also assumed that all of the processor's functional units will compute results in one clock cycle. In a real processor, each of the individual processor units may have their own pipelines. Typically, functional units for, say, floating point operations may take some cycles to compute. We have chosen to ignore some technological realities in order to simplify the example. For example, no limit is placed on the number of instructions that can be committed in any one cycle. In practice, bus width and cache/memory bandwidth would impose a limit of (currently) a few instructions per cycle.
Processor Initialisation
A strong definition for the processor initialisation function init ACS is:
The function above simply sets the reset flag (in the register unit) to one, effectively emptying the pipeline. While simple, this definition of initialisation causes problems when considering the correctness of ACS: see §9. Ideally, we would like the weakest possible initialisation function; that is, one which only modifies the initial state if it is internally inconsistent. Unfortunately, given the complexity of ACS, such an initialisation function is difficult to define: see §9 and Fox [1997] . Figure 5 : The main components of the ACS processor.
I-cache
The Instruction Cache
The state of the instruction cache is Icache = [PC → Reg]. Since we do not permit modifications to the program, the instruction cache does not change state. Hence there is no next-state function.
The Instruction Buffer
The state of the instruction buffer is:
with component names:
The fetch buffer −−−−→ instbuf stores instruction words fetched from the instruction cache. The register fetchpc holds the address of the next instruction to be fetched, 'guessing' the future value of the PM program counter by assuming the absence of branches.
The next-state function Ibuffer : STATE → Ibuffer is defined below.
If the pipeline is not being reset then instruction buffer entries at the head of − −−− → ibuffer are removed by the operation Decode : Ibuffer × Decode → Ibuffer × Decode ( §8.3.2), and new ones inserted at the tail of − −−− → ibuffer by the operation F etch : Icache × Ibuffer → Icache × Ibuffer ( §8.3.1). If the pipeline is being reset (i.e. after a branch), then the instruction buffer is cleared with the buffer operation Reset, which empties the buffer. It is then filled with entries fetched from the instruction cache using the current PM program counter value pc, rather than the fetch program counter f etchpc. Note that < Reset( −−−−→ instbuf ), pc > is a tuple of type Ibuffer. We use < · · · > to identify tuples, here and elsewhere, to avoid confusion over the numbers and types of arguments of functions.
The projection function π− −−− → ibuffer : Ibuffer × Decode → Ibuffer is defined as follows:
In general, projection functions of the form π x project out the state element named x. We will omit the definition of such functions from now on. The hidden function IBuffMrg : Ibuffer × Ibuffer → Ibuffer combines the results of instruction decode and instruction fetch.
The fetch program counter is affected by the Fetch operation but not by the Decode operation; hence fetchpc 2 is discarded. Instruction buffers −−−−→ instbuf 1 and −−−−→ instbuf 2 are combined using the buffer function Merge, which concatenates the contents of two buffers.
The Decode and Dispatch Unit
The state of the decode and dispatch unit is:
with component names: ra, rb, rc, addr) .
The fields correspond with instruction op-code, registers and address/value field respectively. The next-state function DecodeDisp. : STATE → Decode is defined below.
If the pipeline is not being reset then entries are dispatched to the issue unit Issue by the operation Dispatch : Decode × Issue × ReorderBuf × Registers → Decode×Issue×ReorderBuf ×Registers ( §8.3.3), and decoded from the instruction buffer IBuffer by Decode, with results combined by Merge. If the pipeline is being reset then the decode and dispatch unit is emptied by the buffer operation Reset.
The Issue Unit
The state of the issue unit is:
where AddIssue1 = List (addsize1,AddEntry) , AddIssue2 = List (addsize2,AddEntry) , BrIssue = List (brsize,BrEntry) , LsrIssue = List (lsrsize,LsrEntry) , and AddEntry = RegOpnd 2 × RC × W log 2 (reordersize) ,
with component names: The register operand entries each consist of a ready flag ready, a register location sr, and a machine word opnd. If the flag ready is set then the operand will already be stored in opnd, otherwise it will be fetched later from register location sr.
The branch unit reservation station consists of a flag ready, a machine word opnd, a reorder buffer offset offset, a memory address addr, and an assigned location in the reorder buffer reorderpos. If the flag ready is set then opnd will contain the current contents of register zero. Otherwise, this will be fetched later. The offset field is used to ensure the pc-relative branch is made to the correct address. If the reorder buffer is not empty when the branch is dispatched then the PM program counter value upon dispatch will be inconsistent with the branch instruction address. The offset field compensates for this.
A load-store reservation station entry consists of a register operand − −−−− → regopnd (with the same structure as an adder reservation station's operand entry), an address operand −−−−−−→ memopnd , a memory destination address da, a destination register dr, a flag load, and an assigned location in the reorder buffer reorderpos. 
Issue(
− −− → state) =              IssueRmv(< π− −− → issue (Dispatch( − −−− → decode, − −− → issue, if reset = 0; − −−−−−− → reorderbuf , −−−−−→ registers, Dcache)) >, < π execs (Execute( − −− → issue, −−−−−→ registers, Dcache)) >), (Reset( −−−−−→ addissue 1 ), Reset( −−−−−→ addissue 2 ), Reset( −−−−→ brissue), otherwise. Reset( − −−−− → lsrissue)),
IssueRmv(<
Each reservation station has the entries of executed instructions removed by the list operation Remove, which removes a specified element from a list.
The Functional Units
The states of the functional units are:
where
load, done, reorderpos).
The adder functional units consist of a result word result, a destination register address dest, a flag done, and a reorder buffer location reorderpos. The flag done is set when the unit has finished computing a result to be inserted into the reorder buffer. The branch functional unit consists of a memory address result, flags taken and done, and a reorder buffer location reorderpos. The flag taken is set when the condition for the branch holds. The flag done serves the same purpose as that in the adder functional units. The load-store functional unit consists of a result word result, a destination address dest, flags load and done, and a reorder buffer location reorderpos. The load flag distinguishes between load and store instructions; the done flag serves the same purpose as in the adder and branch units.
The next-state function F unctional : STATE → Functional is defined below. If the pipeline is not being reset then instructions are executed, otherwise all registers are set to zero. Note that clearing just the done flags would be sufficient to reset each functional unit, since this would prevent results from the functional units being moved to the reorder buffer ( §8.3.5).
The Reorder Buffer Unit
The state of the reorder buffer unit is:
word).
The reorder buffer entry consists of a three-bit content-type word type, a destination word dest, and a result word word.
Valid type values will be represented by unique constants wait, skip, dcache, reg and count.
The next-state function Reorder : STATE → ReorderBuf is defined below.
If the pipeline is not being flushed, entries are removed from the buffer by the operation Commit : ReorderBuf × Registers × Dcache → ReorderBuf × Registers × Dcache ( §8.3.6), place-holding entries are added (to hold future results) by Dispatch, and available results from the functionals units are inserted into their assigned entries by the operation ReorderInsert : ReorderBuf × Functional → ReorderBuf ( §8.3.5). If the pipeline is being flushed then the reorder unit is cleared using the buffer operation Reset.
The Register Units
The state of the register unit is:
This unit consists of the register array reg, a register usage table regu, a memory address usage list − −−− → memu, a flag reset, and the program counter pc. The components regu and memu are used to keep track of registers and memory locations that are to be written to by instructions that have been dispatched and have not yet committed. This information is used to resolve dependencies in instruction dispatch.
The next-state function Registers : STATE → Registers is defined below.
If the pipeline is not being reset, then the register usage table and memory usage list is updated by Dispatch. After this, all components may be altered by Commit. Dispatch and commit results are combined by simple composition. If the pipeline is being reset then the register usage table and the memory usage list are cleared, and the reset flag set to zero. The hidden constant emptyregu ∈ [RC → Bit] is defined by
The Data Cache
The state of the data cache is Dcache = [PC → Reg], and the next-state function Dcache : STATE → Dcache is defined below.
If the pipeline is not being reset then the data cache is only affected by Commit. If the pipeline is being reset, the data cache remains unchanged.
Processor Operations
The next-state functions for each of the physical units of ACS are defined in terms of conceptual operations. The six operation stages form the ACS Operations algebra, defined below.
Algebra ACS Operations Carrier Sets
Icache, Ibuffer, Decode, Issue, Functional, ReorderBuf, Registers, Dcache
The components of this algebra are defined in the following sections.
Instruction Fetch
The operation F etch : Icache × Ibuffer → Icache × Ibuffer is defined below.
Instructions are repeatedly fetched from the instruction cache, and added to the instruction buffer using the buffer operation P ush, until the instruction buffer is full (tested by the buffer operation F ull). In ACS, the operation is bounded by the size of the instruction buffer. In a real microprocessor, the bandwidth of the bus between the instruction cache and the instruction buffer will also bound the number of instructions that can be transferred in a clock cycle.
Instruction Decode
The operation Decode : Ibuffer × Decode → Ibuffer × Decode is defined below.
Instructions are removed from the instruction buffer using the buffer operation P op, decoded by DecodeEntry, and then added to the decode-dispatch unit buffer using the buffer operation P ush, until either the instruction buffer is emptied (tested by the buffer operation Empty), or the decode-dispatch unit buffer is full. The operation is bounded by the sizes of these two buffers. The hidden function DecodeEntry : Reg → OP × RC 3 × PC decodes an instruction word.
DecodeEntry(ibufentry) = (op(ibufentry), ra(ibufentry), rb(ibufentry), rc(ibufentry), addr(ibufentry)).
There are five fields: an op-code, three register addresses and a data memory address; each of these fields is extracted using the functions op, ra, rb, rc and addr from §6.
Instruction Dispatch
The operation Dispatch : Decode × Issue × ReorderBuf × Registers → Decode × Issue × ReorderBuf × Registers affects the decode-dispatch unit, the issue unit, the reorder buffer and the register unit.
1. The decode-dispatch unit buffer is popped.
2. A reservation station entry is constructed by AddIssEnt, and pushed onto the reservation station list of the first adder. This entry includes the address of the next available reorder buffer entry (reorderbuf · tail + 1). Note, it is assumed that the concrete implementation of the buffer described in Fox [1997] is being used.
3. An entry for the result of the addition in the reorder buffer (labeled 'wait') is pushed onto the reorder buffer.
The destination register rc is marked in use (
The hidden function AddIssEnt : DecEntry × Registers × W log 2 (reordersize) → AddEntry is defined below.
AddIssEnt(
−−−−−→ decentry, −−−−−→ registers, reorderpos) = (DispatchOperand(ra, −−−−−→ registers), DispatchOperand(rb, −−−−−→ registers), rc,
reorderpos).
The operands ra and rb are fetched (if ready; see DispatchOperand below), the destination register is set to rc and the reorder position becomes the successor to the current reorder buffer tail.
The hidden function DispatchOperand : RC × Registers → RegOpnd is defined below.
DispatchOperand(sr,
If the source register sr is reserved (that is, the register sr is already in use), then the ready bit is set to zero and sr is stored in the reservation station; the contents of sr will be fetched later. If the register is not in use then the ready bit is set to one, and the contents of sr are stored in the reservation station.
The hidden function CanDispatchAdd1 :
In order to dispatch an add instruction to the first adder unit:
1. there must be an add instruction at the top of the (non empty) decode unit;
2. the first adder reservation station and reorder buffer must not be full;
3. the destination register for the add must not be in use; and 4. the first adder reservation station must have fewer or the same number of instructions pending execution as the second adder station, or the second adder reservation station must be full. This prevents the same instruction from being dispatched to both adder reservation stations.
The process of dispatching branch, store, load and set instructions, and of dispatching add instructions to the second adder unit, is similar, and we omit the definitions.
Instruction Execution
The Execute : Issue × Registers × Dcache → N 4 × Functional operation returns a 4-tuple of reservation station locations, representing instructions that have been executed, together with the state of each of the functional units, and is defined below.
The functions Adder, Branch, and LoadStore each return a reservation station location and functional unit state as a pair. The projection
projects out the reservation station location, and the projection
projects the functional unit state.
Executing Add and Set Instructions
The hidden function Adder is defined below.
Adder : (List (addsize1,AddEntry) ∪ List (addsize2,AddEntry) ) × Registers → {0, . . . , max(addsize 1 , addsize 2 )} × Adder,
where toexec = Find ( −−−−−→ addissue, CanExecuteAdd regu ) is the reservation station location of an add instruction that may be able to execute. The list operation Find is defined in Fox [1997] . If an add instruction is able to execute (toexec > 0), then toexec is returned together with the state of the adder functional unit after executing reservation station entry toexec. Note that the oldest instructions in a reservation station list will be executed first, subject to dependencies: see the definition of F ind in Fox [1997] .
At this stage of the instruction pipeline, set instructions are indistinguishable from add instructions of the form 0 + x: see Fox [1997] . Therefore, we need take no special steps to execute them.
The hidden function AddExecute : AddEntry×Reg → Adder is defined below.
AddExecute(
The two register operands are added, the destination address dr and reorder unit position are copied, and the done flag is set to one. The hidden function GetOperand : RegOpnd × Reg → Reg is defined below.
GetOperand(
If ready = 1 then the operand has already been fetched, and is stored in opnd. Otherwise the operand is fetched from the appropriate P M-level register reg sr . The family of sets CanExecuteAdd of executable add instructions is defined below.
and
An add instruction may be executed if both its register operands are ready.
The function Ready
: RegOpnd × [RC → Bit] → B is defined below.
Ready(
The operand is ready if it has already been fetched (ready = 1) or it is no longer being used as the destination of a waiting instruction (regu(sr) = 0). Note, if an add instruction is using the same register as destination and operand, then the ready bit will not be set until after its operands have been dispatched: see DispatchAdd1, AddIssEnt and DispatchOperand in §8.3.3 . Therefore, an instruction will not attempt to wait for itself to commit before being executed. The process of executing branch, store and load instructions is similar, and we omit the definitions.
Filling the Reorder Buffer
The operation ReorderInsert : ReorderBuf ×Functional → ReorderBuf is defined below.
ReorderInsert(
Results are inserted into the reorder buffer from each of the four functional units. The hidden function ReorderAdd : ReorderBuf × Adder → ReorderBuf is defined below.
ReorderAdd(
The reorder buffer entry consists of (i ) a destination flag, set to reg in the case of an add or set instruction; (ii ) the (padded) register destination; and (iii ) the result of the addition (result). The position at which the entry is inserted within the reorder buffer (reorderpos) is stored within the adder functional unit. If no instruction has been executed by the addition unit (done = 0), the reorder buffer is unchanged. The buffer operation Insert is defined in Fox [1997] . The hidden function ReorderBranch : ReorderBuf × Branch → ReorderBuf is defined below. There are two types of reorder buffer entries for branch instructions. A branch which is taken is flagged with count, whereas a branch which is not taken is flagged skip. In the case of a taken branch, the branch unit result (i.e. the branch address) is stored in the destination field of the reorder buffer entry. The hidden function ReorderLsr : ReorderBuf × LoadStore → ReorderBuf is defined below.
ReorderBranch(
− −−−−−− → reorderbuf , − −−− → branch) =            Insert( − −−−−−− → reorderbuf , (count,
ReorderLsr(
and load = 0;
otherwise.
If the instruction executed was a load then the reorder buffer entry is flagged with reg. If the instruction was a store then the reorder buffer entry is flagged with dcache.
Instruction Committal
The map ψ : STATE ACS → ST AT E PM is defined below. Dcache, pc, reg) , 
The duration functions dur 1 : STATE → T + and dur 2 : STATE → S + , where T + = {t ∈ T | t > 0} and S + = {s ∈ S | s > 0} are defined below.
Duration function dur 1 counts the number of instructions committed for each cycle of clock R. Duration function dur 2 counts the number of system clock cycles for each cycle of event clock R. The function Committed : STATE ACS → T + gives the number of instructions committed from a given state.
Committed(
The function CanCmt : ReorderBuf → B is defined below.
CanCmt(
− −−−−−− → reorderbuf ) =    tt , if not Empty( − −−−−−− → reorderbuf ) and π type (T op( − −−−−−− → reorderbuf )) = wait; ff , otherwise.
VERIFYING ACS
Space precludes a full formal discussion of the normal process of verifying the correctness of a microprocessor, expressed as an iterated map in the form above. However, informally, the argument proceeds as follows. Microprocessors expressed as iterated maps are functions only of their initial state, some number of clock cycles, and (possibly) some inputs. They do not depend on the numeric value of time. That is, given an initial state σ 0 , and neglecting inputs, suppose we run a microprocessor representation F for t 1 + t 2 clock cycles, and finish (at time t = t 1 + t 2 − 1 in state σ n . We would reach the same state σ n as if we first ran F for t 1 cycles, reaching state σ t , reset time to zero, and then ran F for t 2 cycles, now starting in state σ t .
Logically extending this argument, given a P M model 
(pc) = ir).
This can significantly simplify formal verification. A more formal discussion can be found in Fox and Harman [1996b] , and a full account in Fox [1997] (including the conditions ST AT E PM , STATE ACS , λ and ψ must satisfy). The same simplification has also been observed, within the framework of their own formalisms, by others working on microprocessor verification; for example, Windley and Coe [1994] , Miller and Srivas [1995b] , Miller and Srivas [1995a] , Windley and Burch [1996] , Burch [1996] , , .
There are several difficulties in the case of superscalar microprocessors.
1. The size of the state-space makes establishing that State P M (1, ψ( − −− → state)) = ψ(State AC (λ( − −− → state)(1), − −− → state)) difficult, simply because of the number of cases to consider. A large proportion of the possible cases will be disallowed by init AC ; but even so, the number remaining is very large (Fox [1997] ).
2. The complexity of the relationships within the state-space makes it difficult to construct an appropriately-weak initialisation function. The current initialisation function ( §8.2.2) for ACS simply resets the pipeline, and is very strong. However, given the size of the state-space, the complexity of the relationships between state components, and the consequent number of possible, consistent values for each of the state elements, a weak initialisation function is extremely complex (Fox [1997] ). In particular, it is not the case that only instructions currently being executed affect the state of the pipeline. Previously-committed instructions affect, for example, the lengths of queues in the reservation stations, and thus the route through the pipeline of subsequent instructions.
3. As well as being complex to construct and check, such an initialisation function will consume considerable resources in automated verification attempts because of the need to check It is not clear how to address these problems, though considers a technique for the systematic generation of initialisation functions (termed invariants), which may be helpful with 2. The problems essentially stem from the 'complexity' of the state-space, where 'complexity' in this context is some measure of (i ) the number of separate state components; and (ii ) the relationships between the state components. Point (i ) leads to a large number of cases to be checked. Point (ii ) makes establishing if the processor is in a legal state, corresponding to one of the cases, complex and time consuming.
We are considering two possible approaches to reducing state-space complexity. Firstly, making concessions in the implementation to simplify the complexity of the state-space. This obviously could negatively affect performance, and, at first sight, seems to imply a return to less advanced implementations. However, this need not be the case. The aim is more subtle than simply making the state-space 'smaller': recall that a monolithic memory can be very large, yet is conceptually simple. It may be possible to reduce the number of discrete state components, and the complexity of their interrelationships, while still maintaining the potential for high instruction throughput. The second approach involves inserting a new level of abstraction between the current P M and AC levels. By doing this, it may be possible to conceal some of the complexities of the current AC level and hence simplify the representation of the processor. It would, of course, still be necessary to verify that the AC level is correct with respect to this new level of abstraction. However, we could consider each of the physical units of the processor in isolation, which would make verification more tractable.
CONCLUDING REMARKS
We have shown that the algebraic tools developed for representing simple, nonsuperscalar microprocessor implementations are equally applicable to complex superscalar examples. The algebraic techniques are not specific to any particular software, but are adaptable to a range of currently-available tools. We have developed, in considerable detail, a superscalar implementation. We have formulated the correctness conditions for the implementation with respect to an architecture. We have also briefly discussed the problems of formal verification. Future work will consider simplifying formal verification, by studying possible alternative techniques for reducing complexity. In addition, we intend to consider levels of abstraction higher than the current P M level. Stephenson [1996] and consider algebraic models of high-level languages, compilers and abstract machine languages in a form very similar to our models of hardware. We wish to bridge the gap between abstract machine languages, and the current P M level, in order to construct a unified algebraic model of computer systems, from high level languages to abstract hardware.
