Modern microprocessors require an immense investment of time and e ort to create and verify, from the high-level architectural design downwards. We are exploring ways to increase the productivity of design engineers by creating a domain-speci c language for specifying and simulating processor architectures. We believe that the structuring principles used i n m o dern functional programming languages, such as static typing, parametric polymorphism, rst-class functions, and lazy evaluation provide a good formalism for such a domain-speci c language, and have made initial progress by creating a library on top of the functional language Haskell. We have speci ed the integer subset of an out-of-order, superscalar DLX microprocessor, with register-renaming, a reorder bu er, a global reservation station, multiple execution units, and speculative branch execution. Two key abstractions of this library are the signal abstract data type ADT, which models the simulation history of a wire, and the transaction ADT, which models the state of an entire i nstruction as it travels through the microprocessor.
Introduction
Modern microprocessor technologies have substantially increased processor performance. For example, pipelining allows a processor to overlap the execution of several instructions at once. With superscalar execution, multiple instructions are read per clock cycle. Out-of-order execution, where some instructions that logically come after a given instruction may be executed before the given instruction, can also greatly increase processor speed 6 . All of these technologies dramatically increase design complexity. In fact, creating and verifying these designs is a signi cant proportion of the total microprocessor development lifecycle. As the number of possible gates in future microSupported by a graduate research fellowship from the NSF y Supported by a g r a n t from Intel and a contract with Air Force Material Command F19628-93-C-0069.
processors increases exponentially, so too does design complexity.
At O G I , w e h a ve d e v eloped the Hawk language for building executable speci cations of microprocessors, concentrating on the level of micro-architecture. In the long term we p l a n f o r H a wk to be a standalone language. In the meantime we h a ve e m bedded our language into Haskell, a strongly-typed functional language with lazy demand-driven evaluation, rstclass functions, and parametric polymorphism 5 1 2 .
The library makes essential use of these features. As an example, we h a ve used Hawk to specify and simulate the integer portion of a pipelined DLX microprocessor 4 . The DLX is a complete microprocessor and is a widely used model among researchers. Several DLX simulators exist, as well as a version of the Gnu C compiler that generates DLX assembly instructions. The processor includes the most common instructions found in commercial RISC processors. Our speci cation, including data and control hazard resolution, is only two pages of Hawk code. A non-pipelined version of the processor was speci ed in half of a page.
In this report, we i n troduce the concepts behind Hawk. Rather than attempting a detailed explanation of the whole of the DLX with all of its inherent complexity, w e h a ve c hosen to exhibit the techniques on a considerably simpli ed model. A corresponding annotated speci cation of the DLX itself can be found in 13 .
The Hawk Library
We start with a simple example that introduces several functions used in later examples. Consider the resettable counter circuit of Figure 1 .
The reset wire is Boolean valued, while the other wires are integer valued. Of course, in silicon, integervalued wires are represented by a v ector of Boolean wires, but as a design abstraction, a Hawk user may choose to use a single wire. The circuit counts and outputs the number of clock cycles since reset was 
Signals
Notice that there is no explicit clock in the diagram. Rather, each wire in the diagram carries a signal integer or boolean valued which is an implicitly clocked value. The output of a circuit only changes between clock cycles. We build signals using an abstract type constructor called Signal. As a mental model we could think of a value of type Signal a as a function from integers to values of type a.
type Signal a = Int -a
The integers denote the current time, measured as the number of clock cycles since the start of the simulation. Circuits and components of circuits are represented as functions from signals to signals. This view of signals is used extensively in the hardware veri cation community 9 1 4 . Equivalently, w e can think of signals as in nite sequences of values.
In the resettable counter example above, the constant 0 circuit outputs zero on every clock cycle. The select component c hooses between its inputs on each clock cycle depending on the value of reset. I f reset is asserted on a given cycle has value true, then the output is equal to select's top input, in this case zero. If reset is not asserted, then its output is the value of its bottom input. In either case, select's output is the output of the entire circuit, as well as the input to the increment component, which simply adds 1 to its input. The output of increment is fed into the delay component. A delay c o m p o n e n t outputs whatever was on its input in the previous clock cycle: it delays" its input by one cycle. However, on the rst clock cycle of the simulation there is no previous input, so on the rst cycle delay outputs whatever is on its init input, which is zero in this circuit.
Components
The components used in the resettable counter are trivial examples of the sorts of things provided by t h e Hawk library, but let's look at a speci cation of each component i n t u r n .
The This declares select to be a function. In a Hawk declaration, anything to the left of an arrow is a function argument. Thus, the expression select bs xs ys, where bs is a Boolean signal, and xs and ys are signals of type a, will return an output signal of type a. T h e v alues of the output signal are drawn from xs and ys, decided each clock t i c k by the corresponding value of bs. F or example, if bs = True,False,True,False,... , xs = x1,x2,x3,x4,... , ys = y1,y2,y3,y4,... then select bs xs ys is equal to the signal x1,y2,x3,y4,... .
Hawk treats functions as rst-class values, allowing them to be passed as arguments to other functions or returned as results. First-class functions allow u s to specify a generic lift primitive, which lifts" a normal function from type a to type b into a function over the corresponding signal types: This function takes an initial value of type a, a n d an input signal of type Signal a, and returns a value of type Signal a the input arguments are in reverse order from the diagram. At c l o c k cycle zero, the expression delay initVal xs returns initVal. Otherwise the expression returns whatever value xs had at the previous clock cycle. This function can thus propagate values from one clock cycle to the next. Note that delay is polymorphic, and can be used to delay signals of any t ype.
Using the components
Once we h a ve de ned primitive signal components like the ones above, we can de ne the resettable counter: The resetCounter de nition takes reset as a Boolean signal, and returns an integer signal. The reset signal is passed into select. O n e v ery clock cycle where reset returns True, select outputs 0, otherwise it outputs the result of the delay function. On the rst clock cycle delay outputs 0, and thereafter outputs the result of whatever increment output was on the previous clock c y c l e . T h e o u t p u t o f t h e whole circuit is the output of the select function, here called output. Notice that output is used twice in this function: once as the input to increment, a n d once as the result of the entire function. This corresponds to the fact that the output wire in Figure 1 is split and used in two places. Whenever a wire is duplicated in this fashion, we m ust use a where statement in Hawk to name the wire.
Recursive De nitions
There is something else curious about the output variable. It is being used recursively in the same place it is being de ned! Most languages only allow s u c h recursion for functions with explicit arguments. In Hawk, one can also de ne recursive data-structures and functions with implicit arguments, such as the one above.
If we didn't have this ability, w e w ould have had to de ne resetCounter as follows:
resetCounter reset = output where output time = select reset constant 0 delay 0 increment output time
Every time we h a ve a cycle in a circuit, we h a ve t o create a local recursive function, passing an explicit time parameter. This breaks the abstraction of the Signal ADT. In fact, in the real implementation of signals, we don't use functions at all. We use in nite lists instead. Each element of the list corresponds to a value at a particular clock cycle; the rst list element corresponds to the rst clock cycle, the second element to the second clock cycle, and so on. By storing signals as lazy lists, we compute a signal va l u e a t a g i v en clock cycle only once, no matter how m a n y times it is subsequently accessed.
Haskell allows recursive de nitions of abstract data structures because it is a lazy language, that is, it only computes a part of a data structure when some client code demands its value. It is lazy evaluation that allows Haskell to simulate in nite data structures, such as in nite lists.
A Simple Microprocessor
As we noted in the introduction, the DLX architecture is too complex to explain in ne detail in an introductory report. Thus for pedagogical purposes we show h o w to use similar techniques to specify a simple microprocessor called SHAM Simple HAwk Microprocessor. We begin with the simplest possible SHAM architecture unpipelined, and then add features: pipelining, and a memory-cache.
The unpipelined SHAM diagram is shown in Figure 2. The microprocessor consists of an ALU and a register le. The ALU recognizes three operations: ADD, SUB, and INC. The ADD and SUB operations add and subtract, respectively, the contents of the two ALU inputs. The INC operation causes the ALU to increment its rst input by one and output the result. The register le contains eight i n teger registers, numbered RO through R7. Register R0 is hardwired to the value zero, so writes to R0 have no e ect. The register le has one write-port and two read-ports. The write-port is a pair of wires; the register to update, called writeReg, and the value being written, called writeContents. The input to each read-port is a wire carrying a register name. The contents of the named read-port registers are output every cycle along the wires contentsA and contentsB. If a register is written to and read from during the same clock cycle, the newly written value is re ected in the read-port's output. This is consistent with the behavior of most modern microprocessor register les. SHAM instructions are provided externally; in our drive for simplicity there is no notion of a program counter. Each instruction consists of an ALU operation, the destination register name, and the two source register names. For each instruction the contents of the two source registers are loaded into the ALU's inputs, and the ALU's result is written back i n to the destination register.
Unpipelined SHAM Speci cation
Let us assume we h a ve already speci ed the register le and ALU, with the signatures below: The regFile speci cation takes a write-port input, two read-port inputs, and returns the corresponding read-port outputs. The alu speci cation takes a command signal and two input signals, and returns a result signal. Given these signatures and the previous de nition of delay, i t i s e a s y i n H a wk to specify an unpipelined version of SHAM: The de nition of sham1 takes a tuple of signals representing the stream of instructions, and returns a pair of signals representing the sequence of register assignments generated by the instructions. The rst three l i n e s i n t h e b o d y o f sham1 read the source register values from the register le and perform the ALU operation. The next two lines delay the destination register name and ALU output, in e ect returning the values of the previous clock cycle. The delayed signals become the write-port for the register le. It is necessary to delay the write-port since modi cations to the register le logically take e ect for the next instruction, not the current o n e .
Pipelining
Suppose we w anted to increase SHAM's performance by doubling the clock frequency. W e will assume that, while sham1 could perform both the register le and ALU operations within one clock cycle, with the increased frequency it will take t wo clock cycles to perform both functions serially. We use pipelining to increase the overall performance. While the ALU is working on instruction n, the register le will be writing the result of instruction n ,1 back i n to the appropriate register, and simultaneously reading the source registers of instruction n + 1 .
But now consider the following sequence of instructions, such a s :
When the ADD instruction is in the ALU stage, the SUB instruction is in the register-fetch stage. But one of the registers that is being fetched R2, has not been written back i n to the register le yet, because the ALU is still calculating the result. The SUB instruction will read an out-of-date value for R2. This is an example of a data hazard, where naive pipelining can produce a result di erent from the unpipelined version of a microprocessor. To resolve this hazard, we will rst add bypass logic to the pipeline, then later abstract away from this added inconvenience. Figure 3 contains the diagram of a pipelined version of SHAM with bypass logic. By the time the source operands to the SUB instruction R2 and R5 are ready to be input into the ALU, the up-to-date value for R2 is stored in the delay circuit between the ALU and the register le's write-port. The bypass logic uses this stored value of R2 as the input to the ALU, rather than the out-of-date value read from the register le. The bypass logic examines the incoming instructions to determine when this is necessary. The following code contains the Hawk speci cation: The rst two lines after the where keyword read the contents of the source registers from the register le. The next four lines delay the source register contents, the ALU command, and the destination register name by one cycle. The two select commands decide whether the delayed values should be bypassed. The decision is made by the Boolean signals validA and validB, w h i c h are de ned in the control logic section. The next line performs the ALU operation. The last two lines in the data-ow section delay the ALU result and the destination register. The delayed result, called aluOut', is written back i n to the register le in the register named by destReg'', as indicated in the rst two lines of the section. The control logic section determines when to bypass the ALU inputs. The signals validA and validB are set to True whenever the corresponding ALU input is up-to-date. The de nition of these signals uses the function noHazard, which tests whether the previous instruction's destination register name matches a source register name of the current instruction. If they do, then the function returns False. The exception to this is when the destination register is R0. In this case the ALU input is always up-to-date, so noHazard returns True.
Transactions
The de nition of sham2 highlights a di culty of many such speci cations. Although the data ow section is relatively easy to understand, the control logic section is far from satisfactory. In fact, it often takes nearly as many lines of Hawk code to specify the control logic as it does to specify the data ow, and mistakes in the control logic may not be easy to spot. We need a more intuitive w ay of de ning control logic sections in microprocessors.
We use a notion of transactions within Hawk to specify the state of an entire instruction as it travels through the microprocessor similar in spirit to Aagaard and Leeser 1 . A transaction holds an instruction's source operand values, the ALU command, An operand is a pair containing a register and its value. Values can either be unknown" or they can be known, e.g. Val 7. For example, the instruction R3 -R2 ADD R1, when it has completed, would be encoded as shown below assume that register R2 holds the value 3, a n d R1 holds 4: Trans R3,Val 7 ADD R2,Val 3,R1, Val 4 This expression states that register R3 should be assigned the value 7 as a result of adding the contents of register R2 and R1.
Not all of the register values in a transaction are known in the early stages of the pipeline. When a register name does not have an associated value yet, it is assigned the value Unknown. F or example, if the above instruction had not reached the ALU stage yet, then the corresponding transaction would be: 
Changes to handle transactions
We c hange the regFile and alu functions so that they take and return transactions: Further, assume that register R1 is assigned 20 and R2 is assigned 3 before regFile's application. Then regFile will update R1 to contain 4 from the writetransaction, and will output a new transaction that is identical to the read-transaction, except that all of the source registers have been assigned current v alues from the register le:
Trans R3,Unknown ADD R2,Val 3,R1, Val 4 The revised alu function takes a transaction whose source operands have v alues, performs the appropriate operation, and outputs a modi ed transaction whose destination eld has been lled in. Thus if the ADD transaction above w ere given to alu, i t w ould return: But the real bene t of transactions comes from specifying more complex micro-architectures, as we shall see next.
SHAM2 with Transactions
Transactions are designed to contain the necessary information for concisely specifying control logic. The control logic needs to determine when an instruction's source operand is dependent on another instruction's destination operand. To calculate the dependency, t h e source and destination register names must be available. The transaction carries these names for each instruction. Because of this additional information, bypass logic is easily modeled with following combinator: The bypass function usually just outputs its rst argument. Sometimes, however, the second argument's destination operand name matches one or more of the rst argument's source operand names. In this case, the source operand's state values are updated to match the destination operand state value. The updated version of the rst argument is then returned.
So if at clock cycle n the rst argument t o b ypass is: One special case to bypass's functionality is when a source register is R0. Since R0 is a constant register, it does not get updated. The pipelined ve r s i o n o f S H A M with bypass logic is now s t r a i g h tforward. Notice that no explicit control logic is needed, as all the decisions are taken locally in the bypass operations. The rst line takes instr and lls in its source operand elds from the register le. The lled-in transaction is delayed by one cycle in the second line. In the third line bypass is invoked to ensure that all of the source operands are up-to-date. Finally the transaction result is computed by alu and delayed one cycle so that the destination operand can be written back to the register le.
Hazards
There are some microprocessor hazards that cannot be handled through bypassing. For example, suppose we extended the SHAM architecture to process load and store instructions:
The rst instruction above is a load instruction; it loads the contents of the address pointed to by R2 into R3. The second instruction is a store; it stores the contents of R2 into the address pointed to by R5. A block diagram of the extended SHAM architecture is shown in Figure 5 . There is now a load store pipeline stage after the ALU stage. However, this introduces a new problem. Suppose SHAM executes the following two instructions in sequence:
These two instructions have a data hazard, just as before, but we can not use bypassing to resolve it. Bypassing depends on having a value to bypass at the beginning of a clock cycle, but R2's value won't be known until the end of the cycle, after the memory contents have been retrieved from the memory cache. To resolve this hazard, we h a ve t o stall the pipeline at the register-fetch stage. When the rst instruction has reached the end of the ALU stage, the second instruction will have reached the end of the registerfetch s t a g e . A t this point the delay circuits between the register-fetch stage and the ALU stage are overridden; on the next clock cycle they instead output the equivalent of a no-op instruction. The register-fetch stage itself re-reads the second instruction on the next clock cycle. In e ect, the pipeline stall inserts a no-op instruction between the two instructions involved in the hazard:
Now when the ADD instruction is about to be processed by the ALU, the load instruction has already completed the memory stage. R2's value is held in the pipeline registers after the memory stage, so bypass logic can be used to bring the ALU's input up-todate. In order to stall correctly, w e h a ve to re-read the second instruction. Thus stalling reduces the performance of the pipeline. 
Hawk Speci cation of Extended SHAM
In this section we will give more evidence of the simplifying power of transactions by specifying the extended SHAM architecture. The load store extension significantly complicates the control logic for the SHAM architecture. We shall see that transactions hold up well when we m ust add stalling logic to the pipeline. To s t a r t , w e need to add the commands LOAD and STORE to the Cmd type:
We also need to de ne some additional Hawk circuits. The rst circuit, defaultDelay, a u g m e n ts the normal delay circuit so that when a stall hazard is detected, the augmented circuit will output a default value on the next clock cycle, rather than its current input value: Since the pipeline can stall, we need a way t o ask for the same instruction two cycles in a row. The instrCache function takes a Boolean signal and returns the current transaction. Whenever the argument signal is True, then on the next cycle instrCache returns the same transaction as it did for the current clock cycle. Otherwise, it returns the next transaction as normal.
We also need a circuit that actually performs the loads and stores: On those clock cycles where the input transaction is anything but a load or store transaction, the mem function simply returns the transaction unchanged. On loads, mem updates the destination operand of the input transaction, based on the input load address. On stores, mem updates its internal memory array according to the address and contents given in the input transaction. The destination operand value is set to zero.
We also de ne a new Hawk function, transHazard, that returns True whenever its two transaction arguments would cause a hazard, if the rst transaction preceded the second transaction in a pipeline: The extended Hawk speci cation using transactions is given below: The register-fetch stage retrieves the instruction and lls in its source operands from the register le. The register-fetch pipeline register delays the transaction by one clock cycle, although if there is a load hazard, the register instead outputs a nop-instruction on the next cycle. The ALU stage rst updates the source operands of the stored transaction with the results of the two preceding transactions memOut' a n d aluOut' by i n voking bypass twice. It then performs the corresponding ALU operation, if any, on the transaction and stores it in the ALU-stage pipeline register. The memory stage again updates the stored transaction with the immediately preceding transaction, performs any required memory operation, and stores the transaction. The stored transaction is written back t o t h e register le on the next clock cycle. The control logic section determines whether a load hazard exists for the current transaction, that is, whether the immediately preceding transaction was a load instruction that is in hazard with the current transaction.
As we can see, the body of the speci cation remains manageable. The small control logic section to detect load hazards is straightforward and is a minority o f the overall speci cation. In contrast, an equivalent speci cation of this pipeline where the components of each transaction were explicitly represented contained over three times as many source lines. The lower-level speci cation's control section was almost as large as the data ow section, and not nearly as intuitive.
We feel the transaction ADT is close to the level of abstraction design engineers use informally when reasoning about microprocessor architectures.
Modelling the DLX
Using techniques comparable to those described in this report we h a ve modeled several DLX architectures:
An unpipelined version, where each instruction executes in one cycle. A pipelined version where branches cause a onecycle pipeline stall. A more complex pipelined version with branch prediction and speculative execution. Branches are predicted using a one-level branch target bu er. Whenever the guess is correct, the branch instruction incurs no pipeline stalls. If the guess is incorrect, the pipeline stalls for two cycles. An out-of-order, superscalar microprocessor with speculative execution. The microarchitecture contains a reorder bu er, register alias table, reservation station, and multiple execution units. Mispredicted branches cause speculated instructions to be aborted, with execution resuming at the correct branch successor.
The microarchitectural speci cation for the unpipelined DLX is written in a quarter page of uncommented source code; the most complicated pipelined version takes up just over half a page.
Executing the model
We used the Gnu C compiler that generates DLX assembly to test our speci cations on several programs. These test cases include a program that calculates the greatest common divisor of two i n tegers, and a recursive procedure that solves the towers of Hanoi puzzle.
We h a ve not made detailed simulation performance measurements yet. Although we plan to test Hawk on several benchmark programs, we do not expect to break simulation-speed records. Hawk is built on top of a lazy functional language, which imposes some performance costs. Transactions also perform some runtime tests that are compiled-away" in a lower-level pipeline speci cation. While it would be nice to get high performance, Hawk is primarily a speci cation language, and only secondarily a simulation tool. Our main interest is in using Hawk to formally verify microarchitectures, while at the same time retaining the ability to directly execute Hawk programs on concrete test cases.
Related Work
There are several research areas that bear a relation on this work, in particular, modeling speci c application domains with Haskell, and modeling hardware in various programming languages. We will pick a n example or two from these two categories.
Haskell has been used to directly model hardware circuits at the gate level. O'Donnell 10 h a s d e v eloped a Haskell library called Hydra that models gates at several levels of abstraction, ranging from implementations of gates using CMOS and NMOS passtransistors, up to abstract gate representations using lazy lists to denote time-varying values. Hydra has been used to teach a d v anced undergraduate courses on computer design, where students use Hydra to eventually design and test a simple microprocessor. Hydra is similar to Hawk in many w ays, including the use of higher-order functions and lazy lists to model signals. However, Hydra does not allow users to de ne composite signal types, such a s s i g n a l s o f i n tegers or signals of transactions. In Hydra, these composite types have to be built up as tuples or lists of Boolean signals. While this limitation does not cause problems in an introductory computer architecture course, composite signal types signi cantly reduce speci cation complexity for more realistic microprocessor speci cations.
There are many other languages for specifying hardware circuits at varying levels of abstraction. The most widely used such languages are Verilog and VHDL. Both of these languages are well suited for their roles as general-purpose, large-scale hardware design languages with ne-grained control over many circuit properties. Both of these languages are more general than Hawk in that they can model asynchronous as well as synchronous circuits. However, Verilog and VHDL are large languages with complex semantics, which m a k es circuit veri cation more di cult. Also, neither of these languages support polymorphic circuits, nor higher-order circuit combinators, as well as Hawk.
The Ruby language, created by Jones and Sheeran 7 , is a speci cation and simulation language based on relations, rather than functions. Ruby is more general than Hawk in that relations can describe more circuits than functions can. On the other hand, existing Ruby simulators require Ruby relations to be causal, i.e. to be implementable as functions. Thus Hawk is equal in expressive p o wer to currently executable Ruby programs. In addition, much o f R u b y's emphasis is on circuit layout. There are combinators to specify where circuits are located in relation to each other and to external wires. Hawk's emphasis is on behavioral correctness, so we do not need to address layout issues.
Two other languages that are strongly related are HML 8 and MHDL 2 . HML is a hardware modeling language based on the functional language ML. It also has higher-order functions and polymorphic types, allowing many of the same abstraction techniques that are used in Hawk, with similar safety guarantees. On the other hand, HML is not lazy, so does not easily allow the recursive circuit speci cations that turned out to be key in specifying micro-architectures. The goal of HML is also rather di erent from Hawk, concentrating on circuits that can be immediately realized by translation to VHDL.
MHDL is a hardware description language for describing analog microwave circuits, and includes an interface to VHDL. Though it tackles a very di erent part of the hardware design spectrum, like H a wk, MHDL is essentially an extended version of Haskell. The MHDL extensions have to do with physical units on numbers, and universal variables to track frequency and time etc.
Future Directions
We h a ve just completed the speci cation of a superscalar version of DLX, with speculative and out-oforder instruction execution. The use of transactions has scaled well to this architecture; it turns out that superscalar components like reservation stations and reorder bu ers are naturally expressed as queues of transactions.
Beyond this, we i n tend to push in a number of directions.
We hope to use Hawk to formally verify the correctness of microprocessors through the mechanical theorem prover Isabelle 11 . Isabelle is wellsuited for Hawk; it has built-in support for manipulating higher-order functions and polymorphic types. It also has well-developed rewriting tactics. Thus simpli cation strategies for functional languages like partial evaluation and deforestation 3 can be directly implemented. We also expect that transactions will aid the verication process. Transactions make explicit much of the pipeline state needed to prove correctness. In lower-level speci cations this data has to be inferred from the pipeline context. We are also working on a visualization tool which will enable the microprocessor engineer to inspect values passing along internal wires. We have made initial progress on formally extracting stand-alone control logic from the transaction-based models of pipelines. Standalone control logic may be more amenable to conventional synthesis techniques.
