This paper presents a new approach for automatically pipelining sequential circuits. The approach repeatedly extracts a computation from the critical path, moves it into a new stage, then uses speculation to generate a stream of values that keep the pipeline full. The newly generated circuit retains enough state to recover from incorrect speculations by flushing the incorrect values from the pipeline, restoring the correct state, then restarting the computation.
INTRODUCTION
This paper presents a new algorithm for automatically pipelining sequential circuits. The algorithm is based on speculation and uses state retention and recovery to respond to incorrect speculations: The paper also presents two extensions to the basic approach: generating stall logic to avoid incorrect speculations and the associated area penalty, and generating forwarding logic to increase the throughput of the resulting circuit.
Our algorithm starts with a non-pipelined or insufficiently pipelined specification of a circuit and repeatedly shortens its clock cycle by extracting a computation from the critical path and moving it into a new pipeline stage. The new stage precomputes the result of the selected expression and passes it to the computation of the next stage that uses it. To keep *This research was supported in part by NSF Grant CCR-P emssion to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. October 1-3,2001, Montreal, QuBbec, Canada. Copyright 2001 ACM 13-418-5/01/0010 ... $5.00.
ISSS'O1,

702297.
the pipeline full, the new stage must produce the next value of the expression before the final values of the variables it accesses become available. Our algorithm achieves this goal by speculating on the values of these variables. If the speculation is incorrect, the circuit restores its state to match the state before the speculation, flushes the incorrect values from the pipeline, then restarts the computation.
Our algorithm uses several techniques to improve the quality of the pipelined circuit. If the amount of state necessary to recover from an incorrect speculation is excessive, our algorithm can generate stall logic that causes the pipeline stage to stall until the new values are available. This technique eliminates the need for retaining recovery state, as the execution of the pipeline stage will never need to roll back. Our algorithm also generates circuits that forward the correct value to preceding pipeline stages. This technique increases the throughput of the circuit by reducing the amount of time that the circuit spends recovering from incorrect speculations or waiting for correct values to become available.
We have built a prototype implementation of our algorithm. Using our synthesizer [13] as backend, this implementation generates synthesizable Verilog at the RTL level. We have used our implementation to automatically generate pipelined versions of several circuits. Our results show that our automatically generated pipelined circuits are competitive with hand-generated versions.
This paper makes the following contributions: Approach: It presents a new approach for automatically pipelining sequential circuits. This approach repeatedly extracts a computation from the critical path and moves it into a new stage. This stage uses speculation to generate a stream of values that keep the pipeline full. The approach reduces the clock cycle and increases the throughput of a circuit. Algorithm: It presents a pipelining algorithm that implements our approach. It also presents two extensions to the approach: stalling, which reduces the amount of area that would otherwise be required to respond to incorrect speculations; and forwarding, which increases throughput either by replacing values produced by incorrect speculations with correct values or by making new values available earlier to the stall logic. Experimental Results: It presents experimental results that prove the viability of the approach in practice. The remainder of the paper is organized as follows. Section 2 discusses related wdrk. Section 3 illustrates how a system is specified using rewrite rules and gives an example of what the starting and derived circuit specifications may look like. Section 4 presents the pipelining algorithm. Section 5 presents the experimental results. Section 6 draws the conclusions.
RELATED WORK
Many high-level synthesis systems focus on the automatic generation of highly efficient pipelined designs. Most of this work is primarily concerned with functional pipelining. Many synthesis tools target instruction-set architecturesllo, 4, 3, 14, 15, 1, 9, 21; our tool, on the other hand, targets the more general class of sequential circuits. Other approaches start with a C program [8, 51. Kroening and Paul [ll] describe a method of automating the generation of stall and forward logic starting from a given sequential machine. Starting from hardware that is initially partitioned into pipeline stages, the algorithm produces a circuit that can stall in any arbitrary stage while keeping the other stages running, if possible. To implement forwarding, the designer has to specify the registers holding intermediate results that need to be forwarded to previous stages in the pipeline. The goal of our research, in contrast, is to completely automate the pipelining transformation starting from a non-pipelined or insufficiently pipelined specification.
Retiming [12] optimally pipelines combinatorial circuitry. Architectural retiming [6] adds a negative/normal register pair on a latency-constrained path, effectively pipelining the logic without adding latency. It implements the negative register by either precomputation or prediction of the value that it produces.
Research by Hoe and Arvind [7] and Shen and Arvind [16] has introduced an approach to describe, verify and synthesize processors based on term rewriting systems (TRS). They do not implement automatic pipelining, but their specification language offers comparable capabilities in this direction as our language.
EXAMPLE
The text inside the boxes in Figure 2 and Figure 1 presents a specification written in our high-level description language. The figures also contain a graphical representation that we find useful in explaining our example. We first present a non-pipelined datapath, then a simple, three-stage, linear pipelined datapath that our algorithm can automatically derive from the non-pipelined specification. Section 4 shows the intermediate specifications at each step of the algorithm. We chose to present this three:stage linear pipeline because of its simplicity; our algorithm is capable of generating deep pipelines and is not specific to this particular class of circuits.
The designer specifies the circuit using two kinds of information:
State Declarations: The designer specifies the state of the system as a set of typed variable declarations. 0 Module Specification: The designer specifies the behavior of each module as a set of update rules. Modules communicate by reading and writing shared state and particularly using FIFO queues. Figure 2 and Figure 1 show the functional modules in our We next illustrate the conceptual model of execution in our system by discussing the operation of the rules in our example. The rules describe the structure of the hardware pipeline and not the program that executes on it. To keep the example clear, the instruction set contains only an I N C instruction, which increments the value in its single register argument, and a JRZ instruction, which tests the value in its register argument and, if the value is zero, jumps to the location in its location argument.
Modules
Non-pipelined SpeciXcation
Three-stage Pipelined SpeciXcation
The condition for the rule in module IFM of Figure 1 is true, which means that the rule is always enabled. When it executes, it fetches an instruction from the instruction memory and inserts it into the instruction queue iq. It also increments the program counter pc to set up the next fetch. The two rules in the module ROFM remove instructions from iq, fetch the register operands, and insert them into rq. The enabling condition of the first rule is <INC r> = head(iq) and notin(rq, tINC r ->). If the instruction at the head of i q is an INC, the clause matches and binds r to the register name argument of the INC instruction. The second clause, notin(rq, <INC r ->) uses the binding r to check for a read after write (RAW) hazard caused by a pending instruction in rq that will write the register r. In this case, the machine delays the operand fetch so that it fetches the value after the write (this translates into stalling').
The clause n o t i n h q , <INC r ->> checks to make sure that there is no such instruction in rq, and the rule as a whole is enabled and can execute only if there is no hazard.
The other rules perform similar actions. The update iq/rq = n i l clears the queue(s) iq/rq. 
Three-stage Pipelined Specixcation
Basic Approach
The pipelining algorithm starts with a non-pipelined or insufficiently pipelined specification and automatically generates a highly-pipelined, functionally equivalent specification. The algorithm repeatedly extracts an expression from a target module and creates a new module to compute the value of the expression at each clock cycle. It then uses a stream to transport the computed values from the new module into the target module from which the expression was extracted. The length of the stream is conceptually unbounded. The synthesis algorithm we developed in [13] implements all the streams in the final specification as finite hardware buffers. This algorithm operates on the resulting specification once the pipelining algorithm has finished. Our pipelining algorithm transforms the target module so that it reads the value of the expression from the stream instead of computing its value. Because this transformation splits computations across multiple clock cycles, it may reduce the clock cycle of the circuit and increase its throughput.
In general, the extracted subcomputation may depend on values that are not available until after it must produce the new value. The compiler therefore speculates on the values that the subcomputation uses. If the speculation is incorrect, the circuit restores the values of any incorrectly updated variables and restarts the computation from the restored state. To enable the restoration, the transformed specification inserts the old values of any potentially incorrectly updated variables into the new stream. When the circuit encounters an incorrect speculation, it extracts these values from the stream and uses them to restore any incorrectly updated variables to their correct values. The algorithm consists of seven phases for each further pipelining decision. The steps are illustrated using expression is driven by an analysis of the combinational path lengths in the circuit. The algorithm repeatedly determines the critical path, then chooses a target expression on this critical path. Inserting the computation of the target expression into a different stage of the pipeline removes the expression from the critical path, shortening its length. A more general approach could be implemented that uses a wider set of paths than the critical one(s) in deciding which expression to select as target. In addition to selecting the target expression automatically, our implemented system also allows the designer to drive the pipelining process by manually selecting the target expression. For Figure 2 the algorithm selects imCpcl; for The pipelining algorithm will move the computation of the target expression into a new module. This module will compute the value of the target expression in a clock cycle before the final values of the involved variables have been determined. The module therefore speculates on the final values of these variables, using the speculated values to compute the value of the target expression. There are two kinds of speculation: 0 Control: Speculate on which rule will fire. For the involved variable pc in Figure 2 we speculate that the first rule will fire. This choice implies that the new value of pc is p c + l . For i q in Figure 6 we speculate that the first rule in the rightmost box will fire, so iq's new speculated value is t a i l (iq) .
Data Hazard:
Speculate on the absence of data hazards. For involved variable rf in Figure 6 , we speculate that there will be no writes to r f [rl . Figure 2 and target expression imCpc1. Figure 1 presents the results of this transformation for the specification in Figure 6 and target expression rf Crl . The algorithm transforms the specification to detect incorrect speculations and, when necessary, use the values in the stream to restore the correct state of the system and clear the stream. The system will therefore restart from a consistent state.
For the non-pipelined example in Figure 2 , Figure 5 shows the resulting specification after the algorithm executes this step. After updating all the rules that read the target expression, eliminate all the fields of stream entries that were saved and never used again.
As we notice in Figure 5 , field a of i q is never used, so we can safely remove it from the stream to obtain the specification in Figure 6 . 
Optimizations
We next present how our algorithm generates logic that implements two techniques -stalling and forwardingwhich can improve the quality and performance of the automatically pipelined circuit.
We first present the circuit that responds to incorrect speculations by restoring saved state. Figure 7 shows the transformation schema for a rule R, R,: e = head(str) andP, + A, that reads some target expression TE. e = head(str) and P, is the enabling condition of R,; both clauses are optional. A missing clause reads as a true clause. A, consists of all the updates performed by rule R,.
In Figure 7 , IV stands for the set of involved variables of TE and W for the corresponding set of updated variables. IV is the disjunct union of two subsets: IV,t,z -the set of involved variables on which the algorithm applies control speculation, and IVDH -the set of involved variables on which it applies data hazard speculation. The speculated value of TE is TE', and the set of speculated values for the elements of IVct,L is IV&. Ai .upd(IV) returns the set of values that Ai updates the involved variables in I V to. A,(IV) returns the updates in Ai that write the involved variables in IV. newstr is the newly generated stream of values for target expression TE. dataHazard(E) returns t r u e if headhewstr) writes the expression E in the current clock cycle and at least one entry in t a i l ( n e w s t r ) reads it.
Rzl: e l = head ( Assume we apply this transformation to the specification in Figure 6 , for TE = r f [ r l . Figure 8 presents the I N C instruction that this transformation generates. The first clause of each newly derived rule reads the head of r q and binds r, x and s to the arguments of the I N C instruction. The clause n o t i n ( t a i l ( r q ) ,<-r -->> checks if there is an entry in t a i l ( r q ) that reads r f [rl . This is the check for a data hazard on r f Er] for the I N C instruction. In case of a hazard ~ second rule in Figure 8 -all the updated variables, in our case iq, have to be restored to their old values. We would like to avoid having to store and carry the whole instruction stream i q along, up to the restoration point. In this case, a viable approach is stalling the pipeline stage that reads rf [rl until all data hazards are cleared.
Stalling
Let S be the set of all target expressions on which the algorithm speculates. For TE E S, let {R,} be the set of rules that generate the speculated values of TE and let this stream be & I . Let { Q n } be the set of streams that the rules that write TE read from. To eliminate the need to restore state in case of a failed speculation on TE, the algorithm modifies the preconditions of all rules in { R 2 } to check that either 1) no item in any stream from Q1 to { Q n } will generate a write to TE or 2) all items that update TE write the same known value. This approach implements the stalling mechanism; Figure 1 presents its results for our example.
Let R2l and Rzz be the two resulting rules from splitting the rule in Figure 6 handling I N C instructions. The check for data hazards is now handled by Rzl, which will not fire and read the target expression r f Crl until all the previous rules writing r f [rl did so:
R21:<INC r > = head(iq) and notin(rq,<INC r ->) i i q = t a i l ( i q ) , r q = i n s e r t (rq,<INC r r f Er]>) ;
Rule R 2 p does not anymore need to check for data hazards for the current target expression -in our example r f Crl .
There is no speculation on fly regarding the absence of a data hazard and therefore no need to. save the instruction queue in the newly generated stream of values for r f [rl.
The derived R 2 2 will have the form below:
r f = r f [r->v+ll, r q = t a i l ( r q ) ;
Stalling trades the potentially higher throughput of speculative execution for a smaller circuit area. This trade-off requires a policy to decide when it is better to stall the pipeline or when it is better to speculate. The decision depends primarily on the accuracy of the predictions and the amount of state that needs to be saved for restoration purposes. Also, if all all the rules ahead in the pipeline will update TE with the same known value, the circuit can safely generate the next value of TE even if some rule will write TE. Therefore, a stalling check replacing a data hazard speculation waits until TE is hazard-free; a stalling check replacing a control speculation waits until all the previous rules in the pipeline can only update TE with the same known value.
Forwarding
Regardless of whether the algorithm speculates on the value of TE or stalls the pipeline, waiting for its correct value to become available, generating forwarding logic may increase the throughput of the circuit. To implement forwarding, the algorithm replaces the obsolete values of TE in &I with their correct, updated values. Figure 10 shows how the technique updates a rule to implement the bypass. updatedTE stands for the newly computed, correct value of TE. R 2 2 : < I N C r v> = head(rq) + <INC r x s> = head(rq) and n o t i n ( t a i l ( r q ) , < -r -->I -+ r f = r f [r->x+ll, r q = t a i l ( r q ) ; <INC r x s> = head(rq) and -n o t i n ( t a i l ( r q ) , < -r -->) + i q = t a i l ( s ) , r q = n i l , r f = r f [r->x+ll; Forwarding transforms the two rules handling I N C instructions in Figure 8 into the single new rule in Figure 9 . This rule produces a circuit that updates all of the entries in r q produced by rules that accessed rf [rl with the new correct value x+l.
I
RZ1: e l = head(str) and dataHazard(TE) + newstr = insert(newstr,<e, updatedTE W > ) ,
Rzl: e l = head(str) and no dataHazard(TE) + newstr = i n s e r t (newstr,<e, TE[IVct7-,/IVct7-~1 W > ) , 
EXPERIMENTAL RESULTS
We have implemented the pipelining algorithm within our prototype synthesizer, which generates synthesizable Verilog implementations at the RTL level. We then by compared the results obtained by our algorithm against a hand-written version that implements the same basic functionality with our example processor. We wrote a non-pipelined specification of a 32-bit datapath processor with a complete instruction set' and ran it through our pipelining algorithm for all the pipeline buffers of depth one. The resulting pipelined specification was then fed into the synthesizer to obtain a Verilog model for it. This model was then synthesized using the Synopsis Design Compiler to an industry standard .25 micron standard cell process. To serve as a reference point, we also synthesized, in the same environment, the Santa Clara University SCU RTL 98 DSP, a hand-written (in Verilog), standard 32-bit fixed point DSP that implements the same basic functionality. Our automatically pipelined version had a cycle time of 88.9 MHz as opposed to a 90.9 MHz 2The instruction set contains load, store, jump, ALU, multiply and variable shift operations, but no division.
cycle time for the hand-pipelined version; the synthesized areas were virtually identical.
It took us approximately fifteen minutes to write the specification for the non-pipelined processor and less than one minute to run it through our pipeline algorithm. Our specification contains 7 lines for state declarations and 10 lines of rule definitions for module specifications. Our automatically generated implementation consists of about 1200 lines of synthesizable Verilog. We tested the generated Verilog model using the Cadence NCVerilog simulator.
CONCLUSIONS
This paper presents a new approach for automatically pipelining sequential circuits: repeatedly extract a computation from the critical path, move it into a new stage, then use speculation to generate a stream of values that keep the pipeline full. We also present extensions that integrate stalling and forwarding into this basic approach. Our experimental results provide encouraging evidence that the approach can deliver efficient pipelined implementations.
