The quuliry of high-level synthesis results i s strongIy dependant on the concurrency that can be found in designs. In this paper we introduce the Ephemeral History Regisfer (EHR), ( I new primitive stute element that enables concurrent scheduling of urbitrmy rules in a rule-bused design framework. The key properties of the EHR are that ir allows multiple operations to write to the same stute simultuneody, and that the EHR maintains a history of 011 writes that occur within a dock-cycle. Using the EHR, we present an algorithm that takes as input a design and a desired schedvle, and produces a functionally equivalent design that sath#es the desired concurrency and wdering of operations. A processor pipeline is wed to illustrate the eflectiveness of the EHR and scheduling algorithm, and shows how this approach signijcanth improves on previous synthesis algorithms for rulebased designs.
Introduction
There is a need for a new hardware design and synthesis approach to address the growing complexity of hardware designs. Desired properties of such an approach are (i) the input language must have welldefined execution semantics that bridge the gap between specification, design and formal verification (ii) the methodology should encourage correct-byconstruction designs and (iii) performance must matcb the designer's expectations. We use guarded atomic actions, which we also refer to as rules, as a basis for our synthesis framework because they address all three of these properties. Guarded atomic actions have a strong semantic foundation[ 1-41 and previous work has shown how they can be applied to hardware specification [5] [6] [7] , synthesis[X-lO] and verification[ 1 I,
121.
Much of the research related to hardware synthesis from rule-based descriptions has focused on achieving maximal concurrent scheduling of rules within each clock cycle. Although many of the synthesis results have produced hardware that is comparable to handcoded RTL Verilog, large designs sometimes exhibit unpredictable and at times poor performance due to scheduling ineficiekies. Additionally, oftentimes the only way a designer can achieve a desired schedure is to assert scheduling properties that carry proof obligations. If the designer makes an error in this process, then the design w i l l not only exhibit poor performance, but it might also become functionally incorrect. This paper presents scheduling for predictable performance by introducing new scheduting algorithms that are based on a new primitive state element, the Ephemeral History Register (EHR). Besides improving performance, the scheduling algorithms also remove the possibility of introducing functional errors when guiding the rule scheduling process. If the designer does not request the correct scheduling properties, then the design might not achieve the desired performance, but it will still produce a behavior that can be explained as some sequential firing of rules. This is important towards achieving correct-by-construction design and io remain within the formal framework of guarded atomic actions.
The basic idea behind the new algorithms and the EHR came from the realization that none of the current Paper organization:
The next section. reviews previous synthesis methodologies for guarded atomic actions and illustrates an example where this approach is not sufficient to produce the desired perfonnance. Section 3 introduces the Ephemeral History Register and shows how it can be used to achieve the desired scheduling performance. Section 4 presents a new scheduling algorithm that accepts scheduling constraints as input, and using the EHR, transforms the design to meet the scheduling requirements. We conclude in Section 5.
Rule-based hardware synthesis
This section reviews the execution model of atomic actions and outlines the synthesis approach of Hoe and Arvind[S, 91. We then present a processor pipeline that uses a FIFO built from primitive registers for its pipeline stages. This example demonstrates the need for a new scheduling approach because the previous algorithms are unable to derive sufficient concurrency in ruIe firings.
Atomic Action Execution Model
Each atomic action (or rule) consists of a body and a guard. The body describes the execution behavior of the rule if it is enabled, The guard (or predicate) specifies the condition that needs to be satisfied for the rule to be executable. We write rules in the form: 
Synthesizing Rules into RTL Hardware
There is a straightforward translation from rules into hardware. Assuming all state is accessible (no port contention), each A and 6 can be easily implemented as combinational logic. A hardware scheduler and control circuit then needs to be added so that in every cycle the scheduler dynamically picks one 6 function whose corresponding K condition is satisfied. At the end of the cycle the control circuit updates the state of the system with the result of the selected 6 function. The cycle time in such a synthesis is determined by the slowest K and the slowest 6 functions.
Although correct, such an implementation has unsatisfactory performance because it is often possible to execute several rules simultaneously such that the result of the execution continues to match an execution in which the select~d rules are applied in some sequential order. Thus, the challenge in generating efficient hardware from sets of atomic actions is to generate a scheduler which in every cycle picks a maximal set of rules that can be executed simultaneously. In this paper we assume that each rule executes within a single cycIe but we are also investigating implementations where the execution of a rule may stretch over multiple cycles. Figure 1 shows the circuit that is generated in Hoe's synthesis flow. The predicates (q's) are computed for each rule using a combinational circuit. The scheduler is designed to select a maximal subset of applicable rules with the constraint that the outcome of a scheduling step can be explained as atomic firing of rules in some sequence. Based on which rules the scheduler chooses to enable (vi's), the selector bIock then combines the update functions (Si's) from the chosen rules and updates the current state with the resulting values. With some small differences that are pointed out in Section 4, we use the same model for circuit generation in our new synthesis flow. The main difference in our flow lies in the fact that we introduce the EHR state element and that we use a new scheduling algorithm, not in the circuit generation.
Rule-based processor pipeline
The example we use throughout this paper is derived from a standard 5-stage processor pipelineC151. In a rule-based design environment such a processor is typically expressed using one rule to specify the behavior of each of the five pipeline stages. FIFO's are used as pipeline stages so that the stages can be described and scheduled independently of one another. In contrast to traditional hardware descriptions, this description-style lends itself to proving the correctness of the processor implementationE51.
In this paper we focus on the interaction of rules The question that needs to be examined is: do current scheduling algorithms allow these three rules to execute concurrently? Hoe and Arvind [9] showed that efficient hardware can be generated from such a description, but required that a special FIFO primitive be used. However, if the FIFO is constructed from primitive registers, then as we show in the next subsection, the compiler derives an unsatisfactory schedule. Bluespec[ldj and 1171 improved on Hoe's scheduling, but only by supporting scheduling annotations that carried proof obligations. An error in these annotations could easily lead to a functionalIy incorrect design. Thus, there clearly exists a need for improved scheduling algorithms. The new algorithms that we later present solve this scheduling problem while fitting into the semantic and language framework of 1161 and [17] . We should note that we use the FIFO because it is a simple example that exhibits m y of the problems with the current scheduling algorithms.
As designs get larger, r&y simiiar problems have arisen. introducing new primitives, as would be required in Hoe's flow, or requiring annotations with proof obligations as would be required in 1171 is generally not a satisfactory solution.
Pipeline Anaiysis using CF and SC
Below we provide the code for a possible implementation of the processor pipeline FIFO. This FIFO contains only a single element and is constructed from two registers: data and $11. The data register holds the contents of the FIFO element. The fuii register is true if the FIFO is full, that is, the d d~ register contains valid data. The FIFO contains the standard enqueue, dequeue, and first methods, along with a bypass method which is intended for bypass logic. The first and bypuss methods are written using exactly the same code since they both return to contents of the FIFO. We later show how we can move the bypass function to appear to execute later in time so that it observes values written by the enqueue method. Each method has a when condition, which must be true for a rule to be able to call the method.
For example, a rule cannot call enqueue if the FIFO is full.
The FIFO is expressed as a set of module methods, which for the purpose of this paper can be assumed to be flattened into the rules that call them during the compilation process. This means that each method body is inlined into the body of the rule that calls it, and each method %hen condition'' is conjugated with the rule predicate. We use the method notation Clearly, SC and CF analysis are not sufficient to create a FIFO that works as a proper pipeline register.
Sufficient concurrency is not found and the value being inserted into the FIFO cannot be forwarded to another rule that executes within the same cycle. We have experimented with other FIFO designs and beiieve that no FIFO can be constructed from existing primitives and achieve the desired properties using the past rule-based synthesis framework. Similar issues arise in more complex designs. This problem is what motivated us to fmd an alternate scheduling algorithm. The reader will note that neither CF nor SC permit values to be forwarded from one rule to another. Hence, using the SC and CF scheduling methods it is impossible to correctly schedule or create a bypass function that returns the value being enqueued into the FIFO.
3, The Ephemeral History Register (Em)
The basic idea behind the new scheduling approach is to permit the forwarding of values between rules that execute within the same cycle. If one rule is writing to a register, and another rule is reading the 5ame register, then the value that is written can be forwarded to the rule that is reading the register. In this case it appears as though the rule that is reading the register executes after the rule that is writing the register, even though they are executing within the same cycle. In essence, we are dividing each clock cycle into sub-cycles and assigning rules to particular sub-cycles. Figure 3 . We call it the Ephemeral History Register because it maintains a history of all writes that occur to the register within a clock cycle. Each of the values that were written can be read through one of the read interfaces. However, the history is lost at the beginning of the next cycle. We refer to the superscript index of a method as its version. For example, write' is version 2 of method write. The reader will note that each write method has two signals associated with it (x and en). 7'he x input is the data input and could be a bus. The en input is a control input that indicates the method is being called and should execute. A value is not written unless the associated en signal is asserted. 
FIFO implementation using the EHR
This section shows how the EHR can be used to implement a FIFO with the desired scheduling properties. The FIFO bypass method also has the desired property of returning the value being inserted into the FIFO. The implementation is based on the FIFO code from section 2. However, rather than use the standard register primitive for the full and doto registers, this implementation uses the EHR for both state elements. Otherwise, the only changes that are made to the code are that version numbers of the register method calls are changed. In section 4 we show how these changes can be automatically derived through new scheduling algorithms.
In order to achieve the proper processor pipeline schedule we require the FIFO methods to satisfy the following scheduling properties:
(first < dequeue) enqueue e bypass It is important to note that each method individually behaves precisely as it did in the original implementation. This is important towards maintaining the semantic model of guarded atomic action execution. Only when methods are called together does the behavior change, and such behavior is valid in both cases only when it can be explained as a sequential execution of the two methods (or rules). For example, the bypass method returns the current state of the dura register if the FIFO contains valid data (the ful! register is set) and no other FIFO methods are called. However, if the enqueue method is called simultaneously, then the bypuss method now returns (forwards) the value being enqueued. Previously, these two methods could not be called simuItaneously. The circuit that is synthesized fiom the new FIFO description is shown in Figure 4 . The rdy signals correspond to the values of the method when conditions. This circuit behaves precisely as the designer wouId expect and allows the processor pipehe to be scheduled with the desired performance. After optimizing the constant inputs to the mux's, the circuit is also equivalent to what a designer would have implemented as a pipeline stage in a RTL-level implementation. Thus, using the EHR as a new primitive state element, we are able to build designs that we more efficient than was previously possible in a rule-based synthesis environment. The use of the EHR is not onIy advantages when designing FIFO's, but improves scheduling in more complex designs as well.
FlexibIe Scheduling Algorithm
We have shown that the Ephemeral History Register is a powerful new primitive element that allows the scheduling of designs to be improved However, as the exampfe in section 3 showed, the designer still has to alter the design to achieve a desired schedule. In this framework the changes a designer has to make to satisfy scheduling constraints are limited to changing the version of method caIls to EHR instances. However, this is a tedious process, and if not performed correctly c m lead to designs with poor performance, or even functionally incorrect designs. The ability to easily experiment with different schedules is an important component of architectural and implementation exploration. This led us to develop a new scheduling algorithm that takes a set of rules (or methods) along with a set of scheduling constraints as input. The scheduling constraints specify desired concurrency and execution ordering among rules. As output the algorithm produces a transformed design that satisfies the scheduling constraints and is fimctionalIy equivalent to the original design. In general, any scheduling constraint can be satisfied, provided that sufficient resources (ports) are available.
There are two key components to our scheduling algorithm. The PROMOTE procedure takes a d e as input and transforms it into 8 new rule that is functionally equivalent to the original rule, but appears to execute later in time relative to other rules -it executes in a later sub-cycle. The PROMOTE procedure i accomplishes this by selectively increasing the version of method calls to EHR instances. The TSCHED~LE function is the top level procedure that takes a set of rules and the scheduling constraints as input and produces a new design that meets the scheduling requirements. It achieves this by repeatedly calling PROMOTE on select d e s so as to advance them in time until they meet the scheduling constraints. After a finite number of calls to PROMOTE, the design is guaranteed to satisfy the constraints.
The next two subsections explain these two procedures. We then show how the procedures can be used to automatically derive the design that satisfies the designer's expectation for the original processor pipeline.
PROMOTE
The idea behind the PROMOTE procedure is that we can increase the version of calls to EHR methods without altering the behavior of a rule or method call. By increasing the version we can reduce conflicts with other rules that are calling methods fiom the same
I95
EHR. For example, if RI calls write' of an EHR and R2 calls reud of the same EHR, then we can achieve RI < R2 without altering the behavior of either d e by promoting the read call to read. However, if R2 also calls write', then promoting r e d to read would alter the behavior of the rule since it would now forward the value from write' to its own call to r e d . The way to avoid this would be to also increase the version of write' to write'.
The PROMOTF! procedure, as described below, accepts a rule (R) or method as input. It also accepts as input one of the method calls (xi that appear in the rule. The goal of the algorithm is to increase the version of xi to X'+l without altering the behavior of the rule. It returns a new rule that increases the version of xi along with a minimal number of other method calls while maintaining the same functionality as the original rule.
One simple approach to increasing the version of an EHR method call without altering the rule behavior is to increase by the same amount all versions of calls to the same EHR within the rule. However, this is not efficient since there are cases where only a subset of the calls to an EHR need to have their version increased.
Step 3 in the PROMOTE algorithm spells out the precise requirements for increasing the version numbers. The guiding principle is that the version number should only be increased if it helps scheduling or if it needs to be increased to avoid altering the behavior of the rule.
PROMOTE@, X') = Assume R is a rule (or method) that makes calls to the set of methods X = {xo, XI, XZ, ...I, xi EX, and xi is a call to an EHR method. 1) Let B be the minimal subset of X such that
x' E B and such that all method calls y in (R -B) satisfy one of the properties: y c xi,
x'<yandx*'<y 2) For each method call x' E I3, replace the call to x' in R by a call to xF1. 
TSCHEDULE
The TSCHEDULE procedure takes a sequence of rules as input and transforms them into new rules that are functionally equivalent to the original rules. The new rules have the scheduling property that they can all execute simultaneously and appear to execute in the order they were listed in the input. We first show how the TSCHEDULE algorithm can be applied to a pair of rules and then show how it can be generalized to an arbitrary number of rules.
In the two input case, the TSCHEDULE procedure accepts two rules (R; and Ry) as input and produces a new rule R, ' that satisfies the property R, < R,, ' . The procedure always succeeds at achieving the scheduling requirement and guarantees that I?,,' has the same behavior as R, Functionai correctness is guaranteed since the only transformation that is applied to Ry is PROMOTION, which we showed above does not alter the behavior of a rule. It is also clear that we eventually achieve the desired scheduling relationship between R, and Ry
The reason for this is that we repeatedly promote elements (either calls to EHR methods, or the methods of modules that both R, and R, call). If sufficient promotion occurs, then everything in Ry ' must appear to execute "later" than RI, and hence R, < Rv must be true eventually. 
TSCHEDLJLE(R,
,
Processor pipeline scheduling
The TSCHEDULE and PROMOTE algorithms are best illustrated through an example. In Table 1 we show a simulation of the algorithm applied to the processor pipeline. As input the TSCHEDULE algorithm takes the three rules we were concemed with ( 
Conclusion
In this paper we presented a new schedding algorithm for rule-based designs that significantly improves on previous methods. These are general algorithms that can be applied to many designs. They are particularly useful for large designs that require flexibility in scheduling without risking incorrect functional behavior.
The algorithms are made possible through the use of a new state element, the Ephemeral History Register. This new primitive element makes forwarding of values from one rule to another possible, while maintaining the semantics of guarded atomic actions. As an example of tbe power of the EHR and scheduling algorithm, we presented a processor pipeline and showed that we were able to build the pipeline FIFO's using the EHR -something that previously could not be done using only primitive elements. This FIFO was interesting because it was implemented using only a single storage element, allowed simultaneous enqueue and dequeue, and allowed the value that was being enqueued to be bypassed to another rule.
The scheduling algorithms are useful because they allow a designer to precisely specify how rules should be scheduled. The compiler then takes these requirements and transforms the design to meet the constraints. By providing incorrect constraints, the designer might not achieve the desired performance, but will never cause the design to become functionally incorrect. This contrasts with the previous compilation flow where a designer had to compile a design and then observe the scheduling results. It was often difficult to understand what was limiting the scheduling performance and once the scheduling problem was discovered, the code had to be rewritten to achieve the desired performance, This was a timeconsuming and error-prone process.
Both the EHR and the scheduling algorithms are powerful new mechanisms that we have shown to be practical through the processor pipeline example.
