Most hardware description frameworks, whether schematic or textual, use cooperating finite state machines (CFSM) as the underlying abstraction. In the CFSM framework, a designer explicitly manages the concurrency by scheduling the exact cycle-by-cycle interactions between multiple concurrent state machines. Design mistakes are common in coordinating interactions between two state machines because transitions in different state machines are not semantically coupled. It is also difficult to modify one state machine without considering its interaction with the rest of the system. This paper presents a method for hardware synthesis from an "operation centric" description, where the behavior of a system is described as a collection of "atomic" operations in the form of rules. Typically, a rule is defined by a predicate condition and an effect on the state of the system. The atomicity requirement simplifies the task of hardware description by permitting the designer to formulate each rule as if the rest of the system is static.
Introduction

Operation-Centric Hardware Descriptions
Digital hardware designs inherently embody highly concurrent behaviors. Any non-trivial design invariably consists of a collection of cooperating finite state machines (CFSM). Hence, most hardware description frameworks, whether schematic or textual, use CFSM as the underlying abstraction. In a CFSM framework, a designer explicitly manages the concurrency by scheduling the exact cycleby-cycle interactions between multiple concurrent state machines. Design mistakes are common in coordinating interactions between two state machines because transitions in different state machines are not semantically coupled. It is also difficult to modify one state machine without considering its interaction with the rest of the system. This paper presents a method for hardware synthesis from an "operation centric" description, where the behavior of a system is described as a collection of "atomic" operations in the form of rules. Typically, a rule is defined by a predicate condition and an effect on the state of the system. In an execution, a rule "reads" the state of the system in one step, and if enabled, the effect of the rule updates the state in the same step. If several rules are enabled at the same time, any one of the rules can be nondeterministically selected to update the state in one step, and afterwards, a new step begins with the updated state. The atomicity requirement simplifies the task of hardware description by permitting the designer to formulate each rule as if the rest of the system is static. Describing the instruction reorder buffer (ROB)' of a modern out-of-order microprocessor poses a great challenge if concurrency needs to be managed explicitly. An operation-centric description captures the behavior of an ROB more perspicuously as a collection of rules for operations like dispatch, complete, commit, etc. [ 11. For example, the dispatch operation is specified to take place if there exists an instruction that has all of its operands and is waiting to execute, and furthermore, the execution unit needed by the instruction is available. The effect of the dispatch operation is to send the instruction to the execution unit. The rule specification of the dispatch operation does not have to include information about how to resolve potential conflicts arising from the concurrent execution with other operations.
The sequential and atomic interpretation of a description does not prevent a legal implementation from executing several rules concurrently in a clock cycle, provided some sequential execution of those rules can reproduce the behavior of the concurrent execution. In fact, detecting and scheduling valid concurrent execution of rules is the central issue in hardware synthesis from operation-centric descriptions.
1.2
Behavioral descriptions typically describe hardware, or hardwarekoftware systems, as multiple threads of computation that communicate via a message-passing or shared-memory paradigm [13, 4, 14, 17, 51. As in CFSM frameworks, designers of behavioral descriptions still need to manage the interactions between concurrent computations explicitly. In reconfigurable computing, both sequential and parallel programming paradigms have been used to capture functionalities for hardware implementation. Paper Organization: This section introduced the concept and advantages of operation-centric hardware description. The next section presents an example. Section 3 explains the synthesis of operation-centrically described hardware, while Section 4 explains the concurrent scheduling of conflictfree rules. Section 5 presents a comparison of designs synthesized from operation-centric descriptions vs. hand-coded RTL descriptions. Section 6 summarizes the key contributions of this paper.
An Operation-Centric Example
2.1
We describe a two-stage pipelined processor where a pipeline buffer is inserted between the fetch stage and the execute stage. We use a bounded FIFO of unspecified size to model the pipeline buffer. The FIFO provides the isolation to allow the operations in the two stages to be described independently. Although the description reflects an asynchronous and elastic pipeline, our synthesis can infer a legal implementation that is fully-synchronous and has stages separated by simple registers.
Our operation-centric description framework borrows the notation of Term Rewriting Systems (TRS) [ 2 ] . A two-stage pipelined processor can be specified as a TRS whose terms have the signature Proc(pc,rf,bf,imem,dmem). The five fields of the processor term are pc the program counter, rf the register file (an array of integer values)?, bf the pipeline buffer (a FIFO of fetched instructions), imem the instruction memory (an array of instructions), and dmem the data memory (an array of integer values).
Instruction fetching in the fetch stage can be described by the rule: 
Description of a Pipelined Processor
+
The Fetch rule performs a weak form of branch speculation by always incrementing pc. Consequently, in the execute stage, if a branch is resolved to be taken, besides setting pc to the branch target, all speculatively fetched instructions in bf need to be discarded. In this pipeline description, the Fetch rule and an execute rule can be ready to fire simultaneously. Even though conceptually only one rule should be fired in each step, an implementation of this processor description must carry out the effect of both rules in the same clock cycle. Without concurrent execution, the implementation does not behave like a pipeline. However, the implementation must also ensure that a concurrent execution of multiple rules produces the same result as a sequential execution. In particular, consider the concurrent firing of the Fetch rule and the Bz-Taken Exec rule. Both rules affect pc and bf. In such a case, the implementation has to guarantee that these rules fire in some sequential order. The choice of ordering determines how many bubbles are inserted after a taken branch, but it does not affect the processor's ability to correctly execute a program.
State-Transformer View
In a TRS, the state of the system is represented by a collection of values, and a rule rewrites values to values. Given a collective state value s, a TRS rule computes a new value s' such that s'=if~(s) then 6 (~) else s where the K function captures the firing condition and the 6 function captures the effect of a rule. It is also possible to view a rule as a state-transformer in a state-based system. In this paper, we are going to concentrate on the synthesis of state-based systems with three types of state elements: registers (R), arrays (A) and FIFOs (F). The state elements are depicted in Figure 1 . A register can store an integer value up to a specified maximum word size. The value stored in a register can be referenced using the side-effect-free get() query and updated to v using the set(v) action. The entry of an array can be referenced using the side-effect-free a-get(icfx) query and updated to v using the a-set(idx,v) action. The oldest value in a FIFO can be referenced using the side-effect-freejrst() query, and can be removed by the deq() action. A new value v can be added to a FIFO using the enq(V) action. In addition, the contents of a FIFO can be cleared using the clear() action. The status of a FIFO can be queried using the side-effect-free notfuZZ() and notempty() In the state-transformer view, the applicability of a rule is determined by computing the 7t function on the current state. However, the next-state logic consists of a set of actions that alter the contents of the state elements to match 6(s). The processor rules in Section 2.1 can be restated in terms of actions:
Null actions, represented as E, on a state element are omitted from the action list above. The complete list of actions implied by the Add Execute rule is aAdd=(apc,arf,abr,aimem,admem) where apt, aimem and admem are E'S.
Hardware State Machine Synthesis
Implementing an operation-centric TRS description as a finite-state machine (FSM) involves combining the actions of all rules to form the FSM's next-state logic. The actions of a rule need to be qualified by the rule's n signal. For performance reasons, an implementation should carry out multiple rules concurrently while still maintaining a behavior that is consistent with a sequential execution of the atomic operations that the rules represent. We will describe such a concurrent scheduler in the next section. 
3.2
One straightforward implementation of an ATS is a FSM that executes one transition per clock cycle. The elements of S are the state of the FSM. The transitions in X are combined to form the next-state logic of the FSM in three steps.
Step 1: All value expressions in the ATS are mapped to combinational signals on the current state of the state elements. In particular, this step creates a set of signals, nq ,..., n~,, that are the 7t signals of transitions TI ,..., T, of an M-transition ATS. The logic mapping in this step assumes all required combinational resources are available. RTL optimizations can be employed to simplify the combinational logic and to share duplicated logic.
Step 2: In the second step, a scheduler is created to generate the set of arbitrated enable signals, $T~,...,$T~, based on 7t7, ,... 
v$T'
The reference implementation scheduler asserts only one $ signal in each clock cycle, reJ7ecting the selection of one applicable transition. A priority encoder is a valid scheduler for the reference implementation.
Step Figure 4 illustrates the merge circuit for a register that can be affected by the set actions from two transitions. The scheme assumes at most one transition's action needs to be applied to a particular element in a clock cycle. Furthermore, all the actions of a selected transition should be enabled in the same clock cycle to achieve the appearance of an atomic transition. The merge circuit for the three state element types are given next as RTL equations. For each R, the set of transitions that update R is { TxI I a$xI =set(exp,,)} where a;xt is the action by T., on R. tion, the reference implementation is deterministic. In other words, the implementation can only embody one of the behaviors allowed by the ATS. Thus, the implementation can enter a livelock if the ATS depends on non-determinism to make progress. The reference implementation can use a round-robin priority encoder to ensure weak-fairness, that is, if a transition remains applicable for a sufficient number of consecutive cycles then it is guaranteed to be selected at least once.
Although the semantics of an ATS require an execution in sequential and atomic update steps, a hardware implementation can exploit the underlying parallelism and execute multiple transitions concurrently in one clock cycle. For a pipelined processor, it is necessary to execute transitions for different pipeline stages concurrently to achieve pipelined execution.
Concurrent Scheduling of Conflict-Free Transitions
In a multiple-transitions-per-cycle implementation, the state transition in each clock cycle must correspond to a sequential execution of the ATS transitions in some order. If two transitions T, and Tb become applicable in the same clock cycle when S is in state S, ~T , (~T , ( S ) ) or ~T , (~T , ( s ) ) must be true for an implementation to correctly select both transitions for execution. Otherwise, executing both transitions would be inconsistent with any sequential execution in two atomic update steps.
There are two approaches to execute the actions of To and in the same clock cycle. The first approach cascades the combinational logic from the two transitions. However, arbitrary cascading does not always improve circuit performance since it may lead to a longer cycle time. In our approach, T, and Tb are executed in the same clock cycle only if the correct final state can be reconstructed from an independent and parallel evaluation of their combinational logic on the same starting state. This section develops a scheduling algorithm based on the conflict-free relationship (<>cF). <>CF is a symmetrical relationship that imposes a stronger requirement than necessary for executing two transitions concurrently. However, the symmetry of < >CF permits a straightforward implementation that concurrently executes multiple transitions if they are pairwise < >CF. An analysis based on the Sequential Composibility ( <sc) relationship can further increase hardware concurrency [lo] . The intuition behind <SC, an asymmetrical relationship, is that concurrent execution does not need to produce the same result as all possible sequential executions, just one.
Conflict-Free Transitions
The conflict-free relationship and the parallel composition function PC are defined in Definition 1 and Definition 2.
Definition 1 (Conflict-Free Relationship)
Two transitions Ta and Tb are said to be conflict-free (Ta <>CF Tb) if pcR(aRl,bRI) 
Static Deduction of <>CF
The scheduling algorithm given in this section can work with a conservative test for <>cF, that is, if the test fails to identify a pair of transitions as <>cF, the algorithm might generate a less optimal, but still correct implementation. A static determination of <>CF can be made by comparing the domains and ranges of the transitions. The domain of a transition is the set of state elements in 5 "read" by the expressions in either n or a. The domain of a transition can be further sub-classified as n-domain and a-domain depending on whether the state element is read by the n-expression or an expression in a. The range of a transition is the set of state elements in 5 that are acted on by a. For this analysis, the head and the tail of a FIFO are considered to be separate elements. Using D(T) and R ( n , a sufficient condition that ensures two transitions are <>CF is given by the following theorem.
Theorem 2 (Sufficient Condition for <>,-F)
Given Ta and Tb, 0 If the domain and range of two transitions do not overlap, then the two transitions do not have any data dependences. Since their ranges do not overlap, a valid parallel composition of a~, and arb must exist.
Definition 3 (Mutually Exclusive Relationship)
If two transitions never become applicable on the same state, then they are said to be mutually exclusive, i.e., Two transitions that are < > M E satisfy the definition of <>CF trivially. An exact test for < > M E requires determining the satisfiability of the expression ( T C T , ( S ) A~T~( S ) ) .
Fortunately, the n expression is usually a conjunction of relational constraints on the current values of state elements.
A conservative test that scans two n expressions for contradicting constraints on any one state element works well in practice. Figure 5 gives the corresponding conflict graph where two nodes are connected if they are not <>cF, i.e. two unconnected nodes 7;: and r j imply 7;: <>CF T j . The conflict graph has three connected components, corresponding to the three < >CF scheduling groups. The $ signals corresponding to T I , T4 and T, can be generated using a priority encoding of their corresponding n's. Scheduling group 2 also requires a scheduler to ensure $2 and $5 are not asserted in the same clock cycle. However, $~~= n q without any arbitration. Enumerated Scheduler: Scheduling group 1 in Figure 5 contains three transitions { T I , T4, T6} such that TI <>CF T6 but neither TI nor is <>CF with T4. Although the three transitions cannot be scheduled independently of each other, TI and T6 can be selected together as long as T4 is not selected in the same clock cycle. This selection is valid because TI and T6 are <>CF between themselves and every transition selected by the other groups. In general, the scheduler for each group can independently select multiple transitions that are pairwise <>CF within the scheduling group. (Figure 6 ) that allows TI and T6 to execute concurrently. The construction of an enumerated encoder is not necessarily unique. For example, in this example, row "01 1" in Figure 6 could also contain the data value "001".
Performance Gain
When X can be partitioned into scheduling groups, the partitioned scheduler is smaller and faster than the monolithic encoder used in the reference implementation. The partitioned scheduler also reduces wiring cost and delay since n's and 9's of unrelated transitions are not brought together for arbitration.
The property of the parallel composition function ensures that transitions are <>CF only if their actions on state elements do not conflict. Hence, the state update logic from the reference implementation can be used with a <>CF scheduler without any modification, and consequently, combinational delay of the next-state logic is not increased by this optimization. All in all, the < >cF-scheduled implementation achieves better performance than the reference implementation by allowing more transitions to execute in a clock cycle without increasing the cycle time.
Synthesis Results
The synthesis procedures in the previous section have been implemented in the Term Rewriting Architectural Compiler (TRAC). TRAC accepts TRSPEC descriptions and outputs synthesizable structural descriptions in the Verilog Hardware Description Language [ 181. The TRSPEC language is an adaptation of TRS for operation-centric hardware description [ l l]. This section discusses the synthesis of a fivestage pipelined implementation of the MIPS R2000 ISA (as described in [ 121) . The TRSPEC description implements all of the MIPS R2000 integer ISA except: multiple/divide; partial-word or non-aligned loadstores; coprocessor interfaces; privileged and exception modes. The delay semantics of the memory load and brancwjump instructions have also been removed. The TRSPEC description can be compiled by TRAC into a synthesizable Verilog RTL description in less than 15 seconds on a 266 MHz Pentium I1 processor. The TRAC-generated Verilog description can then be compiled by Synopsys Design Compiler to target both Synopsys CBA and LSI Logic 10K Series technology libraries.
Input and Output
The example from Section 2.1 described a simple processor whose instruction memory and data memory are storage arrays internal to the system. The description can be synthesized, as is, to a processor with an internal instruction ROM and an internal data RAM. However, as a realistic design for synthesis, the MIPS processor accesses external memory through input and output ports. TRSPEC allows U 0 semantics to be assigned to terms as part of the type definition for a term.
Synchronous Pipeline Synthesis
As in the processor from Section 2.1, the MIPS processor is described as an asynchronous and elastic pipeline. The description of the processor does not depend on the exact depth of the pipeline FIFOs. This allows TRAC to instantiate onedeep FIFOs, i.e. a single register, as pipeline buffers. Flow control logic is added to ensure a FIFO is not overflowed or underflowed by enqueue and dequeue actions. In a naive construction, the one-deep FIFO is full if its register holds valid data; the FIFO is empty if its register holds a bubble. With only local flow control between neighboring stages, the overall pipeline would contain a bubble in every other stage in a steady-state execution. For example, if pipeline buffer K and K + 1 are occupied and buffer K + 2 is empty in some clock cycle, the operation in stage K + 1 would be enabled to advance at the clock edge, but the operation in stage K is held back because buffer K + 1 appears full during the clock cycle. The operation in stage K is not enabled until the next clock cycle when buffer K + 1 has been emptied.
TRAC creates a flow control logic that includes a combinational multi-stage feedback path that propagates from the last pipeline stage to the first pipeline stage. The cascaded feedback scheme shown in Figure 7 allows stage K to advance both when pipeline buffer K + 1 is actually empty and when buffer K + 1 is going to be dequeued at the coming clock edge. This scheme allows the entire pipeline to ad-.______) 
I-deep FIFO
Sfnge
Analysis and Discussion
The table in Figure 8 summarizes the pre-layout area and speed estimates reported by Synopsys. The row labeled "TRSPEC" characterizes the implementation synthesized from the TRSPEC description. The row labeled "Handcoded RTL" characterizes the implementation synthesized from a hand-coded Verilog description of the same microarchitecture. The data indicates that the TRSPEC description results in an implementation that is similar in size and speed to the result of the hand-coded Verilog description. This similarity should not be surprising because, after all, both descriptions are describing the same microarchitecture, albeit using very different design abstractions and methodologies. The same conclusion has also been reached on comparisons of other designs and when we targeted the designs for implementation on FPGAs [lo] . The TRSPEC and the hand-coded Verilog description are similar in length (790 vs. 930 lines of source code), but the TRSPEC description is developed in less than one day (eight hours), whereas the hand-coded Verilog description required nearly five days to complete. The TRSPEC description can be translated in a literal fashion from an ISA manual. Whereas, the hand-coded Verilog description has a much weaker correlation to the ISA specification. The handcoded RTL description also requires circuit implementation information, which the RTL designer has to improvise. This does not only create more work for the RTL designer but also creates more opportunities for error. In a TRSPEC design flow, the designer can rely on TRAC to correctly supply the implementation-related information.
Conclusion
The operation-centric view of hardware has existed in many forms of informal hardware specification, usually to convey high-level architectural concepts. This research improves the usefulness of an operation-centric hardware description by developing a formal description framework and by enabling automatic synthesis to an efficient circuit implementation. The result of this paper shows that an operationcentric framework offers significant reduction in design time and effort without loss in implementation quality.
