Retim'ng is an optimization technique for synchronous circuits introduced by Leiserson and Saxe in 1983 . Although powerful, retiming is not very widely used because it does not handle in a satisfying way circuits whose registers have load enable, synchronous and asynchronous seuclear inputs. We propose an extension of retimhg whose basis is the characterization of registers into register classes. The new approach called multiple-class retiming handles circuits with an arbitrary number of register classes. We present results on a set of industrial FPGA designs showing the effectiveness and efficiency of multiple-class retiming.
Introduction
Retiming is a powerful optimization technique for synchronous circuits that was introduced by Leiserson and Saxe in 1983 [8] . It consists of moving the sequential elements (registers) in a circuit while preserving its YO behavior. Retiming can be used (1) to reduce the clock period of a circuit (minperiod retiming) and (2) to reduce its number of registers while achieving a given clock period (minareu retiming); the latter is of most practical interest.
Since the seminal work by Leiserson and Saxe [8, 9] , many researchers have contributed to the theoretical and practical aspects of retiming. Originally designed to handle edge-triggered flip-flops, retiming has been extended to also handle multi-phase level-clocked latches [6, 101 . Efficient implementations [16, 12, 111 have made retiming applicable to large circuits. Important contributions have been made to apply retiming to circuits with reset states [19, 4, 18, 131 . Finally, it has been shown that retiming can be used together with existing combinational optimization techniques [ 14, 2, 3, 15, 51 , to further improve circuit performance.
Despite its proved effectiveness and efficiency, retiming has not been very widely used in industrial logic synthesis tools. One of the main technical reasons for this is that most available retiming packages do not handle in a satisfying way the circuits that engineers really design today. In practice, these packages work well on circuits whose registers do not have synchronous or asynchronous setklear inputs, as well as no synchronous load enable input.
However, most modem technologies offer registers with asynchronous, and/or synchronous reset inputs, as well as a synchronous load enable input (also called clock enable). For instance, every logic block in a Xilinx XC4000 FPGA contains two D-type edgetriggered flip-flops with asynchronous reset and synchronous load enable inputs which can be connected to arbitrary signals [20] . As shown by the results presented in Section 6, fully exploiting these capabilities is absolutely mandatory to achieve high design quality. This is illustrated in Fig. 1 on a circuit that has two registers with load enable inputs. To apply existing retiming approaches, complex registers are transformed into simple registers with some additional logic to implement the synchronous load enable and reset behaviors. This transformation transforms the circuit a) into c), which is larger than a). Note that a register with asynchronous reset input has no equivalent synchronous circuit with a simple register and additional logic. Moving the simple registers forward results in circuit d). It can be seen that applying this retiming step results in an additional area cost of two registers and two multiplexors.
Camposano and Ploger showed [ 13 that registers can be moved together with their load enable inputs if they are connected to the same load enable signal. For instance, both registers in Fig. la) have the same synchronous load enable signal and thus can be moved forward together with their EN input to produce circuit b) which is much smaller than circuit d). Similar conditions for registers with asynchronous and synchronous reset inputs were presented by Singhal.et al. [18] . However, both works only discuss the conditions for a single retiming step and do not present a comprehensive approach for computing a retiming solution. A first general approach to this problem was proposed by Leg1 et al. in [7] , but they did not present any implementation showing that minperiod and rninareu could both be solved in an effective and efficient way.
In this paper we present a practical and comprehensive approach called multiple-class retiming, or mc-retiming, which allows to efficiently and effectively compute a minperiod or minarea retiming solution for circuits designed with a variety of different registers. MC-Retiming is an extension of retiming that manipulates complex registers. The registers are classified into register classes which are used to determine how far backward and forward each register can be moved in the circuit. This information is then used to map the problem of multiple-class retiming into a basic retiming problem which can be efficiently solved using existing retiming approaches. Thus, the big advantage of mc-retiming is that it can reuse many of the efficient techniques available for basic retiming. After giving background information on basic retiming in Section 2, we introduce in Section 3 the multiple-class retiming problem using a retiming graph model in which we classify the registers into register classes. In Section 4 we show how to map the multiple-class retiming problem into a basic retiming problem. Sec-tion 5 presents an efficient implementation of multiple-class retiming in which we reuse existing basic retiming approaches. Finally, we present in Section 6 experimental results obtained with this implementation on a set of industrial FPGA designs.
Basic Retiming
The basic retiming approach presented by Leiserson and Saxe [9] handles sequential circuits whose registers are controlled by a single clock and possibly have reset values. A sequential circuit is represented by a vertex-weighted, edge-weighted, directed graph G = (V, E , d , w) , called retiming graph. Each combinational gate and each primary input and output port is modeled by a vertex v E V. An edge euv models a connection from an output of gate U to an input of gate v, passing through an arbitrary number of registers. A The satisfiability of these constraints and an appropriate set of retiming values can be efficiently computed, e.g., by the FEAS algorithm [9] . Using the FEAS algorithm and binary search, it is easy to compute the minimum feasible clock period $min.
To solve the minimum area retiming problem, Leiserson and Saxe introduced a cost function that takes into account the possible sharing of the registers on the different fanout edges of each vertex [9] . This cost function, together with the circuit and period constraints, forms a special integer linear program (ILP) whose solution can be computed using a minimum-cost flow algorithm [9] . Recently, very efficient reduction techniques have been presented for this ILP formulation resulting in a significant speedup [16, 12, 111.
Multiple-Class Retiming
This section shows how a circuit with complex registers can be retimed without transforming these registers into simple registers and additional logic. It introduces register classes and explains how classes are added to a retiming graph.
Retiming Circuits with Multiple-Class Registers
Most sequential elements in synchronous circuits can be represented by the generic register shown in Fig. 2a) . Each register has a signal connected to the data input D, the data output Q, and to the clock input. Additionally, a register can have inputs SS or SC and AS or AC which allow to synchronously and asynchronously set or clear the register, and a synchronous load enable input EN. If a register has, e.g., no load enable capability, then the synchronous load enable input EN of the generic register is deactivated by connecting it to a signal representing the constant 1 . Generic registers must fulfill certain conditions to be moved across a combinational logic gate. In general, such a retiming step is valid if it yields a circuit which is a suficciently old replacement [8] of the original circuit. It has been shown in [l] that, for registers with synchronous load enable inputs, moving a layer of registers across a gate is valid if all registers are connected to the same load enable signal. The same condition holds for the clock inputs of the registers, because it is necessary to preserve the temporal equivalence of the circuit [ 171. Registers with reset inputs can be moved if the reset signals are equivalent [ 181.
Since the validity of moving registers depends on the connected control signals, we classify the registers of a circuit using the signals connected to the control inputs.
Definition 1 (Register Class) A register class C is characterized by a tuple (clk, load, rsync, rasync) of signals.
A register 1 belongs to class C iyeach signal connected to its control inputs is logically equivalent to the corresponding signal of the class. Two registers are said to be compatible iff they belong to the same register class.
It follows from this definition that a layer of registers can be moved across a logic gate if all registers are compatible.
Retiming Graph for Multiple-Class Circuits
A circuit which contains multiple register classes is called a multiple-class circuit. Since the validity of moving registers in a multiple-class circuit directly depends on which classes these registers belong to, we have to model the class information in the retiming graph. Especially, it is no longer sufficient to store only the number of registers w(e) on an edge e of the retiming graph, as the registers on the edge may belong to different classes. Therefore, we introduce a modified retiming graph Gmc = (V,E,d,C) which we call a multiple-class retiming graph or, in short, a mc-graph. Fig. 2b ) shows how a generic register is modeled in the mc-graph. Instead of a weight w(e), we attach to e a sequence of registers C(e) = [ l l , . . . , lw(e)]. 11 corresponds to the register closest to the source of the edge, while l,,,(e) is the register closest to the sink of the edge. The superscript C at a register 1 ' denotes the class to which it belongs. In the presence of reset inputs, a register is labeled with appropriate values s,a E { 0 , 1 , -} which specify the synchronous and asynchronous reset values of the register, respectively. For each control signal, except the clock signals, we introduce an output vertex in the mc-retiming graph and an edge from the vertex generating the signal to the corresponding output vertex. This is necessary to ensure that these signals get correctly handled through retiming.
A valid mc-retiming step for a vertex v can be performed as depicted in Fig. 3 . For instance, for a forward mc-retiming step at vertex v, there must be a complete layer of compatible registers at the sink of the fanin edges of v. The last registers of the fanin edges are removed, and a new layer of registers with the same register class is inserted at the source of the fanout edges of v.
As in the basic retiming approach, we define a retiming for a mcgraph as an integer-valued vertex labeling r : V + Z. A mc-retiming r is legal for a multiple-class circuit, if it can be implemented by a sequence of valid mc-retiming steps. 
Mapping Multiple-Class to Basic Retiming
This section presents the simple mechanisms that allow us to map the problem of retiming a multiple-class circuit onto the basic retiming problem which can then be solved efficiently by existing approaches to basic retiming.
Multiple-Class Retiming Constraints
A legal mc-retiming can only move layers of compatible registers. These mc-retiming bounds can be used to express the conditions for a mc-retiming r to be legal:
As in basic retiming, the circuit constraints ensure that retiming does not create negative edge weights. In addition, the class constraints guarantee that at each vertex v only valid mc-retiming steps are performed. Thus, we can consider a legal mc-retiming to be a legal basic retiming with additional constraints set on the retiming values.
The mc-retiming bounds can be easily computed on the mcgraph. Instead of traversing the register layers reachable in the transitive fanin or fanout of a vertex, we adopt a different procedure which was proposed in [7] . In order to compute the backward mcretiming bounds, we move registers backward as long as we can apply valid mc-retiming steps in the graph. Thereby, we count the number of registers which are moved across each vertex. When no more valid backward moves are possible, the mc-retiming graph is maximally backward retimed, and the number of registers moved across each vertex v is equal to the backward mc-retiming bound P&(v). Similarly, to compute the forward mc-retiming bounds, we move the registers forward as far as possible using valid mcretiming steps only. In the maximally forward retimed graph the negative number of registers moved across a vertex v equals the forward mc-retiming bound $&(.).
Note that we do not consider reset values while computing the retiming bounds. Although this may result in maximal backward retiming bounds which can actually not be achieved due to justification conflicts, we decided to ignore reset values for two reasons. First, it was shown in [ 131 that retiming constraints which guarantee justifiable reset values are generally not unique resulting in a large number of different constraint sets. Thus, in order to find the optimal solution a retiming must be computed for each constraint set. Second, the backward justification of reset values can computationally be very expensive. Thus, we want to justify only those backward retiming steps which are actually required by the retiming solution. Our experiments have shown that the number of required backward retiming steps is usually much smaller than the number of retiming steps performed during maximal backward retiming.
Thus, by not considering reset states we compute a unique set of class constraints. Only when implementing the retiming solution do we compute equivalent reset states and take appropriate action in case of a justification conflict. Section 5.2 gives more details on how we compute equivalent reset states.
Register Sharing for Multiple-Class Registers
Minimum area retiming requires that we take register sharing at the gate output into account to get correct area estimation. The problem here is that if we directly apply the cost function introduced by Leiserson and Saxe [9] to count registers in the mc-graph, this would produce a register count that would be smaller than the actual count. Indeed, registers belonging to different classes cannot be shared. In the example in Fig. 4a ) we would report a shared register count of 2. But the registers of class C1 and C2 cannot be shared so that the area cost is actually 3.
Recently
O/l-MILP retiming formulation which is much more expensive to solve than a minimum-cost flow problem. We suggest a new approach in which the graph is modified so that the register count is no longer underestimated by the sharing cost function of Leiserson. The resulting problem can still be solved using an efficient minimum-cost flow algorithm.
In a mc-graph, the sharing cost function underestimates the register count if registers of different classes appear in a register layer on the fanout edges of a vertex. In Fig. 4a ) the second register layer gives an example for this case. In order to detect these cases, we make the following two observations. First, any register layer which results from a forward move across a multiple-fanout vertex can be unrestrictedly shared at the fanout edges because all inserted registers belong to the same class. Second, any register layer which can be moved backward across a multiple-fanout vertex can also be shared. Otherwise, it could not be moved backward. Thus, the shared register count is potentially wrong only for those registers which are in their maximal backward position. Fig. 4b) shows the example mc-graph with its registers in the maximal backward position. The backward mc-retiming bounds are depicted at the vertices. In order to estimate the shared register count, we heuristically identify the largest number of sharable registers and separate them from the remaining registers. The set of sharable registers is found by traversing the register layers from the sources to the sinks of the fanout edges. At each layer, we select the registers that constitute the largest set of compatible registers. Then, we proceed to the next layer using only the edges of the recently selected registers. In Fig. 4b ), all registers on the left side of the cutline can be shared while the registers on the right side of the cutline cannot be shared with any register on the left side.
Our goal is to forbid the registers that are at the right of the cutline to move onto the fanout edges of U where they would be considered as sharable by the area cost function. To do this, we introduce a separation vertex si with zero delay on each edge euVi along the cutline. Thereafter, each non-sharable register is placed on the edge of a single-fanout vertex and is thus counted as one register. We prevent the non-sharable register to move backward across the separation vertices by specifying appropriate backward retiming bounds. If wb(eSivi) denotes the weight of the edge eSiYi after maximal backward retiming, then the backward retiming bounds of a vertex si is given by (3) Informally, if we rewind the maximal backward retimed graph to its starting position, then P&(si) is the number of registers that have to pass the cutline in order to undo the maximal backward retiming at vertex vi. Using this procedure, we also find how the initial registers must be distributed on the edges eusi and esivi. Fig. 4c) shows how the initial mc-graph is finally modified to account for multiple-class register sharing. Note that each register which enters an edge esivi from vertex vi during retiming is immediately passed to edge eusi as long as r(si) < P&(si). This is because a register placed on eusi has a lower cost than a register placed on esivi.
The above transformation is performed at each multiple-fanout vertex before solving the minarea retiming problem. It must be noted that there are certain situations where the register count is overestimated by our approach. If, e.g., in Fig. 4b ) the registers on the edge euv4 swap their classes, then the first registers of euVj and euv4 could be shared. This is not detected by our sharing model, because it separates only the largest set of sharable registers at a multiple-fanout vertex. However, these cases occur only if registers are in maximal backward position which does not seem to happen very often in practice. Furthermore, it is more desirable to overestimate the area during retiming rather than to underestimate it. registers are not allowed to move across inputs and outputs of the circuit. Thus, from (2) The cost coefficient c(v) is determined for each vertex v according to the sharing cost model of Leiserson and Saxe [9] . In order to solve the minimum period retiming problem of Step 4, the cost function is omitted and the minimum clock period $min resulting in a feasible set of constraints is determined by binary search.
Note that the number of class constraints is small comoared
Efficient Implementation
A technically relevant implementation of multiple-class retiming must be able to compute a minimum area retiming for a minimum feasible clock period. This is achieved by performing the following steps which summarize the overall mc-retiming approach:
1. 2.
3.
4.
5.
6.
~ ..
to the possibly huge set of period constraints. The algorithm presented in [16] already makes use of efficient techniques to reduce the number of period constraints of which many are redundant. We expect to further reduce the overall number of constraints by using the technique proposed by Maheshwari and Sapatnekar [ 12, 111. They showed that additional bounds on retiming values can be effectively used to further prune the set of constraints resulting in a much smaller ILP.
Computing Equivalent Reset States
Our technique for reset state computation is similar to the one proposed by Even et al. VI. n e~ move registers across several logic gates and then compute new reset values using forward implication or backward justification across the retimed logic gates. These steps
Generate the mc-graph Gm from the circuit description.
Derive the retiming bounds P&(v) and en(") using maximal backward and forward retimng, respectively. Modify the retiming graph so as to improve the estimation of the shared register count during minarea retiming. Compute a minimum period retiming subject to the retiming bounds to get the minimum feasible clock period Compute a minimum area retiming subject to the minimum feasible clock period Relocate the registers in the circuit according to the computed nous and asynchronous reset state.
retiming Thereby, compute an equivalent synchro-
We have already discussed Steps 1 -3 in the previous sections. These steps are performed very fast, especially since we do not consider reset states during maximal backward retiming. In the remainder of this section we focus on how to efficiently compute the retiming solutions and the equivalent reset states for the retimed multiple-class circuit.
Computing a Multiple-Class Retiming Solution
The previous sections show that we can view the mc-retiming problem on the mc-graph GmC as a basic retiming problem where upper and lower bounds are imposed on the retiming values. Additionally, the graph Gmc is modified by introducing separation vertices to provide a more reasonable estimation of the shared multiple-class register count. Thus, the mc-retiming problem can be solved by any retiming approach as long as the retiming bounds are satisfied.
We implemented basic minperiod and minarea retiming using the efficient algorithms presented by Shenoy and Rudell [16] . These algorithms, however, cannot directly handle retiming bounds set on vertices. To overcome this limitation, we rewrite the corresponding class constraints in (2) as a set of difference constraints using the retiming value of the host vertex. We can assume r(vh) to be 0, since are iterated until all registers are in their final position.
Since backward justification can be very expensive, our idea is to break down the justification task into justification steps as easy to execute as possible, as long as this provides a valid solution. Only if this simple approach fails to find a justification, do we perform a possibly more expensive justification. This mechanism is the following. Like [4] , we concurrently compute a new reset state while moving registers into their final position. However, we compute new reset values each time a layer of registers is moved across a gate, which means that we just have to justify across one gate at a time, which is usually not expensive. This operation has been implemented using BDDs.
In a backward justification step we select as many don't cares for the reset values as possible. This helps to avoid conflicts in subsequent backward justification steps and also improves the register sharing potential. If a justification conflict occurs, we try to resolve the conflict by a global justification step. In this case, we trace the conflicting registers back to their original positions together with other registers involved in moving backward the conflicting registers. Then, we try to compute a justification for the larger portion of logic gates. On success, we update the reset values and proceed. If we cannot resolve the conflict by global justification, the retiming solution cannot be implemented, and we have to compute a new retiming solution. Beforehand we set an upper retiming bound on the vertex where the conflict occurred such that the non justifiable backward move is no longer allowed. Fig. 5 illustrates our approach with an example. The numbers above the gates denote the retiming values to be applied to the circuit. The first two steps consist of moving the registers across the NAND gate v3 and the inverter v4, and local justification produces reset values for the registers inserted on the fanin edges of v3 and v4 (see Fig. 5a )). The following backward move across the AND gate v2 produces a conflict due to the different reset values on the fanout edges. Therefore, the registers are traced back to the original registers and a global justification is performed across gates v2, vg, and v4. as depicted in Fig. 5b ).
In the experiments, our approach has shown to be very efficient. In less than only 1% of all justification steps we had to resort to global justification step in order to resolve a conflict. More impressively, we never encountered an example where we actually had to compute a new retiming solution due to a non-resolvable conflict. This shows that in practice computing equivalent reset states can be done in reasonable time using rather simple methods.
Results
We have developed a software package implementing the rnultipleclass retiming that we have presented here. As mentioned in Section 5, this package has been built on top of the efficient basic minimal period and minimal area retiming engine presented in [16] . Backward justification has been implemented using BDDs. This section presents the experimental setup that we have used to evaluate this new package, and then gives the results that we have obtained on real life industrial circuits using this setup. Note that it would not make sense to give results on standard benchmark circuits, like the ISCAS circuits, because they do not contain complex register and are also not available as RT-level HDL source, from which we could derive complex registers using an HDL analyzer.
The multiple-class retiming package has been integrated in an existing state-of-the-art logic synthesis system for FPGAs. This system provides us with scripts to perform logic synthesis, optimization and mapping of circuits for minimal area as well as for minimal area for best delay. Both the logic optimization and mapping are architecture specific, i.e., they both make use of the specific features of the target FPGA architecture to produce higher quality results. For instance, when mapping logic and arithmetic operators on a Xilinx XC4OOOE [20] , it makes use of the hardwired carry chain logic to get the best performance.
Each circuit used here is an industrial circuit described at the RTlevel in VHDL or in Verilog. This source code is first run through an HDL analyzer which produces a technology independent gate level netlist. Remarkable elements of this netlist are the registers, which can have a synchronous load enable input EN, as well as synchronous SS/SC and asynchronous AS/AC sedclear inputs. Table 1 gives the areas and delays of each circuit after optimization, mapping, place and route, onto a Xilinx XC4OOOE, using the minimal area for best delay script. Since registers on a XC4OOOE do not have synchronous set/clear inputs, all such inputs inferred by the HDL analyzer are decomposed into additional logic before Column #FF is the number of registers in the circuit. Column #LUT is the number of lookup tables (LUT) in the mapped circuit, and Delay is the minimal period of the circuit. This delay is the maximal delay over all combinational paths in the circuit, computed after place and route using Xilinx timing analyzer [20] .
In order to evaluate the new retiming package, the mapping script was modified to include a retiming step. A command "retime" was inserted after the circuit has been completely mapped. The circuit is then seen as a netlist of Xilinx primitives, e.g., LUTs, carry chains, and special buffers. We decided to run retiming at this point because it allows us to compute delays for the combinational gates that are as close as possible to the actual delays in the FPGA. This is particularly important when dealing with carry chains for instarice. The command "retime" is run with the minimal area for best delay objective. A command "remap" was also added to the script to remap the combinational part of the circuit after retiming. Table 2 presents the results obtained with this modified script.
The first part provides information about the retiming process itself, while the second part provides information about its effect. Column #Class is the number of classes in the mc-graph of the circuit. In column #Step the first number is the total number of layers of registers that have been actually moved in the circuit. The second number is the total number of all possible valid mc-retiming steps in the mc-graph, computed during the maximal backward and forward retiming phase. Column #FF is the number of registers in the retimed circuit, #LUT its number of LUTs, and Delay its maximal combinational delay. Finally, Rlut and Rdelay are the ratio of columns #LUT and Delay, respectively, over the corresponding columns of Table 1 .
First of all, the overall "retime" command finished for all circuits within 60 seconds of CPU time on a Sun Ultrasparc (333 Mhz), showing the efficiency of our retiming approach. On average, about 90% of the time was used by the basic retiming approach, and 7% of the time was spent in register relocation and reset state computation. Only 3% of CPU time was used for building the mc-graph, computing the classes and retiming bounds, and modifying the graph for register sharing. This shows that the computational overhead caused Note that for all processed designs, the number of register layers actually moved is much smaller than the number of layers that can possibly be moved. Also note that over 99% of all needed backward justifications could be performed locally. This means that the cost of backward retiming is kept as low as possible. Although of course this cannot always be the case, we think that this is still very encouraging in practice.
Retiming proves to be fairly effective, with the largest delay reductions being obtained for the three largest circuits (C4, C6, C10).
The penalty incurred on the combinational area by the process is non existent or very small for a majority of the designs, although 3 out of the 10 designs see their number of LUTs grow more than 10%. The penalty on the number of registers is more significant, with an average ratio of the penalty equal to 1.10. In another experiment we compared the results presented in Table 2 with the results we obtain if we don't preserve the load enable inputs for retiming. In order to do so, we added at the beginning of the script a command that decomposes the synchronous load enable inputs of all the registers in the design. The results for this script are presented in Table 3 . The first part of the table gives the number of registers, the number of LUTs, and the maximal delay in the circuit mapped using this modified script. These values are then compared with the values presented in The column Rdelayz shows that there is only one circuit (C4) for which the resulting delay is better than the one given in Table 2. This can happen since after decomposing the load enable inputs there may be less restrictions in moving registers around resulting in a better delay improvement. For circuit C4 this comes, however, with a very significant area penalty of 32% more registers and 25% more LUTs. For all other designs, the delay in the retimed design is larger than the one reported in Table 2 . Overall, after decomposing the load enable inputs retiming produces circuits that are 21% faster than the original circuits, but with 17% more registers and 10% more LUTs, while using the load enables during multiple-class retiming produces circuits that are 22% faster than the original circuits, with 10% more registers and 3% less LUTs.
Conclusion
In this paper we have presented an extension of the basic retiming algorithm which allows us to apply retiming on circuits designed to take advantage of the complex registers available in modem hardware technologies, such as registers with synchronous load enable, and synchronous and asynchronous setklear inputs.
We have implemented this new retiming algorithm, called multiple-class retiming, integrated it in a state-of-the-art FPGA synthesis environment, and reported results obtained on a set of industrial FPGA designs. We think that these results are quite encouraging, because they show that the computational overhead caused by the extension is very small compared with its benefits on the processed circuits.
