We propose an exact clustering with retiming algorithm to minimize the clock period for sequential circuits. Without moving ip-ops (FF's) by retiming, conventional clustering algorithms can only handle combinational parts and therefore cannot achieve the best cycle time. Pan et al. 2] have proposed an optimal algorithm under the unit gate delay m o d e l . W e propose a more powerful and faster algorithm that produces optimal results even under the more realistic general gate delay model. Experimental results show that our algorithm is twice as fast as Pan's.
Introduction
Circuit clustering groups cells in a design into macros to satisfy some constraints such as area or pin limitations 3] 4] 5]. But this often induces large interconnect delays between macros. Therefore, avoiding performance degradation is a major objective when we perform circuit clustering.
Retiming, which repositions ip-ops (FF's) while preserving circuit functionality, can be used to shorten the clock period 1]. Traditional clustering techniques, which do not consider retiming, cannot achieve the optimal performance for sequential circuits 4] 5]. These clustering methods often treat a sequential circuit as combinational parts by dropping all FF's and then clustering each combinational part independently. I f w e c a n appropriately relocate FF's by retiming when clustering a circuit, we can achieve better performance.
Pan et al. have proposed an approach t o c o m bine retiming and clustering 2]. Under the unit gate delay model 4], their algorithm can achieve the optimal clock period. For the general gate delay model, it can get near-optimal clock period within the maximum delay o f any gates in the circuit. They use a labeling technique to achieve the retiming e ect and to integrate it with clustering.
Pan's algorithm cannot produce the optimal results under the general delay model because it does not nd the best labeling. If the relocated position for an FF, computed during labeling, is occupied by a gate, the labeling value of this gate has to be modi ed.
This work was supported in part by the Science Council, R.O.C., under a contract no. NSC88-2215-E-007-012.
In this paper, we propose a new algorithm to cluster circuits with retiming using a new labeling method. This algorithm not only can achieve the optimal clock period under the general delay m o d e l b u t a l s o u s e s l e s s time than Pan's. Experimental results show that the average ratio of the run time used by P an's algorithm to ours is 2 : 1.
The rest of this paper is organized as follows. Preliminaries are described in Section 2. How to enhance Pan's labeling method is presented in Section 3. For the convenience of illustrating our clustering algorithm, we r s t r e v i e w P an's algorithm in Section 4. Our algorithm i s i n troduced in Section 5. Section 6 presents some experimental results. Finally, Section 7 draws some concluding remarks. delay, D, which is also a given parameter. We de ne the function (u) to denote the interconnect delay from u to a cluster. Because we assume there is no inter-cluster delay for PI's and PO's, (u) i s z e r o i f u is a PI or the cluster is a PO otherwise it is D. A node can be duplicated without changing functionality for optimizing the clock period. The cycle time may di er from the original because of retiming. The clustering problem in this paper is as follows:
Problem 1 Given a sequential circuit G, a n d a t a r get clock period c, nd a clustered c i r cuit Gr with (Gr) less than or equal to c, i f s u c h a c i r cuiting exists. 3 The Weakness of Pan's Approach In this section, we explain why P an's approach cannot obtain the optimal clock period under the general delay model and give our solution. Pan's approach d o e s n o t nd the best labeling. The algorithm labels each n o d e in the circuit an "l-value", de ned as the weight of the longest path from the PI's to the node using the "w1 . If the l-value of a PO is greater than c, t h e r e exists no clustered circuit Gr with (Gr) less than or equal to c.
However, this labeling method is not adequate if we want to nd the optimal solution under the general delay model. In fact, it attempts to relocate the ith FF to the position at which the propagation delay f r o m P I equals to i c. Unfortunately, this position is very likely to be at the middle of a gate. Since a gate cannot be split, we can only push this FF to the front of the gate to satisfy the timing constraint c. T h us this kind of approach m a y produce a solution near the optimal within the maximum delay o f a n y gates in the circuit.
In our approach, we follow the method proposed in 6] to modify the l-value. The l-value of a node v can be computed by using the l-value of its fan-in node u by
If 2 shows a simple example that Pan's approach cannot achieve the optimal clock period but we c a n . Fig. 2 (a) is the original circuit showing the gate delay o n e a c h node. The minimal timing constraint that Pan's labeling can achieve is 5. The l-value and the corresponding retiming value r(v) for each node are listed in Fig. 2 (b) . However, the critical path delay o f t h e retimed circuit is 7. On the other hand, the minimal timing constraint that our labeling can achieve i s 6 . A s shown in Fig. 2 (c) , the critical path delay of the retimed circuit according to our labeling is 6, which i s t h e optimal clock period. We can achieve the optimal clock period 6 by using our labeling. 4 Outline of Pan's Algorithm For illustrating our algorithm, we outline the algorithm proposed by P an et al. 2] in this section. Their algorithm consists of two phases. The rst is the labeling phase to compute the l-value and generate the corresponding cluster for each node. The second phase connects all nodes and clusters, retimes the clustered graph, and then merges clusters to reduce area. Clusters have t o be merged because the algorithm generates one cluster for each node during the rst phase. We address the labeling phase in this paper the second phase can be found in 2].
As we h a ve illustrated in the previous section, the l-value of a PO cannot be greater than the target clock To reduce the search space, Pan's algorithm only considers \simple clustered circuits". A simple clustered circuit satis es four conditions:
1. Each cluster has only one node that can output signals to the outside of the cluster. Thus we c a n name a cluster as Cv if it outputs from node v. it is updated. If no more l-value can be updated, the algorithm terminates. Otherwise, the failure condition will be met for some PO with its l-value greater than the target clock period c. (1) In addition, we adopt another circuit traversing method rather than Pan's by using FIFO's to incorporate our labeling technique for run-time e ciency.
The procedure PansLabel is a variation of the Bellman-Ford algorithm 1 10]. We d e v elop an algorithm similar to the retiming algorithm proposed by Chen 6 ]. Chen's algorithm uses a FIFO to store the nodes for updating rather than iteratively traversing the whole circuit. Experimental results show that Chen's algorithm runs much faster than the Bellman-Ford-like retiming algorithm 6]. Thus in our labeling procedure, we use a queue, called queue1, to store the nodes whose new l-values will be calculated. Initially, all PI's are put into queue1. Then, nodes are retrieved from queue1 one at a time. We update the l-value for a node if the calculated value is greater than the present one. Then we put all nodes reachable from the updated node into queue1 i f they have a c hance to be updated. The procedure stops if queue1 i s e m p t y or the failure condition is detected. v may h a ve a c hance to be updated, we use another queue, named queue2, to help traverse the sub-circuit starting from v. I f t h e l -v alue of a node y computed according to the updated l-value of v is greater than the original value, y will be put into queue2 for further traversal. The candidate l-value (i.e., l 0 (v)) to a node y computed according to the l-value of node v is stored in labelmatrix v] y]. The l-value is calculated according to Equation 1. Meanwhile, y is also put into queue1 i f it is put into queue2 since its l-value has a chance to be updated. Routine TightenBound as shown in Fig. 7 will calculate the possible l-value of y associated with Cy when y is retrieved from queue1. To analyze the time complexity, w e rst examine how many times a node will be visited in procedure Label. A node v will be visited only when the l-value of another node u is updated and there is a path from u to v. The increasing amount for an updated l-value is at least c= , where is the minimal positive d i e rence between any t wo g a t e d e l a ys. 
Experimental Results
We h a ve implemented our algorithm in C language and embedded it in the SIS package 7]. We h a ve also implemented Pan's algorithm for comparison purpose. We run the experiments on an UltraSparc-2 machine with 2GB of memory. The M i s s e t a s a q u a r t e r o f t h e t o t a l circuit area, and D twice the average gate delay.
Experimental results on the ISCAS89 benchmark suite 9] are listed in Table 1 . All the circuits are technology mapped by SIS using a 0:5um library from TSM- The nal clock periods are listed in \ (Gr)" columns. Results prove that the clock period obtained by P an's algorithm is bounded by c plus maximal gate delay a n d may not be the optimal. 11 of totally 26 cases of Pan's results are sub-optimal as bold-faced in the table.
The run time in second is listed in \CPU" columns, and the ratio of the time used by P an's algorithm over ours is listed in \Pan's/Ours CPU" column. Only for two cases (s1423 and s9234) is our algorithm a little bit slower than Pan's. The average ratio of the run time used by P an's algorithm to ours is 2 : 1.
The experimental results prove our algorithm valuable since it is an exact algorithm and runs faster than the previous non-optimal heuristic.
Conclusions
We h a ve presented an exact algorithm to cluster-withretiming sequential circuits to get the optimal clock p eriod. We pointed out why a previous work, Pan's approach, could not produce the optimal solution under the general delay m o d e l . W e modi ed Pan's labeling methods and proposed an exact algorithm which u s e d two queues to enhance run-time e ciency. Experimental results show that the average ratio of the run time used by P an's algorithm to ours is 2 : 1.
These techniques can also be used for technology mapping with retiming. We will study this problem in the future.
