INTRODUCTION
The Intemational Technology Roadmap for Semiconductors (ITRS'2002 update) [18] predicts that there will be over ten billion transistors integrated on a single chip with an on-chip local clock frequency of 28GHz in the 22nm technology by 2016. It was shown in [3] that even with the use of new interconnect materials and aggressive interconnect optimization, the delay of a 2cm global Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. interconnect still remains around 500ps. This implies that multicycle communications over the long interconnects are required for multi-gigahertz synchronous designs.
Retiming is a powerful sequential optimization technique used to minimize the clock period (delay) or the number of flipflops (FFs) by relocating the FFs (it changes the netlists) while preserving the functionality of the circuits [ll] . The potential of retiming over the global interconnects needs to be considered during the global placement stage, as the placement results define the interconnects. Given a netlist of the circuit, the existing placement algorithms place the gates and blocks so that certain objectives, e.g., wirelength minimization, delay minimization, and routing congestion minimization, can be achieved. They can not change the netlist of the circuit. However, the benefit of considering retiming during the placement stage is significant and can be easily illustrated by the simple motivational example shown in Figure 1 . A circuit G shown in Figure l c)) by moving flipflop F1 from the fanin of gate a to its fanouts, the longest path delay is unchanged. On the other hand, if the placer is aware of retiming possibility along the path from 1 1 + F1 -+ a -+ d -+ 0 2 , it will identify critical paths in terms of retiming potential and try to short them, thus generate Placement 2 (shown in Figure l(d) ) with path delay of 5. However, after retiming is applied on Placement 2 (shown in Figure l (e)), the delay can be further reduced to 3. This shows the necessity of considering retiming during the placement stage in order to hide long interconnect latency for performance optimization, because not all the long interconnects are problematic. Only those long interconnects which have to be crossed in a single clock cycle are problematic and should be optimized during the placement stage. In the above example, path pT = 12 -+ c + d + 0 2 has to be crossed in a single clock cycle, while path p = 11 + F1 + a -+ d -+ 0 2 can be crossed in two clock cycles if retiming is applied. Therefore path pT is a "bad" interconnect, while path p is not. Although Placement 1 is better than Placement 2 in terms of delay before retiming is performed, Placement 2 is better than Placement 1 in terms of retiming possibility as it leads to better performance after retiming. The critical path pT ("bad" interconnects) in the retimed circuit is shortened in Placement 2, while the apparent critical path There are two kinds of approaches, iterative and simultaneous, for integrating retiming with placement or floorplanning in the physical design phase. In the iterative approach [21, 13, 141, placement and retiming are alternatively performed until the timing constraints are met, that is, wirelength-driven or timing-driven placement is first performed on the given netlist to minimize the total wirelength and/or delay followed by performing retiming on the placed circuit based on the layout information. In the simultaneous approach [5, 2, 201 , placement or floorplanning is performed in such a way that the placement or floorplanning engine can be aware of the retiming possibility and thus reduce the lengths of interconnects which are critical in terms of retiming potentials. In the simultaneous approach, the placer with retiming awareness will shorten the "bad" interconnects in terms of retiming potential and thus produce a solution similar to Placement 2 in Figure 1 . In the iterative approach, the traditional timing-driven placer will shorten the critical path in the pre-retimed circuit and thus produce a solution similar to Placement 1 in Figure 1 . In the iterative approach, another iteration of placement can be performed in the retimed circuit (Figure l(c) ) to further shorten path p r . However, it is not as efficient as the simultaneous approach which can lead to the best solution in one round.
Clearly, the simultaneous approach has the advantage of efficiently generating better results. However, the existing simultaneous approaches have their limitations. In [2] , retiming is integrated into the floorplanning stage based on the theory that as long as the target retiming preserves the number of FFs for every loop in a circuit whose underlying graph is strongly connected, there exists a valid finite sequence of retiming operations which can reach this target, thus making the loops become critical in terms of delay minimization when pipelining is to be performed afterwards. Edges in the critical loops are given high weights during the partitioningbased floorplanning stage to promote clusters of loops within a single partition, such that the clock period can be minimized (the clock period is determined by the ratio of the loop delay to the register count in the loop). However, in addition to that the complexity of such method is too high, there may not be enough global interconnects seen by a floorplanner, which could limit the retiming potentials. In [20] retiming is integrated into a simulated annealing (SA)-based placement for FPGA designs. At each temperature, critical cycle and slack analysis is performed so that the SAbased placement engine can be aware of the retiming possibility and thus reduce the lengths of the critical cycles. Their placement method, however, is based on the flat netlist and may have difficulty in handling large-scale designs. In [5] , sequential timing analysis (Seq-TA), formerly called RTA, is proposed and integrated with a multilevel partitioning-based physical planner GEO. By doing that, GEO can generate results with retiming awareness. However, the proposed Seq-TA has one limitation -it can not handle gates (or clusters of gates) with multiple outputs, which makes it difficult to be directly applied to the multilevel framework. As a result, in GEO Seq-TA is always performed on the original single-output gate-level circuits, greatly affecting its efficiency.
In this paper, we present a practical solution for simultaneous retiming and multilevel global placement for performance optimization. Our contributions are (i) we generalize the Seq-TA to handle the gates/clusters with multiple outputs; (ii) we integrate retiming into the multilevel placement framework in order to efficiently handle large-scale designs and provide two speed-up techniques for Seq-TA to be efficiently integrated with the placement process.
The remainder of this paper is organized as follows. Section 2 reviews the related work and defines the terminologies. Section 3 describes the generalized sequential timing analysis for gatesklusters with multiple outputs. Section 4 describes the overall flow of integration. The experimental results are shown in Section 5, followed by the conclusions and ongoing work in Section 6.
REVIEW OF RELATED WORK

Retiming
Retiming is a sequential optimization technique that relocates the sequential elements in a circuit without changing the behavioral of the circuit. It moves the FFs across the combinational elements to optimize the clock period, the number of FFs, or power. In [ 1 11 Leiserson and Saxe first proposed a graph-theoretic model for a synchronous circuit. In this model, a circuit consisting of functional elements and globally clocked registers is transfered to a finite vertex-weighted, edge-weighted directed graph 
c-Retiming
Pan proposed c-retiming [15] , a continuous version of retiming where the value assigned to a vertex can be a real number. To compute a c-retiming for a target clock period q5, another edge weight
is the combinational delay of vertex v. The l-value of a node is defined as the weight of the longest path from the PIS to this node using the new edge weighting method. In a sequential circuit, if there is a PO whose 1-value is greater than 4, the circuit can not be retimed to a clock period of 4. If, on the other hand, the I-values of all the POs are not greater than 4, the circuit can be retimed to a clock period less than 4 + K, where K is the maximum gate delay in the circuit. Because c-retiming can be computed much more efficiently than retiming and can be converted to a retiming by a simple rounding, it is combined with other optimization and synthesis techniques, such as FPGA mapping [17, 7] , performancedriven clustering [16, 4] , and partitioning 161, for a tight integration. Moreover, c-retiming is used as the basis for sequential timing analysis in the next subsection.
Sequential Timing Analysis (Seq-TA)
In 151, the concept of sequential timing analysis (formerly called 
Similarly, SRT of v in terms of fan-out vertices is defined as vertices, the €-network for the given sequential circuit is derived and used for net weighting during the partitioning phase in GEO. Obviously, Seq-TA is very helpful for the placers (floorplanners) because it can identify the critical path with retiming potentials and let them be aware of it. On one hand, by minimizing those critical paths, the placer (floorplanner) can achieve further delay reduction with the anticipation of retiming. On the other hand, retiming is not required to be performed during the placement (floorplanning), saving a great deal of runtime.
SEQUENTIAL TIMING ANALYSIS FOR COMPLEX NETWORKS
The first contribution of this work is to extend the sequential timing analysis for circuits consisting of multi-output gates. An example of SO-gates vs. MO-gates is shown in Figure 2 . In Figure 2 (a), all the gates have a single output and gate G1, G3 and G5 have a propagation delay of 1, while gate G2 and G4 have a propagation delay of 2. Its corresponding retiming graph is shown in Figure 2 (b). When we cluster gate G1, G3, G2, and G4 into a cluster GO (which can be regarded as an MO-gate), shown in Figure 2 (c), GO has not only multiple outputs but also non-uniform input-output propagation delays. Its corresponding retiming graph is shown in Figure 2(d) . Obviously, it is necessary to extend the Seq-TA to the complex network, such that Seq-TA can be integrated with more optimization processes.
Generalized c-Retiming
When the MO-gates are combinational logic, i.e., there is no sequential logic (FFs) inside, we can generalize the c-retiming for them. In the retiming graph of a complex network, each gate (with either a single output or multiple outputs) corresponds to a vertex with internal edges from its inputs to its outputs according to the logic dependence. Each PI corresponds to a vertex with zero propagation delay, one input, and one output. Each PO corresponds to a vertex with zero propagation delay, one input, and one output. An example is shown in Figure 3 Figure 3 ) of U . An internal edge from an input uZ, to its reachable-output U: is denoted as fi",+o, and its delay is denoted as C Z (~F~+~~) .
l ( u & ) . The logic dependence is shown by the internal edges (the dashed edges in
An edge e of the retiming graph (called extemal edge and shown by a solid line in DEFINITION 3. A simple network is a circuit network that con- Figure 3 ) connects an output of a vertex and an input of another vertex and its delay is denoted as d(e). If 41 is the minimum clock period such that the 12-value labels of all the POs are not greater than # l , then 4Jl is the lower bound of the feasible clock period achieved by retiming. If 4Ju is the minimum clock period such that the 11-value labels of all the POs are not greater than 4Ju, then 4Ju + K is the upper bound of the minimum feasible clock period achieved by retiming, where K is the maximum input-output delay of all the gatedclusters. For the circuits without MO-gates, I1 -value labeling and /a-value labeling are equivalent and 11-value labeling is equivalent to the SAT defined in Section 2.3.
Under the scenario of multilevel placement, the netlist of the original circuit is clustered to form a coarser netlist. Let C be a circuit which only consists of SO-gates. Let Cc be the circuit which is derived from C by clustering SO-gates into MO-gates. Due to page limit, we have to omit the proofs for the above theories. Please refer [8] for details.
Generalized Sequential Timing Analysis
Based on the generalized c-retiming, we can generalize the sequential timing analysis for the complex network by using the 12-value labeling to compute SAT and SRT values for the inputs and outputs of vertices. We define the SAT for each output v, O of vertex v (shown in Figure 4) in the retiming graph G as Though 12-value c-retiming can not be converted to a retiming, it gives us enough information to catch the critical "sequential path" that should be minimized during the placement phase such that the placement solution can produce the best result achieved by retiming. We still use the Bellman-Ford variant shortest path algorithm to determine whether the target clock period # is feasible under the &value labeling and compute the generalized SAT and SRT for inputs and outputs of the vertices, as the RTA algorithm [5] does. We start with the initialization for the SAT and SRT values. SAT for all the PI outputs are set to zero while the outputs of other vertices are set to -03. SRT for all the PO inputs are set to # while the inputs of other vertices are set to 03. During one iteration of relaxation, we visit the vertices and iteratively update SAT and SRT values of their inputs and outputs. The iteration stops when SATs and SRTs converge to their maximum and minimum values, respectively, or there is one PO input whose SAT is greater than #. A binary search is performed to find the minimum feasible clock period under the 12-value labeling. Based on the Seq-TA-M, we can identify the €-network which is defined to be a subcircuit consisting of external edges whose slacks are smaller than or equal to E in the current placement. The €-network consists of critical interconnects, in terms of the retiming potential, that deserve attention from the placement engine for optimization.
INTEGRATION OF RETIMING WITH MULTILEVEL PLACEMENT FRAME-WORK
The second contribution of our work is to provide a solution for integrating retiming to a multilevel placement framework and two speed-up techniques for Seq-TA-M.
The multilevel optimization method is very powerful in solving problems with high computation complexity. It includes two phases, a coarsening phase and a refinement phase. We integrate retiming with a multilevel coarse placement [l] by performing Seq-TA-M to identify critical nets and assigning higher weights to them. A simulated annealing-based placement engine minimizes the weighted wirelength to reduce the length of the critical path and thus to reduce the delay. The integrated placement algorithm, called mPG-rt, consists of a bottom-up coarsening phase and a topdown placement. The overall flow is shown in Figure 5 .
During the bottom-up coarsening phase, we build a coarser level netlist(graph) Li from level La-' by performing clustering until we reach level Lt where the number of clusters is within a certain range, so that the SA-based placement can be efficiently performed.
We use the Firstchoice (FC) clustering algorithm [9] as it experimentally generates a better hierarchy for global placement [l]. FFs are not allowed to be clustered if Seq-TA-M is to be performed at During the top-down placement refinement phase, at level La, where i > k , i.e., FFs may be clustered, we perform static timing analysis and assign weight to nets according to their criticality to reduce the longest path delay at the current level. This is helpful for reducing the runtime for performing Seq-TA-M at the finer levels as it brings Seq-TA-M a placement with a decent path delay to start with during the binary search for finding the minimum feasible clock period. At level La, where i 5 IC, i.e., FFs are not clustered, we build a retiming graph and perform Seq-TA-M once at each temperature, and weight each net according to its criticality in terms of the retiming potential. Except for the top level, where a full-scale SA process is performed, the SA process starts from a low temperature at the following levels to save the runtime [l] . At the finest level, after the SA-based placement is finished, final retiming is performed and FFs are inserted as proposed in [5] . A low temperature SA process may be required to legalize the retimed placement.
Speed up the Sequential Timing Analysis
Although the Seq-TA-M is polynomial, the complexity can still be up to O(lVllEl) when determining whether the target delay is feasible under the 12-labeling and computing SAT and SRT.
Additionally, the binary search for finding the minimum feasible clock period under the la-labeling may also be time-consuming. When we perform Seq-TA-M, we visit the vertices in their pseudotopological order, as it is shown in [15] , that if we visit the vertices in the pseudo-topological order, the relaxation will be converged much faster than one with a random order. Furthermore, we provide two methods to further speed up the Seq-TA-M process. The first method is called "single-4," that is, instead of doing a binary search to find the minimum feasible clock period for SAT, SRT and slack computation, we just use the longest path delay D, , , of the current placement to calculate the SAT, SRT, and slack. Be-
is the longest path delay of the given placement, it 
Net Weighting
At each temperature in the SA process, once either static timing analysis or sequential timing analysis is performed, the criticality of nets is obtained and transfered to weights on the nets. When static timing analysis is performed, we use the PATH algorithm [lo] to compute the weights for nets. When the sequential timing analysis is performed, we use the net weighting methods proposed in [12, 191 to compute the weight. The delay cost of an edge in the timing graph (or retiming graph) is the product of the edge delay and its weight. Delay-Cost is the sum of all the delay costs of all the edges/connections in the timing graph (or retiming graph). Wire-Cost is the sum of all the bounding box lengths of the nets. The overall cost is a weighted sum of Wire-Cost and Delay-Cost defined as
Delay-CostCUrTent
Wire-CostCUrrent
Delay-Costprevious Wire-Costprevious c o s t = Q 01 is a user defined value which can trade off wirelength with delay. We set it to 0.5.
EXPERIMENTAL RESULTS
We implemented our algorithm mPG-rt in C++/STL and tested it on a Sun Blade 1000 workstation running at 750MHz frequency.
The benchmark consists of 5 ISCAS circuits and 4 large scale industrial designs which were used in [5] . ' The delay model we used is the same as that in [5] . * The circuit characteristics are listed in Table 1 .
Impact of Speedup Techniques of Seq-TA
We tested our two speed-up techniques, "single q5" and "early abortion," for Seq-TA-M on some of the largest circuits on a 8x8 global placement grid. The results are shown in Table 2 . For each circuit we adopted "single #' with full relaxation, "early abortion" with iteration of 5, 15, and 30 and the full relaxation.
From this table it can be seen that using a limited number of iterations can greatly reduce the runtime with a reasonable quality 'Currently the large-scale placement benchmarks in public domain are mainly for wirelength-driven placement and are lack of functionality information of the cellshlocks which is required by the retiming operation. *We did not compare mPG-rt with GEO because GEO did not consider wirelength minimization and a direct comparison in terms of wirelength may not be very meaningful. circuit Table 3 : Impact of simultaneous retiming and placement loss. It becomes even more efficient when the circuit size increases. Therefore in all of our experiments shown in the next subsection we used the "early abortion" scheme with iteration of 30.
Impact of Simultaneous Retiming and Placement
Our multilevel coarse placement can be used as a wirelengthdriven placer when the cost is totally set to be the Wire-Cost (which is mPG). We ran mPG [ 11 followed by retiming and a post legalization on the global placement. We also ran mPG-rt followed by the same post legalization procedure. We compared the results generated by mPG followed by retiming with those generated by mPG-rt to show the impact of simultaneous retiming and placement. We also report the delay of placement results generated by mPG before retiming to show the impact of retiming on placement. The comparison results are shown in Table 3 .
It can be seen that (i) retiming can improve the performance by 14% on average when it is applied after placement; (ii) our simultaneous approach for retiming and placement can outperform the two-step approach (placement followed by retiming) by 10% on average in terms of delay with 10% wirelength increase, demonstrating the necessity of such integration.
CONCLUSIONS AND ONGOING WORK
We proposed a practical solution for integrating retiming into the multilevel global placement for large-scale designs. We extended the available sequential timing analysis to handle gates/clusters with multiple outputs, and integrated it into a multilevel SA-based placement framework for performance optimization. We also provided two speed-up techniques to enable it to be efficiently integrated into the placement engine. Experimental results show that such an approach is efficient compared with the two-step approach (placement followed by retiming). We are currently working on 31t is done by a timing-driven SA-based refinement on the finest level in the mPG framework.
large benchmarks to further test our approach, and plan to use a performanceidriven cluster algorithm in the coarsening phase instead of using connectivity-driven clustering algorithms.
