At the integration scale of System-On-Chips (SOCs), the conflicts between communication and computation will become prominent even on a chip. A big fraction of system time will shift from computation to communication. In synchronous systems, a large amount of communication time is spent on multipleclock period wires. In this paper, we explore retiming to pipeline long interconnect wires in SOC designs. Behaviorally, it means that both computation and communication are rescheduled for parallelism. The retiming is applied to a netlist of macrwblocks, where the internal structures may not be changed and flipflops may not be able to be inserted on some wire segments. This problem is different from that on a gate level netlist and is formulated as a wire retiming problem. Theoretical treatment and a polynomial time algorithm are presented in the paper. Experimental results showed the benefits and effectiveness of our approach.
cores or regular-structured blocks such as memories, (combinational) buffers or RigAops may not be inserted everywhere [20]. We will incorporate buffering position restrictions in our solution.
The contributions of this paper are in multiple aspects. First, timing macro-models of both combinational and sequential blocks are established to facilitate retiming on them. Second, flip-flop location restrictions within blocks and on the wires through blocks are handled uniformly. Finally, a polynomial time algorithm is designed for the wire retiming problem that has both block and wire delays.
Problem formulation
We consider wire pipelining via retiming on a SOC design with a given block placement (also known as floorplan) and a global routing of the global wires. Such a design is depicted in Figure 1 . Here, we have five blocks with the Aoorplan and the global routing of global wires. Each wire has an arrow to indicate the signal direction, and a weight to specify how many flipflops are on the wire. Those with weight 0 have the weight omitted. For example, the wire from U to n has 1 Rip-flop, while the wire from n to U has 0 and that from n to 20 has 2. Some segments of a wire may not accommodate any buffer or flipflop because they run over macro-blocks that do not allow transistors to be added. In order to take the delays within each block into consideration, timing models are used for specifying the timing behavior of each block. Due t o the increasing popularity of SOC design and IP-core based design, recently there are increasing research activities on timing models for macroblocks [7, 16, 101 . If a block is a pure combinational circuit such as an ALU or a multiplier, then a minimum and a maximum delay from each input pin to every output pin can be used to characterize the timing behavior of the block. This is a traditional approach. Recent researches are mainly focused on sequential blocks [7, 16, 101 . Generally speaking, if there are combinational paths from an input pin to an output pin, a minimum and maximum delay pair will be used to characterize the delay on each path, If an input pin has a path to a tlipflop in the block, the arrival time of the input pin must be constrained to satisfy the set-up condition. Finally, if an output pin has paths hom Aip-flops in the block, then the arrival time of the pin is given by the path delays.
Traditional retiming is applied to logic level netlists that are composed of simple gates. In our application, the netlist is composed of macro-blacks. Since a combinational block can be viewed as a complex gate, moving a flip-flop over it is simply justified. Now we will show that retiming can be generalized to sequential blocks. L e m m a 1 In a SOC design composed of macro-blocks, aJ3ip-flop can be moved from eve y input to euery output of a block or vice WTSR without changing the Junction of the design.
However, in order for a retiming procedure to move flipflops over sequential blocks, the timing model for a sequential block must he a graph that connects its inputs to outputs. The traditional approach of treating sequential inputs as primary outputs and sequential outputs as primary inputs will cut off the connections through the block hence make i t impossible to move flip-flops over the sequential block.
We will now consider how flip-flops can be moved over timing models of macro-blocks. When the block is a combinational circuit, as shown in Figure 2 (a), we can use edges between inputs and outputs to represent the path delays between them, as shown in Figure 2 (b). If we place these edges in traditional retiming formulation, flip-flops can be moved over them. However, there are two caveats. First, in traditional retiming, flip-flops may he placed on any edge. In order to avoid flip-flops being placed on the edges we introduced in timing models, we require that the retiming tags of the input and the output connected by such an edge be the same. This means that the number of Hip-flops moved over the input is the same as the output. Therefore, no Hipflop will he left in between. Second, depending on the structure of the circuit within the block, an input may not have a path to every output. For example, in Figure Z (a), inputs a and b do not connect to output y, and input c does not connect to output x. If the macro-block is substituted by the edges in its timing model, then, the number of flip-flops moved over z may he different from that over y. On the other hand, if a block is treated as a super-gate, it is required that the retiming tags of the inputs and the outputs are all the same, and this may give suboptimal solutions. Since the delay edges represent the path topology of the circuit, our timing model for macro-blocks gives us more flexibility. Since the timing model of a macro-block (whether it is combinational or sequential) is composed of a set of edges on which no flip-flop can he placed, and the timing model of a net is composed of a set of edges some of which accommodate buffers and flip-flops but others forbid them, after applying these models, our problem is represented as a directed graph with two types of edges: one allows buffer and flipflop insertions but the other forbids them. Since the nets we are considering are global interconnections among macro-blocks, they have relatively long lengths. On those wire segments where buffers and Hip-flops are allowed, we can use optimal buffering to make their delays to he linear in terms of their lengths 1171. Therefore, we assume that the delays on hufferallowable edges to be linear.
In summary, in a graph model of the problem, a vertex is used to represent a source or sink of a net, the input or 
Notations and constraints
Before discussing the solutions to the two problems, we will first select some essential notations to help us to clearly state the requirements for a solution.
From the formulations of the problems, we already have a
We will follow the convention of Leiserson and Saxe 1141 t o use an integer variable .(U) to represent the number of flipflops moved from the outgoing edges of a vertex U to its incoming edges. In 1141, these variables were sufficient to specify a retiming solution since there was no wire delay and a solution only needed to tell whether a flipflop was on a wire. In fact, it is not the absolute values of these variables hut their ditTeerences that are important, since the number of flip-flops on a given edge ( U , U ) after retiming is given hy
However, in our problem, a retiming solution must include the positions of flip-flops on each wire. Because retiming can change the niimher of flip-flops in the system, it is not even known how many flip-flops there will be after retiming. Fortunately, we can overcome the problem by only specifying the arrival tinie of every vertex with respect to a clock period. For each vertex U t V , we use t ( u ) to represent its arrival time with respect to the nearest flip-flop on its incoming paths.
Given t ( u ) , the positions of flipflops directly fanning into U can he found, and the positions of other flip-flops can also he computed.
Using these notations, the requirements for a retiming sclution can he stated as follows. First, to ensure that no flipflop is inserted on forbidden edges, we need Then, to make sure the availability of flip-flops, it must be true that
The fdowing inequalities will guarantee that the arrival times are all achievable.
Finally, the set-up conditions at the flipflop inputs are equivalently stated in the following inequalities.
As we can see, the fixed period wire retiming problem is actually a mixed integer linear programming given by (1)-(5).
But we need also to set t(u) = 0 for every primary input U,
and .(U) = 0 for every primary input or output U. A general mixed integer linear programming problem is NP-hard. However, this problem can be solved in polynomial time.
3.2
From T 2
Since a gate (with only one output) only gives a directed tree with forbidden edges, Lemma 3 subsumes previous results 118, 15, 41. However, our result also shows that it can be extended to circuits with complex blocks such as multipliers and adders where each output depends on all inputs. Fortunately, the forbidden edges introduced by a net (m shown in Figure 4 ) are always a directed forest, thus will not give any trouble.
When the topology of forbidden edges does not satisfy the condition in Lemma 3, the optimal clock period may not be upper bounded by TI +T,. As an example, consider a circuit shown in Figure 5 Our approach to find a tighter upper bound is as follows. First, we find an optimal retiming solution without considering forbidden edges. This can be done based on the mmputation of T2. Then a local adjustment to move flip-flops out of forbidden edges is done to get a feasible solution. The ohjective in this step is to keep the increase of the clock period as small as possible. From any set of forbidden edges that form a complete bipartite graph, any flip-flop can be moved out with at most an increase of TI to the period. To move a flipflop out of a forhidden edge in a non-complete hipartite graph, other flip-flops may be moved over the block, thus the increase of the period could be larger. However, the local adjustment will keep the number of flip-flops moved out of a non-forhidden edge as small as possible. In the example in Figure 5 (a), an optimal retiming without considering forhidden edges is shown in Figure 5 (b) and has a period of 34. The local adjustment will move two flip-flops out from 0 1 and two from Iz. Therefore, our upper bound will be 3 x 34 = 102.
Fixed period wire retiming
From now on, we will consider how to check whether a clock period T that satisfies the above lower bounds can be realized by retiming. We will use an approach similar to Leiserson and Saxe [14] to solve this problem in polynomial time.
For any path p in the graph, we define the sequential delay sd(p) as follows
where T is the clock period. And we define the sequential delay sd(u, U) of any two vertices U , v to be sd(u,v) = max sd(p).
P € " I "
Then, consider the following set of inequalities
Using the definition of the sequential delay, for any edge ( U , . ) E Ez, (8) means that
Thus, (8) implies (2). However, the importance of (8) is more than that. Under the lower bound condition of (6), formula (1) and (8) The other direction is more difficult, since besides the number of flip-flops on each edge we also need to find the positions for those flip-flops such that the delay between any two consecutive ones is upper bounded by the clock period. In other words, we need to prove that there exists a solution for (3) -(5) determined by the solution of (I), (6) and (8).
Since (6) is true, once (1) and (8) have a solution, we can always compute a solution for all t(u) satisfying ( 3 ) , ( 4 ) and / ( U ) 2 0 by applying Bellman-Ford's algorithm 161, because it will not report a positive cycle under (6). In the meanwhile, for any U E V whose t(v) > 0, there must exists at least one U
suchthat (%,U) E Eandt(v)--t(u) = s d ( u , v ) -( r ( v )~T ( U ) ) T .
Since we have set t(u) = 0 for every primary input, we now prove that such solution also satisfies the requirement / ( U ) 5
T for all u E V .
For the seek of a contradiction, we assume there exists such a vertex v that t ( v ) > T . Starting from v we trace hack along those critical edges ( . , U ) such that t ( v ) -t(u) = sd(u,u) -(.(U) -r(u))T until we reach a vertex u whose / ( U ) = 0. In the worst case, U is a primary input. Now for
t(u) and t ( v ) , we have
Hence t(u) 5 T, which contradicts OUI assumpt,ion. 
( N ( E I ) ,
where N is the product of the out-degrees of all the vertices in G. In 191, some popular algorithms that were widely used in CAD community were systematically compared and their comprehensive experimental; results revealed that Howard's algorithm was by far the fastest algoTithm though the only known bound of its running timeis exponential. In our implementation, we adopt an improved'version of Howard's algorithm 13, 91. After T2 is obtained; we compute a tight upper bound of the feasible clock periods in the way as described in section 3.2.
In the second'step, a binary search is used to find the optimal clock period. Given a particular T , we need to apply Johnson's all-pair shortest path algorithm (61 first to compute all sd(u, U) = maxptu-u sd@). Based on the results, we create a new graph rG incorporating all the vertices in V and edges (U,.) with weight (sd(u,v)/T] -1 if there exists a path from U to U in G, for all'u # v E V . We also need to introduce a virtual vertex M in rG as well as directed edges from each PO to it and fromi it t o each PI with zero weight. Then we apply Bellman-Ford's algorithm to check if there exists a positive cycle in rG. When it terminates, we can decide how to adjust the twofbounds accordingly. Note that once rG was created, its structure was kept throughout the rest of the algorithm and its edge weights were recomputed and updated every time T was changed.
In the third and last step, we use the optimal clock period and corresponding n(u) computed in step two to calculate t ( u ) for all U t V. Due to Theorem 1, a feasible solution for t ( u ) is guaranteed. To distinguish from the graph G, we call TG retiming graph and its edges retiming edges. We now present two pruning techniques which could help to reduce the redundancy of (E). By deleting the redundant inequalities, we can reduce the number of retiming edges dramatically. Although we did not improve the asymptotic complexity, we did get great benefit in reducing the running time in reality. We implemented the algorithm in a PC with two 2.4GHz1512K Xeon CPUs and 1GB RAM. We performed retiming on the ISCAS-E9 benchmark suite. In absence of delay information for ISCAS-89 circuits, wc randomly assign delay values between 1.0 and 2.0 units to gates (we treat them as macroblocks) and 0.2 to 5.0 to wires. In terms of the chip level we are focusing on, the delay range is intentionally chosen in order for the wire delay to be commensurate or even many times larger than the block delay. To further test the cases with non-complete bipartite( "non-CB" in Table. 1) blocks, we apply hMETIS [I21 to partition a circuit into groups. All edges inside a group are then treated as forbidden edges. The number of partitions of a circuit, which is denoted as "No.Part" in Table 1 did not further apply our timing model to the partitions when generating the results. In addition, the lower bound Tz of each Fircuit is reported as a comparison with the optimal clock periods we computed.
Pruning and optimization
In practical implementation, we found that the benefit we got using the tirst pruning technique presented in section 4.2 did not actually pay off the time penalty. The f?(lV13) complexity for the first technique is t,ight hence becomes the dominant part of the total running time. Therefore, we only apply the second pruning technique and compare the running time of the two schemes mentioned in section 4.2. Experimental results show that the performances of the t.wo schemes are in the same level. Thus, we only report the running time of the second scheme in Table 1 . 6 Conclusions and future work The wire retiming problem that relocates existing flip-flops to multiple clock period interconnections is formulated with the help of the block timing models. Similar to Leiserson and Saxe [14], a set of integer difference inequalities is shown to be both necessary and sufficient, thus gives a polynomial time algorithm. Our current and on-going work is to investigate a more efficient algorithm similar to (191 for the wire retiming problem. The preliminary results are very encouraging.
