As the CMOS technology is scaled into the dimension of nanometer, the clock frequencies and die sizes of ICs are shown to be increasing steadily [5] . Today, global wires that require multiple clock cycles to propagate electrical signal are prevalent in many deep sub-micron designs. Efforts have been made to pipeline the long wires by introducing registers along these global paths, trying to reduce the impact of wire delay dominance [2, 8] .
INTRODUCTION
As the CMOS technology continues to scale down, the clock frequencies and die sizes of integrated circuits are increasing steadily. Data showed that the frequencies of highperformance ICs have doubled every process generation while the die sizes increased by about 25% [5] . With such short cycles and long interconnects, it is not uncommon for a global signal to take multiple clock periods to travel across a chip. In order to alleviate this problem, we need to be able to insert registers to pipeline long global interconnect [2, 8] . Nevertheless, arbitrary insertion of registers along a wire is forbidden because it would change the original functionality of the circuit. As a result, retiming, a sequential circuit optimization technique to relocate registers without affecting circuit functionality, can be applied [9, 10] .
Retiming is a sequential circuit optimization technique that relocates registers while maintaining the functionality of the circuit. This powerful technique can be used to optimize the clock period, the usage of registers, or a combination of both in a sequential circuit. Since retiming was firstly formulated a decade ago, much effort has been made to improve the efficiency of the technique [15] or to apply this technique in various areas like physical planning [4] , circuit partitioning [12] , power reduction [14] , testability [6] and so on. Recently, there are some research efforts on addressing the retiming problem with dominant global wire delays [17, 1] . In contrast to the traditional settings of retiming where wire delay is ignored, these new methods can handle both gate and wire delay simultaneously. Their assumption that the delay of a wire will grow linearly with its length can be applied in our case of global interconnects as optimal buffering can be performed.
However, the placement of registers after retiming is another new challenge. Since a net is represented as a branch of edges in a retiming graph model which does not bear any information about the topology of the routes or the positions of the registers. It is unknown whether the clock period obtained from retiming can be realized in the design. Being able to insert registers into a placement solution in order to realize a retiming solution and preserve the corresponding clock period is important, or it will make the retiming optimization meaningless. Meanwhile, minimizing the number of registers used is also essential as the size of a register is usually several times larger than that of a simple gate, regardless of the process technology being used.
Even though there are several previous works on postretiming register placement, many of them suffer from the problem of over-simplification when wire delay dominates. For example, in [16] , the authors assume that the position of a register is located at the geometric center of the connected gates. This assumption is natural, but the clock period resulted from retiming will be easily violated. A similar problem occurs in [11] in which the authors determine the position of a register such that the sum of the lengths of the nets connected to that register is minimized. Although both of these methods provide a fast estimation of register position, the optimized clock period obtained from retiming cannot be achieved in the layout when interconnect delay is a dominant factor of the circuit delay. A sophisticated method to tackle this problem is of utmost need.
In this paper, we study the problem of realizing a retiming solution on a global netlist by inserting registers in placement to achieve the target clock period. This problem involves two main sub-problems, namely, topology finding and register placement. As we have mentioned before, a net is modeled as a branch of edges in the retiming graph, the problem of topology finding refers to the determination of an optimal sharing of the registers among the fanout edges of a net given the geometric positions of the connected gates such that the optimal clock period obtained from retiming can be preserved. After obtaining the topology tree of a net, we need to find an appropriate position for each register given the constraints in placement (some occupied areas do not allow register insertion) and this problem is known as register placement. Given a placement (we used standard cell design in our experiments) and a retiming solution (we used the technique in [1] to generate the retiming solutions in our experiments), our proposed algorithm can insert registers into the placement solution to preserve the clock period as much as possible. Notice that our algorithm has no dependency of the retiming algorithm being used, as long as it considers wire delay of global nets and gives a retiming solution with a target clock period, retiming labels and the maximum arrival time at each gate output.
Our algorithm can find the optimal topology, i.e., using the minimum number of registers while preserving the clock period, for 4 or fewer pin nets. Since nets with 4 or fewer pins constitute, on average, over 90% of nets in a circuit, the proposed algorithm offered an agreeable performance in the experiments. Nearly all the nets had their best topology found and registers inserted successfully while maintaining the clock period.
The remainder of this paper is organized as follows. We present the problem statement in Section 2. The problem of retiming with gate and wire delay is briefly reviewed in Section 3. In Section 4, our proposed algorithm for topology finding and register placement will be presented. Experimental results are presented in Section 5 and a conclusion follows in Section 6.
PROBLEM FORMULATION
Problem Statement. Given a placed sequential circuit and a retiming solution, i.e., an optimal clock period clk, a retiming label r(v) at each gate v and the maximum arrival time a(v) at the output of gate v, we want to insert registers into the circuit layout to realize the retiming solution, preserving the clock clk as much as possible. We can represent the circuit as a graph G(V, E), where each vertex v ∈ V corresponds to a combinational gate, and each directed edge euv ∈ E represents a connection from the output of gate u to the input of gate v. Let w(u, v) be the number of registers along the edge euv, duv be the wire delay of edge euv if no register lies along the edge. Note that the wire delay duv is assumed to be proportional to the shortest Manhattan distance between u and v. Now, consider a net N(s, D, L), where s denotes the driving gate, D denotes the set of all driven gates, and L denotes the set of interconnections between s and each of the gates di ∈ D. Obviously, {s} ∪ D ⊆ V and L ⊆ E. For each edge e sd i ∈ L, we have a value wr(s, di) representing the number of registers along the edge e sd i after retiming. The problem is to insert the minimum number of registers for this net into the circuit according to the retiming solution such that the clock period is preserved as much as possible. This problem comprises two main sub-problems known as topology finding and register placement. Topology finding is the problem of finding a topology, ΥN , of net N given the exact geometric positions of the gates such that the minimum number of registers is used and the target clock period is preserved. Register placement is the problem of finding the corresponding position for each register given the topology ΥN of net N.
A topology ΥN = (P, K) is a tree (an acyclic graph with no designated root yet) that describes the structure of net N on the plane. Each node p ∈ P corresponds to either a combinational gate or a register, and each edge kuv ∈ K represents a physical connection between gate u and gate v. Each node p ∈ P that has only one adjacent node in ΥN , i.e., deg(p) = 1, represents a combinational gate while an internal node p ∈ P that has more than one adjacent node, i.e., deg(p) > 1, represents a register. In fig. 2 , an example of a 4-pin net in which each source-to-sink edge has a register after retiming is shown. There are five possible register sharing topologies in this example: (a) all the edges share a single register (maximum sharing) as shown in fig. 3 ; (b) each edge has its own register (no sharing) as shown in fig. 4 ; (c) for the rest three equivalent cases, two of the edges share a single register while the other has a separate one, as shown in fig. 5 . Although we can always identify the topology tree which has the maximum sharing of registers for a net, it is not always possible to place the registers on a chip using that topology while preserving the given clock period. Using case (a) in fig. 3 as an example, suppose the clock period resulted from retiming, clk, equals 1.5 units and the positions of gate u, a, b and c are (0, 0), (−3, 0), (0, 3) and (3, 0) respectively, as depicted in fig. 6 . Obviously, it is impossible to share a single register among the three edges without clock violation. Three separate registers have to be allocated and inserted exactly at (−1.5, 0), (0, 1.5) and (1.5, 0) for edge eua, e ub and euc respectively in order to satisfy clk. Even if we have a feasible topology tree, it can happen that the suggested position for a register has been occupied by some other gates, i.e., the target area is blocked, and we have to look for another appropriate position. Section 4 will address how a feasible topology tree can be found and how the positions of the registers can be finalized.
REVIEW OF RETIMING WITH GATE AND WIRE DELAY
In this section, we briefly mention the technique of retiming with gate and wire delay in [1] which we have employed to generate a retiming solution for our proposed register insertion algorithm to work on. In [1] , it is assumed that wire delay is proportional to the length of the wire segment and each gate is associated with a gate delay. Given a sequential circuit, they modeled it as a retiming graph model, G(V, E), as usual, except that each edge has an additional attribute of wire delay. Besides, they have also defined a variable, a(v), as the maximal arrival time at the output of gate v for all v ∈ V .
Two approaches were proposed to solve the problem optimally and near-optimally. Both approaches were based on a transformation of the problem into a system of linear difference inequities such that the Bellman-Ford algorithm could be used to test the feasibility of a trial clock period. Together with the technique of binary search, the optimal or a near-optimal clock clk, the retiming labels and the value of a(v) for all v ∈ V would be obtained.
Since the near-optimal approach runs much faster than the optimal one, we chose the former method to produce the retiming solutions. Our register insertion algorithm will insert the registers with clock preservation accordingly.
PLACEMENT OF REGISTERS AFTER RETIMING

Topology Finding
In this section, an algorithm is proposed to find the topology of a net given the constraints in placement such that maximum sharing of registers is achieved and the clock period is preserved. This method can find the optimal topology for a net with 4 or fewer pins, and can give near-optimal solution for a net with 5 or more pins according to the experimental results.
Algorithm Description
Given a net N(s, D, L), a clock period clk, and the maximal arrival time at the output of gate v, a(v), we can obtain a feasible topology tree of N, ΥN , as described below.
First, we construct the best possible topology ΥN opt for N, i.e., a topology having the minimum number of internal nodes (an internal node represents a register). Obviously, the number of internal nodes in ΥN opt equals Q = max d i ∈D {wr(s, di)}, where wr(s, di) denotes the number of registers on the edge e sd i after retiming. We label each internal node as ri representing the i-th register on the net from the source s for 1 ≤ i ≤ Q. An example of the retiming graph model and the corresponding best possible topology ΥN opt for a 4-pin net is shown in fig. 7 .
We call the region of the plane where a register r can be placed the candidate region of r and is denoted by C(r). For consistency, the candidate region C(v) of a combinational gate v is the position of v itself, i.e., its coordinates (xv, yv), since v is fixed after placement. An δ-extended region of a region , denoted by R +δ ( ), is the region of the plane at a distance δ or less from some points in , assuming that the distance between two points is measured by their shortest Manhattan distance.
Besides, we define an adjacent-gate region for each node p in a topology tree, denoted by A(p), as an δ-extended region from its candidate region C(p), i.e., A(p) = R +δ (C(p)) where δ is defined differently for different types of nodes. The physical meaning of A(p) refers to the region on the plane such that it encompasses all the possible positions for an adjacent gate of p in the net. Therefore, the value of δ for A(p) of a node p is described as follows. If node p is an internal node, δ equals clk. If node p represents a driven gate, δ equals a(p) − dp, where a(p), given by the retiming solution, is the maximum arrival time at the output of gate p and dp is the gate delay of p. Otherwise, node p represents the driving gate, and we set δ to clk − a(p). Notice that all these regions are 45
• -rotated rectangles on the rectilinear plane because of the Manhattan distance measurement.
Starting from the best possible topology ΥN opt , we will modify the topology incrementally until an optimal feasible topology ΥN is obtained for net N. First of all, we choose the node that represents the driving gate s as the root in ΥN opt and direct all the edges away from s. Then, we will process each internal node ri in ΥN opt from i = Q to i = 1, i.e., from the furthest register to the closest register to the driving gate s, in the following manner.
For each internal node ri with a set of children q1, . . . , qm, find a minimal set of all the overlapping regions between A(qj) for 1 ≤ j ≤ m, denoted by Ymin = (y1, . . . , y k ), such that the union of the elements in Ymin covers at least one point from each region A(qj). For each y l in Ymin, we call the number of regions that y l has covered at least a point as the size of y l , denoted by s(y l ). Sort the elements in Ymin in a non-ascending order of their sizes from 1 to k, using a greedy procedure ALGSETY as described below. Procedure ALGSETY(ri, ΥN ); begin overlapped := a boolean flag; Ymin ←− φ; add A(q1) to Ymin, i.e., y1 ←− A(q1); for j = 2 to m overlapped ←− false; Notice that the union of the elements in Ymin covers at least one point from each region, A(qj), for 1 ≤ j ≤ m. Next, we can remove all the edges from ri to its children q1, . . . , qm in ΥN opt , split the node ri into k new internal nodes, n1, . . . , n k , where node n l corresponds to element y l in Ymin for 1 ≤ l ≤ k. In addition, we will assign region y l as the candidate region of n l , i.e., y l = C(n l ), for all l.
Starting from the y l whose s(y l ) is the largest in Ymin, add an edge from n l to each qj that has no parent node and whose A(qj) is covered by y l . Repeat this step until all y l have been processed. Finally, add an edge from the parent node of ri to every newly generated internal nodes n l and ri can then be removed from the topology tree. The above operations are described in the procedure ALGMODITREE below. Procedure ALGMODITREE(ri, Ymin, ΥN ); begin remove all the edges from ri to its children q1, . . . , qm in ΥN ; instantiate k new internal nodes, n1, . . . , n k , where k = |Ymin|; assign region y l as the candidate region of n l , i.e., y l = C(n l ), for all l; for l = 1 to k for j = 1 to m if (y l ∩ A(qj) = φ and qj has no parent node) add an edge from n l to qj; end if; end for; add an edge from the parent node of ri to n l ; end for; remove ri; OUTPUT(ΥN ); end.
After visiting all the internal nodes ri in ΥN opt and modifying the topology as described above, we will get a new topology tree ΥN at the end such that the clock period clk is preserved. The whole algorithm for topology finding of a net N is described in the procedure ALGTOPOTREE. Procedure ALGTOPOTREE(N); begin construct the best possible topology ΥN opt for net N; ΥN ←− ΥN opt ; for i = Q to 1 Ymin ←− ALGSETY(ri, ΥN ); ΥN ←− ALGMODITREE(ri, Ymin, ΥN ); end for; OUTPUT(ΥN ); end.
Optimality Proof
To prove the correctness of the above algorithm, we have the three following lemmas. However, the proofs of the first two lemmas are omitted due to the limitation in space.
Lemma 1. Given a set of n 45
• -rotated rectangles R1, . . . , Rn on a rectilinear plane, if R1∩. . .∩Rn = φ, then R x (R1)∩ . . . ∩ R x (Rn) = φ, where x is a non-negative real number.
Lemma 2. Given a set of n 45
• -rotated rectangles R1, . . . , Rn−1 and S on a rectilinear plane, if S ∩ Ri = φ for 1 ≤ i ≤ n−1 and R1 ∩. . .∩Rn−1 = φ, then S ∩(R1 ∩. . .∩Rn−1) = φ.
Lemma 3. Given two 45
• -rotated rectangles, A and B, on a rectilinear plane, we denote the n times clk-extended regions of A and B as An and Bn respectively, i.e., An = R +(n×clk) (A) and Bn = R +(n×clk) (B). Suppose A ∩ B = RAB = φ , we denote the n times clk-extended region of RAB by (RAB)n, i.e., (RAB)n = R +(n×clk) (RAB). It is claimed that if there exists a point x ∈ An∩Bn, x ∈ R +clk ((RAB)n−1) for all n ≥ 1.
Proof. We prove by induction on n. Base case: Consider the case when n = 1. Suppose x ∈ A1 ∩ B1, the clk-extended region from the position of x is given by R +clk (x). Obviously, R +clk (x)∩A0 = φ and R +clk (x)∩B0 = φ because x ∈ A1 ∩ B1. Since A0 ∩ B0 = φ (∵ A0 = A, B0 = B and A ∩ B = φ), R +clk (x) ∩ (A0 ∩ B0) = φ by lemma 2. Therefore, x ∈ R +clk ((RAB)0) and the claim is true for n = 1.
Inductive step: Assume that the claim is true for n = j − 1, where j is a positive integer ≥ 2, i.e., if there exists a point x ∈ Aj−1 ∩Bj−1, x ∈ R +clk ((RAB)j−2). Consider the case when n = j. Given a point x ∈ Aj ∩ Bj and the clk-extended region from its position is denoted by R +clk (x). Obviously,
Theorem 1. The proposed algorithm finds a topology that maximizes the sharing of registers for an i-pin net, where 2 ≤ i ≤ 4, and the given clock period clk is preserved.
Proof. We prove the three possible cases one-by-one.
This case is trivial because there is only one source s, one sink t1 and one edge est 1 in a 2-pin net, there is no other edges to share registers with. The algorithm will start from the furthest internal node rQ and take the adjacent-gate region of t1, A(t1) = R +(a(t 1 )−dt 1 ) (C(t1)), as the candidate region of rQ, i.e., C(rQ) = A(t1). Next, the algorithm will process node rQ−1 and take the adjacent-gate region of rQ, A(rQ) = R +clk (C(rQ)), as the candidate region of rQ−1, i.e., C(rQ−1) = A(rQ).
By substitution, C(rQ−1) can be represented as an extended region from the position of the sink t1, as C(rQ−1) = R +((a(t 1 )−dt 1 )+clk) (C(t1)). The algorithm repeats the above steps until it reaches the first internal node r1 where C(r1) = R +((a(t 1 )−dt 1 )+(Q−1)×clk) (C(t1)). Since the retiming solution is valid, the distance between s and t1 will not exceed (clk − a(s)) + ((Q − 1) × clk) + (a(t1) − dt 1 ). Therefore, the algorithm will find the candidate regions for every register and return the best possible topology when it terminates.
Case 2: i = 3 Given a 3-pin net, let s be the source, and t1 and t2 be the two sinks. Let wr(s, t1) and wr(s, t2) be p and q respectively, where 1 ≤ p ≤ q. Suppose that there exists a topology tree of maximum register sharing for the 3-pin net such that the first k registers, where 1 ≤ k ≤ p, are shared (notice that if the k-th register can be shared, the h-th register can be shared where 1 ≤ h ≤ k), and that the algorithm cannot find such a topology.
Since the algorithm cannot find that optimal topology, it must fail to find an overlapping region for the k-th register to be shared. At the point of failure, the algorithm should find that the regions R +((a(t 1 )−dt 1 )+clk×(p−k−1)) (t1) and R +((a(t 2 )−dt 2 )+clk×(q−k−1)) (t2) do not overlap. However, these two regions encompass all the possible positions for the placement of the k-th register from t1 and t2 respectively such that the clock period clk would not be violated. Therefore, should the k-th register be able to be shared as assumed, it must lie within these two regions and the algorithm must be able to find it. Contradiction occurs.
Case 3: i = 4 Given a 4-pin net, let s be the source, and t1, t2 and t3 be the three sinks. Let wr(s, t1), wr(s, t2) and wr(s, t3) be p, q and r respectively, where 1 ≤ p ≤ q ≤ r. Suppose the algorithm is attempting to share the k-th register where 1 ≤ k ≤ p, i.e., it is trying to find a minimal subset of the overlapping regions such that it covers all the extend regions R +((a(
, denoted by A, B and C respectively. Notice that we only consider when k ≤ p and assume that the three paths from s to t1, t2 and t3 are not merged yet (i.e., no sharing of registers from k + 1 to r). It is because, otherwise, the situation will fall into case 1 or case 2 discussed above. There are 4 distinct cases. First, if A, B and C are disjoint, it means that the k-th register cannot be shared and the algorithm will introduce three new internal nodes to represent the registers and continues with the next internal node r k−1 . Second, if A, B and C overlap with each other, it means that the k-th register can be shared among t1, t2 and t3. The algorithm will introduce a single internal node to represent the register and continues. The correctness of the algorithm in these two cases is trivial and will not be elaborated here.
The third case is, without loss of generality, that A∩B = φ and B ∩ C = φ but A ∩ C = φ. Denote the region A ∩ B as RAB and the region B ∩ C as RBC. There are three possible options that the algorithm can choose from when evaluating the k-th register: (1) it does not share the kth register and introduces three different registers for the sinks; (2) it shares the k-th register between t1 and t2 but a separate one for t3; (3) it shares the k-th register between t2 and t3 but a separate one for t1. Our algorithm will choose arbitrarily either (2) or (3) as the number of adjacent-gate regions covered by RAB and RBC are the same, but it will never choose (1). We assume that the algorithm chooses (2) in the following analysis.
First, we compare the choices of (1) and (2) . Notice that (1) can be better than (2) only when the three separate paths can be merged together at a subsequent step of processing register h where 1 ≤ h ≤ k, while the combined path of t1 and t2 and the path of t3 cannot be merged at the h-th register. We are going to show that this will not happen.
If we choose (1), suppose that there exists a point x on the plane such that x ∈ Aj ∩ Bj ∩ Cj, where Aj, Bj and Cj represent the j times clk-extended regions of A, B and C respectively, during a subsequent step of processing register h where 1 ≤ h ≤ k. By lemma 3, it is shown that x ∈ R +clk ((RAB)j−1), where (RAB)j−1 is the (j − 1) times clk-extended region from RAB. This means that if it is possible to share the h-th register among the three edges without sharing the k-th register at the first place, by choosing (2), i.e., to share the k-th register between t1 and t2, the algorithm will also be able to share the h-th register among the edges. Therefore, (2) is better than (1) by sharing more registers.
Next, we compare the choices of (3) and (2) similarly. Suppose we choose (3) and there exists a point x on the plane such that x ∈ Aj ∩ (RBC)j, where Aj and (RBC)j represent the j times clk-extended regions of A and RBC respectively, during a subsequent step of processing register h where 1 ≤ h ≤ k. Obviously, there exists a point y covered by Aj ∩Bj ∩ Cj, i.e., y ∈ Aj∩Bj∩Cj. By lemma 3, y ∈ R +clk ((RAB)j−1), i.e., y ∈ R +clk ((RAB)j−1) ∩ Cj, so the h-th register will also be shared among the three edges by choosing (2) . Therefore, (2) is no worse than (3). As a result, the algorithm will find the optimal solution by choosing arbitrarily either (2) or (3) (using the greedy algorithm).
Finally, if two pairs of the regions overlap while the other is disjoint, i.e., A ∩ B = φ but A ∩ C = φ and B ∩ C = φ, the analysis is similar to the third case above.
Register Placement
In this section, we discuss how registers are actually placed using the topology tree yielded from the algorithm discussed in the previous subsection. Using an idea similar to the technique in [13, 7] , the positions of the registers are determined.
Since some of the chip areas are occupied by the standard cells, we need to know where on the chip a register can be placed. To tackle this problem, we divide the chip into a mesh of m × n grids. For each grid gij, we keep track of its center coordinates, (xg ij , yg ij ), and the size of the free space in the grid, f (gij). We finalize the position for a register in the following manner.
Given a topology tree ΥN , choose arbitrarily an internal node r to be the root of ΥN , and direct the edges of ΥN away from r. Starting from the root r, we choose a grid whose center is contained in C(r), i.e., the candidate region for placing the register r, and it has the largest free space available. We denote this grid as g(r). If f (g(r)) ≥ z, where z denotes the size of a register, we take the center of g(r) as the position of the register r. Otherwise, we allow a controlled degree of inaccuracy by extending C(r) one grid width further, i.e., R +gw (C(r)), where gw represents the width of a grid. Repeat the same process using R +gw (C(r)) instead of C(r) in the search of a feasible grid for placing register r. If no such grid is found, we report that this register cannot be placed. This could happen because the register counts may increase greatly after retiming.
Let q1, . . . , qm be the set of internal nodes which are the children of r in a topology tree ΥN . After fixing the position of r, register qj, for 1 ≤ j ≤ m, is placed arbitrarily in its candidate region C(qj) provided that it is at a distance of clk or less from r. After visiting all the internal nodes of ΥN , the position of each register is located.
Suppose we have a 3-pin net N(s, D, L) and its topology tree ΥN is shown in fig. 8 . The topology tree ΥN shows that the two driven gates d1 and d2 will share two registers represented by the internal node r1 and r2. In this example, we assume that clk = 3 units. Consider a 5 × 5 mesh as shown in fig. 9 , where the positions of the driving gate s and the two driven gates, d1 and d2, are assumed to be the centers of the grids containing them correspondingly, i.e., gate s is located at (4, 0), gate d1 is located at (0, 4) and gate d2 is located at (2, 4) . Suppose ΥN is rooted at node r1 and the algorithm has fixed its position at (1, 0), let us examine how the position of r2 is determined.
The candidate region C(r2) of r2 covers the centers of grids g03, g04, g12, g13, g14, g23 and g24. Starting from the position of register r1, the algorithm expands a rectangle of distance clk from it, denoted by R +clk (r1) as shown. Next, the algorithm will find that C(r2) ∩ R +clk (r1) is not empty and covers the center of grid g12 and g13 -the candidate positions of register r2. If the free space of g12 is greater than that of g13, i.e., f (g12) ≥ f (g13) ≥ z, the algorithm will assign the center of g12 as the position of register r2.
EXPERIMENTAL RESULTS
We performed retiming and our proposed register placement algorithm on the ISCAS89 benchmark suite. The program was implemented in C language and run on a 1.5GHz Intel Pentium IV processor with 256KB cache and 512MB RAM.
In our experiments, we implemented the circuits using a 0.35µm CMOS standard cell library from Austria Micro Systems and Silicon Ensemble was used to layout the design Figure 9 : An illustration of how the final position for the register r2 is determined.
with a setting of 50% row utilization. Gate delays were referenced from the data book while wire lengths were estimated using the Manhattan distance between the connected cells. We scaled the wire delay according to [3] in which a 1mm wire was assumed to have a delay of 150ps approximately. The size of a grid was set to twice as large as a D-type flip flop. During the placement of a register, we allowed an error of one-grid width, i.e., the width of a D-type flip flop. The results are shown in Table 1 . The first column indicates the name of the circuits. The second column shows the number of logical registers existed in the retiming graph model after retiming. The number of registers had increased after retiming for most of the circuits because the retiming method that we used did not minimize the number of registers as one of its objectives. Although this increase in register counts does hinder our algorithm to place registers, it is not the main concern addressed in this paper.
In the third column, the minimum possible number of register required after sharing is shown, i.e., assuming that every net could be realized using the best topology. The fourth column shows the number of registers that have actually been inserted after using our proposed algorithm. It can be observed that the numbers in the fourth column are the same as those in the third column except for circuit s3271 and s4863. This observation showed that almost all the nets in our test cases could have their registers placed using the best topology, revealing that our proposed algorithm can very often find a near-optimal solution for register insertion.
The fifth column shows the statistics of the number of nets containing 4 or fewer edges with registers whereas the sixth column shows the number of nets having 5 or more edges with registers. The seventh column shows the number of registers that are placed within their candidate regions while the eighth column shows the number of registers that are placed outside their candidate regions but with a controlled error range (one grid size). As we can see, all the registers are placed in their candidate regions successfully in all the test cases. Finally, the CPU runtime is shown in the last column.
CONCLUSION
In this paper, we have proposed an algorithm that solves the problem of register insertion on global wire given a placed sequential circuit and a retiming solution. The proposed algorithm can preserve a given clock period with a controlled error using as few registers as possible in contrast to many previous works with post-retiming register placement that have the problem of clock period violation. In addition, the algorithm is also proved to be giving the optimal topology for nets with 4 or fewer pins. Since this type of nets makes up for about 90% of the nets in a sequential circuit on average, the algorithm performs very well and effectively under most situations.
Together with any powerful retiming methods which are designed to handle global netlist with block and wire delays, our proposed algorithm can be applied to locate where a register should be inserted to pipeline long global interconnects such that the target clock is preserved. This is particularly useful in today's designs in which multiple clock cycles are required to propagate a signal across a global wire.
Improvements can be made to handle the situation when there is no room for the candidate region of a register. Instead of scanning the neighboring grids for free spaces, incremental shifting and reshuffling of cells can be performed to free continuous rooms for register insertions.
