For many years, CMOS process scaling has allowed a steady increase in the operating frequency and integration density of integrated circuits. Only recently, however, have we reached a point where it takes several clock cycles for global signals to traverse a complex digital system such as a modem microprocessor. Thus, interconnect latency must be taken into account in current and future design twls at the architectural as well as synthesis level. To this purpose, this work proposes a new latencyaware technique for the performance-driven concurrent insertion of flip flops and repeaters in VLSl circuits. Overwhelming evidence showing an exponential increase in the number of pipelined interconnects with process scaling, for high-performance microprocessors as well as highend ASICs, is also presented. This increase indicates a radical change in current design methodologies to cope with this new emerging problem.
Introduction
Repeater insertion is a technique extensively used to reduce the delay of interconnects and improve their noise characteristic parlicularly when signals are distributed over long distances on a chip. An elegant dynamic programming algorithm was proposed in [l] to determine optimal repeater assignments of the candidate locations of a given interconnect topology. Several other works based on the same technique, [3-71 to cite a few, have also been proposed incorporating other optimization steps such as noise or area minimization, wire sizing, etc. All of these works, however, only consider the case where a signal is required to arrive at its destination within one single clock cycle. On the other hand, in complex digital systems with relatively large die area operating at very high frequency, as in the case of modern highperformance microprocessors such as the Itanium' processor [9] , many global signals traveling across the chip need several clock cycles to reach their destinations, thus requiring the adoption of pipelined interconnects, i.e. latent wiring structures in which normal repeaters are interleaved with sequential elements such as latches and flip-flops. Current scaling trends indicate that this phenomenon will be accentuated in future process generations. AS can be seen in Figure 1 , the frequency of high-performance microprocessors approximately doubles every process generation, in pari due to shorter gate pipelines [lo] . Moreover, as showed in Figure 2 , the die size also tends to increase by about 25% per generation. taking advantage of increased complexity and level of integration. As a result, the numbers of clock cycles needed U, cross the die is bound to increase. In this situation, at least two new challenges are faced by micro-architects and circuit designers: a) the accurate prediction of the minimum latency that ctm be achieved between the blocks of a design, given the available rouling resources of a CMOS process. and b) the performance-driven inserLion of repeaters and flipflops in a large number of pipelined nets where interconnect and functional latency constraints are specified by the micro..architects. To this purpose, in this paper we propose a new methodology for the performance-driven concurrent flipflop and repeater insertion in latency constrained and unconstrained VLSl interconnects. In the case of unconstrained latency the problem is solved optimally with the goal of minimizing the overall interconnect latency, i.e. the latency at the most latent receiver. In the constrained case the insenion problem is solved specifying the target latency for each driver-receiver pair of a net. 
Dk?I!ZelrDm25%per

Clock and Routing Grids
The clock distribution network of a chip is modeled as a regular grid of n independent domains Ai distributed over the die. The clock skew is then represented by an upper uiangular matrix Z where an element ai i indicates the skew within domain Ai and an element ad indicates the skew between domains Ai and A,. The location of the nodes of every 0 is then constrained to the center of the tiles of a finer regular routing grid superimposed to the clock one. This scenario is depicted graphically in Figure 4 . An important requirement on the routing grid is that the size of its tiles should be short enough to allow an effective insertion of repeaters and tlipflops. In particular this can be achieved by choosing a value two or more times shorter than a given process dependent repeater critical length l,, defined here as the typical distance between the repeaters of a delay optimized two-pm interconnect routed over the most resistive metal layer used. Let L, = 1, I l., be such distance measured in terms of routing tiles. It has been shown [SI that repeater delay is quite insensitive to local displacement, therefore the center of a relatively small grid tile provides a good approximation for the real location of the repeater to be positioned anywhere within the tile. 
Routing Resource Allcation
Interconnects routed over a topology 0 can be designed by allocating in its nodes and branches routing resources such as wires of given metal layer, length, width, and repeater gates of given size. Since this paper focuses on the repeater insertion problem, for simplicity we assume all branches b,, as allocating wires of given length 1,. = with same metal layer and fixed width. A repeater assignment AB over topology 0 is then defined as a set of labels (1%' and au where a",, = gk corresponds to the assignment of a repeater gk, taken from a given gate library G, to branch b , , right after node n.. On the other hand, a value (2%" = 0
indicates that no repeater is inserted. For a node nu branching to two children n, and n, through branches b,. and b,,z, au E a"," U auz, whereas if nu has one child n, only, a, E a,". Finally if nu is a leaf a. = 0. A possible assignment for the topology of Figure 3 is showed in Figure 5 .
Wire and Gate Modeling
A wire of length l,," routed over branch b,,, is modeled with a resistance R,, connected between nodes nu and n,. and two capacitances of value C,," connected between nu and ground and n, and ground, respectively. If the wire has distributed resistance RI and distributed capacitance C,, the lumped R.,v and C,, can be calculated as R,, = RI 1 ".". and CU," = 0.5
Cl
The wire model of a simple topology with three nodes and two branches is showed in Figure 6 . A repeater g , is modeled as a buffer gate with input capacitance load(gk) and input to output delay delay@,, CO&, expressed as a function of the capacitance CO, present at its output.
Similarly, if gk is a clocked device it is modeled as a D-type flip-flop with input capacitance load@,), clock to output delay delay@,. Corn), and set-up time T, . , &J. nu ay n.
-w 
Interwnnect Cover
Let us assume that a repeater assignment is given for a routing tree 0 along with timing constraints at its leaves in terms of input load capacitance and propagation required time. Under these circumstances, the timing at the root of each sub-tree Oi of 0 needed to satisfy those constraints can be expressed by a CNple yi = (q. r;, k, aJ. Here, q is the input capacitance seen at the root, rj is the required arrival time after the positive edge of a clock signal 'p with period Tp, & is the interconnect, latency defined as the max number of clocked repeaters nossed when going from the mot of Oi to its leaves, and q is the repeater assignment at ni. Since every leaf n . of 0 is also the elementary tree 8. = ( [ n u ) , 0), the interconnect constraints at the receivers are also specified using a 4-Nple y. = (c", ru. 0,O). Because y specifies the timing constraints and the allocated resources of an interconnect mapped onto the topology of subtree 8, we call y interconnect implemenrafion or cover of @. In general a sub-tree €Il will have multiple.feasible covers specifying different timing and resource assignments. For convenience, these covers are grouped per latency in the ordered set ri = {rim,. .., rP) with m, n ? 0, and m c n.
where = { ( c l . r l , k, a,),.. ., (ca. ra. k, a$) is a set of covers of Bi with same latency k. In the case of a sub-tree 8. rooted at a branching point of degree two, y, , " and yur are used to denote the covers of the sub-tree components 8, " and Oui . respectively.
Cover Computation
If we assume that the covers at the leaves of 0 are given as constraints and that an assignment Ae is known, the wver of every sub-tree 0; can be recursively determined, from leaves to root. using a hierarchical delay model such as Elmore [2] . This is accomplished here by the three operations wire, repeat and join sketched in Figure 7 . Specifically. If a node n . branches to node n, with a known yv through branch b%", cover yu I y, " is fist calculated through operation wire, where 7 , .
is back propagated to nu inserting a wire on branch b,. using Elmore delay. It must be noted that, unlike [ 11, covers y are here constrained by a fixed given clock cycle Tp. Therefore. only covers with non-negative required time are generated. After operation wire, if a gate must be inserted at nu, operation repeat is called. For the reason just stated, a non-clocked repeater is inserted only if the required time at its input is zero or positive. In that case the new required time and the input capacitance of 8 are stored in the cover while the latency remains unchanged. Similarly, a clocked repealer is inserted only if the slack at its output is nonnegative. In particular, this slack is computed from the required time of the wire by subtracting the delay of the flip-flop and the term U , , , . that models the skew of the clock signal cp as defined in Section 2.2. Here, Am is the clock domain where the flip-flop g is located and 4 is chosen among the domains of the upstream flip-flops so as to consider the most pessimistic value of amn. If the slack is not negative, the required time at the input of the gate is set to the period of the clock minus the set-up time of gate g, and the latency of the new cover is increased by one. When two covers 7%" and y. . . are back propagated to a branch node nu of degree two via operations wire and repeat, cover y. is calculated by means of operation join. Here, the input capacitance is the sum of the load seen at the two branches and the required time is the minimum of that at T~~ and yoi to account for the worst case. whereas according to its definition the latency of the joined cover is the maximum of the latencies at the merging branches.
4 Cover Optimality When multiple covers are computed for a sub-tree 0, only those noninferior covers that can lead to optimal solutions at the root of B need to be saved. A principle for cover optimality was introduced in [I] to prune cover sets of their inferior solutions. We extend here those concepts to the general case of non-zero latency in the following property. any gate driving a sub-tree 0 with cover y will have inpu! required time always worse than that of the same gate driving tl with y. while having the same input capacitance and the same latency. On the other hand, when y and y have identical input capacitance and required time, as in property 4.c, y is also inferior if the value of a user specified cost function associated with the routing, resources allocated in 0 by y. e.g. repeater =,is greater than that of y . Finally, when y has latency higher than that of y. as in property 4.d. y is inferior for the same reasons as in 4.a, 4.b and. when it has identical input capacitance and required time, because y covers sub-tree 0 with same timing as in y but wasting an unnecessary extra clock cycle.
Property 4 Cover Inferiority
Asignment for Minimum Latency
The repeater insertion problem for the design of interclmnecfs with minimum latency, under the definitions of Section 2, is stated here with the following formulation. Given an interconnect topology 0 mapped onto a routing grid and a clock grid with skew matrix Z, timing constraints at the receivers in terms of y. (cu, r., 0, 0). a library G of clocked and non-clocked repeaters, find a set of optimal covers rI with minimum latency at the driver of 8, = 0, according to proparty 4. This is accomplished here by calling the algorithm MiLa, whose pseudo-code is outlined in Figure 8 , with argument 01. Notice that minimizing the latency in rl corresponds to minimizing the signal latency at the most latent receiver of the net. Also, please notice that the latenc:y values h. at the receivers are here set to zero only for convenience. The optimal covers at each node of the tree are computed recursively after multiple nested calls, starting from the leaves and ending at the rmt, traversing the tree in a depth first fashion. At any call, if 0. is a loaf the given constraint at the corresponding receiver r, = r., 0 , O ) ) is returned. apply propmy 4.10 r.
5.
6.
returnr. If the root of 0. is connected to a single branch bU,", in line 2.1 the algorithm is called again to compute the optimal covers r. of the next sub-tree E, , then m line 2.2 such covers are propagated to node n.
inserting wires. Next, in loop 2.4 an additional cover is inserted in ru for each repeater io G. To do this, all the covers previously computed in line 2.2 are repeated using the same repeater gate g calling opeiation repeat thus generating a new set K' . Inferior covers are then deleted according to property 4 leaving r with only one optimal cover for every available sub-set r' G r with latency k, since its covers originated fr<>m the same repeater. Finally, on line 2.5, r. is updated adding the repeated covers.
Section 3 of the MiLa algorithm computes the optimal ccwers of subtree 0. when its m t is connected to two branches b. , " and b, . A methcdology for joining covers without taking into accoun~ latency was given in 111 and more formally in 131. The same method, extended to the latency case, is implemented in the function merge outlined in Figure 9 .
Here, only the elements of r. and r, with the same latenc:y are joined using a technique similar to the merging of two sorted lists. In this case we assume that covers in sets r. and r, are sorted first by increasing latency and then, within each latency, by increasing required time and capacitance. It is apparent that a cover not featuring such a monotonic increasing behavior would be an inferior one according to property 4 and would then be deleted from its set prior to the merge. On line 3.1 MiLa calls itself twice to compute the covers of sub-trees
8%" and
As illustrated in line 3.2, the corresponding sets r,. and r, will be in general composed of an arbitrary number of sub-sets rk of different latency k, where k is equal to the max latency at the root of each branch. Similarly to the cover Ltency h defined in section 2, the signal latency at any ncde of an interconnect can be defined as the number of flipflops crossed to reach the node starting from the driver where the signal latency is zero. While the algorithm is geared towards minimizing the signal latency at the most latent receiver, it must also determine the signal latency at all other receivers such that optimal covers are obtained and propagated back to the driver. To do so, we must join all combinations of the sub-sets c r, so that for each couple (r", r", a new joined sub-set P c r. is generated with function merge where q = max(k, h). contains covers with maximum latency. Therefore, on l i e 3.4.1, only the covers of r, need to be shifted by latency k to consider all possible cases. After determining the optimal covers r. for every case of branching degree at the rmt of 8.. in line 4 set r. is pruned of its inferior elements according to property 4. Finally, in line 5 the optimal set r. is returned. It is worth noting that the application of propmy 4 has the important consequence of limiting the returned set r. to having covers with at most two values of latency, that is r. = {r', ?I). Such a property, also experimentally verified, holds assuming that library G contains at least one flipflop with input capacitance equal or lower than the input capacitance of all nonclocked repeaters. Afler the last call to MiLa returns, the optimal covers 7; in rl for the whole interconnect are computed and repeater assignments and the signal latencies at every receiver found by traversing the tree from root on. The optimality of the algorithm is proved by induction on the sequence of recursive computations of cover sets r. generated by the depth first traversal induced by the first call to 0,. Therefore, assuming that the given covers r at the receivers are optimal, we only have to pmve the optimality of the coven produced by one recursive call to section 2 or 3 of the algorithm. However, both sections 2 and 3 produce a cover set containing among its elements all possible optimal solutions according to the problem formulation. The optimality of all the covers of r. is then ensured by property 4. which eliminates any inferior element. 8% is a leaf then r. . = (c". ru. I,. 0) .~ . . . ,
2.1
r, = Gila(9,) ReFlop(e., exmflops)
3. mm-" 
Experimental Results
Instead of directly implementing the basic algorithms, we have chosen to verify our methodology by applying it to a more complex case where repeater insertion is simultaneously performed with interconnect topology synthesis. To this purpose we have incorpomted our repeater insertion strategy and the P-TreeAT [3] routing tree conshuction technique in two new algorithms referred here as to Flop-Tree-ML and Flop-Tree-GL based on the MiLa and GiLa algorithms, respectively.
Aside from testing the correctness of the proposed methodology this has also served us in verifying that our technique is indeed amenable to being employed in other interconnect design algorithms based on the same dynamic programming style of [l] to extend them to the broader case of clocked repeater insertion. In the following both Flop-Tree-ML
and Flop-Tree-GL are experimentally verified.
Set-up
In our experiments, the proposed algorithms are applied to perform Scaling the devices and interconnects by a factor S 9 . 7 from the original 0.ISpm process, first to a 0.l3pm process and then to a 0.09pm one. Parasitics are derived from the lTRS 1121 roadvp.
~
272
Notice that using a 0.18pm design as a base for this study leads to optimistic projections since the number of global signals is likely to increase with newer processes leading to larger repeater counts.
Scaling the die size and clock frequency following: N) nominal scaling, representative of microprocessor shrinks, where die size scales down by S1 and frequency scales by SI, and T) trend scaling, where die size scales by 1.25 and frequency scales by 2 as indicated by the current microprocessor trends of Figure 1 and Figure 2 [lo]. In addition, we also follow a third scaling rule C) constant scaling. representative of high-performance ASICs, in which the die size remains the same. More interestingly, the total number of flip-flops increases by about 70% each process generation, but in this case also because of frequency scaling. Moreover, the total area of all repeaters decreases following the die size shrink. As can be seen, the run time of Flop-Tree-ML tends to decrease as the number of flipflops goes up. This is due to the fact that in general more covers (the average number of covers per net is reported in column IO) are pruned as a consequence of the insertion of a flip-flop than because of the insertion of a non-clocked repeater. At the other extreme, using scaling pattern TT, which corresponds to a scaling trend that high-performance microprocessors have been able to sustain so far, the die size increases by 25% and the frequency doubles each generation. For this reason, the increase in repeaten is here more dramatic: every process generation the total number of repeaters goes up by a factor of about 2.5X, while the number of flipflops increases by about 7 times! Moreover, the percentage of pipelined interconnects (column h > 0) is 9% and 42% for the 0.l3pm and 0 . 0 9~ processes, respectively. For the scaling pattern CN, representative of high-end ASICs, the increase in the number of flip-flops is 2.16X and 2.9X for the scaled O J 3 p and 0.09pm processes, respectively.
For the convenience of the reader the total number of inserted clocked and non-clocked repeaters Rptrs and the total number of flipflops Flops are plotted in logarithmic scale in Figure 12 and Figure 13 , respectively, for the six scaling patterns of Table 1 . In Figure 12 , as expected, for all scaling patterns the total number of repeaters increases exponentially, independently of frequency, with increase rate growing from a minimum corresponding to the nominal die size scaling rule, to a maximum corresponding to the mend die size scaling rule. Similarly, Figure 13 shows that the number of inserted flipflops also grows exponentially for all scaling patterns. This time, however. the increase rate depends on both die size and frequency.
Scalingnttern
Figure 12. Increase in total number of repeaters and flipflops for six scaling patterns across three process generations. Figure 13 . Increase in number of flipflops for six scaling patterns moss thee process generations.
Latency constrained Pipelined loterconnect synthesis
In this experiment we test the functionality of MOP-Tree-GL by applying it to the latency constrained repeater insertion of the same test case used in the previous experiment. The set-up of the experiment is the same as the previous one with the exception that now latency constraints are given at each net driver-receiver pair. In pardcular, the constrained repeater insertion is performed on the scaled test case corresponding to the scaling pattern TT in Table 1 . The latency constraints are generated by randomly adding 1 or 2 extra clock cycles to the receiver latency values computed by MiLa in the previous experiment in all nets with 1 > 0. In practice, in the case of a microprocessor, such increase could be attributed to conservative latency values set at the architectural level.
The results of the experiment are reported in Table 2 , where normalized values refer to the REF experiment of Table I . As can be seen, in all three runs the number of inserted flipflops increases due to the higher latency constraints with respect to the values of Table 1 . Nevertheless.
the total number of flip-flops and repeaters does not substantially vary.
~
273
Intuitively, this can be explained by the fact the inserted extra flip-flops take the place of normal non-clocked repeaters. As expected, for all runs, the run-time increases due to the extra calls to routine ReFlop. 
