In this paper we focus on routing techniques for optimizing clock signals in small-cell (e.g., standard-cell, sea-of gate, etc...) ASICs. In previously reported work, the routin P of the clock net has been ordinary g obal routing techniques base a" erformed using on a minimum or minimal Steiner tree that have little underof clock routing problems. We F resent a novel to clock routing that all but e rminates clock yields excellent phase delay results for a wide range of chip sizes, net sizes (pin count), minimum feature sizes, and pin distributions on both randomly created and standard industrial benchmarks.
Abstract
In this paper we focus on routing techniques for optimizing clock signals in small-cell (e.g., standard-cell, sea-of gate, etc...) ASICs. In previously reported work, the routin P of the clock net has been ordinary g obal routing techniques base a" erformed using on a minimum or minimal Steiner tree that have little underof clock routing problems. We F resent a novel to clock routing that all but e rminates clock yields excellent phase delay results for a wide range of chip sizes, net sizes (pin count), minimum feature sizes, and pin distributions on both randomly created and standard industrial benchmarks.
For certain classes of pin distributions we have proven theoretically and observed exnerimentallv a decrease in skew with an increase in net Bize. In practice, we have observed a two to three order magnitude reduction in skew when compared to a minimum rectilinear spanning tree.
Introduction
In toda d
's highly competitive IC market lace, company survrv necessitates differentiability may I! roduct differentla rhty. Product . g. . e engendered in many ways, several of which include: increased performance (e.g. lower power, faster timing, etc...), lower cost, more features, Previous work in clock optimizaxion has been contributed bv several authors.
H-trees have been recognized foi years as a technique to help reduce the skew in synchronous systems [FK82] [KGE82] [DFW84] [BWMSS] [WF83] . F or regular structures such as systolic arrays the H-tree works well to reduce skew, but in the general case asymmetric distributions of clock ins are common and the H-tree is not as effective for c ock P routing. The large size of the clock net has led some researchers [DFW84] [Mij87] to perform buffer optimization within the clock distribution tree. More recently, [BWMSS] However, in all previous work the routing of the clock net is performed usin ordinary routing techniques. This causes non-optima clock '1 behavior and as re the net increases, t ion size or the number of pins in 5l e undesirable behavior is exacerbated. In this paper, we focus exclusively on routing techni ues for optunizin cuits. Iv e demonstrate t Pa the clock signal in VLSI cire superiority of our al f orithm over standard routing techniques for examples o a wide ran P e of ranging size. n section two the preliminaries necessary for understanding the problem is de aper are presented. Following this, the ii ned in section three. Section four illustrates the algorithm for clock routing and section five discusses theoretical results. Next, in section six, practical considerations are discussed. In section seven the experimental results are presented, and in section eight possible avenues for future work and conclusions regardmg the approach are discussed.
Preliminaries
The majority of digital chi Synchronous designs are P s are synchronous in nature. o ten modeled as a Moore finite state machine for the pur ical requirement impose % oses of analysis. The topologon such a finite state machine is that all closed signal paths must contain at least one synchronizin % element. Satisfaction of this constraint has several enefits, two of which are: the assurance of deterministic behavior if the physical aspects of the design are correct, and the elmmation of the requirement that the combinational logic be free of transients as long as next state sampling is performed after the longest path has settled to its final value [MC80]. For simplicity, and without loss of generality, let the synchronizin elements be edge-triggered. Furthermore, let CP deno e the clock period, do the largest path delay f throu h the com.binat1ona.l lo ic, t SKEW the clock skew, tfr~ tl!e set-y trme of the ec$-tnyred srhronrzlng e ements, an tc,g the delay rom t e sync ronizing element's clock pins to the Q output pins. In order to % uarantee that no long-path timing vrolations occur in t e design, the following equation must be satisfied cp 1 dL + tSKEW + tsu + tcg (1) This expression demonstrates the important relationshi between the clock period, the longest path delay, an B the clock skew.
The two timing related clock parameters that one must consider for highc!i erformance skew and phase delay.
design are clock ock skew is defined to be the 27th ACM/IEEE Design Automation Conference@
Paper 34.3 @ 1990 IEEE 0738-l 00X/90/0006/0573 $1 .OOI Figure 1 : Relationship between TH, TL, and CP with 50 % duty cycle maximum difference in arrival times at any two similarl Y clocked clock pins. Clock skew is caused by severa phenomena: asymmetric routes to the clocked elements, differing interconnect line parameters, different delays through the clock distribution elements, and different device threshold voltages for the clock distribution logic. Equation 1 illustrates the important relationbetween skew and the longest combinational logic delay. As skew increases with the clock period held xed, the efficiency of the digital system is reduced because valuable computation time IS "stolen" from the total cycle time. Frequently in high-performance design environments, skew is constrained to be less than five percent of the clock period. Thus, in a 100 MHz design skew would be constrained to be less than 500 In this s. . paper, routi?
techniques to help achieve t E* is oal wil be presente E . Phase delay may be defined to e the maximum delay to any s The same phenomena causing s z nchromzing clock pin. ew also contributes to phase delay. It is convenient to consider phase delay to consist of two components: an intrinsic cell dela or tr and t 42 contributed by the externally driven clot k tl~ pad e time to charge or discharge the clock net tCH or tc~. Expressions for phase delay may be defined as
Figure 1 illustrates equations 2 and 3. Phase delay affects chip to chip interfaces by appearing as inter-chip skew and in worst-case scenarios may provide inadequate time to charge and discharge the clock net. For example, for a clock signal with a duty cycle of 50 % the high and low portions of the clock period 7~ and r~, must satisfy the following constraints
explicitly minimized, should it be a problem, circuit techniaues that insert delav to eouahze arrival times passed to the global router as blockages. The determined routes are constructed so that the clock signal behavior is optimized.
To understand the consequences of decisions made during physical desi oarasltlcs that load f n, one must model the interconnect &le assum he clock tree. We make the reasonii tion that in a high-performance design environment, t e interconnects are realized with aluminum due to its excellent conducting properties. Interconnect resistance fint is determined usmg the following expression where p is the resistivity of aluminum (3 ~0 (8) In these expressions Kc is a constant that is inserted to account for frinaing effects. and can be calculated using the two-dimen&& analysis of [DS80]. It is assumed to' be 2.0 in our calculations.l, represents the line spacing, t,, represents the thickness of the field oxide, and E,,, 1s the permittivity of the oxide. All lengths used to calculate resistance and capacitance are based on manhattan distances.
Based on estimates for &,,$ and C;,, , simple and accurate interconnect delay estimates may be calculated usine the first-order moment of the imnulse resoonse whicvh has also been called Elmore's delay [RPH83] [Elm48]. The interconnects are treated as distributed RC trees and the rectilinear segments comprising the interconnect are modeled using their equtvalent Tnetwork representation.
Problem Definition
Given the ICs placement, the locations of blockages on the routing la clock net, an ers, the positions of all clock B ins on the the location of the clock pa B along the P eriphery of the chi P' the problem may be defined as folows: construct a c ock tree consisti.ng only of Manhattan se and p a ments that optimizes the clock skew, wirelength ase delay subject to the blockages on the routing layers. In and wirelengt i eneral, one seeks to mimmize clock skew and the routin , subJect to constraints on phase delay . Formally, clock optimization could be formulated as 0110~s P (11) w% YCW) P re resents the center of mass of the set of points S. We shall use the notation S,(S) or simply S, to denote the ordered set of points obtained by orderin the set S by increasing 2 coordinate, i.e. xi < 2j 1 *f si,sj E S=(S) and i < j. Similarly, S,,(S) or simply S, represents the ordered set of points obtained the set S by increasing y coordinate. Define by ordering k(S) = {Si E sz I i 5 wq} (12) sR(s) = {Sj E sz ) [n/21 < i 2 n} (13) &3(S) = {Si E sy I i I wq} (14
The sets SL and SR represent the division of S into two sets about the median 2 coordinate of the set of points. These sets partition the original region in the x dimension into two subregions with approximately equal number of elements in each sub-region. In fact, 1 IS,/ -IS,1 1 < 1. Similarly, SB and ST represent the division of S into two sets about the median y coordinate of the set of points. The basic algorithm first splits S into two sets (arbitrarily in the x direction or y direction). Assume that a split of S into SL and SR is made. Then, the algorithm routes from the center of mass of S to each of the centers of mass of SL and SR respectively. The regions SL and SR are then recursively split in the y direction (the direction opposite to the previous one). Thus, splits a1ternating between x and y are introduced on the set of points recursively until there is only one point in each sub-region. The pseudocode for the algorithm is given in Figure. 2.
Improvements
The simple algorithm described above yields f ood results, but there is room for further improvemen . In the following discussion we define a cut in the x direction to mean a split resulting in a left region and a right region. A cut in the y direction implies a split resulting in a top and a bottom region. Figure 3 , where S is we make a cut in the x direction and then recursively split the left and right regions in the y direction, we get the result shown in Figure 3(a) . Clearly, there is skew between points PLT and PR However, if we reverse the cut directions, i.e., split in t if' e y direction first followed by a split in the x direction, we get the result shown in Figure 3(b) , which has no skew between the endpoints. This example illustrates the need for making a good choice of cut direction at each level of the recursron tree. We make the choice by a one level look-ahead technique. Given a region to be split, the algorithm makes an x direction cut followed by a y direction cut on the resulting left and right re ions. lowed by an x cf It also makes a y direction cut folerection cut. The skews for each of the Fi ure 4: Clock tree representation used for delay calcu ations P configurations is compared and the cut direction that minimizes skew between its current endpoints is chosen. The method of estimatin the skew between the endpoints is described in the fo lowing section. 7
Delay Calculation
We use the Penfield-Rubinstein [RPH83] algorithm for calculating delays to the endpoints in the rown clock tree. The resistance and capacitance of t a e tree segments are modeled by a T-network model. Consider the tree shown in Figure 4 . Because of a property of the center of mass (see next section), the lengths of tree segments from the center of mass of a region to each of its two sub-regions will always be e The delay from s to the endpoint s 7.
ual and symmetric. IS calculated as 6s--s7 = RI(0.5qdq
where C, is the gate capacitance at the clock pin (assumed equal for all clock pins), CI is the capacitance per unit length and RI is the resistance per unit length. Note that the delay to any endpoint de ends on the lengths of other segments connecting dl * fF erent endpoints, so it is important to use delay est,imates to drive the look-ahead rather than len th calculations The complexity of the algorithm wit '% look-ahead is ' O(nlogn) where rz is the number of clock pins. For a detailed analysis of the complexity, the interested reader is referred to [JSKSO] . We shall refer to the algorithm as the Method of Means and R4ediaus (MMM) in the following text.
Theoretical results
In this section we state some ke results that motivate the Method of Means and Me d ians. For the sake of brevity we have omitted proofs of the propositions. A detailed treatment the proofs leading to the results can be found in [JSKSO] . The first result is that after splitting a region into two sub-regions, the lengths of segments from the center of mass of the re ion to each of its sub-regions is equal and s mmetric. &xt we establish a bound on the total wire ength for a gridded distribu-7 tion of points and compare it with the wirelength for a minimum rectilinear Steiner tree spanning those points. We also present an interesting result which claims that increasing the number of points within a region reduces the skew. Finally, we show that the algorithm with one level of look-ahead runs in time O(n logn) where n is the number of clock pins. All our the:oretical results corroborate our experimental results (Section 7). Theorem 5.1 Given a set of points S = (sr, ~2,. . . , sn }, where n is an even integer,
I xc(S) -2$k(S)> I + I Y,(S) -Y,&(S)) I = I 4s) -~cwt(S)) I + I YCW -Yc(SR(S>) I
A similar result holds between S, Sn (S) and S*(S). The significance of the above result is that at every split in the algorithm, the len are always equal. Note t A ths to each of the sub-regions at as we move dee clock tree, the se ments become shorter. 5l J er into the hus, at the topmost level of t e clock tree when t,he segments are longest, we ensure exact balance and n.o skew. Lemma 5.2 Given a distribution of n points on a uniform arid, where n = 4k, k is an inte k er unit, the total 5 1, within a region-of side 1.0 wire engEh of the tree produced by the basic algorithm grows as z&i. Theorem 5.3 The wirelength for a minimum rectilinear Steiner tree spanning a set of n uniformly spaced points on a grid, where n-= 4L, k an integer > 1, within a region of side 1.0 unit, grows as fi+ 1. This is also the largest possible wirelength for a rectilinear Steiner t,ree for a distribution of n points in a unit square. Any other distribution 0f.n points within the unit grid will yield a smaller total wirelength.
These results indicate that the wirelength of the clock tree is within a constant factor of 4 -compared to a minimum rectilinear Steiner tree for the particular distribution of points. We conjecture that the worst-case wirelength for the Method of Means and Medians is g/Z-; [JSKSO] . Thus even in this case constant times the i the total wirelength is still a wire ength for the: largest minimum rectilinear Steiner tree.
We define the sparsityp of a distribution of points in a re B ion divl 4-i ion to be the total number of points in the re ed by the area of the region. It is a measure o the average number of points per unit area. The next result concerns the variation of skew with sparsitv. Theorem 5.4 For a uniformly randomly distributed set of points inside a box of side n units with s arsit p, the expected maximum difference in length rom t P ii e center to any endpoint for the basic algor%hm is proportional to 5. This result indicates that as the: number of points within the region is increased, the skew between the endpoints is reduced. Our experimental results support this claim. Theorem 5.5 The algorithm with one level look-ahead runs in time O(n log n), where n is the number of points in the region.
The algorithm is fast and the runnin ing a region with 4096 points on a D E! time for routCStation 3100 computer (14 MIPS) was less than a second of CPU time. Therefore, speed is not an issue when running the algorithm on practical examples.
Practical Considerations
In practice one would like to route the clock net with minimum wirelength while satisfying a prespecified tolPaper 34.3 576 erance on clock skew and phase delay. We have deMM&t velo ed a hybrid clock routing algorithm that performs o a certain depth, i.e., untrl the chip has been divided into a number of regions each containing less than a certain number of clock pins. Then standard routing techniques are applied to each of the remainin subregions. The de th at which the transition from to standard tee B%l MM nrques occurs and the number of cr ins within each sub-re &-ion that are routed using stan ard methods are a func ion of the amount of skew tolerable. 
Experimental Results
As a test of the effectiveness of MMM it was run on twenty random examples and the MCNC industrial benchmarks Primary1 and Primary2
The twenty random examples had uniform pin distributions in a square region. For the twenty examples, four equal-sized chips with 16, 32, 64, 256 or 512 pins were generated. For comparative purposes, we routed the same pin distributions using a minimum rectilinear spanning tree (MST) algorithm. As shown in [Han66], the ratio of the length of a minimum rectilinear spanning tree and an optimal rectilinear Steiner tree is bounded by a factor of 5. SPICE [Nag751 files were generated for all examples baaed on Manhattan geometries, and the interconnect was driven by a single T/O buffer pad with equivalent drive of ten times the minimum sized inverter cell in a 2 pm design style. To model was placed at the leaves of t a ate loading, a capacitance e clock distribution tree of value 0.3 pF. Figure 8 shows the results of runnin gorithm to varying depths on the MC % the hybrid al-C benchmark chip Primary2. On the x-axis we have plotted the number of pins m each region that were routed using standard techniques.
The origin of the x-axis corresponds to routin using MMM for all the pins. The ri htmost point on f he axis corresponds to routing the c ock net f usin a minimum spanning tree. whlc 4l
Thus the depth to MMM was applied decreases as we move towards the right of the figure. The solid line shows decreasing wirelength (normalized) while the dotted line shows increasing skew with decreasin depth. This provides the designer with the opportum y *Ei to arrive at a compromise between the excellent skew of MMM and the low wirelength of a minimum spanning tree. Figure 9 and 10 show MM&I's routing results for the MCNC Primary1 and Primary2 benchmarks respectively. The skew introduced for each of these examples was '31 ps and 260 ps respective1
Primar 1 had 269 clock pins and Primary2 had 603 c ock pins. r 50th placements were obtained using PROUD [TICH88] . It is interesting to note that Primary2's placement exhibited an asymmetric clock pin distribution while Primaryl's remained relatively uniform. However, the asymmetry was not enough to deter MMM from yielding excellent results. Fi at the furt % ures 11 and 12 show the voltage waveforms est and cIosest pins from the clock driver for Primary1 when routed usin MST and MMM respectively.
The skew introduce 8 by MST was 4.7 ns, and the routing to the furthest point was so poor (in terms of timing behavior) that the pin was unable to charge to the su erated by MM M! ply voltage. Note that the skew gen1s in Figure 11 . 31 ps and is barely distinguishable
Conclusions and Future Work
We have presented an approach to clock routing that is clearly superior to simple minded clock routing based on a minimum spannin 5. tree. While: high-performance industrial designs are un lkely to have clock routing performed using such a simple a it of the results generated preach g as MST, the qual-TK y MMM are exceptional. e a yielde lf preach has all but eliminated clock skew and excellent phase delay results for a wide range chip sizes, net sizes (pin count), technologies, and pin distributions on both randomly created and industrial benchmarks.
Future work will address clock tree buffer optimization and give consideration to blocka es and routin congestion during the growth of the c ock 7 tree. AJ ditionally, the impact of the approach on wirability and chip area will be mvestigated.
[Fis89] has shown that clock skew may be used to decrease the clock period of a s ative delays between the arriva r* stem by introducing relat the clocked pins.
times of the clock slgnal implement this idea.
We are considering techniques to
