In this paper we focus on routing techniques for optimizing clock signals in small-cell (e.g., standard-cell, sea-of gate, etc ...) ASICs. In previously reported work, the routin of the clock net has been erformed using ordinary gfobal routing techniques base3 on a minimum spanning or minimal Steiner tree that have little understandin of clock routing problems. We resent a novel a proaca to clock routing that all but eEminates clock sfew and yields excellent phase delay results for a wide range of chip sizes, net sizes (pin count), minimum feature sizes, and pin distributions on both randomly created and standard industrial benchmarks. For certain classes of pin distributions we have proven theoretically and observed experimentally a decrease in skew with an increase in net size. In practice, we have observed a two to three order magnitude reduction in skew when compared to a minimum rectilinear spanning tree.
Introduction
In toda 's highly competitive IC market lace, company survivaynecessitates t r o d u c t differentiagility. Product differentiability may e engendered in many ways, several of which include: increased performance (e.g. lower power, faster timing, etc...), lower cost, more features, or faster time to market. Thus, design techni ues that enhance chip timin performance are of funxamental importance to the 18 community.
The clock is the essence of a synchronous digital system. Physically, the clock is distributed from an external pad to all similarly clocked s nchronizing elements through a distribution network t i a t includes clock distribution logic and interconnects. It serves to unify the physical and temporal design representations by determining the precise instants in time that the digital machine changes state. Because the clock is important, optimization of the clock signal can have a significant impact on the chi 's cycle time, es ecially in high-performance designs. %on-optimal clocf behavior is caused b either of two phenomena: the routing to the chip's syncLonisin elements, or in the non-symmetric behavior of the c l 0 5 distribution-lo ic.
Previous work in clock optimzafion has been contributed by several authors. H-trees have been recognized for years as a technique to help reduce the skew in synchronous systems [FK82] [KGE82] [DFW84] [BWMSS] [WF83] . For regular structures such as systolic arrays the H-tree works well to reduce skew, but in the general case asymmetric distributions of clock ins are common and the H-tree is not as effective for crock routing. The large size of the clock net has led some researchers [DFW84] [Mij87] to perform buffer optimization within the clock distribution tree. More recently, [BWMSS] has provided an analysis of the clock lines that considers the transmission line properties of the clock net. [BBB+89] have presented an approach for ASIC clock distribution that integrates buffer optimization into place and route algorithms, while [RS89] and [FP86] have presented approaches that consider macrocell clock distribution. However, in all previous work the routing of the clock net is performed usin ordinary routing techniques. This causes non-optim3 clock behavior and as re ion size or the number of pins in the net increases, h e undesirable behavior is exacerbated. In this paper, we focus exclusively on routing techni ues for optimizin the clock signal in VLSI circuits. %e demonstrate t ! e superiority of our a1 orithm over standard routing techniques for examples of a wide ran e of ranging size.
f n section two the preliminaries necessary for understanding the Raper are presented. Following this, the problem is de ned in section three. Section four illustrates the algorithm for clock routing and section five discusses theoretical results. Next, in section six, practical considerations are discussed. In section seven the experimental results are presented, and in section eight possible avenues for future work and conclusions regarding the approach are discussed.
Preliminaries
The majority of digital chi s are synchronous in nature. Synchronous designs are ogen modeled as a Moore finite state machine for the pur oses of analysis. The topological requirement imposeJon such a finite state machine is that all closed signal paths must contain a t least one synchronizin element. Satisfaction of this constraint has several fenefits, two of which are: the assurance of deterministic behavior if the physical aspects of the design are correct, and the elmination of the requirement that the combinational logic be free of transients as long as next state sampling is performed after the longest path has settled t o its final value [MC80]. For simplicity, and without loss of generality, let the synchronizing elements be edge-triggered. Furthermore, let CP denote the clock period, d~ the largest path delay throu h the combinational lo ic, t $ K E W the clock skew, t s u t f e set-u time of the ecf e triggered s nchronizing elements, a n f t c g the delay !rim the syncironizing element's clock pins to the Q output pins. In order to Tarantee that no long-path timing violations occur in t e design, the following equation must be satisfied This expression demonstrates the important relationshi between the clock period, the longest path delay, a n i the clock skew.
The two timing related clock parameters that one must consider for high erformance design are clock skew and phase delay. % duty cycle maximum difference in arrival times at any two similarl clocked clock pins. Clock skew is caused by severJphenomena: asymmetric routes to the clocked elements, differing interconnect line parameters, different delays through the clock distribution elements, and different device threshold voltages for the clock distribution logic. Equation 1 illustrates the important relationshi between skew and the longest combinational logic at% delay. As skew increases with the clock period held [xed, the efficiency of the digital system is reduced because valuable computation time is "stolen" from the total cycle time. Frequently in high-performance design environments, skew is constrained to be less than five percent of the clock period. Thus, in a 100 MHz design skew would be constrained to be less than 500 s. In this aper, routin techniques to help achieve tgis y a l wilybe presentecf Phase delay may be defined to e the maximum delay to any s nchronizing clock pin. The same phenomena causing slew also contributes to phase delay. It is convenient to consider phase delay to consist of two components: a n intrinsic cell dela t I H or t I contributed by the externally driven c l o d pad and &e time to charge or discharge the clock net tcw or tct. Expressions for phase delay may be defined as tct < TL (5) These expressions are necessary but not sufficient conditions t o guarantee proper clocking. In worst-case situations, tc and t C L could conceivably constrain the clock periof so it is necessary to make provisions for minimzing phase delay.
To this point, we have tacitly assumed that only one clock exists. However, in CMOS design styles it is commonplace t o design with more than one clock. The ideas presented in this pa er may be easil extended to the case of multi le clocfs by treating t i e clock nets inde endently. d e n multiple clocks are present, interclocfand intra-clock skew and phase delay must be considered. Independent treatment of the different clock nets will address intra-clock problems by minimizing the phase delay of each clock net and minimizing the skew between similarly clocked synchronizing elements. Inter-clock phase delay is minimized since intra-clock phase delay is minimized. While inter-clock skew is not explicitly minimized, should it be a problem, cir.cuit techniques that insert delay to e ualize arrival times across different clock nets will amiiorate the problem. Hereafter, for purposes of simplicity, attention wlll be restricted to single clock desi ns.
Prior to delvin into the Aock routing algorithm, it is necessar to detne its role in the context of the overall design &w. Traditional ASIC design proceeds from logic to hysical design. Physical design consists of three classicafsteps: placement, lobal routing, and detailed routing. In our roposed %esign flow a clock routing step is inter osefbetween placement and global routing. Present&, durin the clock routing step, the global and detailed routes o$ the clock net are determined and passed to the global router as blockages. The determined routes are constructed so that the clock signal behavior is optimized.
To understand the consequences of decisions made during physical design, one must model the interconnect parasitics that load the clock tree. We make the reasonable assum tion that in a high-performance design environment, t i e interconnects are realized with aluminum due to its excellent conducting properties. Interconnect resistance is determined using the following expression where p is the resistivity of aluminum (3 pi2 cm), L is the interconnect length, W the interconnect width, and H the interconnect thickness. Interconnect capacitance Cint is modeled using a simple parallel-plate model given by
In these expressions Kc is a constant that is inserted to account for fringing effects, and can be calculated using the two-dimensional analysis of [DSSO] . It is assumed to be 2.0 in our calculations.L, represents the line spacing, to, represents the thickness of the field oxide, and co2 is the permittivity of the oxide. All lengths used to calculate resistance and caDacitance are based on manhattan and distances.
Based on estimates for R,.? and Cint, simple and accurate interconnect delay estimates may be calculated using the first-order moment of the impulse response which has also been called Elmore's delay [RPH83] [Elm48]. The interconnects are treated as distributed RC trees and the rectilinear segments comprising the interconnect are modeled using their equivalent Tnetwork representation.
where W L equals the total wirelength and N R equals the number of no routes.
4 The Algorithm
The Basic Algorithm
The algorithm which we call the Method of Means and Medians (MMM) is conceptually simple and yields theoretical results which are intuitively pleasing. Let S = {SI, s2,. . ., sn) be the set of points in the plane which represent the clock pins. Each si is a couple (xi, y;). Define ( x c ( S ) , y c ( S ) ) represents the center of mass of the set of points S. We shall use the notation S,(S) or simply S, to denote the ordered set of points obtained by orderin the set S by increasing x coordinate, i.e. xi 5 x , if s i , s, E S,(S) and i < j. Similarly, S,(S) or simply S, represents the ordered set of points obtained by ordering the set S by increasing y coordinate. Define The sets SL and SR represent the division of S into two sets about the median x coordinate of the set of points. These sets partition the original region in the x dimension into two subregions with approximately equal number of elements in each sub-region. In fact, I !SL! -SRI 1.5 1. Similarly, S B and ST represent the division of S into two sets about the median y coordinate of the set of points. The basic algorithm first splits S into two sets (arbitrarily in the x direction or y direction). Assume that a split of S into SL and SR is made. Then, the algorithm routes from the center of mass of S to each of the centers of mass of SL and SR respectively. The regions SL and SR are then recursively split in the y direction (the direction opposite to the previous one). Thus, splits alternating between x and y are introduced on the set of points recursively until there is only one point in each sub-region. The pseudGCode for the algorithm is given in Figure. 
Improvements
The simple algorithm described above yields good results, but there is room for further improvement. In the following discussion we define a cut in the x direction to mean a split resulting in a left region and a right region. A cut in the y direction implies a split resulting in a top and a bottom region. 
Delay equalization look-ahead
Consider the exam le shown in Figure 3 , where S is the clock source. #we make a cut in the x direction and then recursively split the left and right regions in the y direction, we get the result shown in Figure 3(a) . Clearly, there is skew between points PLT and PRT. However, if we reverse the cut directions, i.e., split in the y direction first followed by a split in the x direction, we get the result shown in Figure 3(b) , which has no skew between the endpoints. This example illustrates the need for making a good choice of cut direction at each level of the recursion tree. We make the choice by a one level look-ahead technique.
Given a region to be split, the algorithm makes an x direction cut followed by a y direction cut on the resulting left and right re ions. It also makes a y direction cut followed by an x firection cut. The skews for each of the configurations is compared and the cut direction that minimizes skew between its current endpoints is chosen. The method of estimatin the skew between the endpoints is described in the fo%owing section.
Delay Calculation
We use the Penfield-Rubinstein [RPHSS] algorithm for calculating delays t o the endpoints in the rown clock tree. The resistance and capacitance of t f e tree segments are modeled by a T-network model. Consider the tree shown in Figure 4 . Because of a property of the center of mass (see next section), the lengths of tree segments from the center of mass of a region to each of its two sub-regions will always be e ual and symmetric. The delay from s t o the endpoint s ?' is calculated as
6s-s7 = R I (~.~C I (~:
where C, is the gate capacitance at the clock pin (assumed equal for all clock pins), CI is the capacitance per unit length and RI is the resistance per unit length. Note that the delay to any endpoint de ends on the lengths of other segments connecting dikierent endpoints, so it is important to use delay estimates to drive the look-ahead rather than len th calculations. The complexity of the algorithm wit% look-ahead is O(n1ogn) where n is the number of clock pins. For a detailed analysis of the complexity, the interested reader is referred to [JSKSO] . We shall refer to the algorithm as the Method of Means and R4edians (MMM) in the following text.
Theoretical results
In this section we state some ke results that motivate the Method of Means and Medans. For the sake of brevity we have omitted proofs of the propositions. A detailed treatment the proofs leading to the results can be found in [JSKSO] . The first result is that after splitting a region into two sub-regions, the lengths of segments from the center of mass of the re ion to each of its we establish a bound on the total wirefength for a gridded distribution of points and compare it with the wirelength for a minimum rectilinear Steiner tree spanning those points. We also present a n interesting result which claims that increasing the number of points within a region reduces sub-regions is equal and s mmetric. hext P the skew. Finally, we show that the algorithm with one level of look-ahead runs in time O(n log n ) where n is the number of clock pins. All our theoretical results corroborate our experimental results (Section 7). 
, S B ( S ) and S T ( S ) .
The significance of the above result is that a t every split in the algorithm, the len ths to each of the sub-regions are always equal. Note t f a t as we move dee er into the clock tree, the se ments become shorter. $bus, at the topmost level of &e clock tree when the segments are longest, we ensure exact balance and no skew. Lemma 5.2 Given a distribution of n points on a uniform grid, where n = 4k, k is an inte er > 1, within a region of side 1.0 unit, the total wireyenga of the tree produced by the basic algorithm grows as :,hi.
Theorem 5.3
The wirelength for a minimum rectilinear Steiner tree spanning a set of n uniformly spaced points on a grid, where n = 4k, k a n integer 2 1, within a region of side 1.0 unit, grows as ,hi+ 1. This is also the largest possible wirelength for a rectilinear Steiner tree for a distribution of n points in a unit square. Any other distribution of n points within the unit grid will yield a smaller total wirelength.
These results indicate that the wirelength of the clock tree is within a constant factor of $-compared to a minimum rectilinear Steiner tree for the particular distribution of points.
We conjecture that the worst-case wirelength for the
Method of Means and Medians is
Thus even in this case the total wirelength is still a constant times the wireiength for the largest minimum rectilinear Steiner tree. We define the sparsityp of a distribution of points in a re ion to be the total number of points in the re ion divijed by the area of the region. It is a measure ofthe average number of points per unit area. The next result concerns the variation of skew with sparsity. Theorem 5. 4 For a uniformly randomly distributed set of points inside a box of side n units with s arsit p , the expected mainium difference in length Zom tKe center to any endpoint for the basic algorithm is proportional to This result indicates that as the number of points within the region is increased, the skew between the endpoints is reduced. Our experimental results support this claim. Theorem 5.5 n e algorithm with one level look-ahead runs in time O ( n log n ) , where n is the number of points in the region.
The algorithm is fast and the runnin time for routing a region with 4096 points on a DIfCStation 3100 computer (14 MIPS) was less than a second of CPU time. Therefore, speed is not a n issue when running the algorithm on practical examples.
**

Practical Considerations
In practice one would like to route the clock net with minimum wirelength while satisfying a prespecified tolPaper 34.3 erance on clock skew and phase delay. We have develo ed a hybrid clock routing algorithm that performs M M h to a certain depth, i.e., until the chip has been divided into a number of regions each containing less than a certain number of clock pins. Then standard routing techniques are applied to each of the remainin regions. The de th at which the transition from h%k to standard teciniques occurs and the number of ins within each sub-re ion that are routed using stanfard methods are a funcfion of the amount of skew tolerable.
Another practical concern is the degradation in the clock waveform. Usually, buffers are inserted into the clock tree to regenerate the clock waveform. We propose two strategies to deal with the placement of buffer cells. The first is to pre-place buffers a t symmetric locations on the chi that coincide with expected locations of the centers opmass of the sub-regions to be driven by the buffers. Then, durin the clock routing, detours are made to the pre3laced %uffers so that they may drive the clock pins. hese buffers then act as centers from which a clock tree is grown using the Method of Means and Medians. The second strategy is to insert buffers after placement a t optimal locations determined by the Method of Means and Medians. The expected perturbation to the placement would be small considering the size and number of buffers relative to the total number of cells.
Experimental Results
As a test of the effectiveness of MMM it was run on twenty random examples and the MCNC industrial benchmarks Primary1 and Primary2. The twenty random examples had uniform pin distributions in a square region. For the twenty examples, four equal-sized chips with 16, 32, 64, 256 or 512 pins were generated. For comparative purposes, we routed the same pin distributions using a minimum rectilinear spanning tree (MST) algorithm. As shown in [HanGG] , the ratio of the length of a minimum rectilinear spanning tree and an optimal rectilinear Steiner tree is bounded by a factor of $. SPICE [Nag751 files were generated for all examples based on Manhattan geometries, and the interconnect was driven by a single 1/0 buffer pad with equivalent drive of ten times the minimum sized inverter cell in a 2 pm design style. To model ate loading, a capacitance was placed a t the leaves of t i e clock distribution tree of value 0.3 pF.
We compared the skew, phase delay and wirelen th as a function of the number of pins for MMM and M8T.
Additional experiments comparing skew as a function of chip size and minimum feature size may be found in [JSKSO] . An extended evaluation of the results of all experiments is also given in [JSKSO] . To determine the relationship between skew and the number of pins with chi size fixed a t 25 mm2, MMM and MST were comparetto one another with the result appearing in Figure 5 . Interestingly, the skew decreased with increasing number of pins for A4MM and grew linearly for MST. Similarly, phase delay versus the number of pins for a chip size of 25 mm2 were compared for MST and MMM. A ain, MMM displayed a clear advanta e with its row& in phase delay appearing to be su%-linear an8 MST approximating linear growth. These results can be seen in Figure 6 . The dramatic improvements in clock skew and hase delay are paid for in terms of total wirelen th. 50 illustrate this, the average wirelength for aI'i examples was plotted against the number of pins in Figure 7 .
The experimental results corroborate the theoretical f i relationship between wirelength and number of points n with the difference appearin as a constant factor. Thus, improvements in clock beiavior are accompanied by an increase in the clock net's wirelength. Figure 8 shows the results of runnin the hybrid algorithm to varying depths on the MCkC benchmark chip Primary2. On the x-axis we have plotted the number of pins in each region that were routed using standard techniques. The origin of the x-axis corresponds to routing using MMM for all the pins. The ri htmost po.int on the axis corresponds to routing the cyock net usin a minimum spanning tree. Thus the depth to whit% MMM was applied decreases as we move towards the right of the figure. The solid line shows decreasing wirelength (normalized) while the dotted line shows increasing skew with decreasing depth. This provides the designer with the opportunity to arrive a t a compromise between the excellent skew of MMM and the low wirelength of a minimum spanning tree. Figure 9 and 10 show MMM's routing results for the MCNC Primary1 and Primary2 benchmarks respectively. The skew introduced for each of these examples was 31 ps and 260 ps respective1 Primar 1 had 269 clock pins and Primary2 had GO3 c&ck pins. i o t h placements were obtained using PROUD [TI<H88] . It is interesting to note that Primary2's placement exhibited an asymmetric clock pin distribution while Primaryl's remained relatively uniform. However, the asymmetry was not enough t o deter MMM from yielding excellent results. Fi ures 11 and 12 show the voltage waveforms at the furkest and closest pins from the clock driver for Primary1 when routed usin MST and MMM respectively. The skew introduce5 by MST was 4.7 ns, and the routing to the furthest point was so poor (in terms of timing behavior) that the pin was unable to charge to the su ply voltage. Note that the ske.w generated by M M g i s 31 ps and is barely distinguishable in Figure 11 .
Conclusions and Future Work
We have presented an approach to clock routing that is clearly .superior to simple minded clock routing based on a minimum spannin tree. While high-performance industrial designs are uaikely to have clock routing performed using such a simple a proach as MST, the qualit of the results generated g y MMM are exceptional. T i e a proach has all but eliminated clock skew and yieldelexcellent phase delay results for a wide range chip sizes, net sizes (pin count), technologies, and pin distributions on both randomly created and industrial benchmarks.
Future work will address clock tree buffer optimization and give consideration t o blockages and routin congestion during the growth of the clock tree. Acf ditionally, the impact of the approach on wirability and chip area will be investigated.
[Fis89] has shown that clock skew may be used to decrease the clock period of a s stem by introducing relative delays between the arrivaytimes of the clock signal at the clocked pins. We are considering techniques to implement this idea.
