Abstract
Introduction
In the multi-giga-Hertz design era, clock design plays a crucial role in determining chip performance and facilitating timing and design convergence. Since clock-skew directly translates to cycle-time penalty, zero-skew clocktree designs have received significant attention. Various approaches for zero-skew clock-tree construction have been proposed [13] [2] [8] . As technology feature size shrinks, process-variation induced clock-skew dominates the clockskew in manufactured chips. Process-variation comes from various manufacturing procedures such as lithography, etching, chemical-mechanical polishing, etc., and is usually difficult to predict. It is observed in [7] that processvariation induced skew can reach 10% of the clock-delay.
The total clock-delay is dominated by interconnect-delay in a clock-tree. To reduce such delay, buffer-insertion, buffer-sizing, and wire-sizing have been widely adopted. These techniques not only improve the robustness of a clock-tree but also reduce power-consumption by avoiding the use of wide wires and thereby reducing clock-tree capacitance. An excellent survey of interconnect optimization techniques can be found in [5] . Buffer-insertion, buffersizing, and wire-sizing techniques for delay and power optimization have been proposed in two major categories. In the first category, buffer-insertion, buffer-sizing, and wiresizing are formulated as optimization problems, in which the maximum delay at each leaf node is constrained [6] [4] [3] . The second category, based on van Ginneken's algorithm, uses a bottom-up dynamic programming technique to find optimal solutions for a subtree and propagate the solutions up toward the root node [8] [14] [9] [1] .
Recent work [12] uses a new approach to tackle wiresizing problems for zero-skew clock-tree optimization. Since a zero-skew clock-tree can be characterized by its clock-delay and capacitance, the proposed algorithm captures the zero-skew design-space of each subtree with a two dimensional region, or a DC region, where the coordinates of the Y-axis and the X-axis are the clock-delay and capacitance values. Every point inside a DC region represents one or more zero-skew designs, or embeddings, of the subtree. Sampling techniques are used to capture the irregular shapes of the DC regions. The DC region of a clock-tree is constructed by the sampled subtree DC regions in a bottomup fashion. A top-down embedding selection algorithm is then used to generate the embeddings that yield the chosen clock-delay and capacitance values from the DC regions. The major advantages of the proposed algorithm are that it finds -optimal solutions and takes pseudo-polynomial runtime and memory usage. However, it does not consider buffer-insertion and buffer-sizing that are more efficient in reducing clock-delays.
In this paper, we extend the work of [12] and propose a zero-skew buffered clock-tree synthesis flow that generates process-variation robust and low-power clock-trees from a given set of clock sink nodes. First, we adopt the BB+DME [2] algorithm to generate the initial routings. The initial routings are then optimized for process-variation and power with a novel algorithm that performs simultaneous buffer-insertion/sizing and wire-sizing within the zero-skew design-space. The zero-skew design-space of each buffered subtree is represented by a three dimensional DC region and sampling techniques are extended to capture it. In order to handle inverter-insertion, which is a more area efficient alternative to buffer-insertion [11] , and the signal-polarity issue, two sets of DC regions are used. After the bottom-up DC region construction, a top-down embedding selection algorithm is used to generate embeddings from the DC regions.
The rest of this paper is organized as follows: in Section 2, the delay and power models and the DC region approach are discussed. Section 3 introduces the proposed clock-tree synthesis flow and simultaneous bufferinsertion/sizing and wire-sizing algorithm. The top-down embedding selection algorithm is proposed in Section 4. Experimental results are discussed in Section 5 and finally Section 6 concludes this work.
Preliminaries

Delay and Power Models
Interconnect-delay and gate-delay are two delay components in a clock-tree. In this paper, interconnects and buffers are modeled with the resistance-capacitance (RC) model with delays according to the Elmore delay model. A clock-tree with given routing rooted at node v is denoted T v . For a wire with length l and width w, the wire resistance is , where c b and r b are unit-width gate capacitance and resistance. The gate is modeled as a ramp voltage source with an intrinsic delay of t c .
The clock-tree power consumption is generally modeled as P = f CV 2 + P s + P l , where f is the switching frequency, V is the voltage swing, C is the sum of all interconnect capacitance, buffer gate capacitance, and sink loads. P s accounts for the buffer short-circuit power and P l accounts for the leakage power. Since P s and P l are usually much smaller compared to f CV 2 , which alone is a reasonably accurate measure of the total power consumption in clock-tree designs [10] , we adopt the simplified model P = f CV 2 in this paper.
The DC Region Approach
In the Elmore delay model each subtree is characterized by the total capacitance seen at its root node. and T vr considering only wire-sizing. The DC region of a level-1 node such as Ω v l is a curve and the DC regions of higher level nodes have complex shapes. From the DC regions we can learn the minimum-delay and minimumpower achieved by wire-sizing. The Y-coordinate of the bottom-most point in a DC region is the minimum-delay and the X-coordinate of the left-most point in a DC region is the minimum-power. A clock-tree is composed of two branches. The left branch of T v is enclosed in the dashed circle in Figure 1 and is denoted T
and it can be obtained by applying a transformation on Ω v l . The zero-skew constraint requires the delays along both branches to be the same, thus Ω v can be obtained by an equi-delay merge operation on Ω Generating embeddings from the DC regions is an inverse process. First the target clock-delay and capacitance, d t and c t , of an embedding are chosen. The capacitance is then split into two branches. The DC regions of the left and right subtrees are scanned and feasible target clock-delays and capacitances of the subtrees are selected. From the selected points in the branch DC regions and the subtree DC regions the wire widths above the subtrees can be determined. The process is illustrated in Figure 2 .
Simultaneous Buffer-Insertion/Sizing and Wire-Sizing
In our zero-skew buffered clock-tree synthesis flow the BB+DME [2] algorithm is used to generate initial routings. A set of clock sink nodes is first recursively bi-partitioned according to the locations and the capacitances of the sinks such that the two partitioned subsets have balanced number of sink nodes and total capacitances. The balanced bipartition continues until each subset contains only one sink node and then an abstract binary-tree topology is obtained. The merging segments, sets of zero-skew merging points, for each internal node are constructed in a bottom-up fashion. Finally a set of merging points are selected in a topdown fashion and an initial clock-tree is generated. We consider buffer-insertion above all merging points and obtain a buffered clock-tree.
A buffered clock-tree T v can be characterized by three parameters as shown in Figure 3 . The first two parameters, d v and c v , are the clock-delay and the capacitance seen at the root node v. The total switching capacitance of T v is the sum of c v and c sv , the capacitance shielded from v by the buffers. Thus we can represent the zero-skew design-space with a three dimensional DC region, where the coordinates of the Z-axis, X-axis, and Y-axis are the clock-delay, capacitance seen at the root node, and the shielded capacitance.
Significant extensions are done to handle the three dimensional DC regions. Due to the space limitation we only highlight the important extensions in the following subsections. 
Branch DC Regions and Projected Scan-Line Sampling
The details of transforming subtree DC regions to a branch DC region for wire-sizing is available in [12] . To capture the irregular shapes of the branch DC regions, sampling on wire-width is applied and the intersection of the branch DC regions and a set of clock-delay scan-lines are obtained. The branch DC regions can then be represented by a set of horizontal segments. Since the branch DC region is the projection of a zero-skew design-space, we refer the project-sample-scan procedure as projected scan-line sampling. The transformation is defined by the following equations obtained directly from the Elmore delay model:
The edge connecting v to its parent node is e v and the length of e v is l v . The wire-width of e v is w e (v) and w m ≤ w e (v) ≤ w M . Since wire-sizing does not affect c sv we can set c + sv = c sv and use projected scan-line sampling to obtain three dimensional sampled branch DC regions. It is denoted
If a buffer is inserted above a subtree, then the transformation is defined by the following equations:
Equation (3) defines the difference of clock-delays between a branch and its subtree, which is composed of the buffer intrinsic delay, the delay caused by wire and buffer capacitance, and the buffer-delay by driving the shielded capacitance. The width of the buffer above v is w b (v). Equation (4) (5) define the change of the capacitance seen from Figure 4 shows the steps of projected scan-line sampling and equi-delay merge in obtaining the sampled DC region of an unbuffered two-level clock-tree.
Inverter-Insertion
In practice, inverter-insertion uses less area and is preferred over buffer-insertion. With an inverter inserted, the clock-phase is inverted. This is accommodated in our model by keeping two DC regions and branch DC regions at each node v, where
From this point on in this paper the term buffer refers to inverter. Figure 5 shows the steps of projected scan-line sampling in obtaining the sampled branch DC region from the sampled DC region. The complete bottom-up DC region construction algorithm is presented in Algorithm 1.
Complexity
Assume a clock-tree has n nodes and the numbers of samples used for clock-delay scan-line, buffer-width, and shielded capacitance are p, q, and r. A sampled DC region is represented by r groups of horizontal segments lying on X-Z planes, with each group containing p segments. To generate a branch DC region, q curves are generated from each segment, and p · q · r curves are used to intersect with
Algorithm 1 Construct DC(T v )
Input: a clock-tree T v with given routing rooted at node v Output: DC regions and branch DC regions of T v if v is a leaf node then
the p · r scan-lines to generate another r groups of segments. Thus the complexity of our algorithm is O(np 2 qr 2 ) and memory usage is O(npr). By exploring the properties of (3) (4) (5), runtime can be reduced to ∼ O(npqr 2 ). Note that there can be more than one segments on a clock-delay scan-line. However, the gaps between the segments tend to be filled-up quickly as we move upward toward the root node. For example, multiple segments can overlap and become a single segment either when we create the sampled branch DC regions or merge the sampled branch DC regions with the ¡ operator. In practice, the number of segments on a scan-line is always less than four and we exclude it in the complexity analyses.
Compared with van Ginneken's algorithm based approaches, which suffer from exponential growth on the size of the solution set, our approach only takes polynomial runtime and memory usage. This advantage comes from projected scan-line sampling. Since each line segment captures an infinite number of solutions, our approach encode more information with the same amount of memory and result in better runtime and memory usage.
Slew-Rate Control and Useful-Skew
One of the purposes of buffer-insertion is to adjust the clock slew-rate. If the load capacitance of a buffer is too large, the output signal has a slow rise and fall time, which in turn increases the short-circuit power of its downstream buffers. One way to control the slew-rate is to limit the load capacitance to a certain value such that the slew-rate of the buffer is bounded to a desired value. This constraint is accounted for in our approach by limiting c v in the bottom-up phase and discarding the portion of the design-space with long slew-rates. In the bottom-up phase, the DC regions can become large because of the embeddings with excessive buffers, which have large clock-delay and total capacitances. By setting upper limits on d v and (c sv + c v ), those ill-buffered embeddings are excluded in our DC regions.
Recently, useful-skew [15] concepts have been widely proposed to speed up timing convergence in order to compensate for timing uncertainties in physical layouts. Our algorithm accommodates DC region generation for a usefulskew design-space by assigning useful-skew values to d v at each leaf node. Furthermore, in low-power designs not requiring high clock-frequencies, DC regions can be generated for a bounded-skew design-space to allow more aggressive power-optimization. In this case, the skew-bounds are assigned to d v . Ω v of a leaf node becomes a vertical segment, of which the Y-coordinates of the end points are the maximum and minimum acceptable skews and the Xcoordinate is the sink capacitance.
Embedding Selection
The top-down embedding selection algorithm is illustrated by the following five steps:
vr by breakingĉ v andĉ sv , respectively, intoĉ (2) or (3)- (5) are satisfied.
• To satisfy (1)- (2), b v l = φ,ĉ sv l =ĉ + sv l , and w e (v l ) is determined byĉ v l .
• To satisfy (3)- (5),ĉ v l =ĉ To highlight the purpose of each step, Step 1 determines the target clock-delay, capacitance seen at the root node, and shielded capacitance, Step 2 splits the capacitance budgets to two branches T + vl and T + vr , and Step 3 determines buffer widths for b v and wire widths for e v .
Experimental Results
We implement our algorithm in C++ and run the program on a 1GB 1.2GHz Pentium IV PC. The benchmarks r1-r5 are taken from [13] . All simulations use r 0 = 0.03Ω, c 0 = 2 × 10 −16 F , w m = 0.3µm, w M = 3µm. The parameters of the buffers are c b = 40f F , r b = 100Ω, t c = 30ps, w bm = 1, w bM = 10, and the channel length equals 0.3µm. The initial routings are generated by the BB+DME [2] algorithm. The sampling resolution is set to p = q = r = 64. Table 1 shows the minimum-delay and minimum-power solutions captured by projected scan-line sampling. The delay gains are the initial delays divided by the optimized delays and the load gains are the load reductions divide by the initial loads. With simultaneous buffer-insertion, buffer-sizing, and wire-sizing, the minimum-power embeddings consume 23% ∼ 34% less power and delay improves 2X ∼ 24X than their initial routings. Minimumdelay solutions have more than 2X speedup compared with minimum-power solutions. The power difference between minimum-delay and minimum-power solutions decreases with larger circuits and drops to 5% in r5. Compared with wire-sizing along, which achieves an average of 3.3X delay reduction with 21.6% more power (with 1µm ≤ w ≤ 4µm) [12] , simultaneous buffer-insertion/sizing and wire-sizing is not only much more efficient in reducing clock-delay but also more power efficient. Figure 6 shows the DC region of r5. Figure 7 (a) shows the clock-delay and power trade-off curves of r1-r5. The power difference between minimumdelay and minimum-power embeddings are relatively small compared with their clock-delay difference. Figure 7(b) shows the worst-case process-variation induced skews of the minimum-delay and minimum-power embeddings of r1-r5 with four systematic process-variation models. In these models, cross-chip feature-size variation increases linearly from 0µm to 0.03µm affecting wire-widths, buffer-
Construct DC (wm = 0.3µm, wM = 3µm, w bm = 1, w bM = 10, p = q = r = 64) Table 1 : Clock-delay and power consumption before and after buffer-insertion/sizing and wire-sizing. The delays shown are the Elmore-delays multiplied by ln2.
widths, and buffer channel-lengths. The four processvariation models are: top-to-bottom, bottom-to-top, left-toright, and right-to-left. Not surprisingly, process-variation induced skew is highly correlated to clock-delay and Figure 7(a) is a good approximation for process-variation and power trade-offs. For low-power applications we can trade clock-delay and process-variation robustness for power by choosing minimum-power embeddings. For high performance designs minimum-delay embeddings are in general preferable for their higher process-variation tolerance and near optimal power-consumption.
Conclusion and Future Work
We present a novel zero-skew buffered clock-tree synthesis algorithm using projected scan-line sampling that utilizes simultaneous buffer-insertion/sizing and wire-sizing. Compared with the previous work that uses wire-sizing along, our algorithm reduces clock-delay significantly and uses less power. Experimental results show that our algorithm provides process-variation, clock-delay, and power trade-offs with efficient runtime and memory usage. In high performance designs minimum-delay embeddings are better due to their robustness to process-variation and their near optimal power-consumption.
We are currently extending our algorithm for zero-skew gated clock-tree optimization. We also plan to investigate the impacts of different clock-routings on clock-delay, power-consumption, and robustness to process-variation.
