Abstract-Clock networks contribute a significant fraction of dynamic power and can be a limiting factor in high-performance CPUs and SoCs. The need for multi-objective optimization over a large parameter space and the increasing impact of process variation make clock network synthesis particularly challenging. In this work, we develop new modeling techniques and algorithms, as well as a methodology, for clock power optimization subject to tight skew constraints in the presence of process variations. Key contributions include a new time-budgeting step for clock-tree tuning, accurate optimizations that satisfy budgets, modeling and optimization of variational skew. Our implementation, Contango 2.0, outperforms the winners of the ISPD 2010 clock-network synthesis contest on 45nm benchmarks from Intel and IBM.
I. INTRODUCTION Processor-based systems fueled the development of electronics since the 1960s. PCs were the main driver of growth in electronics in the 1990s, and in the 2000s mobile phones and other battery-powered consumer devices became a significant market segment, followed by automotive electronics. The emphasis in CPU design has shifted from high performance to power-performance-cost trade-offs, including the advent of multicore CPUs and the growing popularity of low-power ARM CPUs. In the netbook market, the low-power 1.6GHz Atom CPU from Intel is currently competing with ARM's multicore 2GHz Cortex-A9 CPUs and the 1GHz Cortex-A8, but 98% of world's mobile phones rely on ARM-based CPUs [13] which currently offer better power-performancecost trade-offs than Intel CPUs [24] .
ARM cores often drive system-on-chip (SoC) designs, laid out using low-power ASIC methodologies. Such methodologies perform automated clock-tree synthesis after placement, whereas traditional high-performance CPU methodologies predesign clock networks and use active deskewing to lower clock skew and susceptibility to process variations [19] . Clock trees are more susceptible to variations than meshes (common in CPUs), but are 2-4 times more power-efficient. This is significant because clock distribution networks and corresponding sequential elements consume up to 70% of CPU power and can affect power-performance comparisons between CPUs [20] .
Recent developments in embedded CPU design stress the need for low-power clock trees, yet also impose stringent skew limits, especially in the presence of process, voltage and temperature (PVT) variation for sub-45nm CMOS technology. Previous clock-tree methodologies rely on symmetric and regular tree topologies, such as H-trees and fishbones [1, chapter 43 ], which do not require sophisticated design algorithms (see Section II). However, these topologies experience difficulties with layout obstacles, non-uniform sink distributions, and varied sink capacitances. Fully-automated clock-tree synthesis supported by commercial EDA tools offers clear advantages in terms of capacitance, but may not be able to ensure sufficiently low skew for use in a 2GHz CPU. For example, the authors of [17] report clock trees generated by Cadence tools with skew that is orders of magnitude higher than the single-ps skew provided by clock meshes.
In this paper, we pursue the following research questions.
• How far can the skew of a high-performance clock tree be optimized? • How can one minimize the impact of PVT variations in a clock tree? • Given a single-picosecond skew requirement, how competitive are clock trees with clock meshes?
Our approach to answering these questions is inspired by the ISPD 2010 clock-network synthesis contest, which used several 2GHz CPU benchmarks from IBM and Intel to compare tools submitted by 10 teams across the world (downselected from 20 initial registrants). To evaluate the quality of the clock networks, difficult slew and skew constraints were checked against 45nm Monte-Carlo SPICE simulations that modeled PVT variations. Clock networks that cleared all constraints were compared by their total capacitance -a proxy for dynamic power. In this context, we developed a suite of algorithms for the design and thorough optimization of clock trees. The results of the ISPD 2010 contest offer a rare opportunity to compare multiple strategies for clock-network synthesisthe third-place team used symmetric trees [23] , the secondplace team used clock meshes, and our team won the contest by optimizing clock trees built by the DME algorithm [2] , [6] . Specific innovations in our comprehensive methodology for clock-network synthesis include
• The notion of local-skew slack for clock trees.
• A tabular technique to estimate the impact of variations on skew between two sinks. • A path-based technique to enhance the robustness of a clock tree to PVT variations.
• A time-budgeting algorithm for clock-tree tuning that distributes delay targets to individual edges of the tree so as to improve skew with minimal power resources. This algorithm can be used in the context of PVT variations and is not specific to our methodology.
• Fine tuning of optimized clock trees by gentle wire snaking, sufficiently accurate to satisfy delay budgets.
Processors
Year Node, Freq., Clock Deskew Skew,  nm  MHz Topology  ps  IBM S/390  1997 200  400  tree  -30  IBM Power4 2002 180 1300 tree+grid  -25  Alpha 21264 1998 350  600  grid  -65  Pentium 2  1997 350  300  spine  -140  Pentium 3  1999 250  650  spine  active  15  Pentium 4  2001 180 2000  spine  active  16  Itanium  2000 180  800  tree+grid active  28  Itanium 2  2003 130 1500 tree+grid  fuse  24  ISPD 2010  2010  45  2000  tree  -7.5   TABLE I   CLOCK NETWORKS IN INDUSTRY CPUS [1, CHAPTER 43] AND  ISPD 2010 BENCHMARKS FROM INTEL AND IBM (TABLE II) .
Our empirical results are compared to those of the winners of the ISPD 2010 clock-network contest, where each team violated prescribed skew constraints (7.5 ps in most cases) on at least some benchmarks in the presence of variations. However, results reported in this paper satisfy skew constraints on every benchmark. Our clock trees have 4.2× smaller capacitance than clock meshes produced by CNSrouter [25] , while exhibiting smaller skew. The remainder of this paper is organized as follows. Section II covers background and prior work. Section III describes optimization objectives and variation modeling. Section IV explains initial tree construction with buffering. In Section V details the techniques for robustness improvements. Section VI outlines our skew optimization techniques. Our empirical results are described in Section VII. Conclusions are given in Section VIII.
II. BACKGROUND AND PRIOR WORK Clock networks in microprocessors.
A variety of clock network topologies and deskewing techniques were developed for microprocessors. Table I [25] .
(DME) algorithm was proposed in [2] , [6] based on the concept of merging segments. Timing optimization based on Elmore delay was also incorporated into DME algorithms. DME algorithms assume a binary clustering of clock sinks. Such clustering can be found by a recursive horizontal-vertical partitioning algorithm called the method of means and medians (MMM) in [11] or the geometric matching algorithm (GMA) in [5] . Other simple algorithms for clock-tree synthesis are discussed in [ [14] . Dynamic Nearest-Neighbor Algorithm to generate tree topology and Walk-Segment Breadth First Search for routing and buffering were proposed in [22] . A three-stage CLR-driven CTS flow based on an obstacle-avoiding balanced clock-tree routing algorithm, monotonic buffer insertion, as well as wire-sizing and wire-snaking is proposed in [15] . A Dual-MST geometric matching approach is proposed in [16] for topology construction, along with recursive buffer insertion and a way to handle blockages. SoC methodologies often spend significant effort dealing with hundreds of layout obstacles, while CPU layouts include very few obstacles. However, skew constraints are more difficult in CPU clock synthesis. Because of these differences and due to the incorporation of process variation into the ISPD 2010 contest, most of the above techniques were not adopted by the contestants.
III. MODELING AND OBJECTIVES
Before introducing our clock-tree synthesis methodology and specific optimizations in Sections IV-VI, we review key optimization objectives (global and local skew), define the notion of local-skew slack, and propose a simple yet effective model of process variation.
A. Global and local skew
Common terminology and notation are introduced next. Definition 1: Given a clock tree Ψ, let λ(s i ) be the clock latency (insertion delay) at sink s i ∈ Ψ. Then the skew between two sinks s i and s j ∈ Ψ is defined as
Global skew is defined as
Nominal values of skew Ψ (s i , s j ) and ω Ψ are computed neglecting the impact of variations. Global skew can be improved by decreasing max i∈Ψ λ(s i ) (speeding up the slowest sinks) or increasing min i∈Ψ λ(s i ) (delaying the fastest sinks). Previous publications on clock network synthesis were focused on reducing global skew with or without the presence of variations [2] , [3] , [10] , [12] , [14] [15] [16] , [22] , [26] . However, in a large clock network, skew between adjacent and connected sinks is a more meaningful optimization objective [8] , [18] . Local skew is defined by restricting eligible sink pairs to be within distance ∆ > 0, which is determined for a given circuit after timing-driven placement.
Definition 2: Given a clock tree Ψ and a local skew distance bound ∆ > 0, let dist(s i , s j ) be the Manhattan distance between sinks s i and s j ∈ Ψ. Then the worst local skew [25] is defined as
Reducing skew down to single picoseconds in the presence of variations may require a significant increase in power consumption. Since more than 30% of total power in modern microprocessors is consumed by clock networks, minimizing clock-network capacitance is as important as skew minimization. Therefore modern circuit designs can tolerate a certain amount of clock skew, and power can be reduced provided that the clock network remains below a given skew bound, even in the presence of variations. Definition 3: Consider a clock tree Ψ, a local skew distance bound ∆ > 0, variation model ν and target yield 0 < y ≤ 1. Let Ψ ν be the clock tree Ψ with variation ν and f (t) be the cumulative distribution function of ω Ψν ∆ . Then the worst local skew with variation is defined as
Viewing the local skew limit Ω ∆ as a design constraint (see Table II ), we pursue the following goals.
1) Building variation-tolerant clock networks with ω ∆,ν,y < Ω ∆ , subject to slew constraints. 2) Minimizing clock-tree power.
B. Local-skew slack
Given a clock tree with known sink latencies, one can optimize it using delay budgets derived from the sink-and edge-slack calculation [14, Section 3], followed by global skew optimization to reduce global skew below Ω ∆ . This strategy is sound because local skew ω ∆ cannot exceed global skew. However, global skew optimizations attempt to reduce skew between sinks more distant than ∆, which may require unnecessary increase in power.
To tune the clock tree on a tight power budget, we propose the concept of local-skew slack.
Definition 4: Given a clock tree Ψ and local-skew constraints Ω ∆ , the local-skew slack σ(s) for a sink s ∈ Ψ is the minimum amount of additional delay in picoseconds for s, so that the tree satisfies ω
It is used in Algorithm 1 to calculate σ(s) for every sink. This algorithm uses varEst(s i , s j ) = 0 in the absence of variations, and otherwise the definition in Section III-C.
Once local-skew slacks σ(s) are computed for all sinks, we define local-skew slack of tree edge e as the smallest slack of a downstream sink. Edge slacks in the entire tree can be computed by one recursive tree traversal in linear time, giving the optimal amount of tuning to improve worst local skew [14, Section 3] . Figure 1 illustrates the computation of local-skew slack for sinks and edges.
C. Modeling process variation
Designing low-capacitance low-skew clock trees without considering process, voltage and temperature variations often results in significant skew in each chip. However, variationaware optimization has not been explored until recently and requires reliable estimation techniques. Monte-Carlo simulations are slow and not suitable to clock network optimization. Instead, we develop a tabular technique to account for variation in single-shot timing analysis.
When two sinks can be connected by a short path in the tree, variation of skew between them is small. On the other hand, variational skew between sinks that are geometrically close can be significant if the unique tree-path between them is long. This is illustrated in Figure 2 .
Our key insight is that the impact of variations on skew between two sinks is closely correlated with tree path length and how the tree path is buffered. Therefore, for a given technology node, buffer library, wires and variation model, we propose to build a look-up table with comprehensive information regarding the worst-case variation on skew for various paths between pairs of sinks. 
SinkQ.enqueue( s j ); end if end for end while Definition 5: Given a technology node T , buffer and wire library B, variation model ν and desired yield 0 < y ≤ 1, let Ξ T ,B,ν,y [w, b, t] be the variation-estimation table which returns the worst-case increase in skew (with probability y) between two sinks connected by a tree path of length w with b buffers and the buffer type t. When multiple buffer types are used in the tree path, t is the smallest type in the tree path, so as to avoid under-estimation of variation.
To build the table, we generated a large number of test trees on public CNS benchmarks and randomly generated benchmarks. The initial tree-construction method explained in Section IV with various buffer types is utilized for the test trees. The number of Monte-Carlo SPICE simulations is determined based on the given variation model ν. Variational skew between any two sinks during the simulations is recorded in the table with classification by w, b and t. The table is later restructured to represent a probability density function for each (w, b, t) entry in order to look up with yield y. Building the variation-estimation table requires extensive simulations, but once the table is built, it can be used for many clock trees. To determine the impact of variation on skew between sinks in a clock tree, a function varEst(s i , s j ) is defined as follows. Given a clock tree Ψ and a variation table Ξ T ,B,ν , let L(s i , s j ) be the total length of wires, b n (s i , s j ) be the total number of buffers and b t (s i , s j ) be the largest buffer type in the tree path between two sinks s i and s j ∈ Ψ. The variation table is accessed by the function varEst(
To estimate the impact of variations when optimizing clock trees we utilize varEst() when computing local-skew slack for each sink (Algorithm 1). Without considering variations, it is sufficient to satisfy skew(s i , s j ) < Ω ∆ for all pairs of sinks within ∆. However, in the presence of variations, we have the following result.
IV. INITIAL TREE CONSTRUCTION AND BUFFER INSERTION
We invoke the unmodified ZST-DME algorithm [4] , [10] and perform initial buffer insertion to minimize source-tosink Elmore delay, rather than skew or capacitance [7] , [21] . Elmore delay is too inaccurate for skew optimization, but our approach creates significant room for tuning the clock tree by delaying fast paths [14] . In the presence of layout obstacles, proper obstacle-handling is required to avoid violations due to obstacles. The ISPD 2010 benchmarks include obstacles over which wire-routing is possible but buffer insertion is not allowed. We adapted a simple and robust technique for obstacle avoidance in clock trees from [14] which repairs obstacle violations in the trees obtained by the ZST-DME algorithm.
When multiple wire types are available, the choice of wires affects both total power and susceptibility to variations. Under tight skew constraints in high-performance CPU designs, thicker wires (on a given metal layer) are preferable because they limit the impact of variations and still allow for future power-performance trade-offs by wire sizing. In less aggressive ASIC and SoC designs, power optimization may motivate thinner wires. But upsizing wires in a reasonably tuned clock tree may be of limited use because it increases capacitance, potentially leading to slew violations.
Selecting buffer types for initial buffer insertion is also important. Given an initial tree without buffers Ψ 0 , let t(s i , s j ) be the type of a buffer required for the tree path between two sinks s i and s j ∈ Ψ 0 to satisfy varEst(s i , s j ) < Ω ∆ . t(s i , s j ) can be found from the variation-estimation table Ξ T ,B,ν with L(s i , s j ). Since b n (s i , s j ) is not available at this step, it is difficult to find the exact required t(s i , s j ). However, because b n (s i , s j ) and L(s i , s j ) are highly correlated with each other, b n (s i , s j ) can be estimated by modeling it with the average number of buffers corresponding to L(s i , s j ). Once b n (s i , s j ) is estimated, t(s i , s j ) can be computed as described in Section V. The initial buffer type (t 0 ) for a given initial tree is computed as t 0 = Avg si,sj ∈Ψ0 t(s i , s j )
Once t 0 is determined, we adopt the fast variant of van Ginneken's algorithm from [21] for initial buffer insertion. b n (s i , s j ) ∀s i , s j ∈ Ψ is determined after initial buffer insertion and more accurate t(s i , s j ) can be obtained. For sink pairs that do not satisfy varEst(s i , s j ) < Ω ∆ , we use the robustness-improvement algorithm from Section V to ensure that the tree eventually satisfies ω
V. ROBUSTNESS IMPROVEMENTS
The initial buffer insertion algorithm cannot accurately estimate buffer types required for local-skew constraints for a given initial tree. Therefore robustness-improvement must follow after initial buffer insertion so that ω Ψ ∆,ν,y < Ω ∆ holds after all the skew optimization techniques are applied.
In an ideal situation in which we can reduce all the skew down to 0, varEst(s i , s j ) < Ω ∆ ∀s i , s j ∈ Ψ is sufficient to satisfy ω Ψ ∆,ν,y < Ω ∆ . In practice we must estimate nominal local skew skew Ψ est after accurate optimizations, which we upper-bound by 5ps based on experience. 
The target buffer type for the tree-path between sink s i and s j , t(s i , s j ) can be computed as the smallest t such that
From the above method, the minimum size of buffer type which satisfies varEst(s i , s j ) < Ω ∆ -skew Ψ est is selected to reduce capacitance. Once t(s i , s j ) is determined, the buffers in the tree path between sink s i and s j are substituted with type t(s i , s j ) buffers. This step is repeated for all eligible pairs of sinks within distance ∆.
VI. SKEW OPTIMIZATIONS
In this section, several local skew optimization techniques are described. Each technique is designed to reduce skew under different circumstances, but the primary objective is to optimize the skew of given tree to below the local skew limit in the presence of variations. The target tuning amount for each edge of the tree can be determined by local-skew slack including variation modeling described in Section III.
A. Wire snaking
Wire sizing and wire snaking are popular techniques for skew optimization and are often able to reduce global or local skew down to the practical skew limit. In this context, however, we exclude wire sizing because narrowing down a wire in the middle of a clock tree is risky due to the impact of variations. We extend the wire snaking technique from [14] to improve its speed and accuracy, while limiting its use of routing resources.
The optimal tuning amount for each edge can be obtained by the top-down slack computation explained in Section III-B. Let T target (e) be the amount of time in ps by which the edge e must be delayed to achieve legal ω ∆ under local skew constraints. L sn (e) denotes the length of the wire determined by the wire snaking algorithm to delay the edge e by T target (e). Let T actual (e) be the amount of time in ps which the edge e is actually delayed by L sn (e) of a wire. Ideally, the wire snaking algorithm can estimate L sn (e) so that T target (e) = T actual (e). L ideal (e) is the length which satisfies T target (e) = T actual (e). The total additional capacitance from wire snaking T otalCap sn is
where κ(w) denotes the capacitance of a wire w, and the ideal total additional capacitance T otalCap ideal is
Practically, T actual (e) = T target (e) unless extensive SPICE simulations are performed for finding L sn (e), which is unrealistic in terms of runtime for a clock network synthesis flow. When T actual (e) < T target (e), another round of wire snaking is required to bring T actual (e) closer to T target (e). L i sn (e) denotes the length of the wire determined at i th iteration of the wire snaking algorithm to delay the edge e. T actual (e) and T i target (e) is too big, more iterations will be needed for T actual (e) to approach T target (e). We improve the accuracy of wire snaking in two ways. Delay model for wire snaking. To keep T i actual (e) ≤ T i target (e) with optimal quality, we define α where,
Wire snaking algorithm aims for T i actual (e) to satisfy the above inequality with the highest α possible. When α is specified, the required worst-case number of iterations of wire snaking N to make T actual (e) to T target (e) within error rate ε is
Closed-form delay models like Elmore delay are not accurate enough to keep T i actual (e) ≤ T i target (e) and α high. To enhance the quality of estimation by the wire snaking algorithm, look-up tables for L i sn (e) are built by performing a set of SPICE simulations for each technology environment which includes technology model, types of buffers and wires, variation specification. In the simulations, T i actual (e) is tested with different snaking lengths on various locations of nodes in various types of clock trees. The results of simulations are stored in a look-up table, used by wire snaking during local skew optimization. We achieved α values between 60% and 70%, therefore 4 ≤ N ≤ 6. Only one technology environment was used at the ISPD 2010 CNS contest, requiring a single set of simulations. Optimal node selection for wire snaking. Figure 3 compares two different styles of wire snaking. Figure 3(b) illustrates undesired delay of sinks after wire snaking on non-bufferoutput nodes. The increased capacitance and resistance by wire snaking affects the driving buffer which results in additional delay of slow sinks. Wire snaking at buffer output nodes, as in Figure 3 (c), is much more accurate than wire snaking at any branch. Limiting wire snaking to buffer output nodes reduces the number of SPICE calls required for clock-tree tuning. This also reduces the number of simulations for building the look-up table by limiting the number of target nodes to be tested. Wire snaking usually increases slew rate of input nodes of downstream buffers. To prevent slew violation, slew rate numbers of downstream buffers are checked and if the worst slew rate is more than 70% of the given slew limit, the target node is excluded from wire snaking.
B. Delay buffer insertion
The local skew of a sink cluster driven by the same final buffer is often negligible. However, highly unbalanced sink capacitances or layout obstacles in those clusters can result in significant local skew. An alternative technique is needed because wire snaking in Section VI-A is inapplicable. In this 470ps 500ps 490ps 505ps 490ps 500ps
(a) (b) (c) case, inserting a buffer at the target node is very efficient for two reasons. First, skew can be reduced by the delay of the inserted buffer. Second, further precise wire snaking is possible because the inserted buffer isolates the target node from the remainder of the cluster. Let W(B) be the set of sinks driven by a final buffer B and d(B) be the delay of the buffer B. Delay buffer insertion is required if there exists
For each path from the buffer to the sinks, inserting at most one buffer is sufficient since the wire snaking algorithm in Section VI-A can be invoked again at the output node of inserted buffers. Figure 4 illustrates delay buffer insertion algorithm followed by wire snaking. When a delay buffer is inserted, it is placed at the node so that the input capacitance of a delay buffer is comparable to the sum of downstream sink and wire capacitance of the target node, thus sink latency in the other path changes very little. (see Figure 4 (b) ). VII. EMPIRICAL VALIDATION Our implementation, Contango 2.0, is written in C++ and is based on our software Contango 1.0 [14] that shared the first place at the ISPD 2009 clock-network synthesis contest. Contango 2.0 was the sole winner of the ISPD 2010 contest, but we now report significantly stronger results. ISPD 2010 benchmarks. Table II lists the statistics of all benchmarks from the ISPD 2010 contest. The contest limited slew to 100ps, and all reported clock networks satisfy this constraint. Slews in Contango 2.0 trees do not exceed 81ps. Table III compares Contango 2.0 with CNSrouter and NTUclock. Clock networks produced by our software have smaller capacitance than CNSrouter and NTUclock on average by 4.22× and 4.13× respectively. The contest imposed local skew constraints with yield y = 95%. Our clock trees always yield > 95%, while CNSrouter violates yield constraints on three benchmarks and NTUclock on all benchmarks except one. All three teams satisfied the 12-hour runtime limit for all benchmarks. Our data suggest that wire snaking usually increases wire length by 1-3% (5.43% in one case), which is small enough to neglect the negative effects of wire snaking. Figure 7 compares probability density functions (pdf) produced by Monte-Carlo SPICE simulations of our clock trees to those of clock meshes produced by CNSrouter. One such clock tree is illustrated in Figure 5 . Despite the dramatic differences in network topology and total capacitance between trees and meshes, some of the plots in Figure 7 bear striking resemblance (cns01, cns02, cns04, cns05). To explain this phenomenon, we recall that meshes cannot be buffered directly and are therefore driven by a buffered clock tree. Such a clock tree can be constructed by the same DME algorithm that we use, which is why the pdf profiles in Figure 7 reflect the pointset of sink locations. Apparently, the mesh does not significantly change this profile. Power versus robustness to variations. Figure 6 describes experiments on benchmark ispd10cns08 with different local skew constraints. When tight local skew constraints are given, large buffers are required to ensure robustness to variations, increasing the capacitance of the clock tree. On the other hand, a large portion of capacitance can be saved when local skew constraints are loose. To clarify the impact of variation, we plot variational skew (y-axis), defined as ω ∆,ν,y -ω ∆ for ∆, ν, y from Table II. VIII. CONCLUSIONS Power-performance-cost trade-offs are becoming a major issue in modern high-performance CPU clock designs. Mesh structures often sacrifice power to improve robustness to variations. We propose a tree solution for CPU clock routing that improves power consumption under tight skew constraints in the presence of variations. To this end, we introduce the notion of local-skew slack for clock trees, a model for variational skew, a path-based technique to enhance robustness, a new time-budgeting algorithm for clock-tree tuning and accurate optimizations that satisfy budgets. We have shown that clock trees can be tuned to have nominal 
