Abstract-Clock trees are commonly used to deliver clock signals to sequential elements in circuits. However, by construction, tree structures are inherently prone to failure caused by variations. The robustness of a clock tree can be improved by inserting redundancy in the form of cross links or multilevel fusion trees. Such near-tree structures can provide robustness at low cost. In this paper, we establish that the locations of the inserted redundancy are crucial in providing cost-effective robustness. We present two methods to systematically insert redundancy. The redundancy is realized by either inserting cross links or performing local merges. Moreover, we present a vertex reduction method that reduces the amount of redundancy that needs to be inserted in our near-tree structures. Empirical results show that our structures are more robust to variations and have lower power consumption compared to the state-of-the-art clock networks. Furthermore, our near-tree structures provide smooth trade-offs between cost and robustness, reducing clock skews by 11%-39% at an expense of 3%-68% higher power consumption.
I. INTRODUCTION

R
ELIABILITY and low power consumption are the primary objectives in clock network synthesis. Both the reliability and power consumption are determined by the choice of the clock network topology. Clock trees are sensitive to process, voltage, and temperature (PVT) variations. While nontree structures are more robust to PVT variations, they consume more power than tree structures. In 2009 and 2010, the International Symposium on Physical Design (ISPD) held two clock contests [1] ; both contests focused on constructing robust and low-power clock networks. Based on the two contests, several tree structures [2] - [4] , nontree structures [5] , [6] , and near-tree structures [7] , [8] have been proposed.
Nontree structures such as meshes [5] , [6] may be robust, but they are also power hungry. Near-tree structures, i.e., structures that are close to being a tree, are alternatives to nontree structures. By inserting a small number of redundant paths to a clock tree, the robustness can be improved significantly, with a small increase in power consumption. The inserted redundancy in previous studies has been in the form of cross links [7] , [9] - [11] or multilevel fusion trees [8] .
In [7] , it was shown that cross links placed between internal nodes in a clock tree seemed to be more effective when compared to those inserted between leaf nodes. In [8] , additional clock trees were constructed to connect sensitive locations in an existing clock tree. Next, these additional trees were fused with the original clock tree into a "multilevel fusion tree."
This paper provides further insight on the advantages and limitations of near-tree structures in variations-aware clock network synthesis. Based on this paper, we find that many cross links inserted using the method in [7] may provide only limited improvements in robustness. The inserted redundancy in [8] is effective in improving the robustness of the network, but very costly in terms of power consumption.
We also present an illustrative study to show the effect of inserting different types of redundancy at different locations in a tree topology. Based on this paper, a new family of cross link insertion techniques is proposed. The cross links are inserted between internal nodes that are distant in the tree topology. We refer to these clock trees with cross links as cross-linked networks (CLNs). Moreover, we propose a new near-tree structure called the locally merged network (LMN). In this near-tree structure, internal nodes are merged to insert redundancy. Both of our proposed near-tree structures are constructed by inserting redundancy at locations in a clock tree that are more prone to failure caused by variations. In addition, we propose a VR method; this method reduces the amount of redundancy that needs to be inserted for both CLNs and LMNs.
Experimental results reveal that our near-tree structures are more robust compared to the clock networks constructed in previous studies, and at the same time consume less power. This paper is organized as follows: the problem formulation is described in Section II. Previous works on near-tree structures are reviewed in Section III. In Section IV, a case study is presented. The synthesis details of the proposed neartree structures are provided in Section V. The VR method is presented in Section VI. Finally, experimental results and conclusion are described in Sections VII and VIII, respectively.
II. PROBLEM DEFINITION
This paper considers a variations-aware clock synthesis problem that is based on the problem formulation in the 2010 ISPD [1] contest and the extension of the variations model in [3] . The objective in the considered problem is to construct 0278-0070 c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
a clock network that consumes the least amount of power while meeting timing constraints under variations and also blockage constraints. A clock network consists of a source delivering a synchronizing signal to a set of sequential elements called as sinks. The source-sink connections are realized using devices and wires, where a device is either an inverter or a buffer, i.e., two inverters connected in series. The wires and devices available are specified in a wire and device library, respectively.
The difference in arrival time of the clock signal between a pair of sinks is the clock skew. Pairs of sinks that require the clock skew to be limited are said to be sequentially related. In this paper, sink pairs that are spatially separated by less than a specified local skew distance [1] are considered to be sequentially related.
The timing constraints consist of a network skew constraint and a slew constraint. The network skew is computed by performing 500 Monte Carlo simulations of the network using NGSPICE [12] , with the network subject to supply voltage and wire width variations as follows.
1) The nominal supply voltage V nom of each device is affected by ± V voltage variations.
2) The nominal wire width w nom of each wire is affected by ± w wire width variations. The skew between a sequentially related sink pair is called local clock skew. In each Monte Carlo simulation, the worst local clock skew (wLCS) is recorded. The skew of the clock network is defined to be the 95th percentile of the 500 recorded wLCSs, denoted as 95%-skew. Each clock network must meet a specific constraint on the 95%-skew. Moreover, at any point in the clock network, in all of the 500 Monte Carlo simulations, the signal transition time, or slew, must meet a specific slew constraint. The power consumption of a clock network is defined as the sum of the average rising and falling power consumption in the 500 Monte Carlo simulations. To enable comparison with previous studies the capacitive cost, i.e., the total capacitance of the clock network, is also reported.
We follow the approach in [3] to generate the supply voltage variations. All inverters placed at the same location experience the same supply voltage variations. In [3] , this approach is called the single location single voltage (SLSV) variations model. In the original ISPD variations model, each inverter experienced a different supply voltage. In such a model, the effect of the supply voltage variations can be averaged out, by placing many inverters in parallel as observed in [3] and [6] . In [3] the ISPD and SLSV variations models are compared and analyzed in detail. It was concluded that the robustness of a clock network under the ISPD model can be misleading. By placing multiple inverters in parallel, which effectively is a form of device sizing, an unrealistic trade-off between cost and robustness is obtained as each inverter experiences a different supply voltage in the ISPD model. Therefore, for the remainder of this paper we choose to focus on the SLSV variations model. The SLSV variations are much more severe compared to the ISPD variations; therefore, if we can meet skew constraints under the SLSV variations model we could easily meet skew constraints under the ISPD variations model [3] . Fig. 1 . A clock tree connecting a source to a set of sinks (A-H). The sink pairs with arrows between them are considered to be sequentially related. The LCA of some sink pairs is shown. Sequentially related sink pairs with an LCA close to the root of the clock tree are called critical sink pairs. In this example, the most critical sink pair is the pair (D, E).
III. PREVIOUS WORK
Many studies have been performed in the field of clock network synthesis [2] , [5] - [11] , [13] , [14] . As previous studies have revealed that near-tree structures can have robust performance even under PVT variations, we concentrate the attention of this paper on such structures. Specifically, we review the methods of constructing clock trees with cross links [7] , [9] - [11] and clock trees with multilevel fusion trees [8] . First, we introduce some definitions and notation.
An example of a clock network in the form of a clock tree is shown in Fig. 1 . Moreover, a clock tree can be decomposed into stages as shown in Fig. 1 . Each stage consists of a set of dc-connected subtrees (DCCSs). Each DCCS is formed by a subtree and a corresponding driving device, as shown in Fig. 1 . Each sink pair has a lowest common ancestor (LCA) [8] , as illustrated in Fig. 1 . The skew introduced by variations between a pair of clock sinks is expected to be greater if the sinks are more distant in the topology, i.e., they have an LCA that is closer to the root of the clock tree. Sink pairs that are sequentially related are illustrated with an double-ended arrow. Pairs of sinks that are sequentially related and with an LCA high in the tree are called critical sink pairs in [8] .
A. Cross Link Insertion
Cross links were first proposed to be inserted between nodes in an unbuffered clock tree [9] , followed by a bottom-up tuning procedure for each inserted link. (Extensions to insert cross links in a buffered clock tree were proposed in [11] and [15] .) The tuning of the entire tree is done by reconstructing the same topology with the same bottom-up approach as in deferredmerge embedding (DME) [16] - [18] , but with the following change: The capacitance of a node with a cross link is replaced with the capacitance of the node plus half of the capacitance of the cross link. The basic idea is that a cross link can be modeled as two capacitors at the two ends of a resistor. Moreover, two nodes with zero skew will also have zero skew after adding a resistor between them, implying that the resistor of a cross link can be added after the tree construction. Fig. 2 . Two near-tree structures constructed using methods in [7] and [8] .
(a) Clock tree with the cross links inserted as in [7] . (b)-(d) Construction of a multilevel fusion tree in [8] . A measure indicating how effectively a cross link reduces the sensitivity of skew to variations was introduced in [9] . The measure was called the α-rule, which considered the resistance ratio α = R link /R loop , obtained by dividing the resistance of the cross link with the resistance of the loop constructed in the clock tree with the cross link. In [9] , it was concluded that the smaller the α-ratio, the better the skew sensitivity reduction.
In [7] , a cross link is considered for insertion when two subtrees, each driven by a device, are merged. The cross link is inserted to connect two points on the two wires, called stem wires, that connect the devices to the subtrees. An illustration of the resulting structure in shown in Fig. 2(a) . As the insertion is performed beneath the devices, the insertion step can be easily integrated into the process of constructing a clock tree, avoiding a reconstruction of the network.
B. Multilevel Fusion Trees
In [8] , a tree topology is first synthesized. Additional trees are then constructed and fused with the original tree. Each additional tree is constructed to connect a subset of the original sinks. The subsets are created by first identifying critical clock sink pairs using a statistical analysis. The critical clock sink pairs are then clustered if they have the same LCA in the original clock tree. For each subset, a tree is constructed and fused into the original tree at the LCA of the subset. This is illustrated in Fig. 2(b)-(d) , where additional trees are constructed to improve the robustness of clock sink pairs (A, B), (C, D), and (E, F). The generation of additional trees ends when the construction meets the desired design requirements. Empirical results show that the methodology can significantly reduce the skew sensitivity. Compared to the initial tree, the capacitance may increase by 60% in order to satisfy stricter skew constraints. Nonetheless, the cost of multilevel fusion trees is far less than that of meshes.
IV. CASE STUDY
We perform a case study, to better understand how redundancy improves the robustness of a clock network to PVT variations. First, the experimental setup is described. Next, we draw some conclusions about cost-effective robustness. Based on these conclusions, the cross-linked structures in [7] and multilevel fusion trees in [8] are analyzed. Finally, we present a family of CLNs and another near-tree structure called LMN. 
A. Experimental Set-Up
As in [8] , we choose to study critical sink pairs because their long distances in the clock tree topology make these pairs prone to synchronization error. To study the effects of the two different methods in [7] and [8] , we assume that we have a symmetric clock tree. To obtain a simplified experimental model, we consider an abstraction of a critical sink pair from the symmetric clock tree. The abstraction is performed by removing everything but the paths connecting the critical pair to the source in the topology, as shown in Fig. 3 . The figure shows a clock tree driven by three stages of devices, which are numbered in a bottom-up fashion. With this abstraction, we obtain a simple structure with only two paths, each driven by three stages of devices, of which the top one is the clock source.
Using this abstraction strategy, we obtain the two reference structures shown in Fig. 4 . The two structures consist of twoclock and eight-clock sinks, respectively. Each clock sink in the reference structures is sequentially related with only its neighbors, as illustrated with the double-ended arrows. Each clock sink is driven by a eight-stage path, with each stage consisting of a device driving a long wire. The source then drives the two or eight paths through a symmetric tree. While the reference structures are approximations of critical clock sink pairs found in a clock tree, they can illustrate the characteristics of different robustness enhancement techniques.
The 45 nm technology provided in the 2010 ISPD contest is used in this experiment. The characteristics of the devices in the benchmark suite can be found in Table I . Evaluation is based on 500 Monte Carlo NGSPICE simulations under the SLSV variations model.
In [3] , buffers have been shown to be more effective in handling SLSV variations. In this paper, devices in the form of buffers are used. In this experiment, a buffer is formed by connecting seven parallel inv1's in series with ten parallel inv1's. Each stage consists of a 1 000 000 nm wire. Each clock sink has a capacitance of 300 fF and the clock sinks are placed 400 000 nm apart. Both reference structures and all enhanced structures in this section have a nominal skew below 1 ps. 
B. Trade-Off Between Cost and Robustness
The two-sink reference structure is enhanced by inserting a cross link. Four different locations for cross link insertion are considered, as shown in Fig. 5(a) . In the figure, we classify each of the potential locations for a cross link with a LCA distance (LCAD). The LCAD of a cross link is defined to be the number of buffers that must be passed through in the tree topology to connect an end point of the cross link to the LCA of the two endpoints of the cross link. The improvement in robustness of inserting the cross link at different locations is determined by performing 500 Monte Carlo NGSPICE simulations with SLSV variations on each of them. The results of the simulations can be found in Fig. 5 
(b).
We observe that inserting any cross link improves the robustness compared to the original tree structure. Also, as the cross link is placed closer to the sinks, the skew is successively reduced. Based on this, we speculate that redundancy may improve the robustness more effectively when it is introduced closer to the sinks. To grasp how the introduction of redundancy increases the capacitive cost of a design, the capacitance distribution of a regular clock tree is studied. A clock tree is constructed on benchmark circuit ispd10cns01 from the ISPD 2010 contest [1] . The cost at each stage is determined as the sum of sink, wire, and output capacitances of the driving devices: 71% and 12% of the total capacitance are located, respectively at the bottom-most stage (stage 1) and at the second bottom-most stage (stage 2); stages 3-5 or above account for only 6%, 4%, and 7% of the total capacitance, respectively.
Based on these observations, we face a dilemma. To achieve high robustness, redundancy has to be introduced as close as possible to the clock sinks. At the same time, the additional cost of construction at the bottom layer is high.
We propose a trade-off by introducing the redundancy close to the devices driving the bottom-most stage. At this location, the redundancy is still relatively close to the clock sinks, but the cost will be significantly lower compared to that of adding redundancy at the sink level.
C. CLN
In this section, we study the cross link insertion method in [7] and propose some alternative locations for the placement of cross links. The eight-sink reference structure in Fig. 4 (b) is used as a baseline for our analysis. We start by evaluating the reference structure with Monte Carlo simulations using the SLSV variations model. From the histogram in Fig. 6 , the 95%-skew of the eight-sink reference structure is determined to be 8.72 ps under SLSV variations.
In [7] , cross links are considered for insertion under the driving devices of DCCSs. When subtrees are merged above the driving devices, cross links are inserted between the subtrees below the driving devices. We insert cross links in the eightsink reference structure using the same method. The clock tree with the mimicked cross link insertion in [7] is shown in Fig. 7 (a).
In [7] , all the cross links have an LCAD of 1. In Fig. 5 , it seems as if cross links with a higher LCAD may be more effective. Based on this observation, we propose a new cross link insertion technique. Cross links are inserted under the driving devices of the bottom-most stage; this insertion technique maximizes the LCAD number of each inserted cross link. At the same time, the cross links are not inserted at the sink level, thereby limiting the cost of the cross link insertion. An illustration of the proposed CLN structure is shown in Fig. 7(b) .
The structures in (a) and (b) of Fig. 7 are evaluated with Monte Carlo simulations to see how the cross links affect the robustness. The clock tree with cross links in Fig. 7 (a) has a 95%-skew of 8.84 ps, which is actually worse than the reference tree structure. The proposed CLN in Fig. 7 (b) has a 95%-skew of 3.31 ps, an improvement of 5.41 ps compared with the reference tree.
We observe that there are cross links of different LCAD in Fig. 7 (b). We view this as an opportunity for optimization; in Section V-B we investigate whether some of the links with the lowest LCAD can be removed to reduce cost while not affecting the robustness too negatively.
D. Multilevel Fusion Tree Structures
Now we study the multilevel fusion trees proposed in [8] .
Redundancy is again inserted into the larger eight-sink reference structure to mimic the method in [8] . For the eight-sink reference structure, we assume that every two adjacent clock sinks form a critical clock sink pair. As none of the pairs share LCA, they cannot be clustered, and an additional tree has to be constructed for each pair. The additional trees are fused into the original structure at their respective LCAs, as shown in Fig. 8(a) .
The histogram in Fig. 8(c) show the distributions of wLCS obtained from Monte Carlo simulations under the SLSV model. The 95%-skew decreases from 8.72 to 4.14 ps compared with the reference tree structure. The structure seems to be very effective in improving the robustness. However, the redundancy will be costly in terms of capacitance as the insertion is performed between every pair of adjacent sinks.
E. LMN
We propose the construction of a LMN shown in Fig. 8(b) . The structure will have lower cost compared with multilevel fusion trees because of two main reasons. First, the redundancy is not inserted between the sinks of the clock tree. Second, the added redundancy is merged into the original structure closer to the sinks as compared to Fig. 8(a) . Now, we outline how such a structure can be created: First, the bottom-most DCCSs are constructed. Next, a pair of DCCSs that share a sequentially related pair of clock sinks are merged using a so-called local merge operation. Each local merge operation creates a new root. After this, we construct a normal tree structure on top of the set of new roots. The structure in Fig. 8(b) should have similar robustness as in Fig. 8(a) as the redundancy is inserted relatively close to the sinks.
Our proposed structure is evaluated with Monte Carlo simulations under SLSV model. A 95%-skew of 5.57 ps is obtained for the structure, which is lower than the 8.72 ps skew of the eight-sink reference structure and slightly higher than the 4.14 ps skew of the multilevel fusion tree structure. The skew distribution is rather irregular under the SLSV variations model. We suspect that the main reasons for that are: 1) the higher degree of asymmetry of the proposed structure and 2) the two clock sinks at the extreme left and right are each driven directly by only "one" branch. The relatively worse performance compared with the multilevel fusion tree is expected because there are now fewer redundant paths to the clock sinks. In other words, we have traded-off performance for lower cost. However, we shall show in the experimental results (see Section VII) that we actually obtain both lower cost and better robustness.
V. NEAR-TREE METHODOLOGY
In this section, we describe how our two proposed near-tree structures, i.e., the CLN and LMN structures, are constructed. The flow for constructing a CLN structure is shown in Fig. 9(a) , with the details given in Section V-B.
First, the bottom-most subtrees of an CLN are constructed (see Fig. 9 (c) and Section V-A1) as in the construction of a clock tree. Subtrees are built by iteratively merging smaller subtrees together. Next, an optional VR operation can be applied. The VR method aims to reduce the cost by reducing the number of subtrees in the first stage. The optional VR is shown in Fig. 9 (d) and explained in Section VI. After this, a sequential relation graph (SRG) is constructed to guide the insertion of the cross links. Using the SRG, cross links are inserted between the subtrees as described in Section V-B2. After the insertion of the cross links, the driving devices of the first stage are inserted (see Section V-A2). The remainder of the structure is constructed as a clock tree, by alternating between merging subtrees and inserting devices, using the clock tree construction flow given in Section V-A and shown in Fig. 9(c) .
An LMN structure is constructed in a bottom-up manner using the flow shown in Fig. 9(b) , with the description given in Section V-C. Here, the bottom-most subtrees are again generated through iterative merging operations. The optional VR technique can then be applied before inserting the driving devices of the first stage. Next, an SRG is again constructed to guide the insertion of redundancy. In the LMN structure, the redundancy is placed above the driving devices using local merges as outlined in Section V-C1. After this introduction of redundancy, the clock tree construction subflow is adopted to construct the remainder of the network.
A. Flow for Clock Tree Construction
A clock tree is constructed by iteratively performing tree merging and device insertion as follows.
1) Tree Merging: To construct subtrees, the greedy DME approach using a nearest-neighbor graph (NNG) [19] is deployed under the Elmore delay model. In this NNG, subtrees are vertices. The cost of each edge between two vertices indicates the cost of merging the two corresponding subtrees. The cost of an edge is defined to be the wire length needed to perform the merge as in [19] .
Iteratively, two subtrees are merged together form a larger subtree. We iteratively alternate between two different methods in selecting the two subtrees that are to be merged. The first method finds the least cost edge in the NNG and merges the two corresponding subtrees [19] . In the second method, the subtree with the smallest delay is found. This subtree is merged with its least cost neighbor in the NNG [11] . As a pair of subtrees are merged, the corresponding nodes are deleted from the NNG, and the newly formed subtree is inserted into the NNG. However, if the newly merged subtree violates the slew constraint, denoted as S slew , the two subtrees are unmerged and locked, i.e., they are not considered for further merging. When all subtrees are locked from further merging, a device is inserted to drive each subtree to ensure that the slew is acceptable.
The first-order slew model in [20] is used to estimate the slew of a subtree. If an estimated slew is close to the constraint S slew , NGSPICE is invoked to compute the slew accurately. As each subtree is constructed to have zero-skew, the root-to-sinkdelay (RTSD) of each subtree is unique and is defined to be the delay of the subtree. Next, we attempt to balance the delay of all the DCCSs, because if the DCCSs have similar delay, it is less likely that detour wiring will be needed in the next stage. Moreover, these DCCSs are expected to be affected by variations more similarly. To balance the delays of the DCCSs, a piece of stem wire is inserted between the subtree and the driving device of each DCCS.
The stem wire insertion is performed as follows: First the DCCS s min with the minimum delay is found. Next, the longest possible stem wire is inserted under the driver of DCCS s min without violating the slew constraint. Finally, a stem wire is inserted on each of the other subtrees to match the delay of s min without violating the slew constraint.
This ends the construction of the current stage. The inserted driving devices will be the sinks in the construction of the next stage.
B. Flow for CLN Construction
In this section, the construction of the CLN structure proposed in Section IV-C is described. First, the bottom-most subtrees (or stage-1 subtrees) are constructed using merging until they are all locked. Here, the optional VR method, which we will explain in Section VI, could be applied. Next, a SRG is constructed to guide the cross link insertion.
1) SRG Construction:
To systematically introduce redundancy in a clock network, we introduce the concept of an SRG. An SRG consists of vertices and edges. An SRG vertex is created for each tree. An edge is added between two vertices if the two corresponding trees are sequentially related. Two trees are sequentially related if they share a sequentially related sink pair, i.e., one of the sinks in the sink pair is located in the first tree, and the second sink in the pair is located in the other tree. A set of trees and the corresponding SRG is shown in Fig. 10 . (Similarly, we can also define an SRG based on DCCSs. Such an SRG will be used in the construction of LMNs in Section V-C.)
2) Insertion of Cross Links: A cross link is inserted between every pair of subtrees that share an edge in the SRG. The cross links are inserted one-by-one as follows: An edge e that is connected to the subtree with the least delay is first found. If multiple such edges exist, we pick the edge that has the smallest total RTSD of the two subtrees incident on the edge.
Cross links are the most effective when inserted between nodes of equal delay. Therefore, the subtree that is connected to edge e and has the least delay is extended to match the delay of the other subtree using a piece of stem wire (using the Elmore delay model). Next, the edge e is removed from the SRG and is replaced with a cross link. The method in [9] is used to perform the insertion, i.e., half of the capacitance of the cross link is added to each of the two subtrees, and the resistor of the cross link is added after the entire clock tree has been constructed. This process is repeated until all edges in the SRG have been replaced with cross links.
After all the cross links have been inserted, a driving device is inserted to drive each of the subtrees. Next, the regular clock tree construction flow is used to construct the remainder of the tree structure as shown in Fig. 9 .
3) CLN Sparsification: Fig. 11 shows the LCAD of each cross link. A cross link with a higher LCAD seems to be more important, as it connects two subtrees that are further apart in the topology, as seen in Fig. 11 . This also relates well with the α-rule presented in [9] , the observation in [8] , and the experiment in Fig. 5 .
To trade-off capacitance cost and robustness, we may chose to insert the most important cross links. We present a family of sparsification schemes based on the LCADs of the cross links. A clock tree that has cross links with a LCAD equal or higher than X is classified as a CLN-X. A CLN-2 corresponds to a clock tree in which all cross links of LCAD of two or higher are present. The lower X is, the more cross links a clock tree will have. We consider three sparsification schemes, namely CLN-1, CLN-2, and CLN-3.
To construct the different CLN structures, a clock tree with all cross links present is constructed. Next, the LCAD of each cross link is determined. With the LCAD of each cross link computed, it is easy to remove the appropriate cross links. Finally, the tree requires some tuning because of the imbalance induced by the link removal. We use a similar method as in [11] to perform the rebalancing. The rebalancing is required because no current is needed to charge the capacitance of the cross link, which was accounted for in the initial synthesis.
C. Flow for LMN Construction
The construction of the proposed LMNs is described in this section and illustrated in Figs. 12 and 13 . The bottommost DCCSs (or stage-1 subtrees, each with a driving device) are constructed using tree merging, VR, and device insertion. Next, an SRG based on DCCSs (instead of trees in Section V-B1) is constructed to guide the local merges between DCCSs that share an edge in the SRG.
1) Local Merges:
The insertion of the redundancy is performed as follows: Each DCCS is split [8] , [11] into as many parts as the number of adjacent nodes the corresponding vertex has in the SRG. By splitting, it means that the input capacitance of the driver of each DCCS is divided evenly. Each of these parts is called a split-DCCS. For every edge in the SRG, a zero-skew merge is performed on the two corresponding split-DCCSs. We call such a merge a local merge.
In Fig. 12(a) an SRG is shown. In (b), split-DCCSs are formed, and each is merged with an adjacent split-DCCS. Fig. 13(a) shows four DCCSs that are sequentially related with each other, and Fig. 13(b) shows five DCCSs that are sequentially related with immediate neighbors. Multiple local merges are performed, which results in six new roots and four new roots in Fig. 13(a) and (b) , respectively.
2) "Tree" Construction: The new root nodes in Fig. 12 (b) are then used to create an NNG, and iterative tree merging is performed as in Section V-A1 on top of the new roots to produce a number of locked subtrees. We denote this as tree merging as the structure on top of the new roots is a tree but if local merges are included, it may be a near-tree structure. The tree merging process of the new roots in the top of Fig. 13 produces the locked subtrees in the middle row of Fig. 13 . In Fig. 13(c) , the six new roots result in a single locked subtree, whereas, two locked subtrees are obtained when the tree merging is applied to the four new roots in Fig. 13(b) .
Next, an optional and iterative sparsification process can be applied to each of the locked subtrees, before inserting the driving devices of the second stage. Above these driving devices, the remainder of the structure is constructed as a regular clock tree. The obtained structure is denoted as an LMN structure, or if the sparsification is applied, a sparsified LMN (S-LMN).
3) LMN Sparsification: After tree construction there may exist multiple paths from a device to a DCCS driven by it, as shown in the middle row of Fig. 13 . Sparsification is performed to ensure a unique path from the driving device of the subtrees in the second stage to each DCCS of the bottom-most stage. This condition may initially not be satisfied because a single DCCS of the first stage may have been split into multiple parts, i.e., split-DCCS; next, many of these split-DCCS originating from the same DCCS may have been merged together to form a subtree at the second stage, resulting in the existence of a nonunique path to that DCCS. Naturally, this proposed sparsification implies a trade-off between robustness and cost. Sparsification increases the vulnerability to wire variations experienced along the sparsified portion of the network. Nevertheless, as the alternative paths are intentionally kept to be relatively short, the robustness is not affected too negatively.
The sparsification is performed on the locked subtrees one at a time. All sink nodes of one locked subtree, which are all the split-DCCSs in the locked subtree, are placed in an NNG. We then apply the subtree merging process in Section V-A1. However, different split-DCCSs that originate from the same DCCS are joined before the merging process starts. When sparsification is applied to the structures in the middle row of Fig. 13 , the structures in the bottom row of Fig. 13 are obtained. A single tree is obtained and in (a) and two trees are created in (b) of Fig. 13 , respectively. It is important to note that the tree merging in the tree construction step is performed on all the new roots after local merges, whereas the tree merging in the sparsification step is performed on the sinks of a locked subtree. The newly constructed subtrees will have a single path to each sink. However, in a few adverse cases, a newly constructed sparsified subtree may have significant detour wiring and cannot meet the slew constraint. In these cases, the original un-sparsified version is kept instead.
After all second-stage subtrees have been sparsified, the wire capacitance of the subtrees in the second stage may have been reduced. That presents an opportunity to merges some of the sparsified subtrees, and if necessary, perform an additional sparsification process. This iterative process stops when either additional merges result in slew violation or sparsification is not successful. At this point, the second-stage devices will be inserted as described in Section V-A2. Next, the remainder of the structure is constructed as a clock tree.
VI. VR FLOW
Both methods in Section V require an SRG to insert redundancy. In this section, a VR method is presented. The VR method aims to reduce the number of SRG vertices or subtrees in stage one. By reducing the number of subtrees, less redundancy needs to be inserted. The slew is the factor that determines if a pair of subtrees can be merged.
In the VR method, subtrees are reconstructed to meet slew constraints. The motivation behind the reconstruction is that the slew of a zero-skew subtree correlates with the root-tosink (Elmore) delay (RTSD) of the subtree [20] . By exploiting the inherent quadratic nature of interconnect delays, the RTSD of some subtrees can be reduced substantially using a small amount of additional wire. Specifically the reconstruction step alters the abstract topology and merges partial subtrees closer to the final root of a subtree.
The flow of VR method is outlined in Fig. 14 . The input to the VR method is a set of subtrees. In an enumeration step, all pairs of subtrees are merged as shown in Fig. 14 , and described in Section VI-A. Next, the enumerated (paired) subtrees are reconstructed to meet the slew constraints. The details of the reconstruction step are given in Section VI-B. Subtrees that cannot be reconstructed to meet the slew constraint are removed, as shown in Fig. 14 with dashed crosses. Fig. 15 . Illustration of the flow of the reconstruction step. First, a subtree is broken into smaller subtrees. Next, subtrees are remerged to a single subtree but with a smaller RTSD. Finally, the remerged subtree (if any) that meets the slew constraint and is of least cost is kept in the pruning step.
Lastly, a nonoverlapping set of reconstructed subtrees is selected to be kept, as explained in the selection step in Section VI-C. In the example in Fig. 14 , the number of subtrees is reduced from four to three.
A. Enumeration
The input to the VR method is a set of zero-skew subtrees. We denote such a set of N subtrees as a subtree-set. First, N(N − 1)/2 possible pairs of subtrees are formed; Next, each of these pairs of subtrees are merged into a paired-tree. Denote the set containing all of these paired-trees as P. Next, the slew of every paired-tree in P is computed and compared with the slew constraints S slew . If the violation of the slew constraint is less than a parameter S vio , it is likely that the slew could be legalized in the reconstruction step and therefore, the pair is kept. If the slew exceeds this relaxed slew constraint, the pair is removed from P.
B. Reconstruction
In the reconstruction step, each paired-tree in P is reconstructed to attempt to meet the slew constraint. According to the slew model in [20] , the slew of a zero-skew subtree depends on the total capacitance of the subtree and the RTSD of the subtree. The reconstruction step attempts to reduce the slew by decreasing the RTSD at the expense of increasing the total capacitance.
The reconstruction step is illustrated in Fig. 15 and consists of breaking, remerging, and pruning. Each paired-tree p is broken k max times into smaller subtrees. After k breaks, a subtree-set S k has k + 1 subtrees, with 1 ≤ k ≤ k max . Using this approach k max different subtree-sets S k , 1 ≤ k ≤ k max are formed. Next, each subtree-set S k is remerged into a subtree p k , 1 ≤ k ≤ k max . The index k, denotes how many times the initial subtree p was broken before it was remerged. Of the k max remerged subtrees generated from the same paired-tree p, only the one subtree that meets the slew constraint (if any) and is of least "cost" is kept.
1) Breaking: Consider a paired-tree p. Next, consider breaking the subtree into two separate subtrees. The single possible breaking point is illustrated with a solid red circle in
Step 1 of Fig. 16 . After the break has been performed, the two obtained subtrees form the subtree-set S 1 . Fig. 16 . Illustration of how a subtree p is broken thrice and remerged to construct the subtree p 3 with the MRTSD.
Algorithm 1 Breaking a Subtree
Input: A paired-tree p Output: A series of subtree-sets
. . , S k max
Next, consider performing another break. There are two alternative subtrees that could be broken, each illustrated with a red circle, in Step 2 of Fig. 16 . Each of the "break" alternatives would create a subtree-set consisting of three subtrees. We pick the break that would generate the subtreeset such that if the subtrees in the subtree-set were remerged to a single subtree T, the subtree T would have the minimum RTSD (MRTSD) possible. The break resulting in the MRTSD is illustrated with a solid red circle, the and the unselected break is illustrated with a dashed red circle. The obtained set of three subtrees form the S 2 subtree-set. This process is repeated k max times to iteratively break the current subtree-set into more subtrees, to create a total of k max different subtreesets. We set k max = 7 in this paper based on a run-time performance evaluation. The subtree-breaking approach as outlined above is detailed in Algorithm 1, and illustrated in Fig. 16 . In the algorithm, s left is the left child and s right is the right child of a subtree s.
A key component of the breaking algorithm in Algorithm 1 is the get-MRTSD function. The get-MRTSD function computes the MRTSD amongst the trees that could be constructed from a subtree-set S. Let |S| be the number of subtrees in S, and S[k] be the kth subtree in S, and t MRTSD be the value returned by the function.
In Algorithm 2, the t MRTSD of a subtree-set is found by merging every possible pair of subtrees in the subtree-set and determining the maximum delay among all merged pairs. The maximum of all the merged pairs is required because we are attempting to join all the subtrees in the subtree-set to a single subtree T with the MRTSD. The key to understanding the get-MRTSD function is to recognize that the eventual subtree T with the MRTSD constructed from a subtree-set contains all the subtrees of the subtree-set, including the pair of subtrees that has the maximum RTSD, i.e., t MRTSD , when merged. However, the subtrees in this pair may be merged directly with each other or indirectly, i.e., with other trees first, in T. Consequently, the MRTSD of the subtree set S cannot be less than the t MRTSD of the merged paired-trees in S. Therefore, t MRTSD is a lower bound for the MRTSD of the subtree-set S.
Moreover, for a subtree-set S, a subtree T with an RTSD of the lower bound t MRTSD can always be constructed. Consider extending a piece of wire from the root of each subtree in the subtree-set such that the RTSD of each subtree is t MRTSD . This extends merging segment [17] of each of the subtrees to an corresponding merging region (MR), which is illustrated in Fig. 17(a) . We call the intersection of all the MRs a target region (TR). If there exists a nonempty TR, a subtree with an RTSD of t MRTSD can be constructed by connecting all the extended pieces of wire to any one point within the TR. Next, we show that the TR is nonempty. As each pair of subtrees in S could be merged with an delay of at most t MRTSD each of the MR must pairwise intersect.
Now it remains to show that pairwise-intersection implies the existence of a nonempty common intersection. The collection of MRs can be viewed as two interval graphs because all MRs are rectangles tilted at a 45 • angle. The interval graphs can be obtained by rotating the MRs 45 • and using the left and right x-coordinates and the top and bottom y-coordinates of rectangles to define the intervals in horizontal and vertical directions. A direct application of Helly's theorem [21] implies that a nonempty common intersection exists. For example, if MR 3 in Fig. 17(a) is to pairwise overlap with both MR 1 and MR 2 , it must overlap with the common intersection of MR 1 and MR 2 . This proves the correctness of the get-MRTSD function, i.e., for a subtree-set S there exists a MRTSD subtree T with an RTSD of t MRTSD computed by the get-MRTSD function in Algorithm 2.
2) Remerging: In this section, we explain how the subtreesets S k created in the breaking step are remerged to a subtree T with an RTSD of t MRTSD . An overview of the remerging is provided in Algorithm 4.
A subtree can be constructed from an subtree-set with MRTSD as follows: First, the value of the MRTSD t MRTSD is determined using get-MRTSD. Next, using the delay value t MRTSD , the target point (TP) of an subtree-set can be found using Algorithm 3 as follows: Find the TR as explained before. The middle point of the TR is defined to be the TP of subtreeset, as shown in Fig. 17(a) . In fact, any point in the TR could have been selected to be the TP. Now, a NNG is created and subtrees are merged as in Section V-A1. However, we introduce a target constraint. The target constraints dictates that the resulting subtree of a merge operation in the NNG must be able to reach the TP using a piece of wire without exceeding the RTSD of t MRTSD . The restriction of TR to a TP allows us to specify the target constraint easily.
If a pair of subtrees cannot be merged using standard zero skew merge because of the target constraint, a set n max of intermediate merging locations (IML) between the zero skew merging point (ZSMP) and the TP is enumerated, as shown in Fig. 17 . We set n max = 10 based on a performance analysis. Next, the subtrees are merged at the IML closest to the ZSMP that does not violate the target constraint.
3) Pruning: The remerged subtrees p k have a decreasing RTSD in k and most likely increasing capacitance in k. For each of the remerged subtree the slew is computed, the subtree (if any) that meets the slew constraint with the least cost is kept, and the other subtrees are removed. The capacitive cost of a subtree p k , is defined to be the sum of the wiring capacitance of the subtree and the capacitance of the minimum driving device that can drive the subtree. Let p 0 be the two paired subtrees that are merged (and then broken, remerged, and pruned) to create p k . When we compute the cost of p 0 , we add the capacitance of a wire equal to the distance separating the two un-merged subtrees to the wiring and device capacitance of the two respective subtrees. In Fig. 18(a) it is shown that the RTSD of a reconstructed subtree can be reduced significantly. In Fig. 18(b) , the slew vs cost trade-off curve is shown for the different reconstructions of a paired-tree p 0 .
This process is performed for all of the paired-trees p in P. Finally, each paired-tree p has been reconstructed into one new subtree with an associated cost.
C. Selection
In this step, only reconstructed subtrees that meet the slew constraint remain. However, one original subtree may be present in many pairs, and it can only be merged with another subtree once. A subset of nonoverlapping pairs of subtrees must be selected as shown in Fig. 14 . Such selected pairs are replaced with the corresponding reconstructed subtrees. Note that some subtrees may not be part of any pair. The original subtree is kept unchanged in that case.
Pairs are selected such that as many pairs as possible can be merged. If there are multiple alternatives of selecting pairs such that the same number of subtrees are obtained, the solution with the least cost is selected. We solve a weighted matching problem to select the pairs.
VII. EXPERIMENTAL RESULTS
In this section our experimental results are presented. We have implemented our clock network synthesis tool in C++. The tool is run on a quad core 3.1 GHz Linux machine with 7.7 GB memory.
The ISPD 2010 benchmarks suite is used to evaluate the proposed algorithms. We refer to those benchmarks as BM01-BM08. In the 2010 ISPD contest, the skew constraint is set to 7.5 ps and the local skew distance is 600 000 nm on BM01-02 and BM04-08. On BM03, the skew constraint is set to 4.99 ps and the local skew distance 375 000 nm. The slew constraint is set to 100 ps on all the benchmarks. The supply voltage variations are set to ±7.5% and the wire width variations are set to ±5% around the nominal values. In [3] , the ISPD constraints were adopted for the SLSV variations model. In this paper, which is evaluated with the SLSV model, we also adopt the ISPD parameter settings and constraints.
There are some common "issues" with how experimental results have been presented within variations aware clock network synthesis. First, the robustness of the clock networks, in terms of 95%-skew, presented in some works is reported using only the ISPD variations model, which can be misleading [3] , [6] . Second, many works do not present a good reference for their proposed optimization approaches. For example, when cross links are inserted in a clock tree in [7] , the robustness and cost of a clock tree with no cross links inserted are not presented. Lastly, in nontree structures, the capacitance cost may not be an accurate estimate of the actual power consumption. We address the above issues in the following way.
1) We evaluate the robustness with the SLSV variations model. 2) The individual impact of each optimization step is demonstrated. Our clock network synthesis tool is used to create clock trees to use as a "fair" reference in terms of robustness and cost. 3) Both the actual power consumption and the capacitive cost of each clock network are reported. To evaluate the robustness of each clock network, we report the clock network skew, which is denoted 95%-skew. Power denotes the sum of the average rising and falling power consumption in the 500 Monte Carlo simulations. To facilitate comparison with some previous work, we also report the capacitance cost of each of the networks.
The run-time results for [3] and [6] in Table II and [8] in Table III are the obtained directly from the respective papers. When compared with existing studies on tree-based or neartree-based approaches, the run-times of this paper are quite reasonable. However, as the results are obtained on different machines, we do not make any claims with respect to the run-times.
A. Performance of the Near-Tree Structures
In Table II , a complete set of experimental results is presented. Tree is a stand-alone default clock tree. CLNs are denoted as CLN-1, CLN-2, and CLN-3. The CLN structures constructed using the VR method are called VR-CLN-1, VR-CLN-2, and VR-CLN-3. The LMN structures constructed with and without sparsification are denoted S-LMN and LMN, respectively. When the VR technique is applied on the S-LMN structure, VR-S-LMN is obtained.
In Fig. 19 the normalized average robustness and power consumption of the different structures are displayed. From this figure we can observe that the CLN and the LMN structures are significantly more robust compared to the tree structures but consume more power. As the capacitance cost and power consumption correlate well, it seems the amount of short circuit current caused by redundant paths is negligible. Now, we more closely examine the performance of the various CLN structures in Fig. 19 . Consider the CLN structures in increasing number of cross links, i.e., tree, VR-CLN-3, CLN-3, VR-CLN-2, CLN-2, VR-CLN-1, and CLN-1. Here, tree is a clock tree with no cross links, and CLN-1 is the near-tree with the most cross links. It is not surprising that the normalized skew is decreasing with the number of cross links inserted and that the power consumption is increasing with more cross links inserted. When we compare the CLN structures with and without the VR method applied, we find that the VR method reduces the cost at the expense of some robustness. The trade-off seems very acceptable. It is very promising that the average trade-off between cost and robustness seems relatively smooth. Overall, the skew is improved by 11%-39%, at an expense of 3%-40% increase in power consumption, for the CLN structures.
Next, we shift our attention to the performance of the LMN structures in Fig. 19 . Consider the following order: Tree, VR-S-LMN, S-LMN, and LMN structures. With respect to this order, the normalized skew is decreasing and the power consumption is increasing as expected. The sparsification and the VR method both reduce the cost at the expense of robustness. The LMN has 35% and 39% higher power consumption compared with S-LMN and VR-S-LMN, respectively. The skews are 15% and 19% higher for the S-LMN and the VR-S-LMN, respectively.
On the average, it seems as if the two near-tree structures achieve very similar cost-robustness trade-offs. In reality, we find that the two different techniques are advantageous for different benchmarks. In general, LMN structures perform very well on the large benchmarks such as BM01-02, and the CLN structures perform better on BM03-08. Specifically, it appears like the cross links cannot improve the robustness of the largest benchmarks, i.e., BM01-02.
One reason for the performance to be different is that the distance between the subtrees that share an edge in the SRG is farther on BM01-02 compared to BM03-08. On BM01 and BM02 the average Manhattan distances are 981 and 1049 μm, respectively. On BM03-08 the average distances range between 395 and 717 μm. In [7] it was observed that long cross links (>900 μm) had limited impact on the robustness. We speculate that this may be why the CLN structures do not have a smooth robustness cost trade-off on BM01-02. Now, we examine the trade-off between the trees, the CLN structures, and the LMN structures on BM04 more closely. The robustness and power consumption of each structure is shown in Fig. 20 . Every structure has a unique capacitance cost (we use capacitance to facilitate comparison with previous work) and 95%-skew. As the proposed methods are heuristics, we do not expect perfect trade-offs curves between the different structures on each individual benchmark. For example, CLN-3 is expected to have lower skew compared with VR-CLN-3, but on BM04 we can see in Fig. 20 that the skew is actually a little higher. However, a fairly consistent trend of trade-off between robustness and power is observed across the entire benchmark suite.
If a structure has both higher skew and capacitive cost compared to another structure, we say the former is dominated by the latter. In Fig. 20 the nondominated structures are tree, VR-CLN-3, VR-S-LMN, S-LMN, CLN-2, and CLN-1. We observe that both the LMN structure and the CLN structures are part of the nondominated structures. Depending on the skew constraint, a different near-tree topology would be advantageous. For example, if a network skew (or 95%-skew) of 4.70, 6.00, or 8.00 ps, is required; CLN-1, VR-S-LMN, tree, respectively, will be the structure meeting the skew constant with the least power consumption. None of our topologies dominate the other topologies in all instances.
A limitation of the scope of this paper is that statistical static timing analysis is not performed to determine the best topology and the exact amount of redundancy that needs to be inserted to meet a specific robustness constraint.
B. Comparison With Previous Work
Now, we compare this paper with the paper in [3] and [6] . In Table II it can be observed that on the average our structures perform better when compared to solutions in [3] and [6] , in terms of both capacitance cost and robustness. In Fig. 20 it can be observed that the solutions in [3] and [6] are dominated by our near-tree structures. Now, we examine some of our results in more detail. It is very promising that our near-tree structures are scalable to meet the constraints on the largest benchmarks BM01 and BM02. In fact, our LMN can meet the skew constraint with less than five times the capacitance compared with the mesh in [6] . On the smallest benchmark the CLN-1 structure can achieve much better robustness at a much better cost compared to previous works. In fact, CLN-1 can meet a 4.99 ps skew constraint on all five benchmarks BM03-08, which was a question raised at ISPD 2010 contest [1] .
As [8] reported only results on BM08, we limit our comparison with [8] to that circuit. Lee and Markov [8] constructed their clock networks using a single device size to avoid averaging out variations. In terms of variations, using a single device size is in fact equivalent to the SLSV model. In [8] results are reported with both uniform and Gaussian distribution. Under uniform distribution, we compare this paper in [8] with [3] , [6] , and a select subset of our near-tree structures. We obtain the skew values in Table III using the SLSV model with the same uniform variations as in [8] . In addition, we generate the supply voltage variations using the same Gaussian distribution as in [8] . Table III shows that our near-structures are advantageous under both variation models.
VIII. CONCLUSION
In this paper, we have proposed two near-tree structures and a general cost reduction method. Based on the experimental results, it appears that the robustness of a clock tree can be improved using different near-tree structures. In the future, we will explore mixing different types of redundancy in the same clock network and determining the amount of redundancy to be inserted based on the benchmark constraints.
