Abstract. In VLSI digital circuits, clock network plays an important role on the total performance of the chip. Clock skew and power dissipation are two major focuses of concerns in the clock network synthesis. During topology generation, the locations of buffer and gate insertion are usually not available. Despite local optimization, the global performance is limited. In this paper, a novel approach of topology generation with concurrent gate insertion is proposed. Meanwhile, a strict clock slew constraint is applied with comprehensive buffer insertion techniques. By clock gating, the switched capacitance of the clock tree is reduced, with acceptable extra cost caused in controller tree. In experimental results it is shown that our approach has good performance on the reduction of both clock skew and power dissipation.
Introduction
Clock signals are employed in VLSI digital systems to synchronize the active components of a design. Clock skew minimization is a popular research topic during the past decades. Some early works [1, 2] mainly concentrated on the average distribution of wirelength between source and each terminal to achieve actual delay equalization. Afterwards, delay balancing [3] using Elmore delay model [4] became prevalent to acquire more accurate information of timing delay. The deferred-merging and embedding (DME) technique was proposed in [5] , it can achieve the zero clock skew with minimal wirelength. In topology generation, some algorithms were proposed for unbuffered and ungated clock tree in [6] , and buffered but ungated clock tree in [7] . In ISPD 2009 clock network synthesis contest [8] , a voltage variation related objective named Clock Latency Range (CLR) was formulated. Subsequent research work was also proposed [9] accordingly.
Twenty percent to fifty percent of the power usage is contributed by the clock network [10] . On behalf of power reduction, the application of clock gating is an effective approach in the sequential circuits. The principal idea is to turn off the idle modules and tree sections in order to cut down the unnecessary switching power. Clock gating can be applied on logic level [11] , register-transfer-level level [12] and architecture level [13] . Nevertheless, besides logical information, physical location of the modules should also be taken into account in case wirelength overhead thus power usage waste. Some achievements were proposed with both logical and physical concerns. The algorithm in [14] showed a clock tree topology construction, taking advantage of the activity patterns of modules. Moreover, activity similarity was considered in [15] . Besides, a gating method regarding microprocessor design was proposed in [16] . The algorithm constructed the topology in a bottom-up procedure, with the objective of switched capacitance minimization. Further on, in [17] a comprehensive technique with a recursive computation on effective switched capacitance and a solution sampling on merging segment set was discussed.
In this paper, we propose a novel synthesizer to construct a binary clock tree in a bottom-up course. Simultaneous optimization on the clock skew and the power dissipation is applied. The topology generator is responsible for a buffered and gated clock tree, and the clock gates are inserted concurrently. The major advantage of our work is to take the downstream masking information of subtrees into account during each merging step. An algorithm named dual-MST [9] for topology generation is involved in our work, and the cost function is improved for power awareness. Besides, we perform a more strict slew constraint along the whole clock network. Thus the constraint on buffer and gate location is emphasized. The experimental results show that our method can greatly reduce the power consumption of the clock network with proper gate insertion. Meanwhile, the clock skew and PVT variation can still be maintained within an acceptable range.
The rest of the paper is organized as follows. Some preliminary knowledges of tree construction and capacitance are discussed in section 2. The details of our approach are discussed in section 3. The technique of power aware topology generation with concurrent buffer and gate insertion is proposed detailedly. Experimental results are shown in section 4. Finally we reach our conclusion in section 5.
Preliminaries

Clock Tree and Controller Tree
Let T = {V, E} denote the clock tree. V = {v i |i = 1, 2, . . . , m v } is the set of nodes, and E = {e j |j = 1, 2, . . . , m v − 1} is the set of clock edges between the node v j and its corresponding parent. Let |e j | denote the length of the edge e j . Apparently, for the root node there will be no edge assigned. Let G = {g i |i = 1, 2, . . . , m v − 1} denote the set of gates. The gate g j is assigned to be on the edge e j masking the node v j directly. We use S = {v k |k = 1, 2, . . . , m s } (where m s < m v ) to denote the set of modules (the sinks, or leaf nodes). The rest (m v − m s ) nodes are named internal nodes. The root is said to be at level 0. Node v i is said to be at level n i if there are n i edges on the path from v i to the root of the tree. Moreover, we assume that the topology of the clock tree is full binary. Every internal node has exactly two children. The skew of T is the difference between the longest signal delay and the shortest signal delay from the source to any sinks. As proposed in [16] , we assume that the control logic is located at the center of the chip. Star routing is also applied in the controller tree, denoted as T ctr . A control edge EN i in T ctr will transmit the enable signal to the respective gate g i on the edge e i in the clock tree T . An example of a clock tree T as well as its controller tree T ctr is shown in figure 1 . During the operating time of a circuit, each module will have its active and idle times. It is usually specified as different activity patterns. The activity patterns can be obtained by the simulation of the design at the behavioral level [14] . Let A i denote the activity pattern of the node v i . It is a binary string with 1s indicating the active periods and 0s indicating the idle periods of a sink or an internal node. If v i is a sink node, we can directly obtain A i from the benchmark file. Otherwise, suppose v i to be an internal node with two children nodes v L and v R accordingly. The clock signal at v i must be enabled whenever its left or right child is active. Therefore, A i is calculated by performing the bitwise OR operation on the activity patterns of v L and v R . Hence A i = A L ∪ A R . An example of a bottom-up activity pattern transmission is shown in figure 2.
Let P (A i ) denote the activity of the node v i , and P tr (A i ) denote its transition probability. These two factors are calculated based on the corresponding pattern of v i . The specific equations are shown as below
where 
Switched Capacitance
The power consumed by CMOS circuits consists of two components: static and dynamic power. The static power is mostly determined by the feature size and other technology. Therefore, in this paper we only consider dynamic power minimization. The definition of the dynamic power is P =
. C means the total load capacitance on the circuit, f is the frequency of the clock signal and V dd is the power supply. α means the amount of switch times in each clock cycle. For clock tree α = 2, because there is one rising and one falling edge in each clock period. α = 1 in the controller tree, respectively. Since f and V dd are constant parameters in the digital circuits, we can use the switched capacitance as a measure of the power usage. Assume that a subtree T i rooted at v i with a gate insertion g i , and the controller tree is denoted as T 
The power consumption of a clock network is directly proportional to the average switched capacitance for each clock cycle. The total switched capacitance is contributed by a gated and buffered clock tree T and a controller tree T ctr . In order to reduce the switching activity, modules and clock tree sections can be disabled by clock gates during their inactive clock periods. From the above example, we can see that the original capacitance of node
, the capacitance will be reduced with the insertion of g i . A power aware clock tree topology with proper buffer and gate insertion will efficiently reduce the switched capacitance, hence cut down the power usage of the circuits. Given the physical location of the modules together with the models of wire, buffer and gate, the objective of our work is to construct a buffered and gated clock network and a controller network. Subject to the two constraints of nominal zero skew and maximal slew rate, the average switched capacitance should be minimized.
Methodology
We build our clock tree based on the dual-MST construction method [9] , and the resulting clock tree is close to a full symmetry. In our paper, it is improved with a new cost function to take both distance and power saving into account. As a result, the according topology can result in both low power usage and small clock skew. A recursive buffer/clock gate insertion method is developed for bottom-up merging. Blockage handling technique is also involved, because the buffers and gates cannot be placed inside blockage regions. Elmore model [4] is applied for clock delay computation. DME technique [18] is applied for wirelength minimization. Thus, segment is used instead of point to represent the set of merging location, and deferred embedding is applied to reduce total wirelength.
Power Aware Topology Generation
In order to save the power, the nodes with a bigger similarity of activity patterns should have a higher priority to be matched. Assume v a and v b to be a pair of two nodes, as shown in figure 2. If the corresponding activity patterns A a and A b are similar, the resulted activity A i will have a shorter active period, and smaller power cost will be caused. Besides the concerns on activity patterns, an estimation of the merging cost P wr (v a , v b ) is also required. This can be determined in multiple ways. For instance, we can actually merge the two nodes together to obtain the exact connection information. However, exact buffer insertion and wire balancing are performed, which will cost longer time. Instead, we develop a new method for potential switched capacitance estimation. The Manhattan distance between the nodes v a and v b is denoted by D(v a , v b ) . The Elmore delay difference of these two nodes is denoted by DLY (v a , v b ) . The delay and power consumption for unit wirelength are denoted by ρ D and ρ P respectively, which are computed in advance for simulation reference. If
is smaller than  D(v a , v b ) , then the two nodes can be merged without snaking wire involved, and the corresponding equation for power cost computation is shown as below
Otherwise, snaking will be included, as shown in the following equation
An improved power aware dual-MST geometric matching technique is developed for topology construction, a specific definition of a geometric matching of one iteration can be found in [2] . The detailed description is shown in procedure 1. It is a weighted perfect matching approach. . The cost of matching two nodes v i and v j is denoted as f c (e i,j ). Let M denote the matching result of G. M is composed of a group of edges and it is a subset of E. The maximal pairing cost of M is denoted as C max and defined as below. We will get close to a symmetric clock tree by reducing C max in each level. The merging cost f c (v a , v b ) is shown as below. α and β are the weight of the Manhattan distance and the estimated power cost, respectively.
By means of this weighted cost function, the node pairs with a bigger similarity of switching activity and a shorter distance will have a higher priority to be matched. Our approach of topology generation is based on concurrent gate insertion, therefore the downstream information of the two merging nodes are accurate.
Concurrent Gate and Buffer Insertion
A recursive buffer and gate insertion technique is developed on behalf of three objectives: (1) slew rate constraint (2) clock skew minimization (3) power usage reduction. Buffers are utilized for power supply to restrict the signal transition time, and clock
Procedure 1. Partition(G)
Require: G = {V, E} is a complete graph, E is sorted in ascending order of fc (ei,j ). if |V | ≤ 1 then return; else if |V | = 2 then merge(v1, v2); return; else Building dual-MST with |V | − 2 edges inserted. Two subgraphs G = {V , E } and G = {V , E } are generated Two minimum spanning trees st and st for V and V are generated if |V | is odd and |V | is odd then em,n = arg e i,j min{fc (ei,j ) |∀ni ∈ V , ∀nj ∈ V }; merge(vm, vn); remove vm from V ; remove vn from V ; remove em,x from E , ∀x ∈ V ; remove en,y from E , ∀y ∈ V ; end if partition(G ); partition(G ); return; end if gate insertion can reduce the switched capacitance by disabling idle sections. Real-time simulation of signal slew rate costs much more time and is impractical. Hence we build look-up tables in advance for slew reference. It can estimate the driving ability among diverse circumstances. We model the buffer and gate with according attributes for Elmore delay computation. Some previous works [19] already proposed to construct a buffered clock tree with zero clock skew. In our work, we apply similar approach for both buffer and clock gate insertion. The input/output capacitance and resistance of the buffers and clock gates should be obtained first. Hence, the delay of wire, buffers and clock gates can be computed based on Elmore RC model.
In our work, we try to maintain the level of buffers and gates of every source-to-sink clock path exactly the same. During the procedure of the bottom-up binary merging, we first examine the two downstream levels of gates. If they differ by two or more, a penalty cost will be engaged. Such matching result will probably be discarded due to the huge cost. Buffer levels will be balanced accordingly. By means of this level balancing, the clock skew will be reduced significantly, and the negative effect caused by signal variation will be reduced.
Here we will describe our technique of gate insertion based on a determined matching result. We first define three different kinds of gate insertion. They are virtual gate insertion at the upstream level, temporal gate insertion at the current level and none gate insertion. Temporal insertion is controlled by the balancing of gate levels, which will be further divided into two kinds of single gate insertion and one kind of back-to-back double gates insertion. The insertion of a gate is assumed to be closest to the internal merging node on behalf of switched capacitance minimization. Since DME technique is applied in our work, we assume the middle point of the merging segment to be the gate location. The comparison among the three assumption of gate insertion are based on the resulting switched capacitance, which are SC vir , SC tmp and SC non , respectively. If the power consumption of the virtual insertion or the none insertion is the smallest, no insertion of any gate will definitely result in less switched capacitance compared to the choice of temporal gate insertion. Therefore, we discard any insertion of gates at the current level. Otherwise, temporal gate insertion will probably reduce the switched capacitance rather than the others, and here we will accept the insertion of gates.
An example is shown in figure 2 . The activity A i equals to A a ∪ A b . The edge connection between each of the two nodes to the merging node are denoted as e a and e b . C ea and C e b are their corresponding capacitance cost. The equations to compute the three resulting switched capacitance are shown as below
Notice that here we only describe the equation of SC tmp for a single gate insertion at node v a . The other two equations can be derived in a similar way.
Experimental Results
In this section, our experimental results are presented. We implement our clock network synthesizer in C programming language. The binary is executed on a Linux machine with an Intel Core2 Quad 2.4G Hz CPU and 4GB memory. The benchmark circuits used in the experiments are released from the ISPD 2009 CNS contest [8] . The detailed information of the benchmark circuits is shown in table 2. In our experiment, one type of wire and one type of buffer is used in our clock tree synthesizer. The unit resistance of the wire is 0.0003Ω/nm, and the unit capacitance of the wire is 0.00016f F/nm. The specific configuration of the buffer in different sizes is shown in table 1. In our synthesizer, the maximum buffer size is set to be 6. Hence we list the attributes of different buffer sizes up to 6. Notice that the corresponding attributes of a gate is listed in the last row of table 1. This table is generated from our SPICE simulation statistics. C b means the input capacitance, R b means the driver resistance and d b means the internal delay of a buffer, respectively. During the evaluation, the power supply is set to be V dd = 1.0V . The PTM model applied in our simulation are of 45 nanometer scale. A summary of the performance of our clock tree after insertion of clock gates is shown in table 3. We run our program with different values of α and β for topology tuning. The clock skew (SKEW), total capacitance (TC), optimal capacitance (OSC), switched capacitance (SC) and CPU time are listed. The respective units are picoseconds (ps) for SKEW, seconds for CPU and femto-farad (fF) for capacitance. TC denotes the original total capacitance cost of the clock tree without gate insertion. OSC denotes the resulted capacitance after disabling of all the idle periods at each node. SC denotes the resulted switched capacitance of our gated clock tree. It can be seen that SC is mostly smaller than T C, which means a effective power reduction in our gated clock tree construction. The nominal skew of each clock tree is zero. Additionally, we use NGSPICE for further evaluation and get the accurate skew estimation, as listed in the table. The activity pattern of all the sinks are generated according to the instruction and RTL description used in [16] . The length of the activity pattern is 10000 for every benchmark. Previous works were proposed with loose constraint on slew or driving power supply, for instance, ≤ 20 × C g for a buffer or gate insertion in [17, 16] . The work in [14] did not involve clock routing and synthesis. However, in our program the transition time (slew rate) is maintained to be under 100 ps throughout the whole clock network, thus more buffers are inserted to follow this rule. As a matter of fact, in this paper it is very difficult for us to include direct comparison with previous works. It can be speculated that the power cost of our work should be larger than the previous ones, but the signal transition time is more consistent hence the work is more practical in use. Generally, in our work the switched capacitance can be reduced by around 10% with the insertion of clock gates. Meanwhile, the clock skew is only about 20 ps in average. The runtime of our program is less than 3 seconds, which represents good efficiency. 
Conclusion
In conclusion, power saving and clock skew are two major concerns in clock network synthesis. A power aware topology generation with concurrent buffer/gate insertion is proposed in this paper. This is developed in order to optimize the clock skew and the power dissipation of a clock distribution network simultaneously. Experimental results show that our method can greatly reduce the switched capacitance hence power consumption of the clock network with proper clock gate insertion. Meanwhile, the clock skew can still be maintained within an acceptable range.
