Abstract
Introduction
Clock gating is an effective way of reducing power dissipation in digital circuits. In a typical synchronous circuit, e.g. a general purpose microprocessor, only a portion of the circuit is active at any given time. By shutting down the idle modules in the circuit, we can prevent the circuit consuming unnecessary power. In addition, we can shut down a portion of the clock tree by masking off the clock at the internal node of the tree using an AND-gate. This prevents unnecessary switching in the clock tree and saves power in the clock tree in addition to the power savings in the modules.
In this paper, we address an instance of the gated clock routing problem. In our gated clock tree, we insert gates immediately after every internal node of the clock tree to minimize the dynamic power consumption. These gates also serve as buffers and can be sized to adjust the phase delay of the clock signal. They are turned on and off by the control signals generated from a centralized gate controller. An instance of the gated clock tree is shown in Figure 1 , where the sinks correspond to the locations of modules and the Steiners are the internal nodes of the clock tree.
A gate in the clock tree must be enabled (i.e. the control signal is true) whenever any of its descendant gates are enabled. This suggests that the control signal of a gate is the OR function of the control signals of its descendant gates.
In [5] , a gated clock tree topology construction based on module activity patterns was suggested. The authors used high-level synthesis information to extract the activity patterns. However, the routing of the clock tree and the control signals, the actual power dissipation and the area of them were not considered. In contrast, our method considers all of these. In addition, we propose a method for clock tree construction based on the instruction statistics of the processor. This can be extracted from instruction level simulation of the processor with a number of benchmark programs. The instruction statistics are used to extract the activities of the nodes and the switching activities of the control signals as will be discussed in the following sections. More precisely, we will investigate how the probabilistic information (instruction statistics) and the geometrical information (sink locations) are used to guide the low power clock routing.
The remainder of this paper is organized as follows. Section 2 gives the terminology and the precise problem statement. Section 3 describes how the probabilities of the gate control signals are calculated. Section 4 presents the clock tree construction algorithm.
Sections 5 and 6 show our experimental results and conclusions. 
Problem Definition
We assume that the topology of the clock tree is full binary, that is, every non-leaf node has exactly two children. However, the tree is not necessarily a balanced tree (depth of leaf nodes may not be the same We assume that the controller is located at the center of the chip. The control signal routing is a star routing as shown in Figure 1 . We denote the controller tree as S. There may be a load capacitance associated with each node of the clock tree. Including the node capacitance C i at node v i , the switched capacitance of e i is given by
The total switched capacitance in the clock tree is therefore
Switched capacitance in the c ontroller tree
Similarly, from Equation (1) 
The objective of our gated clock routing is to find trees T and S so as to minimize
subject to zero skew constraints. Notice that the signal probability of EN i determines the switched capacitance in the clock tree whereas its transition probability determines the switched capacitance in the controller tree.
Computation of P(EN i ) and P tr (EN i )
To calculate W(T) and W(S), we need to compute the signal probabilities P(EN i ) and the transition probabilities P tr (EN i ). Let P(M i ) be the probability that M i is active (i.e. M i receives the clock signal). Suppose v i has modules M 1 , M 2 ,…,M l at the leaves. If any of these modules are active, then EN i must be turned on. Thus P(EN i ) is given by
To find P tr (EN i ), we need the module activation statistic over consecutive clock cycles. Let AT(M i ) be a two-bit activation tag which represents the module activities in two consecutive clock cycles. Cases 1 and 4 do not cause transition of EN i , so we just need to consider cases 2 and 3 for the computation of the transition probability.
If Register Transfer Level (RTL) simulation is used to find these probabilities, a huge number of clock-by-clock module usages have to be recorded. Certainly, the time complexity will be very large, especially for general-purpose microprocessors. So we propose a method for computing activities using more efficient instruction level simulation of the processor and knowledge about the RTL description of the processor.
RTL description
For simplicity, we assume that the microprocessor has four instructions and six modules throughout the rest of this section. The RTL description of each instruction tells us what modules are used to execute each instruction. For example, we may have the following RTL description of the instructions.
Instruction
Used Modules However, the instruction stream can be very long. To get accurate instruction statistics, we may need some millions of instructions. Because some instructions are rarely executed, the instruction stream should be very long to get reasonable probability value for the rare instructions. Therefore, the above brute-force method is very expensive.
To overcome this problem, we propose a method that computes all the necessary probabilities from the tables that can be generated by scanning the instruction stream just once.
Table-driven probability com putation
Instruction Frequency Table ( IFT) enlists the probability that each instruction is executed on the average. By scanning the previous instruction stream, we have the IFT in Table 2 . Table 2 , which is 0.55. Any signal probability P(EN i ) can be found using Table 1 and Table 2 without rescanning the instruction stream. It was shown in [4] that the time complexity of computing this probability is O(KL), where K is the total number of instructions and L is the maximum number of used modules for any instructions (K = L = 4 in our example). Table (IMATT) enlists AT(M i ) for every possible combination of two consecutive instructions. In addition, IMATT keeps the probability that the two instructions occur in two consecutive clock cycles. By scanning the previous instruction stream, we have IMATT in Table  3 , if the corresponding modules' activation tags cause EN i to make a transition, the probability on that row should be added to the transition probability of EN i . The time complexity of computing the transition probability of ENi is O(K 2 N), where K is the total number of instructions and N is the total number of modules.
Instruction

Instruction Transition-Module Activation
Clock Tree Construction
Delay modeling
To estimate the phase delay of the clock tree, we use the Elmore delay model as was used in [6] for zero-skew clock routing. Inserting gates reduces the subtree capacitance in the Elmore delay computation, thereby reducing the phase delay.
Minimum switched capacitan ce heuristic
Bottom-up merging followed by top-down placement method is commonly used in clock routing. In [2] , the merging sector is a line segment with slope ±1, which represents the possible locations of the Steiner node where its two subtrees are merged. These merging sectors are found in bottom-up fashion. The actual locations of the Steiner nodes are determined in top-down fashion (see an example in Figure 2 ). The nearest-neighbor heuristic of [3] greedily merges two nodes when the geometric distance between the corresponding merging sectors is minimum. Our method is also greedy, but the merging sequence is determined by the switched capacitance.
Let ms(v i ) be the merging sector of v i . Suppose we try to merge (ms(v i ), ms(v j )) and the root of the merged tree is v k . We can uniquely determine |e i |, |e j | such that the zero skew constraint is satisfied. As mentioned before, we assume that the gate controller is located at the center of the chip. Let this center be CP. To compute the switched capacitance in an edge of the controller tree, we need to estimate the distance between the gate location (location of the Steiner node) and the CP. Since we do not know the exact locations of the Steiner nodes during the bottom-up phase, we approximate the edge length of the controller tree as the distance between the CP and the middle point of the merging sector. Let this distance be dist (CP, mid(ms(v j )) ).
Then the switched capacitance SC after the merge of (ms When we merge subtrees bottom-up, we merge sectors that result in the smallest switched capacitance as given in Equation (3) . Detailed algorithm for the clock tree construction is similar to [4] . We summarize the algorithm outline below (SC is short for switched capacitance). 
PROCEDURE
). The repeat loop iterates N times and within each iteration, the dominating complexity is the probability computation which takes O(K 2 N). So the overall complexity is O(B + K 2 N 2 ).
Reduction of Gates
Inserting gates at every node of the clock tree may result in large area and increase complexity of the control circuit and the routing of the enable signals. Especially, since the routing of enable signals is a star routing, its area can be bigger than the clock tree routing if we do not reduce the number of gates. There are cases when inserting gates hardly reduces switched capacitance. We can think of three cases when a node does not need a gate.
1.
Activity of the node is close to 1,
2.
Switched capacitance of the node is very small,
3.
Activity of the parent node is almost the same as activity of the node.
Case (1) is obvious since there is no time frame during which the node can be shut off. In case (2), the node's switched capacitance is so small that having a gate can only reduce switched capacitance marginally. In case (3), there is very little increase in activity when we go up from the node to its parent. In this case, it is not necessary for both the node and its parent to have gates. Only the parent will have a gate, and the resulting switched capacitance is at most slightly higher than the case that both nodes have gates.
However, these gate removal schemes may remove so many gates in the tree that the phase delay of the clock signal may increase rapidly. So we included a rule for enforcing a gate insertion regardless of those three schemes whenever the subtree capacitance of the node reaches, say 20C g , where C g is the input capacitance of a gate.
Experimental Results
We implemented our algorithm in C++ on a Sun Sparc 20 workstations. For sink locations (module locations) and the sink load capacitance, we used the benchmarks r1-r5 from [6] . The instruction stream and the used modules for each instruction are generated according to a probabilistic model of the CPU when it executes typical programs. The benchmark characteristics are shown in Table 4 . The length of the instruction stream was 100 thousands for all the benchmarks. The average number of used modules per instruction is about 40% for all the benchmarks (this can be seen in the column labeled Ave((I i ))). That is, about 40% of the modules are active at any given time on the average. Note that the power consumption of the gated clock tree will be at least 40% of the ungated clock tree as a result.
Buffered clock tree vs. gated clock tree
Buffered clock tree is a commonly used method in current clock routing. The buffered clock tree is constructed using the nearest neighbor heuristic and the size of a buffer is assumed to be half the size of AND-gates. The comparison among buffered clock tree, gated clock tree and gated clock tree with gate reduction heuristic is shown in Figure 3 .
As can be seen from the figure, if the gate reduction heuristic is not applied, the gated clock routing is worse than the conventional buffered clock routing. The major overhead in switched capacitance and the area comes from the star routing. After the gate reduction, it consumes about 30% less power than the buffered clock routing. There is still however an area overhead.
Impact of average module act ivity
If the average activity is too high, then there is little room for power savings. The average module activity vs. switched capacitance is shown in Figure 4 . As the average module activity increases, the power consumption difference between the two routing methods diminishes. Thus the gated clock routing is more effective when the module activity is low. 
Optimum number of gates
If there are a lot of gates, then the switched capacitance in the clock tree is reduced, but the switched capacitance and the area of the controller tree is increased. On the contrary, if there are too few gates, the switched capacitance in the clock tree will be increased. Intuitively, there will be an optimum number of gates that minimizes the total switched capacitance. This is shown in Figure 5 .
We controlled the number of gates by giving different parameters in the gate reduction heuristic. When there are many gates, the controller tree dominates the switched capacitance and the area. As the number of gates is reduced, the switched capacitance in the controller tree is reduced but that of the clock tree is increased. In the figure, the optimum gate reduction for lowest power is at 55%.
Conclusion
We presented a gated clock routing which has lower switched capacitance over buffered clock trees. We presented a clock topology generation heuristic based on the module activities and the sink locations. We proposed a method to find the signal probability and the transition probability of the gate control signals from the tables generated from the instruction stream.
Our experimental results showed that there is an optimum number of gates for lowest switched capacitance. Our results help a designer choose trade-off among the power, area and the complexity of the routing.
In this paper, we assumed a centralized gate controller. However we are also investigating a distributed controller for reduced star routing area. This is illustrated in Figure 6 .
Assume that the chip is square and its side is of length D. 
