Runtime leakage control techniques, such as power gating (PG) and body biasing (BB), have been applied in a coarse-grained manner traditionally. In order to enable more aggressive leakage reduction, researchers are seeking ways to control leakage with finer granularity. Our research proposes two novel methods, namely circuit clustering for temporal and spatial idleness exploitation, to systematically reduce the granularity of leakage control and improve leakage reduction. Another strength of this paper is the quantitative study of leakage saving and control cost by leakage control with different granularity. With our quantitative study, designers can make the trade-off between leakage saving and control cost, and decide the optimum granularity for leakage control. A heuristic algorithm has been developed to automate the two circuit clustering methods and determine the optimum granularity for any given circuit. The analysis and experiments of this paper is mainly based on RBB. They are also applicable to PG by modifying the cost function.
INTRODUCTION
MOSFET scaling into deep sub-100nm has resulted in significant increase in leakage power consumption. Particularly, in 45nm technology generation and beyond, leakage power consumption will catch up with, and may even dominate, dynamic power consumption [1] . Subthreshold leakage, gate leakage and band-to-band tunneling leakage (BTBT) are three main components contributing to the total leakage power consumption.
Many leakage control techniques have been introduced and studied so far. They can be characterized into two classes: runtime techniques and design-time techniques. Runtime leakage control (RTLC) techniques, such as input vector control, power gating (PG) and body biasing (BB) [1] , tune the circuit into a low-leakage state at runtime when the circuit is idle. RTLC techniques have been proven to be very effective. They are extensively studied in academia and widely used in industry.
Most current RTLC designs adopt block level, or coarse-grained approach. In order to achieve more aggressive leakage reduction, fine-grained approach has been proposed recently [2, 3, 4, 5, 6, 7] . Take PG for example. Figure 1 shows an example of coarse-grained PG versus fine-grained PG. As shown in Figure 1 , coarse-grained PG uses a single footer (or multiple distributed footers with connected virtual ground) to control the leakage of the whole circuit. On the contrary, fine-grained PG has an individual footer for each gate. The footer of each gate can be turned on and off separately. Fine-grained PG has several advantages over coarse-grained PG. It is easy for synthesis, and causes less ground bouncing problem. Most importantly, it is supposed to allow better circuit slack utilization [2] , since each individual footer is controllable.
G 2

Coarse-Grained Power Gating
Fine-Grained Power Gating Various aspects of the fine-grained approach have been studied, such as sizing [2] , wake-up time [3] , robustness issues [4] and circuit clustering schemes [5, 6, 7] . We will focus on the circuit clustering schemes in this paper. Bhunia et al. [5] proposed a Shannon expansion based clustering method to apply PG. Leinweber et al. [6] improved the method in [5] by performing hypergraph partitioning before Shannon expansion. Usami et al. [7] proposed to use clock gating signals as references to cluster a circuit into several PG domains. Each domain belongs to a clock gating signal. When clock gating is applied, PG will be also applied together to the corresponding domain to reduce leakage. However, the aforementioned three studies did not address two problems: 1) They used an arbitrary granularity for circuit clustering, and did not quantify leakage saving with different granularity. 2) They did not consider energy and area overhead to control PG. Their schemes can lead to negative net energy saving and significant extra area overhead. Precisely, the scheme in [7] can yield negative energy saving if clock gating signals flip frequently. The schemes in [5, 6] can yield negative energy saving if the internal mutual exclusion signals flip frequently. In addition, up to 36% area overhead were reported in [5, 6] .
In this paper, we propose two novel circuit clustering methods, namely clustering for temporal idleness exploitation and clustering for spatial idleness exploitation, to enable leakage reduction with finer granularity. However, instead of simply pursuing the finegrained approach, we propose the concept of optimal-grained approach. We observe that when granularity reduces, both leakage Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. saving and control cost increase. In many cases, control cost increases more rapidly than leakage saving. Therefore, there should exist a trade-off point, at which the leakage saving and control cost are balanced. We call this point as "the optimum granularity", and the method to achieve this point as "the optimal-grained approach". Figure 2 illustrates the trade-off among the coarse-grained, finegrained and optimal-grained approaches. To determine this optimum granularity, the quantification of leakage saving and control cost is required. So we derive accurate models for the leakage saving and control cost of applying RTLC with different granularity. To the best of our knowledge, this quantitative study has not been achieved by current literatures. Coarse-grained, Fine-grained and Optimal-grained Both power gating (PG) and body biasing (BB) are effective techniques. In this paper, we choose BB to be the target technique for two reasons: 1) PG has the output-floating problem, while BB doesn't. Isolation cells need to be inserted at the interface of different PG domains [8] to prevent sneak current. This adds significant area and power overhead for applying PG with finer granularity. 2) PG has the state-retention problem, while BB doesn't. Since finegrained leakage reduction is intended for short idleness exploitation, and for the circuit in active mode, the state-retention problem is another obstacle for PG with finer granularity. However, the two circuit clustering methods and quantitative models proposed in this paper are also applicable to PG. When they are applied to PG, the cost of isolation and data retention cells should be added into the cost function. BB can be implemented in forward (FBB) or reverse body biasing (RBB). This paper chooses RBB for demonstration.
Our optimal-grained approach is explained as following. As shown in Figure 3 , for any given circuit, we first analyze its potential idleness based on circuit topology and input statistics. Next, we analyze how to cluster the circuit to fully exploit potential temporal and spatial idleness. The leakage saving and control cost are estimated for different clustering scheme with different granularity. Base on these estimations, designers can determine the optimum granularity to apply RBB, and obtain the optimum clustering scheme. This whole process is implemented by an algorithm based on Simulated Annealing. So the results are actually near-optimal values. In this paper, we focus on circuit clustering methods only, 
Workload Predictor
Figure 3: Optimum Granularity for RTLC
and do not discuss the generation of RBB control signals. We assume that a workload prediction is used to guide the mode switching of RBB. Detailed mechanisms of this predictor are explained in [9] . In brief, the workload predictor(s) sends out RBB enable signals only when the slackness in circuit workload is predicted to be larger than the energy breakeven time (EBT) of the circuit, or a circuit cluster. (EBT is defined as the minimum time for a circuit, or a gate, to stay in low-leakage mode, such that leakage saving compensates the energy penalty for mode transition [10] . It is an important design parameter for power management systems.) The rate of successful prediction depends on workload regularity. This paper is organized as follows. Section 2 derives the basic leakage saving model of RBB as a foundation for quantitative study. Section 3 proposes the two circuit clustering methods, and performs quantitative analysis on leakage saving and control cost. Section 4 formulates the optimum granularity problem, and presents an heuristic algorithm to obtain a near-optimum granularity. Section 5 shows the experimental results. Finally Section 6 concludes the paper.
LEAKAGE SAVING MODELING OF RBB
RBB can be implemented in a discrete manner using V th -hopping, or in a continuous manner using dynamic V th scaling [1] . Here we choose V th -hopping since it is easy to implement. A basic V th hopping scheme is illustrated in Figure 4 . When the controller decides to apply RBB on the circuit, it connects the circuit body to bias voltage sources (VP for PMOS, −VN for NMOS) by turning on the switch transistors S1 and S3 and turning off the switch transistors S2 and S4. So the V th of each transistor in the circuit increases due to the body effect. As a result, subthreshold leakage reduces due to higher V th . Meanwhile, the BTBT tunneling leakage slightly increases as a side effect. In the following, we show the modeling process of leakage saving by applying PMOS-RBB. NMOS-RBB can be modeled in a similar way. For a particular gate (g) in the circuit, assume that its original subthreshold leakage current with zero body biasing is Is, and its original BTBT leakage is I b . When PMOS-RBB is applied, the PMOS body voltage switches to VP . This results in a reduction in the subthreshold leakage of g [11] :
Meanwhile, the BTBT leakage increases [11] :
where Ks and K b are technology dependent parameters. For a idle period of length T , the leakage energy (Sg) saving of g by applying RBB is:
The RBB mode transition incurs an energy overhead (Og) for charging the PMOS body capacitance (Cg) of gate g:
Thus, the net leakage saving (Eg(T )) is:
Equation 5 is applicable when g has only one state when RBB is applied. In a runtime environment, the circuit can receive different input patterns and thus g can have multiple states when RBB is applied. In [12] , Xu et al. have presented a method to factor in the impact of input patterns for PG. Similar idea can be used for RBB. Assume that at runtime, the circuit receives U typical input patterns. The probability of occurrence for each input pattern is Pu (u=1..U). For each input pattern u, the net leakage saving of gate g is E u g (T ). Then according to [12] , the average net leakage saving ( Eg(T )) of g for all U input patterns is:
Similarly, the EBT (Bg) of g can be calculated by:
For the whole circuit, its net leakage saving is the summation of the saving of each gate:
The EBT of the whole circuit can be obtained by:
By using the workload predictor, the circuit will only be put into RBB mode when the expected idleness is larger than a threshold time T th . So the leakage saving before T th is zero. Hence Equation 8 needs to be modified into a discontinuous function:
LEAKAGE SAVING AND CONTROL COST WITH FINER GRANULARITY
At runtime, not all the gates in a circuit are functioning at all time. The idleness of some gates gives us opportunities to apply RBB on them to save leakage. The amount of idleness that we can exploit determines how much leakage we can save. In this section, we will introduce two circuit clustering methods to aggressively exploit the temporal and spatial idleness of a runtime circuit.
Circuit Clustering for Temporal Idleness Exploitation
Temporal idleness is caused by slackness in the circuit workload. To exploit temporal idleness, the conventional coarse-grained approach will apply RBB to the whole circuit when slackness is larger than the EBT of the circuit. Temporal idleness exploitation can be improved by applying RBB with finer granularity. To explain this, we start from the following observation. When RBB is applied, the leakage saving and energy penalty for mode transition of each gate in the circuit can vary significantly. Some gates have high leakage saving and low penalty, while others have low leakage saving but high penalty. To quantify this variation, we analyze EBT of each gate in ISCAS85 benchmark circuit C880. As shown in Figure 5 , the X-axis is EBT value, and the Y-axis is the count of the gates, whose EBT falls into the same range. In Figure 5 , the EBT of each gate varies up to 70 times, from 4 to 280 clock cycles. This indicates that by applying RBB, some gates, which have high leakage saving but low penalty, can achieve net energy saving after just 4 cycles. But some gates, which have low leakage saving but high penalty, can only save energy after 280 cycles. The vertical line (B ckt =14 cycles) in Figure 5 is the EBT of the whole circuit. The coarse-grained approach will put the whole circuit into RBB mode only when the slackness is larger than B ckt . However, there are two disadvantages in doing so: 1) For those gates (GL) whose EBT is larger than B ckt , entering RBB mode at B ckt yields negative net leakage saving, and thus causes more energy consumption.
2) For those gates (GS) whose EBT is smaller than B ckt , entering RBB mode at B ckt yields positive net leakage saving. However their leakage saving potentials are not fully exploited, since each gate in GS can enter RBB mode earlier.
These two disadvantages can be reduced if RBB is applied on GL and GS separately. GS can enter RBB mode earlier than B ckt to exploit more idleness, while GL can enter RBB mode later than B ckt to avoid unnecessary mode switching. Thus, we can see the potential of improving temporal idleness exploitation with finer granularity. We call this method as circuit clustering for temporal idleness exploitation (TIE). Ideally, each gate in the circuit should enter RBB mode individually, whenever the circuit idleness is larger than EBT of the gate. This is essentially the fine-grained approach. It maximally exploits the temporal idleness of each gate and thus yields the upper bound. By substituting T th with gate EBT (Bg) in Equation 10 , we can quantify this upper bound as:
The coarse-grained approach yields the lower bound:
If the circuit is clustered into M partitions, denote the EBT of each cluster as Bm (m = 1..M ). Then the net leakage saving of M -way clustering is: Figure 6 shows the net leakage saving as a function of idle time, for 2/3/4-way TIE clustering of C880. As shown in Figure 6 , leakage saving can be significantly improved when granularity reduces. However for 4-way TIE clustering, the leakage saving is very close to the theoretical upper bound. Further reducing granularity yields minor improvement on temporal idleness exploitation. 
Circuit Clustering for Spatial Idleness Exploitation
Spatial idleness refers to the cases, where the whole circuit has continuous workload, but a subset of the circuit is idle. This phenomenon is due to two features in circuit topology: partial input dependency and mutual exclusion.
Partial input dependency means that some parts of a circuit only have dependency on a subset of the circuit primary inputs, instead of all of them. For example, Figure 7 shows a channel interrupt controller (ISCAS85 C432). This circuit is divided into five modules (M1toM5). As we can see, M1 does not have dependency on channels B and C. M2 does not have dependency on channel C. At runtime, it is possible that channels A and E are continually switching, while channels B and C remain static. In this case, although the whole circuit has continuous workload, module M1 and M2 are not functioning. They have spatial idleness due to their partial input dependency. For spatial idleness exploitation (SIE), we can cluster the circuit into three partitions, as shown in Figure 7 . The improvement of leakage saving by performing SIE clustering can be reflected by the variable T in Equation 10 . When the whole circuit is clustered into partitions, each partition has less input dependency than the whole circuit. So the probabilities for them to be idle increase. To quantify this, we denote that for each primary input n, the probability for it to be idle is Rn. Assume that the circuit has N ckt primary inputs, and R for each primary input is equal for simplicity. Then the probability for all primary inputs, or the whole circuit to be idle is R N ckt . For a workload period of length TW , the total idle time of the circuit is TW R N ckt . Based on Equation 10, we have the net leakage saving during TW of the whole circuit without clustering:
Equation 14 essentially factors in both TIE (by controlling variable T th of Equation 10) and SIE (by controlling variable T of Equation 10
). It is our overall optimization goal in Section 4. However here, in order to study pure spatial idleness, we fix the variable T th as B ckt to disable TIE. Equation 14 also gives the lower bound of SIE. Again, the upper bound is given by the fine-grained approach. Assume that for each gate g, its number of dependant inputs is Ng, then the upper bound can be obtained by:
For M -way SIE clustering, assume that the number of dependant inputs of each cluster m is Nm. Its net leakage saving is: Figure 8 shows the net leakage saving for 2/3/4-way SIE clustering of C432. The X-axis is the probability (R N ckt ) for all primary inputs to be idle. As shown in Figure 8 , the leakage saving improvement is significant when the inputs has medium activities (between 0.4 and 0.8).
The second type of spatial idleness, mutual exclusion means that two parts of a circuit do not work concurrently in any conditions. A simple example is a MUX. When one channel of the MUX is selected, the output of unselected channels is useless. If the select signals remain the same for a period longer than the EBT of the unselected channels, the unselected channels can be put into lowleakage mode. However to do this, we need to monitor and predict the steering signals (select signals for MUXes). This turns out to be difficult for the gates inside a circuit. For example, Figure 9 shows an inside cluster (M1), which does not receive any primary inputs. 
Figure 9: Monitoring and Prediction of Mutual Exclusion Steering Signals
To predict the mutual exclusion of M1, we need to monitor the internal steering signals. This is difficult to implement for two reasons: 1) Predictors need to be inserted inside the circuit. This causes large overhead and disturbs the original design.
2) The steering signals inside a circuit may not have regularity, so the prediction may not be accurate. Hence in our method, we only exploit mutual exclusion for the clusters, whose steering signals are primary inputs, such as M2 in Figure 9 . The studies in [5, 6] 
Control Cost With Finer Granularity
The control cost of RBB consists of three parts: the cost of the predictor(s), the cost of switch transistors (S1 to S4 in Figure 4) , and the routing area for bias voltage rails. [9] reported that the predictor has very small area and power consumption. Here we focus on two other costs.
Area and Power Cost of Switch Transistors
The area of switch transistors (Aswt) are determined by the required speed for mode transition: wake-up time. In [13] , wake-up time is calculated by the charge stored in the body capacitances (C body ) divided by the Ion of switch transistors:
Ion is proportional to Aswt. So Equation 17 tells that for a fixed TWUT , Aswt is a linear function of the body capacitance of the circuit. When the circuit is clustered into M partitions, the total body capacitance of each cluster does not change. Hence the summation of the area of all switch transistors does not change either. So the area and power cost of RBB switch transistors are roughly constants, despite of granularity.
Routing Cost for Bias Voltage Rails
Since each cluster enters RBB mode separately, it requires a pair of virtual bias rails. For a standard cell design, this will just be a pair of metal strips between each row of cells [2] . Narendra et al. [14] have reported the routing area to be 2% of the total chip area. For M -way clustering, M pairs of metal strips are required between each row of cells. For example, Figure 10 demonstrates that for 2-way clustering, the routing area is simply doubled. (Only NMOS is shown here.) Denote the routing area cost for a pair of virtual bias rails as A ral . The total area cost of a M -way clustering can be estimated as: 
DESIGN AUTOMATION FOR DETERMIN-ING THE OPTIMUM GRANULARITY
In this section, we first formulate the optimum granularity problem. Then we implement a heuristic algorithm to obtain a nearoptimum granularity for any given circuit.
Problem Formulation
As discussed in Section 3.3, the power consumption of control circuits remains roughly the same, despite of granularity, while the area consumption is a linear function of granularity. So the tradeoff between leakage saving and control cost finally comes down to the trade-off between energy saving and area consumption. In order to make this trade-off, weights need to be assigned to energy and area. Assume that the cost of unit energy (area) consumption is WE (WA). The optimum granularity (Mopt) is determined by:
where MAX(EM opt+1 ) (MAX(EM opt )) is the maximum net leakage saving that can be achieved with granularity Mopt+1 (Mopt). AM opt+1 (AM opt ) is the area cost of granularity Mopt+1 (Mopt) in Equation 18. Equation 19 guarantees that the optimum granularity is the finest granularity, whose weighted leakage saving is larger than its weighted area cost. To obtain MAX(EM ), we need to know how to achieve the maximum net leakage saving for a given granularity M . We call this as the optimum clustering problem. We present an ILP formulation for the optimum clustering problem based on Equation 16. Assume that the circuit is clustered into M partitions. Use variable Xgm to denote the assignment of gate g in cluster m. Specifically,
The optimization goal is to maximize EM in Equation 16:
Determine : Xgm
where Bm is the EBT of cluster m. Nm is the number of primary inputs that cluster m is dependent on. It can be calculated by Dg(n). Dg(n) is a dependent vector for g:
Algorithm to Find A Near-optimum Granularity
ILP formulation is guaranteed to produce the optimal solution. But it is time consuming for large circuits. Instead, a simulated annealing (SA) based heuristic algorithm can be used to obtain nearoptimum results. The maximization goal in Equation 21 can be simply used to generate the cost function. However, one problem in doing so is that Equation 21 tries to optimize both temporal and spatial idleness exploitation. This may cause SA long time to converge, since a good clustering for TIE may not be suitable for SIE. To address this issue, our algorithm will only optimize either TIE or SIE at a time. Then whichever partition yields the better leakage saving will be accepted. The second feature of our algorithm is using the divide-and-conquer strategy. Once a M -way clustering scheme is generated, it will be fixed. The M + 1 clustering will be performed based on the existing M -way clustering scheme Finally, our algorithm to obtain a near-optimum granularity is as follows. 
The first and second parameters of SA are the initial partition and the best partition found, respectively. The third and forth parameters are the idle probability and EBT of each gate in Equation 21. By fixing the third parameter as R N ckt , SA maximizes TIE. By fixing the forth parameter as B ckt , SA maximizes SIE.
EXPERIMENTAL RESULTS
We conduct HSPICE on ISCAS85 benchmark circuits with 32nm predictive technology [15] . The clock speed is 1GHZ. The bias voltages for V th -hopping are set to enable 25× total leakage reduction. The switch transistors are sized to ensure that the wake-up time of each circuit is 0.5 clock cycle. For each benchmark, 1000 random input patterns are applied. These patterns are generated with 70% idleness probability for all primary inputs (R N ckt =0.7). Table 1 shows the experimental result of 10 benchmark circuits with different granularity (M=1-5). For each granularity M , 'T' means TIE clustering is accepted as the M th partition. 'S' means that SIE clustering is accepted. The percentages shown in Table 1 is the net leakage saving percentage. It is calculated by dividing the simulated net leakage saving by the simulated original leakage. The last column calculates the improvement of net leakage saving for 5-way clustering (M=5). In Table 1 , 5-way clustering achieves 14% to 47% improvement on the net leakage saving, and causes 12% area overhead. A single TIE clustering achieves up to 10% improvement (C6288). A single SIE clustering achieves up to 26% improvement (C5315). For a runtime circuit, the actual amount of temporal idleness and spatial idleness that we can exploit depends on its input characteristics. To demonstrate this, we change the input idleness probability (R N ckt ), and observe the variation of improvement of a single clustering. As shown in Figure 11 , the improvement varies significantly with inputs idle probability. Figure 11 also shows that applying RBB with finer granularity is especially efficient for medium input activities (0.4 < R N ckt < 0.8). This is because when idle probability is high (> 0.8), the circuit already has idleness for the most time. Reducing granularity will not yield significant improvement. When idle probability is low (< 0.4), the circuit has such a heavy workload that there is not much idleness that can be exploited with finer granularity. The last row shows the total area cost (AM ) of each granularity of C7552. The area costs of other benchmark circuits are similar. Figure 12 clearly shows the leakage saving improvement versus extra area cost with different granularity on C7552. 
CONCLUSION
This paper studies aggressive runtime leakage control with finer granularity. Our first contribution is the proposition of two novel circuit cluster methods for temporal and spatial idleness exploitation. Differing from conventional coarse-grained and fine-grained approaches, our method aims at seeking the optimum granularity to balance the leakage saving and control cost. The key to make this trade-off is the quantification of leakage saving and control cost with different granularity. So our second contribution is the model derivation, as well as the formulation of the optimum granularity problem. Experimental results based on reverse body biasing have proven the effectiveness of these two circuit clustering methods. An algorithm has been implemented to automate these two methods and obtain a near-optimum granularity.
