Sleep transistors are effective to reduce dynamic and leakage power. The cluster-based design was proposed to reduce the sleep transistor area by clustering gates to minimize the simultaneous switching current per cluster and then inserting a sleep transistor per cluster. In the paper, we propose a novel distributed sleep transistor network (DSTN), and show that DSTN is intrinsically better than the clusterbased design in terms of the sleep transistor area and circuit performance. We reveal properties of optimal DSTN designs, and then develop an efficient algorithm for gate level DSTN synthesis. The algorithm obtains DSTN designs with up to 70.7% sleep transistor area reduction compared to cluster-based designs. Furthermore, we present custom layout designs to verify the area reduction by DSTN.
INTRODUCTION
Lowering supply voltage is effective for power reduction because of the quadratic relationship between supply voltage and dynamic power consumption. To compensate the performance loss due to a lower supply voltage, transistor's threshold voltage Vt should be also reduced, which causes exponentially increase in the sub-threshold leakage current [1] . Multi-threshold CMOS (MTCMOS, see figure 1 ) has been introduced with low Vt modules connected to ground through high Vt transistors called sleep transistors [2] . The sleep transistor is turned off to reduce dynamic and leakage power in the standby mode, and is turned on to retain functionalities in the active mode. 
Figure 1: MTCMOS circuit structure
In this paper, we propose a novel distributed sleep transistor network (DSTN) with inherent advantages in area and performance compared to module-based and cluster-based sleep transistor designs [3, 4] . We will discuss background knowledge in Section 2, introduce the concept of DSTN in Section 3, and propose a gate-level DSTN synthesis methodology in Section 4. We will present experiments of gate-level synthesis and custom layout design in Section 5 and conclude in section 6. Proofs of all theorems can be found in the technical report [5] .
BACKGROUND
When sleep transistors are absent, the propagation delay for a CMOS gate can be approximated by
where CL is the load capacitance, VtL is the threshold voltage in the low Vt module, and α is the velocity saturation index for modeling short channel effects [6] . When the sleep transistor is present and the source drain voltage drop is Vst, the gate propagation delay increases to
In order to measure the increase in propagation delay, the following performance loss (PL) is defined [4] :
181

11.3
According to the analysis in [4] , for P L = δ, we have
where Ist is the switching current in the low Vt module, VtH is the threshold voltage of the sleep transistor and is higher than VtL in the low Vt module (we assume VtL = 350mV and VtH = 500mV in this paper), and Rst is the channel resistance of the sleep transistor in the linear-operation region. (ct) . Therefore, the module-based design saves sleep transistor area compared to the cluster-based design.
Let virtual-ground wires be interconnects connecting the sleep transistor to low-Vt gates. The above analysis does not consider the virtual-ground wires. The module-based design, however, leads to long virtual-ground wires as pointed out in [4] . The increased resistance of virtual-ground wires has to be compensated by more area in the sleep transistor. Such overhead can be avoided by having a local sleep transistor per cluster, and sleep transistor area can be further reduced by clustering gates to minimize the MSSC in the cluster. Minimizing MSSC introduces extra constraints for placement, and may conflict with timing-driven placement. In the next section, we will propose our DSTN design, and show that DSTN has a reduced area for both sleep transistors and virtual-ground wires, and is compatible with timing-driven placement. Owing to the fact that the clusterbased design is better than the module-based design [4] , we compare DSTN mainly with cluster-based design in the rest of the paper.
SLEEP TRANSISTOR NETWORK
We illustrate the cluster-based sleep transistor design in figure 2 .(a), where gates in a cluster are connected to the sleep transistor for this cluster by virtual-ground wires. Virtualground wires of different clusters are not connected. By adding more wires to form a mesh containing all virtualground wires, we obtain the DSTN structure in figure 2.(b) . We assume that all sleep transistors share a common control signal in both designs. We will show that DSTN reduces the sleep transistor area compared to the cluster-based design. The area saving can be explained by the discharging current balance phenomenon. As shown in figure 3 , the switching current in module 2 is larger than those in module 1 and module 3. When discharging current flows over sleep transistors, the voltage drop in sleep transistor 2 tends to be larger than the voltage drop in sleep transistor 1 and 3, which causes a part of current from module 2 flowing to transistors 1 and 3
1 . The total area of all the sleep transistors in DSTN can thus be significantly reduced with presence of such current discharging balance. However, owing to the parasitic resistance and capacitance in virtual-ground wires, the total transistor area should be larger than the following
which is the optimum area for the single sleep transistor in the module-based design introduced in [3] , and also the ideal total sleep transistor area in DSTN. The routing area overhead is a crucial aspect for all three types of sleep transistor design because every gate in the circuit has to be connected to a sleep transistor. Different sleep transistor designs impose different requirements for routing in terms of wire length and wire size. We assume in this paper that sleep transistors are connected to the ideal ground. Although DSTN and the module-based design may have the same topology for virtual-ground wires, the wire size for DSTN is found to be smaller due to the proximity of sleep transistors. On the other hand, DSTN needs more virtual-ground wire segments than the cluster-based design. As illustrated by the DSTN layout in figure 4 , where the dotted lines are virtual-ground wires inside modules and are required by both DSTN and cluster-based design. Solid lines are virtual-ground wires that are needed by DSTN. These solid lines are short for compacted layout design. When the chip has a few "isolated" compacted layout regions such as IP-blocks in system-on-chip designs, we can simply apply individual DSTN inside each IP-block without introducing extra long virtual-ground wires.
Furthermore, introducing cluster methodologies in the sleep transistor design can affect placement. A good clustering solution minimizing the cluster MSSC is crucial to reducing sleep transistor area in the cluster-based design. Such clustering helps DSTN as well. However, our experiments to be presented shows that DSTN without cluster current minimization achieves significant sleep transistor area reduction compared to the cluster-design with cluster current minimization. Due to the adverse effect of MSSC minimization on timing-driven placement, we suggest not applying cluster current minimization to DSTN.
GATE LEVEL DSTN DESIGN
In this section, we first present the DSTN modeling, then formulate and solve the DSTN sizing problem. In order to compare different design styles, we will also introduce a rigorous algorithm for cluster-based sleep transistor design.
DSTN modeling
We model both sleep transistors and virtual-ground wires as resistors. Therefore, DSTN can be modeled as a resistance network shown in figure 5, with resistance Rst for a transistor and Ri for a virtual-ground interconnect. Note that Ri is needed to accurately model the discharge current balance. Exact estimation of Ri, however, requires detailed layout information. In gate level design, we assume that Ri is uniform for each wire. Specifically, we assume that the wire resistance is 0.05Ω/µm. We consider virtual-ground wires that are 200µm and 1000µm long, i.e., we consider Ri = 10Ω and 50Ω, respectively. Given our assumption that each cluster has about six gates (decided by the typical sleep transistor size in section 4.2), 200µm is a conservative estimation for virtual-ground wires between clusters, and 1000µm serves as the worst-case scenario to analyze the impact of Ri.
DSTN sizing
Problem formulation
We assume in this paper that the topology of DSTN is defined as a priori, and formulate the following DSTN sizing problem: Note that DSTN sizing is totally different from the sleeptransistor sizing in the cluster-based design. The size of the sleep transistor in the cluster-based design is solely determined by the MSSC of the accommodated cluster. Owing to discharging current balance in DSTN, the size of a sleep transistor in DSTN depends on the current going through the accommodated cluster, the adjacent clusters, and even non-adjacent clusters. This makes the DSTN sizing problem much harder than the sizing problem of the cluster-based design. More precisely, DSTN can be modeled by a resistance network, and then the accurate transistor sizing can be obtained by algorithms similar to P/G sizing algorithms in [8] . We expect that well-designed heuristics may as well lead to good solutions, but in a more efficient fashion. We reveal below a few important properties in order to develop effective heuristics.
Properties
Note that our properties are based on an important observation about the resistance network: Ri is normally much smaller than Rst. The channel resistance of the transistor in the linear-operation region is
We assume VtH = 500mV , a typical sleep transistor in DSTN has W L = 6, and V dd = 1.3V in 100nm technology.
Thus, the typical resistance value for Rst is around 218 Ω. On the other hand, a 200µm long virtual-ground wire has Ri of about 10Ω in 100nm technology. Therefore, it is reasonable to assume that Rst is much larger then Ri.
Theorem 1. Assuming Ri = 0 and P L = δ, the total transistor area in DSTN is determined by:
When Ri = 0, all sleep transistors in DSTN can be viewed as one single transistor with channel resistance and (W/L) of:
Because the current of the entire circuit goes through this single transistor, the following equation holds:
Combining (12) and (11) leads to (9) . We can also prove:
Theorem 2. To maintain PL as a constant, the total area of DSTN increases when Ri increases.
As Ri increases, the effective resistance seen by the current source at each tapping point increases. Thus, the voltage drop in the sleep transistor increases when the current is constant. To maintain PL as a constant, the sleep transistor resistance has to be decreased, which results in more area consumption in DSTN.
The total area of DSTN can be roughly determined by Theorems 1 and 2 together. If Ri = 0, the total area of the DSTN is given by (9) . However, according to Theorem 2, the total transistor area in DSTN must be larger than the value in (9) . Nevertheless, the effective resistance increase at the tapping point is limited because Ri is much smaller than Rst. The increase of transistor area in DSTN is therefore limited.
Theorem 3. Assuming the current Ii that flows into each tapping point ti being constant and the total area of the DSTN given, every transistor sti accommodating ti should be sized proportional to current Ii in order to minimize the maximum voltage drop among all sleep transistors.
Note that Theorem 3 is an ideal case to allocate area to individual transistors in DSTN. Although the current at each tapping point ti is not constant in real designs, Theorem 3 helps guiding the design of our DSTN transistor sizing scheme below.
Algorithm
The overall flow of the sleep transistor sizing scheme is described as follows. We first calculate MSSC(ckt) for example by genetic algorithm [9] . We then compute the total area in DSTN according to the following formula:
Our experiment shows that β should range from 0.05 to 0.5 and a larger β should be used for a bigger circuit. Finally, according to Theorem 3, the total DSTN area is allocated to each sleep transistor sti proportionally to the correspondent cluster MSSC.
Cluster based sleep transistor insertion
The total area of sleep transistors for the cluster-based sleep transistor design is proportional to A cluster-based design methodology has been proposed with placement constraints [4] . In this paper, we target at reaching the maximum potential of sleep transistor area reduction. Therefore we propose to apply simulated annealing (SA) for È i MSSC(ci) minimization without placement constraints. In SA, each cluster is associated with a cost of MSSC. The cost for the entire circuit is the sum of costs for all clusters. The objective is to minimize the cost for the entire circuit. We take advantage of the freedom that a gate can be assigned to any cluster. Specifically, two gates are randomly picked from two clusters and exchanged in each move. We start SA from temperature of 100 and terminate at 0.1. The number of moves at a particular temperature is 200x of the number of clusters in the circuit. After these moves, the temperature is decreased by a factor of 0.9.
Cluster MSSC calculation
The primary objective of MSSC calculation is to search the input vector space to identify the maximum switching current value. The genetic algorithm(GA) based [9] and automatic test pattern generation(ATPG) based algorithm [10] have been developed for MSSC estimation. We employ GA algorithm to calculate the MSSC for the entire module in this paper. However, GA algorithm is inefficient to calculate MSSC for a large number of clusters. Therefore, we propose an efficient heuristic algorithm for cluster MSSC calculations in this section. The reader who is only interested in experiments may skip section 4.4.
MSSC estimation searches for the maximum current value considering both switching time and input vector. In order to simplify the problem, we first solve the MSSC estimation problem at a fixed time, that is, we first estimate MSSC(c, t) based on a small number of random simulations. For example, we want to estimate the maximum current for the cluster of gates G1 to G7. We first simulate the cluster for a number of random input vectors. The switching activities at time t for all gates in all simulations can be encoded in a table shown in figure 6 , where 1 stands for switching and 0 means no switching. For example, row S1 (i.e., simulation S1) means that G1, G2 and G6 switch while G3, G4, G5 or G7 do not switch. Although G3, G4, G5 and G7 never switch simultaneously with G1, G2 and G6 at S1, they may switch simultaneously with G1, G2 and G6 under other input vectors. In this case, the switching current at those input vectors is larger than the one in S1. We want to capture this potential and expand the list of simultaneous switching gates as much as we can. We illustrate the idea of expanding simultaneous switching lists by using list S1, in which the simultaneous switching gates are G1, G2 and G6. Instead of checking whether G4 can switch simultaneously with G1, G2 and G6, we check whether all the combinations of three gates, i.e., G1G2G4, G1G4G6 and G2G4G6 can switch simultaneously. As shown in figure 6 , G1G2G4, G2G4G6 and G1G4G6 do happen in S3, S30 and S33. Thus, G4 has a large potential to switch simultaneously with G1, G2 and G6. G4 is then set to be switching at S1 and the switching current of G4 is added into the total switching current of S1. The switching list for each simulation is expanded until no more expansion is possible. The maximum current value among all the simulations is MSSC(c, t). Overall, our method for cluster MSSC estimation contains two phases. In the first phase, we carry out a number of random simulations and choose the time (called peak time) for the peak current of each simulation. In the second phase, we apply the above MSSC(c, t) for every peak time.
EXPERIMENT RESULTS
Gate level synthesis
All proposed algorithms have been implemented inside SIS [11] environment. We use ISCAS benchmark circuits and report experiment results in Table 1 . A gate-level simulator has also been implemented to calculate voltages and current waveforms. Parameters needed to simulate a circuit, such as gate delay, loading capacitance, and switching current, are all extracted from SPICE simulations and built into tables. Simulation results from our simulator are within 20% difference from SPICE simulations, but it is much faster than the SPICE simulation. This simulator was used to verify the gate level synthesis in this sub-section.
We first compare the area (i.e., transistor width) used by DSTN and cluster-based design (CB-STD), respectively. We measure area by the total channel width of sleep transistors. One can see that DSTN uses significantly smaller area than CB-STD does. On average, the area reduction is 49.8%. Because we do not consider the delay constraint during placement for CB-STD, we obtain a lower bound of the cluster MSSC in a timing-driven placement and in turn a lower bound of the sleep transistor area in CB-STD. Therefore, the area reduction by DSTN would be larger compared to CB-STD if considering practical placement constraints.
We then compare performance loss. We have used extensive random simulations to verify the quality for both sizing schemes. Specifically, 10,000 random simulations for each circuit have been conducted to calculate the maximum PL (in short, MPL) for each circuit. For DSTN, the peak current for each module in each simulation is applied to the resistance network as the current source. We compute the transistor channel resistance by (8) , and use Ri = 10Ω and 50Ω for virtual-ground wires. The resulting resistance network is solved by a sparse linear equation solver integrated with SIS. The calculated voltages at tapping points are used to compute the performance loss via (3) . Note that the resulting MPL value in Table 1 is an upper bound of MPL for the following reasons: (i) the above Ri values are conservative as discussed in Section 4.1. (ii) the peak current for an individual module normally happens at different times, but we assume that all the peak current happens at the same time in our experiment.
The same random simulations have been applied to calculate MPL in CB-STD, where PL is calculated via (6) . Although the peak current value is also used to calculate MPL, it will not overestimate the PL because each module only discharge from one sleep transistor in CB-STD. Instead, ignoring the resistance of virtual ground in CB-STD leads to a lower bound estimation.
As shown in Table 1 , when Ri = 10Ω (a conservative case as discussed in Section 4.1), MPL of DSTN is on average 10% smaller than that of CB-STD. When Ri = 50Ω, i.e., an extreme worst-case scenario as discussed in Section 4.1, MPL of DSTN is about 6% worse than CB-STD. However, Ri is normally 5 -15 Ω in 100nm technology when the cluster size is 6. Furthermore, the MPL presented in Table 1 is an upper bound of the real MPL in DSTN, and is a lower bound of the real MPL in CB-STD. Therefore, it is fair to say that DSTN is better than CB-STD in terms of MPL.
Note that MPL for both DSTN and CB-STD are larger than 5%, the PL bound in our experiments. It is because the current values in a large number of random simulations may be bigger than the estimated cluster MSSC. This underdesign can be easily removed by scaling up the estimated MSSC.
Custom Layout Design
The exact evaluation of most parameters, such as PL and transistor area, can only be obtained after a layout design. Therefore, we implement and compare three layout designs, sleep transistor free(ST-free) design, clusterbased sleep transistor design(CB-STD) and DSTN, for a 4-bit carry-lookahead(CLA) adder.
The three layout designs are implemented as follows. First, a ST-free layout, consisting of four sum modules and one CLA module, is implemented. Then, a CB-STD layout is implemented by partitioning each module into 2-3 clusters and accommodating each cluster by one sleep transistor. Sleep transistor sizes are determined by SPICE simulations to keep PL below 5%. Finally, we implement a DSTN design by accommodating the entire CLA adder via six distributed sleep transistors. All these sleep transistors are connected As shown in Table 2 , compared to the ST-free design, both CB-STD and DSTN achieve significant leakage current reduction but DSTN is approximately five times better than CB-STD. Both CB-STD and DSTN increase the critical path delay but DSTN has a much smaller delay than CB-STD. DSTN has a transistor area several times smaller than CB-STD. These comparisons are consistent with previous theoretical analysis and experiment results.
CONCLUSION AND FUTURE WORK
Sleep transistors are effective to reduce both dynamic and leakage power. We have proposed a novel distributed sleep transistor network (DSTN), and have convincingly illustrated that DSTN has reduced area, less supply voltage drop, and no conflict with timing-driven placement when compared to existing module-based and cluster-based sleep transistor structures. We have revealed several properties of the optimal solution to the DSTN sizing problem, and have proposed an effective and efficient DSTN sizing algorithm based on these properties. Based on the experimental comparison with a rigorous cluster-based design, DSTN assuming conservative virtual-ground wires achieves on average 49.8% sleep transistor area reduction and leads to less performance lost. Having these advantages, DSTN can be used to implement power gating for reducing dynamic and leakage power [12] .
Sleep transistor can be viewed as an essential part of the power/ground network. We assume that the power/ground
