The Multi-Threshold CMOS (MTCMOS) 
INTRODUCTION
Leakage power in modern CMOS VLSI circuits has become a component comparable to dynamic power dissipation. Typically, the subthreshold leakage current dominates the device off-state leakage due to low V th transistors employed in logic cell blocks in order to maintain the circuit switching speed in spite of decreasing V DD levels [1] . The Multi-Threshold CMOS (MTCMOS) technique can significantly reduce the subthreshold leakage currents during the circuit sleep (standby) mode by adding high-V th power switches (sleep transistors) to low-V th logic cell blocks [2] [3] . This is because the stacked high-V th sleep transistor connected to the bottom of the pull-down network of all logic cells in the circuit acts as a high-resistance element during the sleep mode, which limits the leakage current from V dd to ground lines. At the same time, because of the stack effect, the subthreshold leakage of the low-V th transistors in the logic block itself goes down. This leakage reduction is preferably achieved with small performance degradation because, during the active mode of the circuit, the sleep transistor is fully on (i.e., it operates in the linear mode), and thus, all low-V th logic cells in the MTCMOS logic block can switch very fast.
Unfortunately, the situation is different in real designs. More precisely, during the active mode of the circuit operation, the high-V th sleep transistor acts as a small linear resistance placed at the bottom of the transistor stack to ground, causing the propagation delay of the cells in the logic block to increase. In addition, the virtual ground network itself acts as a distributed RC network, which causes the voltage of the virtual ground node to rise even further, thereby degrading the switching speed of the logic cells even more (cf. Fig.1 .) The former effect is a function of the size of the sleep transistor whereas the latter effect is a function of the physical distance of the logic cell from the sleep transistor. Figure 1 (a) depicts a logic block, LB, in which a group of low-V th logic cells are first connected to the virtual ground node and then through a high-V th sleep transistor, S, to the actual ground, GND [2] . Figure 1 (b) models the virtual ground interconnection and the high-V th sleep transistor, which behaves like a linear resistor in the active mode of the circuit operation [4] , as resistors R i and R s , respectively. The virtual ground is at voltage V x above the actual ground, i.e., ( )
where I is the current flowing through the virtual ground sub-network and the sleep transistor. The voltage drop across R s + R i reduces the gate overdrive voltage of MTCMOS logic cells (i.e., their V gs value) from
In this paper, we present an optimal algorithm for placing sleep transistors for the standard cell-based layout design, which minimizes the performance degradation of MTCMOS circuits due Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. to the interconnect resistance of the virtual ground network. We discuss previous works in section 2, introduce the problem formulation and the proposed method in section 3 and 4, respectively. Then, we present experiment results in section 5 and conclude in section 6.
PRIOR WORK
Optimal sizing of the sleep transistors (STs) has been actively researched since the MTCMOS technology was introduced. One of the most conservative approaches is the dedication of one ST to each logic cell and the optimization of individual STs, which is known as "fine-grained" leakage control [5] [6] . This approach makes it easier to do RT-level sign off by using standard static timing analysis techniques whereas it tends to incur large area overhead [7] . Other approaches [8] [9] group logic gates based on their current discharge patterns in a given switching cycle so that sleep transistors of gates with mutually exclusive current discharge patterns are merged together, thereby, reducing the area overhead. In [8] , the authors first size the sleep transistor of each logic cell so as to impose an upper bound on the performance degradation of the cell during the active mode of the circuit operation. Next, they calculate the discharge current patterns of all logic cells in the circuit based on a unit delay model, and finally, merge the sleep transistors of cells with non-overlapping discharge current waveforms. In [9] , the authors use a more precise delay model to do the same steps, presenting several heuristic techniques for efficient gate clustering. In [10] , the authors use average current consumption values of logic gates to determine the sleep transistor width needed to satisfy a required circuit speed based on the assumption that the circuit speed depends only weakly on the circuit operating pattern for sufficiently large sleep transistor sizes. In other approaches [11] [12], the authors selectively apply the MTCMOS technology for gates belonging to timing-critical paths of a circuit, where low-V th gates are only used in the timing-critical paths. Recently, a number of approaches have begun to consider the interconnect resistance of the virtual ground line as one of the crucial factors that affect the performance of MTCMOS circuits. In [13] , the authors model both sleep transistors and virtual ground interconnects as resistors, and thereby, consider a resistive-only network when computing sizes of the distributed sleep transistors. In [14] [15] , the authors show that the delay and output slew of a gate with the MTCMOS technology increase linearly with virtual ground length, and then take this parameter into account when modeling the delay of MTCMOS gates. This paper addresses the automatic placement of sleep transistors considering the interconnect resistance of the virtual ground in standard cell-based layout design. In [7] [16], the authors provide design methodologies that treat a ST as a standalone library cell and then calculate and allocate a number of sleep transistor cells (STCs) on each cell row. The STCs are then placed at one or the other corner of each row on a row-by-row basis. In these methods, all cells in a row share all of the STCs and the virtual ground line in the same row. The main advantages of these approaches are that (a) they are fully compatible with the existing standard-cell physical design flow and (2) they result in a shorter re-activation time when exiting the sleep state [16] . However, these approaches do not consider the voltage drop due to the virtual ground interconnect in deciding the location of STCs. In practice, they tend to over-estimate the size of sleep transistors needed on each cell row so as to compensate for the voltage drop due to virtual ground interconnect resistance. With continued process scaling, the wire segments in the ground network become more resistive, causing more area overhead. In addition, the performance degradation of a logic cell that is far away from the STCs may greatly affect the overall circuit performance if the logic cell happens to lie on a timing-critical path of the circuit. Therefore, in this paper, we shall address the question of how to place STCs for the standard cell-based layout design so that the performance degradation of MTCMOS circuits due to the interconnect resistance of the virtual ground line is minimized.
PROBLEM FORMULATION
The problem of sleep transistor distribution can be described as follows. Given a placement of MTCMOS design in the row-based layout and the distribution of sleep transistor cells on each row, the objective is to determine an optimal placement of sleep transistor cells such that the performance degradation of the MTCMOS circuit due to the voltage drop of the virtual ground line is minimized.
Let's focus on the sleep transistor distribution in row y. Let C={c i , i=1,2,…,n} denote the set of logic cells in row y, and S={s j , j=1,2,…,m} denote the set of STCs to be placed on the row. n and m denote the cell and STC counts, respectively. We define the performance loss of a logic cell c i ∈C in the active mode as follows:
where ∆T(c i ) denotes the additional propagation delay of cell c i that results from the resistance of the virtual ground. SL(c i ) is the slack available for cell c i before inserting the STCs. The slack times are calculated by doing static timing analysis (STA) on the circuit. Using the well-known alpha-power delay model [17] , the cell propagation delay in the presence of a sleep transistor and virtual ground interconnects is given by 
R i is dependent on the distances between c i and STCs on this cell row, s j ∈S. Accordingly, ∆T(c i ) is a function f of these distances:
where x ci and x sj denote the horizontal coordinates of logic cell c i and STC s j , respectively. Furthermore, to a first order, ∆T(c i ) is dominated by the position of the sleep transistor that is closest to c i . Let d i denote the minimal distance between c i and any of the sleep transistors, i.e., 1 2 min
We may approximate equation (4) by
Now, in equation (2), R i may be replaced with r⋅d i , where r is the wire resistance per unit length of the virtual ground network. By using Taylor Series Expansion of the right-hand side of equation (2), we can obtain function g in equation (6):
That is,
is a proportionality coefficient
Thus the performance loss can be written as
We next define a cost figure, Φ, for the STC placement:
It is important to balance the current flowing through each sleep transistor in the active mode. Otherwise, a sleep transistor that carries too much current will significantly increase the virtual ground voltage in its physical neighborhood, hence, it will decrease the performance of the cells in the region. Thus, the optimization problem for STC distribution may be formulated as Φ Minimize 
where I sj denotes the total current passing through STC s j and I smax is an upper bound on the current flowing through the STC. (The rationale is that when it was decided that cell row R should have m sleep transistors on it, that decision was in part based on the assumption that no sleep transistor will have to carry more than I smax .) The minimum Φ value will be denoted by Φ min .
PROPOSED APPROACH
In this section, we present an optimal solution to the STC distribution problem where m<n. For m≥n, it is easy to see that Φ min can be achieved trivially by enforcing The solution with a minimal Ψ value in space Σ is an optimal solution to the STC distribution problem defined by equations (11) and (12) . Proof: Let σ* denote the solution in space Σ m that has the minimal Ψ value. Assume σ opt is an optimal solution to the problem defined by equations (11) and (12) . Based on σ opt , we can construct a valid partition P of set C as follows. First, we sort sleep transistors in an increasing order of their coordinates. Next, each cell is assigned to the section corresponding to its closest sleep transistor. Let m' denote the number of sections in P, where m'≤m. Notice that m' may be less than m because it is possible that a sleep transistor is not the closest one to any cell. Similar to the proof of Lemma 1, we can easily show that and m are large, the brute-force approach will be impractical.
We present an efficient dynamic programming approach to search for the optimal solution. Before describing the approach, we first re-formulate the current balance constraint on STCs to facilitate its incorporation into the proposed approach. By assuming a uniform current distribution in the row, the maximal current constraint can be described as a limit on the number of cells that any section in a valid partition can include. Let pn j denote the number of cells in section j and pn max denote the preset upper bound of pn j . Constraint (12) ( 1) ,1) min( , ) 
In equation (15), l denotes the number of cells in the rightmost section between c 1 and c j . According to constraint (14) , l should be no greater than pn max ; on the other hand, l has to be large enough so that the remaining cells can be put into the remaining k−1 sections.
The pseudo-code for the optimal STC distribution algorithm is presented in Figure 2 . First, Equation (16) is solved for all Γ(i, j,1) values, 1≤i≤j≤n, which are stored in a table with a dimension of n×n. The proposed dynamic programming approach utilizes a table Γ(j), 1≤j≤n, where each entry holds the computed value of Γ (1, j, k) during the k th iteration; and table I(j, k), where each entry stores the l value in equation (15) which leads to the minimal Γ (1, j, k) . Lines 3 to 9 embody the dynamic programming iterations of updating values in tables Γ(j) and I(j, k). The decrement of j from n to i ensures that the updated entries of table Γ(j) will not be used in the same k iterations, and thus, enables an in-place operation. The procedure in Line 11 traces back through pointers held in table I(j, k) and works as follows. Read the value in the I(j, k) entry. Assume I(j, k)=l, then assign cells from c i−l+1 to c j to section p k . Next, go to entry I(j−l, k−1) and section p k−1 . This process continues until k=1. Finally, after partition P is constructed, the coordinates of sleep transistors are obtained by solving Equation (16) over each corresponding section in P. Γ(1, j,1) , ∀j, 1≤j≤n 3. for k = 2:m 4. 
STC Distribution Algorithm
INPUT: x ci , i=1,2,…,for j=n:1 // j decrement 5. t ← ∞ 6. for l = max(j−(k−1)pn max ,1):min(pn max , j−k) 7. t ← min (t, max(Γ(i, j−l), Γ 1 (j−l+1, j))) 8. I(k, j) ← l 9.
EXPERIMENTAL RESULTS
We used ISCAS benchmark circuits and SIS to generate optimized gate level netlists; all benchmarks were first optimized by using the SIS "script.rugged" and doing timing-driven technology mapping based on an industrial-strength 90nm ASIC design library. All benchmarks were run on SUN Ultra Spark II machine. We set the high V th values for PMOS and NMOS as -303mV and 260mV, and the low V th values for PMOS and NMOS as -250mV and 200mV, respectively.
In our experiments, we first generated detailed row-based cell placements for the generated benchmark circuits by using the timing-driven placer of [18] . We then calculated the number of sleep transistor cells to be placed at each row based on the Average Current Method (ACM) of [10] , where the size of every sleep transistor cell is determined so that its "on" resistance, R s , comes out to be about 300Ω. Next we applied the proposed Optimal ST distribution technique (from here we call it OSTD) for each placement row, which is followed by a layout adjustment to remove overlaps between logic cells and sleep transistor cells.
We compare OSTD with the other sleep transistor cell placement methods used in [9] , where all sleep transistor cells are located at the corners of each row so that they do not disturb the existing placement (from here we call this method CSTD for Corner-based ST distribution), in terms of the total wire length and critical path delay. In addition, we implement another sleep transistor cell placement method, which distributes the sleep transistor cells uniformly on the row, thereby, spacing equally between sleep transistors on the same row (from here we call this method USTD for Uniform ST distribution.) We also compare OSTD with USTD in terms of the critical path delay. To obtain these results, we perform global routing after the placement of sleep transistor cells so that the wiring loads of all nets may be accurately calculated. Next, we extract cell locations and interconnect parasitics and input the whole extracted netlist to HSPICE so that we measure the critical path delay for the benchmark circuits. To minimize the simulation time, we run STA before HSPICE simulation to identify the set of PI-PO critical paths and then apply input vectors that cause the propagation of an event along these paths. We therefore run HSPICE simulation only for the input vectors producing the transitions along the STAidentified timing-critical paths. Figure 3 presents transient simulations of the virtual ground line of the first row in the layout of circuit C7552, operating at a supply voltage of 1V. Compared to CSTD, OSTD reduces the virtual ground bounce by about 50%, from 250mV to 180mV. This reduction of virtual ground bounce tends to improve the performance of the circuit. Table 1 shows comparison results between OSTD and CSTD in terms of critical path delay and wire length. For each circuit benchmark, we generated 10 different placement solutions (corresponding to different random seeds for the timing-driven placer.) Therefore, we also generated 10 different sets of wire lengths and critical path delays, and reported in Table 1 only the mean values of each figure of merit.) Based on these results, we conclude that OSTD reduces the critical path delay by an average of 11% at the cost of an average of 0.7% increase in the total wire length for the benchmark circuits. The increased total wire length is caused by pushing or pulling some logic cell during the layout adjustment step needed to remove the overlaps between the sleep transistors and logic cells. The last column in Table 1 shows the runtime of OSTD algorithm for the benchmark circuits. Table 2 shows comparison results between OSTD and USTD in terms of the critical path delay. OSTD reduces the critical path delay by an average 3.8% compared to USTD. Notice that we limited ourselves to small circuit benchmarks due to the lack of capacity by Hspice to simulate large circuits. We expect that the advantage of our proposed method (OSTD) over CSTD and USTD becomes more pronounced as the number of logic cells in the circuit increases. 
CONCLUSION
An optimal sleep transistor cells placement methodology for MTCMOS circuits was presented. The presented algorithm provides optimal locations of sleep transistor cells for the standard cell-based layout design so that the performance degradation of MTCMOS circuit due to the interconnect resistance of the virtual ground network is minimized. 
