ABSTRACT
INTRODUCTION
Leakage power consumption is a growing concern in integrated circuit design. Nanometer CMOS transistors are characterized by significant sub-threshold and gate leakage currents [1] and feature size scaling is exacerbating this problem. In absence of revolutionary technology advances (e.g., high-k dielectrics, new transistor structures), design techniques to reduce leakage power are now critical. As a result, leakage reduction has recently become a cross-cutting issue at all levels of abstraction [2] , from device to architecture. In today's technologies (i.e., 90nm), sub-threshold leakage Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. currents are still dominant with respect to gate currents (although the trend shows that the latter grows more rapidly as technology scales). Thus, this paper addresses specifically the sub-threshold component of the overall leakage current. A number of leakage reduction techniques move from the observation that sub-threshold current in a stack of OFF transistors is greatly reduced with respect to the single transistor case. This is due to the exponential decrease of subthreshold currents with decreasing gate-source voltage Vgs. While Vgs = 0 for a single OFF transistor, it becomes negative for the top transistors in a stack. As a consequence, leakage current is effectively cut off for the entire stack. Quantitative analyses reported in the literature [3] show that leakage current can be decreased by one order of magnitude by simply stacking two transistors. Clearly, the main problem with transistor stacking is that the effective resistance of a series connection of transistors is higher than that of a single transistor, and therefore adding transistors in the pull-down and/or pull-up of logic CMOS gates significantly decreases their switching speed. To reduce the performance impact associated to transistor stacking, a common technique is to connect a number of CMOS gates to a virtual ground node, which is then connected to the ground node through a large sleep transistor, whose gate is driven by a sleep-control signal. When the transistor is OFF, leakage is reduced for all gates connected to the virtual ground. At the same time, when the transistor is ON, its large size guarantees a highly conductive path for the discharge currents coming from the gates. Even more importantly, the capacitance of the virtual ground greatly helps the dynamic performance of the gates, by providing a lowimpedance AC path to ground. Even in presence of these clear advantages, the shared sleep transistor approach faces several challenges. First, sleep transistors have a significant cost in terms of area. Second, and most important, they slow-down standard CMOS gates. We distinguish two main speed effects, namely slowdown of power-gated logic cells when the circuit is active (active slow-down), because of the increased pull-up/pulldown resistance and the re-activation delay for re-enabling a set of powered down cells. While huge sleep transistors controlling a large number of cells are desirable for minimizing active slow-down (thanks to the virtual ground effect), they are obviously very expensive in terms of area and reactivation delay. Distributed sleep transistor approaches [7, 8] represent a compromise solution. Smaller clusters of cells can be gated with smaller sleep transistors, which can be more easily embedded in unused spaces of existing layouts.
5.1
Furthermore, it is easier to individually select the size of the sleep transistors to provide localized and fine-tunable control on re-activation delay. In this paper, we contribute a complete methodology for layout-aware, distributed sleep transistor insertion for cell clusters that have physical proximity. Our insertion style is fully compatible with industry-standard row-based layout styles and the supporting design tools. Sleep transistor cells are chosen from a library of cells that has been designed for high layout efficiency. These cells are inserted at the boundaries of existing cell rows, causing minimal disruption in placement and routing. Selection of the most appropriate sleep transistor cell size to control each group of cells is driven by the models of [10] . Furthermore, we present a novel gate clustering algorithm that groups together sets of cells to be controlled by the same sleep transistor; the cost function used by the algorithm to select the cells that have to be gated is layout-aware, i.e., it takes advantage of cell placement information. The algorithm accounts for constraints on area overhead, active slow-down and re-activation delay: It selects for power gating the subsets of cells that give maximum power reduction, without exceeding user-specified bounds for delay and area costs. The effectiveness of the proposed methodology has been benchmarked on a set of design examples for which a physical implementation has been obtained through commercial EDA tools; the results we have achieved show a reduction of leakage power ranging from 74% to 83%, depending on the circuit. It is important to stress the point that, thanks to the strategy used for gate clustering, the optimized designs have a tightly controlled delay and area penalty. Therefore, the user is allowed to explore the trade-off between leakage reduction and delay or area overhead. The remainder of the paper is organized as follows. In Section 2 we briefly review previous work on leakage reduction techniques. Section 3 highlights the sleep transistor insertion methodology. Section 4 describes the layout-aware cell selection algorithm. Section 5 provides experimental results obtained on a set of benchmark circuits, while Section 6 closes the paper.
PREVIOUS WORK
Several approaches for succesfully minimizing sub-threshold leakage power dissipation in stand-by mode have been presented in the literature. In [4] , a Variable-VTH (VTCMOS) strategy is adopted in order to cut off leakage current. In particular, it applies back-gate bias by exploiting body effect. This requires modification to cell libraries and, above all, specific technology support [5] . Other approaches are Dual-VTH strategies, which perform leakage power reduction by partitioning a circuit into critical and non-critical path regions. Subsequently, low-VTH and high-VTH transistors are used for implementing gates in the critical and non-critical regions, respectively [6] . The shortcoming of this approach is that many circuits may have a significant number of critical paths. As a consequence, high-VTH transistors may be used for an excessively small percentage of gates to result in a significant leakage power reduction. Furthermore, supporting multiple thresholds implies complexity increase in the fabrication process, as well as potential difficulties from the tool support perspective. A popular approach for stand-by power reduction is represented by the adoption of emerging Multi Threshold CMOS (MTCMOS) technologies [7, 8] . They reduce stand-by power consumption by inserting a high-VTH cut-off MOSFET (i.e., a sleep transistor) in series to the initial low-VTH circuit. Hence, sub-threshold leakage current is reduced by the sleep transistor while performance loss is controlled. The latter happens thanks to two factors: First, the sleep transistor can be made very large (i.e., with low resistance), because it is shared among many cells; second, the large capacitance of the net connecting the cells and the sleep transistors provides a low-impedance AC discharge path, i.e., a virtual ground for the transient currents created by the switching gates. MTCMOS techniques present two drawbacks. First, they still require process modifications for supporting the high-VTH of the sleep MOSFET. Second, when a circuit is deactivated by power gating, it takes a non-negligible amount of time to wake up and re-activate it, simply because the large sleep transistor must be switched on and it must initially discharge the slow virtual ground capacitance. The first drawback is eliminated if the sleep transistor is fabricated with the same threshold as the other transistors in the circuit. Even though leakage reduction is less substantial, the stacking effect still provides significant benefits. To address the second limitation, several distributed sleep transistor approaches have been proposed, where multiple smaller sleep transistors are instantiated. The main advantage of distributed sleep transistor implementations is a faster re-activation time when exiting the sleep state. Unfortunately, most techniques presented in the past work at the logic and circuit level, and thus they do not fully take into account the information about the placement of the logic cells. This is a serious inconvenient, because connecting cells that are placed far apart to the same virtual ground and sleep transistor can cause severe wiring congestion. The only two approaches available in the literature that account for cell placement are [7, 8] . However, they both assume a full-custom design style, where single transistors can be arbitrarily placed inside the chip. In the sequel, we describe a distributed sleep transistor implementation style which is fully compatible with standard-cell physical design tools that support row-based layouts, where logic gates are placed in rows of adjacent cells with connection channels between rows.
AUTOMATIC STI METHODOLOGY
Most approaches for distributed sleep transistor insertion (STI), including those that account for physical information (i.e., cell placement) [7, 8] , are characterized by a significant cost, both in area and delay, that is associated to the instantiation of the sleep transistor cells. In this section, we describe an automatic methodology for distributed STI that allows the designer to keep under control the area and the delay overhead, thanks to an accurate analysis of the circuit layout to be optimized. The entry point of the flow is a circuit for which placement is already done using a row-based style. We assume that all the cells in the circuit can be potentially controlled by sleep transistors that cut off the sub-threshold leakage currents when the cells are in stand-by mode. The control signal that drives sleep transistors is thus assumed to be available from some external module (e.g., a microprocessor). Sleep transistors are inserted on a row-by-row basis, at the boundaries of each row, as shown in Figure 1 , and they are connected to a common virtual ground. The sleep transistors are picked from a library that contains devices of different sizes, driving strengths and speed, fully compliant with the cells belonging to the technology library; the sleep transistor cells in the library have been designed and fully characterized using the procedure of [10] . The number and the position of the cells driven by each sleep transistor is selected through the algorithm described in Section 4, which accounts for the area and delay overhead that are allowed through a user specification. In the remainder of this section, we briefly highlight the principles that allow our algorithm, described in Section 4, to tightly control the area and delay penalties that are caused by distributed STI.
Controlling Area Penalty
In a row-based layout style, the floorplan of a circuit is partitioned into rows separated by routing regions, known as channels. If a few metal layers are supposed to be used for routing, the interconnect scheme of the design can be completed thanks to the routing resources provided by such regions. This may be true even if an aggressive over-the-cell routing style (four metal layers or more) is adopted, since interconnects might be so complex to require more horizontal routing resources. In order to satisfy performance constraints and facilitate routability, it is common practice placing cells after channel heights are fixed and the number and positions of cell sites for each row is determined. Clearly, this leads to the presence of empty spaces (white spaces) which are allocated between cells mainly for alleviating local wiring congestion (see Figure 2-a) . We propose to take advantage of part of the area of such empty regions for sleep transistor insertion in accordance to the wiring congestion tolerance. The presence of interrow spacings eases the use of such a strategy. In fact, since the heights of the channels are fixed before placement, they might not be fully exploited by the router, which would instead utilize the channels to the maximum extent, leaving several white spaces in the layout rows. The amount of available space for each layout row is determined (see Figure 2 -b) by performing row compaction (ac- cording to the congestion tolerance) and used for accommodating the sleep transistors. Confining the implementation of the sleep transistors into the space that becomes available after compaction would have the desirable effect of zeroing the area overhead, that is, the layout after STI would have the same size as the original one. However, this solution may be overly conservative, as it may prevent the possibility of power-gating the majority of the cells in the row. In fact, the larger the number of cells in a row that are controlled by the sleep transistor, the larger the size of the transistor to be inserted (to preserve the active slow-down factor). In addition to that, not all the available space reclaimed through compaction can be used by the sleep transistor cells; some spacing has to be maintained between sleep transistors and standard cells in order to avoid undesirable electrical phenomena. Since in a row-based design cells are placed by abutment, if such a space is not maintained, an electrical contact between the ground of a sleep transistor and the virtual ground of the adjacent cell (if gated) is generated with the undesirable pitfall of shorting the sleep transistor ground and thus nullifying its stacking effect. In order to increase the potential of STI (i.e., the possibility of power-gating many cells in a row), we can trade transistor size for area overhead. Rows can be widened by a certain (tightly controlled) amount in order to allow the insertion of larger transistors, thus enabling the gating of more cells in the row (see Figure 2 -c).
Controlling Delay Penalty
In order to minimize the leakage power consumption, assuming that enough area slack is available, all cells in the circuit should be power-gated. Unfortunately, this solution would imply a delay increase that would go far beyond the intrinsic performance penalty caused by STI (i.e., active slow-down, which is related to the size of the sleep transistors). In fact, the re-activation time needed by the sleep transistors, when they change from the off-state to the on-state, may be longer than the response time of many cells in the design (especially those placed closed to the circuit primary inputs). This is mainly true if the activation of all the gates within the circuit is influenced by the inserted sleep transistors. In other words, when the circuit changes from the stand-by mode to the active mode, a penalty in delay corresponding to the sleep transistors re-activation times must be payed. Such a penalty can be traded for a smaller reduction of the subthreshold leakage current in stand-by mode by limiting the number of cells that will be power-gated. In particular, it is possible to trade (or even nullify) the re-activation delay penalty by preventing the power gating in the circuit of some (all) of the cells whose arrival times are shorter than the re-activation delay of the sleep transistors. Figure 3 shows an example of how cells to which power gating is not applied are selected based on timing information; shaded gates have arrival times that are shorter than the re-activation delay required by the sleep transistor that is supposed to control them. Avoiding power-gating of all the shaded cells will ensure a zero re-activation delay overhead. The fact that, for a given constraint on the re-activation delay, not all the cells in a row are power-gated may provide a further benefit of the application of the proposed methodology; in particular, the size of the sleep transistor in that row may end up being smaller than that of the transistor that would be able to control all the gates in the row. This would have the twofold advantage of reducing the active slow-down delay overhead observed in normal active operation (although it will never become zero), and of increasing the opportunities for further row compaction.
GATE CLUSTERING
Objective of the clustering procedure is that of identifying groups of cells that will be power-gated by the same sleep transistor cell. In particular, the clustering algorithm we have implemented takes into account both the physical positions of the cells in the layout and their timing paths.The pseudo-code of the proposed algorithm is shown in Figure 4 . On the basis of the previous considerations, gates closer to primary outputs (hence with longer timing paths) are good candidates to be clustered since the sleep transistor gating them will be already turned on when their inputs will become stable. Initially, timing information about each gate of the layout is captured and listed in decreasing timing order (Lines 1-2) . Then, the algorithm proceeds one layout row at a time (for loop of Line 3). The available space for row i after compaction is computed (Line 4); further space is also added according to the area overhead allowed by the user (Line 5) and the sleep transistor of the appropriate size is retrieved from the library (Line 6). The maximum sustainable current by the chosen transistor is calculated in Line 7; the cell selection process performs a gate-by-gate exploration of each row (while loop of Line 8), starting from the cell with the longest timing path and going back towards the primary inputs (line 9). If more than one cell is available, the algorithm selects the one with maximum leakage current (Lines 10-14). For each selected gate, the impact of the gate itself on the sleep device re-activation time is evaluated (Line 15) and the remaining current at the sleep transistor is computed (Line 16). If the required re-activation time has not been violated and the sleep transistor is able to sustain the current associated to the selected gate (Line 17), such a gate is added to the current cluster (Line 18) and the exploration goes on. Otherwise (Line 19) the cluster is complete for the i-th row and therefore it is added to the overall list of clusters (Line 20) before the procedure continues with the next row.
EXPERIMENTAL RESULTS
The viability and effectiveness of the proposed sleep transistor insertion methodology has been assessed on a set of logic blocks that are part of an industrial design provided by STMicroelectronics. The standard cell library we used for our experiments is the 130nm HCMOS9 provided by STMicroelectronics. The gate clustering algorithm was run by posing a zerooverhead constraint on the re-activation delay, thus ensuring that the overall performance degradation was never higher than the intrinsic 5% originated in active-mode operation by the insertion of the sleep transistor. On the other hand, a constraint on the allowed area overhead of 5% w.r.t. the original circuits was tolerated. This value was determined after analyzing the sensitivity of leakage power on area overhead of some of the benchmark circuits we have considered.
The results of our analysis indicated that widening the layout rows by more than 5% did not really provide further leakage power savings, as growing the size of the transistors did not lead to the consideration of more cells for powergating. This was clearly a consequence of the zero-overhead constraint posed on the re-activation delay; no other cells could be power-gated without introducing a timing violation, even if a larger transistor would have been inserted. Post-layout simulation was performed to obtain leakage power consumption and timing information for each circuit. The power results, expressed in mW , for all the experiments are collected in Table 1 . In particular, columns Orig and Opt report, for the original and for the minimum leakage circuits, the leakage power (PL), the dynamic and internal power (P dyn+int ), the total power (Ptot) and the corresponding savings and penalties. Leakage power savings are, on average, around 80%. The average penalty in dynamic and internal power introduced by the sleep transistors and the extra routing is around 10%. This leads to an overall power savings, averaged over all the benchmarks, of 19%. Area results are summarized in Table 2 , which reports the number of gates of the original circuits (column Gates), the number of inserted sleep transistor cells (column Sleep), the area of the original (column Area Orig) and of the minimum leakage (column Area Opt) circuits and the percentage of area overhead due to the sleep transistors insertion. We observe that, in spite of the fact that the area overhead constraint has been set to 5%, only an average area increase of 2.5% did actually occur. This is due to the fact that, as not all the cells in each row can be power-gated due to the constraint posed on re-activation delay (i.e., zero overhead), some of the sleep transistors have been down-sized, as the currents they need to sustain are lower than what was initially planned; thus, a further step of layout compaction has allowed us to recover some additional area. For the sake of completeness, we conclude this section by reporting, for one of the benchmarks (i.e., Block6), partial snapshots (i.e., the upper-left corner -displaying of the full layouts is avoided for the sake of readability of the images) of the layouts for the original circuit ( Figure 5 ) and for the minimum leakage implementation ( Figure 6 ). On the left hand-side of Figure 6 the inserted sleep transistor cells are clearly visible. Next to the sleep transistor cells it is also visible the "column" of empty slots that are left between the sleep transistors and the standard cells for isolation purposes.
Benchmark Gates Sleep Area Orig Area Opt

CONCLUSION
Leakage power consumption is becoming dominant in deep sub-micron CMOS technologies, and different approaches for limiting it are now appearing in the scientific literature.
In this paper, we have presented a novel methodology for sub-threshold leakage power reduction based on the idea of inserting distributed sleep transistors into standard-cell circuits with the purpose of cutting off the leakage current We have presented an algorithm for gate clustering that allows selective power-gating of circuit cells and we have validated it on a set of benchmark circuits using an industrystrength design flow. Experimental data show leakage power reductions around 80% (total power savings, accounting for cell dynamic and internal power, are around 19%), with a circuit delay increase of 5% caused by active-mode slow-down due to the insertion of the sleep transistors and an average area overhead around 2.5%.
ACKNOWLEDGMENT
This work is supported, in part, by STMicroelectronics and by Intel Corp. The authors would like to thank Antonio Remollino for his valuable help in sleep transistor cell design.
