Abstract-Power efficiency and variability, currently, are the main aspects of concern of nanometer-scale CMOS technology. Both issues have been widely studied and described in the literature, and various options for their independent management are available. Unfortunately, their exacerbation on sub-40 nm processes will require new design solutions for concurrent optimization. This paper moves towards this objective, and presents a new, fully-automated, design methodology, based on the Monitor and Control paradigm, able to improve the timing yield of a system making use of traditional power-gating (PG) as a knob for controlling power consumption and performance. In particular, the design and implementation of tunable-size sleep transistors is described, as well as a methodology for inserting them in a row-based layout. In order to keep under control both area and power overhead that come from the insertion of the sleep transistors, this paper also proposes a new strategy for clustering and power-gating only the timing critical cells. The experimental results are extremely promising. In fact, the proposed approach guarantees 100% of the timing yield with average leakage-power savings of about 29%.
I. INTRODUCTION

A. Motivation
T HERE is a wide consensus within the semiconductor industry that traditional bulk CMOS scaling will continue for a few years [1] . Although researchers are actively pursuing alternatives to bulk silicon CMOS for beyond the 22 nm node (e.g., III-V and germanium channels, FinFETs and tri-gate devices, fully depleted SOI), and novel devices, materials and architectures are being considered (e.g., "Beyond-CMOS" [1] emerging technologies such as carbon nanotubes, spin-based devices, ferro-magnetic logic, atomic switches, or nano-electro-mechanical-system (NEMS) switches), cost motivations will push industry to stick to bulk planar CMOS as much as possible [2] .
The debate about which of these technologies will prevail is still open, and it will be clear only when the economic relevance of each option is assessed. Therefore, we have to expect that in the near future traditional challenges of scaled planar CMOS devices will still be relevant and that they will become more and more difficult to solve.
Among the most critical design challenges for sub-32 nm CMOS devices in the 2009 ITRS Roadmap [1] , two are particularly critical: 1) static power consumption and 2) manufacturability.
Managing static power is somewhat easier, since it does not require any change in the design paradigm. Static (or leakage) power can be regarded simply as a new design objective, and it can be managed by: 1) Devising new design strategies (e.g., power-gating, multi-design) or 2) providing new technological knobs to control leakage (e.g., control via body voltage).
Manufacturability, conversely, is a metric of a different nature; it is affected by scaling mostly in terms of yield reduction for various sources of variation. Variations can be viewed as effects that cause (directly or indirectly) device parameters to deviate significantly from their nominal values. Managing variations is particularly difficult because some variations are random, some are deterministic, and some change over time more or less quickly. This characteristic has one major consequence: Managing variability requires a new design paradigm. In fact, traditional design techniques cannot naturally handle statistical or time-varying device parameters and metrics. Such a design for variability (DFV) approach also provides the benefit of encompassing the optimization of traditional metrics. As a consequence of parameter deviation, these metrics become variable as well. For instance, leakage power or performance become themselves statistical and/or time-varying quantities. Variations can be broadly classified according to their nature (statistical versus deterministic), their spatial reach (local or global), and their temporal rate of change (static or dynamic). This classification includes any type of variation occurring in nano-scale design. Fig. 1 summarizes the typical variations arising in typical nano-scale systems. Whatever the type of variation, it is essential to manage it. There are two design paradigms to mitigate the effects of variations.
The first goes under the broad name of "correct-by-construction." It includes all those techniques in which the variability of a given design is either mitigated or amortized, rather than eliminated: From traditional design for manufacturability (DFM) approaches [3] - [ Rules layout design), to statistical design approaches, in which the circuit parameters are handled as statistical distribution and the circuit is designed to meet a constraint on yield rather than an exact value [6] .
The second design paradigm, instead, includes adaptive strategies, which attempt to solve the variability issues by "sensing and correcting" the desired parameters using various knobs that affect them. The most popular type of adaptive design techniques uses the so-called "always-correct" approach, in which dynamic correction is always effective and errors are not contemplated. These schemes are also called monitor & control (M&C) strategies, to emphasize their analogy with closed-loop control systems. The concept is depicted in Fig. 2 .
Since monitoring is based on estimates of the monitored quantities, some design margins are needed for safe operation. An alternative class of adaptive design strategies uses what it is normally called the "fail and correct" approach. The idea is to let a system fail and then recover it from the failure, to achieve correct operation. These approaches eliminate the need of design margins, but require proper handling of error conditions. Dedicated error detectors [7] are used to communicate the occurrence of some critical timing exceptions and activate a procedure to restore the correct data and operation environment. A common practice is to use variable-latency units whose throughput can be dynamically adjusted [8] .
M&C strategies have proven to be extremely flexible and they are usually preferred to error-tolerant approaches. Their applicability, however, requires solving some critical design issues. First, the use of proper monitoring units, which can probe in real-time the systems, is mandatory. The choice of the most appropriate monitoring approach depends on the desired metric to be controlled. When considering timing-yield, which is the case for this work, measuring the propagation delay on the critical path is a sure shot. Notice that the monitor is in charge of predicting the occurrence of a timing error, rather than detecting its occurrence. To address this task, classical solutions consist of a critical path replica, which mimics the actual critical path, and a phase detector, which recognizes when the delay is approaching a timing-safe threshold [9] . Depending on the process corner, the operating point, and the actual workload, many differing paths may be critical. This requires the use of multiple replicas distributed across the chip. However, due to high area and power implementation costs, their number is usually very limited. Other techniques suggest to directly monitoring the functional paths at the flip-flop level by means of in situ error prediction cells. The most common implementation consists of a flip-flop with increased setup time (the crystal ball flip-flop) in parallel to the regular flip-flop [10] . When the timing path becomes too critical, a timing violation occurs first in the crystal ball flip-flop, thus signaling that the circuit is close to a failure.
The second issue is the identification of an effective control strategy. Given the nature of the metrics to be observed (normally, power and performance), most solutions use dynamic voltage scaling (DVS) [11] , or adaptive body bias (ABB) [9] as control knobs. Although both DVS and ABB are effective in adjusting circuit performance, as a side-effect they have a dramatic impact on the power consumption of the circuit: Dynamic power is quadratically related to the supply voltage, while sub-threshold leakage current shows an exponential relationship to the body voltage. This makes their implementation energy inefficient.
Moreover, the simultaneous presence of multiple supply voltage and body bias domains directly impacts the physical design of the system, and it usually requires customized processes that have to be handled manually.
As a result, these issues seriously complicate the implementation of M&C strategies in standard design flows, thus reducing its effective utilization as a fully-automated design strategy.
B. Contribution
In this paper, we focus on static statistical variations arising from process variability, and propose an M&C design methodology which aims at improving the timing yield of a system by using an unconventional knob for power and performance control: The sleep transistor, which is used in traditional implementations of power-gating for leakage power reduction [12] . We show that sleep transistors, if properly designed, can act as a natural supply-voltage regulator for the power-gated circuit. The advantage with respect to knobs based on explicit setting of voltages are many. Low-cost easy automation into standard design flows, and natural combination of leakage power reduction and variation control. The use of a tunable-size sleep transistor as a control knob is not new; previous works have demonstrated these features [13] , [14] .
Our contribution is more on the design-automation side, and provides a fully-automated methodology which shows that tunable-size sleep transistors can truly work in a realistic design flow as a post-silicon variability control mechanism. Specifically, we propose the following:
• design of a tunable-size sleep transistor using the design rules of a technology library, and its encapsulation as a regular library cell; • sleep transistor insertion methodology, in which the tunable-size sleep transistor is inserted on a layout row level. This allows a finer control of the circuit while reducing the design overheads; • clustered power-gating strategy in which the tunable-size sleep transistor is used to power-gate only the timing-critical cells, whereas non-critical ones are power-gated using a conventional non-tunable transistor. This minimizes the power and area overhead by concentrating tunability only where needed. Experimental results on a suite of public-domain benchmarks, mapped onto an industrial 65 nm technology, show that our design methodology guarantees 100% of timing yield, and an average area saving of 34.5% with respect to a fully power-gated approach [14] . This translates into a 29% of total leakage power savings during the standby mode of the circuits.
II. BACKGROUND AND RELATED WORK
A. Process Variability
In old fabrication processes, die-to-die (DTD) variations were the only sources of variability, while in nowadays processes also within-die (WID) variations have become important in determining the frequency and the power of a circuit. WID variations are mostly due to systematic and random variations of several physical device parameters, such as the concentration of doping atoms in the substrate , the effective channel length , and the channel width , the oxide thickness . This results in a shift of the electrical characteristics of the transistors, like the threshold voltage and the maximum current density with a significant impact on the power dissipation and the performance of a manufactured circuit: 20X variation in leakage power for a 1.5 variation in delay between fast and slow dies have been reported in the literature [15] .
Device parameters (and, as a consequence, design metrics) have to be treated as random variables, whose probability distribution function (PDF) depends on the actual fabrication process. Fig. 3 pictorially summarizes the relation between variability and yield. Deviation from the ideal case (i.e., where a given parameter has a nominal, deterministic value) is represented by a PDF with a given shape (a bell-shaped one in the example described the figure), with average value coincident with the nominal value of the parameter; variability is related to the variance of the PDF (e.g., the range of values between 3 ). We assume that some design margin is provided, and that values slightly exceeding the nominal value ( in the figure) can be considered as acceptable. The area below the PDF and delimited by this upper bound (the striped region in the figure) represents the yield, that is, the probability that , which can be immediately obtained by the cumulative distribution function of . For a fixed , as the variability (and thus the variance of the distribution) increases, the yield will accordingly become smaller. 
B. Power-Gating (PG)
Power-gating (PG) can be considered as a coarse-grain generalization of the MTCMOS technique (Fig. 4) . It is based on the principle of adding header (or footer) switch devices, also called sleep transistors, in series with the pull-up (or pull-down) network of logic gates [12] . Turning-off the sleep transistors, the resulting virtual-rail, also called virtual-(virtual-GND), can be disconnected from the actual (GND), thereby decreasing the leakage power component due to sub-threshold voltage during the idle state of the circuit; on the contrary, during the active mode, the sleep transistors are fully turned-on and they are transparent from a functional perspective, even though they slow down the gates to which they are connected.
While the original power-gating version uses both header and footer switches, the same leakage savings can be achieved using only one of the two options. The choice of which one depends on several design aspects. From the area viewpoint, it is better to use nMOS sleep transistors, which show a better conductance. For this reason, footer power-gating has been historically very popular in the literature [12] , [16] , [17] . However, considering other factors like leakage, noise on power/ground rails, and ease of implementation 1 , the use of header pMOS has also become popular [18] , [19] ). In this paper, we opted for a footer implementation, considering that all the concepts do apply also on a header power-gating.
Although based on a simple, yet intuitive concept, the actual implementation of power-gating requires a special attention to several design issues. In decreasing the order of the abstraction level, the first important issue to be considered is relative to whether the gating is applied to the entire logic block (full power-gating), or only to a subset of the cells (typically, the non-timing critical ones-partial PG). This raises another challenging issue that is the clustering of the circuit into critical and non-critical regions.
Secondly, an effective use of power-gating requires a proper sizing of the sleep transistor [20] . While a small sleep transistor may unacceptably slow down the power-gated cells in the active mode due to its high resistance, a larger one implies a larger area and a significant energy cost to drive it. This issue is described in deeper detail in Section III.
Another issue is related to the physical implementation of power-gating at the layout level. An important factor is the granularity at which sleep transistors are inserted. Granularity may range from individual cells [21] , to large chip sub-units, in which very large sleep transistors are placed on the root of the power distribution networks of large chip areas. Regardless of the insertion granularity, an effective power-gating design should maintain compatibility with the standard row-based design style. For this purpose, a sleep transistor is typically designed by connecting a number of small transistors in parallel in a multi-finger style. State-of-the-art approaches [22] are based on the use of custom switch cells, which belong to dedicated power-gating libraries and are connected in parallel in order to obtain a transistor array. The array can be implemented either in a ring-style, where the sleep transistor array surrounds the circuits forming a ring, or in a distributed-style, where the sleep transistors are distributed throughout the power-gated region together with the standard cells [16] or in dedicated rows [22] .
C. Related Work
Section I has already provided an overview and taxonomy of typical DFV solutions. In this section, we focus more on the M&C solutions, and in particular on the implementation factors, which limit their integration in automated design flows. Two are the milestone techniques in the field of M&C: dynamic voltage scaling (DVS) [23] and adaptive body biasing (ABB) [9] . DVS compensates timing variations playing with the supply voltage (a higher reduces the delay), while ABB recovers circuit performance by controlling the threshold voltage of active devices (forward-body-biasing (FBB) reduces the that in turn makes the circuit faster).
Both techniques have proven to be effective knobs when used to increase the timing yield, and they have been successfully adopted in several industrial designs, however, they show several limitations, which make their optimization difficult to be automated. First and foremost, both DVS and ABB drastically affect the power-consumption: Higher supply voltage induces a quadratic increase of the dynamic power, while FBB results in higher junction capacitance, higher forward source-body junction current [24] , and larger sub-threshold current [25] .
Second, employing multiple voltage or multiple biasing domains requires DC/DC converters and voltage regulators, decoupling capacitors, as well as customized design and test strategies, which are technology dependent [26] . For instance, ABB requires triple-well process and additional on-chip power distribution networks for the body voltage, which occupy additional area and compete for precious routing resources. Third, the scalability of DVS and ABB is quite questionable. As technology scales down, the gap between and reduces and even small variations on and/or may have huge effects; thus, controlling these parameters may require high accuracy and high resolution, which do contrast with classical discretized voltage regulators. Furthermore it has been demonstrated that working with higher induces a faster device degradation due to aging effects like Electromigration, hot carrier injection (HCI) and negative bias temperature instability (NBTI) [27] .
III. SLEEP TRANSISTORS: A LOW-POWER CONTROL KNOB
Process variations affect the electrical parameters of the active devices, such as the channel length and the threshold voltage. This results in a variation of the actual current capability of the active transistors, whose speed may change significantly. From the circuit standpoint, the effects of variability can be easily sampled and quantified by measuring the variation of active current drained from the supply rail: The slower the circuit the smaller the and vice versa. When the sleep transistor is on (active state of the system), due to its channel resistance , it can be considered as a sort of current-to-voltage transducer, which transforms the current into a voltage drop across the sleep transistor (Fig. 5) . is proportional to the width of the sleep transistor. The actual ground terminal of the logic block (the virtual ground) is now at a potential , equivalent to working with a reduced supply voltage: The circuit virtually operates with an equivalent supply voltage of ; this reduced drive directly affects the overall speed of the gated block.
It is worth emphasizing that, unlike methods that directly control the supply voltage, like DVS, the equivalent voltage reduction appears only when the circuit is switching, namely, when the gated cells inject the drawn current through the channel resistance of the sleep transistor . Once all the switchings are completed, falls to zero and so does . In this sense the supply voltage is automatically adapted to the load, and there is no need of a voltage regulator.
By modulating the it is therefore possible to change the operating point of the entire circuit and compensate delay variations with a very simple mechanism and with an arbitrarily fine granularity. Specifically, a smaller (larger) translates into a smaller (larger)
, and thus in a faster (slower) circuit; this is denoted in Fig. 5 by expressing the virtual ground voltage as . A quantitative evaluation of the above dependency is given in Table I, 
shows then the slowest circuit instance (152.7 ps) when a transistor of size is used for gating, i.e., the same conditions for which is measured. This instance is approximately 17% slower than the nominal one. If we then upsize the sleep transistor for such circuit instance, we see that, as expected, we can progressively speed up the instance. Depending on the allowed margin, this will calibrate the maximum range of tunability.
For instance, if we tolerate a 5% delay increase (i.e., ps), then we need a sleep transistor as large as 16 W to achieve such a speedup of the worst case: This will cause the slowest instance to have a delay of 134.9 ps . If a 10% margin is allowed ps), then a 4 W transistor size will be enough. Similar considerations do apply if also fast corners must be considered, for example, to satisfy power constraints. The fast corners result in fact from device parameters that are good for performance but not for dynamic and/or static power (e.g., larger devices or a smaller threshold voltage). In such cases, we can slow down fast instances by downsizing the sleep transistor until the delay of the best case instance is within a given margin . In our methodology both sleep transistor upsizing and downsizing are feasible, although we mostly focus on delay yield, and therefore the rest of the description refers to transistor upsizing.
IV. CLUSTERED POWER GATING FOR CONTROL OF VARIABILITY
A. Power-Gating Scenarios
The previous section has shown a tunable-width power-gating implementation for managing variability. However, given the required ranges of the width, for some adverse samples this device might result into a significant area overhead, which in turn causes extra static and dynamic power consumption. The sleep transistor itself will in fact leak when turned off, and its leakage is proportional to its size; a very large sleep transistor may nullify the leakage power savings obtained by gating the circuit. In fact, if we consider the example described in the previous section, in order to bring back the circuit within the desired delay margin of 5%, a total sleep transistor width that is 16 times larger (16 W) than the nominal case is required.
Such a large area and power overhead is unavoidable if we use a single sleep transistor to power-gate the entire circuit. This overhead can be, however, mitigated by recognizing that tunability is required only for the cells that determine the critical path(s). All other cells can be power gated with a regular, non-tunable, small-sized sleep transistor. In fact, even if these cells are powered with a lower equivalent supply voltage, their increase in propagation delay will not reflect into an increase of the actual delay of the circuit. Similarly, if process variations affect some of these cells, their delay increase will not slow down the critical paths.
Paths (and the relative cells) with a slack smaller than a given threshold should be considered as potentially critical. Independently of how these two sets (critical and non-critical) are selected, the idea conceptually identifies two clusters using separate sleep transistors. We call this approach clustered-tunable
PG (TPG).
A pictorial description of the three power-gating approaches is given in Fig. 6 .
The figure on the left represents the traditional PG, in which a single transistor of size is used to gate the entire block. Regular TPG, described in the figure of the middle, uses a single tunable transistor for the whole block: The transistor will be set to the proper width according to the detected delay value. As discussed in Section III, in order to compensate the worst case variation, the size of the tunable sleep transistor must be the largest of the range , even if a smaller width will be chosen. The potentially critical cells (as defined above) identify a cluster that is gated with a tunable transistor of size (see Fig. 6 on the right). The rest of the cells in the circuit form a cluster NC gated by a regular transistor of size . It is important to emphasize that, even in the case in which a logic path crosses both the NC and clusters, with cells in the NC cluster driving others in the cluster, no voltage-level shifter is necessary, since the actual supply voltage of the cluster is equivalent to , as explained in Section III, and is automatically adapted to the load.
Concerning total power, the net effect is that clustered TPG is much more energy-efficient than full TPG as a consequence of reduced equivalent total sleep transistor width, while keeping the same delay range capability. Clustered TPG lies between regular power-gating (minimum power but no variability control), and full TPG (required delay range with largest power). This is summarized in Fig. 7 , where the three solutions are compared in the delay/power space, and is used as a proxy of the total energy.
B. Methodology
Having two different power-gating domains, i.e., two different virtual-ground rails in the same circuit, requires a dedicated methodology that is aware of the design constraints imposed by the sleep transistor insertion strategy. Since having a dedicated sleep transistor for each cell (fine-grain power-gating) would require customized standard cell libraries (unavailable in most of the commercial design kits), we adopted a coarse-grain insertion methodology, called row-based power-gating [28] . In this strategy, the gating granularity is a row of the standard-cell layout; given our two-cluster methodology, it has to be decided which row (rather than single cells) should be assigned to which cluster ( or NC).
To support this approach, we implemented the framework presented in Fig. 8 . The starting point is a design in which the placement has already been executed according to a row-based style.
In the first step, we run statistical static timing analysis (SSTA) in order to identify the timing critical paths, namely, those paths that may induce overall performance degradation in the presence of process variations. The collected timing reports feed the second step and are matched with the physical information of the layout in order to identify the critical rows (and thus define the two clusters). Once the and NC clusters have been formed, the technique proposed in [29] is adopted to determine the maximum active current which both the clusters inject into their own virtual-ground rail, and . The obtained values are used in the fourth stage as upper bounds to define the nominal width of their sleep transistors. More specifically, we define the nominal width of a sleep transistor as the one that guarantees, in the absence of process-variations, a user defined maximum virtual ground voltage when is flowing. This identifies the nominal operating point of the circuit. As a final stage, both sleep transistors are inserted into the netlist of the placed design: A fixed-size transistor for the non-critical cluster and a variable-width sleep transistor structure for the critical cluster.
1) Statistical Static Timing Analysis:
Statistical static timing analysis is carried out to characterize performance in the presence of process variations. We adopted a Monte Carlo (MC) based statistical sampling approach, where devices parameters are stochastic variables described by a probability density function (PDF) dependent on the fabrication process. During each MC simulation sample, the propagation delay of the circuits is computed using traditional static timing analysis tools. As main output, the Monte Carlo analysis provides data about path delay distribution. Although the fabrication process may affect different device parameters (like ), we opted for a simplified model in which the variation of the effective gate length is considered as a comprehensive source of variation for the delay of the gates. We assumed a normally distributed , as follows:
( 1) where is the nominal value of the channel length, and is a zero mean normal distribution with a variance. The granularity of our statistical analysis is the gate level, namely, we consider an independent random variable for each gate in the netlist. This means that all the transistors belonging to a gate have the same , and that the mismatching effects between adjacent devices are ignored.
This model allows taking into account spatial correlations among the gates of the same die. To this purpose, we used the grid model of [30] as a simple yet effective way to solve the problem.
2) Identifying Critical Rows: We define a statistically critical path as a sensitizable path with a nonzero probability of having a slack smaller than a given threshold . Assume that there are such paths (for a given value of , let these paths be . Given the list of the layout rows , each one hosting gates , with , we mark as critical those rows that include at least one gate belonging to a statistically critical path . All the critical rows are then grouped into the critical cluster :
The non-critical cluster NC will be made up of the remaining rows, i.e., .
3) Sleep Transistor Sizing:
A first preliminary step for the estimation of the sleep transistor width of a cluster is the cal-culation of its discharge current . The estimation of the active current is not a trivial task. In fact, an erroneous estimation of translates to a sub-optimal sleep transistor sizing, which may result in area and power increase or, even worse, in timing violations during the active mode [28] .
Once we have the on-current of the and NC clusters ( and ), respectively, it possible to correctly define the nominal sleep transistor channel resistance. As reported in (3), designers usually define a maximum voltage drop across the sleep transistor that is used as a constraint (3) where is the maximum voltage drop allowed during the active mode across the sleep transistor (that is, the virtual ground voltage). Considering that each sleep transistor operates in the resistive region, its nominal width can be properly evaluated using the (4): (4) where is the length of the transistor, is the carrier mobility, is the oxide capacitance and is the threshold voltage. In practice, a lookup table is maintained with the channel width for the corresponding input due to a nonlinear relation between the on-resistance and the drain-to-source voltage. To notice that both threshold-voltage and gate-length of the sleep transistors are defined at the library-level, according to the process optimization phase. This allows to have the same and for all the transistors in the chip.
4) Design of the Tunable Sleep Transistor Cell:
Borrowing the concept of a variable-width sleep transistor architecture [13] , we designed a self-contained tunable sleep transistor cell that contains 4 different-sized parallel sleep transistors driven by dedicated control NAND gates (Fig. 9) . The fanout load of each NAND is distributed through an optimal sized buffer, i.e., an inverter chain. The sleep signal, which is in charge of defining the operating mode of the gated circuit, is provided by an external power-management unit and it is fed to the sleep transistor through the NAND gates.
Each NAND gate receives a configuration bit ( , in Fig. 9 ), whose value can enable (in case of 1-logic) or disable (in case of 0-logic) the corresponding transistor. Hence, when the sleep signal is 0 (i.e., during the active power-mode), only those transistors that are enabled contribute to the total sleep transistor width (5) where the term which gives a different weight to each bit is defined by the actual size of the transistors.
The obtained cell, which is fully compliant with the design rules imposed by the silicon vendor, represents a modular solution for power-gating architectures, where a sleep transistor is built up using several sleep transistor cells connected in parallel. 
5) Sleep Transistor Insertion:
As described in the previous paragraph the sleep transistor of the critical cluster is implemented by means of modular tunable sleep transistor cells connected in parallel. In order to guarantee the maximum dynamic range, all the programmable sleep transistor cells are centered in their middle configuration, namely, the configuration word is set to in the nominal case. Recalling (5) the resulting number of cells to be inserted is given by (6) The same scheme is used for the non-critical cluster, but using non-tunable sleep transistor cells of fixed size [22] .
Following the strategy proposed in [29] , the sleep transistors cells are then placed in dedicated layout rows, called sleep rows, that can be added at the top/bottom of the design thus enabling design automation support for semi-custom ASIC. This allows minimal layout disruption due to the sleep transistor insertion. Regarding the design of the virtual-ground rails, it is important to underline the fact that having two different clusters implies the routing of two independent rails. This may drastically complicate the final design phases of the back-end flow. However, it has been proven in [29] , that the row-based methodology is very efficient in the case of multiple virtual-ground rail based clustering.
The power-grids at the upper metal layers of the layout are modified such to interlace also the virtual ground lines between the ground lines, thereby providing accessibility of all the virtual ground lines to all the rows. Vertical vias are used to connect the rows to the proper virtual-ground grid. It is important to notice that in case of adjacent rows, which belong to different clusters, i.e., to different power-gating domains, we need to split the common ground in order to allow them to have a different ground path. Fig. 10 shows an overview of the proposed layout strategy.
V. EXPERIMENTAL RESULTS
A. Characterization of the Tunable Sleep Transistor Cell
In order to validate the proposed methodology, we designed and implemented a tunable sleep transistor cell using a 65 nm technology by STMicroelectronics. The cell is made up of four Table II. The NAND gates are minimum sized and they show symmetrical low-to-high and high-to-low propagation delay. The inverter chains are used as a buffer to drive the high-fanout gate terminal of the sleep transistors in order to guarantee a small turn-on/off time. Fig. 11 shows the normalized resistance for the 16 different configurations. The data have been extracted from the current-voltage characteristics of the cell. To build the -curve, we used an equivalent-circuit load connected to the virtual-ground terminal of the cell [31] . The circuit is made up of a set of inverters that switch simultaneously thus emulating a worst-case peak current load. Each inverter is configured with the minimum size and is connected to an equivalent fanout capacitance of 1. By varying the number of inverters, we can generate different realistic injected current loads through the sleep transistor. The plot clearly shows that the relation between resistance and the equivalent transistor width is not linear.
B. Experimental Setup
Synthesis Flow: Fig. 12 presents the flow which was used to run all the experiments with the circuit benchmarks. The flow is basically divided into two parts, composed of the following steps.
• Synthesis, Placement, Gate-level Static Timing Analysis (STA) and Critical-rows identification: These steps are run once for each design. • Monte Carlo (MC) SPICE simulations and Transistor-level STA: This phase can be run several times for each design, once for each sleep transistor size. In the first part of the flow, the design is synthesized and placed using a standard design flow, and afterwards, a static timing analysis is performed. Synopsys Design Compiler, IC Compiler and PrimeTime tools are used to execute these steps, respectively. After timing analysis, the design is submitted to a newly created PrimeTime command, called identify_critical_rows, whose function is identifying the critical rows in the design, based on the timing analysis results and a user-defined criteria of criticality (i.e., percentage of slack). For example, if the criticality percentage is defined as 10%, critical rows are those that have at least one cell with timing slack smaller than 10% of the clock period. This command generates a report that contain information about the critical rows and its respective cells. The algorithm used to identify whether a given row is critical or not takes into consideration the timing analysis performed on the cells of each row. If any of the cells in a row is part of a critical path, then this row is marked as critical. In this case, critical paths are the ones which have a timing slack lower than a given percentage of the clock period used to synthesize the design, as described in (2) . Fig. 13 presents a visual output of the identify_critical_rows command for the b03 ITC'99 benchmark, where the dark rows correspond to those belonging to the critical cluster.
The second part of the flow regards simulation of the design and the analysis of the variability effects. For these steps, Synopsys HSIM and NanoTime tools are used, respectively. The simulation is run as a set of 100 SPICE-level MC simulations used to emulate the effects of process-variations in the design. These simulations provide information about power consumption and variation on the virtual ground power lines, both caused by process variations.
Once the MC simulations are executed, a SPICE-level static timing analysis is run, using the variation results obtained during the simulation step. Timing results are recorded and later analyzed to verify if the process-variation generates timing violations.
Variability Assessment: The process-variation (PV) model used in the experiments is based on a variation in the effective length of the transistors in the various cells of the library. The variation in the transistors of a given cell leads to timing variations in the cell, and therefore, to timing variations in the whole path that contains the cell.
To run the experiments and collect information about the effects of variability in the designs, Monte Carlo simulations were executed at transistor level using a Gaussian distribution to set up the channel length values of the transistors in each Monte Carlo run. We used the Gauss function defined in HSIM, agauss(0.06, 0.014, 2), which defines a Gaussian distribution with parameters (the nominal channel-length), and 2-(absolute variance at 2-). This distribution leads to a timing variation of about 10% in the timing slack of the paths, related to the slack measured using the mean value of the distribution.
C. Simulation Results
Table III summarizes the results obtained from the simulation on a subset of the ITC'99 benchmarks [32] .
For each benchmark, the first two columns report the total number of layout rows (column Rows Count), and the percentage of critical rows, i.e., the percentage of rows that belong to the critical cluster (column Critical Rows). The column Timing Yield Full-power-gating contains the timing yield when the benchmarks are power-gated using a classical approach, namely, a single non-tunable sleep transistor for the whole circuit (full-power-gating). The timing yield is calculated as the percentage of samples which falls below a timing upper bound. The latter is defined as , where is the nominal delay used as constraint for the synthesis and alpha is a user defined parameter (0.1 for this set of experiments, i.e., a 10% design margin). The following set of four columns report the sleep transistor figures in the case of TPG architectures, for both regular and clustered TPG. We show in particular the total equivalent width required to achieve a 100% of timing yield. For both schemes, the absolute width in m and the relative bit configuration of the tunable sleep transistor (Fig. 9) are reported. It is worth emphasizing that although we focus here on timing yield, as mentioned in Section III, our approach supports also transistor down-sizing. For the sake of generality, we use as nominal width the central bit configuration, i.e.,
. Finally, the last column reports the total leakage saving of the clustered TPG versus the regular one. As a general observation that can be drawn from the analysis of the data, is that, as anticipated, clustered TPG achieves the same yield improvement with a much smaller equivalent transistor width-that is, area (% on average), which translates into a comparable leakage saving (% on average). Notice also that we do not force or change the cell placement, therefore, the clustered version of TPG has no hidden area overhead.
A non-obvious observation that might be misleading regards tuning bit configurations. Although in some cases the regular TPG has a smaller bit configuration than the clustered one (e.g., for
, 1010 versus 1111, that is, a transistor versus a ), the scale of width values is not the same. In other terms, the reference (nominal) size of the tunable sleep transistor in the regular TPG is larger (because it applies to the whole circuit) than for the clustered one (applied to critical cells only). Therefore a larger width ratio in the latter case actually corresponds to a smaller width (176.8 versus 302.0 m). The table also suggests that the smaller the percentage of critical rows, the more efficient the clustered approach can be. Circuit b18, for instance, shows the smallest percentage of critical rows (42%), thus providing the best chance to apply the clustering approach: Total sleep transistor cell area falls from 1793.6 m to 1007.3 m . Fig. 14 shows, in graphical form, the yield improvement for increasing bit configurations, for the b03 benchmark. Each configuration shows the histogram of the 100 Monte Carlo samples: Black (grey) bars denotes values below (beyond) the threshold (horizontal dotted line). The number at the bottom of each bar denote the number of samples in each bin. The leftmost plot represents the nominal case (1000). In this case 17% of the samples are outside the margin. As the number of the enabled transistors inside each tunable cell increases, the number of paths that violate the timing constraint progressively reduces, and becomes 0 (100% yield) for the configuration 1101. As a final analysis of the benefits achievable by clustered TPG, Fig. 15 shows the yield-leakage (normalized) tradeoff of regular and clustered TPG. The curves refer to the average over all benchmarks of Table III . We notice that the leakage benefit if roughly independent of the yield level: The ratio between the two curves is approximately 1.3 for all values of the leakage.
VI. CONCLUSION
Static power consumption and manufacturability will be the most critical design challenges that will be addressed in the near future of planar CMOS devices.
Controlling static power consumption can be considered as a new design target and can be done by identifying new design strategies, such as power-gating and multi-design, or by using new technological knobs, such as control via body voltage.
Manufacturability, on the other hand, is sensible to different sources of variations that may affect it in terms of reduced yield. Such variations can be of different nature: Random, deterministic, and time-varying, and can be classified in different broad classes (see Section I for reference).
In this paper, we focused on static, statistical variations coming from process variability. We proposed a Monitor and Control design methodology that improves the timing yield of a system by using sleep transistors (used in the traditional power-gating paradigm) as power and performance control knob. In particular, we proposed: 1) a new design of a tunable-size sleep transistor and its implementation as a new cell in an industrial technology library; 2) a methodology for inserting, in a row-based fashion, the tunable-size sleep transistors; and 3) a clustered power-gating strategy for power-gating, by means of tunable-size sleep transistors, the timing critical cells only, for minimizing area and power overhead.
Experimental results has shown that the proposed design methodology guarantees 100% of timing yield with a leakage power savings of about 29% during the standby periods. This translates to higher fabrication yield and higher energy efficiency of the fabricated chips, two metrics that are major concerns for battery-powered devices, whose production covers most of the portable electronics market.
ACKNOWLEDGMENT
The author would like to thank A. Tosson for the technical contribution in designing the tunable power-gating cells.
