Abstract: Power reduction in CMOS platforms is essential for any application technology. This is a direct result of both lateral scaling-smaller features at higher density, and vertical scaling-shallower junctions and thinner layers. For achieving this power reduction, solutions based on process-device and process-integration improvements, on careful layout modification as well as on circuit design are in use. However, the drawbacks of these solutions, in terms of greater manufacturing complexity (and higher cost) and speed degradation, call for "optimized" solutions. This paper reviews the issues associated with transistor scaling and related solutions for leakage and power reduction in terms of topological design rules and layout optimization for digital and analog transistors. For standard cells and SRAMs cells, leakage aware layout optimization techniques considering transistor configuration, stressors, line-edge-roughness and more are presented. Finally, different techniques for leakage and power reduction at the circuit level are discussed.
Introduction
Transistor scaling that has driven the CMOS technology for the last 45 years increased the transistor density. However, the power consumption and the leakage current of scaled down transistors increase rapidly and thus, some "classical" scaling rules like gate oxide thinning can no longer be maintained. The increased number of transistors per chip and reduction in die size leads to rapid increase in power. Due to the fact that the device clock frequency has increased with each new generation, but the power supply was not scaled down at the same ratio the dynamic power is now the dominant power factor for 65 nm platforms (Table 1, Figure 1 ). To overcome this problem, in most cases, foundries are offering platforms with several technologies. The different "technologies" refer to the thin-oxide transistors' parameters and operational voltage per the related application (Table 1) . In most cases, the SL (Standard Logic for General Purposes) technology will have FEOL (Front-End-of-Line) with thinner gate oxide thickness, lower operation voltage, higher drive currents and lower threshold voltages compared to the Low-Power (LP). Interactive audio and/or video mobile platforms have both dynamic and static high power consumption. To meet these opposite demands, "mixing technology" is also proposed [1, 2] , with a "triple-gate". We will discuss this solution later on. The BEOL (Back-End-of-Line, metal and dielectric layers) and most of the analog passive components like resistors, junction varactors and thick-oxide MOSFET-varactors are common for all technologies at the same platform. Figure 1 . Total power ratio evolution vs. platform (node) and technology (application) [2] (left). "SL" refers to "Standard Logic for General Purposes" and "LP" for "Low-Power".
IBM leakage values for high-performances and LP technologies for the different platforms [6] (right). Table 1 contains a short list of benchmark specifications for 65 nm down to 32 nm platforms, including specifications for supply voltage (V dd ), gate oxide thickness, drive current (I on ) and sub-threshold leakage (I sub ). As can be seen, I sub typical values go up by several orders of magnitudes as technology is scaled down where V dd is reduced by only 30%. This is the main challenge that will be discussed in this paper. We will use Table 1 all along this paper, for technologies comparison. It is interesting to compare IBM I sub for HV (High-Speed) and LP technologies, as shown in Figure 1 , with the data in Table 1 : for 65 nm, the HP leakage is close to the SL/LVt, and LP is close with the LP/LVt. For 45 nm, IBM use a high-k dielectric, and the leakage values are lower by a factor of 5 compared to LP/HVt in [4] .
There are several sources for power dissipation (P) in digital CMOS circuits [7] : 
The first term P short is the power consumed during gate voltage transient time, that in CMOS technology is only related to the direct path short circuit current (I sc ) which flows when both the NMOS and PMOS transistors are simultaneously active, conducting current directly from supply V dd to ground or V ss .
The second term, P switch refers to the dynamic component of switching power due to charging and discharging C L -is the total loading capacitance, f is the clock frequency and α is the average switching activity factor (typical value for α is 20% for logic blocks in 65 nm technology [8] ). Some techniques for P switch reduction are described in the next section.
Imperfect cut-off of the transistor leads to leakage (I leak ) and power dissipation (P static ) even without any switching activity. With an increasing number of gates both the total capacitance and the channel width are relevant for the leakage increase. This paper is organized as follows: we will start by generally reviewing the transistor's leakage components related to scaling. Later, we will describe some layout parameters and related design rules that affect leakage. Some guidelines for layout optimization for power (or leakage) reduction will be given. In Section 3, we describe some operational and layout consideration for power reduction in SRAM. Finally, techniques for power reduction in transistors and circuit level will be discussed for both core and SRAM.
Transistors Leakage Components
The main feature of transistors scaling is the reduction in V dd , the threshold voltage (V t ), effective channel length, T ox and doping levels and depth (Table 1) . In this section, we will discuss some of the dependency of the transistor leakage components to these parameters.
As analyzed in [7, 9, 10] , the overall leakage currents can be divided into several components (Figure 2) , taking place under different bias conditions. At very low gate voltage, a potential difference between source and drain still results in sub-threshold static leakage current, I sub . Among the many parameters, I sub dependence on higher threshold voltage (V t ) and operation temperature is the most significant, reducing I sub in an exponential manner with increasing V t and decreasing temperature, respectively. Basically, lower channel doping, shorter effective channel length and longer transistor width will reduce V t and increase I sub . In addition, the body-factor and DIBL (Drain-Induced-BarrierLowering) parameters, that depend on the 1D and 2D doping profiles of the V t adjust halo/pocked and extensions implants will also affect I sub . Transistor scaling also means shallower and more abrupt extensions and S/D junctions. Although more abrupt junctions provide improved short channel effect, the rising doping concentrations and the high electric field (>10 6 V/cm) across the reverse-biased p-n junction lead to leakage due to Band-To-Band-Tunneling (BTBT) [9] . Higher gate-to-drain voltage increases the vertical field in the drain depletion layer, and reduces the depletion width at the gate-drain overlap area, resulting in Gate-Induced-Drain-Leakage (GIDL) [9] . For having a good V t control, and to reduce the I sub leakage, the dopants concentration near the surface are kept high. However, an increase of the drain voltage lowers the potential barrier for the majority carriers at the source side, thus leading in "additional" I sub leakage and the punchthrough. 
HCI

I gate
Junction
Punch through over I sub . In [7] , the gate leakage is simply approximated using W (transistor width) and K 1 and K 2 that are constants which can be extracted experimentally: Figure 3 describes the gate leakage dependence on the gate oxide thickness. The exponent is much more dominant then the (V dd /T ox ) part in the pre-exponent. Table 1 . For the same effective oxide thickness, the gate leakage is lower by ~ 3 orders of magnitudes comparing to oxynitridization thermal oxide.
For 130 nm, I sub , GIDL and junction leakage, cover ~95% of the overall leakage, and I gate < 5%. For 90 nm, I gate is ~40% and for 65 nm, it is >90%. Note that these percentages refer to leakages at room-temperature. As temperature goes-up, both I sub and the junction leakage become more dominant [2] . Another factor which affects the ratio between the different components is the V t target: in multi-V t technology, having for example 3 types of V t 's, the high-V t (HVt) will have 25% leakage due to I gate , 25% leakage due to diodes and ~50% leakage due to I sub . Regular (or Standard V t , SVt) will have <5% for I gate and diode and ~90% for I sub . In Low-V t , I sub is the dominant (>98%) [1] . The 32 nm SL (Standard Logic for General Purposes) foundry technology node is the first one to use high-k material that allows reducing I gate while keeping good gate control on the channel. About 3 order of magnitude reduction of I gate can be achieved for the same effective oxide thickness (Figure 3) .
In addition to gate current due to tunneling, Hot Carrier Injection (HCI) at the channel pinch-off area leads to impact ionization and leakage injection into the gate oxide.
Another aspect of scaling is the increase of inter-die thermal gradients due to the increase of the local power densities. Higher thermal gradients increase the voltage drop due to increased leakage. This voltage drop affecting the clock skew. Kawa [11] found voltage drop of 12% and 16% for 30° thermal gradients for 0.18 μm and 0.13 μm technology nodes, respectively.
Transistor and Cell Level Leakage Analysis and Optimization
Topological Design Rules and Layout Optimization
In addition to continued reduction in transistor dimensions along the scaling, also the transistor configuration (or "transistor layout"), as used by standard cells become more and more complex. At this section, we will discuss some of the transistor leakage dependency to the layout "style". Although the number of the different functions supported almost did not change during the years, the number of different cell types has increased in ~×1.2 at every technology node (Figure 4 ). The gate density is increased by factor of ×2 as required by basic scaling. More cell types and with more demanding design rules increase the challenge to reduce the leakage dependence on layout. Another aspect for analysis of the complex topography, is the OPC (Optical-Proximity-Correction) implemented by the semiconductor foundry after the design is completed and prior to mask making. Basically, during OPC, small corrections are made to the design by attaching (or removing) small polygons. This OPC procedure takes place for the active area (AA), poly, all the metal layers and in the advanced platforms (≤90 nm), also for contacts and vias. Figure 5 shows a snapshot from a standard cell library used in mass production, and the "on-silicon" shapes, based on modeling that takes into consideration the OPC and the manufacturing photolithography illumination conditions. About 20 TDR (Topological Design Rules) are needed, for drawing the cells shown above. Among them, several rules have a direct relation with the transistor leakage, and therefore, should be optimized for low-power design. The analysis below covers some of these layout rules that are listed in Table 2 . 
GC.D.1 and transistor configuration:
Several papers already discussed the effect of the distance between the poly (over STI) to related AA. If the poly is too close and rounded, it may affect the transistor gate length [12] , and because of that, it is always recommended (if possible) to have larger distance. In terms of process and leakage interaction, the distance GC.D.1, also affects the exact corner location of the spacer/AA ( Figure 6 ). In case the spacer corner is located too close to the AA/STI boundary, the damage to the silicon substrate during the spacer etch-back can cause junction leakage. The example below (Figure 6 ), shows the N+/WP junction leakage, as function of GC.D.1, using a dedicated test structure consisting of diffusion comb interdigitated with the poly over STI comb. As can be seen, if the distance is large enough, the leakage is low and almost similar to that of junction w/o poly-near-by. However, for a too short distance, the leakage and the leakage spread both increase. Figure 6 also shows the dependence of the leakage value on the number of diffusion corners. Naturally, the higher the number of corners, the higher the junction leakage (for the same value of GC.D.1). Therefore, for low-power design relaxing GC.D.1 and avoiding using complex transistors with a large amount of AA corners is recommended.
The transistor leakage also depends on the complex AA/Poly configuration. For analysis, a study methodology was developed [13, 14] consisting of systematic Edge-Contour-Extraction (ECE) from transistors, taken along the manufacturing line. In general, the SEM (Scanning Electron Microscopy) ECE algorithm is based on CAD (GDS) to SEM pattern recognition, followed by initial and final 2D edge extraction. About ~3000 transistors were measured for the analysis. Device modeling (based on SPICE simulation) was then used, to predict the nominal values as well as the device performance variability of I on and I sub . The SEM analysis was done with measurement steps of 2 nm, so for every transistor gate, the min/max, average and standard deviation of the width and length were measured and calculated. I on was calculated based on W avg and L avg -average width and length of every transistor, respectively. I sub was calculated based on L min , σ L and W avg were L min and σ L are the minimum gate length and the related standard deviation of every transistor. More details on this calculation method are given in [15] . The I on /I sub characteristic was used, in order to compare the performance of different transistor configurations. Generally, shorter gate length resulted in higher drive current and higher leakage current. The I on /I sub chart (Figure 7 ), gives the possibility to characterize configurations that yield lower leakage current for the same drive current. This is the main advantage of using the I on /I sub chart instead of looking on I on or I sub separately for variability analysis. Analysis of the different clusters at the I on /I sub chart using Calibre DFM (Design-for-Manufacturing) property (Mentor Graphic) showed that each cluster is related to a different transistor configuration. The most frequent configurations ( Figure 8 ) were configuration 4 (37%), configuration 1 (32%), configuration 8 (11%) and configuration 10 (8%). In the other 12% of transistors, 18 different configurations were defined. [14] .
Configuration 1
Configuration 4 Configuration 8
Configuration10
Configuration 1 consists of a U-shape AA, with isolated poly gate. This configuration is known to have higher AA width variability [15, 16] . The transistor under detection at configuration 4 does not have any AA or poly corners close the gate area, and can be referred to as "semi-dense" poly. Configuration 8 and 10, are very similar: both have poly bent at minimum design rule distance to the gate area, and the poly gate can be referred to as isolated. The only difference between these two configurations is the local area that is very similar but not identical. The electrical performances of each one of the configurations are shown in Figure 9 . Configuration 4 shows the lowest I sub , with about 20% lower leakage compared to configuration 8 or 10. This "better performances" can be attributed to the lack of AA and poly corners near the transistor gate, as well as to the semi-dense poly line. On the other hand, configurations 8 and 10 show the "worse performances" due to the isolated poly line, L min was narrow and yield high leakage current. In addition, the poly bent and the related OPC, may affect the transistor width (as well as the transistor minimum length), as proposed at the SEM micrograph ( Figure 8 ). It is clear from the I on /I sub chart, that these two configurations, had the highest (the worse) ratio and therefore, are less recommended to be used for low power or low leakage applications. Configuration 1, also shows bad I on /I sub ratio, correlated with the AA width spread as well as the isolated poly gate as can be clearly seen at the SEM micrograph ( Figure 8 ). GC.D.1 L J Configurations 8 and 10 were also studied in [14] . A large array of standard cells was OPC treated followed by "silicon simulation", to simulate the optical and etch manufacturing conditions. After the physical parameters were extracted from the "on-silicon" structures, device simulation was performed. Some correlation was found between the I on and I sub values and the ratio of J/L (Figure 10) where J is the length of the parallel poly to the AA and L is the length of the poly line. The parameters L and J determine how close the poly corners are to each other. Close corners will cause the OPC corrections to interfere with each other, causing channel length profile to undershoot in the jog side of the channel. This poly line-width undershoot increases I sub because it is a function of L min .
CS.D.1/2: Contacts too close to transistor gate, may have higher electrical field between the gate and the drain and as a result, may lead to higher I sub ( Figure 11 ). This is the reason that at some cases, CS.D.2 for thick oxide MOSFETs that use V dd of 3.3 V and have larger electrical field between the contact and the gate, use larger distances compared to thin oxides (CS.D.1). In addition, contacts too close to the gate increase the overall gate capacitance, and degrade the transistor switching speed. Process improvement for this rule for leakage reduction result mostly from contact etches profile optimization and some selective OPC. A special test chip was proposed for monitoring CS.D.1 leakage levels [17] . Comparison among several vendors of standard cells using "ranking methodology" was presented in [18] . The ranking rule was based on fab manufacturing information data regarding the physical and electrical sensitivity of the structure to the design rule type and value. The overall design score was calculated using the ranking rule and its "weight". Table 3 below (taken from [18] ), shows the results for 4 different Std Cells libraries from 3 different vendors. Vendor C yielded the highest score for both GC.D.1 and CS.D.1. Vendor B2 received the lowest score for these two parameters. High LER (Line-Edge-Roughness) and higher LWR (Line-Width-Roughness), also degrade the transistor leakage current. If we assume that the poly is composed of N segments in series, having a length l i , so the overall I sub of the transistor at V gs close to 0 V will be [19] :
where l is constant. The first term at the summation will be L min (the minimum gate length at the specific transistor). This segment will have much higher leakage than the other terms because of the exponential dependence. This can also explain the decision to use (L min + σ L ) for the I sub calculation [15] . The natural conclusion from this is that higher transistor variability means also higher power consumption. Kim et al. [20] , found that for poly gates having widths of 80 nm~90 nm, increasing LWR from <7.1 nm to 14~21 nm, increased I sub by 1.5~2 orders of magnitude. LER is a strong function of the image conditions. At poly layer photolithography, the poly is "dark" and the poly space is "clear" or "bright". In order to improve image fidelity and reduce variability, the transition from bright-to-dark needs to be steep. For reducing LWR, it is recommended to have a fixed (and optimal) space between poly the gates: to the near transistor or to dummy transistor. The size of this optimal space is set by the image conditions used by the technology-the wavelength, the numerical aperture, the illumination conditions as well as the photo-resist conditions like thickness and viscosity. Standard cells libraries used the minimum poly width of the technology for almost all gates. However, the position of the different transistors over the AA can not be fixed due to contact located in between for some cases. In addition, if the library supports multi-V t , so the distance between different types of transistors should be maintained. This "fix space" is the base for using regular and gridded design with restrictive design rules (RDR) that introduced in 45 nm and below platforms.
Ban and Pan [21] proposed an algorithm for LER-aware poly optimization in order to minimize the leakage related LER, by setting an optimal space. The procedure placed poly gates at the best locations and introduced dummy poly to eliminate boundary conditions. As an example, 6 cells simulated with 32 nm technology conditions, showed leakage reduction by up to 47% (average 40%). In another work [22] , Ban at al., presented a layout optimization based on comprehensive sensitivity metric which seamlessly incorporate proximity effects and process variations. Based on that information, standard cell layout optimization (poly gate and AA layout adjustments) is taking place, to minimize the delay at nominal and corner conditions. Using 45 nm Std Cell library, they demonstrated a leakage reduction of 7~91% at that corner.
Leakage Reduction in Transistor Level-CMOS and SRAM
Threshold voltage reduction is the simplest way to overdrive the transistors, and reduce propagation delay. However, V t reduction means an exponential increase of I sub . (Figure 12 ). By using a very high V t values for non-critical paths, the leakage can be reduced by 2~3 orders of magnitude. Figure 13 shows the V t scaling from 0.25 μm platform down to 65 nm for Standard (or regular) V t . As can be seen, V t values were no longer reduced beyond 90 nm. For Low Power, V t higher by 100 mV~200 mV was used. Basically, V t change can be done by doping adjustments (of the channel and/or the SDE-SourceDrain-Extensions), adjustment of the gate oxide effective thickness and/or the work-function difference of the gate electrode (in the case of metal-gates) or by body biasing. In multiple-V t , also known as "Multiple Threshold Voltage CMOS" (MTCMOS, or dual-V t CMOS or DVTCMOS), two (or more) types of transistors are fabricated: Standard (SVt) or Regular-V t and high-V t (HVt). In most cases, this is done by an additional two V t implant masks or two SDE implant masks. In high-density standard cells, this technique can be limited for "mixing" standard cell libraries, due to the layout design-rules related to the other layers. For SRAMs, the mask data preparation done by the foundry assign the relevant HVt implant masks also to the SRAM array. In case the design is without HVt, the dedicated VNS (Vt implant for nMOS SRAM) mask described above (or another VPS mask) are used. Another way to adjust the V t is by the V t roll-off behavior. However, modern CMOS devices use high doping levels of halo implants, in order to reduce the V t roll-offs. In addition, in case a larger L is used, the gate capacitance will also be increased.
"Mixed" technology refers to the case of having simultaneously the Standard Logic transistors and the Low-Power transistors, having a different gate oxide thickness. In the example of the 65 nm platform described in Table 1 , the two technologies can be used separately or "mixed" together [1, 2] . These combinations have a triple-gate oxide and it is not manufacturing friendly due to an additional mask penalty, between +1 mask for the gate oxide process only, and up to +8 masks for the case V t and SDE implants also need to be separated. In addition, the complexity of having an oxide-strip at a very small window, the additional thermal budget, and the fact that there are two different gates with a close thickness target are problematic due to the oxidation kinetics [25] . However, it was successfully developed for 28 nm technology, having gate oxides thickness of 16A for Low Power Standby (1.1 V) and 13.5 A for Low Power (0.8 V) [26] . In summary, this combination is one of the ways to reduce the overall circuit leakage, but it introduces many process challenges and has a high cost penalty.
It is known that the back bias (body bias, or reverse body biasing -RBB) can modify the transistor V t . However, higher body bias increases GIDL, degrades V t variability, and in multi-V t transistors induces different body-biasing sensitivity that depends on V t [1] . In this case, closely located transistors can not share the same N or P wells and because of that, triple wells are needed. Such wells have high area penalty due to additional layout design rules. It is important to note that, while the reverse bias increases V t , it also increases the junction current and decreases the junction capacitance. In [27] , a novel technique to minimize the standby leakage was proposed. In order to overcome the performance's degradation using RBB due to increase in GIDL, DIBL and BTBT currents, H-J Jeon at al. [27] , proposed a standby leakage power reduction technique, based on optimal body bias voltage. This voltage was determined by the ratio of I sub and the band-to-band tunneling current (I BTBT ). For circuit implementation, they proposed a control system that includes monitoring circuit, current comparator and charge pump. The leakage monitoring circuit input both I sub and I BTBT into the current comparator that increase or decrease the body voltage applied to the chip core by the charge pump. Implementation of this technique to 32 nm MOSFET technology ISCAS85 benchmark circuits yield 400~1500× leakage reduction. Yasuda et al. [28] succeeded in reducing the sensitivity of the body-biasing to threshold voltage by careful channel and gate engineering-they increased the channel contour doping (by adjustment of the punch through implant dose and energy) and shifted the channel from the surface (buried channel). Taking advantage of the V t shift by the work-function modulation of the Hf-based gate dielectric, the peak concentration of the channel impurity profile was positioned in a deeper channel region, away from the surface, and without lowering the V t .
The main drawback of reducing the leakage by increasing the channel doping for V t , is the reduction in the transistors currents due to a lower overdrive, which leads to degradation in delay time, that for an inverter is given by:
where ν is a fitting constant (that is correlated to the velocity saturation index), μ, ε ox , W eff and L eff are the channel mobility, the gate oxide dielectric constant, the effective width and length of the transistors, respectively.
The 90 nm technology was the first node in which performance enhancement was done using stressors [29] . These stressors can be STI [30] , Stress Memorization layers [31] , nitride located under D1 that was also used as Contact Etch-Stop Layer (cSEL) [32] and eSiGe (elevated SiGe) [33] . Stress induced by the salicided Source-Drain active area can also improve performance [34] . Basically, stress induced into the channel can improve or degraded the carrier's mobility, and as a result change the transistors currents. The level of improvement (or degradation) depends on the level of the stress induced, the type (tensile or compressive) as well as on the direction of the strain induced into the silicon. For example, compressive stress induced by STI along the x-axis (along the channel length), will improve the drive current for pMOS transistors. However, the compressive stress at the y-axis (along the channel width) will degrade the drive current for the same pMOS transistor, and because of that, higher tensile stress resulting from AA salicidation at the y-axis will improve the current. In the work reported in [35] , all stress techniques listed above were used in a 90 nm platform. The overall currents improvement was up to 15%. It consisted of improvement due to cSEL (~7%), from STI (~7%) and from salicidation (~5%). Naturally, the improvement of all stress components is not "cumulative". Figure 14 . I sub /I on curves for nMOS (left) and pMOS (right) transistors, with and without stresses induced. PSS is "Process Strained Si" that includes: cESL, STI and silicided layer. Data is from [35] .
The main advantage of the different mobility enhancement techniques is the increase in drive current without leakage degradation (Figure 14) . Based on that, by a careful layout modification, the leakage current can be improved by keeping other parameters in place. As a basic example, assume a transistor with a specific gate length that yields drive current and leakage based on the I on /I sub charts. Increase of the gate length, will reduce the leakage and the drive current. However, by using stressors, we can re-set the drive current back to place, while still having this low leakage levels. In the example of Figure 14 , the leakage for the nMOS can be reduced by ~60% while keeping the same drive current. In addition to this example, the I sub reduction is observed to taper off quickly with longer gate length. It is important to note, that stress can also increase the leakages related to junctions. Wang et al. [36] studied the effect of mechanical uni-axial stress on junctions fabricated in 65 nm technology. They found, that for junction in nMOS, where the BTBT is the major component of the junction leakage, the dependence on stress (generated by tensile cSEL) is weak, and even using high-stress layers (thicker, pMOSFET as for 45 nm technology), the junction leakage degradation was <7%. For nMOS, higher tensile stress reduced the leakage. However, for junction in the pMOS, where the stress was generated by both compressive cSEL and eSiGe elevated S/D, due to the fact that the leakage mechanism was based on both BTBT and generation current, high stress would degrade the leakage by up to 25%. These facts should be taken into consideration for future technologies. Examples for stress affecting layout modification are:
• Expending the AA (Source and Drain) edges beyond gate (GC.X.1), for different STI induce stress ( Figure 15 ). The stress range and magnitude are up to GC.X.1 = 1.3 μm, and <10%, respectively; • Re-placement of contacts with distance to gate (CS.D.1). This is because contacts "punching" of the cSEL layer, and release some of the stress. For this reason, also re-set of the Source and Drain contact pitch may improve performance [37] ;
• Poly space between transistor fingers (without contacts) [38] . This is because smaller space also means narrow cSEL layer, or narrow eSiGe stressor trench, or both. In [3] , up to 7.8% degradation was seen for different poly spaces;
• Location of tensile/compressive nitride cSEL boundary layer over STI ( Figure 16 ) and separating nMOSFET and pMOSFET [38] . The first research work to tackle timing closure for standard cell by layout modifications using active area depended mobility of strained silicon was made by [38] . In there work, GC.S.1 was adjusted, to modify the stress induced by eSiGe stressor. Later, Joshi et al. [39] , developed a methodology for stress-aware layout optimization, with a constraint that the cell area will not change, have similar switching delays or less, and lower leakage. Because dual-V t (HVt, LVt) was available, the algorithm also "assigned" the optimal V t type per case, together with the layout optimization. This approach was used successfully for the 65 nm technology design having the following parameters: V dd = 1 V, nMOS_HVt = 334 mV, pMOS_HVt = −391 mV, nMOS_LVt = 243 mV, pMOS_LVt = −280 mV. The I on and I sub ratio for LVt/HVt was ×1.24/×16 for the nMOS and ×1.32/×29 for the pMOS. The stress-aware layout modification included changes in GC.X.1, CS.D.1 and CS pitch, and setting of the location of tensile/compressive nitride cSEL layer located over STI (Figure 17) . Comparison was also made between using dual-V t with single thin-oxide thickness only, and using dual-V t with stress-aware layout modification. Analysis showed that for the same delay time, up to 34% reduction of leakage was obtained. For the same leakage values, up 10% delay time reduction was achieved using this methodology. The results of Table 4 [39] clearly show, that the combined approach improved significantly the leakage power while keeping the same delay time. Improvement in critical delay time for iso-leakage was also seen while comparing to dual-V t only approach. Maximum leakage improvement was 38.5% and with average value of 23.8%. Table 4 . Improvement in leakage and delay, comparing Dual-V t (HVt/LVt) approach to Dual-V t with stress-aware layout optimization, based on data from [39] . 12 different circuits were used, having number of gates from 166 and up to 37,560. 
Low Power Consideration for SRAM
Technology scaling decreased the overall SRAM area by factor of ×2 (or more) for each generation (Figure 18 ). The 0.13 μm platform was the first in which two bit-cells were used by foundries for high volume manufacturing: 2.43 μm 2 , that is a direct shrink from 0.18 μm, and 2.14 μm 2 , for high-density low-leakage application. Down to 80 nm, a 6-T (six transistors) SRAM Bit cell of type A to D was used [40] . The 65 nm foundry technology [41] , introduced a new layout configuration, that did not have any AA or Poly corners that could be rounded as explained above. This "thin" vertical height also reduced the bit-line loads and improved noise immunity. For 45 nm or 32 nm technologies, the straight poly lines could also be supported with line-cut double-patterning [42] . The total leakage in SRAM is roughly expressed as [1] :
were I sub_latch and I sub_PU are the subthershold current for the nMOS latch and the pMOS Pull-Up transistor, respectively. I gate_latch and I gate_PU are the gate currents for nMOS latch and pMOS Pull-Up, respectively. In order to reduce off-state leakage, in many cases the SRAM array has higher V t . This is most important for the nMOS pull-down (PD) and costs an additional dedicated VNS (dedicated V t implant for the SRAM nMOS) mask. Note, that this mask also increases the nMOS Path-Gate (PG) threshold voltage. The penalty is that both the write delay and the read delay increase. In some cases, another additional mask is used in order to increase the V t of the pMOS Pull-Up (PU) transistors results in reduction in V dd -to-ground leakage, but with a penalty of write delay [43] . The higher V t results also in improved static noise margin (SNM) in the cell, which allows reduced β ratio (or cell ratio), that is defined as β = (width/length of nMOS PD)/(width/length of nMOS WL). The reduction of β improves the cell read current [9] .
A major contributor of leakage for SRAMs is the gate-to-channel leakage of the PD nMOS transistors in the "ON" state. An increase of the gate oxide thickness can reduce this leakage (that has an exponential behavior, see Equation (2) above). However, the gate thickness is set by the logic transistors (both nMOS and pMOS). Solutions like "multiple" gate oxide thicknesses ( [43] , called MoxCMOS in [9] , dual-T ox CMOS or DTOCMOS in [44] ) were also proposed. For advanced technologies, which use high-k gate oxide materials, reduced gate leakages for the same effective gate oxide thickness are achieved (See Figure 3 and Table 1 ). Yasuda et al. [28] reported that by replacement of the SiON gate material for HfSiON, where both have the same effective oxide thickness, the gate leakage components in (3) become negligible, the total stand-by power consumption is reduced by a factor of 5. In addition, Yang et al. [45] reported, that for 32 nm Low Power technology, the adoption of a gate-first Hf-based high-k process, improved V t mismatch by 50% (comparing to 45 nm technology), due to thicker gate oxide that provided better channel control. V t mismatch improvement reduced SRAM soft fail rate.
One of the SRAM scaling parameters refers to space reduction between the nMOS AA to the pMOS AA, and it affects all A-D types of 6-T "tall" SRAMs [40] . This space is composed of: AA.D.3+AA.E.3, where AA.D.3 is the distance between WN to N+ in WP and AA.E.3 is the enclosure of WN around P+ in WN (see Figure 19 ). Based on Table 2 , the values for these two rules are scaled down by a factor of ~0.7. However, the limiting factor for nMOS-pMOS AA space reduction is the punchthrough. Figure 19 shows a typical layout and SEM top-view micrograph of 6-T SRAM type D. Assuming that the distance is 2 × 0.22 = 0.44 μm, leakage measurements for the standard photolithography conditions show an increase of the leakage value when this distance is reduced by 2 × 0.04 μm. For stability testing, a process window having larger AA by 0.015 μm, reduced the minimum space between diffusions to 2 × (0.18 − 0.015) = 0.33 μm. As seen from Figure 19 , the leakage goes up. For more scalability, both N-Well and P-Well tub profiles as well as the STI depth and slope need to be optimized. The area reduction of SRAM requires more aggressive design rules than those allowed by the platform design kit. For AA, poly and contacts, this "violations" are mostly related to enclosure of AA and poly around contacts, CS.D.1, Poly-end-caps (GC.X.2) and as explained above the-distance N+/PW to P+/WN. All these "violations", demand a careful OPC treatment. In most cases, foundries use a dedicated OPC treatment for the SRAM array. For an SRAM bit-cells having area of 2.14 μm 2 and used in 0.13 μm platform LL (low leakage) technology, an overall typical cell current of 5 pA/cell (Max < 10 pA/cell) was achieved using a dedicated OPC.
Circuit Level Techniques for Power and Leakage Reduction
This paper focuses on device and process level power reduction techniques and therefore, circuit level solutions will be covered mostly from design-rules point of view. Power reduction techniques at the circuit level are listed in [46] . For mobile applications, where the product is in a standby mode most of the time, the most effective way is to cut the leakage by switching off the inactive circuits. The basic method is to insert a power switch in series between a digital circuit block and its supply line ( Figure 20) . When entering the sleep mode, the gate of this power supply switch transistor is raised above V dd , to decrease I sub , which depends exponentially on V gs (gate-to-source voltage). The drawback of this V dd "boosting" is that I gate also increase exponentially (2) , and the gate oxide may wear out. Naturally, an "optimal" V dd should be applied. In [47] , a circuit that automatically biases the power switch gate transistor to its minimal leakage point, and efficiently compensated for temperature and corner variations was presented. If dual-V t or triple gate oxides are used, the power switch transistor will have the HVt and thick oxide. Another way is to use reverse biasing (with SVt or LVt), to obtain lower leakage and reasonable performance, as explained above. Figure 20 . Schematic of the power switch transistor used to cut the supply into the logic circuit in a sleep mode (left). Stack and sub-stack of a NAND3 (right). It is recommended that the same transistor type should be used (pMOS at this case) in the parallel structure.
In this section, a short review of circuit solution for digital design will be presented. After, we will focus on design optimization to reduce leakage of large SRAM array.
For CMOS: I sub flowing through a stack of series-connected transistors is reduced when more than one transistor in the stack is turned off. For example, the leakage of a two-transistor stack is one order of magnitude less the leakage of a single transistor. This effect is known as the stacking effect [9] or self-service biasing [48] . Leakage reduction takes place because the voltage level of the intermediate node (between the two transistors) is positive. This leads to a negative V gs and to a negative V bs (body-to-source potential) and also to reduction in V ds (drain-to-source voltage). All these yield lower I sub . For example, in 3 input NAND gates in stack, that were simulated using 65 nm technology with 17 Å gate oxide thickness, turning-off 1 transistor reduced I sub by 23% (by 7% for 2 turned-off and ~4% for all 3 turned-off) [49] .
Sill et al. [44] , performed a simulation analysis for selecting the best transistor type, using both dual-V t and dual-T ox (DVTCMOS and DTOCMOS). From the results, they extracted two design rules for transistors stacks:
• The Delay rule-within mixed stack, the L-V t Transistor (with low V t doping and thin oxide), has to be placed as close as possible to the gate output to achieve best results for the time delay;
• The leakage rule-within mixed stack, the H-V t Transistor (having high V t doping and thick oxide), has to be placed at the end of the stack (away from the output) to achieve best leakage result.
Using these recommended rules, a library of ten standard gates in 65 nm technology was created. The example below shows different possible realizations for NAND3 (Figure 20) , with the relative leakage and results of performance (Table 5) for the case where all transistors are made with LVt and Low-T ox . Delay improvement of 6% with the same leakage value (compare #3 and #2) was achieved by placing the H-V t transistor in the center (T2), and allowed the L-V t transistor to be close to the output (T1), as defined by the gate delay rule. Leakage improvement of 20% with the same gate delay time (compare #4 and #3) was achieved by placing the two H-V t far from the output (T2, T3), as defined by the leakage rule. More details on static leakage reduction through simultaneous V t , T ox and transistors' state assignment can be found in [50] . Table 5 . Comparison of possible mixed-gates realization, based on data for NAND3 from [44] . "H" means high H-V t transistor (having high V t doping and thick oxide), "L" means low L-V t transistor (having low V t doping and thin oxide). Refer to Figure 20 for nMOS transistors locations. Supply voltage reduction is also an effective method for switching power reduction due to the quadratically dependence (1) . Following the basic scaling rules, the supply voltage should be reduced by a factor κ in order to maintain a constant electric field. However, although V dd reduction yields lower dynamic power consumption it also degrades the circuit performances that cannot be compensated by V t reduction. Morifji et al. [8] , analyzed the dependence of the total power consumed by 1 M gates at 105 °C, on delay time. A 65 nm platform was used, and the gate delay time was calculated by CV/I for inverter with FO = 3. The total power was estimated with clock frequency of 2GHz and switching activity of 20% (1) . The implicit variables were the V dd and the V t . For high-speed demands (Figure 21 ), V t should be reduced, and cause the standby power to increase. The dotted line is the boundary where the dominant power changes from being mostly an active power to being mostly a standby power, depending on the operation frequency and the switching activity. Based on that, it is proposed [8] , that in SoC (System-on-Chip) composed of different circuits-each circuit may have an optimized V dd (and V t target) per need. For example, in Logic Core or clocks with 100% duty that seeks for high speed and high activity V dd (and V t ) will scale down aggressively (see point "S" in Figure 21 ). On the other hand, logic with low frequency or low activity will have higher V dd and higher V t (see point "L" in Figure 21 ). Figure 21 . Estimated total power consumption for 1 M gates at 105° as a function of delay of the FO3 inverters, simulated using 65 nm technology [8] . "S" is the working point with low V t and low V dd that provides high speed and L is the working point for high V t and high V dd to minimized leakage.
For SRAM Cell: SRAM cell stability can be observed using its eye (or "butterfly") property where its size is the Static-Noise-Margin (NSM). Basically, SNM degraded for lower V dd , lower V t or lower β. However, in the case of memory array (where many cells are connected together on a single bit-line), lower V t will increase the leakage current. When the leakage current becomes comparable to the cell current (that is reduced due to lower V dd ), the array will fail. Therefore, both small leakage of the transfer gate and large cell currents are required. This can be partially achieved by longer gate length of the PG transistor and wider width for PD transistor. An optimal V t /V4 combination can be found after setting the β [8] . Dual-power solutions [51] show power reduction by 20%~40% [26] .
In [52] , two design techniques that reduce the static power dissipation due to I gate and I sub reduction were presented. The first one (titled "PP-SRAM" in [7] ) is based on replacing the nMOS path-transistors with pMOS and re-set the transistors widths and V t levels. This new configuration showed 26% reduction in gate leakage current, 37% in power dissipation and 15% improvement in SN. However, the cell area increased by 16.5%. The second configuration (titled "IWL-VC SRAM" in [7] ) is based on improvement of the dynamic voltage scaling method, titled NC-SRAM in [53] . Basically, V t Reduction S L this method uses two nMOS path transistors (NC1 and NC2), which provides different ground levels and reduces the gate leakage by ~50% and ~57% power dissipation. In [7] , a 3rd pass transistor is added to reduce the gate voltage of the path-gate (Word-Line) transistor yielding another 16% leakage reduction. For both the NC-SRAM and the IWL-VC SRAM since only 2 or 3 transistors are added per row, the area penalty is negligible. More design techniques for SRAM power reduction can be found in [46, 54] .
Summary and Conclusions
The rapid reduction in transistors dimensions results in increase leakages and power dissipation. This demands efforts in several aspects. At the transistor level, the increased leakages have different origins, and therefore reduction requires careful new process integration including novel materials. Another aspect is the leakage dependence on the layout that gives the possibility to reduce leakages by clever layout optimization. Some layout-aware procedures including automated tools for leakage reduction were proposed. Finally, some circuit-based solution linked to layout design rules was described.
The presented analysis revealed correlation between leakages and transistor configurations. Guidelines for leakage reduction based on the use of different stressors, the dependence of leakage on LER, etc. were specified. For SRAMs, different circuit level techniques, like multi-V t , Multi-T ox , body bias adjustment, and power-switching were discussed as possible approaches for leakage reduction.
