ABSTRACT Power and energy minimization is a critical concern for the battery life, reliability, and yield of many minimum-sized SRAMs. In this paper, we extend our previously proposed hybrid analyticalempirical model for minimizing and predicting the delay and delay variability of SRAMs, VAR-TX, to a new enhanced version, exVAR-TX, to minimize and predict the power/energy and power/energy variability of a 16-nm 6T-SRAM under the influence of the three major types of variations: Fabrication, Operation, and Implementation. Using exVAR-TX for architectural optimization [exhaustively computing and comparing the range of feasible architectures subject to interdie (die-to-die/D2D) and intradie (within-die/WID) process and operation variations (PVT), electromigration (EM), negative bias temperature instability (NBTI), and soft-errors, among others] on top of deploying the most recent state of the art effective mitigation techniques we show that energy and energy-delay-product (EDP) of 64KB 16-nm 6T-SRAM could be reduced by ∼12.5X and ∼33%, respectively, as compared to the existing conventional designs.
I. INTRODUCTION
As in most of the arena of digital design, reduction of power dissipation in memories is becoming of premier importance. Memory designers have been remarkably good in keeping dissipation in check, even while increasing the memory capacity with about 10 orders of magnitude over the last 40+ years. Yet, the challenges are mounting. Portable applications using the newer technology nodes (at or below 32-nm) are significantly reducing how much power memory may consume. At the same time, technology scaling with its reduction in supply and threshold voltages and its deterioration of transistor off-current causes the standby power of the memory to rise [3] . Design for low power, energy efficient SRAMs has become a necessity as the SRAM subsystem is one of the largest components (∼%70) in today's system-on-a-chip and high-performance VLSI circuits, a main contributor to the overall power consumption of the system, and therefore one of the most vulnerable components to the effects of power variations.
The trend of scaling of device sizes, very low threshold voltage, and ultra-thin gate oxide have increasingly been challenged by variability, and therefore, by reliability-related issues-such as line edge roughness, intrinsic parameter fluctuation, random dopant fluctuation, and oxides thickness fluctuation. These unwanted variations, in turn, can result in excessive subthreshold leakage, reverse diode leakage, and power dissipation-which undesirably affect the stability and energy consumption [4] - [6] .
The primary goal of low-power high-speed circuit design differs from application to application. In case of battery operated portable systems, such as cellular phones and laptops, the primary goal is to keep the battery lifetime and weight reasonable. For non-battery systems, such as workstations and multimedia digital signal processors, the primary goal is to reduce the overall system cost (cooling, packaging, and energy bill) while ensuring long-term device reliability [7] . These different requirements are the driving factors for low-power high-speed electronic system design.
Interdie (D2D) and intradie (WID) process variations have become significant as a result of continued technology scaling into the deep submicron region [8] , [9] . The International Technology Roadmap for Semiconductors (ITRS) [10] predicts that over the next decade, both performance and power consumption variation will increase by up to 66, and 100 percent, respectively. Variations can stem from semiconductor manufacturing processes, ambient conditions, device aging, and in the case of multi-sourced systems, vendors [11] .
System design typically assumes guard-banding, a method that ensures reliable and consistent components over all operation and fabrication corners. However, there are many associated costs from over-design, such as chip area, complexity, performance, and/or power consumption. The overheads of guard-banding are reduced, but not eliminated, through the practice of binning, where manufacturers market parts with considerable post-manufacturing variability as different products. However, even with guard-banding, binning, and dynamic voltage and frequency scaling (DVFS), variability is inherently present in any set of manufactured chips [12] .
Many modelling techniques have been proposed to minimize the negative impact of process, operation, environmental effects, and/or aging-related variations on energy efficiency, reliability, and yield in SRAM and cache, including chip-area models [13] , [14] , power/leakage models [12] , [14] - [22] , access-time models [23] - [25] , and failure probability models [24] - [28] . Newer techniques can also be used to combat process variations such as adaptive body biasing (ABB) [29] or chip-by-chip resource resizing in various micro-architectural structures [30] . However, these either have inherent costs, must be applied with great caution, or require modification of the chip architecture. Such costly complications demonstrate the importance of inexpensive and early modelling to determine an optimal design that will allow the SRAM to improve its energy efficiency while being more tolerant of variations (Section II.D).
In several recent models [22] , ultra-low voltage design of 8T-SRAM memories have been proposed suggesting positioning the operating voltage (V DD ) near or below threshold voltage (V th ), known as the near-or sub-threshold region. Although potentially applicable to some low-speed applications-like processors for sensor networks used for habitat or health monitoring, and biomedical equipment such as implantable biomedical devices and hearing aid [31] -the ultra-low voltage design might not be quite useful for many of the existing and future high-speed, low-power, small area multi-media applications. Near-or sub-threshold region-designs inevitably make the digital and memory circuits highly susceptible to such additional constraints as higher sensitivity to process, voltage, temperature (PVT) variations, smaller cell stability, smaller voltage margin, larger area and prevailing leakage current [31] .
Similarly, several other power-reduction approaches have been conducted to minimize the overheads of the guardbanding. One of such approaches (an on-going research work called UNO) [32] has made rather successful progress in compensating the variabilities' effects in hardware through software controls, though, without, yet, showing its commercial applicability to 6T-SRAMs below 22-nm. While these and numerous other studies have been conducted to minimize SRAM energy consumption (mostly at various older technology nodes), with only one or two specific mitigation techniques deployed in their design, research on the exhaustive exploration of all different feasible architectures subject to all recent major variability issues to achieve optimum energy-efficiency has been rarely executed.
The various mitigation techniques, each with a different specific goal, incorporated into the design have the potential to partially or completely neutralize the positive impact of each other-or in worst case scenario even conflict with each other. For example, where a technique suggests using the minimum size width for bitlines to reduce SRAM cell layout and improve its functional performance, another technique suggests widening the bitlines to reduce the current density [33] , as discussed in section III.D.
We believe that simultaneous adaptation of effective mitigation techniques for process variations, environmental factors, and aging effects, as discussed throughout this paper, will allow a more realistic prediction of energy efficiency of a robust 6T-SRAM design.
We propose a cross layer approach starting from the impact of major variability types on the static and dynamic power of nano-scaled CMOS transistor up to circuit level analysis by considering the layout of the memory array. This approach enables us to consider the effect of the major factors such as supply voltage and process variation on reliability issues such as IR drop, temperature, NBTI, EM, soft-error rate (SER), and static noise margin (SNM)-which individually and collectively affect the power consumption of the cell and, therefore, the entire memory. Our simulation results show that, when combined with selected recent mitigation techniques, discussed in Sections II and III, our proposed architectural optimization methodology exVAR-TX can considerably reduce the power variation and total power consumption (as expressed in terms of energy-delay-product (EDP) and energy consumption), as we will show later in Table 2 and Fig. 10 , respectively.
In this paper we undertake an exploration of feasible variability-aware, reliability enhanced architectures for 16-nm 6T-SRAM for energy efficiency.
The main contributions of this paper include: 1. Proposing exVAR-TX: our newly enhanced statistical timing analysis based, hybrid analytical-empirical model that helps predict the variation of leakage and power due to process, operational, and aging variation in memory design for 16-nm 6T-SRAMs-while providing the ground work for its extension to other types of memory such as 8T-, 10T-, or multi-ported SRAM, cache and CAM in a straightforward manner for future work. 2. Providing an inexpensive first-order solution to mitigate the effects of increasing process variations in upcoming technology nodes to achieve energyefficiency given such constraints as speed, area and yield.
3. Presenting a comprehensive, yet compact, analysis of the prominent power-and performance-related variability issues affecting the battery-life, reliability, and/or cost of the next-generation SRAM design to hopefully increase the success of future circuit designs. The rest of this paper is organized as follows. Section II reviews the static and dynamic power consumption and power minimization techniques for conventional 6T-SRAM at the transistor, cell, and circuit level, followed by an overview of our proposed enhanced model and its optimization capability. Section III presents analysis of critical reliability issues impacting SRAM leakage/dynamic power, yield, and energy efficiency. Major reliability issues such as IR drop, temperature, NBTI, EM, soft-errors, as well as their associated mitigation techniques are discussed in this section. Section IV illustrates the experimental simulation results coupled with their associated analysis. Section V presents conclusions and possible future work. Appendix provides background on the sources of power dissipation in memories as well as the techniques to reduce the retention current of the SRAM memories discussed in this paper.
II. OVERVIEW OF OUR MODEL A. SRAM
The six-transistor-cell static random access memory (6T-SRAM) (Fig. 1) is the conventional choice for most onchip memory designs. We do not change the circuit topology from the standard six-transistor cell because of previous studies discouraging this [34] , [35] . With power applied, SRAM provides permanent data storage subject to different components of leakage current shown in Fig. 1 , and briefly discussed in Section II.B.2.
B. POWER 1) IMPACT OF SUPPLY VOLTAGE
Over the last several decades, device dimensions have been scaled down continuously, but power supply and operational voltages have not scaled accordingly. The reduced supply voltage that is typical for deep submicron technologies can be attributed, mainly, to the necessity of keeping both hot-carrier effects and energy consumption under control [36] .
Although reducing the supply voltage up to about twice the threshold voltage (V DD ≤ ∼2V th ) indiscriminately has a positive impact on the energy dissipation and it actually increases the gain of the inverter (the main component of the SRAM cell) in the transition region, we do not choose to operate all of our digital circuits at these low supply voltages for the following main reasons [37] :
• Reducing the supply voltage is absolutely detrimental to the delay of the gate.
• The DC characteristic becomes increasingly sensitive to variations in the device parameters, such as the transistor threshold, once supply voltages and intrinsic voltages become comparable.
• Scaling the supply voltage means reducing the signal swing. While this typically reduces the internal noise in the system (such as caused by crosstalk), it makes the design more sensitive to external noise sources that do not scale. In short, the largest hindrances to low supply voltages are increased delay, leakage and uncertainty.
2) IMPACT OF LEAKAGE CURRENT
The total leakage in a cell principally consists of the subthreshold leakage, the gate leakage, and the junction bandto-band tunneling leakage through different transistors in the cell [38] (Fig. 1) . The most important source of leakage current is the sub-threshold current of the transistors, especially since the accelerated adoption of high-k gate dielectric was set to reduce gate leakage 100-fold [39] .
Leakage current is influenced by such uncertainty factors as threshold voltage, physical channel dimensions, channel/surface doping profile, drain/source junction depth, gate oxide thickness, supply voltage, data statistics, temperature, and variations or temporal changes in all these parameters [40] .
The main challenge in the design of low voltage variabilityaware SRAMs is to reduce leakage with the minimum cost to performance and area. To maintain performance under the supply voltage reduction condition it is essential that the device threshold scales as well. This trade-off is not without penalty. Reducing the threshold voltage increases the subthreshold leakage current exponentially [41] .
In Eq. (1), S is the slope factor of the device. The exponential increase in inverter leakage for decreasing thresholds is illustrated in Fig. 2 . Leakage currents are a concern particularly for designs that feature intermittent computational activity separated by long periods of inactivity. For example, the processor in a cellular phone remains in idle mode for a majority of the time. Such an idle processor with a low threshold voltage design would have a larger leakage as compared to the idle processor with a high threshold voltage design.
The constraint on the leakage sets a minimum voltage for digital operation in a given technology. Going beyond that requires an aggressive energy-management strategy.
Addressing uncertainty typically requires margins to ensure functionality and correctness. This then introduces techniques to reduce these margins while guaranteeing correct operation. Both leakage and uncertainty management can be performed at many layers of the design hierarchy. There have been a few major techniques commonly used to reduce the transistor leakage current in caches and/or SRAMs [40] :
• Gated-V SS -using a high threshold ''header/footer '' transistor to disconnect a cell, row, or way in the cache from V DD (Fig. 12) , at the cost of losing the cell's stored data.
• Multi-threshold CMOS (MTCMOS)-as explained in the next section.
• Drowsy caches-lowering V DD to approximately 1.5 times V th to preserve the cell's stored data and its recently introduced cousin ''dark silicon'' (not all parts of the chip are simultaneously powered on at nominal voltage) [42] . These are discussed in further details in our previous work [40] and by others [42] .
3) PERFORMANCE
To satisfy the requirements of high performance during active periods and low leakage during standby, several process modifications or leakage-control techniques have been introduced in CMOS processes. These techniques discussed in depth by Rabaey et al. [41] as well as others [40] can be summarized as:
Multi-threshold CMOS (MTCMOS): a device with low threshold is used on critical paths to achieve [45] . The exVAR-TX is a tool for exploring such architectural variations.
4) POWER DISSIPATION
Power consumption is becoming a limiting factor on how much memory can be integrated on a single die. As mentioned above, similar to logic circuits, reducing the voltage levels is one of the most effective techniques to reduce power dissipation in memories. In contrast to logic, however, voltage scaling might run out of steam earlier. Data retention and reliability issues make the scaling of the voltages to the deep sub-below-1-V level quite a challenge. In addition, careful control of the capacitance and switching activity, as well as minimization of the on time of the peripheral components is essential [3] .
a: SRAM ACTIVE POWER REDUCTION
To obtain a fast read, the voltage swing on the bitline is made as small as possible-typically between 50 mV and 150 mV. The resulting signal is transmitted to the sense amplifier for restoration. Since the signal is developed as a result of the ratio operation of the bitline load and the cell transistor, a current flows through the bitline when the wordline is activated. Limiting the activation time of the wordline and the bitline swing helps to keep the active dissipation of SRAMs low. Signal-to-noise issues ultimately determine how short the wordline activation time, how small the bitline swing, and how short the on-time of the sense amplifier can be made [3] . The situation is worse for the write operation, since BL and BL-bar have to make a full excursion. Reduction of the core voltage (in addition, of course, to the bitline partitioning approach described in Appendix) is the most effective remedy for this. Ultimately, the reduction of the core voltage is limited by the mismatch between the paired MOS transistors in the SRAM cell. Even when designed to be completely identical, process and device variations cause the MOS transistors in the cell to be different. The ever-increasing V th threshold mismatch is the most important culprit. It is caused by implant non-uniformities, channel length and width variations, and even random microscopic fluctuations of dopant atom in very small devices. Even if process engineers manage to keep the threshold variations constant, its relative importance is rapidly increasing in light of decreasing supply and threshold voltages. The mismatch between the transistors causes the cell to become asymmetrical, biasing it toward either 1 or 0 state. This makes the cell substantially less reliable in the presence of noise and during read operations. Stringent control of the MOSFET characteristics, either at process time or at run time using techniques such as body biasing, is essential in the low voltage operation mode [3] . If this cannot be accomplished, boosted voltage schemes on the selected wordline (V WL > V DD ) at run time, extra redundancy, and error correction is needed.
Reducing the supply voltage also impacts the memory access time. Lowering the device threshold is only an option if the resulting increase in leakage is dealt with effectively, especially in non-active cells, as further discussed in the following section.
b: SRAM STATIC POWER REDUCTION
In principle, an SRAM array should not have any static power dissipation. Yet, the leakage current of the cell transistors is becoming a major source of retention current (current required to maintain cell values). While this static current historically has been kept low, a considerable increase has been observed in recent embedded memories due to subthreshold leakage. This is illustrated in Fig. 3 , which plots the standby current of the same 64KB embedded 6T-SRAM memory, implemented in a 45-nm and a 16-nm CMOS process. An increase of the static current with a factor of almost 12 for the same supply voltage can be observed. Fig. 3 shows that the static power that used to be negligible (as compared to the dynamic power) in older nodes has gained more importance over the last few years. Technology scaling is increasing both the absolute and relative contribution of static power dissipation [40] . According to the ITRS [10] , in the next several processor generations, leakage may constitute a much bigger portion of the total power dissipation as compared to today's processor generation (unless new innovative techniques prevent this undesirable raise in leakage current). Techniques to reduce the retention current of existing SRAM memories are thus necessary, as discussed in Appendix.
c: POWER/ENERGY MINIMIZATION TECHNIQUES
With a fixed architecture of the data-path, speed, area, and power can be traded off through the choice of the supply voltages, transistor thresholds, and device sizes. This opens the door for a large variety of power minimization techniques, which are summarized in Table 1 [46] .
As shown in Table 1 , lowering the supply voltage is a very attractive technique for power reduction: It not only reduces the energy consumed per transition in a quadratic way-albeit at the expense of performance-but also reduces the leakage current. On the other hand, increasing the threshold voltage mainly reduces the leakage component.
Therefore, the low-threshold devices are preferably used in timing-critical paths, while the high-threshold devices are used for data storage. The thresholds between the low and TABLE 1. Power minimization techniques [46] . high are used anywhere else, such as in peripheral circuitries, as we pointed out in Section II.B.3.
Considering the control knobs used to keep power and energy consumption in check, it should now be clear that, as compared to the effect of the supply voltage scaling on the energy efficiency, the effect of the threshold voltage on the energy efficiency is not straightforward. Increasing the threshold voltage decreases the amount of leakage current exponentially. However, increased threshold voltage degrades performance exponentially too. Consequently, the impact of threshold voltage alteration on static energy is determined by the ratio of the reduced leakage current to the increased operating delay. If the gain in the leakage reduction is larger than the loss in the performance, the overall energy efficiency improves by replacing the existing transistors with higher-V th transistors. Contrarily, if the impact of delay degradation exceeds the gain in leakage suppression, the energy efficiency improves when lower-V th devices are adopted. This counter-intuitively implies that increasing the threshold voltage of high-V th devices over an optimal point results in higher (rather than lower) energy (E) consumption. This is simply because the operation with higher-V th devices takes longer time (t) than the operation with lower-V th devices
Transistor Sizing is another effective power-reduction method for data-paths, where the majority of energy is consumed inside the block, rather than in driving the external load [46] . As the effective capacitance (C EFF ) is the product of the physical capacitance (C L ), determined by sizing, and the switching activity (P 0→1 ), minimization of both factors is another power-reduction method (C EFF = P 0→1 C L ). This observation lead to an interesting design challenge [46] . Assume we have to minimize the energy dissipation of a circuit with a specified lower bound on the performance. An attractive approach is to lower the supply voltage as much as possible, and to compensate the loss in performance by increasing the transistor sizes. Yet, the latter causes the capacitance to increase. It may be seen that at a low enough supply voltage, the latter factor may start to dominate and cause energy to increase with a further drop in the supply voltage.
A wide variety of logic and architectural optimizations exist that reduce the activity at no expense in performance. One such techniques is resource allocation-using a dedicated, specialized, non-multiplexed CMOS circuit, rather than using a multi-task multiplex circuit for such specific functions as counting (in control and/or peripheral unites) or logical effort drivers (for wordlines), at negligible extra cost in area. This is because multiplexing multiple operations on a single hardware unit has a detrimental effect on power consumption. Besides increasing the physical capacitance, it can also increase the switching activity. Another technique is providing glitch-free paths through path balancing, as dynamic hazards like glitching are parts of the contributions to the dissipation in complex structures such as memories [46] , [47] .
It is also beneficial to use low-swing busses for the short paths (i.e., in the peripheral), as shorter paths essentially waste energy because there is no reward to finishing early. Busses typically present the largest capacitance, and it is more advantageous to lower the supply voltage on bus drivers than that on any other logic gate. Since busses have the largest capacitance, the payoff in power savings is the largest for the same increase in delay [46] .
Although effective, all such mitigation techniques, themselves, are subject to variations and uncertainty. This further highlights the value of our proposed exVAR-TX methodology for its capability of finding the optimum variability-aware architecture with the simultaneous adaptation of multiple reliability-boosting techniques, while minimizing variability and energy dissipation within a specified lower bound on the performance. The optimization capability of exVAR-TX can help diminish or even (possibly) eliminate the need for the modification of the chip architecture due to variations and/or strained silicon effect [88] -(these modifications) commonly carried out by the growing deployment of self-tracking and adaptive design technique to improve the yield by tracking and subsequently mitigating the factors responsible for the post-silicon variation.
C. VARIATION
Systematic and random variations in process, supply voltage, and temperature (P, V, T) are a major challenge to the future of high performance micro-processor design SRAM caches. Technology scaling beyond 90-nm has caused higher levels of device parameter variation, which have changed the design problem from deterministic to probabilistic over the last few years [51] , [52] . This variation will be even higher going towards the 16-nm or 8-nm technology nodes. The demand for lower power requires supply voltage scaling, and thus, voltage variations are becoming a significant part of the overall challenge [9] . Similarly, the quest for increasing operating frequency has resulted in significantly high junction temperature and within-die temperature variation.
In addition, new device and interconnect architectures as well as new methods of fabricating them are changing sources and nature of variability [32] .
The sources of hardware variation can be classified into four groups: 1) Manufacturing, 2) Environment, 3) Vendor, 4) Aging. These types of variations are explained in further detail by Samandari-Rad et al. [2] , Dutt et al. [28] , Gupta et al. [32] , and Reddi et al. [35] , among others.
D. ENHANCED MODEL OVERVIEW 1) ASSUMPTIONS AND IMPLEMENTATION VARIATION
We extend our VAR-TX model, previously used to analyze D2D and WID device threshold voltage (V th ), channel length (L), and supply voltage (V DD ) variations to predict the delay and delay variation [1] , [2] . The newly enhanced model, ''exVAR-TX'', additionally allows efficient computation of the leakage and power distribution of SRAM array using leakage and power sensitivities, respectively. As with the delay case in VAR-TX [1] , in exVAR-TX, we assume independent WID variations of 8.98% for V th , 4.84% for L, and 2% for V DD and independent D2D variations of 4.01% for V th and L, and 2% for V DD , based on ITRS predictions [10] .
Whereas our within-die systematic threshold voltage variation ( V thWID,sys ) has an inverse relation to the squareroot of L ×W [1] , [94] , the major source of our within-die random threshold voltage variation ( V thWID,rand ) is random dopant fluctuations which dominates our overall within-die variation of the threshold voltage ( V thWID ) due to rather heavy doping concentration of the transistor channel allowing the control of the short channel effects [1] , [40] . To avoid the degradation of our static noise margin (SNM) impacting the stability of our planer CMOS 6T-SRAM (beyond the conventional level [1] ) we have restrained the down-scaling of our V DD to the range shown in Table 4 , depending on the mode of operation at a given time.
exVAR-TX uses static power sensitivity for gates on noncritical paths, and dynamic power sensitivity for gates in critical paths. Both of these estimations use a quad-tree structure [48] in a methodology explained in our previous work [1] for estimation of access-time and its variation, D2D variability, WID variability, and combined WID and D2D delay sensitivity and delay variation.
Given input specifications of area and speed constraints, SRAM size and shape, number of columns, and word-size, exVAR-TX enables the user to predict the leakage/power and leakage/power variability (in addition to predicting delay and delay variation) in upcoming 16-nm on-chip conventional 6T-SRAMs.
The abstract steps of the derivation process for our noncritical path-based leakage and critical path-based dynamic power approach to statistical timing analysis, using the enhanced exVAR-TX include:
1. Compute the per-gate power sensitivities and store them in tables. 2. Compute the D2D component of the non-critical and the critical path for static and dynamic power, respectively. 3. Express the WID component of the non-critical and critical path for static and dynamic power variation, respectively, as an analytical expression of the device parameter variation. 4. Combine the two components-D2D and WID-of the leakage (or the power) variations to obtain the joint leakage (or the joint power) distribution. 5. Optimize the leakage (or the power) through the examination of all feasible architectures to achieve maximum energy efficiency for a given minimum performance and yield. The details of our D2D and WID modelling as well as the independent variation values assigned to each of the V th , L, and V DD , based on ITRS [10] predictions, are explained in our previous work [1] .
We verified the accuracy of the newly enhanced model assumptions through Monte Carlo simulations and validated the model optimization capability by comparing our leakage/ power results with those obtained by the IBM group [33] and others [19] , [49] , [50] . In order to incorporate not just the correlated, but also the uncorrelated and partially correlated paths, we considered more than one hundred different selected worst-case critical paths spanning different regions in our quad-tree modeling [1] . We found these worst-case critical paths from experimental results, similar to the method introduced by Bowman et al. [8] . Our Monte Carlo simulations show that this approach produces accurate results, as discussed in Section IV.
2) OPTIMIZATION
In addition to computing the leakage/power of a given SRAM system, with reliability-boosting schemes in place, exVAR-TX performs exhaustive computations and comparisons based on the user entry (e.g., SRAM size, word-size) using its embedded library of lookup tables (constructed from the linearized device/cell leakage/power for different configurations) to provide the minimum-leakage/power architecture/organization that satisfies a given desired speed and area requirement from the modelled alternatives. exVAR-TX does this first phase of our optimization within thirty seconds, even for large SRAM circuits with nearly countless critical parameter fluctuations. exVAR-TX also provides a measure of the expected variability in this minimum leakage/power.
Once the best architecture is found by our exVAR-TX modeling methodology in the first phase of the optimization process, we additionally perform some minor tuning by hand (e.g., adjusting the bitline/interconnect width and pitch) in the second phase of our optimization process to further maximize the energy-efficiency (if possible at all) while still meeting the reliability, performance and yield requirements.
III. RELIABILITY ISSUES IMPACTING SRAM POWER
As the semiconductor industry continues to push the limits of sub-micron technology, manufacturing reliable SRAM arrays with small yield losses becomes progressively more difficult [33] .
As a result, it is imperative for designers to adopt variability-aware memory management techniques to minimize power consumption while meeting the increased system performance/responsiveness requirement. This section briefly discusses the most prominent failure mechanisms that limit the reliability and power of modern SRAMs and provide the most relevant effective mitigation techniques pertinent to 16-nm 6T-SRAMs.
A. IR DROP AND Ldi/dt
Power distribution systems are essentially resistive and inductive in nature. Like most conducting metals, the resistivity of power distribution network (PDN) increases linearly with increases in the surrounding temperature, resulting from power dissipation. Therefore, in order to accurately characterize the performance of high-power integrated circuits (ICs), packages and printed circuit boards (PCBs), it is essential to account for electrical and thermal effects and the couplings between them [53] .
The increase in clock frequency and edge rates, as well as the continuous downscaling of feature size and threedimensional interconnect technologies in high-speed systems, result in power integrity (PI) and signal integrity (SI) as the major culprits causing chip failures [54] .
Two of the major PI issues are the IR drop [53] , and inductive loss Ldi/dt [55] . IR drop is caused by the finite resistivity (R) of metals and current (I) drawn off from the power and ground (P/G) planes. Ldi/dt is caused primarily from high frequency dynamic switching, high inductive load, and packages imposing inductance.
Chip designs housing SRAMs are typically more susceptible to IR drop than to Ldi/dt, especially when the integrated circuit (IC) supply voltage V DD decreases with the scaling of silicon processes. Due to the proportional dependency of IR drop to the resistance of the power/ground plane, IR drops will increase when the resistance of the power/ground plane increases with the shrinking of complex geometries.
To further confound the difficult issue, the rise of the thermal temperature owing to the current carrying interconnects also has tremendous impact on the IC performance and reliability. Current flow in a very large-scale integration interconnect can cause power dissipation, which is referred to as Joule heating or self-heating. The Joule heating effects become increasingly significant with the shrinking scale of the silicon process because of the increase of on-chip power density, inclusion of more metal layers with higher densities and the use of dielectric materials with lower thermal conductivities. Therefore, the temperature effects on the electrical performance of the three-dimensional system should be carefully considered, as well in the electrical designs [56] .
Moreover, owing to the linear temperature dependency of metal resistivity, the non-uniform temperature distribution affects the electrical performance of the power delivery network (PDN) and substantially increases the IR drops in the power/ground planes [57] . In particular, modules that draw more current from the PDN, owing to higher power demands, will suffer worse IR drop effects. Furthermore, the decreases in IC supply voltage, together with larger IR drops, will significantly reduce the noise margins, consequently making electronic devices more vulnerable to direct current (DC) noises. To simulate this closed coupling loop between electrical and thermal analyses, a non-overlapping, non-conformal domain decomposition method (DDM) for thermal-aware DC IR drop co-analysis has recently been proposed [54] . Fig. 4 illustrates the overall concept and the methodology used to mitigate much of the IR drop reliability issue to prevent performance drop and achieve total power minimization of the circuits subject to IR drop. Fig. 4 presents a flow diagram for the thermal-aware IR drop co-analysis. First, the IR drop/conduction analysis is performed at room temperature for the initial voltage distribution. Subsequently, the computed dissipation owing to DC voltage leakage power is imposed as the heat source for steady-state thermal analysis. Moreover, the power profiles P = J · E of the chip and memory modules are included as heat sources in the thermal temperature computation. Once the temperature distribution within the device is calculated, it can be used to update material properties, specifically the conductivity (resistivity) of the conducting materials. With an elevated temperature on the chips, electrical resistance has a linear temperature dependence. For instance, when the temperature of the device increases from room temperature to 80 • C, the electrical resistivity will increase by more than 40 per cent. Consequently, the updated values of the resistivity within the device lead to a new/updated voltage distribution. The fully coupled multi-physics, electric and thermal analysis is repeated until the temperature-dependent IR drop and thermal distributions converge with negligible errors. Interested readers are referred to Shao et al. [53] , [58] for details.
To reduce the effect of temperature and V DD variation in our design we first determined the resistive and inductive loads and identified the routs with major current draw during the design, then eliminated most of temperature variation and the supply voltage static IR drop and dynamic Ldi/dt variation by adding decoupling capacitors (at the cost of increased gate oxide area and consequently increased oxide leakage) in the empirically found crucial locations of our SRAM circuits [9] . After reducing the initial temperature and V DD variation, we then used our thermal-aware CAD design tools similar to the method discussed above to do more tuning to achieve maximum energy-delay efficiency.
B. TEMPERATURE
Power density in modern integrated circuits (ICs) continues to increase rapidly. In turn, larger power densities result in higher peak temperatures which can reduce chip reliability and further increase leakage power consumption.
Junction leakage currents are caused by thermally generated carries that increase exponentially with increase in junction temperature. For example, at 85 • C (a commonly imposed upper bound for junction temperatures in commercial hardware), the leakage currents increases by a factor of 60 over their room-temperature values [37] . It is mainly because of these factors-such as hot carrier injection (HCI) as well as the IR drop and NBTI-that keeping the overall operation temperature of the newer circuits low has become a mandatory goal. This is accomplished by limiting the power dissipation of the circuit using thermal-aware CAD design tools, temperature control techniques (such as throttling), using chip packages that support efficient heat removal mechanisms or combination of these, as the temperature is a strong function of dissipated heat and its removal mechanism. Fig. 5 explains the Throttling method, used by many commercial processors to control the maximum temperature and within-die temperature variations. When the temperature rises above the maximum limit set by the frequency and power, the operating frequency is lowered, followed by the die V DD . Subsequently, the power consumption drops, which lowers the on-die temperature. When the die comes out of power saving mode, V DD is raised followed by frequency [9] . The consideration of the hot carrier effect in our modeling along with several other temperature mitigation techniques such as temperature-insensitive voltage design (VINS), tunable replica circuits (TRC), reverse temperature dependence (TRD), among others, are discussed in Samandari-Rad and Hughey [40] .
C. NBTI
With the continuous scaling of CMOS technology, negative bias temperature instability (NBTI) is emerging as one of the major reliability degradation mechanisms [59] . NBTI is an aging effect which gradually increases the threshold voltage V th of pMOS transistors when they are negatively biased (i.e., V GS = −V DD ) under high operating temperatures, thus increasing the gate delay. Reducing the increased V th of the affected pMOS transistors (through such techniques as negative V BS ) while helpful in keeping the NBTI in check, is harmful for leakage power especially in this era of growing process and device variations [49] .
With the introduction of High-K Metal gates, a new degradation mechanism, positive bias temperature instabilities (PBTI), has also appeared. The PBTI affects the NMOS transistor when positively biased [2] , [60] . Since, in this particular case, no interface states are generated and 100% of the V th degradation may be recovered, the impact of PBTI is not as severe as that of NBTI and thus ignored in this paper.
Traditional worst-case design will lead to an overpessimistic estimation. Instead, statistical static timing analysis (SSTA) is an effective technique to evaluate the increasing variations [61] . One such technique portable to our own model for process and operation variations is recently proposed by Chen et al. [49] . By utilizing dual V DD and dynamic V DD scaling scheme we were able to mitigate NBTI-induced degradation by at least an average of 50% (compared with the pessimistic guard-banding and single V DD scaling approach), reduce power/energy, and predict the lifetime of a 6T-SRAM cell simultaneously. The high V DD is used for devices in critical paths to ensure their performance, as well as to compensate for NBTI-induced degradation; while the low V DD is used for devices on non-critical paths to reduce power. During circuit operation, the optimal V DD values are dynamically determined according to the aging-aware timing constraint, as explained in detail by Chen et al. [49] . The calculated power is the average power during the overall circuit lifetime, which is proportional to the energy consumption.
D. EM
Electromigration (EM) is a major failure mechanism that greatly affects the long term reliability of interconnects and, therefore, VLSI chips [62] . In modern technologies, EM not only affects the power and ground (P/G) lines [63] but also the bitlines of SRAM arrays as they also carry unidirectional currents [50] .
EM is caused by a gradual material transport caused by the movement of Cu+ ions due to the momentum transfer between electrons and metal atoms. As the transported matter accumulates, a void may nucleate and grow, leading to an interconnect failure [64] . Therefore, EM-caused wire damage occurs when large density unidirectional current flows on a wire. Lines with large bi-directional currents will hardly suffer EM failure due to the healing effect caused by inverse momentum transfer [64] . Thus, typical IC signal lines, except for SRAM's bitlines, are usually immune to EM failures. In contrast, P/G lines are more prone to EM.
In SRAMs, current flows on a bitline during both write and read operations. The write currents are mostly balanced and do not contribute to EM, but, read currents are unbalanced (unidirectional), and, therefore, contribute to EM.
When a read operation is performed, the bitline is charged by the same PMOS transistor placed at one end and discharged by an SRAM cell placed anywhere within the bitline range, leading to a unidirectional current path. Since the current of read operation is a dominant component, it is possible that bitlines may be affected by EM [50] .
The subthreshold leakage current has a similar unidirectional path. It consists of leakage current from all SRAM cells connected to a single bitline and its value is greatly affected by the threshold voltage and channel length of each bitline access transistor. Since these two parameters (V th , L) are significantly affected by process variation, it is clear that their impact on subthreshold leakage current may result in both bitline EM reliability degradation and SRAM functional failure.
Even though affected by EM, lines do not break immediately and can still function correctly for some time. This time period is called EM lifetime. The mean time to failure (MTTF) is well predicted by Black's equation (Eq. 2) [65] , where A is a constant, j is current density, Ea is activation energy, k is Boltzmann's constant, T is temperature and n is an experimental constant whose value is between 1 and 2.
There is also a lower bound for the length of wires affected by EM, Eq. (3) [66] . It is known as Blech length, and the probability that a wire shorter than this bound will fail due to EM is extremely small. In such wires, the mechanical stress buildup causes a reverse migration process, which reduces or even compensates the effective material flow caused by EM. Blech length is stated in terms of a critical jL crit product of current density and interconnect length (via to via), derived from the equilibrium condition between electron collision induced atom flux (EM) and back-stress flux. The critical value for copper dual damascene structure is around 0.37A/µm [67] and any given interconnect line with jL greater than this value will suffer EM failure. In our paper, we assume 0.37A/µm as the threshold value between mortal and immortal wires.
This assumption allows us to use Eq. (3) [33] for upper bound and Eq. (4) [33] for lower bound in conjunction with our proposed process variation model exVAR-TX and the scheme outlined by Zhong et al. [33] to find an optimal width of bitlines (36 nm) and P/G lines (36 * 2 nm) in our design for 16-nm 64KB 6T-SRAM that minimizes failure probability by trading off EM and functional failure to achieve maximum possible SRAM yield.
where W cell is the width of SRAM cell, W space is the width of space between bitlines, W DRAM ,1/2 is the DRAM half pitch (the minimum width of the space between bitlines and P/G lines already decided by the technology) and K is the current ratio between P/G lines and bitlines (we set K to 2.2, based on our empirical results).
E. MEMORY ERRORS
Memory faults can be classified by their persistence (temporal behaviors) and their causes. With respect to persistence, a memory fault can be permanent or transient.
Permanent faults persist indefinitely in the system after occurrence, while transient faults manifest for a relatively short period of time after occurrence. Causes of memory faults can be hard or soft. Hard faults are static and caused by device failure or wear-out failure. In contrast, soft faults (also called soft-errors) are dynamic and are typically caused by the proton-induced, neutron, and most importantly alpha particle radiation coming from the operating environment [28] , [68] .
Memories are increasingly becoming the most vulnerable components of systems-on-chip to soft-errors since they use minimum size transistors and may occupy the majority of onchip real-estate [10] , [11] . Aggressive voltage scaling makes the situation even worse as it reduces the capacitance that keeps the charge in a single cell, therefore, further increasing its vulnerability to low energy alpha particles, or cosmic rays [69] . As alpha or cosmic particles come into contact with a silicon device, the probability of a single event failure increases with decreasing charge within a memory cell [69] , causing transient faults such as single-event upsets (SEU) or multiple-event-upsets (MEU) in memories.
The root of most types of SRAM error and reliability problems is the cell signal-to-noise margin (SNM). At low supply voltages noise margins are reduced, increasing susceptibility to data corruption (exhibiting dynamic and random behavior) caused by environmental factors. Therefore, variability in cell noise margins requires a statistical approach to designing a reliable memory array and choice of minimum supply voltage, which must be increased to maintain yield under large variations [28] .
In short, there is a tradeoff between power (as it depends on supply voltage) and fault tolerance overheads (in terms of area, performance, and power).
Several successful attempts have been made to achieve reliability through redundancy [70] , partitioning applications' address space [71] , and using simple [27] or sophisticated [73] fault tolerance schemes, among others.
At the cell-and circuit-level, engineers have designed larger memory cells using more transistors and/or larger transistors to increase mean noise margins and/or reduce margin variability, but these come at the cost of reduced area efficiency and sometimes power. Several of these cell-and circuit-level techniques include 8T [22] , [74] , [75] , 10T [76] , and Schmidt Trigger (ST) [77] SRAM cells.
At the architectural level, several commonly-used techniques have been proposed to improve the reliable memory design. One example is fault-tolerant memory designs-which have often used simple techniques such as adding redundant rows/columns to the memory array [78] or applying memory down-sizing techniques by disabling a faulty row or cache line (block) [79] .
Information redundancy via error coding is another example that helps improve the reliability of memory components. In the same line, wide ranges of error detection coding (EDC) and error correction codes (ECC) have been used [80] . Typically, EDCs are simple parity codes, while the most common ECCs use Hamming [81] For in-depth discussion and the simulation results supporting our ''memory errors'' topic the interested reader is referred to the references used throughout this section.
IV. SIMULATION RESULTS AND ANALYSIS
As discussed earlier, SRAM energy is determined by the choices of threshold voltage, transistor sizing, and supply voltage impacting leakage current, dynamic current and critical path delay. Reliability-related schemes also play an important role in performance boosting and energy reduction. The use of an optimum architecture after incorporating the two former steps makes further improvement to the energy efficiency. In this section, we will illustrate the effects of these parameters on SRAM energy.
First, in the next three sections (IV.A-IV.C), we illustrate the impact of process and operation variation on leakage and power, show the associated Monte Carlo simulation results for verification, and demonstrate the validation of the architectural optimization capability of exVAR-TX. Then, in Sections IV.D, we show the energy-delay-product (EDP) of a 16-nm inverter, followed by the comparison between the EDP of different 16-nm 64KB SRAM architectureswhich further validates the positive impact of selecting the optimum architecture for maximum EDP efficiency. Finally, in Section IV.E, we illustrate the impact of power minimization techniques (discussed throughout this paper) progressively and highlight the additional energy reduction that can be obtained by selecting the optimum architecture for given SRAMs with certain specifications.
Since the READ access is the dominant operation [33] , we only show our READ access related simulation results in this paper, for simplicity, at no significant loss in the accuracy of our findings.
We used the Arizona State University (ASU) Predictive Technology Model (PTM) [85] and the mixed-signal UltraSim simulator (MMSIM72-UltraSim64, Cadence Inc.) to produce our results.
A. STATISTICAL ESTIMATION AND DISTRIBUTION OF LEAKAGE CURRENT
As mentioned in Section I, the D2D and WID variation in the electrical characteristics (especially variations of V th ) of the transistors of a cell results in significant variation of the leakage of the cell. These parameter variations must be taken into account when approximating leakage current and power. D2D variation can be characterized as a global mean and variance while WID variation can be characterized as the deviation occurring spatially within any one die. The mean (µ leakCELL ) and standard deviation (σ leakCELL ) of the leakage of inactive cell (the cell that is not located on the critical path), considering V th , L, and V DD fluctuation, can be obtained using our proposed exVAR-TX model. Since the sub-threshold leakage is an exponential function of the threshold voltage, we have assumed a lognormal PDF to describe the distribution of the cell leakage [83] . Fig. 6 shows that the lognormal distribution model, with the mean and the standard deviation estimated using a method similar to the method shown in our previous work [1] and abstractedly outlined in Section II.D, closely follows our Monte Carlo simulation results. If we assume that the different cells are independent and identically distributed random variables (i.e., the mean and the standard deviation of the leakage of all the cells are same), we can compute the estimated total SRAM array leakage through the following equation: (5) where I leakCELL is the random variable representing the leakage of a cell, N CELLS is the total number of cells in the noncritical path, N ROW is the number of rows, and N COL and N RC are the number of columns and redundant columns, respectively. Applying the Central Limit Theorem [83] , the distribution of the total leakage can be approximated as a Gaussian with a mean (µ Leak ) and standard deviation (σ Leak ) given by
To quantify the effect of the leakage distribution on the statistical design of the SRAM array, we have defined the probability (P LeakMem ) that the total memory leakage will meet a given bound as
B. IMPACT OF TEMPERATURE (T) ON LEAKAGE POWER
Temperature effects are important because leakage current depends exponentially on temperature, and future operating temperatures may exceed 100 • C [10] . Fig. 7 shows how the relative leakage power changes as a function of temperature, for different threshold voltages at 125 • C. Leakage power increases dramatically with temperature (∼4 times from 27 • C to 125 • C). In addition, we observe that the leakage dependence on the threshold voltage is significant. For different V th values (different lines in Fig. 7 ), the leakage changes significantly. Such large variation in leakage current/power due to temperature, as well as leakage current's near exponential dependency on varying manufacturing process parameters (e.g., gate length and threshold voltage) is expected to increase with technology generations and fare much worse in terms of performance and yield of 16-nm SRAM.
The estimation/measurement of temperature variation in our model is taken into account by observing the changes in voltage and current, as the known (calibrated) voltage applied to a device can be dropped by a heated interconnect and/or a heated device due to a known (calibrated) current passing through them. The changes in voltage and current of a device are strongly related to the changes in the device threshold voltage and charge carriers effective mobility resulting from changes in temperature. We have used the temperature/aging model card of Virtuoso UltraSim Simulator (a Cadence CAD tool) to capture the variations in threshold voltage, mobility, voltage, and active/leakage current due to increase in temperature. These topics and their associated temperature estimation/measurement methods are discussed in detail by Avenas et al. [89] and others [2] , [90] - [93] .
C. STATISTICAL ESTIMATION AND DISTRIBUTION OF POWER
As shown in Eq. (8), each of the components of the total power (P dyn , P dp , and P stat ) is influenced by process and operation parameters (V th , L, and V DD ) and their variation. For example, the dynamic power variation is influenced quadratically by V DD and linearly by the capacitance C Lwhich in turn is influenced by the channel length dimensions (W and L) and by the oxide thickness (t ox ) [40] . (8) where f 0→1 represents the frequency of the energyconsuming transitions (these are 0 → 1 transitions for static CMOS). In other words, f 0→1 is equivalent to P 0→1 f , where f now represents the maximum possible event rate of the inputs (which is often the clock rate) and P 0→1 the probability that a clock event results in a 0 → 1 (or power-consuming) event at the output of the gate. Fig. 8 shows the probability distribution of the total power for three different architectures of 16-nm 64KB SRAM, with the assumed parameter variations presented in Section II.D. Comparing the PDF traces of the total power of the 16-nm node with the 45-nm and 180-nm nodes (not shown) reveal that the variation of SRAM increases with technology scaledown. While the overall sigma for the same SRAM size and architecture is only about 3.1% for 180-nm and 6.2% for 45-nm technology, it is about 10.2% for 16-nm node.
D. VALIDATION OF MODEL OPTIMIZATION
To quantify the power saving improvement of our proposed approach, we compare the probability density function (PDF) of our optimal Power P arc,op design with both the PDF of our worst power P arc,wo design and the PDF of our reference power P arc,ref design-obtained by using square architecture VOLUME 4, 2016 (256 rows by 256 columns), used by IBM group [33] to optimize their bitline width, without exploring the range of all feasible architectures-for a given 16-nm 64KB 6T-SRAM in Fig. 8 .
Within the notation of this paper, the IBM group architecture is Fig. 8 illustrates that by choosing the optimum architecture in an SRAM design the power (and therefore the yield) can be improved by up to about 25%, with respect to prior square models such as those proposed by IBM group [33] and others [23] - [25] .
E. ENERGY-DELAY PRODUCT
To maximize the energy efficiency of an SRAM, both leakage/power reduction and frequency/performance improvement need to be achieved at the same time. We illustrated the frequency/performance improvement in our previous work, and illustrate the leakage/power minimization in this work.
A metric that combines a measure of performance and energy is the energy-delay product (EDP) [37] :
where PDP and t p are the power delay product and the propagation delay of a logic gate, respectively. It is worth analyzing the voltage dependence of the EDP. Higher supply voltages reduce delay, but harm the energy, and the opposite is true for low voltages. An optimum operation point should therefore exist. Assuming that NMOS and PMOS transistors have comparable threshold and saturation voltages and the devices remain in velocity saturation, we can write a simplified version of the propagation delay expression [37] :
where V Te = V th + V DSAT /2 and θ is a technology parameter. Combining Eqs. (10) and (11) yields
The optimum supply voltage can be obtained by taking the derivative of Eq. (12) with respect to V DD , and equating the result to 0. The result is
The remarkable outcome from this analysis is the low value of supply voltage that simultaneously optimizes performance and energy. For submicron technologies, with thresholds in the range of 0.450V, the optimum supply is situated around 0.8V.
For our generic 16-nm CMOS process presented in this paper the optimum supply is situated around 0.82V, using the above computation method. The simulated graphs of Fig. 9 , which plot normalized delay, energy, and EDP, closely match our result, as the optimum supply voltage in the plot is predicted to equal 0.84V. The charts clearly illustrate the trade-off between delay and energy. When the supply voltage is at or above 0.9V region, dynamic energy is dominant compare to leakage energy. Therefore, lowering supply voltage reduces overall energy consumption. As expected, the minimum point of each device selection is formed at a point where the supply voltage is around 0.84V. Table 2 compares the mean, sigma, and 3 sigma of the energy-delay product (EDP) of the optimum, worst, and three other architectures that fall between the optimum and worst architectures against the EDP of our reference (IBM group [33] ). The difference between the mean, sigma, and 3 sigma of the worst and optimum cases clearly emphasize the crucial role of selecting an optimum architecture in EDP improvement. It is worth mentioning that Arc Optim is not necessarily the fastest possible architecture. In fact it is up to 5 to 10 percent slower than the maximum possible speed architecture Arc 1, considering the variations. This is because, as compared to Arc 1, Arc Optim architecture has a lower amount of total capacitance, which reduces the total power, among other differences.
F. COMBINATION EFFECT OF POWER REDUCTION AND PERFORMANCE BOOSTING TECHNIQUES
In this section, we will demonstrate that SRAM energy consumption is determined not only by the choices of supply and threshold voltages but also by the exploitation of reliabilityrelated circuits and SRAM architecture choices (including organization selection, transistor sizing, interconnect finetuning, decoder and periphery circuit types, and so on). Fig. 10 compares the normalized progressive energy savings for different combination of parameter selections (Case 1 to Case 8, defined in Table 3 ), considering both power and performance of a 16-nm 64KB SRAM at 85 • C. Table 4 provides the parameter summary for our energy analysis simulation.
Several observations could be made from Fig. 10 .
It is interesting to observe that Case 1 (1. HVt-SV DDNBst-ArcSq)-using a high V th for all the transistors in the circuit to achieve the lowest possible leakage current, as compared to the other seven cases-consumes the highest total energy. This is because of the fact that with the high V th devices on the critical paths the increase in the read delay overpowers the decrease in the leakage current, which, in turn, results in consuming more energy overall. The majority of the energy savings of Case 4 (which uses dual V DD ) as compared to Cases 1-3 (which do not use dual V DD ) comes from static power reduction. The additional energy savings of Case 5 is attributed to the use of multiple V th /V DD (instead of single or dual V th /V DD ) for the reasons explained in Section II.B.4. Similarly, the energy saving of Case 6 is higher than that of Case 5 due to the addition of reliability/performance/power boosting circuits to the Case 5 circuit. Finally, the selection of optimum architecture saves some addition energy, as shown by Case 8. The corresponding quantification for these improvements span from ∼1.67X (i.e., 1.00 to 0.60), due to the use of dual V DD to ∼1.76X (i.e., 0.60 to 0.34), due to the deployment of multiple V th /V DD to ∼2.27X (i.e., 0.34 to 0.15), due to the contribution of reliability/performance/power boosting techniques to additional ∼1.88X (i.e., 0.15 to 0.08), due to the selection of the optimum architecture-with the total net improvement of ∼12.5X. (i.e., 1.00 to 0.08), which is in line with our overall energy saving expectations. Reducing the voltage further than twice the lowest V th (i.e., VDD low = 2 × LVth) is not likely to be useful, as the yield quickly drops off and the power savings exhibits diminishing returns with respect to the baseline/reference SRAMs. Subsequently, no minimum energy point was found in the near-or sub-threshold region since the 6T-SRAM fails to operate earlier.
Note that the highest energy efficiency is not always the best in terms of leakage. Therefore, careful device selections have to be made depending upon the system requirements. If an SRAM stays in an idle or sleep mode for majority of the life time, the leakage current becomes more significant than the energy efficiency during computational operations. However, the energy efficiency will be more significant if the SRAM workload becomes substantial. The energy overhead of the utilizing the mitigation techniques discussed above is negligible compared to the improvement they make. The total area overhead ranged between 2% to 5%, depending on the aggressiveness level of the reliability-, power-, and performance-boosting techniques, as well as the selected SRAM architecture, needed to be used to meet the required set of specifications.
No optimum architecture for power minimization of different SRAM sizes suffered more than 5% to 10% performance degradation for achieving an energy savings of at least 10X (similar to that shown in Fig. 10 ), compared with the highest speed possible architecture of the same SRAM size at the full V DD of 0.9 V. Finally, power savings were almost always better for larger SRAMs. This was in line with our expectations, as larger caches/SRAMs can afford to put more blocks/banks to sleep mode, using lower V DD .
V. CONCLUSION AND FUTURE WORK
In this work, we explored power reduction techniques for SRAM energy efficiency. We discussed and illustrated how to improve the energy efficiency of variability-aware 6T-SRAMs with all the major reliability improvement techniques in place. Optimal combination of device, cell, and architectural organization in terms of energy efficiency were obtained through exhaustive empirical results of our simulations.
We showed that proper architectural selection with major reliability-boosting schemes applied to the design can improve the energy and energy-delay-product (EDP) of 64KB 16-nm 6T-SRAM by ∼12.5X and ∼33%, respectively, as compared to the baseline/reference SRAMs at 0.9 V.
We achieved predictable desired power savings all around with no more than 10% performance, 5% area, and 2% yield penalties in the most demanding power minimization SRAM configuration, even though we do not claim the lowest VDD min compared to other works (i.e., [20] - [22] ).
We verified our simulation results by Monte Carlo and validated our findings presented in this paper by comparing our simulation results to that of IBM group as well as to those of our own for different SRAM sizes (8KB to 512KB) and various architectures.
We hope this power-related work which supplements our previously published performance-related work [2] makes an effective case for further research into the computing paradigm with the goal of improving energy efficiency of systems, while lowering design cost, improving yield, and recovering lost performance due to conventional guardbanding techniques.
Additional future directions may include an investigation of more sophisticated 3D architectures and their associated evaluation of system-wide power and energy impacts, as well as, a broader design space exploration involving multi-core systems with consideration of cache coherence.
Moreover, it is informative to mention that, with the recent developments in Fin-Shaped Field Effect Transistor (FinFET) [21] , [68] and fully depleted silicon on insulator (FD-SOI) [86] as well as the emerging memory technologies [87] such as phase change memory, spintronic memory, resistive memory (memristor), ferroelectric memory, resistive RAM (ReRAM), magnetic RAM (MRAM) and its newer cousin Spin-transfer torque random access memory (STT-RAM) have the potential to facilitate or even resolve some of the major challenging issues discussed in this paper. With the continuation of investments from both academia and industry during the next few years, these emerging memories are making promise to improve many aspects of present-day and future memory hierarchy, offering high integration density, large capacity, close-to-zero standby power, and good resilience to soft-errors.
Besides developing robust and scalable devices, the unique characteristics of these emerging memories, such as readwrite asymmetry, stochastic programming behavior, and the tradeoffs among performance, power, and data retention, among others, could introduce plenty of opportunities and challenges for novel circuit designs, architectures, system organizations, and management strategies. These active research topics have created an urgent need of modeling, analysis, design and application of emerging memory technologies across multiple technology scales to accelerate their technology development and adoption.
APPENDIX
This appendix briefly discusses components of total power, partitioning of the memory, sources of power dissipation in memories, and techniques to reduce the retention current of existing SRAM memories to provide important background to the paper.
A. COMPONENTS OF TOTAL POWER
As briefly presented in Section IV.C, Eq. (8)-repeated here in Eq. (14) for convenience-total power of CMOS inverter can be expressed as the sum of its three components [37] : (14) The power dissipation of SRAM (using CMOS circuits) is dominated by the dynamic dissipation that results from the charging and discharging capacitances.
Dynamic power (P dyn ) is caused by switching activity in CMOS circuits.
Direct-path power (P dp ) is caused by the current flowing between V DD and GND for a short period of time (t s ) during switching, while the NMOS and PMOS transistors are conducting simultaneously. The direct-path power dissipation is proportional to the switching activity. With threshold voltages scaling at a slower rate than the supply voltage, the Directpath consumption can be kept within bounds by careful design in newer deep submicron technologies, and thus can be ignored for first order estimation.
Static power (P stat ) is due to leakage current in the quiescent state of a circuit.
B. PARTITIONING OF THE MEMORY
A proper division of the memory in submodules can help confine active power dissipation to limited areas of the overall array. Memory units that are not in use should consume only the power necessary for data retention. Memory partitioning is accomplished by reducing m (the number of cells on a wordline), and/or n (the number of cells on a bitline). By dividing the wordline into several sub-wordlines that are enabled only when addressed, the overall switched capacitance per access is reduced. In some sense, this scheme is nothing more than a multistage hierarchical row decoder [3] . This approach is quite popular in SRAM memories, as illustrated and discussed in our previous work [40] .
In a similar way, partitioning of the bitline reduces the capacitance switched at every read/write operation. An approach that is often used in SRAM memories is the partially activated bitline. The bitline is partitioned in multiple sections (typically 2 or more). All these sections share a common sense amplifier, column decode, and I/O module. This approach has helped us to reduce our bitline capacitance (C BL ) by a factor of 2 to 10, depending on the number of bitline segments on each of the bitlines, in our 16-nm 6T-SRAMs.
C. SOURCES OF POWER DISSIPATION IN MEMORIES
The power consumption in a memory chip can be attributed to three major sources; the memory cell array, the decoders (row, column, block), and the periphery. A unified active power equation for a modern CMOS memory of m columns and n rows is approximately given by [72] (see Fig. 11 ).
FIGURE 11. Sources of power dissipation in semiconductor memory [72] . f -the operation frequency. As should be expected, the power dissipation is proportional to the size of the memory (n, m). Dividing the memory into subarrays, and keeping n and m small is essential to keeping power within bounds, This approach is obviously only effective if the standby dissipation of inactive memory modules is substantially lower.
In general, the power dissipation of the memory is dominated by the array. The active power dissipation of the peripheral circuits is small compared with the other components. Its standby power can be high however, requiring that circuits such as sense amplifiers are turned off when not in action. The decoder charging current is also negligibly small in modern SRAMs, especially if care is taken that only one out of the n or m nodes is charged at every cycle.
Reduction of power dissipation in memories is worthy of a book on its own. In this paper, we have limited ourselves to an enumeration of major techniques that are generally applicable, and refer the interested reader to [3] and [72] for more details and depth.
D. TECHNIQUES TO REDUCE THE RETENTION CURRENT OF EXISTING SRAM MEMORIES
• Turning off unused memory blocks. Memory functions such as caches do not fully use the available capacity for most of the time. Disconnecting unused blocks from the supply rails using high-threshold switches reduces their leakage to very low values. Obviously, the data stored in the memory is lost in this approach (see Fig. 12(a) ).
• Increasing the thresholds by using body biasing. Negative bias of the non-active cells increases the thresholds of the devices and reduces the leakage current.
• Inserting extra resistance in the leakage path. When data retention is necessary, the insertion of a lowthreshold switch in the leakage path provides a means to reduce leakage current while keeping the data intact (see Fig. 12(a) ). The low-threshold device leaks on its own, which is sufficient to maintain the state in the memory. At the same time, the voltage drop over the switch introduces a ''stacking effect'' in the memory cells connected to it: A reduction of V GS combined with a negative V BS results in a substantial drop in leakage current [3] .
• Lowering the supply voltage. Fig. 12 indicates that the leakage current is a strong function of V DD . An effective means to reduce leakage during standby is to lower the supply rails to a value that keeps the leakage within bounds, while ensuring data retention (see Fig. 12(b) ).
The lower bound of the data retention voltage-the voltage that still maintains the stored value-is determined by device variations. A supply voltage as low as 60mV (excluding noise margin) is sufficient to maintain retention in a standard 16-nm CMOS process. The combined effect of reduced supply voltage and leakage current is powerful way of addressing standby power dissipation in SRAM memories [3] . It is worth mentioning that the increasing impact of leakage combined with reduced supply voltages, during recent years, have reduced the gap between SRAM and DRAM to a certain extent. When standby power is a dominant concern, nonvolatile memories offer a viable and attractive alternative.
