Abstract-Over the last decade, the design of ultra-lowpower digital circuits in subthreshold regime has been driven by the quest for minimum energy per operation. In this contribution, we observe that operating at minimum-energy point is not straightforward as design constraints from real-life applications have an important impact on energy. Therefore, we introduce the alternative concept of practical energy, taking functional-yield and throughput constraints on minimum V dd into account. In this context, we demonstrate for the first time the detrimental impact of DIBL on minimum V dd .
I. INTRODUCTION
In the context of green computing, lowering the supply voltage V dd is an efficient technique to drastically reduce dynamic energy per operation at the expense of increased delay [1] . However, as the delay increases, the execution time of the operation increases too. This implies a static-energy overhead, which results from the integration of the leakage power over the execution time of the operation. It has thus been shown that energy per operation can be minimized by operating at an optimal V dd , which is often in subthreshold region. Over the last decade, minimum energy has become a popular research direction [2] - [8] for Ultra-Low-Power (ULP) applications such as RFID's, micro-sensor networks or implanted biomedical devices, which typically require low-to-medium data throughputs. Recently, minimum-energy subthreshold operation has also been proposed as an ULP mode in general-purpose portable systems with Dynamic Frequency Voltage Scaling (DFVS) [9] - [10] .
The concept of minimum energy relies on 3 assumptions:
• optimal V dd is high enough to ensure robust operation of the circuit; • circuit delay is low enough to support required data throughput; • circuit is powered down after completion of the operation with no energy consumption in sleep mode. With these assumptions, the concept of minimum energy is quite theoretical. In real-life applications, V dd cannot be scaled to arbitrarily-low values without raising robustness issues, especially with high process variability of nanoscale technologies [3] . Moreover, even with low data throughput requirement, ULP applications do have timing constraints. Yet very loose, they have to be taken into account. Finally, powering down the circuit is not straightforward in lowcost area/volume-constrained ULP applications. Circuit can be externally shut down but this requires extra off-chip components with cost and volume overheads and circuit state has to be saved in a memory with high energy overhead. On-chip circuit powering down can be achieved through multi-V t (MTCMOS) technique with power-gating devices (sleep mode). Nevertheless, this technique adds design costs at architecture (management of active and sleep modes), gate (addition of special flip flops) and layout (separate routing of power grid for logic and flip flops) levels. Moreover, Seok et al. showed in [5] that sleep-mode energy cannot be overlooked and that the impact of power-gating devices on delay is higher in subthreshold regime.
In this contribution, we introduce the alternative concept of practical energy per operation, which takes robustness and throughput constraints into account. It brings a powerful analysis framework of energy efficiency, which treats data throughput as an input variable dictated by the application. We use 45nm predictive technology model from Arizona State University 1 [11] and a benchmark 8-bit RCA multiplier to show that practical energy in low-throughput applications can be far higher than minimum energy level because of high static energy and robustness-limited minimum operating V dd . To solve these issues, a subthreshold processor in [12] uses MTCMOS power gating with device width upsize, whereas a subthreshold processor in [13] uses reverse body biasing and channel length upsize. It is thus not clear which technique is the most efficient. We use the proposed framework to carry out for the first time an in-depth study of such techniques to pull practical energy toward minimum-energy point.
This paper is organized as follows. In Section II, the conventional concept of minimum energy is briefly reviewed and applied to the benchmark circuit. Constraints on minimum operating V dd are presented in Section III. The impact on practical energy is then investigated in Section IV. Finally, minimization of practical energy is discussed in Section V.
II. CONVENTIONAL CONCEPT OF MINIMUM ENERGY
The energy per operation of a circuit is the sum of the dynamic energy E dyn due to switched capacitances and the static energy E stat due to the leakage current I leak flowing through the devices. In the conventional approach, static energy is computed by the integration of I leak during the actual execution of the operation i.e. over a time period equal to the delay of the circuit critical path. Energy is thus conventionally expressed as:
where N sw is the number of switched nodes to perform the operation, C L the typical node capacitance and Del the circuit delay. Minimum energy is often achieved when operating the circuit in MOSFET subthreshold region [2] - [8] . Subthreshold drain current is expressed as:
where I 0 is a reference current proportional to W /L e f f , which depends exponentially on the threshold voltage V t . S is the subthreshold swing, η the Drain-Induced Barrier Lowering (DIBL) coefficient and U th the thermal voltage close to 26 mV at room temperature. At low V dd , subthreshold leakage current is the dominating leakage source and static energy can be expressed as:
where L D is the logic depth. Energy per operation of a benchmark 8-bit RCA multiplier has been extracted with this conventional approach from Spice simulation with 45nm PTM model (T ox = 1.1nm, L e f f = 17.5nm and nominal V dd = 1V ). A triple-V t technology is considered with 0.27 low-, 0.37 std-and 0.46V high-V t devices (extracted at V ds = 1V ). For the sake of generality, std-V t devices at 25 • are considered unless otherwise specified. Simulated energy per operation for a pseudo-random input pattern is shown vs. V dd in Fig. 1 . Under typical process corner, minimum energy happens at 0.3V optimal V dd .
In nanoscale technologies, process variability cannot be overlooked especially when focusing on subthreshold design, [14] . Second, variability of critical dimensions is considered through L g variability. We model L g variations as a normal distribution with 3σ /μ equal to 20%. As L g variations have a strong spatial correlation and the benchmark circuit is quite small, we consider only one normally-distributed L g variable for all the devices of the circuit. When taking variability into account, static energy is far higher, resulting in a 90mV-higher optimal V dd with a detrimental impact on minimum energy per operation, as studied in [7] .
III. CONSTRAINTS ON MINIMUM SUPPLY VOLTAGE
In previous section, minimum energy was computed assuming that V dd can be reduced to an arbitrarily low value. In real-life applications, this is clearly not the case as a too low V dd deteriorates functional yield and limits circuit throughput. In this section, we examine the minimum practical V dd to meet both these constraints.
A. Functional yield constraint
Functional yield is an important issue in subthreshold circuits because subthreshold V dd implies low I on /I o f f ratio and high current variability [8] . Current variability can in turn lead to bad output logic level of a gate, which would not be recognized as the correct logic level by the next gate. Some gates can thus exhibit functional failure, leading to bad functional yield of the subthreshold circuit. In yield of digital gates. This method, inherited from the extraction of SRAM-cell static noise margins (SNM), is based on a statistical computation of SNM of coupled gates depicted in Fig.2 Fig.  2 (right). Failure rate dramatically increases when lowering V dd below 0.3V. In order to ensure a functional yield of 99.9%, i.e. keeping functional failure rate lower than 0.1%, V dd has to be kept at minimum 0.25V. This result could be surprising as circuits operating at less than 0.2V have been demonstrated in previous articles on subthreshold logic (e.g. [2] ). However, these circuits were implemented on 0.15/ 0.18μm technologies. In 45nm technology, things are different. First, process variability is higher. Secondly, Hanson et al. show that SNM and thus functional yield are reduced with technology scaling due to increasing subthreshold swing S [15] . Finally, we observed for the first time that SNM and functional yield are considerably worsened by DIBL effect, which increases with technology scaling [16] . For the considered 45nm technology, we have the following values: S=96mV/dec and η=160mV/V. We carried out functional yield calculation with long-channel ideal S (77mV/dec) and without DIBL effect (η=0). Fig. 2 (right) shows that functional failure rate is considerably reduced under these conditions.
In order to validate these observations, let us get a qualitative insight on the impact of these effects on functional yield through SNM. Therefore, we examine the output level of a subthreshold inverter in DC operation, assuming an ideal high input level V dd . The output is V OL ≈ 0. The pulldown NMOS in ON state has V gs = V dd and V ds ≈ 0. The pull-up PMOS in OFF state has V gs = 0 and |V ds | ≈ V dd . DIBL effect acts as a systematic detrimental V t mismatch between the ON device with low V ds and the OFF device with V ds ≈ V dd . The output voltage V OL suffers from this and increases somewhat. We also consider variability through a detrimental V t mismatch of 1σ . V OL can be found by equating the currents of the pull-down and pull-up devices:
with
For very low V dd values, V OL depends on F SNM as 10 −F SNM is no longer negligible as compared to 1. Therefrom, an increase in S implies a deterioration of V OL and thus SNM. Variability directly affects V OL through σ Vt in F SNM . Moreover, it shows that the impact of DIBL effect is important through η that further lowers F SNM . Under 0.3V, an η value of 200 mV/V has the same impact on V OL that a V t mismatch of 60mV or an increase of S by 25%. Subthreshold-swing improvement and DIBL mitigation are thus very important for robustness of subthreshold logic circuits in nanometer technologies.
B. Throughput constraint
Even ULP circuits have timing constraints, raised by the application. Circuits have to be fast enough to support the required data throughput, which ranges from medium to low values. Minimum V dd thus depends on the application through the throughput requirement. In this contribution, we consider 10 kOp/s and 10 MOp/s, as the respective lower and upper bounds of the throughput region of interest for ULP applications. Fig. 3 shows the simulated delay of the benchmark multiplier, which depends exponentially on V dd as subthreshold drain current does until 0.4V. Above 0.4V, increasing V dd reduces less efficiently circuit delay because devices leave subthreshold regime. 
IV. PRACTICAL ENERGY PER OPERATION
In conventional computation of minimum energy from Section II, the period to perform the operation was assumed to be equal to circuit delay. In practice, static energy due to I leak is consumed over the whole period determined by the applied throughput. For the sake of generality, let us first assume in this section that the circuit does not have powerdown feature to reduce static energy after completion of the operation. This will be addressed in Section V. Under this assumption, practical energy can be expressed as:
where Per is the period to achieve the operation, i.e. the inverse of data throughput. From Eq. (5), it is clear that lowering V dd is always beneficial to minimize practical energy. However, it was shown in Section III that V dd cannot be reduced to arbitrarily-low values because it raises robustness and speed issues. Therefore, lowest practical energy is reached when operating at the minimum V dd that meets both functional-yield and throughput constraints (just-in-time execution). Minimum V dd is represented in Fig. 4 (left) vs. throughput. Notice once again that 3σ worst-case delay is chosen for verifying the throughput constraint. From this figure, the throughput space can be divided into two regions depending on the constraint that gives the highest limit for minimum V dd . For the considered 45nm technology with std-V t devices, minimum V dd is limited by functional yield when throughput constraint is lower than 300 kOp/s. Practical energy per operation of the benchmark multiplier under minimum V dd from Fig. 4 (left) has been simulated. It is shown in Fig. 4 (right) vs. required throughput, dictated by the application. Throughput space can once more be divided into two new regions, depending on the dominating energy source. For the considered technology, above 7 MOp/s, dynamic energy dominates and the circuit is "energy efficient" because consumed energy actually contributes to perform the operation. However, when throughput is lower than 7 MOp/s, static energy dominates and the circuit is thus "energy inefficient". Moreover, for throughput below 300 kOp/s, static energy dramatically increases as V dd cannot be reduced further because of functional-yield constraint. There is a minimum practical energy point, which corresponds exactly to the minimum energy point in the conventional approach of Section II. Indeed, when minimum V dd is throughput-limited, E stat from conventional and practical approaches are the same because Del = Per from Eq. (1) and (5). Minimum practical energy happens when required throughput leads to V dd value equal to optimal V dd of minimum-energy point from Fig.  1 . Minimum energy per operation thus happens in practice at only one particular data throughput. As throughput is determined by the application, it cannot be tuned by the designer, except with architectural modifications. This shows that minimum energy as described in previous articles on subthreshold logic is a theoretical concept. Minimum energy point can thus only be reached in practice by optimization techniques.
Let us summarize our observations. When looking at practical energy per operation, application throughput space can be divided into three regions:
• energy-efficient R1 region where dynamic energy dominates, • energy-inefficient R2 region where static energy dominates and minimum V dd is limited by throughput constraint, • energy-inefficient R3 region where minimum V dd is limited by functional-yield constraint.
As shown in Fig. 5 , operating conditions do not change the picture but shift the throughput limits between these regions. A temperature increase implies a higher minimum V dd to get sufficient functional yield, thereby extending R3. In combination with the increase of reference current I 0 with temperature (V t lowering), this translates into a static energy overhead in R3. Static energy in R2 increases too due to subthreshold swing degradation.
When considering clock-gated circuits, the activity factor α f is reduced. Fig. 5 shows that static energy is identical whereas dynamic energy is lower. This extends R2 region and increases the difference between minimum energy level and practical energy in R3.
Pulling practical energy toward minimum energy point requires different achievements for each application throughput region. In R1 region, subthreshold current has to be increased in order to reduce delay and thus lower V dd while meeting throughput constraint. In R2 region, subthreshold current has to be reduced at the expense of increased delay and thus V dd to meet throughput constraint. In R3 region, subthreshold current has to be reduced and functional yield has to be increased in order to lower V dd .
Looking at the throughput region of interest for ULP applications (≈ 10 kOp/s -10 MOp/s) in Fig. 4 (right) , we see that circuit is nearly always energy inefficient because static energy dominates (R2 − R3). Therefore, in next section, we address techniques to improve energy efficiency for R2 and R3 throughput regions.
V. MINIMIZING PRACTICAL ENERGY
In this section, we investigate the minimization of practical energy by circuit design techniques. Technology optimizations on one side and architectural optimizations on the other side are beyond the scope of this article. We first examine V t selection in a multi-V t technology. We then focus on run-time leakage reduction techniques: body biasing as in [13] and MTCMOS power gating as in [12] ). We finally investigate device upsizing schemes to increase robustness: device width [12] and channel length [13] upsizes.
A. V t selection
Nanoscale technologies often feature devices with various V t values to trade off between speed and leakage power. Here, we consider a triple-V t 45nm technology with 0.27 low-, 0.37 std-and 0.46V high-V t devices. A higher V t is achieved through increased channel doping, which results in lower subthreshold current. However, higher doping also leads to higher subthreshold swing because of higher depletion capacitance in the channel, and higher variability due to random doping fluctuation through σ Vt calculated from [14] .
Minimum V dd is represented in Fig. 6 (left) for the 3 device types. Increasing V t through channel doping, results in a higher minimum V dd to meet throughput constraint as it increases circuit delay through lower subthreshold current. Moreover, high-V t devices have a higher minimum V dd to meet 99.9% functionnal-yield constraint because of higher variability and subthreshold swing. Notice that higher doping also decreases DIBL effect, but the impact of this reduction on SNM is negligible as compared to the impact of higher variability and subthreshold swing. Simulated practical energy of the benchmark multiplier is shown in Fig. 6 (right) . The use of low-V t devices reduces dynamic energy as it allows to lower V dd while meeting throughput constraint. Low-V t devices are thus adapted to the upper part of throughput space of ULP applications. The use of high-V t devices reduces static energy and is thus adapted to the lower part of the throughput space. Actually, a shift in V t results in a shift of the practical energy curve vs. throughput. It thus extends the throughput range where practical energy is close to the minimum energy point. It is worth noticing that the lowest minimum energy is achieved with low-V t devices as minimum energy is proportional to S 2 [15] and increases with variability [7] .
B. Leakage reduction techniques
There are several ways to reduce subthreshold leakage current. Let us focus on 2 of them. The first technique we consider is the application of a Reverse Body Bias (RBB). This technique reduces MOSFET subthreshold leakage with cost overhead because it requires a triple-well process and the generation of 2 bias voltages (−RBB and V dd + RBB). In such low-voltage ULP applications, we consider a maximum affordable RBB of -0.6V. Minimum operating V dd and corresponding practical energy per operation vs. the applied RBB are plotted in Fig. 7 in R3 (10 kOp/s) and R2 (1 MOp/s) throughput regions. In R3, RBB increases minimum V dd because the higher subthreshold swing degrades functional yield. Nevertheless, practical energy is lowered thanks to leakage reduction but stays 25× higher than minimum energy level. In R2, RBB also increases minimumm V dd as circuit delay increases due to reduced subthreshold current. There is only a small energy gain because what is win in terms of static energy is lost in dynamic energy because minimum V dd increases. There is thus an optimum RBB in R2 close to -0.2V. The second technique we consider is MTCMOS powergating technique: addition of a high-V t NMOS footer to cut off leakage by entering sleep mode when the operation is completed. As explained in the introduction, this technique has a high design cost. As shown in Fig. 7 (left) for several footer widths (relative to the total NMOS width of the circuit), MTCMOS technique increases functionalyield limited minimum V dd in R3 because it introduces a voltage shift of the circuit GND node, thereby reducing its effective V dd . MTCMOS efficiently reduces practical energy, which tends to minimum energy level when decreasing footer width. However, there is lower bound on the footer width as it increases the GND voltage shift and could imply logiclevel integrity issues at the interface between MTCMOS and non-MTCMOS (flip-flops) parts of the circuit. This results in a practical energy limit equal to 4× the level of minimum-energy point. In R2, minimum V dd is raised because MTCMOS has a delay penalty. Notice that, in R2, minimum V dd does not lead to optimal energy. Indeed raising somewhat V dd reduces static energy in active-mode thanks to delay reduction. For the considered benchmark circuit, optimal V dd , which is plotted in Fig. 7 , is ±70mV higher than minimum V dd . The corresponding optimal energy is reduced to a value close to minimum energy. However, notice that the energy of flip-flops, which cannot be power-gated, and the power-down/wake-up energies are not considered and would imply energy overheads.
C. Device upsizing schemes
Upsizing the devices reduces variability induced by random doping fluctuation proportional to 1/ √ W L. The device dimensions that can be upsized at layout level are the width and the length. First, we consider an upsize of the device width to lower minimum operating V dd through functionalyield increase, as proposed in [8] . Minimum V dd and corresponding practical energy extracted from simulations are shown in Fig. 8 , again for R3 and R2 throughput regions. In both regions, minimum V dd is reduced by width upsize thanks to functional-yield improvement (R3) and delay variability mitigation (R2). However, the corresponding practical energy increases because the increase in switched capacitances and leakage currents outweighs the benefit of V dd reduction.
Secondly, the channel length can be upsized to improve subthreshold operation as it not only mitigates variability but also improves subthreshold swing. Moreover, we observe that DIBL mitigation makes length upsize much more efficient in reducing minimum V dd in R3 than width upsize, as shown in Fig. 8 (left) . In addition, subthreshold swing improvement combined with this DIBL mitigation and V t roll-off considerably lowers energy through leakage reduction. Doubling the effective channel length reduces practical energy by a factor 50, with small area overhead (only 17.5nm upsize). In R2, minimum V dd increases as a length upsize has a detrimental impact on delay. At the same time, subthreshold swing improvement increases energy efficiency with small capacitance overhead. With these effects altogether, there is an optimal channel length, which allows to reduce practical energy by 55%, below the level of minimum energy point.
VI. CONCLUSION
In this contribution, we introduced a framework to analyze practical energy per operation of subthreshold circuits, taking functional yield and application throughput constraints into account. Throughput space can be divided in 3 regions depending on the dominating energy source and the limiting constraint on minimum V dd . We showed that practical energy in 45nm technology can be far higher than minimum energy if required throughput is very low (10 kOp/s, R3 region).
We investigated practical energy minimization through reverse body biasing, MTCMOS power gating, V t -selection and device upsize. We showed for the first time their impact on minimum operating V dd . As shown in Table I , traditionallyused MTCMOS yields practical energy close to minimum energy with medium throughput (1 MOp/s, R2 region). However, it gives poor results at low throughputs with a practical energy more than 20× the minimum energy level. Channel length upsize is shown to be the most efficient technique thanks to subthreshold swing improvement, variability and DIBL mitigation. It reduces practical energy to less than 2.1× minimum energy at low throughputs and it even yields a practical energy 30% lower than minimum energy for medium throughputs, at small area cost. Additionally, we demonstrated for the first time the detrimental impact of DIBL on minimum operating V dd .
