Sub-threshold operation is a compelling approach for energyconstrained applications, but increased sensitivity to variation must be mitigated. We explore variability metrics and the variation sensitivity of stacked device topologies. We show that upsizing is necessary to achieve robustness at reduced voltages and propose a design methodology to meet yield constraints. The need for upsizing imposes an energy overhead, influencing the optimal supply voltage to minimize energy. Finally, we characterize performance variability by summing delay distributions of each stage in an arbitrary critical path and achieve results accurate to within 10% of Monte Carlo simulation.
INTRODUCTION
In sub-threshold circuits, the power supply is set below the transistor threshold voltage VT to obtain energy savings when speed is not the primary constraint [1] . Authors of [2] [3] derived analytical expressions for the optimum VDD to minimize energy in sub-threshold and showed its dependence on major circuit parameters. Sub-threshold circuits rely on leakage currents that are exponentially dependent on VT and are therefore more sensitive to process variation than traditional above-threshold designs.
It was suggested in [4] that minimum size devices are theoretically optimal for minimizing energy in sub-threshold. However, minimum size devices have increased sensitivity to VT variation because σV T is roughly proportional to (W L)
If a minimum size circuit does not function at the optimum VDD due to degraded logic output swing, it is necessary to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. upsize devices to improve robustness at the expense of increased energy consumption. Therefore, variability must be considered when analyzing the minimum energy operating point.
Previous work in [5] addresses intra-die variation by providing statistical models for energy and delay of an inverter chain in sub-threshold. An empirical expression for the optimum voltage is shown as a function of logic depth, assuming complete functionality at Vmin. Work in [6] presents a unified delay variability expression for strong-and weakinversion and applies it to a NAND gate. Researchers have also proposed various approaches to optimize delay yield by tuning VDD/VT or choosing gates of different drive strengths, for example in [7] . However, functional yield was not considered until [8] [9] , which address unsatisfactory VOH and VOL in sub-threshold inverters whose output levels are degraded by leaking devices, such as in a register file. Body biasing is another option for mitigating variation in sub-threshold [10] when a triple-well process is available.
We address inter-and intra-die variation and show that functionality in sub-threshold circuits may be compromised without proper design for variations. We first explore variability metrics for the inverter and logic gates with stacked devices, and propose a metric to size logic gates for a fixed failure rate under process variation. We then examine the energy versus VDD profile given the failure rate constraint and find the optimum sizing and supply voltage. We present an efficient methodology to model delay variability of a chain of logic gates and characterize the effect of yield-based sizing constraints on performance variability.
VARIABILITY METRICS AND DEVICE SIZING
A commonly used expression for sub-threshold current is given by [11] I sub = Ioe
where n is the sub-threshold swing factor, V th the thermal voltage, and η the DIBL coefficient. The nominal current scales linearly with W/L, while standard deviation of VT distribution reduces with (W L)
2 , thus lowering sub-threshold current variation. This section explores how sizing affects variability in output swing and active current in the inverter and stacked device topologies.
Logic Gate Output Swing
In the sub-threshold regime, the ratio of active to idle currents in a logic gate is much lower than in strong inversion. If, for example, process variation strengthens NMOS relative to PMOS, a pull-up network will not be able to drive the logic gate output fully to VDD because of idle leakage in the pull-down network. This degradation in gate output swing is illustrated in Figure 1(a) . The solid line shows the voltage transfer characteristic (VTC) of a minimum size inverter in a 65nm technology at skewed global process corner. Dashed lines plot the VTCs when random local VT mismatch is applied to the inverter. One case shows a severely degraded VOL, which can cause functional error if it is above the input low threshold (VIL) of the succeeding gate. Therefore, VT variation significantly impacts circuit functionality in deeply scaled technologies. A consistent metric is necessary to determine whether a logic gate has sufficient VOL and VOH levels. Arbitrary limits, such as 10% and 90% of VDD, do not scale well across global process corners. For example, at the strong-PMOS weak-NMOS corner, strong leakage through PMOS raises VOL of all gates above ground. This also shifts VTCs to the right, and thus logic gates can tolerate higher V OL in the preceding gate. Instead of arbitrary limits, we propose using butterfly plots to verify output voltage levels, specifically in the context of standard cell design.
Use of the Butterfly Plot
To verify VOL of a given gate, we superimpose its VTC with the mirrored VTC of NOR, since the latter has the most stringent VIL requirement from stacked devices in the pull-up network and parallel devices in the pull-down. Similarly, we verify VOH using the NAND VTC, which has the worst case VIH.
In Figure 1 (b), a NAND gate has sufficient output swing such that VOL−NAND produces a logic high output in a succeeding NOR gate. In contrast, the NAND gate in Figure 1(c) exhibits VOL−NAND=65mV and produces a NOR output of 136mV, close to mid-rail and thus causing logic failure.
A gate with failing output levels is analogous to a 6T SRAM cell displaying negative static noise margin (SNM), in that the butterfly plots for both cases do not contain an inscribed square. Therefore, we can also apply [12] to find the side of the largest inscribed square, illustrated in Figure  1 (b). Figure 1(d) shows an equivalent circuit for this measurement on two back-to-back logic gates. Because the VTC is input-dependent, all inputs are varied simultaneously to obtain the worst case VIH and VIL.
It was shown in [13] that the SNM of two back-to-back gates G1 and G2 is equal to the maximum noise that can be applied to all gates in an infinitely long chain of alternating G1 and G2, before logic failure occurs. Thus when verifying a standard cell G using the butterfly plot, we essentially assume that all logic paths in a synthesized circuit are composed of alternating G and NAND3 gates with the same two skewed VTCs. To accurately model the failure rate of a custom-designed logic path, we would plot VTCs of all gates and trace the signal propagation through the path. Exact modeling is not possible for standard cell design where the target circuit is unknown. Therefore, although the butterfly plot does not reflect the exact mismatch conditions in a circuit, it does provide a guideline for sizing standard cells consistently to account for local variation.
Failure Rate From Insufficient Output Swing
We now define logic failure as having no inscribed square in the butterfly plot and measure how the failure rate varies with VDD and device sizing. To consider logic gates with up to three stacked devices, we verify the INV, NAND2, and NOR2 gates against NAND3 and NOR3, which give the most stringent VIH and VIL requirements respectively. Sizing of NAND3 and NOR3 are fixed to provide a starting point for designing the remaining gates.
The failure rate is estimated from a 5k-point Monte Carlo simulation at worst case temperature. VT of transistors in the gate under test and global (inter-die) process conditions are randomized such that the Monte Carlo runs are analogous to sampling logic gates across multiple dies. , and NOR2, plotted against device width (normalized to minimum size). VDD is set at 240mV for demonstration. Figure 3 plots the failure rate versus normalized device width of INV, NAND2, and NOR2. In the inverter, both device sizes are varied simultaneously. In NAND2 and NOR2, the critical two-transistor stack is changed while the two parallel devices are kept constant. The failure rates also decay exponentially with widths. By increasing the device width or VDD, the failure rate can be made to approach 0.
Noise Margin in Registers
The concept of noise margin is also relevant in sub-threshold register design, where data retention is a particular challenge. Dynamic registers suffer from charge leakage, which worsens in sub-threshold due to slow circuit speeds. Therefore, we consider the static transmission-gate based register. Similar to SRAM cells, the data retention capability of the register is reflected in the hold static noise margin of its cross-coupled inverters. Figure 4 shows the equivalent circuit for measuring the register SNM, accounting for the voltage drop across T2 and the worst case leakage across T1. This circuit is used in a Monte Carlo simulation while varying the VT of each transistor and inter-die process conditions. Figure 2 (b) plots the resulting failure rate in the cross-coupled inverters. Similar to the case of logic gates, the failure rate decreases exponentially to zero when either width or VDD is increased. 
Current Variability
In addition to output swing, active current variability is another metric of interest since it relates directly to variation in propagation delay. With the common assumption that VT is normally distributed, sub-threshold current can be modeled as a lognormal random variable. From the property of lognormal distributions, the coefficient of variation of active current is given by
It was observed in [5] that as VDD reduces, the sub-threshold swing factor n decreases. This leads to higher uncertainty in the sub-threshold current through a single device. To examine the impact of topology, Figure 5 plots simulated σI sub /μI sub versus device width for static CMOS primitives consisting of one to three devices in series. Variability decreases with larger widths as expected. Stacked device topologies clearly display lower spread in active currents. 
Constant Yield Device Sizing
We now address the issue of device sizing for single and stacked device topologies, given the metrics of output swing and current variability. In above-threshold design, series devices are sized to give equivalent resistance as the inverter. However, in sub-threshold design when the objective is to minimize energy, device sizes should be kept as small as possible while satisfying variability constraints.
Compared to a single device, stacked devices display lower current spread but higher uncertainty in output levels, which may lead to functional errors. Reducing the error rate clearly takes precedence, so output swing rather than current variability should be considered first in sizing decisions.
The output swing failure rate versus width plot of Figure  3 illustrates a sizing methodology for single and stacked devices. Suppose we constrain all topologies to have the same failure rate, or interchangeably, a constant yield. We obtain the required device sizes by drawing a horizontal line at the desired failure rate, then finding where this line intersects the failure curve and the corresponding x-axis value. In Figure 3 , a target failure rate of 0.13% requires a single and 2-stack NMOS to be sized at 2 and 4.43 times minimum width respectively. 1-PMOS is sized the same as 1-NMOS as both devices are varied together in simulation. The 2-stack sizing here can be used for any static CMOS gate with two series NMOS, since it was derived from NAND2 where two leaking parallel PMOS give the worst case VOL.
Because the failure rate reduces at higher VDD, the required size for a given yield constraint also decreases. The resulting energy trade-off will be analyzed in Section 3.1. Table 1 lists device widths for a constant failure rate of 0.13% while VDD is varied at 20mV intervals. 0.13% represents the 3σ tail of a normal distribution and is chosen for demonstration. It should be noted that such a target allows sizing logic gates consistently, but does not relate in a straightforward way to the failure rate of a circuit built from these gates. As mentioned previously, this value is a pessimistic estimate because it assumes that every second gate in the circuit is NAND3 or NOR3. Furthermore, failing logic gates tend to cluster on die at process corners. 
MINIMUM ENERGY OPERATION
The total energy per operation consumed by an arbitrary circuit is modeled in [2] as
EDY N and EL model the dynamic switching and leakage energy per cycle respectively. C ef f and W ef f denote the average total switched capacitance and normalized width contributing to leakage current. t d and I leak represent the delay and leakage current of a characteristic inverter, while LDP is the logic depth in terms of the inverter delay. As VDD decreases, EDY N is lowered quadratically. The leakage current reduces because of DIBL, but t d goes up exponentially at sub-threshold voltages and causes a similar increase in leakage energy. The two opposing trends give rise to an optimal supply voltage VDDopt at which total energy is minimized, assuming the circuit is functional.
Section 2 has shown that functionality is no longer guaranteed at low supply voltages when VT variation is significant. Reducing the probability of logic failure requires either upsizing devices or increasing VDD, which must be considered when finding VDDopt. This can be accounted for within the framework of [2] by treating C ef f and W ef f as a function of VDD. The resulting energy versus VDD characteristic of an inverter chain and 32-bit Kogge-Stone adder are simulated in a 65nm process and presented as examples. Figure 6 plots C ef f and W ef f versus VDD for the KoggeStone adder under two sizing schemes. The solid line plots energy of designs satisfying an upper bound on the output swing failure rate, derived from constant yield sizing of Table  1 . The dashed line indicates an adder with only minimum size devices. Note that W ef f is obtained by normalizing the adder leakage current to that of a characteristic inverter [2] . DIBL affects leakage through the two circuits differently as VDD decreases, causing a slight increase in W ef f in this case. VDDcrit denotes the critical operating voltage at which minimum size devices can be used to satisfy the yield constraint. When VDD ≥ VDDcrit, the circuit under both schemes are identical.
Minimum Energy Point with Yield Constraint
It should be noted that once the yield constraint is set, VDDcrit can be found immediately from Table 1 and the topology of a given circuit. For example, a circuit without stacked devices does not require upsizing when VDD ≥ VDDcrit = 300mV. In contrast, a circuit with stacks of two NMOS has VDDcrit = 340mV. The switching, leakage, and total energy of the inverter chain and adder are then calculated according to Equation 4 . Figure 7 (a) plots the energy versus VDD characteristic of the inverter chain at nominal process and temperature. Total energy in both constant yield and minimum sized chains are dominated by the dynamic component. Therefore, the optimum supply voltage of the minimum size chain (dashed line) is the lowest VDD at which yield constraints are met. By definition, this is equal to VDDcrit. In the constant yield sizing scheme (solid line), reducing the supply below VDDcrit necessitates an increase in device widths. The resulting rise in C ef f dominates total energy. In this situation, there is no benefit from upsizing in order to operate at lower VDD. The optimum operating point is with minimum sizing at the lowest VDD permitted by the failure rate constraint.
When the minimum size circuit does have a local minimum in its energy characteristic, three scenarios exist depending on the relationship between VDDcrit and the optimum VDD of the constant yield (VDDopt−CY ) and minimum sizing (VDDopt−MS) schemes.
Case 1) VDDopt−MS > VDDcrit:
No upsizing is required to operate at the minimum energy point, therefore a minimum sized circuit at VDDopt−MS yields optimum energy.
Case 2) VDDopt−MS < VDDopt−CY < VDDcrit: A minimum size circuit cannot operate at VDDopt−MS without violating failure rate constraints. A circuit suitably upsized to operate at VDDopt−CY yields optimum energy while satisfying yield requirements.
Case 3) VDDopt−MS < VDDopt−CY = VDDcrit: At VDDcrit, the circuit under both sizing schemes are identical. Therefore a minimum size circuit operating at VDDcrit provides minimum energy.
An example of case 2 is seen in Figure 7 (b) for a synthesized 32-bit Kogge-Stone adder with interconnect parasitics extracted from layout. Ignoring failure rate constraints, the minimum size adder (dashed line) has an optimum supply voltage of VDDopt−MS = 280mV. When we account for failure rate constraints, the effect of constant yield sizing (solid line) is to add energy overhead when VDD < VDDcrit. This shifts the local minimum to the right, hence VDDopt−CY > VDDopt−MS. Here VDDopt−CY is also < VDDcrit, therefore the adder with constant yield sizing at VDDopt−CY = 300mV consumes 10.1% less energy than a minimum size adder at VDDcrit = 340mV. In this example, constant yield sizing results in a small reduction in energy due to the shallow minimum of the energy versus VDD curve. 
PERFORMANCE VARIABILITY

Delay Variability Modeling
Circuits in sub-threshold display significantly higher delay variability than in above-threshold, therefore proper modeling is essential for timing verification. This section presents a methodology to efficiently model the delay distribution of a chain of logic gates. Using this model, we characterize the delay variability of the Kogge-Stone adders of Section 3.1.
From [2] , the delay of a sub-threshold logic gate can be modeled as
where K is a delay fitting parameter, Cg is the output capacitance, and the denominator models the gate active current. Both the active current and t d are lognormally distributed with the same σ parameter. Therefore, delay variability is also given by Equation 3. It depends on σV T , which decreases as (W L)
, and the sub-threshold swing n, which decreases with VDS. To the first order, σ/μ does not depend on input slew or load capacitance.
The critical path delay in sub-threshold is a sum of lognormal random variables (RVs), typically approximated as another lognormal RV. Authors of [5] derived an expression for the propagation delay of a chain of identical inverters using the Wilkinson approximation. Here we employ the Schwartz-Yeh method [14] to model the sum of nonidentically distributed lognormal RVs. The delay of an arbitrary critical path can then be obtained by summing the pre-characterized distributions of each logic gate in the path.
The Schwartz-Yeh method is an iterative algorithm for calculating the sum of lognormal RVs, but requiring much less computation time than Monte Carlo simulation. The modeling methodology using this algorithm is described as follows:
1) Characterize mean delay and standard deviation (μgate, σgate) of each logic gate in a cell library, under one input slew and output load condition.
2) Simulate the (N-stage) critical path of interest at nominal process corner and without VT variation. The delay of the j th stage in the critical path gives μ j−path , for j=1 to N.
3) For each gate j in the critical path, let σ j−path = σj−gate × μ j−path /μj−gate, where σj−gate and μj−gate are characterized in 1). Since the delay variability σj/μj is approximately constant across input slew and load conditions, this scales the pre-characterized standard deviation of each gate to the input slew and load conditions in the actual critical path. 4) μ j−path and σ j−path characterize the distribution of each stage, and are input to the Schwartz-Yeh algorithm to generate the delay distribution of the entire critical path.
The above methodology is applied to a three-stage chain consisting of INV-NAND-NOR and to the critical path of a 32-bit Kogge Stone adder at 300mV. Table 2 compares statistical model results with a 1-k point Monte Carlo simulation randomizing VT of all transistors. The model estimates the mean and standard deviation of the path delay to within a few percent of the Monte Carlo results. This shows that keeping σ/μ constant provides a good approximation. This method is used to characterize the delay distribution of 1) 32-bit adder with constant yield sizing at VDDopt−CY = 300mV, and 2) adder with minimum size devices at VDDcrit = 340mV. Table 3 shows that the first adder exhibits larger mean and 3σ delay, since VDDopt−CY < VDDcrit. However, the delay variability of both adders are comparable, indicating that upsized devices in the first adder offset increased variability from operating at a lower supply voltage. 
Energy Variability
From a 1k-point Monte Carlo simulation, we characterize the energy distribution of the adder with constant yield sizing at VDDopt−CY and the other with minimum size devices at VDDcrit. As suggested in [5] , the switched capacitance is verified to vary negligibly with VT mismatch and is treated as deterministic. Figure 8(a) shows that even though the former adder employs larger devices, it displays lower mean leakage current due to DIBL, and lower variability as an additional benefit. The first adder exhibits lower mean total energy but higher variability in Figure 8(b) . The latter effect results from the delay term in leakage energy having larger mean and standard deviation at 300mV compared to 340mV. Note that the leakage component is a product of two dependent lognormal RVs, so ET is not strictly lognormally distributed.
CONCLUSION
In this paper, we have examined the effect of variation and sizing on single and stacked device topologies in subthreshold circuits. Compared to a single device, stacked devices exhibit lower current variability but a higher probability of logic failure from insufficient output swing. We introduced the use of butterfly plots to verify logic gates as well as registers against process variation, and showed that upsizing is necessary to mitigate degraded output levels. The need for upsizing to meet a given yield constraint imposes an energy overhead and impacts the optimum sizing and supply voltage at which energy is minimized. We presented a methodology to model delay variation in an arbitrary critical path using the delay distribution of each stage. Finally, we compared the delay and energy variability of the proposed sizing scheme with a minimum size circuit, and showed that energy reduction is possible without compromising yield or performance variability.
