Subthreshold circuit design is a compelling method for ultra-low power applications. However, subthreshold designs show dramatically increased sensitivity to process variations due to the exponential relationship of subthreshold drive current with V th variation. In this paper, we present an analysis of subthreshold energy efficiency considering process variation, and propose methods to mitigate its impact. We show that, unlike superthreshold circuits, random dopant fluctuation is the dominant component of variation in subthreshold operation. We investigate how this variability can be ameliorated with proper circuit sizing and choice of circuit logic depth. We then present a statistical analysis of the energy efficiency of subthreshold circuits considering process variations. We show that the energy optimal supply voltage increases due to process variations and study its dependence on circuit parameters. We verify our analytical models against Monte Carlo SPICE simulations and show that they accurately predict the minimum energy and energy optimal supply voltage. Finally, we use the developed statistical energy model to determine the optimal pipelining depth in subthreshold designs.
Introduction
In subthreshold circuit design the supply voltage is less than the threshold voltage, allowing for ultra low power circuit operation. A number of successful subthreshold designs have been presented in the literature [1] [2] . Using subthreshold design, it is expected that energy efficiency in the range of 1pJ / instruction can be achieved [3] , hence enabling low performance applications powered by energy scavenging. In addition, wide range dynamic voltage scaling has been proposed [4] where processors can scale from high performance superthreshold operation to ultra low power subthreshold operation depending on workload.
In previous work [4] [5], a minimum energy voltage (V min ) for CMOS subthreshold operation was demonstrated. Scaling the voltage supply below V min ceases to reduce energy per operation due to the dominance of leakage in this voltage regime, combined with the exponential increase of circuit delay with supply voltage [4] [5] . However, the proposed analyses do not account for the impact of process variation. It is well known that subthreshold designs have dramatically increased sensitivity to process variations since drive current is exponentially dependent on threshold voltage [2] . We observe that variations in gate delay can be as high as 300% from nominal, creating a significant challenge for subthreshold circuit design. It is therefore difficult to meet design specification predictably without dramatic overdesign which wastes energy efficiency. In this paper, we therefore analyze the impact of process variation on subthreshold design and propose methods to mitigate its effect.
We first analyze the impact of different sources of process variations on subthreshold circuit delay. We show that random dopant fluctuations (RDF) [6] become the dominant source of variation in subthreshold operation, in contrast to superthreshold operation where geometric variations (e.g., in L eff ) are equally important. Due to the independent nature of RDF variations it is possible to reduce their impact on circuit performance through averaging. Hence, we show how careful circuit sizing and choice of logic depth can reduce timing variability (3σ/µ) to below 30% with appropriate design choices. We then analyze the energy efficiency of subthreshold designs while capturing the impact of process variations. We derive statistical expressions of circuit delay and static and dynamic power consumption and propose both analytical and numerically-derived expressions for the minimum energy and V min as a function of circuit parameters. We show that the method in [4] , which ignores process variations, can underestimate V min by as much as 78mV for small devices, corresponding to a 40% underestimation.
Using the newly developed model, we then study the dependence of the minimum energy and V min on design parameters such as the circuit logic depth, the number of critical paths in the circuit and the switching activity rate. Finally, we apply our model to a pipeline depth study. We show that the energy optimal pipeline depth in subthreshold designs increases from 10 fanout-of-four inverter (FO4) delays under nominal process conditions to 15 FO4 delays when process variations are considered. The rest of the paper is organized as follows. Section 2 presents key observations in subthreshold circuit variability. In Section 3, we derive statistical models of circuit delay and power under process variations and use these to derive our analytical model of the minimum energy and V min . In Section 4, we verify our analytical model against Monte Carlo SPICE simulations and examine some trends to provide useful insights on how to design subthreshold circuits efficiently. Section 5 explores energy efficiency from the perspective of pipelining, which includes process variation impact and latch overhead. Finally, Section 6 concludes the paper.
Variability Impact on Subthreshold Circuits
It is well known that process variability impact is magnified in subthreshold operation due to the exponential impact of V th and L eff on subthreshold drive current. However, little analysis has been performed to investigate the dominant components of variability in subthreshold circuits and other key trends. In this section we make several key observations about subthreshold circuit robustness based on SPICE simulations using an industrial 130nm technology. First, we point out that random dopant fluctuations (RDF) dominate geometric variations, particularly in channel length. This occurs since the channel length variation dependency of V th stems from drain induced barrier lowering (DIBL), which reduces at low operating voltages. As a result, the magnitude of V th variation arising from channel length uncertainty rapidly falls off as V dd reduces. However, since on current (I on ) at low voltages becomes more sensitive to V th fluctuations (exponentially dependent in subthreshold), the net result is that I on variation due to DIBL remains roughly constant or slightly increases. On the other hand, the uncertainty in V th due to RDF is independent of V dd and solely a function of channel area [7] . Therefore, I on variation resulting from RDF becomes the dominant component as V dd nears V th as shown in Figure 1 .
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. We also observe that I on variation due to RDF continues to increase even in the subthreshold region as seen in Figure 1 . As EQ1 [8] [10] shows, given a fixed amount of RDF V th variation we expect a constant amount of I on variation in the subthreshold region regardless of V dd . In practice, however I on uncertainty grows since subthreshold swing (S S ) improves in the subthreshold region as V dd reduces.
Considering that RDF dominates uncertainty in subthreshold circuits, we can address variability in this case through device sizing which reduces RDF. Furthermore, larger logic depths can serve to average out timing variations since stage delays are effectively independent. Figure 2 shows the 3σ/µ delay variation of an inverter chain versus the number of inverters (n) and inverter size (W) with Monte Carlo SPICE simulations. Interconnect loading for each stage is modeled by a lumped capacitance (50fF). As W or n increases, the relative variation becomes smaller, as expected. Figure 2 shows that by using sufficient logic depth and transistor sizing variability can be reduced to as little as 30%. In addition to selecting an appropriate logic depth, latch-based design (opposed to edge-triggered flipflops) can enable time borrowing which gives more room to average out RDF variations, effectively increasing n. The impact of logic depth is further investigated in Section 5 which studies the optimal pipeline depth for energy efficiency in subthreshold circuits.
Subthreshold Statistical Analysis
In order to estimate the energy consumption under process variation, we need to statistically model both delay and power. In order to make the problem tractable, we choose to set up our target circuit with p identical inverter chains, each composed of n inverters. However, the analysis can be extended to more general gates as well. V th typically follows a normal distribution; from EQ1 subthreshold oncurrent and propagation delay therefore exhibit lognormal distributions. Section 3.1 focuses on statistical subthreshold delay modeling. We first estimate the sum of lognormal gate delays to obtain the path delay (Section 3.1.1) and then find the circuit delay by taking the maximum of path delays (Section 3.1.2). This section also includes our mathematical formulation of the greatest of lognormal random variables (RVs). Section 3.2 details the power/energy and V min analysis under process variation based on delay models in Section 3.1.
Subthreshold Propagation Delay Analysis

Subthreshold Delay Formulation
Let t di be the delay of the i th (i=1,2,...p) path. In this case, the final circuit delay t dm can be expressed as (EQ 2)
Using t inv,j as the j th gate delay in a certain path, t di can be further written as
where η is the delay factor arising from a non-step actual input [4] , and C S is the switching load capacitance of each gate. Subthreshold current can be expressed as [8] (EQ 4) Let I s0 be (EQ 5)
then the on and off current can be rewritten as , (EQ 6) V th has a normal distribution, hence I on,j has a lognormal distribution. Here we consider I s0 as deterministic assuming that L eff variation in the denominator can be expressed in V th . Substituting I on into EQ3, we obtain (EQ 7)
We can see that t di is the sum of several lognormally distributed RVs. From [9] , the sum of several lognormal RVs can be approximated by another lognormal RV. We choose to match the first and second moment of the LHS and RHS in EQ7 as suggested in [9] . After some derivation, we arrive at (EQ 8)
where .
This derivation finds the corresponding normal RVs (ln t di ) mean and standard deviation instead of µ(t di ) and σ(t di ). We will need this 
I subvth V gs V th -
information to find the max of t di . In the next section we describe a method to estimate the mean and standard deviation of t dm from t di .
Greatest of t di using lognormal RVs
We now seek the mean and standard deviation of the greatest of p lognormal RVs. Introducing notation, suppose u has a normal distribution with mean and standard deviation of µ 0 and σ 0 . Then X=exp(u) has a lognormal distribution. The mean and standard deviation of X are , .
The probability that X Z is smaller than x is (EQ 14)
Since we are primarily concerned with delay variation due to RDF, we can assume that X A and X B are independent RVs, and the pdf of X Z is (EQ 15) From the raw moments, the µ and σ of X Z are given by , (EQ 17) There are usually many more than two critical paths in a welldesigned circuit, thus we need to be able to estimate the max of an arbitrary number of lognormal RVs. One method is to apply the above approach iteratively.
(EQ 18)
This approach is based on the assumption that the max of two lognormal RVs is another lognormal; we will show the error of this approach later in this section. To summarize, the detailed steps in computing X M are listed in ALGORITHM 1.
We find that the error incurred by the assumptions taken is very small for the random variables that we are concerned with. We show in Figure 3 that our proposed iterative method provides good accuracy over a wide range of p.
In the special case of identical paths, the greatest of p identical lognormal RVs as in our case can be approximated with a closed-form expression:
If u M , the greatest of normal RVs, can be estimated with a normal RV, we can assume X M has a lognormal distribution and find its µ and σ from EQ10. It then follows that we need to find µ and σ of u M . Based on the expression of the maximum of two normal RVs [11] , we can derive the maximum of p identical normal RVs for u M as (see Appendix B):
where , .
(EQ 21) In EQ20, r' is replaced by r in the original derivation. However this does not serve as a good approximation and r' is found via curve fitting to provide better accuracy. With µ uM , σ uM , and EQ10, we can find µ and σ expressions for X M .
We plot the relative error in standard deviation using the iterative MAX_OF_LOGNORMAL and analytical methods in Figure 3 for a 
value of σ 0 corresponding to minimum sized devices. Both methods provide sufficient accuracy in calculating standard deviation. For the error in computing the mean of X M , the analytical approach typically yields errors on the order of 3-5% for the cases studied. The iterative method is superior in this respect, with error less than 1% (note that curve fitting of r in the µ term of EQ20 could be used to further improve the accuracy of the analytical approach). Since the mean is usually much larger than the standard deviation for X M with large p, the iterative approach is preferable and is used in the remainder of the paper. Furthermore, while the analytical approach is only applicable to the greatest of p identical paths, the iterative approach is general. The analytical approach yields the same trends as those described below and thus provides a simple and efficient way to shed insight on the impact of variability on subthreshold circuits.
Combining the above analysis with Section 3.1.1, we can obtain µ and σ for t dm . With µ and σ of t dm in hand we can find the operating speed based on worst-case delay which is typical practice in ASIC designs. The worst delay tdly is (EQ 22) where conf is the confidence σ value. We use conf=3 in this paper unless otherwise specified.
Statistical Energy and V min Modeling
Total energy consumption during signal propagation is the sum of active and leakage energy. In our energy modeling, we treat the switching energy deterministically. This is reasonable since switching energy has only linear dependencies and therefore smaller variation compared to leakage energy. Again, we consider worst case leakage energy across all chips as the leakage energy. This is done by taking the µ+conf * σ of leakage power and tdly.
Total leakage current I leak,total can be expressed as (EQ 23) where N=n*p is the total number of gates in the circuit. Then the worst case leakage current is:
The worst case total energy across many dies is (EQ 25) where α is the activity rate. The energy expression without considering variation is [4] (EQ 26) where t d,nom is the nominal delay of the inverter chain with n inverters and I leak0 is the leakage current per gate. Comparing EQ25 and EQ26, the only difference lies in I leak,M and tdly. Therefore, we introduce a statistical adjustment factor A stat to consider both statistical terms:
(EQ 27)
Since subthreshold swing is a function of V dd (a quadratic function serves as a good estimation), the closed-form expression for nominal V min,nom in [4] is no longer accurate. We empirically find the following equation to be a good approximation for V min,nom with η=2.7.
(EQ 28)
Multiplying n by A stat , we find V min,stat under process variation. Using the analytical expression for tdly from Section 3.1, A stat can be written as:
( EQ 29) where r and r' are the same as in EQ21, .
We now clarify the modeling of V th variation in EQ30. We model the total V th variance as the sum of RDF and L eff components. We neglect spatial correlation in σ Vth , Leff since the σ Vth , RDF component dominates in this application space, making the error incurred by ignoring spatial correlations small. σ Vth , Leff is modeled as the sum of intra-die and inter-die variation. The RDF component is proportional to the inverse of the square root of channel area [7] . Since NMOS and PMOS are sized differently, we take the average k Vth of NMOS and PMOS with results showing good accuracy compared to SPICE.
(EQ 30)
Model Verification and Discussion
We simulate the circuit configuration of Section 3 (p identical paths of n inverters) in SPICE using an industrial 130nm technology with nominal V th of~350mV. Simulated and modeled results are seen in Figure 4 showing good fit. Inverters use 0.4µm wide NMOS with beta ratio of 1.4. Figure 4 shows that ignoring process variations underestimates V min . In particular, deterministic analysis does not predict a V min (or V min =0) for n<15 and α=1. ∆V min (the difference between V min in deterministic and statistical models) shrinks with increasing logic depth. This follows from larger logic depths enhancing averaging, reducing the spread in timing and leakage energy.
After confirming the accuracy of our model, we apply it to determine the dependency of V min on critical path count p. Results are shown in Figure 5 and Figure 6 . Figure 5 shows that V min first reduces and then rises slightly as critical path count increases. The reason for this is shown in Figure 6 , which shows the worst case delay tdly and average worst case leakage per gate (I leakM /N) versus critical path count. Note that tdly increases with more critical paths while average leakage becomes smaller. When p is not large, the reduction in average leakage dominates the increase in tdly and V min decreases. However, when p becomes large the average leakage stabilizes while tdly continues to increase, yielding a small rise in V min . Figure 7 shows the statistical and nominal V min with activity rate α. When α is high, the nominal V min model predicts a V min of zero. However, the new statistical model confirms that V min exists. We also show that the V min converges towards the nominal case when sizes increase. As shown, 1µm (W nmos ) inverters show smaller V min since it has less variation from RDF. All three curves follow the same V min trend as α decreases. This is due to the fact that α has no interaction with process variation and thus affects V min in the same manner for both nominal and statistical cases.
With our statistical delay model for t dm , some important results can be derived. Figure 8 shows the constant 3σ/µ delay contours in the logic depth and device size for a circuit block with 10 critical paths. Points farther from the origin exhibit less variation, whether due to device upsizing or increased logic depth. For the same amount of relative timing variability, we can compensate for small sizes with larger logic depth and vice versa. Notice that as the target 3σ/µ becomes smaller (<20%), it becomes difficult to achieve with reasonable sizes or logic depth. However, in Figure 8 the total number of paths is 10. With a larger number of critical paths, 3σ/µ naturally reduces as the tail of t dm shrinks.
Optimal Pipeline Depth Investigation
Reference [4] showed that energy efficiency improves with increasing switching activity (α) since devices are then performing useful work and not simply contributing to leakage energy. This suggests that a subthreshold design should be aggressively pipelined to raise α and improve energy efficiency. As with any design optimization, increasing the number of pipeline stages will reach its limits. In particular, latch energy overhead will eventually overtake the advantage offered by high activity rates.
The situation is further complicated when considering variability. As shown in Figure 4 , V min increases significantly when variation is included, but ∆V min in Figure 4 , and subsequently the change in energy, decreases as the logic depth increases. Designing longer paths therefore clearly reduces the effects of individual gate variations on total path delay and energy and we can limit delay and energy variation by increasing logic depth.
In light of process variation, a tradeoff exists between raising α through aggressive pipelining and reducing variation by increasing logic depth. In order to quantify this tradeoff, we examine a simple circuit consisting of 120 inverters with an FO4 load at each node, partitioned into a variable number of pipeline stages. As in Section 4, we use NMOS width of 0.4µm with a beta ratio of 1.4.
For each pipeline depth studied, we seek to minimize the energy consumed per operation. This is fundamentally different than typical superthreshold pipeline studies [12] [13] since throughput is not a primary concern in subthreshold circuits. To find the minimum energy at each pipeline depth, we simulate one pipeline stage across a range of voltages. For each voltage, we find the smallest clock period that guarantees operation for the given pipeline stage and then simulate a single switching event during that clock period. The energy consumed during this switching operation is then multiplied by the number of pipeline stages to yield total pipeline energy. We perform this simulation with and without variation considered (both RDF and geometric variations). To account for variation, we perform a Monte Carlo analysis with 1000 iterations for each clock period and energy calculation, and find the minimum µ+3 * σ value. Figure 9 shows both minimum energy per operation (E min ), and the corresponding supply voltage (V min ), as a function of logic depth. We observe a large increase in E min as a result of variation. Figure 9 highlights the fact that this energy penalty can be minimized by careful sizing of logic chains. For the nominal case, the energy minimum occurs at a logic depth of 10 inverters. The minimum when considering variation is noticeably shifted toward larger logic depths; approximately 15 inverters. Selection of V dd is also critical to minimizing the effects of variation. V min increases by roughly 60 mV across the range of logic depths presented in Figure 9 when variation is included. We begin to see that by designing longer logic paths and increasing supply voltage, the effects of variation can be minimized. Now consider energy consumption when a circuit is designed without considering variation. For example, if a logic path is designed with 10 inverters per pipeline stage and a supply voltage of 130 mV, as suggested by the nominal results in Figure 9 , a designer will expect the circuit to consume 15 fJ per operation. Monte Carlo SPICE simulations show that process variation causes the worst-case energy to be 31.8 fJ. If the circuit is instead designed with 15 inverters per pipeline stage and a supply voltage of 210 mV (conditions leading to minimal energy when variation is included), the worstcase energy consumption is 24.2 fJ, a 24% reduction in maximum energy per operation. This comparison is summarized in Table 1 .
Conclusions
This paper considers the impact of process variation on subthreshold circuits. We first make several observations about the nature of variation in subthreshold operation and how it fundamentally differs from superthreshold operation. We then derive statistical models of subthreshold circuit delay, power and energy efficiency and verify these using SPICE. With a new statistical model for the minimum energy point V min , we show that a previous nominal model underestimates V min by up to 78mV for small devices. Based on the observation that random dopant fluctuations dominate variability in subthreshold circuits, we suggest design strategies to maintain reasonable variability levels, e.g., <30%. Finally, we explore the role of pipelining in the energy efficiency of subthreshold circuits. We observe a 24% energy reduction when properly considering process variation during microarchitectural planning. 
+ ---------------------------------------
        = µ y µ 0 1 π -------σ 0 ⋅ + = σ y σ 0 1 1 π --- - ⋅ = µ j µ j 1 - 1 π -------σ j 1 - ⋅ + = σ j σ j 1 - 1 1 π --- - ⋅ = j 1 2 …r , , =
