This paper investigates subthreshold voltage operation of digital circuits. Starting from the previously known single supply voltage for minimum energy per cycle, we further lower the energy consumption by using dual subthreshold supplies. Level converters, commonly used in the above threshold design, are found to be unacceptably slow for subthreshold voltage operation. Therefore, special constraints are used to eliminate level converters. We give a new mixed integer linear program (MILP) that automatically and optimally assigns gate voltages, avoids the use of level converters, and holds the minimum critical path delay, while minimizing the total energy per cycle. Using examples of a 16-bit ripple-carry adder and a 4 × 4 multiplier we show energy savings of 23% and 5%, respectively. The latter is a worst case example because most paths are critical. Alternatively, for the same energy as that of single below-threshold supply, an optimized dual voltage design can operate at 3 to 4 times higher clock rate. Also, we show energy saving up to 22.2% from the minimum energy point over ISCAS'85 benchmark circuits. The MILP optimization with special consideration for level converters is general and applicable to any supply voltage range.
INTRODUCTION
The ubiquitous era of emerging portable devices demands long battery lifetime as a primary design goal. Subthreshold circuit design can reduce energy per cycle by one or more orders of magnitude by scaling power supply voltage (V dd ) below the device threshold voltage (V th ). Ultra-low power applications such as micro-sensor networks, pacemakers, and many portable devices operate under extreme energy constraint for long battery lifetime. [17] [18] [19] Subthreshold circuit design is suitable for such emerging energyconstrained applications. 4 12 28 34 36 37 42 As the power supply voltage is scaled down below the device threshold voltage, the subthreshold current ever so slowly charges and discharges nodes according to the logic function of the circuit. 36 Despite a very high energy efficiency, the subthreshold design has been applied only in niche markets due to its low performance.
Ultra-dynamic voltage scaling (UDVS) can provide useful system applications by switching between a highly energy efficient subthreshold V dd mode and a normal abovethreshold V dd mode. 3 The normal or subthreshold mode may be chosen according to the workload of the system. According to the available literature, most low-power techniques exploit time slack on non-critical paths of a circuit to reduce power consumption without performance loss. These techniques have been applied to circuits operating with the nominal supply voltage by sizing device widths, using multi-V th devices, or using multiple V dd . 23 30 38 For subthreshold circuits, the technique of sizing device width affects the correct logic function of CMOS circuits at low supply voltage. 36 The multi-V th technique does not adequately utilize the time slack in the subthreshold regime, 1 because semiconductor foundries normally provide standard cell libraries with two to three fixed V th values, namely, high V th , standard V th , and low V th , for low-power design. Gate delay exponentially depends on V th in a subthreshold circuit. Therefore, we cannot utilize all possible time slack on non-critical paths in a subthreshold circuit without further manipulation of these device threshold voltages.
The multi-V dd technique has been widely implemented for two supply voltages. 20 The dual-V dd design is best suited for exploiting the time slack in a subthreshold circuit as well. 13 15 16 Although the gate delay exponentially depends on V dd in the subthreshold region it may be possible to find an optimal lower supply voltage for the available time slack in the circuit. A DC to DC voltage converter 26 will then allow the voltage management. Utilizing the time slack for dual-V dd assignment can give valuable energy saving with small extra cost in physical design.
There are two scenarios for applying dual-V dd design to subthreshold circuits in energy constrained lowperformance applications. Consider a digital circuit working in an absolutely minimum energy consumption mode. The supply voltage for such an operation is known to be in the subthreshold range. 36 We can further reduce the energy consumption without changing the performance by assigning an extra lower supply voltage only to gates on non-critical paths. Alternatively, the subthreshold circuit can be sped up by several times by selecting two supply voltages, one of which is higher than the optimal single V dd . Small energy increase from the absolute minimum energy point of a subthreshold circuit can notably improve performance. 7 In this scenario, the dual-V dd design retains the energy consumption close to that of the single-V dd minimum energy point but operates at a higher speed obtained by using the higher supply for gates on critical paths.
Our contribution provides a framework for finding the optimal dual-V dd assignment in a subthreshold circuit with given speed requirement. The design procedure formulates mixed integer lineal programs (MILP) that, given today's computing capabilities, can deal with moderately large circuit complexity. 8 In a dual voltage circuit, signal level converters are considered essential. A level converter simply changes a logic 1 level from one V dd voltage to another V dd voltage. Even though level converters insert delays and consume power, 24 39 in their absence certain interfaces become unsatisfactory. This is because the logic 1 level produced by a gate has the same voltage level as its V dd . However, for a proper switching operation another receiving gate with a different V dd requires the input signal to match its own V dd . In particular, driving a high V dd gate with a low voltage signal causes high leakage and long delay. We characterize all multi-level interfaces and our MILP contains constraints to eliminate interfaces where level converters may otherwise be essential.
The paper is organized as follows. Section II introduces properties of subthreshold operating circuits with key terms. In Section III, we extend the existing dual-V dd techniques of above-threshold operation, clustered voltage scaling (CVS) 32 and extended-CVS (ECVS) 33 to the subthreshold regime. New MILP solutions are presented in Section IV. Section V reports SPICE simulation results to validate MILP solutions. Finally, a conclusion of this work appears in Section VI.
SUBTHRESHOLD CIRCUITS
Before discussing the original method of dual-V dd optimization of subthreshold circuits, we briefly summarize the properties of the subthreshold circuits in terms of functional operation and failure, performance and energy on which some of the earliest work was reported by Swanson and Meindl 31 and Vittoz and Fallrath. 35 
Minimum Operating Voltage
For the correct functional operation of a subthreshold logic circuit, the supply voltage V dd should be higher than a certain minimum voltage (V min ). The theoretical V min is given as, 22 40
where V T = kT /q is the thermal voltage, k = 1 381 × 10 −23 J/K is Boltzmann's constant, T is absolute temperature in Kelvin, q = 1 602 × 10 −19 C is electronic charge. At 300 K (room temperature), V T = 26 mV. S is the slope of drain to source current I ds in the subthreshold region, usually referred to as subthreshold swing. For example, for 0.18 m technology S ≈ 90 mv/decade. 40 That means that a 90 mV reduction in V gs will reduce I ds by a factor of 10. That gives V min = 48 mV at 300 K.
We define the on-current I on as I ds for V gs = V ds = V dd and off-current I off as I ds for V gs = 0 and V ds = V dd in the subthreshold region (V dd < V th ). From Ref. [9] , S is degraded with the downscaling trend of the CMOS technology, which means that the reduced ratio of I on to I off will cause smaller noise margins and possible functional logic failures at or below V min . Figure 1 shows the simulation of a chain of 1,000 inverters for different supply voltages and inputs of logic 0 (ground) and logic 1 V dd . The circuit was simulated in 90 nm CMOS in the Predictive Technology Model (PTM) 43 using HSPICE simulator. 10 The minimum operating voltage for the inverter chain is found to be 80 mV to guarantee a 10% to 90% output voltage swing. As we move from the input toward the output we observe degrading logic levels, which eventually stabilize between the depth of 10 to 20. Basically, this means that the logic 0 and 1 levels stabilize close to ground and supply voltages, respectively, and do not continue to degrade with the logic depth of the circuit. The data in Table I , which is directly obtained from the HSPICE 10 simulation, clearly shows this.
A new noise tolerant circuit design was proposed based on differential Schmitt trigger gates. 6 This approach increases noise immunity for low voltage subthreshold circuits compared to standard CMOS subthreshold logic circuits.
Rise and fall transition times for outputs of inverters are not degraded through an 1000 inverter chain at lower supply voltages as shown Figure 2 . From these simulation results, subthreshold logic circuits guarantee the correct functional operation above the minimum energy point.
Delay
The delay of a gate in a subthreshold circuit can be simply formulated from the gate delay equation,
where K is a fitting parameter and C L is the load capacitance of the gate. By replacing I on with subthreshold drain current (I sub ), 
where is drain-induced barrier lowering (DIBL) coefficient and I o is drain current at V gs = V th . When V gs = V ds = V dd V T (≈26 mV at 300 K), we get gate delay as,
Thus, t d is exponentially dependent on V dd , V th , , and S.
Energy
Energy per cycle of a circuit is a key parameter for energy efficiency in ultra-low power applications. Because computing workload is characterized in terms of clock cycles, this measure directly relates energy consumption to the workload. Before considering the energy consumed by a circuit, we start by examining the total energy per cycle (E tot ) of a single gate, which is composed of dynamic energy (E dyn ) and leakage energy (E leak ):
where 0→1 is the low to high transition activity for the gate output node and P leak is static leakage power. I off is static leakage current and presented by (3):
Successful hardware implementations of single-voltage subthreshold circuits have been reported. An FFT chip was built by Wang and Chandrakasan 37 in 180 nm CMOS technology and was shown to work with V dd = 350 mV at a clock rate of 10 kHz. The threshold voltage was 450 mV. Its power consumption of the chip was 0 6 W. Subthreshold voltage processor chips have been built and tested by Seok et al. 28 A subthreshold SRAM (256 kb) in 65 nm CMOS has been reported by Calhoun and Chandrakasan.
DUAL-V dd SCHEME FOR SUBTHRESHOLD OPERATION
Scaling V dd down in circuits reduces both dynamic power and static leakage power besides reducing the performance. To reduce power consumption without degrading performance, a multi-V dd technique exploits time slacks and lowers voltage V DDL for gates on non-critical paths.
As shown in Figure 3 (a), a clustered voltage scaling (CVS) algorithm 32 does not allow the V DDL cells to feed directly into V DDH cells and so level converting is implemented inside the flip-flop (LCFF). 11 This topological limitation reduces full use of time slacks that exist in a circuit. The extended clustered voltage scaling (ECVS) in Figure 3 (b) eliminates this constraint by inserting a level converter (LC) with each V DDL cell feeding into a V DDH cell. ECVS gives better power saving than CVS but LC adds to power and delay overheads.
Without a level converter the low to high output transition delay of the second stage inverter in Figure 4 is not affected by the input voltage swing V DDL from the previous stage, because the delay of the pull-up PMOS is only dependent on its own power supply V DDH . 27 During the high to low output transition of the second inverter, the pull-down NMOS delay is affected by both the input swing V DDL and the power supply V DDH . Therefore, lower input swing reduces discharge current through the NMOS, which increases the pull-down delay. Because the pull-up PMOS in the inverter could not be shut off completely by the lower input swing level, severe DC current from the power supply V DDH induces higher static leakage power consumption.
In subthreshold operation, the lower input swing exponentially increases the delay (4) of the driven gate. We investigate the delay and leakage power penalty from lower input swing voltage. For simplicity, in this paper, we use only four types of cells, namely, INV, NAND2, NAND3 and NOR2, to synthesize example circuits. For cell characterization, all simulation results are from HSPICE using PTM 90 nm CMOS. The device threshold voltages are V th PMOS = 0 21 V and V th NMOS = 0 29 V at nominal V dd = 1 2 V and room temperature (300 K).
Various input and output configurations interfacing gates in dual V dd assignments are shown in Figure 5 . Table II summarizes the delay and static leakage power for each case where V DDH = 250 mV and V DDL = 200 mV such that the entire operation is in subthreshold region. The difference between LL and HH delays shows that gate delay (4) is exponentially sensitive to the power supply voltage, while P leak has a smaller change.
In Table II , as expected, due to smaller discharging time constants HL delays for NAND2 and NAND3 gates are lower than those for the LL configuration. However, that is not the case for INV and NOR2 gates, which are faster in the LL configuration. This speed increase is due to a higher logic 0 level for the LL configuration in charging time. In the case of leakage power for HL, all gates suppress the leakage current through the pull-up PMOS (V gs > 0) from the power supply. Severe increases of the delay and power in dual-V dd scheme are from LH, which is prohibited in CVS methodology and is allowed in ECVS with LC. But, a common LC used for above-threshold in Figure 3 (c) cannot be used due to its unacceptable delay overhead besides the power overhead.
From Table III , the LC delay penalty in subthreshold operation is around 80 fanout-of-four (FO4) inverter delays, which exceeds a clock cycle time of a low voltage shallow pipelined microprocessor (40 FO4 delays) or ASIC processor (44 FO4 delays). 5 29 A new LC design suitable for subthreshold circuits may be needed but is out of the scope of the present work. In the next section, we include additional constraints in the MILP that will not allow the LH configuration (similar to CVS) for energy optimization.
MILP FOR V DDL ASSIGNMENT
In this section, we design minimum energy circuits with dual-V dd assignment using mixed integer linear programming (MILP). 8 First, the optimal (i.e., minimum energy per cycle) supply voltage (V opt ) for a single V dd operation is determined. The critical path delay (or clock cycle time) of this design is used as the timing requirement for the dual voltage design. Thus, the MILP automatically applies higher supply voltage V DDH = V opt to gates on critical paths to maintain the performance and finds an optimal lower supply voltage V DDL assigned to gates on non-critical paths to reduce the total energy consumption by a global optimization considering all possible V DDL . This differs from the backward traversal CVS heuristic algorithms that tend to be non-optimal. Note that more paths now may have delays that are either equal or close to the critical path delay.
Let X i be an integer variable that is 0 for V DDH or 1 for V DDL for the power supply assignment of gate i. Let T c be a predetermined critical path delay for the circuit. The optimal minimum energy voltage assignment problem is formulated as an MILP model:
E tot i for V DDL and V DDH are given by (5a) and (5b)
Subject to timing constraints:
T i ≤ T c ∀i ∈ all primary output gates (11) Subject to topological constraints:
In above constraints, T i is the latest arrival time at the output of gate i corresponding to a primary input event. 25 As mentioned in Section III, the unacceptable delay penalty of asynchronous LC prohibits its use in a dual V dd scheme in the subthreshold region. The MILP model does not allow a V DDL cell to drive a V DDH cell as its fanout gate on account of topological constraint (12) as shown in Figure 6 . Thus, the LH configuration of Figure 5 (d) never occurs in the optimized circuit. Within the given timing constraint T c , originally obtained for the best energy per cycle for single subthreshold V DDH operation, the MILP searches recursively for the best V DDL such that the energy per cycle is further reduced to a true minimum. The MILP needs to run multiple times for searching the optimal V DDL , but we presented a new MILP algorithm that runs only one-time for finding the best V DDL in 13, 14. 
RESULTS
As mentioned before, we use only simple four basic cells (INV, NAND2, NAND3 and NOR2) for synthesizing two example circuits, a 16-bit ripple carry adder and a 4× 4 multiplier, and ISCAS'85 benchmark circuits in PTM 90 nm CMOS technology. The delay, capacitance and average leakage power of these four basic cells are characterized for the MILP model by scaling V dd with a 10 mV resolution in HSPICE simulations. Switching activity is the average number of low to high transitions at circuit nodes, which is calculated using a logic simulator with randomly generated input vectors. These randomly generated input vectors are the same as input signal vectors to the circuit for HSPICE simulation to measure energy consumption. As shown in Figure 7 , our example circuit, embedded in a test bench, is driven by randomly generated high input swing flip-flops. Two subthreshold voltages may be provided by a DC to DC voltage converter. 20 26 37 The energy per cycle measurement is for combinational circuit excluding flip-flops. From Figure 8 (a), the minimum energy point for a 16-bit ripple carry adder with an activity factor = 0 21 is 9.65 f J at V dd = 0 21 V. The clock frequency was found to be 2.15 MHz. With dual-V dd assignment the optimized circuit with V DDH = 0 21 V and V DDL = 0 14 V reduces the energy per cycle by up to 23.6% retaining the same performance. This energy reduction is shown by the downward arrow in Figure 8(b) .
Consider again the minimum energy per cycle (9.65 f J) operation of the 16-bit ripple-carry adder circuit with a single subthreshold voltage 0.21 V and a clock frequency of 2.15 MHz.
In an alternative design, we may hold the minimum energy constant and improve the performance. From the MILP results in Table IV , we find that operation with two voltages 0.27 V (V DDH ) and 0.19 V (V DDL ) consumes 9.42 f J, which is just under the minimum energy but has a clock frequency 8.41 MHz. This, as shown by the right arrow in Figure 8(b) , has about 4X speed improvement.
As a worst case example, a path balanced 4 × 4 multiplier reduces the energy per cycle to 5% below the minimum energy point with V DDH = 0 17 V and V DDL = 0 12 V, where the performance is not degraded. For better performance, the 4 × 4 multiplier can operate at 1.67 MHz from a clock frequency 1 MHz on minimum energy with single V dd , where two supply voltages 0.19 V (V DDH ) and 0.13 V (V VDDL ) are provided and minimum energy increases slightly.
Two example circuits using dual-V dd show that performance improves largely for a circuit with large positive slack. Figures 9(a) and (b) illustrate gate slack distribution 13 14 of a 16-bit ripple carry adder and a 4 × 4 multiplier, respectively, for single and dual V dd (Optimized) design at the minimum energy point. Table IV The energy savings at minimum energy operating points using dual-V dd are obtained from HSPICE simulations for ISCAS'85 benchmark circuits, as shown in Table V . The optimized c880 (an 8-bit ALU) shows 22.2% energy saving as the best case. The energy saving for c6288 (a 16 × 16 multiplier) is only about 2.1%. Gate slack distribution is shown for c880 and c6288, respectively, in Figure 10 .
Logic function failure occurs at 0.08 V in NAND3, so the possible lowest V DDL assignment in MILP optimization is 0.09 V. This minimum operating voltage guarantees 10% to 90% output voltage swing for all four cells in the full range of operational voltages used. Figure 11 shows sample signal waveforms from an optimized 16-bit ripple carry adder circuit for V DDH = 0 11 V and V DDL = 0 09 V. This has V DDL assigned to cells on a non-critical path that leads to the least significant sum bit (s1). The output flip-flop (s1q) holds correct signal values at the minimum operating voltage on positive clock edges.
When V DDH is 100 mV, it is approaching the lower end of its range beyond which the circuit would fail to operate. The MILP now has limited choices for a solution and gives a V DDL that provides smaller energy saving. The 16-bit ripple carry adder has better energy reduction because it can utilize more time slack from non-critical paths compared to the 4 × 4 multiplier with more balanced paths. The gate delay in subthreshold operation increases exponentially with reducing supply voltage, which forces the optimal V DDL close to V DDH . Table V . Energy saving with optimal V DDL for given V DDH (minimum energy operating point) in ISCAS'85 benchmark circuits for PTM 90 nm CMOS.
Benchmark circuit
Total gates Activity Even though the MILP model only allows HL configuration and eliminates the use of LC for a dual V dd circuit block, level conversion may be needed at outputs to match signal levels across block to block connections of a system. The differential cascode voltage switch (DCVS) based level converter of a normal standard cell library in Figure 3 problem, our design refrains from using level converters while taking the penalty of energy saving into account. For level converting, we always assign V DDH to primary output (PO) gates before the output flip-flops at multiple voltage boundaries between circuit blocks. The PO gates driven by V DDL cells are found to correctly execute their logic functions if, for a given V DDH , V DDL is bounded as shown in Figure 12 . This lowest possible V DDL raises the minimum operating voltage for the dual voltage optimized circuit block. The optimal V DDL in MILP model can be higher than its true optimal value to suppress DC leakage power of the LH configured PO gates. Using two small example circuits, a 16-bit ripple-carry adder and a 4 × 4 multiplier show average reduced energy savings of 11.9% and 2.6%, respectively. The penalty of energy saving from level converting may be negligible for a large system in which most blocks would operate at V DDL and only a few need V DDH . 
CONCLUSION
In this paper, we investigate the validation of dual-V dd assignment to a bulk CMOS subthreshold circuit. Some applications in the market may need minimum energy consumption without a performance concern. This work could provide a framework for solving those design problems. For a wide range of speed requirements, the MILP determines globally minimum energy optimized circuit configurations by assigning an extra supply voltage V DDL to gates on non-critical paths. A 16-bit ripple carry adder shows on average 20.5% reduced energy consumption, while maintaining same performance as the original single V dd circuit. The worst case example of 4 × 4 multiplier still gives on average 4.9% reduction. Further, allowing a small amount of increase in the energy consumption can significantly speed-up the subthreshold operation of a logic circuit. The methodology of dual-V dd assignment is valid for substantial speed-up without energy increase, as well as for energy reduction below the minimum achievable in a single voltage circuit. With the proposed MILP, ISCAS'85 benchmark circuits could save up to 22.2% (c880) energy per cycle.
The MILP techniques of this paper are not restricted to subthreshold operation alone. When a higher performance, impossible to achieve in the subthreshold region, is required we would then obtain two above-threshold voltages that will satisfy the performance criteria and minimize the energy per cycle. 13 14 There may be potential for greater energy saving as circuit size increases due to larger critical path delay leading to greater slack for many gates. The process variation of the device threshold voltage (V th ) can seriously affect a subthreshold voltage design and this needs to be studied especially for nanometer technologies. 21 41 Higher leakage technologies may display higher speed in the subthreshold region because the logic operation relies on leakage currents. These aspects of dual-V dd design in subthreshold region are worth exploring in the future.
