Abstract-This paper presents a method for minimum energy digital CMOS circuit design using dual subthreshold supply voltages. Stringent energy budget and moderate speed requirements of some ultra low power systems may not be best satisfied just by scaling a single supply voltage. Optimized circuits with dual supply voltages provide an opportunity to resolve these demands. The delay penalty of a traditional level converter is unacceptably high when the voltages are in the subthreshold range. In the present work level converters are not used and special multiple logic-level gates are used only when, after accounting for their cost, they offer advantage. Starting from a lowest per cycle energy design whose single supply voltage is in the subthreshold range, a new mixed integer linear program (MILP) finds a second lower supply voltage optimally assigned to gates with time slack. The MILP accounts for the energy and delay characteristics of logic gates interfacing two different signal levels. New types of linearized AND and OR constraints are used in this MILP. We show energy saving up to 24.5% over the best available designs of ISCAS'85 benchmark circuits.
I. Introduction
Subthreshold circuits offer a promising solution for implementing highly energy-constrained systems for remote or mobile applications. When we scale the power supply voltage (V dd ) below the device threshold voltage (V th ), the subthreshold current ever so slowly charges and discharges nodes for the circuit's logic function [20] . The weak driving current limits the performance but minimum energy operation of the circuit is achieved with reduced dynamic and leakage power resulting in long battery life [11] .
A successful subthreshold design is possible in clock ranges of low to medium frequencies for biomedical and micro-sensor network applications [8] , [16] , [21] . Ultra dynamic voltage scaling (UDVS) [3] can provide more opportunity to spread subthreshold circuit design in various applications by switching between a nominal voltage high performance mode and an energy efficient subthreshold mode according to the system workload. Without the performance requirement, a subthreshold circuit can operate at its minimum energy (E min ) operating point that is s omewhat ab ove the abs olute minimum voltage (V min ) [22] that would guarantee the correct logic function. Some applications that require moderate speed may not aggressively scale the supply voltage down to the minimum energy point to maintain the performance. Near-threshold operating circuit design is another choice to cover a wider range of system performances for applications with tolerable energy increase (∼2X) from E min by scaling V dd to near V th [5] . Technology down-scaling improves the speed of a subthreshold circuit, but greater variability may adversely affect E min for extremely small feature size [2] .
Utilizing the time slack for dual V dd is a well-known technique for a circuit operating with nominal V dd for reducing the power consumption with small extra cost in physical design [18] , [19] . However, operation in the subthreshold voltage region has been long predicted and since verified [20] . Most previous works in subthreshold circuit design only used a single supply voltage scaled down to reduce the energy consumption without considering the time slack. The authors of [10] derived a MILP algorithm to minimize the energy consumption of a subthreshold logic circuit using dual V dd . Their work limits full use of the time slack by topological constraints considering multiple voltage boundaries without level converters. Thus, the energy saving of dual V dd design is not achieved as much as expected.
In the present work, we are motivated to exploit full time slack on non-critical paths in a subthreshold circuit using multiple logic-level gates to further reduce E min at its original speed or alternatively have the circuit operate at a higher speed holding the energy consumption close to E min . Figure 1 shows the benefit of dual voltage design for a 32-bit ripple carry adder in 90nm CMOS technology operating in the subthreshold regime. Energy per cycle for the optimized dual voltage design (E dual ) is reduced ∼0.67X from E min that is obtained by scaling down a single supply voltage to its minimum energy operating point at V dd =0.31V. This 32-bit ripple carry adder can also operate ∼7X faster with same energy as E min in another dual voltage design using V dd =0.45V. Finding an optimal lower supply voltage (V DDL ) for a given higher supply voltage (V DDH ) and its assignments is the main problem in dual voltage design. We formulate a mixed integer linear program (MILP) to solve this problem with multiple logic-level gates considering multiple voltage boundaries.
The paper is organized as follows. Section II introduces the dual voltage design from the literature and considers the cost of level converting in subthreshold regime. In Section III, we present new mixed integer linear program (MILP) models for dual voltage design with multiple logiclevel gates. Section IV reports SPICE simulation results to validate the MILP solution. Finally, conclusion and future work are given in Section V.
II. Dual Voltage Design and Level Converters in Subthreshold Regime
In a dual voltage design, assigning lower supply voltage (V DDL ) only to gates on non-critical paths reduces both dynamic and static leakage power of the circuit. Higher supply voltage (V DDH = V dd ) is assigned to gates on critical paths to maintain the overall circuit performance. By utilizing the time slack, we ensure that there is no performance loss. But, an asynchronous level converter (ALC) is considered essential to suppress DC leakage current and guarantee the correct switching of a V DDH gate driven by a low voltage input signal. Level converting cost, however, reduces the power saving of the dual V dd scheme.
Clustered voltage scaling (CVS) [18] and extended clustered voltage scaling (ECVS) [19] algorithms are two main heuristic methods of assigning dual supply voltages to gates in a circuit. CVS assigns V DDL to gates with positive time slack starting from primary outputs to primary inputs and so dose not allow the V DDL gates to feed directly into V DDH gates by grouping gates into V DDH and V DDL clusters. V DDH cluster is always located upstream as signals flow. This topological constraint reduces the potential power saving from full use of the time slack that exists inside a circuit. Asynchronous level converters are not needed inside a combinational circuit block, but the level converting flip-flops (LCFF) are needed in sequential elements. No overheads of power and delay from ALCs exist in CVS. For removing the topological constraint in CVS, ECVS inserts an ALC at a point, where a V DDL gate drives a V DDH gate, to assign V DDL to more gates with time slack. This gives more power saving than CVS.
We apply the dual voltage technique to subthreshold supply combinational circuits. To maximize energy saving from the time slack, a level converter is still considered essential. In Figure 2 , two traditional ALCs, a differential cascode voltage switched (DCVS) level converter and a pass gate (PG) level converter, are shown. The PG level converter consumes less energy than the DCVS level converter due to fewer devices in it and reduced contention [13] . Compared to the delay of a circuit operating with nominal V dd , the delay of a subthreshold circuit increases exponentially as supply voltage V dd reduces [20] . This means that the time slack is consumed quickly by assigning V DDL , quite close to V DDH , to gates on non-critical paths. With such delay characteristic, the delay overhead of the ALC is more critical for implementing a dual V dd design in the subthreshold regime. We use the HSPICE simulator [7] to size properly for reducing the delay of two ALCs in subthreshold region. Predictive Technology Model (PTM) for 90 nm CMOS [23] was used in the simulations. Table I shows the delay penalty of the two optimized ALCs in a range of 28∼ 60× INV(FO4) delay, where INV(FO4) is the delay of a standard inverter with fanout of four. The normal ALC delay is considered as 2× INV(FO4) delay [4] for a nominal supply voltage. A low voltage microprocessor has ∼ 400× INV(FO4) delay for a single pipeline stage. The microprocessor operating in subthreshold region would prefer shallow pipeline to mitigate variability and a 40× INV(FO4) delay is considered as a typical design case [17] . To reduce the delay penalty of level converting, we need to investigate alternative ap- proaches to remove ALCs without topological constraints in the dual V dd design. As discussed in the literature, two types of logic gate designs have the capability to handle multiple logic levels. Among these the embedded logic level converting circuit [13] may not be a good choice because the previous ALC structures when integrated in logic gates will not reduce the overall delay penalty. A level-shifter free design using dual V th [4] places high V th devices in the pull-up PMOS network of a logic gate to suppress DC static leakage with low input signals as shown in Figure 3 . This causes the rise time of the gate to increase, thus the overall level shifting logic gate delay is larger than that of a normal gate (PMOS V th =0.21). As shown in Table II , the delay penalty of these multiple logic-level gates is much less than that of standard ALCs in the subthreshold region. Within some range of low input voltages close to V dd , a multiple logic-level INV consumes less leakage power than a standard INV, which increases as the low input voltage goes down in Figure 4 . Considering the delay and power overheads, we are compelled to use the multiple logic-level gates instead of ALCs in our dual voltage design.
III. MILP for Dual Voltage Design with Multiple
Logic-Level Gates
In this section, we design minimum energy circuits with dual V dd assignments without ALCs using mixed integer linear programing (MILP) [6] . Multiple logic-level logic gates eliminate the use of ALCs and allow V DDL gates to drive V DDH gates with affordable overheads in terms of delay and leakage power in a combinational circuit. First, the performance requirement (critical path delay T c ) of a system is given. Therefore, V DDH is determined to satisfy the system speed (or clock cycle time). The MILP automatically assigns the predetermined V DDH to gates on critical paths to maintain the performance and finds optimal V DDL for gates on non-critical paths to reduce the total energy consumption (i.e., minimum energy per cycle) by a global optimization. Inherently, CVS and ECVS are heuristic algorithms that tend to be non-optimal, because of the backward traversal from primary outputs through gates with time slack for assigning lower supply voltage V DDL .
Assuming that gates become active once per clock cycle, the total energy per cycle (E tot ) is given by following equations [20] :
where α 0→1 is the low to high transition activity for the gate output node and C load is the load capacitance of the gate. In (1), dynamic energy (E dyn ) quadratically depends on scaling the power supply voltage V dd with the total switched capacitance C sw of a circuit, while the leakage en-ergy (E leak ) is linearly proportional to leakage power P leak during a clock cycle.
Before we formulate the MILP model of the optimal minimum energy V DDL assignment, all variables and constant parameters in the MILP model are presented here:
• V v : supply voltage integer variable that is 1 for two selected V DDH and V DDL in a span of scaling supply voltage v.
• X i,v : voltage assignment integer variable that is 1 for gate i with supply voltage v.
• F i,v : fan-in integer variable that is 1 for gate i having at least one fan-in gate that is powered by supply voltage v.
• P i,v : penalty integer variable that is 1 when gate i driven by low input voltage v.
• T i : latest arrival time variable at gate i output from primary input events.
• α i : low to high transition activity of gate i.
• V dd,v : supply voltage value of v.
• C i,v : load capacitance of gate i with supply voltage v.
• P leak,i,v : leakage power of gate i with supply voltage v.
• P leako,i,v : leakage power overhead of multiple logic-level gate i driven by low input voltage v.
• td i,v : gate delay of gate i with supply voltage v.
• tdo i,v : gate delay overhead of multiple logic-level gate i driven by low input voltage v.
• N i : number of inputs for gate i.
• T c : critical path delay of a circuit.
• G tot : total number of gates in a circuit.
• V nom : nominal supply voltage value (1.2V) for 90nm CMOS.
The optimal V DDL assignment for the minimum energy design is modeled by MILP equations:
where V min is the minimum operating voltage for the correct logic function of a gate with subthreshold supply voltage and V low is the lowest input voltage to keep 10% to 90% output voltage swing for a logic gate when V DDH is predetermined. The timing constraints are [14] :
∀i ∈ all gates, ∀j ∈ all fanin gates of gate i (3)
Penalty condition:
∀j ∈ all fanin gates of gate i
Dual supply voltages selection:
As mentioned before, T c is given by the performance requirement. Therefore, V DDH is selected from (9) in scaling supply voltage span. In dual power supply constraints, MILP only chooses two supply voltages, given V DDH and optimal V DDL , then each gate in the circuit must be assigned to one of them from (11); we use a bin-packing technique [1] . Penalty condition tests the existence of a V DDH gate driven by at least one V DDL fan-in gate from (5) (Boolean Or) and (6) (Boolean AND). The nonlinear Boolean functions are expressed as linear constraints. When penalty exists, P i,V DDL becomes 1 and (7) allows low voltage inputs to drive a V DDH gate by replacing it with a multiple logic-level gate. During assigning V DDL to the time slack gate, MILP checks the timing violation against clock time using (3) and (4) timing constraints. Cost function (2) favorably balances both delay and leakage penalties of the multiple logic-level gates.
IV. Results
All simulation results are from SPICE using PTM 90nm CMOS at room temperature (300K). The CMOS device threshold voltages are V th,pmos = 0.21V and V th,nmos = 0.29V at nominal V dd = 1.2V. For simplicity, we use only four types of basic standard cells, namely, INV, NAND2, NAND3, and NOR2, to synthesize ISCAS'85 benchmark circuits. Therefore, only four types of multiple logic-level gates are used with high PMOS threshold voltage assigned to the pull-up PMOS network of basic cells. High PMOS threshold voltage (V th,pmos = 0.29) is selected.
We assume that randomly generated input signals with high input voltage V DDH drive all primary inputs of the circuit. Two subthreshold supply voltages, V DDH and V DDL , can be provided by a voltage scalable DC to DC converter [15] . We also assume that combinational benchmark circuits have no restriction for primary output voltage level either of V DDH or V DDL . In reality, level shifting flip-flops (LCFF) [18] can be placed at low voltage primary outputs as the sequential elements of the design.
MILP algorithm of Section III is applied to find the optimal V DDL for the benchmark circuits with given performance (i.e., V DDH ) in subthreshold region. Table III shows SPICE simulation results for single V dd total energy per cycle as a reference and dual V dd optimized energy per cycle with the optimal V DDL selection. Activity α is the average number of low to high transitions at circuit nodes and V DDL is the optimal low voltage supply corresponding to V DDH . Multiple logic-level gates were not required for c432, c499 and c1355, and therefore, there were no V DDH gates driven by V DDL gates in optimized circuits; they were same as [10] . From (7), MILP algorithm automatically determines whether or not a multiple logic-level gate is to be used based upon the benefit of energy saving. The design of c3540 shows that energy saving of the dual V dd circuit is improved 15.7% more than [10] . Multiple logic-level gates remove topological constraints and allow V DDL gates to drive V DDH gates. Thus, MILP can assign V DDL to more gates on non-critical paths and further increase energy saving as expected. For the dual V dd design with multiple logic-level gates, the best case is about 24.5% energy reduction for c880 (8-bit ALU). Another circuit, c6288 (a 16×16 multiplier), has only 3.8% reduction. There is little benefit of dual V dd design for c432, c499, and c1355, where most of paths are balanced. The optimized circuits show energy saving of 14.0% on an average, even it includes the energy savings of path balanced circuits. Figure 5 shows the gate slack distributions obtained from static timing analysis [9] of the single V dd and dual V dd designs of c880. Clearly, it is the large number of gates with large slack in the single V dd design that allows many low V dd assignments.
The energy saving from dual voltage design depends on the time slacks of gates. In subthreshold region it is also affected by the number of V DDL gates driven by V DDH gates. Leakage current of PMOS devices in a V DDL gate is suppressed by high voltage input signal from a V DDH gate, because the source to gate voltage, V sg , in PMOS devices is negative. The leakage energy is comparable to dynamic energy in subthreshold region. This leakage reduction is another benefit of dual voltage design for low voltage circuits. The dual voltage technique for a nominal voltage circuit is mainly applied for dynamic power saving, while leakage power saving is considered negligible [12] .
V. Conclusion and Future Work
This paper presents dual voltage design in the subthreshold regime. Level converters are eliminated and special multiple logic-level gates are used instead. This approach is particularly beneficial for subthreshold voltage operation. A new MILP is devised to find an optimal low supply voltage below a given subthreshold supply voltage. The given supply voltage is chosen for the minimum energy per cycle for any single voltage. When paired with the lower voltage from the MILP, the energy is further reduced. The MILP optimally selects the boundaries between the supply voltage domains to position multiple logic-level gates. With this MILP, ISCAS'85 benchmark circuits could save up to 24.5% energy per cycle. Notably, the energy per cycle for these designs is always less than the absolute minimum energy point for the circuit for single voltage operation. Alternatively, the MILP can trade energy reduction for speed increase without letting the energy rise. For large circuits, the MILP may suffer from an unacceptably long run-time as the optimization algorithm for dual V dd design has exponential-time complexity. Gate slack analysis [9] provides an opportunity to reduce the time complexity to 
