We investigate a design strategy for subthreshold circuits focusing on energy-consumption minimization and yield maximization under process variations. The design strategy is based on the following findings related to the operation of low-power CMOS circuits: (1) The minimum operation voltage (V DD min ) of a circuit is dominated by flip-flops (FFs), and V DD min of an FF can be improved by upsizing a few key transistors, (2) V DD min of an FF is stochastically modeled by a log-normal distribution, (3) V DD min of a large circuit can be efficiently estimated by using the above model, which eliminates extensive Monte Carlo simulations, and (4) improving V DD min may substantially contribute to decreasing energy consumption. The effectiveness of the proposed design strategy has been verified through circuit simulations on various circuits, which clearly show the design tradeoff between voltage scaling and transistor sizing.
Introduction
Having more than one processor on an electronic device has already become commonplace with the advancement of integrated circuit technology. The number of such devices that intelligently support our daily lives is still rapidly increasing. Among others, sensor devices, which gather and process information from our surrounding environment, are attracting greater interest [1] - [3] . Applications of those devices span vast areas -from health care, to environmental monitoring, to animal behavior tracking [4] . Sensor devices are required to achieve the lowest energy consumption to maintain the required data processing within a limited power supply capacity. Energy harvesting, from, e.g., solar light [5] or vibrations [6] , may be used to assist a battery, but their energy density and availability are usually strictly limited. Both extremely low energy consumption and low peak power consumption are important for these types of circuits.
Supply voltage scaling is one of the most efficient techniques for reducing energy consumption because dynamic power is scaled quadratically to voltage. Lowering supply voltage, in general, reduces both peak power dissipation and total energy consumption. Subthreshold operation, which utilizes low supply voltage near or below transistors' thresh- old voltages, is considered as a promising technique that can substantially reduce power dissipation and energy consumption [7] . Subthreshold operation has been widely explored in recent years. However, its design methodology is not fully established because it faces several challenging issues. Minimum operation voltage [8] is one of the most important issues. Each circuit has its own minimum operation voltage (V DD min ), below which the circuit does not function as designed. Given that the minimum operation voltage is highly sensitive to process variation, it varies from chip to chip. It is difficult, in particular in the case of large-scale logic circuits, to determine a design-dependent optimal supply voltage so that power consumption is minimized and yield is maximized [9] . Another, but related, issue is leakage energy. As the supply voltage decreases, the delay of each gate increases exponentially to the voltage, which lengthens operation time and thus exponentially increases leakage energy.
The goal of this work is to establish a design strategy for subthreshold circuits that minimizes energy consumption and maximizes yield. This paper presents the following results based on extensive circuit simulations using transistor models and standard cells made using a commercial 65-nm CMOS technology.
• Robustness in low voltage operation is dominated mainly by flip-flops (FFs), and the robustness is improved drastically by avoiding the use of small transistors.
• The minimum operation voltage (V DD min ) distribution of an FF is closely approximated by log-normal distributions. Therefore, the V DD min distribution of a largescale circuit can be quickly and accurately estimated through analytical calculations. This modeling enables us to eliminate Monte Carlo circuit simulations on the large-scale circuit, which is infeasible to do in many cases.
• Energy minimization and yield maximization are achieved simultaneously by transistor sizing of an FF, and by finding an operation voltage that achieves minimum energy consumption under a V DD min constraint.
Examples of the proposed energy minimization will be demonstrated through designs of three modules in a JPEG encoder, which are different in terms of the total FF count and the ratio between FF and combinational logic gates. Experimental result shows that the choice of suitable operation
Copyright c 2012 The Institute of Electronics, Information and Communication Engineers voltages as well as transistor size in an FF can lead to a 20% energy savings for one of the three modules, a DCT circuit, as compared with the design using nonoptimized FF. The remainder of this paper is organized as follows. Section 2 reviews the issues in the low supply voltage circuit design, e.g., leakage energy and V DD min . In Sect. 3, the failure mode in low-voltage operation is studied through circuit simulations of simple circuits, and a method for V DD min estimation is proposed. Section 4 proposes a design strategy for subthreshold circuits, which achieves both energy minimization and yield maximization. Also, in Sect. 5, its effectiveness is demonstrated through an example design of a JPEG encoder. Finally, Sect. 6 draws conclusions.
Design Issues in Low Voltage Operation

Energy Consumption
Power dissipation of a CMOS circuit is expressed by the following equation [10] :
Here, α is the switching factor, f is the clock frequency, C is the load capacitance, Q SC is the electric charge consumed as a short-circuit current, I LEAK is the leakage current, and V DD is the supply voltage. The first term represents dynamic power dissipation owing to charging and discharging of the load capacitance. The second term represents dynamic dissipation by a short-circuit current. The third term represents static dissipation from the leakage current. Here, power dissipation by the short-circuit current, the second term in Eq.
(1), is relatively smaller than the others and the duration of the short-circuit current is short. Hence, we will focus on the first and the last terms of the equation in this paper. From Eq. (1), we observe that the power P is reduced by lowering α, f , C, or V DD . Among others, supply voltage scaling is extremely effective because it is related to all the terms in Eq. (1) and the first term will be reduced quadratically. This paper focuses on near-or subthreshold operations where V DD is set around or lower than the threshold voltages of the transistors. To enable subthreshold operations, however, several issues should be considered.
First, supply voltage scaling reduces the overdrive voltage of a transistor and thus reduces its driving current. As a consequence, gate delay increases and circuit performance degrades. Figure 1 shows oscillation periods of a ring oscillator (RO) composed of 101 inverters as a function of the supply voltage. In the low-voltage region, delay increases exponentially as the supply voltage is lowered.
Second, although the power dissipation from the leakage current, the third term in Eq. (1), becomes small as the supply voltage is lowered, the delay increase is more significant since it has an exponential dependence on the supply voltage. As a result, the leakage energy increases when the supply voltage is scaled to a very low value. These effects result in the existence of a minimum point in the plot of total energy consumption against supply voltage [11] , as shown in Fig. 2 . When we pursue energy efficiency, it is desirable for circuits to operate as close to the minimum energy voltage (V ENE min ) as possible.
Minimum Operation Voltage
Each circuit has its own minimum operation voltage (V DD min ), below which the circuit fails to work correctly. Because of this fact, not all circuits are able to work at its minimum energy voltage V ENE min . Figure 3 shows V DD min of 5 RO circuits, which are composed of inverter, 2NAND, 4NAND, 2NOR, or 4NOR gates respectively. The number of stages is 101 for all ROs and the results are based on circuit simulations. Error bars show the maximum and the minimum V DD min in 100 times circuit simulations applying threshold voltage variation. Figure 3 indicates that V DD min of a combinational circuit is affected by process variation and high fan-in gates tend to have higher V DD min . Assuming that operating frequency for a V DD is determined at the design phase, V ENE min is relatively insensitive to threshold voltage variation since power consumption is averaged over the chip. On the contrary, V DD min is strongly affected by it. It is desirable that V DD min is made sufficiently smaller than V ENE min by design. This can be achieved by appropriately sizing transistors in standard logic gates.
Yield Modeling at Low-Voltage Operation
Investigation of the Cause of Errors
A logic circuit fails to work correctly at low voltages for various reasons, such as attenuation of signal amplitude in combinational circuits or write/read failures in FFs. To ensure correct operation of circuits and to improve yield without increasing energy consumption significantly requires clarification of the mechanism of the circuit errors.
To investigate the cause of such errors in low-voltage operation, we first analyze a simple circuit of the target process technology illustrated in Fig. 4 . This is a 1-bit counter that consists of an odd number of 4NOR gates (the worst case logic gate in Fig. 3 ) and an FF. By cascading k-4NOR gates, the circuit models a path of a combinational circuit that has a logic depth of k. Using this circuit, we can observe behavior of both combinational and sequential circuit components in low supply voltages.
V DD min of this circuit is evaluated by circuit simulations. Starting from 1.2 V, the supply voltage is gradually lowered until the FF produces an incorrect output. Transistor models of a commercial 65-nm technology are used. Areadependent threshold voltage variation [12] has been considered in the simulations. To observe the influence of logic depth of combinational circuits, k has been altered from 3 to 31. In this study, we excluded timing error when V DD min is considered, since setup error can be avoided by applying sufficiently low frequency clock and hold error can be avoided by logic synthesis, i.e., removal of short paths. Figure 5 shows a transistor-level schematic diagram of an FF used in this analysis. For FFs in standard cell libraries that are intended to be used in nominal voltages, it is common for transistors of very small gate width W to be used as feedback and clock inverters. Assuming this to be a major cause of failures in the subthreshold region, we conducted simulations to see how much improvement against process variation is obtained through enlarging the widths of these small transistors. cases where transistor-gate widths are all 1.5× and 2×, respectively, larger than their original sizes. Here, the transistor sizing is limited only to the small nMOS and pMOS transistors that comprise the logic gates marked "x" in Fig. 5 . We can clearly see that the distributions successfully move to lower voltages as we increase transistor sizes. Although these transistors play important roles in an FF (e.g., holding data and distributing clock signal), their default sizes are designed to be small to save area and power. Given that the most influential process variation is threshold voltage variation from random-dopant fluctuation, they are highly sensitive to process variation. It is concluded that V DD min will be significantly improved by upsizing small transistors in FFs. Figure 6 also indicates that the logic depth does not affect V DD min very much for different channel widths. The FF can be considered as the dominant factor that determines V DD min .
Efficient Method for Estimating V DD min
Considering the above results, we propose a method to estimate V DD min for a large circuit. In general, carrying out a circuit simulation on a very large circuit is time consuming. It takes an impractically long time to obtain a statistical distribution of V DD min by repeating the simulation to reflect process variations. It is desirable to develop a way to estimate V DD min of a large circuit by using simulation results of smaller and simpler circuits.
It is obvious that all FFs in a circuit must work correctly to perform correct operation. Consequently, the probability that the target circuit works correctly at a supply voltage of V DD = V i is given by the following equation:
where N is the number of FFs and P V iFF is the probability that the circuit shown in Fig. 4 works correctly at the supply voltage of V i . From Eq. (2), even if P V iFF is close to 1, P V i becomes exponentially small when N, the number of FFs in a target circuit, is very large. Figure 7 shows the influence of N to P V i . When N = 10 3 , for example, P V iFF must be accurate to about five decimal places to estimate the V DD min that guarantees 95% yield. This means that we need to know the distribution of V DD min very accurately, especially around the high-V DD region where the lowest bound of the operational voltage is determined. Otherwise, the estimation Fig. 7 The probability that the target circuit works correctly at a supply voltage V DD = V i for different N and P V iFF . N is the number of FFs in the target circuit and P V iFF is the probability that the circuit shown in Fig. 4 works correctly at the supply voltage of V i . of the V DD min loses confidence and thus a large guard-band would be required for the minimum operation voltage. Figure 8 shows the distribution of V DD min of differently sized FFs. We set the CLK and DATA signals so that both setup and hold times are satisfied. To obtain smoother and reliable histograms we had simulated the circuit 10,000 times. Possibly, tail region needs to be simulated very accurately when N is large. Instead of increasing the number of simulations, we propose modeling the V DD min of the circuit by using log-normal distributions from the experimental result of Fig. 8 . The approximations using the proposed model are also presented in the same figure. We find very good matches between corresponding simulation results and proposed models.
Using this model, V DD min of general circuits will be estimated accurately. Two shape parameters of the log-normal distribution p(V) can be obtained by fitting the histogram of V DD min as in Fig. 8 . Then, P V iFF will be given as the following cumulative distribution: Fig. 9 Illustrative example of the proposed design strategy to find the optimal V DD for energy minimization under a yield constraint.
Substituting Eq. (3) into Eq. (2), we can quickly estimate V DD min of a general circuit without executing Monte Carlo simulations.
Design Strategy for Energy Minimization
In this section, we propose a design strategy for minimum energy consumption by subthreshold circuits while satisfying a given yield constraint. Our strategy consists of the following steps.
Step 1: Estimate the required FF yield P V iFF to satisfy a circuit yield requirement.
Step 2: Estimate the minimum supply voltage that satisfies the FF yield P V iFF .
Step 3: Obtain the V DD -energy curve
Step 4: Find the optimal V DD for the subthreshold circuit.
These steps are illustrated in Fig. 9 . The vertical axes of Figs. 9(a) and (b) are equal, and, hence, the scale of the vertical axis is omitted for (b). Similarly, the horizontal axes of Figs. 9(b) and (c) are equal, and thus the horizontal axis of (c) is omitted.
Step 1: Estimation of P V iFF
First, we translate the required circuit yield into the yield of an individual FF. The inputs of this step are the required circuit yield P V i and the number of FFs in the circuit, N. The output is P V iFF . This translation can be carried out by solving Eq. (2) inversely,
For example, to guarantee an 80% yield when the target circuit has 100 FFs, the required yield for the FF cell is
9978. An example of this translation is graphically presented in Fig. 9 (a).
Step 2: Estimation of Minimum V DD to Satisfy FF Yield
Next, we calculate the minimum V DD to satisfy the required FF yield obtained in the previous step. The input of this step is P V iFF and the output is V DD min . This is carried out according to the following substeps.
Derive a log-normal distribution model of FFs, p(V)
used in the circuit. The parameters of the log-normal distribution are determined so that the model best approximates the histograms of a Monte Carlo simulation, as shown in Fig. 8 . The Monte Carlo simulation is computationally intensive, but it is necessary only when we introduce a new FF cell. 2. Obtain the cumulative distribution P V iFF by Eq. (3). 3. Find V DD min by reversely looking up the cumulative distribution P V iFF . An example of the look-up is presented as a dotted line in Fig. 9 (b).
Step 3: Obtaining the V DD -Energy Curve
We then obtain the V DD -energy curve as in Fig. 9(c) . This is carried out according to the following two steps.
1. For each supply voltage, find the maximum operating frequency f max of the circuit. In this paper, we assume that f max for each supply voltage is determined at the design phase. In other words, we do not assume runtime clock frequency adjustment considering a process variation of the fabricated chip. Note that the critical path at each supply voltage must be chosen because the critical path may be supply voltage dependent. Calculation of f max can be conducted, for example, by utilizing static timing analysis (STA) and its library dedicated to low supply voltages. A sufficiently large number of top-n worst paths at the nominal supply voltage may also be used. In our case, 100 candidate paths are extracted using STA at nominal voltage, and then the actual worst-path delay at each supply voltage is calculated by circuit simulations on the candidate paths. 2. For each supply voltage, derive the energy consumption of the circuit at f max . Unlike V DD min , energy consumption is less sensitive to random process variation because energy variation is averaged over a large number of logic gates in the circuit. This enables us to obtain the energy consumption through only a single run of circuit simulation at the typical process condition.
Step 4: Finding the Optimal V DD
Given the yield constraint, the supply voltage cannot be lower than V DD min . From the result of Steps 2 and 3, either Fig. 9(c) ): The supply voltage can be lowered to the energy minimum point V ENEmin , which is the energy-optimized supply voltage.
Experimental Designs of Subthreshold Circuits
In this section, we apply the proposed design strategy on experimental subthreshold circuits for validation. The circuits used are the following: discrete cosine transform (DCT), quantization (Q), and variable codeword length coding (VLC). These are chosen so that ratios of FFs and combinational counterpart are different because they affect how the minimum energy voltage is determined. Table 1 shows the number of FFs and total number of cell instances in these circuits. These circuits will be optimized by the selections of FF and supply voltage to satisfy a predefined yield constraint with minimum energy consumption. In Sect. 5.1, the V DD min estimation proposed as Step 1 in the previous section is applied for these circuits. Scalability of the proposed model will be verified because these circuits have different numbers of FFs. In Sect. 5.2, Steps 2-4 of the proposed design strategy are applied. In addition, an insight on the energy optimization by selecting FFs of different transistor sizes using the proposed design strategy is presented in Sect. 5.3.
Estimation of V DD min
First, we evaluate the proposed estimations of V DD min . Then the estimations are compared with the results of full circuit simulations. The estimated V DD min distributions using the proposed method for differently sized FFs are shown in Fig. 10 .
To evaluate the accuracy of these estimations, we also show the histograms of V DD min of these circuits obtained through Monte Carlo circuit simulations. "MC" in Fig. 10 shows the V DD min distributions where 100 simulation runs are performed for FFs with different transistor W's; ×1, ×1.5, and ×2. To obtain accurate V DD min through simulation, activation of all paths is needed. Since the initial states of FFs are reset to 0's, we inject test patterns that include a lot of 1's in a binary representation (e.g., negative numbers in the 2's complement, etc.), in the early stage of the test vector to toggle these FFs from 0 to 1, followed by zero-rich patterns to toggle these FFs again to 0.
Among these simulations, no timing constraint is assumed, which is the same approach as the former simula- Table 2 Required processing time to obtain V DD min distribution for each circuit.
Module Proposed method Monte Carlo (Est. in Fig. 10) (MC in Fig. 10 ) VLC less than 1 sec 7 days DCT less than 1 sec 4 days Q less than 1 sec 2 days Fig. 8 . By comparing the distributions in Fig. 10 , we can observe that the V DD min distributions of all circuits are well approximated by the proposed model. This indicates that V DD min for large circuits can be estimated stochastically by tractable circuit simulations only. Table 2 compares the required time to obtain the V DD min distributions by the proposed method and the Monte Carlo simulation. For all three circuit modules, the proposed method requires less than one second to obtain each distribution whereas it takes several days if we resort to the brute force Monte Carlo simulations. Note that, to apply the proposed method, the V DD min distribution of an FF shown in Fig. 8 is required. It takes about 5 hours in our case. Once we obtain the V DD min distribution of an FF, we can estimate any circuits in a very short time by using Eq. (2) .
In this section, an equal yield requirement of 99% is assumed for all the circuit modules. In this condition, V DD min estimations are obtained as shown in Table 3 .
Evaluation of Energy Consumption and Finding Optimal V DD
Next, we consider the energy consumption of the circuits by applying Steps 2-4 of the proposed design strategy. Figure 11 shows the total energy required to process one block (8 × 8 pixels) of an image obtained through full circuit simulations. From Fig. 11 , it is observed that V ENE min of the DCT circuit is relatively low, while that of VLC circuit is relatively high. Since it is known that V ENE min is strongly affected by the switching factor α [13] , it is inferred that α of the DCT circuit is relatively high, while that of VLC circuit is relatively low. The difference in the ratio of dynamic and static energy changes V ENE min of a circuit. For example, the DCT circuit employs pipelined architecture, and thus most of the FFs in the circuit are activated at almost every clock cycle. On the other hand, for the VLC circuit, we applied various input vectors, most of which includes many 0's in higher order components. These are the realistic inputs for a VLC circuit of a JPEG encoder. Circles in this figure indicate V DD min values that guarantees a given yield requirement as has been shown in Table 3 . (1) VLC From Table 3 , V DD min values for the VLC circuit are 398, 308, and 267 mV using the FFs of ×1, ×1.5, and ×2, respectively. From Fig. 11(a) , the VLC circuit has its own V ENE min near 400 mV in all cases. Thus the optimal V DD is V ENE min in all cases, since V ENE min is higher than V DD min .
(2) DCT From Table 3 , V DD min values for the DCT circuit are 418, 323, and 281 mV using the FFs of ×1, ×1.5, and ×2, respectively. From Fig. 11(b) , the DCT circuit has its own V ENE min near 300 mV in all cases. Thus, the optimal V DD is V ENE min in the case of ×2 FF, whereas the optimal V DD is dominated by V DD min in the other cases since V DD min is higher than V ENE min .
(3) Quantization
From Fig. 11(c) , the quantization circuit has its own V ENE min near 350 mV in all cases. This indicates that in ×1.5 and ×2 cases, we may set the supply voltage at 350 mV for minimizing energy, whereas we need to operate at 432 mV in the ×1 case.
Energy Optimization with FF Sizing
As we have previously seen, V DD is an important parameter for optimizing energy consumption. From Fig. 11 , we can see that the transistor size of the FFs is another important parameter.
As described in Sect. 2, energy consumption takes a minimum value at a supply voltage V ENE min , and we can operate the circuit at V ENE min when V DD min is lower than V ENE min to meet the given yield constraint. As we have seen in Sect. 3.1, upsizing transistors in the FFs improves V DD min at the penalty of increased energy consumption. Upsizing transistors increases (1) dynamic power owing to increased parasitic capacitance, and (2) leakage power owing to decreased resistance. This energy increase should be smaller than the energy savings obtained by lowering the supply voltage. To investigate this trade-off, we review Fig. 11 again.
In the VLC circuit, V DD min is lower than V ENE min even in the case of the ×1 FF. Thus, there is no benefit of using larger FFs. In the DCT circuit, V DD min using the ×1 FF is higher than V ENE min and transistor sizing contributes to V DD lowering. However, using the ×2 FF is inefficient because it results in excessive energy consumption in total. From Fig. 11(b) , we can see that operating at 323 mV using the ×1.5 FF is the best choice among the three sizes. It achieves an energy savings of about 20 % compared with the original design of 418 mV using the ×1 FF. Also, in the quantization circuit, V DD min using the ×1 FF is higher than V ENE min . By comparing energies at 432 mV for the ×1 FF, at 350 mV for the ×1.5 FF, and at 350 mV for the ×2 FF, we can determine that using the ×1.5 FF at 350 mV is the best choice.
The major drawback of the transistor sizing are the area and performance overheads. As for the area overhead, our ×2 FF cell occupies about 21% larger area than the ×1 FF cell. Assume that all ×1 FF cells in the DCT circuit are replaced by the ×2 FF cells, the area occupied by the ×1 FF cells (22.3%) will increase by 21%, resulting in an overall area overhead of 22.3%×21% = 4.68%. Similarly, the overall area overhead for VLC and Q circuits are even smaller than the DCT circuit as 0.40% and 4.49%, respectively.
As for performance, we performed simulation-based Shmoo plottings for DCT circuit using ×2 FF cells and ×1 FF cells with supply voltage ranging from 1.2 V to 0.275 V. The observed performance degradation from this experiment is less than 1% for all voltages.
Conclusion
In this paper, we investigated a design strategy for subthreshold circuits by simultaneously considering energy minimization and yield maximization.
Regarding yield maximization, minimum supply voltages for three circuits designed by using various transistors of different gate width are efficiently estimated by the proposed method without performing full circuit simulations. It is shown that V DD min is dominated by FFs rather than their combinational counterpart, and thus V DD min of a large circuit can be stochastically estimated from the distribution of V DD min of a very simple circuit, which can be analyzed easily. We also found that the distributions of V DD min of the simple circuit are all well approximated by log-normal distributions.
Regarding energy minimization, it is shown that two key voltages are important: the minimum energy consumption voltage and V DD min . The relative magnitude of these voltages determines the optimal operation voltage in terms of energy consumption. It is also demonstrated that transistor sizing in an FF achieves energy reduction by lowering V DD min without significantly increasing energy consumption. Furthermore, gate widths of transistors in FFs are altered as an additional optimization parameter using curves of V DD min versus energy.
In the experiments throughout this paper, we did not consider the die-to-die process variation, i.e., we assume that the mean of the threshold voltage distribution equals to the typical threshold voltage. In [8] , it is reported that die-to-die variation has a strong impact in V DD min especially when the die falls into FS or SF corner. The proposed design strategy is applicable for such process corners if we just perform a simulation similar to Fig. 8 using threshold voltage distribution of the process corner to be considered.
