Subthreshold circuit designs are very much popular for some of the ultra-low power applications, where the minimum energy consumption is the primary concern. But, due to the weak driving current, these circuits generally suffer from huge performance degradation. Therefore, in this paper, we primarily targeted analyzing the performance of a near-threshold circuit (NTC), which retains the excellent energy efficiency of the subthreshold design, while improving the performance to a certain extent. A modified row-based dual dd 4-operand carry save adder (CSA) design has been reported in the present work using 45 nm technology. Moreover, to find out the effectiveness of the near-threshold operation of the 4-operand CSA design, it has been compared with the other design styles. From the simulation results, obtained for the frequency of 20 MHz, we found that the proposed scheme of CSA design consumes 3.009 × 10 −7 Watt of average power ( avg ), which is almost 90.9% lesser than that of the conventional CSA design, whereas, looking at the perspective of maximum delay at output, the proposed scheme of CSA design provides a fair 44.37% improvement, compared to that of the subthreshold CSA design.
Introduction
Subthreshold digital circuit design is a well-practiced technique, for implementing the highly energy-constrained, ultra-low power applications such as implanted sensors, pacemakers, and mobile peripheral processors [1, 2] . But the primary challenge, that limits its usage only to low performance systems, is the weak driving current. For the subthreshold or near-threshold operation, the MOS transistor is provided with a gate-to-source voltage which is either lower or else nearer to the threshold voltage (V th ) of the device. At the same time, the supply voltage ( dd ) can be scaled below the V th or else can be set somewhat nearer to the V th . Thus, achieving the minimum power consumption, which leads to a longer battery lifetime, can be possible by using this technique [2] . However, the aforesaid advantage in energy consumption comes at the cost of performance degradation and that is mainly due to the fact that the charging and discharging of the load capacitances of the circuit (with the change in logic function) are actually driven by the weak subthreshold leakage current [3] . Now, it has been observed that a notable improvement in the performance of a CMOS circuit is possible, if we do a little bit of sacrifice in the energy consumption perspective [3] . And, this is the concept which triggers an increasing usage of near-threshold circuits (NTCs). To have the more precise definition, a circuit that operates with a supply voltage which is equal or slightly greater than the V th is called the NTC [4] .
On the other hand, assigning the dual dd scheme to a CMOS circuit can be very effective in reducing both the dynamic and the leakage power [5, 6] . It provides the higher supply voltage ( ddH ) to timing critical logic gates, whereas the other noncritical logic gates of the circuit are actually driven by a lower supply voltage ( ddL ). Therefore, with this dual dd technique, it is possible to reduce the overall power consumption, without degrading the performance of the circuit too much [2, 4] . Moreover, the use of the ddH to speed up the timing critical logic gates and the ddL to the noncritical logic gates for minimizing the total power of the circuit requires the additional level-shifters which causes extra power consumption as well as area overhead [7] . Now, considering the case of NTC, the key advantage lies in the fact that the value of ddH and ddL used in the circuit happens to be very close to each other. Thus, such a small difference in two supply voltages can eliminate the requirement of voltage level-shifters [4] . Thereby, properly selecting the subset of the logic gates which needs to be assigned with the ddH , we can significantly improve the performance of the circuit at an affordable power cost [4] .
Though, the assignment of dual dd can be extremely interesting in case of NTCs, but looking at the physical design implementation part, this approach may cause an extra cost [4] . To reduce this extra cost of routing overhead, we may go for the row-based dual dd assignment, where the different rows of circuit are prioritized based on their time criticality, and according to that the rows residing in the critical path are driven by the ddH , while the rest of the rows in the circuit are provided with the ddL [4] . Now, in this work, to find out the effectiveness of the row-based dual dd assignment in case of NTCs, the scheme is implemented on an example circuit, which is actually the 4-operand CSA, as described in [8] . The rest of the paper is organized as follows. Section 2 introduces the details of several design issues for the subthreshold circuits. In Section 3, the row-based dual dd assignment for a 4-operand CSA has been presented, whereas the near-threshold operation of the 4-operand CSA and its performance analysis has been illustrated in Section 4. Section 5 describes the conclusion of this work.
Subthreshold Circuit Design Issues

Modeling the Minimum Energy Point.
In case of subthreshold operation ( dd < V th ), the current that flows through the channel of a transistor is mainly due to diffusion [9] . Now, for the purpose of estimating the minimum energy point of a subthreshold circuit, we can take the help of the current model which serves as the basis for the entire analysis [9] . Assuming that total drain current in subthreshold regime is equal to the subthreshold current ( sub ) and considering "n" as the subthreshold slope factor ( = 1 + / ox ), as the thermal voltage ( = / ), as the linearized drain induced barrier lowering (DIBL) coefficient, and S as the subthreshold slope, the sub can be represented as [3] 
where I 0 denotes the drain current at gs (gate to source voltage) equal to V th (threshold voltage) and the ds (drain to source voltage) dependence in the "quasisaturation" region has been modeled using the [9] . Again, for a subthreshold circuit, the gate delay is expressed by the following [3] :
where is denoting the delay fitting parameter and is giving the value of the output load capacitance of the gate. Now, for the gs = ds = dd ≫ , we can rewrite (2) as
Thus, the propagation delay of the gate exponentially depends on the dd as well as the V th . Next, the total energy consumed per cycle (assuming rail to rail swing, i.e., gs = dd for "ON" current) by a single gate can be expressed as [3] 
where dynamic = ( 0 → 1 )⋅ ⋅ 2 dd and leak = leakage × dd × . leakage denotes the amount of leakage current, whereas 0 → 1 gives the low to high activity of the output of the gate [3] .
Optimum Sizing of the Various Logic Gates
Subthreshold Voltage Transfer Characteristics (VTC) of the CMOS Inverter Circuit.
For the 45 nm technology node, the SPICE model which is used for the purpose of simulation has the threshold voltage for the NMOS which is set to 0.469 Volt, whereas for the PMOS it is set to −0.418 Volt. Figure 1 shows the voltage transfer characteristics (VTC) curves of an inverter circuit, where the supply voltage is varied from 0 to 0.4 Volt (with an increment of 0.1 Volt), to inspect the behavior of the circuit in the subthreshold region. It is observed that for the ratio of the width of the PMOS ( ) to the width of the NMOS ( ) around 4 : 1 there is a sharp transition at the output, whenever the input value crosses the dd /2 level. Figure 2 (a) shows the conventional transmission gate based 8-transistor XOR that works at ultra-low voltages [9] . Besides, the use of transmission gates in the design helps to balance the number of parallel devices which are operating with the minimum voltage [9] . 
Subthreshold XOR Gate Using Transmission Gate Logic.
Subthreshold Operation of a Two-Inverter Chain or a
Buffer Circuit. Here a two-inverter chain or a buffer circuit is firstly simulated with a single dd and thereafter with a dual dd (where dd1 is taken as 0.4 Volt and dd2 is taken as 0.8 Volt). In the first case, where dd is set to 0.4 Volt and frequency is 200 MHz, we considered the different / values (maintaining the above-mentioned ratio) for the transistors used in the buffer circuit. When the gate length (L) = 45 nm, / = 800 nm/200 nm, we found that the avg of the circuit is 3.528 × 10 −8 Watt and the max is 1.378 × 10 −10 Second. Now, in case of the dual dd assignment for any CMOS circuit, the major problem occurs when a low input swing starts driving a high dd gate. So, whenever a high voltage gate has to be driven by a low voltage gate, it becomes obvious to use a level converter (LC) [3] . Thus, the LC is supposed to perform the job of shifting the voltage from a lower level to a higher one. However, as the LCs do not implement any logic function, thereby the usage of a large number of LCs in a circuit may ultimately cause in the area as well as energy overhead [7] .
To mitigate this issue, the concept of the use of a second threshold voltage for the PMOS transistors in the higher voltage gates (which are actually driven by the lower voltage gates) has been described in [7] . We followed a similar concept here (as shown in Figure 3) , except for the fact that, for the purpose of increasing the threshold of those PMOS transistors, we have actually increased their gate lengths [10] . The overall performance of this buffer circuit, with a dual dd , has been described in Table 1 .
From Table 1 , it can be seen that the best case results are obtained when the gate length of the PMOS transistor in the higher voltage inverter circuit is set to 90 nm.
2.3.
Obtaining the dd,optimum for a Full Adder Circuit. Firstly, the full adder (FA) circuit of Figure 4 has been driven by the single dd [11, 12] and the inputs having the frequency of 200 MHz. This FA circuit (which has actually got no buffer circuits at its sum and carry outputs) will hereafter be called as FA1 if not otherwise mentioned. Now, to find out the dd,optimum for this FA1, we have varied the dd from 0.1 Volt to 0.8 Volt (with an increment of 0.1 Volt) and measured the changes in the values of leakage and max (Table 2) .
It is observed that, for the region of dd = 0.4 Volt to 0.6 Volt, the leakage (= leakage × dd × ) is minimum. But, considering the aspects of dynamic (which will increase with Figure 4 : Design of the full adder circuit, which is used in FA1 and FA2 blocks [11] . the increase in dd ), we have opted the dd = 0.4 Volt as the dd,optimum for the FA1 circuit.
In the next, the same full adder circuit of Figure 4 is provided with two buffer circuits at its sum and carry outputs. For those buffers, the first inverter is driven by a supply of dd1 , whereas the second one is driven with the supply voltage which has the value of dd2 . Besides, as mentioned earlier in Section 2.2, the length of the PMOS transistor of the second inverter is taken as L = 90 nm. Now this FA circuit, which is supplied with the dual dd , will hereafter be considered as FA2. Table 3 shows the performance of this FA2 circuit, when the dd1 is set to 0.4 Volt, the frequency is taken as 200 MHz, and the dd2 is varied in between the range of 0.4 Volt to 0.8 Volt.
From Table 3 , it can be inferred that the best case result is obtained, considering the power delay product, when the dd1 = 0.4 Volt and the dd2 = 0.5 Volt. Figure 5 shows a 4-operand CSA, where four 4-bit binary numbers (say,
Row-Based Dual V dd Assignment for a 4-Operand CSA
can be added with an initial carry-in [8] . The upper two rows of the circuit (as shown in Figure 5 ) form the 4-bit CSA, whereas the third row serves as the carry propagation adder (CPA) [8] . Now, for the purpose of fine tuning the performance, we can opt for the near-threshold operation of this example circuit by selectively using ddH for the gates which are in the critical path and ddL for the rest of the gates to reduce the overall power consumption [4] . The dotted line, as shown in Figure 5 , is meant for denoting the critical path of the circuit. Moreover, considering the view point of physical design implementation, the approach of row-based dual dd assignment has been adopted here. For that, the entire circuit is partitioned into three different clusters of row/rows. The first cluster may be formed using the subset of row/rows which is/are not time-critical (hence driven by ddL ), whereas the third cluster can be formed using the subset of row/rows which is/are time-critical (hence driven by ddH ). Now, the row which resides in the second cluster should be studded with the gates which are well-equipped to do the interfacing job in between the row at ddH and the row at by ddL .
With this notion, in our modified 4-operand CSA design (as illustrated in Figure 5 ), row1 is driven by the ddL (=0.4 Volt), row2 is driven by a dual supply of ddL (= 0.4 Volt) and ddH (= 0.5 Volt), and row3 is driven by the ddH (= 0.5 Volt) only. Furthermore, as the basic building blocks of the CSA design, we have used FA1 blocks for both row1 and row3 and FA2 blocks for the intermediate row2.
Near-Threshold Operation of the Proposed Scheme of CSA Design and Its Performance Analysis
When the conventional CSA design of [8] has been simulated with a larger supply voltage ( dd = 1 Volt), for the frequency of 20 MHz, the max value is obtained as 2.071 × 10 −10 Second. But, for the subthreshold operation ( dd = 0.4 Volt) of the same circuit (even though the power consumption reduces drastically), the max value gets increased to a much higher value of 7.774 × 10 −9 Second. Thereby, the application of the subthreshold design is mostly limited to the low performance systems only. Now, to maintain this excellent energy efficiency of the subthreshold design, while boosting the speed of operation by a significant amount, we can explore the performance of the design for the near-threshold operation [13] . And that is what we have actually done in this work. To evaluate the effectiveness of the near-threshold operation of our modified CSA design, it has been compared with the conventional CSA design as well as the subthreshold CSA design (as shown in Table 4 ).
While operating for a frequency of 20 MHz, the proposed scheme of CSA design consumes 3.009 ×10 −7 Watt of avg , which is almost 90.9% lesser than that of the conventional CSA design. Again, looking at the delay at output, the proposed scheme of CSA design provides a 44.37% improvement in max , compared to that of the subthreshold CSA design. Figure 6 illustrates the variation in power consumption values, considering all the three design styles, for different frequencies like 20 MHz, 50 MHz, and 200 MHz.
Following are the key points regarding the performance of the modified 4-operand CSA design presented in this work.
6
Advances in Electrical Engineering (i) The first one is the flexibility of the choice of any higher supply voltage (as per the requirement) for the gates which are in the critical path. In case where the speed of the subthreshold circuit is important, we can tune it by increasing the ddH even up to 0.8 Volt.
(ii) From Figure 6 , it can be inferred that the proposed scheme works fine not only for an operating frequency of 20 MHz but also for the higher frequency ranges.
(iii) Row-based dual dd assignment has been incorporated in the proposed scheme of CSA design to facilitate the physical design implementation part.
(iv) But, in the case where performance tuning is not the requirement, rather avg is the only concern, we may simply go for the subthreshold CSA design where the power consumption is minimum (Table 4 ).
(v) Lastly, the limits for the upper/lower values of avg can be obtained through the use of dd as 1 Volt and dd as 0.4 Volt, respectively. Simultaneously, at these two dd limits, we will get the lower/higher amount of max .
The proposed scheme of CSA design fits somewhere in between with a major advantage, that is, the flexibility of the choice of ddH (which can be used for the purpose of performance tuning).
Conclusion
In this work we have mainly focused on the performance analysis of a row-based dual dd CSA design which operates in the near-threshold voltage regime. For that purpose, we used two supply voltages: ddH (= 0.5 Volt) and ddL (= 0.4 Volt). Besides, the entire circuit is partitioned into different clusters of row/rows, and all the logic gates which reside in a particular cluster have been driven by a single supply (may be ddH or ddL ). Moreover, a fair comparison among the different design styles for a 4-operand CSA has also been presented here. From the results obtained, we can easily infer that the near-threshold operation of the proposed scheme of CSA design can be very much effective in reducing the overall energy consumption, like a subthreshold design. At the same time, it can also be useful in tuning the performance of the circuit so that the maximum delay at output gets reduced.
