Synchronous circuit designs consists of combinational logic stages and asynchronous (clock-less) designs and synchronous (clocked) designs. The early logic completion sensing circuit implementation provides the benefit of speedy operation. Though synchronous circuits provide an ease of implementation but as the circuit design complexity increases, the global distribution of high speed periodic signal “clock” to all parts of a circuit design becomes challenging [13,14].

Asynchronous circuits provide the benefit of low dynamic power dissipation, because these circuits are activated only during computation, otherwise they are in standby mode. In asynchronous circuit design, clock skew problem is avoided as they don't require a synchronizing signal (clock) for controlling their operation. Also, a significant drop in static and dynamic power dissipation was observed when adaptive voltage scaling was applied on these self-timed systems. The early logic completion sensing in asynchronous systems provides the benefit of speedy operation [15], additional benefit is incurred by using dynamic logic for computational circuit implementation. Hence, designing an adder for these applications that incurs minimal delay with ultra-low power operation and occupies less area on chip is a challenge [13,14].

The current trends of miniaturization of devices have prompted researchers to revaluate VLSI design techniques. In ASICs, DSPs and embedded processing units which are a part of BANs, adders are the key elements in the design [9]. Hence, designing an adder for computation, which consist of registers clocked by a globally synchronizing periodic signal “clock” and latches. This clock signal is distributed throughout the circuit for synchronizing the data computation mechanisms and ensures the correct timing of circuit operation. Though synchronous circuits provide an ease of implementation but as the circuit design complexity increases, the global distribution of high speed periodic signal (clock) to all parts of a circuit design becomes challenging [13,14].

Asynchronous circuits provide the benefit of low dynamic power dissipation, because these circuits are activated only during computation, otherwise they are in standby mode. In asynchronous circuit design, clock skew problem is avoided as they don't require a synchronizing signal (clock) for controlling their operation. Also, a significant drop in static and dynamic power dissipation was observed when adaptive voltage scaling was applied on these self-timed systems. The early logic completion sensing in asynchronous systems provides the benefit of speedy operation [15], additional benefit is incurred by using dynamic logic for computational circuit implementation.

For current trends in VLSI designs, asynchronous circuits provide the following advantages:-

(i) The absence of global clock signal provides the benefit of higher throughput and low power consumption as compared to their synchronous counterparts [13].

(ii) The average speed of computation increases [14].

The Body sensor network [IEEE 802.15] is a wireless communication network consisting of assistive devices which are of prime importance in medical applications. The delay critical and power hungry blocks in these assistive devices are designed so that they consume less power, have low latency and require a lesser area on chip. In this paper, we present a qualitative as well as a quantitative analysis of an asynchronous pipelined adder design with two latest computation completion sensing approaches based on Pseudo NMOS logic and other based on C-element. The Pseudo NMOS based completion sensing approach provides a maximum improvement of 76.92% in critical path delay at supply voltage of 1.2 V and the maximum drop in power dissipation has been observed at a supply voltage of 1.1 V which is 85.60% as compared to C-element based completion sensing approach. Even at low voltages such as 0.8 V, there is a significant improvement in speed and power which is 75.64% and 74.79% respectively. Since the adder is the most widely used component in all present day assistive devices, this analysis acts as a pointer for the application of asynchronous pipelined circuits with efficient Pseudo NMOS based completion sensing approach in low voltage/low power rehabilitative devices.

1. Introduction

The application of Body Area Networks (BANs) [IEEE 802.15] in health care systems is a growing field of research nowadays. BANs consist of mini-devices such as sensors, transceivers, batteries and embedded processors [1–4]. The SAR (specific absorption rate) constraints have to be kept in mind while designing these devices, hence the need for energy-efficient miniature devices [5–8]. The current trends of miniaturization of devices have prompted researchers to revaluate VLSI design techniques. In ASICs, DSPs and embedded processing units which are a part of BANs, adders are the key elements in the design [9]. Hence, designing an adder for these applications that incurs minimal delay with ultra-low power operation and occupies less area on chip is a challenge [10–12]. The Fig. 1 shows a typical BAN employed for health care systems. The PDA collects the data sent by the sensor nodes and sends it over the internet for interpretation by monitoring stations, for maintaining medical database and other healthcare management services [6].

Digital electronic designs are broadly classified as: asynchronous (clock-less) designs and synchronous (clocked) designs. Synchronous circuit designs consists of combinational logic stages
Delay-insensitive circuits are a class of self-timed circuits which use handshaking signals, “request” and “acknowledgement” for synchronization of the computational logic completion with the data flow. These robust delay insensitive circuits are also adaptable to variations in process metrics: voltage, temperature and pressure [16,17] which are a boon for assistive devices. We have incorporated the delay insensitivity concept by using four phase dual rail data encoding for the implemented pipelined adder.

2. Pipelining

To achieve high performance, digital electronic systems use pipelining. As compared to non-pipelined systems, pipelining methods increase system throughput via parallel task execution [18]. Since healthcare applications are battery powered; hence we need a low power/low voltage design at par with the current nanometre technologies [5,6]. In history, two pipeline styles have been mentioned:

2.1. Synchronous pipelines

These pipelines make use of registers in between computational blocks of a complex design and a global periodic clock signal is used for synchronization purpose. The Fig. 2 shown below depicts the basic structure of a synchronous pipeline. In this figure, two pipeline stages are depicted, here clk – refers to global synchronization signal. R1, R2, R3 are the storage registers and CB – is the Computational Block.

2.2. Asynchronous pipelines

These non-clocked pipelines avoid the usage of a clock signal. Hence, it becomes mandatory to employ a data communication protocol for coordination in between the computational blocks in the pipeline. Bidirectional communication is used, which is implemented by a handshaking protocol, where req and ack signals are the handshaking signals. The Fig. 3 shown below depicts an asynchronous pipeline design structure, where req – refers to the initiating signal to start the computational procedure for CB (Computational block) and ack – refers to the acknowledgement sent by the receiver on computation completion.

The benefits of Asynchronous Pipeline Design are:

(i) Multiple data items processing [18].
(ii) By default, Underflow and Overflow conditions are controlled, leading to the benefit of automatic flow control.
(iii) Low dynamic power consumption (power consumption is mainly due to switching activity).

The usage of Dynamic logic in Asynchronous pipelines combines the high speed benefits of dynamic logic and low power benefits of asynchronous circuits together with high throughput advantage of pipelining concept leading to high performance, battery powered designs suited for assistive devices.

2.3. Dynamic logic based pipelines

The classical Dynamic pipeline design, PS0 Pipeline was proposed by Williams and Horowitz [19]. These pipelines make use of implicit latching function of dynamic circuits, hence registers are not needed in between computational stages. Moreover asynchronous design methodology results in simpler pipeline implementations [20].

Self-Timed dynamic pipelines provide the following advantages:

(i) Latch elimination (static logic based pipelines use latch).
(ii) Minimal on chip area overhead.
(iii) Decrease in Critical data-path delay.
(iv) Lower power consumption as compared to static logic based pipelines.

2.3.1. The classical dynamic pipeline – PS0 pipeline

It is a self-timed pipeline sans explicit latches and is based on dynamic logic, proposed by Williams and Horowitz [19]. The PS0 pipeline structure shown in Fig. 4 below consists of:

CB – Computational Block.
ack_nxt – acknowledgement signal going to the predecessor block.
ack_pre – acknowledgement signal received from the successor block.
CSC – Completion Sensing Circuit.
ack1, ack2 – Intermediate acknowledgement signals.

The working of PS0 Pipeline can be described as follows:

(i) Stage 1 computes.
(ii) Stage 2 computes.
(iii) Stage 3 computes.
(iv) CSC of stage 3 sends the acknowledgement signal ack1 indicating the computation completion and hence initiates the precharging mechanism for stage 2.
(v) Stage 2 gets precharged.
(vi) CSC of stage 2 detects that precharge has been completed and therefore sends a signal (ack2) to enable evaluation of stage 1.

An asynchronous pipelined domino logic based full adder has been implemented using dynamic PS0 pipeline style described above. The four-phase dual rail protocol has been used for encoding of data. In this protocol, the handshake signal is combined with dual rail encoded data. Thus, the critical delay variations in data path occurring randomly can be dealt effectively by using this protocol. Its delay insensitive operation provides reliable sender to receiver communication, making it a highly robust protocol [15]. The use of PS0 pipeline avoids the usage of explicit C-elements by examining the input that changed last to be same as the other input. Therefore; C-element is replaced by a wire, subject to various timing assumptions [21]. This potential advantage of PS0 pipeline design has been used to generate high performance topologies suited for BAN applications.

The CSC block has been replaced by two different completion sensing circuits [15,23]. Domino logic based Computational Block (CB) has been used to assess the behaviour of pipeline in presence of various completion sensing circuitries. Rigorous analysis has been carried out to analyze the performance of pipelined adder in presence of both CSCs and a comparison of throughput, power dissipation, critical path delay and power delay product has been presented.

3. Completion sensing circuits

3.1. CSC1

This completion sensing circuit is a C-element based 2-bit Completion detector circuit [15]. It uses two-input NOR gates, to which dual rail output combinations are fed as inputs and corresponding output is obtained. The outputs of NOR-gates act as 1-bit completion detection signals for individual dual rail outputs. They are then fed into a C-element to combine all the 1-bit completion detection signals into a single acknowledgement signal to be fed to next pipeline stage.

The equation of a two input C-element used here:

\[ Y' = i_1i_2 + i_1Y + i_2Y \]  

where \( i_1, i_2 \) are the inputs to C-element and \( Y \) is the output.

The working of CSC1 is depicted in Fig. 5(c).

3.2. CSC2

This completion sensing method [23] uses ratioed pseudo-NMOS logic. Pseudo NMOS circuits work on the fact that PMOS transistors have low mobility, therefore in a standard CMOS design, they should be designed wider that NMOS transistors to achieve rise and fall times that are comparable, thereby resulting in higher input capacitance. Pseudo-NMOS logic based design eliminates all the PMOS transistors, thereby reducing the load capacitance leading to an increase in overall performance [24]. For implementing critical wide NOR functions, they prove to be the best in terms of overall design metrics [25]. Hence, an efficient implementation of wide-NOR gate has been used to generate triggering signal (acknowledgement) for subsequent stages of a pipeline, instead of a large fan-in NOR gate. As all the connections are connected in a parallel fashion, this technique avoids the high fan-in problems of Pseudo-NMOS gates. The PMOS transistor acts as a loaded register and it has been designed to be weak for proper functioning of circuit. The advantages of this technique are:

1) Faster switching activity due to less capacitive loading on input signals.
2) Increase in on chip circuit density (due to less no. of transistors required for implementing the logic).
3) Generic Pseudo-NMOS logic based designs [25,26] incur a major disadvantage i.e. large static power dissipation, this has been optimized by using ack_pre signal as the enable input into one NMOS transistor in the pull down network and the weak PMOS transistor of pull-up network, this will
inhibit the weak PMOS transistor to be in always –ON state, thereby reducing the static power dissipation.

4) Minimal Area Overhead leading to area efficient design.

The Fig. 6(a) below shows the general block diagram of CSC2. The MOS-level circuit schematic is shown in Fig. 6(b) and the layout design for this CSC is depicted in Fig. 6(c).

4. Implementation

The computational block (CB) is a 1-bit dual rail domino Full adder structure [25] that constitutes each stage of the dynamic pipeline. N-bit adder can be implemented as per requirement by cascading N such stages. It acts as a DUT (device under test) for our simulation purpose. It has been designed using domino CMOS logic (improved dynamic logic with no erroneous states in output) [26] and employs four-phase dual rail protocol for handshaking purpose when combined with CSCs. In battery powered applications, circuits should be designed such that they incur minimum delay overhead and occupy less area on chip. Thus, dynamic CMOS circuits provide an alternative to the steady state circuits in terms of speed, area and system latency [26]. The CB implemented here uses ack_pre (precharge) signal for internal synchronization and dynamic CMOS operation. The computation completion is detected
using acknowledgement signal from event termination sensing topologies that work in combination with the CB. Hence handshaking is performed effectively. The inputs to CB are dual rail data inputs \(a_t, b_t, \text{cin}_t, \text{cin}_f\) (carry inputs) which result in dual rail outputs \(\text{sum}_t, \text{sum}_f\) (sum), \(\text{carry}_t, \text{carry}_f\) (carry outputs). The block diagram of a single pipeline stage is shown in Fig. 7(a) and the MOS level implementation of CB is depicted in Fig. 7(b).

The Table 1 presented below depicts the characteristics of the CB at 25 °C temperature, 1.2 V supply voltage, where

- \(T_{\text{plh}}\) – refers to the delay incurred when the output signal rises to 50% of vdd(undergoes low to high transition).
- \(T_{\text{phl}}\) – refers to the delay incurred when the output signal falls to 50% of vdd(undergoes high to low transition).
- \(T_{\text{pd}}\) – Average propagation delay.
- \(P_{\text{diss}}\) – Power Dissipation.

5. Simulation and results

The SPICE-level simulations were carried out using HSpice (© Avant! Corporation) at 90 nm TSMC technology with supply voltages ranging from vdd = 0.8 V to 1.2 V and temperature was kept constant at 25 °C. We have considered a 3 stage pipelined adder for performance evaluation of the synchronous and asynchronous versions of the dual rail adder. The graph depicted below in Fig. 8 presents a comparison of total power dissipation and critical path delay for asynchronous pipelined adder with its corresponding synchronous counterpart.

As observed from this graph, power dissipation has increased and delay has decreased with increase in supply voltage. The asynchronous pipelined system performs better than the synchronous system as depicted by the decrease in power and delay values from the graph. Hence, it will provide high throughput as compared to its synchronous counterpart. Moreover, its energy efficient operation (due to low power dissipation) depicts its ability to be incorporated into assistive technology devices.

The asynchronous domino logic based pipelined adder has been implemented with two different completion detection approaches. We have considered a three stage pipelined system for performance analysis. It has been simulated with above mentioned process parameters and performance of both completion sensing approaches has been compared by analyzing Power dissipation, Worst case delays, Latency and Throughput of the pipelined system. The Table 2 shown below depicts the variation of power dissipation and critical path delay for both the CSCs. Power Delay product is also calculated at all the values of vdd to ascertain the overall performance of circuit designs.
5.1. Power dissipation

As observed from Table 1, the highest power dissipation occurs in all the circuits for maximum value of \(vdd = 1.2 \text{ V}\). CSC1 dissipates 74.79% more power than CSC2 at \(vdd = 0.8 \text{ V}\). CSC2 dissipates 84.35% lower power than CSC1 at \(vdd = 1.2 \text{ V}\). At \(vdd = 0.9 \text{ V}, 1.0 \text{ V}, 1.1 \text{ V}\), the percentage decrement in power dissipation for CSC2 is 81.51%, 84.18%, 85.60% respectively. Thus, the maximum drop in power dissipation is observed at \(vdd = 1.1 \text{ V}\). CSC1 dissipates high power due to the complex logic circuit being used that involves a lot of transistors. Moreover, Human intervention should be avoided, so low power operation is mandatory [9]. Thus, CSC2 should be preferred for designing battery powered assistive devices. The Fig. 9(a) depicts the variation of power dissipation with supply voltage for both designs.

Table 1

<table>
<thead>
<tr>
<th>Characteristics of 1-bit dual-rail domino full adder structure (HSPICE 90 nm technology)</th>
</tr>
</thead>
<tbody>
<tr>
<td>(T) (\text{T}<em>{\text{plt}}) (\text{T}</em>{\text{phl}}) (\text{T}<em>{\text{diff}}(\text{max. at } vdd = 1.2 \text{ V})) (\text{P}</em>{\text{diff}}(\text{min. at } vdd = 0.8 \text{ V}))</td>
</tr>
<tr>
<td>25 (^\circ\text{C}) 0.05 ns 0.04 ns 0.03 ns 113.47 (\mu\text{W}) 29.3 (\mu\text{W})</td>
</tr>
</tbody>
</table>

5.2. Critical data – path delay

The critical path delay analysis has been carried out by determining the critical path in all the pipelined designs. The average propagation delay was calculated for each of the designs by calculating \(\text{T}_{\text{plt}}\) and \(\text{T}_{\text{phl}}\) for \text{ack}_\text{pre} and \text{ack}_\text{nxt} signals for both the pipelined dynamic logic based adders with CSC1 and CSC2. Additional loads were added at each node to calculate delay of circuit.
components. As observed from Table 2, a significant drop in delay was observed at vdd = 1.2 V for CSC2 which was 76.92% lower than CSC1 at the same voltage. At vdd = 0.9 V, 1.0 V, 1.1 V the improvements in speed are 76.62%, 68.51%, 68% respectively. The worst case delay of CSC1 (at vdd = 0.8 V) was 75.2% greater than that of CSC2 at the same supply voltage. Thus CSC2 should be preferred for high speed operation. The Fig. 9(a) depicts the variation of critical path delay with supply voltages for both designs.

5.3. Power-delay product

The prime design goal of high performance systems is to achieve a low power delay product. To analyze the energy dissipation over a switching event, power-delay product has been calculated for all voltages spanning from vdd = 0.8 V to 1.2 V. This product depicts a tradeoff between delay incurred and power dissipated in a design. Dynamic circuits provide the benefit of high speed operation with the drawback of higher power dissipation. But the use of asynchronous design methodology has proved beneficial in lowering the power delay product. As observed from Table 2, the asynchronous fine grain dynamic pipelined adder structure dissipates highest energy (per switching event) at vdd = 1.1 V when CSC1 was used for completion detection which is 95.39% higher than CSC2. At vdd = 0.9 V, 1 V, 1.2 V, the decrement in PDP for CSC2 with respect to CSC1 is 95.67%, 95.02%, 96.38% respectively. Thus, the maximum drop in PDP for CSC2 over CSC1 occurs at vdd = 1.2 V which is 96.38%. The PDP of CSC2 was 93.85% lower at lowest considered voltage (vdd = 0.8 V) as compared to CSC1, depicting its potential advantage in BANs applications. The Fig. 9(b) depicts the variation of power delay product for all values of supply voltage for both CSCs.

5.4. Throughput and latency

Throughput and per-stage latency are critical design parameters of prime concern for designers because they depict whether a circuit is suited for certain applications or not. For evaluating the throughput and latency we have taken different values of vdd spanning from 0.8 V to 1.2 V at T = 25°C. The variation of throughput with variation in supply voltage and the variation of per stage forward latency with increasing supply voltages has been depicted graphically. By replacing CSC block with above implemented completion sensing circuits, throughput has been compared for both the designs to ascertain their performance. The PS0 pipeline structure has been considered for implementing the adder designs with both completion sensing approaches. The parameter required to determine the throughput is the cycle time which is the time required for one computation cycle of a pipeline.

The cycle time for a PS0 PIPELINE = $3T_{CB} + T_{pre} + 2T_{CSC}$ [18, 19]

Throughput and per-stage latency are critical design parameters of prime concern for designers because they depict whether a circuit is suited for certain applications or not. For evaluating the throughput and latency we have taken different values of vdd spanning from 0.8 V to 1.2 V at T = 25°C. The variation of throughput with variation in supply voltage and the variation of per stage forward latency with increasing supply voltages has been depicted graphically. By replacing CSC block with above implemented completion sensing circuits, throughput has been compared for both the designs to ascertain their performance. The PS0 pipeline structure has been considered for implementing the adder designs with both completion sensing approaches. The parameter required to determine the throughput is the cycle time which is the time required for one computation cycle of a pipeline.

The cycle time for a PS0 PIPELINE = $3T_{CB} + T_{pre} + 2T_{CSC}$ [18, 19]

The forward latency per stage is given by: $L = \text{Forward Latency per stage} = T_{CB}$ [18,19] which is same for both the adders implemented.

From Eq. (2), $T_{CSC} = \text{Time required for completion detection by the CSC}, T_{pre} = \text{Precharge Time}$.

The Fig. 10(a) depicts the latency variation with supply voltage and Fig. 10(b) depicts throughput variation with supply voltage for both schemes.

Observed values of latency from the Fig. 10(a) and throughput from Fig. 10(b) lead to the following facts:

### Table 2

<table>
<thead>
<tr>
<th>VDD (volts)</th>
<th>CSC</th>
<th>Delay (ns)</th>
<th>Power (μW)</th>
<th>Power delay product (fJ)</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.8</td>
<td>CSC1</td>
<td>0.78</td>
<td>26.54</td>
<td>20.77</td>
</tr>
<tr>
<td></td>
<td>CSC2</td>
<td>0.19</td>
<td>6.69</td>
<td>1.27</td>
</tr>
<tr>
<td>0.9</td>
<td>CSC1</td>
<td>0.77</td>
<td>48.42</td>
<td>37.28</td>
</tr>
<tr>
<td></td>
<td>CSC2</td>
<td>0.18</td>
<td>8.95</td>
<td>1.61</td>
</tr>
<tr>
<td>1.0</td>
<td>CSC1</td>
<td>0.54</td>
<td>73.21</td>
<td>39.53</td>
</tr>
<tr>
<td></td>
<td>CSC2</td>
<td>0.17</td>
<td>11.58</td>
<td>1.96</td>
</tr>
<tr>
<td>1.1</td>
<td>CSC1</td>
<td>0.50</td>
<td>100.09</td>
<td>50.04</td>
</tr>
<tr>
<td></td>
<td>CSC2</td>
<td>0.16</td>
<td>14.41</td>
<td>2.30</td>
</tr>
<tr>
<td>1.2</td>
<td>CSC1</td>
<td>0.39</td>
<td>110.62</td>
<td>43.14</td>
</tr>
<tr>
<td></td>
<td>CSC2</td>
<td>0.09</td>
<td>17.31</td>
<td>1.55</td>
</tr>
</tbody>
</table>
Forward Latency is dependent on supply voltage. It decreases with increase in supply voltage.

- Lowest per stage forward latency is observed at vdd = 1.2 V which is 0.03 ns.
- CSC2 gave the best throughput results, depicting high performance of this completion sensing circuit in dynamic pipeline design. Max. Throughput achievable using CSC2 was 3.31 Gsps which was observed to be 198.67% higher than that using CSC1 at vdd = 1.2 V, but at the cost of high power dissipation.
- CSC2 gave 178.59%, 134.39%, 142.5% improvements in throughput at supply voltages, vdd = 0.9 V, 1.0 V, 1.1 V respectively. Even at a low voltage of 0.8 V, pipelined adder with CSC2 gave a throughput of 1.37 Gsps which was 162% higher than the throughput obtained by considering CSC1 as completion detection approach. Therefore a significant improvement in throughput is achievable even at low voltages, making this design suitable for assistive devices.

5.5. Circuit complexity

In current deep-sub micron CMOS technologies, where multiple transistors are integrated on a single chip, circuit complexity is a major design metric. It is measured in terms of total number of transistors required to implement a logic function. In the pipelined circuit designed above, number of transistors per stage of the dynamic pipeline have been calculated which are much lesser in number owing to the fact that dynamic CMOS logic has been used for implementation of the computational block. As depicted from the graph, the circuit complexity (in terms of number of transistors) is highest for CSC1.

In terms of transistor count, CSC2 outshines CSC1 depicting the inherent advantage of Pseudo NMOS based design of reducing the transistor count. The Fig. 11(a) depicts the transistor count for both schemes.

5.6. Layout area

The VLSI circuit design layout has been designed in accordance with standard design rules. Two metal wire based layout has been designed for each of the completion sensing circuits and DRC (Design Rule Check) and LVS (Layout vs. Schematic check) was performed to ascertain the equality of circuits at layout design level and the schematic level. The Table 3 depicted below shows the layout area occupied for a standard CSC cell design.

The area requirement is higher for CSC1 due to the complex circuitry used resulting in large area overhead. Hence area wise, Pseudo NMOS based CSC2 outperforms the C-element based CSC1. Fig. 11(b) depicts the layout area comparison for CSC1 and CSC2.
6. Conclusion

The BANs operate with stringent timing requirements. Hence, real-time operation of BANs requires low latency components. The simulation results depict that the asynchronous pipelined adder design with CSC2 (Pseudo NMOS based completion sensing approach) achieves a 75.64% improvement in operating speed and 162% improvement in throughput at a low voltage of 0.8 V. Moreover, energy efficiency, which is depicted by power delay product, has improved by 93.85% at this voltage level. Thus, CSC2 is able to achieve high performance with low power consumption and lower silicon area requirement as compared to CSC1, thereby depicting its ability to be incorporated into battery powered assistive devices for BAN health care systems.

References


Table 3

<table>
<thead>
<tr>
<th>Implementation style</th>
<th>Layout area (M(\mu)m(^2))</th>
</tr>
</thead>
<tbody>
<tr>
<td>CSC1</td>
<td>0.0115 M(\mu)m(^2)</td>
</tr>
<tr>
<td>CSC2</td>
<td>0.0095 M(\mu)m(^2)</td>
</tr>
</tbody>
</table>

Fig. 10. (a) Latency vs. vdd. (b) Throughput vs. vdd.

Fig. 11. (a) No. of transistors per stage. (b) Layout area comparison.


