In this technical note, we provide a comparison of the design metrics of various quasidelay-insensitive (QDI) asynchronous adders, where the adders correspond to diverse architectures. QDI adders are robust, and the objective of this technical note is to point to those QDI adders which are suitable for low power/energy and less area. This information could be valuable for a resource-constrained low power VLSI design scenario. Non-QDI adders are excluded from the comparison since they are not robust although they may have optimized design metrics. All the QDI adders were realized using a 32/28nm CMOS process.
Introduction
The 2017 edition of the International Roadmap for Devices and Systems [1] suggests that asynchronous design could be a potential solution to address the increasing power/energy consumption of a digital circuit or system. Substantiating this, in [2] , a 128-point, 16 -bit, radix-8 fast Fourier transform (FFT) processor was implemented in the robust QDI asynchronous design style and it was compared with a conventional synchronous FFT processor implementation, and both these were realized using a 65nm CMOS process. It was noted that the QDI FFT processor is 34 times more energy-efficient than its synchronous equivalent. The QDI design style is a promising alternative to the synchronous design style, and different types of QDI implementations exist. phase return-to-zero (RTZ) and 4-phase return-to-one (RTO) handshaking, and suggests which QDI adders are preferable for low power and less area. Section 5 finally concludes this note.
Nomenclature
• CLA -Carry Lookahead Adder
• BCLA -Block CLA
• BCLARC -BCLA with Redundant Carry
• BCLG -Block Carry Lookahead Generator
• BCLGRC -BCLG with Redundant Carry
• CCLA -Conventional CLA
• CSLA -Carry Select Adder
• CT -Cycle Time
• PCTP -Power-Cycle Time Product
• RCA -Ripple Carry Adder
• SBFA -Single-Bit Full Adder (which is the conventional full adder)
• DBFA -Dual-Bit Full Adder (i.e., an integration of two SBFAs as one 2-bit adder)
• RCA-SBFA -RCA constructed using SBFAs
• RCA-DBFA -RCA constructed using DBFAs
• Hybrid RCA -RCA constructed using DBFAs and SBFAs
• Hybrid BCLA-RCA -Constructed using a mix of BCLA and RCA-SBFA
• Hybrid BCLARC-RCA -Constructed using a mix of BCLARC and RCA-SBFA
• QDI -Quasi-Delay-Insensitive
• RTZ -Return-To-Zero
• RTO -Return-To-One
QDI Circuits -Background
The design fundamentals of QDI circuits are discussed here to provide a background.
Data Encoding, Handshaking and Timing Parameters
The general schematic of a QDI circuit stage encompassing delay-insensitive data encoding and a 4-phase handshaking is shown in Figure 1a based on the transmitter-receiver analogy. The corresponding technical schematic is shown in Figure 1b .
In Figure 1b , the current stage and next stage registers are analogous to the transmitter and the receiver shown in Figure 1a , and a QDI circuit is sandwiched between the current stage and the next stage register banks. The register bank comprises a series of registers, with one register allotted for each of the rails of a dual-rail encoded data input. The register refers to a 2-input Muller C-element [14] . The C-element will output 1 or 0 if all its inputs are 1 or 0 respectively. If the inputs to a C-element are not identical then the C-element would retain its existing steady-state. The circles with the marking 'C' represent the C-elements in the figures.
In Figure 1 , (X1, X0), (Y1, Y0) and (Z1, Z0) represent the dual-rail encoded primary inputs of the corresponding single-rail inputs X, Y and Z. According to delay-insensitive dualrail data encoding and the 4-phase RTZ handshaking [9] , an input W is encoded as (W1, W0) where W = 1 is represented by W1 = 1 and W0 = 0, and W = 0 is represented by W0 = 1 and W1 = 0. Both these assignments are called data. The assignment W1 = W0 = 0 is called the spacer, and the assignment W1 = W0 = 1 is deemed illegal since the coding scheme should be complete [15] and unordered [16] to maintain the delay-insensitivity.
The application of input data to a QDI circuit which adheres to the 4-phase RTZ handshaking follows the sequence of data-spacer-data-spacer, and so forth. It may be noted that the application of data is followed by the application of the spacer, which implies that there is an interim RTZ phase between the successive applications of input data. The interim RTZ phase ensures a robust data communication (handshaking) between the transmitter and the receiver. The RTZ handshake protocol is specified by the following four steps: • First, the dual-rail data bus specified by (X1, X0), (Y1, Y0) and (Z1, Z0) assumes the spacer, and therefore the acknowledgment input (ACKIN) is equal to binary 1. After the transmitter transmits a data, this would cause rising signal transitions i.e., binary 0 to 1 to occur on one of the dual rails of the entire dual-rail data bus
• Second, the receiver would receive the data sent and drive the acknowledgment output (ACKOUT) to 1. ACKIN is the Boolean complement of ACKOUT and vice-versa
• Third, the transmitter waits for ACKIN to become 0 and would subsequently reset the dual-rail data bus, i.e., the dual-rail data bus assumes the spacer again
• Fourth, after an unbounded (but a finite and positive) time duration, the receiver would drive ACKOUT to 0 and then ACKIN would assume 1. With this, a single data transaction is said to be completed and the QDI circuit is permitted to start the next data transaction According to dual-rail data encoding and the 4-phase RTO handshaking [17] , an input V is encoded as (V1, V0) and V = 1 is represented by V1 = 0 and V0 = 1, and V = 0 is represented by V0 = 0 and V1 = 1. Both these assignments are called data. The assignment V1 = V0 = 1 is called the spacer, and the assignment V1 = V0 = 0 is deemed illegal to maintain the delay-insensitivity.
The application of input data to a QDI circuit conforming to the 4-phase RTO handshaking follows the sequence of spacer-data-spacer-data, and so forth. It may be noted that there is an interim RTO phase between the successive applications of input data. The interim RTO phase ensures a proper and robust data communication between the transmitter and the receiver. The RTO handshaking process is specified by the following four steps:
• First, ACKIN is equal to binary 1. After the transmitter transmits the spacer, this would cause rising signal transitions i.e., binary 0 to 1 to occur on all the rails of the entire dualrail data bus
• Second, the receiver would receive the spacer sent and drive ACKOUT to 1
• Third, the transmitter waits for ACKIN to become 0 and would then transmit the data through the dual-rail data bus
• Fourth, after an unbounded (but a finite and positive) time duration, the receiver would drive ACKOUT to 0 and subsequently ACKIN would assume 1. With this, a single data transaction is said to be completed and the QDI circuit is permitted to start the next data transaction In a QDI circuit, the time taken to process the data in the datapath, highlighted by the red dashed line in Figure 1b , is called forward latency, and the time taken to process the spacer is called reverse latency. Since there is an intermediate RTZ or RTO phase between the application of two input data sequences, the cycle time (CT) gives the sum of forward and reverse latencies.
The CT of a QDI circuit is the equivalent of the clock period of a synchronous circuit. The CT governs the speed at which new data can be input to a QDI circuit.
The gate-level details of example completion detectors corresponding to RTZ and RTO handshaking is shown at the bottom of Figure 1b , within the dotted green boxes. The completion detector indicates i.e., acknowledges the receipt of all the primary inputs given to a QDI circuit stage. In the case of 4-phase RTZ handshaking, ACKOUT is produced using a 2-input OR gate to combine the respective dual rails of each encoded primary input and then synchronizing the outputs of all the 2-input OR gates using a C-element or a tree of C-elements. In the case of 4-phase RTO handshaking, ACKOUT is produced using a 2-input AND gate to combine the respective dual rails of each encoded primary input and subsequently synchronizing the outputs of all the 2-input AND gates using a C-element or a tree of C-elements.
QDI Circuits
QDI circuits are robust and are classified into three types as strong-indication [18, 19] , weak-indication [18, 20] , and early output [21] circuits. The input-output timing relations of Strong-indication circuits would wait to receive all the primary inputs (data and spacer), and after receiving them would process them to produce the required primary outputs (data and spacer respectively). On the other hand, weak-indication circuits can produce all but one of the primary outputs after receiving a subset of the primary inputs. Nevertheless, only after receiving the last primary input, they would produce the last primary output. Weak-indication may be enabled locally or globally, and it has been shown in [22, 23] A connection of strong-indication sub-circuits may not result in a strong-indication circuit; rather, a weak-indication circuit may result. For example, if two strong-indication full adders are connected, it could result in a weak-indication 2-bit RCA. This is because if all the inputs to one of the full adders are provided, the corresponding sum and carry output bits of that full adder could be produced regardless of the non-arrival of inputs to the other full adder in the RCA. However, only after all the inputs to the other full adder are provided, its corresponding sum and carry output bits would be produced. This scenario is characteristic of weak-indication.
While comparing strong-and weak-indication circuit types, the latter are preferable [24, 25] , and this is because of the strict timing restrictions inherent in the former. Especially, for implementing arithmetic functions, the weak-indication type is preferable to the strongindication type and this is due to the following reasons: i) strong-indication arithmetic circuits tend to encounter worst-case forward and reverse latencies for the application of data and spacer, and therefore the CT of strong-indication arithmetic circuits is always the maximum (i.e., worst-case timing), ii) weak-indication arithmetic circuits may encounter data-dependent forward and reverse latencies or a data-dependent forward latency and a constant reverse latency, and so the CTs of weak-indication arithmetic circuits are usually less compared to those of strong-indication arithmetic circuits.
An early output circuit is however more relaxed compared to the strong-and weakindication counterparts. After receiving a subset of the primary inputs (data or spacer), an early output circuit can produce all the primary outputs (data or spacer respectively). This implies the late arriving primary inputs may not be acknowledged by the circuit. However, this does not cause any concern because isochronic fork assumptions are imposed on all the primary inputs, and all the primary inputs are provided to the completion detector that precedes the early output circuit, as seen in Figure 1b . Hence, the acknowledgment of the late arriving primary inputs by the completion detector also implies the receipt of those primary inputs by the QDI circuit. Thus, the problem of wire orphan(s) i.e., unacknowledged signal transitions on the wire(s) due to the late arriving input(s) is overcome through the assumption of isochronic forks which is imposed on all the primary inputs.
Either the data may be produced early, or the spacer may be produced early in an early output circuit. Accordingly, an early output circuit is categorized as early set or early reset kind.
The early set and early reset behaviours of early output circuits are highlighted by the dotted green ovals in Figures 2a and 2b . An early output RCA is preferable to a strong-indication and a weak-indication RCA for achieving improved optimizations in speed and power/energy. In general, an early output circuit can achieve enhanced optimizations in the design metrics compared to the strong-and weak-indication counterparts.
In a QDI circuit, the logic decomposition should be performed safely [26, 27] The signal transitions will have to occur monotonically throughout an entire QDI circuit from the first logic level, which receives the primary inputs, up to the last logic level, which produces the primary outputs [32] . The signal transitions should either be seen as rising or falling throughout an entire QDI circuit. In general, the signal transitions will be rising (i.e., binary 0 to 1) for the application of data and falling (i.e., binary 1 to 0) for the application of spacer in a QDI circuit that corresponds to RTZ handshaking. On the other hand, the signal transitions will be rising for the application of spacer and falling for the application of data in a QDI circuit that corresponds to RTO handshaking.
For monotonicity of signal transitions, the monotonic cover constraint [9] should be incorporated into a QDI logic description. For example, this implies if a QDI logic function is expressed in the sum-of-products form, only one product term should be activated for the application of an input data, i.e., the product terms comprising the sum-of-products expression of a QDI logic function should be mutually orthogonal (also called disjoint), and the logical conjunction of any two product terms in a QDI logic function should yield zero. Thus, a QDI logic function is ideally expressed in the disjoint sum-of-products form [33] [34] [35] , which would consist of mutually disjoint product terms to satisfy the monotonic cover constraint. An example illustration of the monotonic cover constraint is given in Section 2.2 of [36] , and an interested reader may refer to the same. Embedding the monotonic cover constraint and performing safe QDI logic decomposition are vital to the correct implementation of a QDI circuit.
Incorporating the monotonic cover constraint in a QDI logic function would cause the activation of just one signal propagation path from a primary input to a primary output for the application of input data. This is useful to facilitate the proper acknowledgment of signal transitions throughout an entire QDI circuit, thus avoiding the likelihood of any gate orphan occurrence(s). Gate orphans are troublesome unlike wire orphans as they may affect the robustness of a QDI circuit and if they are imminent, restricting them from affecting the circuit robustness may require incorporating additional timing assumptions which are likely to be sophisticated and may also be practically difficult to realize [37] .
Design Metrics of QDI Adders
Several 32-bit QDI adders, which correspond to the generic architectures such as RCA, CSLA, CCLA and BCLA, were physically realized using a 32/28nm CMOS technology [38] , corresponding to both RTZ and RTO handshaking. To transform a QDI circuit corresponding to RTZ handshaking into one that corresponds to RTO handshaking and vice-versa, some rules have been defined in [58] , and the proofs for these are given in [59] . The 2-input C-element was alone custom-realized by modifying the AO222 gate to implement the QDI adders. A typical-case PVT specification of the high Vt standard digital cell library with a supply voltage of 1.05V and an operating junction temperature of 25°C was considered for the implementations and simulations.
The registers and completion detectors associated with the QDI adders are maintained the same with respect to RTZ and RTO handshaking, separately. This implies that the differences between the simulation results of the QDI adders are attributable to the differences between their logic compositions.
About 2000 (random) input vectors encompassing data and spacer, which separately correspond to RTZ and RTO handshaking were used to verify the functionalities of the adders.
The input vectors corresponding to RTZ and RTO handshaking bear a logical equivalence. The functional simulations of the QDI adders were successfully performed and their respective switching activities were captured, which were subsequently used to estimate the average power dissipation. Synopsys EDA tools were used to estimate the design metrics of the adders. Default wire loads were automatically included while performing the simulations. A virtual clock was used to constrain the input and output ports of the QDI adders, and it did not consume any power.
The design metrics estimated include forward and reverse latencies, CT, area, and average power dissipation. The forward latency of a QDI circuit is similar to the critical path delay of a synchronous circuit and it is directly estimated. The reverse latencies of some QDI adders may differ from their forward latencies. This may be evident from Figure 3 . The reverse latencies of QDI adders were estimated from the gate-level simulation timing data, and this method was followed for RTZ and RTO handshaking.
The estimated design metrics of various QDI adders corresponding to RTZ handshaking are given in Table 1 , and the design metrics corresponding to RTO handshaking are given in Table 2 . Adder legends are provided in the second column of Tables 1 and 2 to help with the discussion. The related literature references pertaining to the QDI adders are given in Tables 1   and 2 . RCAs (i.e. RCA-SBFAs) utilising the early output full adders of [40] [41] [42] are excluded from the comparison since these RCAs are relative-timed [43] . Relative-timed asynchronous circuits are not QDI and they are non-robust since they usually incorporate additional timing assumptions with respect to sequencing the arrival of internal signals besides the assumption of isochronic forks, and such additional timing assumptions may not be easy to realize. Also, the dual-bit full adders (DBFAs) proposed in our earlier works [56] and [57] , which incorporate dual- FA refers to the Full Adder and SL refers to the Sum Logic (i.e., FA without the carry output). Referring to Tables 1 and 2 , Z1 (O1) is an RCA constructed using the strong-indication full adder of [44] , Z2 (O2) is an RCA constructed using the strong-indication full adder of [45] and Z3 (O3) is an RCA constructed using the strong-indication full adder of [26] . Z4 (O4) is an RCA constructed using the weak-indication full adder of [45] , and Z5 (O5) is an RCA constructed using the weak-indication full adder of [46] . The (worst-case) forward and reverse latencies of Z1(O1), Z2 (O2), Z3(O3), Z4 (O4) and Z5 (O5) are governed by the longest signal propagation path shown in violet in Figure 3a . Z6 (O6), Z7 (O7), Z8 (O8) are RCAs which are constructed using the weak-indication full adders of [47] and [48] and the early output full adder of [21] respectively. The forward and reverse latencies of Z6 (O6), Z7 (O7) and Z8 (O8) are governed by the signal propagation paths highlighted in blue and red respectively in Figure   3b . Note that in the case of Z6 (O6), Z7 (O7) and Z8 (O8), their reverse latency is a constant, which is typically governed by two full adder stages, while their forward latency is input data- Figure 3d , while the reverse latency is the same as Figure 3c . Z12 (O12) is similarly an improved, i.e., a hybrid RCA version of Z11 (O11) in that a 2-bit least significant RCA comprising two single-bit full adders (SBFAs) of [21] are used to replace a least significant DBFA of [51]. However, this does not result in a reduction of the forward latency, and in fact the forward latency of Z12 (O12) is slightly greater than the forward latency of Z11 (O11), while their reverse latencies are the same. This implies that a hybrid RCA may not always lead to a reduction in the CT compared to a regular RCA. This was also noticed when comparing the hybrid BCLARC-RCAs with the BCLARCs of [39] , where the latter is beneficial than the former in terms of the CT.
Z13 (O13) and Z14 (O14) are non-uniform input-partitioned and uniform inputpartitioned CSLAs presented in [52] . These CSLAs were constructed using the early output full adder of [21] and a strong-indication 2:1 multiplexer of [60] . Of the two CSLAs, the 32-bit CSLA based on an 8-8-8-8 uniform input-partition was found to have better design metrics compared to the non-uniform input-partitioned CSLA. The longest signal propagation paths corresponding to the forward and reverse latencies are shown in blue and red in Figure 3f of [36] , and Z27 (O27) and Z28 (O28) of [39] . The longest signal propagation paths corresponding to forward and reverse latencies in the case of a BCLA are represented by the dotted blue and red lines in Figure 3g , while the longest signal propagation paths corresponding to forward and reverse latencies in the case of a BCLARC are represented by the dotted blue and red lines in Figure 3i . By comparing Figure 3g and 3i, it is clear that the reverse latency of a BCLARC is much reduced than the reverse latency of a BCLA and this is because of the introduction of the redundant carry output logic.
A conventional QDI CCLA was presented in [54] , which is represented as Z19 and O19
in Tables 1 and 2 . While the forward latency of the CCLA is less compared to the forward latencies of BCLAs, unfortunately, the reverse latency of the CCLA is the same as the forward latency, as seen from Figure 3h . As a result, the CT of the CCLA is greater than the CTs of BCLAs, BCLARCs and hybrid BCLARC-RCAs.
To further reduce the forward latency of BCLARCs, a small-size RCA may be of use in the least significant adder bit positions. This leads to a hybrid BCLARC-RCA architecture, which is represented by Z24 (O24) to Z26 (O26) and Z29 (O29) to Z31 (O31) in Tables 1 and   2 . Z24 (O24), Z25 (O25) and Z26 (O26) embed a least significant 4-bit, 8-bit and 12-bit RCA.
Likewise, Z29 (O29), Z30 (O30) and Z31 (O31) embed a least significant 4-bit, 8-bit and 12-bit RCA. However, the usefulness or no use of the hybrid BCLARC-RCA architecture has to be verified via timing analysis. In the case of the BCLARC of [36] , the replacement of a least significant 4-bit BCLARC by a 4-bit RCA enables a small reduction in the forward latency and also the CT (the characteristic of which is portrayed by Figure 3j ), while the use of a higher order RCA is found to degrade the forward latency. However, in the case of the BCLARC proposed in [39] , a hybrid BCLARC-RCA configuration does not lead to any improvement in the CT.
Overall, when considering all the QDI adders given in Tables 1 and 2 , it becomes clear that in terms of the CT, the BCLARC proposed in our latest work [39] (represented by Z28 and O28) is better optimized compared to the rest.
The CT governs the speed of a QDI circuit that utilizes delay-insensitive data encoding and a 4-phase handshaking, and the power-cycle time product (PCTP) governs the low power/low energy aspect. Hence, the PCTPs of the QDI adders were calculated and then normalized. The normalization was performed such that the highest PCTP among the set of QDI adders corresponding to a handshake protocol was normalized to 1, and the actual PCTPs CT predominantly influences the PCTP of QDI adders. This is because the average power dissipations of QDI adders are quite the same and this is because all the QDI adders satisfy the monotonic cover constraint [9] . The average power of the QDI adders are confined to small ranges of 151µW (i.e., 2161µW to 2312µW) in the case of RTZ handshaking and 146µW (i.e., 2157µW to 2303µW) in the case of RTO handshaking. Hence, the PCTP is quite a reflection of CT, as evident from the curves in Figures 4a and 4b , and Figures 5a and 5b. The normalized PCTP plots reveal that Z28 (O28) of [39] is energy-efficient than the rest, as was found to be the case with CT from Tables 1 and 2 . corresponds to the proposed BCLARC of [39] which is energy-efficient than the rest.
Lastly, in terms of area, Z8 (O8), which is based on the early output full adder of [21] , occupies relatively less silicon and dissipates less average power compared to the rest. The full adder of [21] requires less area compared to the full adders of [26, 44, 45, 46, 47, 48] . Even with respect to a synchronous design the RCA architecture occupies less area and dissipates less power than the other adder architectures [62, 63] , and this is found to hold good for a QDI design.
However, in terms of the CT, Z8 (O8) is 29.3% (22.5%) more expensive than Z28 (O28). As a result, in terms of the energy (PCTP), Z8 (O8) is 27.7% (21.1%) more expensive compared to Z28 (O28). corresponds to the proposed BCLARC of [39] which is energy-efficient than the rest.
Conclusions
This technical note has summarized the design metrics of various QDI adders presented in the literature by considering an example 32-bit addition. Area is not that much of a concern as speed and energy. However, if area becomes an overarching concern, then Z8 (O8) of [21] is preferable. Nevertheless, it was observed that Z28 (O28) of [39] is preferable for implementing high-speed and energy-efficient QDI asynchronous addition based on RTZ (RTO) handshaking. 
