# Ultra-Low Voltage 4-to-2 Compressors for Near- $V_{th}$ Computing

Anuradha C. Ranasinghe, Sabih H. Gerez

\*Chair of Computer Architecture for Embedded Systems, Faculty of Electrical Engineering, University of Twente

5, Drienerlolaan, 7522 NB Enschede, Netherlands

Email: rnsburg@gmail.com, s.h.gerez@utwente.nl

Abstract—This paper presents two novel circuit arrangements for an ultra-low voltage, low-power 4-to-2 compressor targeting typical near- $V_{th}$  application domain. A hybrid logic style is utilized to exploit energy efficiency by means of parasitic reduction in circuit blocks. Proposed structures are evaluated against prevalent compressors in terms of their typical figure of merits and noise immunity. From extensive post-layout simulations in 65-nm bulk CMOS process technology, the most optimal arrangement was found to be 35% more power efficient, 3.4% faster, 8% more area efficient and 37% better in PDP at  $0.4V_{DD}$ compared to most appealing implementations in literature.

#### I. INTRODUCTION

With the escalating performance demands of the applications in internet-of-things (IoT) domain, operating them for a reasonable power budget becomes a daunting task. Portable, battery-powered and ubiquitous IoT systems vigorously rely on the lifetime of the energy source and therefore energy efficiency has become the primary design goal in their implementations [1]. Many modern systems in the IoT domain put high demands on the performance of arithmetic blocks for the purpose of computation-intensive applications such as image processing and deep neural networks [2]–[6]. Their operability at near- $V_{th}$  regime is a tempting way to exploit better energyperformance trade-offs in the target application domain [6].

Most general-purpose arithmetic blocks necessitate the processing of partially computed data in the form of a reduction tree in carry-save or carry-propagation manner. Such applications can significantly benefit from an efficient addercompressor circuit owing to its dominant appearance across the reduction tree. Traditionally, CMOS-based full adders have been employed in various multiplier structures [7]–[9]. In order to lower the latency of the accumulation stage, a 4-to-2 compressor (4T2C) has been proposed where in most cases the critical-path delay of the reduction tree can be reduced by 1, 2 and 3 XOR gate delays for the 8-bit, 16-bit and 32-bit versions of multipliers respectively [10]–[12]. However, the standard CMOS-based implementation presented in [13]–[15] of the 4T2C which generally requires 72 transistors, is not energy efficient and not well suited for near- $V_{th}$  operation.

Non-CMOS implementations [16], [17] employing pass transistor and double pass transistor (DPL) multiplexers have shown 18% of performance improvement and 19% reduction in transistor count (58). Radhakrishnan *et al.* [18] has proposed a unique structure which requires only 28 transistors, the most compact design reported to date. However, this circuit severely



Fig. 1: (a) 4-to-2 Compressor Logic Decomposition (b) Carry-save adder tree wiring

suffers at low- $V_{DD}$  due to the regenerative feedback loops [19]. Chang *et al.* [19] improve this circuit and proposes 3 alternative designs based on DPL and complementary pass transistor logic (CPL) that can withstand near- $V_{th}$  operation. These structures require 44, 46 and 52 transistors respectively. Yuan *et al.* [20] proposes two designs: one design combining both full-swing and non-full swing XOR-XNR blocks to lower the transistor count while the second uses transmission-gates (TGL). They require 44 and 56 transistor respectively. Only the latter survives at aggressive PVT variations. The alternatives in [21] and [22] propose hybrid and standard CMOS implementations which typically require more transistors than aforementioned designs. Recently, [23] has proposed a novel structure to improve the critical path delay of 4T2C.

This work reports two 4T2C arrangements based on a novel XOR-XNR circuit targeting near/sub- $V_{th}$  operation. A hybrid logic style has been exploited in order to improve the energy efficiency of the compressor without compromising the other figure of merits.

### **II. 4-TO-2 COMPRESSOR ARCHITECTURE**

A 4T2C has five inputs  $(x_1 - x_4 \text{ and } cin)$  received from the preceding level of the adder tree. Output *sum* has the same weight as inputs  $x_1 - x_4$  while the final carry-out (*carry*) is weighted one bit higher in the binary order as shown in Fig. 1(b). Conversely, the intermediate carry-out *co* is derived from the addition result of first three inputs. In general, *sum* and *carry* signals are propagated vertically or diagonally to the subsequent level of the adder tree while *co* is propagated horizontally as *cin* to the compressor of next higher order bit in the same level. Conventionally, the Boolean expression of



Fig. 2: SXRG based on (a) CPL [19] (b) DPL [19] (c) IRFL [19] (d) 6TXI [23] (e) Proposed ISGS (f) Proposed PV1 (g) Proposed PV2

4T2C is written as follows [18]–[21]:

$$co = (x_1 \oplus x_2)x_3 + \overline{(x_1 \oplus x_2)}x_1 \tag{1}$$

$$sum = x_1 \oplus x_2 \oplus x_3 \oplus x_4 \oplus cin \tag{2}$$

$$carry = sum + \overline{(x_1 \oplus x_2 \oplus x_3 \oplus x_4)}x_4 \tag{3}$$

From eq. (1)-(3), the input to *sum/carry* generation evidently becomes the critical path of the circuit. This is equivalent to 3 XOR (or MUX) gate delays in CPL [19], DPL [19], Hybrid [17]–[19] and TGL [20] implementations. Arasteh *et al.* [23] exploits the relationship between one of the primary inputs (i.e.  $x_1$ ) and *sum-carry* signals to reduce the critical path delay. However the transistor level implementation still has to go through 3 XOR (or MUX) stages, albeit its critical path is shortened by one drain resistance of a transmission gate pair. Moreover, the outputs of this design could be susceptible to spurious activities as the control signals to the final *sum-carry* generation branches (SCG) are poorly synchronized. The novel arrangements proposed in this paper are based on eq. (1)-(3) and will be evaluated in detail in next section.

## III. PROPOSED 4-TO-2 COMPRESSORS

Fig. 2(a)-(d) illustrate the essential building blocks of ultralow voltage 4T2Cs in literature. The dashed arrow in each figure represents the critical path of the circuit. Note that in this analysis, the implementations based on regenerative feedback loops (RFL) [18] and non full-swing operation will not be considered due to the severe performance drops at lower  $V_{DD}$ . Fig. 2(a)-(c) depict the simultaneous XOR-XNR generation (SXRG) modules based on CPL, DPL and hybrid logic styles proposed in [19] respectively. These blocks correspond to the XOR<sub>1</sub>-XOR<sub>3</sub>/XOR<sub>4</sub> stages of the Fig. 1(a). Due to the complementary nature, CPL and DPL typically require 10-12 transistors for SXRG including the input inverters. Alternatively the Fig. 2(c) utilizes an *improved* version of RFL (IRFL) of 10 transistors for SXRG which is more symmetrical in path delays compared to CPL and DPL versions. Moreover, it does not have inverters in its critical path. Since CPL is mainly composed of NMOS pass transistors, it is undoubtedly more energy efficient compared to DPL and IRFL pertaining to the lower switching gate capacitance ( $C_g$ ) of NMOS devices.

The transmission gate pairs in DPL version minimize the equivalent drain resistance in each path and the outputs are free of  $C_g$ s local to the SXRG. This makes the DPL version faster at the expense of more transistors. Although IRFL is more balanced, the input combinations leading to regeneration (i.e.  $x_1 \neq x_2$ ) can still result slower transitions at the outputs. The enclosed portion of Fig. 2(d) [23] for SXRG is straightforward, and employs a 6 transistor XNR circuit followed by an inverter as depicted by I1-I2 and Q1-Q4. Henceforth, this will be referred to as 6TXI. Note that the rest of Fig. 2(d) corresponds to the XOR<sub>3</sub> and MUX<sub>2</sub> stages of Fig. 1(a). Although 6TXI requires only 8 transistors, the synchronization between XOR-XNR signals is difficult to achieve at lower supply voltages due to the PMOS load of I2. This is prominently seen at worst case corner (SS) operation. Among all the SXRG modules, the parasitic coupling to an output node is dominant in Fig. 2(d).

Fig. 2(e)-(g) illustrate the circuit blocks of the proposed 4T2C. Fig. 2(e) represents the intermediate signal generation stage (ISGS) which can be combined with 2(f) and 2(g) to construct two full-fledged compressors PV1 and PV2. The latter provides additional signal buffering to the output loads. The novelty of the circuit has been exploited through a new SXRG circuit as denoted by I1 and O1-O7 of Fig. 2(e). As can be seen, the XOR output within the SXRG circuit only drives a single NMOS  $C_g$  (Q5) in addition to the drain-source parasitics ( $C_d$  and  $C_s$ ) of Q1-Q4. This should result a faster XOR operation compared to SXRGs of IRFL and 6TXI. It is indeed more energy efficient than CPL, DPL and IRFL due to the reduced transistor count. Among all the SXRG modules, the proposed one provides balanced trade-offs between circuit speed and power consumption while improving the signal synchronization. The parasitic decomposition corresponding



Fig. 3: SXRG Circuits: (a) Equivalent RC network (b) Critical path delays (SS,-40°C)

to the critical path of Fig. 2(e) SXRG is shown in Fig. 3(a).  $C_L$ ,  $C_{INV}$  and  $R_{INV}$  represent the output loads and the inverter parasitics respectively. The rest of the parasitics are from  $C_{d/s}$  which are state dependent. From the Elmore delay model [24], the critical path delays of SXRG can be derived as follows:

$$V_{out}(t) = V_{DD}(1 - e^{-t/R_{eq}C_{eq}})$$
(4)

for the 50% of V<sub>DD</sub> level of rise/fall transitions,

$$\tau_{pd\_xor_1} \approx 0.69 (R_{D1} \parallel R_{D2}) (C_{d/s} + C_{g5\_off} + C_{L1})$$
 (5)

$$\tau_{pd\_xnr_1} \approx \tau_{xor|Q5} + 0.69 \{ R_{INV} (C_{INV} + C_{s7\_on}) + (R_{INV} + R_{D7}) (C_{d/s} + C_{d5\_off} + C_{L2}) \}$$
(6)

where  $\tau_{xor|Q5}$  is the time taken for  $xor_1$  signal to reach the  $V_{th}$  of Q5.  $\tau_{xor|Q5}$  is relatively lower than  $\tau_{pd\_xor1}$ . Fig. 3(b) compares the critical path delays of the stand-alone SXRG circuits based on post-layout simulations at worst case corner (SS, -40°C). The drawn geometries for PMOS and NMOS devices were  $0.22\mu/0.06\mu$  and  $0.15\mu/0.06\mu$  respectively. PV1 represents the proposed SXRG and as depicted, it is seen to be comparable to CPL while faster than both 6TXI and IRFL until 0.7V. At 0.6V, PV1 is only faster than 6TXI and this is mainly caused by the slow PMOS devices in its critical path.

To evaluate the overall performance, we consider 6 existing 4T2Cs as baselines against PV1 and PV2. These were based on [19] (CPL, DPL and IRFL), [20] (TGL), [23] (6TXI) and standard CMOS cells respectively. 5 out of these and the proposed versions were derived from eq. (1)-(3) and implemented according to Fig. 1(a). In PV1 and PV2, the intermediate *xor<sub>i</sub>/xnr<sub>i</sub>* signals are capacitively terminated at each transmission gate as depicted in Fig. 2(f)-(g), thereby avoiding the cascaded resistive paths from inputs to the output loads. Additionally this provides proper signal synchronization between SCG branches while  $C_g$ s of TGL pairs filter out glitches that could possibly stem from asymmetric path delays of SXRG.

Noise immunity is crucial for near- $V_{th}$  operation as the transistor on/off current ratio is severely degraded by low  $V_{DD}$ . The effect can be quantified by analyzing the ISGS for the



Fig. 4: Worst-case  $\Delta V_{sw}$  scenario of ISGS

input scenario leading to the worst-case swing degradation  $(\Delta V_{sw})$  as shown in Fig. 4. Note that only the ISGS is considered here as  $\Delta V_{sw}$  occurs from inputs to the *xor*<sub>3</sub> which is eventually terminated at  $C_g$  of transmission gate pairs. In Fig. 4, all the sub-threshold leakage paths are denoted by  $I_n$  where *n* represents the respective component of the ISGS in Fig. 2(e). The leakage through  $I_4$  and  $I_{10}$  is minimal as their resulting  $V_{GS} < 0$  due to the swing reduction at their drain-source nodes. Assuming previous device geometries,  $\Delta V_{sw}$  at *xor*<sub>1</sub> and *xnr*<sub>2</sub> are given by [25]–[27]:

$$\Delta V_{OH\_xor_1} = R_{D3}(I_1 + I_2) = v_t \left(\frac{\beta_n}{\beta_p} e^{\frac{-(V_{SG} - V_{thp})}{\eta_p v_t}} + 1\right)$$
(7)

$$\Delta V_{OL\_xnr_1} = R_{D5}(I_6 + I_7) = v_t \frac{2\beta_p}{\beta_n} e^{\frac{-(V_{GS} - V_{thn})}{\eta_n v_t}}$$
(8)

where  $\beta_n$  ( $\beta_p$ ) is the strength of NMOS (PMOS) transistor which can be tuned by adjusting device aspect ratio. From (7), similarly for *xor*<sub>3</sub>:

$$\Delta V_{OH\_xor_3} = R_{D13}(I_{12} + I_{14} + I_{15}) = \frac{v_t}{\beta_p} \Big\{ 2\beta_n + \beta_p e^{\left[\frac{-(V_{xor_2} - V_{xor_1}) - V_{thp}}{\eta_p v_t}\right]} \Big\} \cdot e^{\left[\frac{-[(V_{xor_1} - V_{xor_2}) - V_{thp}]}{\eta_p v_t}\right]}$$
(9)

From eq. (7)-(8), it is evident that  $\beta$  mismatch ( $\beta_n >> \beta_p$ ) should be kept minimum for better  $\Delta V_{sw}$ . Moreover the internal nodes formed by multiple drain-source junctions as well as regenerative feedback loops are highly vulnerable for  $\Delta V_{sw}$ . Hence eq. (9) further justifies the necessity of terminating the *xor<sub>i</sub>/xnr<sub>i</sub>* signals at SCG branches and if not,  $V_{OH\_xor\_3}$  further degrades through the resistive paths of SCG. Ideally, the impact of  $\Delta V_{sw}$  on the noise margin of the compressor should be minimal and both low/high noise margin levels should remain balanced at low  $V_{DD}$  [28].

### **IV. PERFORMANCE EVALUATION**

The 4T2Cs in section III were implemented in 65-nm bulk CMOS process technology and the post-layout simulations were carried out in the Cadence Virtuoso analog environment. An equal effort was put to optimize all cell layouts by sharing drain-source regions and minimizing intra-cell ME-2 routing. Fig. 5(a)-(f) illustrate the figure of merits of all the designs. The same device geometries as in section II were utilized for all implementations except CMOS. Note that the power consumption was measured against a uniformly distributed random input sequence of 1000 patterns. Power consumption was measured at two frequencies, 100 MHz for 0.6V-1.2V range and 10 MHz for 0.4V-0.5V range. In delay profiling, the minimum frequency has been set to 10 MHz. Power, delay and power-delay product (PDP) figures are based on the typical corner operation (TT, 25°C,  $C_L=1fF$ ). PV1 and 6TXI are the most power efficient designs which are comparable across 0.6V-1.2V as shown in Fig. 5(a). However the delay profile in Fig. 5(b) deviates from this trend and PV1 is seen to be fastest option followed by CPL. Even-though CPL requires more transistors than PV1, the dominance of NMOS



Fig. 5: 4T2C Merits: (a)-(c) Power, Delay, PDP (0.6V-1.2V, 100MHz) (d) PDP (0.4V-0.5V, 10MHz) (e) Core Area/(#)Transistor Count (f) PVT Variation (0.6V/SS/-40°C)

devices in CPL critical path reduces the influence of parasitics. Interestingly, CPL is followed by PV2 and then DPL. Eventhough the standalone DPL SXRG was more robust (Fig. 3(b)), the compressor composed of DPL SXRGs has become slower due to cascaded resistive paths and increased coupling parasitics. When the operating region shifts from near- $V_{th}$ (0.5V) to sub-V<sub>th</sub> (0.4V), only PV1, PV2, CPL and CMOS versions sustain their operation under the given constraint. At 0.4V (Not shown), PV1 (15.43 nW) and PV2 (18 nW) were 34.6% and 24% more power efficient than CPL (23.61 nW). However PV1 (52.9 ns) was 3.4% faster while PV2 (60.3 ns) was 9.1% slower than CPL (54.8 ns). Nonetheless, proposed versions outperform all other implementations in terms of PDP across a wider operating range (0.4V-1.2V) as shown in Fig. 5(c)-(d). Specifically at 0.4V, PDP of PV1 (0.817 fJ) and PV2 (1.08 fJ) were 37% and 16.3% better than CPL (1.295 fJ).

Fig. 5(e) represents the core area of each implementation. The layout irregularities (i.e. diffusion breaks) of nonconventional logic styles require relatively larger core area compared to the superior standard CMOS layouts of the commercial cell library. However one exception to that is the DPL version in which, the drain-source junctions are shared by multiple MOSFETs. Hence, PV1 (17  $\mu$ m<sup>2</sup>) is only 8% more area efficient than DPL (18.5  $\mu$ m<sup>2</sup>).



Fig. 6: Noise margin variation (TT, 25°C) : (a) NMH (b) NML

The influence of process and mismatch variation (PVT) under extreme conditions (0.6V, SS, -40°C,  $C_L=1fF$ ) is depicted in Fig. 5(f). From 500 Monte Carlo iterations it was observed that only PV1, PV2, CPL and CMOS versions can withstand aggressive PVT variations at the given constraint. In terms of the outliers beyond the  $\mu$ +3 $\sigma$  limit, CPL has the worst score despite its smaller  $\mu$  (20.5 ns) owing to fewer PMOS devices. Conversely PV1 ( $\mu$ =21.2 ns) and PV2 ( $\mu$ =23.0 ns) are closer to standard CMOS version ( $\mu$ =19.9 ns) in terms of outliers.

Based on the unity-gain points [25], observed noise margin variation of the compressors is depicted in Fig. 6(a)-(b). Except IRFL, other versions including PV1/PV2 demonstrate balanced and comparable noise margin variations to standard CMOS version across a wider  $V_{DD}$  range. IRFL fails to achieve this due to the internal voltage drops across the cascaded PMOS devices for the input scenario  $x_1=\_\_$ ,  $x_n|_{n:[2,4]}=cin=0$ .

## V. CONCLUSION

This paper presented two novel circuit arrangements for an ultra-low voltage 4T2C targeting typical near/sub- $V_{th}$ application domain. From extensive post layout simulations in 65nm technology, the optimal design of the proposed versions was demonstrated to be 35% more power efficient, 3.4% faster, 8% more area efficient and 37% better in PDP at 0.4V<sub>DD</sub> compared to state-of-the-art 4T2Cs. These gains do not come at the expense of noise margin and PVT immunity which were comparable to standard CMOS version across a wider operating range.

#### ACKNOWLEDGMENT

The research reported in this paper is supported by the Dutch NWO Applied and Engineering Sciences program *ZERO: Towards Energy Autonomous Systems for IoT* and by Dialog Semiconductor B.V., The Netherlands.

#### REFERENCES

- J. Prummel, et al., "A 10 mw bluetooth low-energy transceiver with onchip matching," *IEEE Journal of Solid-State Circuits*, vol. 50, no. 12, pp. 3077–3088, 2015.
- [2] P. N. Whatmough, S. K. Lee, D. Brooks, and G.-Y. Wei, "DNN engine: A 28-nm timing-error tolerant sparse deep neural network processor for iot applications," *IEEE Journal of Solid-State Circuits*, vol. 53, no. 9, pp. 2722–2731, 2018.
- [3] F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini, "PULP: A ultralow power parallel accelerator for energy-efficient and flexible embedded vision," *Journal of Signal Processing Systems*, vol. 84, no. 3, pp. 339– 354, 2016.
- [4] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E. Flamand, F. K. Gürkaynak, and L. Benini, "Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 25, no. 10, pp. 2700–2713, 2017.
- [5] F. Glaser, G. Haugou, D. Rossi, Q. Huang, and L. Benini, "Hardwareaccelerated energy-efficient synchronization and communication for ultra-low-power tightly coupled clusters," in 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2019, pp. 552–557.
- [6] A. Pullini, D. Rossi, I. Loi, G. Tagliavini, and L. Benini, "Mr. Wolf: An energy-precision scalable parallel ultra low power SoC for IoT edge processing," *IEEE Journal of Solid-State Circuits*, vol. 54, no. 7, pp. 1970–1981, 2019.
- [7] S. S. Mahant-Shetti, P. T. Balsara, and C. Lemonds, "High performance low power array multiplier using temporal tiling," *IEEE Transactions on very large scale integration (VLSI) systems*, vol. 7, no. 1, pp. 121–124, 1999.
- [8] A. A. Farooqui and V. G. Oklobdzija, "General data-path organization of a MAC unit for VLSI implementation of DSP processors," in *ISCAS'98. Proceedings of the 1998 IEEE International Symposium on Circuits and Systems (Cat. No. 98CH36187)*, vol. 2. IEEE, 1998, pp. 260–263.
- [9] A. A. Del Barrio, R. Hermida, and S. Ogrenci-Memik, "A combined arithmetic-high-level synthesis solution to deploy partial carry-save radix-8 Booth multipliers in datapaths," *IEEE Transactions on Circuits* and Systems I: Regular Papers, vol. 66, no. 2, pp. 742–755, 2018.
- [10] V. G. Oklobdzija, D. Villeger, and S. S. Liu, "A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach," *IEEE Transactions on computers*, vol. 45, no. 3, pp. 294–306, 1996.
- [11] W. C. Yeh and C.-W. Jen, "High-speed Booth encoded parallel multiplier design," *IEEE transactions on computers*, vol. 49, no. 7, pp. 692–701, 2000.
- [12] Z. Huang and M. D. Ercegovac, "High-performance low-power left-toright array multiplier design," *IEEE Transactions on computers*, vol. 54, no. 3, pp. 272–283, 2005.
- [13] M. Santoro and M. Horowitz, "SPIM: A pipelined 64×64-bit iterative multiplier," *IEEE journal of solid-state circuits*, vol. 24, no. 2, 1989.
- [14] M. Nagamatsu, S. Tanaka, J. Mori, K. Hirano, T. Noguchi, and K. Hatanaka, "A 15-ns 32×32-b CMOS multiplier with an improved parallel structure," *IEEE Journal of Solid-State Circuits*, vol. 25, no. 2, pp. 494–497, 1990.
- [15] J. Mori, "A 10ns 54×54-b parallel structured full array multiplier with 0.5u CMOS technology," *IEEE Journal of Solid-State Circuits*, vol. 26, no. 4, 1991.
- [16] N. Ohkubo, M. Suzuki, T. Shinbo, T. Yamanaka, A. Shimizu, K. Sasaki, and Y. Nakagome, "A 4.4 ns CMOS 54×54-b multiplier using passtransistor multiplexer," *IEEE Journal of Solid-State Circuits*, vol. 30, no. 3, pp. 251–257, 1995.
- [17] S.-F. Hsiao, M.-R. Jiang, and J.-S. Yeh, "Design of high-speed low-power 3-2 counter and 4-2 compressor for fast multipliers," *Electronics Letters*, vol. 34, no. 4, pp. 341–343, 1998.
- [18] D. Radhakrishnan and A. Preethy, "Low power CMOS pass logic 4-2 compressor for high-speed multiplication," in *Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems (Cat. No. CH37144)*, vol. 3. IEEE, 2000, pp. 1296–1298.
- [19] C.-H. Chang, J. Gu, and M. Zhang, "Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors for fast arithmetic circuits," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 51, no. 10, pp. 1985–1997, 2004.

- [20] S. Yuan, "4-2 compressor of fast Booth multiplier for high-speed RISC processor," *International journal of electronics*, vol. 94, no. 9, pp. 869– 875, 2007.
- [21] S. Veeramachaneni, K. M. Krishna, L. Avinash, S. R. Puppala, and M. Srinivas, "Novel architectures for high-speed and low-power 3-2, 4-2 and 5-2 compressors," in 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07). IEEE, 2007, pp. 324–329.
- [22] A. Pishvaie, G. Jaberipur, and A. Jahanian, "Improved CMOS (4:2) compressor designs for parallel multipliers," *Computers & Electrical Engineering*, vol. 38, no. 6, pp. 1703–1716, 2012.
- [23] A. Arasteh, M. H. Moaiyeri, M. Taheri, K. Navi, and N. Bagherzadeh, "An energy and area efficient 4:2 compressor based on FinFETs," *Integration*, vol. 60, pp. 224–231, 2018.
- [24] J. M. Rabaey, A. P. Chandrakasan, and B. Nikolic, *Digital integrated circuits*. Prentice hall Englewood Cliffs, 2002, vol. 2.
- [25] M. Alioto, "Understanding DC behavior of subthreshold CMOS logic through closed-form analysis," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 57, no. 7, pp. 1597–1607, 2010.
- [26] —, "Ultra-low power VLSI circuit design demystified and explained: A tutorial," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 59, no. 1, pp. 3–29, 2012.
- [27] B. H. Calhoun and A. P. Chandrakasan, "Static noise margin variation for sub-threshold sram in 65-nm CMOS," *IEEE Journal of solid-state circuits*, vol. 41, no. 7, pp. 1673–1679, 2006.
- [28] N. Reynders and W. Dehaene, Ultra-Low-Voltage Design of Energy-Efficient Digital Circuits. Springer, 2015.