# Improved Toolchain-Compatible Standard Cells with 5% - 36% Lower EDP for Super Threshold Operation in 65nm Low-Power CMOS Technology

S. Yadav Radio Systems Group University of Twente The Netherlands s.yadav@utwente.nl A.B.J Kokkeler Radio Systems Group University of Twente The Netherlands a.b.j.kokkeler@utwente.nl M.S. Oude Alink Integrated Circuit Design Group University of Twente The Netherlands m.s.oudealink@utwente.nl

Abstract—High performance and energy efficiency are very crucial aspects in e.g. the field of edge computing where a tight power budget constrains the device operation. Different logic families were explored over the years to design standard cells with higher performance and/or lower power while keeping the noise immunity and the compatibility with design automation tools intact. Hybrid pass transistor logic with static CMOS output (HPSC) seems to be promising and is exploited in this paper to design low energy, high performance and toolchain-compatible standard cells without compromising on noise immunity and chip area. This paper presents a 2/3-input XOR cell, a 2/3-input XNOR cell, two variants of a half adder cell, a full adder cell and two variants of a 1-bit multiply-accumulate combinational cell based on a combination of HPSC and static CMOS logic in a commercial 65nm Low-Power CMOS technology. Post-layout simulations over all the process-voltage-temperature corners show a 4.7% - 35.7% lower energy-delay product with significant improvement in the propagation delay of the proposed cells.

*Index Terms*—Standard Cell Optimization, Cell Design, Logic Design, Super-Threshold Operation, Leakage Power, Propagation Delay, HPSC, PDP, EDP, Digital, CMOS

## I. INTRODUCTION

Energy efficiency of edge devices like wearables and Internet-of-Things (IoT) nodes operating on small batteries or energy harvesting circuits is of utmost importance when it comes to their operation within a tight power budget.

An encryption algorithm like AES-256 used for data security involves XOR-dominated functions [1]. Digital signal processing (DSP) algorithms like filtering use multiplyaccumulate (MAC) operations. A neural network involves matrix-vector multiplications (MVMs) which is another example where the MAC is the dominant operation [2]. Improvements in the energy efficiency of these operations are needed to keep up with the growing needs of such applications.

Semiconductor manufacturing companies typically provide libraries which contain 1000+ atomic operations that can be used by designers to implement any arbitrary digital circuit. Designing circuits manually is a complex and error-prone task due to which designers make use of electronic design automation (EDA) tools to synthesize and layout the circuits, which highlights the importance of toolchain-compatibility of these libraries. The operations present in these libraries are implemented as unit cells at the schematic and physical layout level using field effect transistors (FETs). These cells are "power-performance-area" optimized and characterized for operation in the super-threshold region of the FETs [3].

The conditions to achieve the lowest energy consumption for a given cell / circuit are defined by a minimum energy point (MEP) which is located in the sub / near threshold regions of FETs depending on the frequency of operation [4], [5]. One way to reach the MEP is by lowering the supply voltage, but this comes at the cost of slower operation, reduced noise margin and highly increased PVT sensitivity. A complete redesign of standard cells for sub / near threshold operation is required to eliminate these effects [4].

Over the years, logic families like static CMOS logic, dynamic logic, complementary pass transistor logic (CPL), swing restored CPL, double pass logic (DPL), transmission gate (TG) logic, HPSC and many more have been explored to design smaller, faster and low-power digital cells [6]-[8]. In order to design complex digital circuits, strong logic levels and high noise immunity are desired which are not provided by CPL and dynamic logic. Even with a level restoration circuit, as seen in swing restored CPL, the driving capabilities are limited leading to slower cells. Additionally, the absence of a driver (similar to an inverter in static CMOS logic) in CPL, swing restored CPL, DPL and TG logic families leads to an input capacitance which is defined as a function of the total output capacitance of the cell. When cells based on these families are cascaded, the input capacitance keeps varying which makes it difficult for the EDA tools to estimate timing and power of the (complex) digital circuits. In contrast to these logic families, HPSC logic is based on TG / DPL logic but puts an additional inverter at the output to provide the desired driving capabilities and termination to achieve fixed input capacitance but at the cost of additional transistors to generate complementary inputs. Additionally, the noise immunity of

This publication is part of the project Analog Approximate Accelerators (AAA) with project number 17703 of the research programme OTP which is (partly) financed by the Dutch Research Council (NWO).

HPSC is comparatively higher due to the absence of weak / floating nodes which makes it a good candidate for designing standard cells apart from the well-known static CMOS logic family.

In order to achieve energy efficiency, a combination of HPSC and static CMOS logic is explored in this paper to design 9 toolchain-compatible standard cells with higher operating speed, lower area and above all, lower energy-delay product (EDP). Compatibility with toolchains make them perfect drop-in replacements for their commercial standard cell library counterparts.

This article is organised as follows: Section II presents the methodology and the schematics of the proposed cells. Section III presents the post-layout simulation results and discussion. Section IV concludes this article.

# II. STANDARD CELL DESIGN

A hybrid approach combining HPSC and static CMOS logic families is used here to design energy-efficient cells. This paper proposes a 2/3-input XOR cell (XOR2 / XOR3), a 2/3-input XNOR cell (XNOR2 / XNOR3), 2 layout variants of a half adder cell (HA1 and HA2), a full adder cell (FA) and 2 variants of a 1-bit combinational MAC cell (MAC1 and MAC2). These cells can be used to design complex computational units for aforementioned applications like AES-256, digital filtering, MVMs, etc. The boolean equations governing the functionality of these cells are:

$$\begin{split} &XOR2/XOR3{:}\ Z = A \oplus B \ / \ A \oplus B \oplus C \\ &XNOR2/XNOR3{:}\ Z = \overline{A \oplus B} \ / \ \overline{A \oplus B \oplus C} \\ &HA1/HA2{:}\ (S_{out}, C_{out}) = (A \oplus B \ , \ A \cdot B) \\ &FA{:}\ (S_{out}, C_{out}) = (A \oplus B \oplus C_{in} \ , \ A \cdot B + C_{in} \cdot (A \oplus B)) \\ &MAC1{:}\ (S_{out}, C_{out}) = ((A \cdot B) \oplus S_{in} \ , \ (A \cdot B) \cdot S_{in}) \\ &MAC2{:}\ (S_{out}, C_{out}) = ((A \cdot B) \oplus S_{in} \oplus C_{in} \ , \\ &S_{in} \cdot C_{in} + A \cdot B \cdot (S_{in} \oplus C_{in})) \end{split}$$

where 'A', 'B', 'C', 'S<sub>in</sub>' and 'C<sub>in</sub>' are inputs to the cells and 'Z', 'S<sub>out</sub>' and 'C<sub>out</sub>' are outputs of the cells. 'C<sub>in</sub>' stands for old carry input and 'S<sub>in</sub>' stands for old sum input. MAC1 is a 3-input complex gate whereas MAC2 is a 4-input complex gate having an extra additive input. The schematics for the proposed XOR2 / XNOR2 cells, XOR3 / XNOR3 cells, HA1 cell, HA2 cell, FA cell, MAC1 cell and MAC2 cell are presented in Fig. 1(a), Fig. 1(b), Fig. 1(c), Fig. 1(d), Fig. 1(e) and Fig. 1(f) respectively.

In order to achieve energy efficiency and faster performance, five keys concepts are used.

• No weak '0' or '1' logic anywhere: In order to propagate strong logic '1' and strong logic '0', and achieve high noise immunity with no signal degradation, both P-type and N-type FETs should be used. This can be for example observed from Box A, B, C and D in Fig. 1(a) where VDD is passed through P-type FETs and VSS is passed through N-type FETs. In Box C of Fig. 1(a), both P-type

and N-type FETs are used to pass strong logic levels for input signal A and A' whether they are '0' or '1'.

- **Reduced FET stack:** Reducing the size of FET stacks can speed up the performance. Instead of stacking 2 P-type and 2 N-type FETs, see box F in Fig. 1(a), only 1 P-type and 1 N-type FET is sufficient, see box C in Fig. 1(a), to get the same functionality by allowing the input signal to pass through the source / drain terminal of the FETs. This can also be observed from Fig. 1(b), Fig. 1(c), Fig. 1(d), Fig. 1(e) and Fig. 1(f) where TG logic is used to pass the input signal through the source / drain terminal of the FETs in order to achieve the desired functionality with reduced FET stacks.
- **Output driver:** Presence of an output driver, see box D of Fig. 1(a), can provide the required termination to achieve fixed input capacitance and driving capabilities to the cell leading to toolchain compatibility along with fast output rise and fall times, and thus high-performing cells.
- Smart generation of the inverted signal: Generating inverted signals without using additional inverters, see Box B for signal B' in Fig. 1(a), as compared to the alternative implementation in Box E, saves 2 FETs and thus also reduces the input capacitance of the cell.
- **Reduced capacitance:** The reduced FET stack and the absence of additional inverters reduce the input and internal node capacitances of the cell leading to lower  $CV^2$  energy.

The proposed cells were designed for super-threshold operation at 1.2V in Cadence® schematic and layout editor using high threshold voltage (HVT) FETs from a commercial 65nm CMOS LP process. The cells are free of design rule violations (DRCs). Calibre provided by Mentor Graphics® was used to extract the post-layout netlist with parasitic information. These cells adhere to the layout constraints (layer dimensions and using Metal M1 routing only) defined for a 65nm 7track standard cell (both single and double height) which makes them perfect drop-in replacements for their commercial standard cell library equivalents.

## III. RESULTS AND DISCUSSION

The proposed digital cells were simulated using the testbench presented in Fig. 2 which is similar to the testbench discussed in [8] and [9]. The inputs to the driving inverters (X1 driving strength [10]) were generated and provided using a Verilog-A module with a rise and fall transition time of 50ps measured from 10% to 90% of the signal level. The generated input patterns cover all the combinations required to observe all the rising and falling transitions at the output for all the proposed cells. The somewhat idealized input from the Verilog-A module passes through the inverter chain and generates a realistic input for the design under test (DUT). The output of the DUT is loaded with inverters of X4 driving strength (Fan-Out-of-4 (FO4) load) [11]. This testbench is kept identical for both delay and power simulations of all the proposed cells.



Fig. 1. (a) Proposed XOR2 and XNOR2 cells (with alternate implementations for sub-blocks) (b) Proposed XOR3 and XNOR3 cells (c) Proposed HA1 and HA2 cell (d) Proposed FA cell (e) Proposed MAC1 cell (f) Proposed MAC2 cell

$$P_{\text{avg}} = \text{VDD} \cdot \frac{1}{T} \int_{0}^{T} (i_{\text{VDD}} + i_{\text{in1}} + i_{\text{in2}} - i_{\text{out1}} - i_{\text{out2}}) dt \quad (1)$$

The average power ( $P_{avg}$ ) consumed by a cell over time duration '*T*' is calculated using Eq. 1 which includes the current flowing from supply ( $i_{VDD}$ ) and input terminals ( $i_{in1}$ ,  $i_{in2}$ ). The load dependent output currents ( $i_{out1}$ ,  $i_{out2}$ ) are not included in determining the cell power. If the load is changed, the rise / fall time of the output signal changes along with internal short-circuit currents. The propagation delay is the time from input transition at 50% voltage level to output transition at 50% voltage level.

The proposed cells were simulated for the corners (SS, TT, FF over different voltages and temperatures) as defined in Table I. Transistor count, layout area, average leakage power, propagation delay, power-delay product (PDP) and energy-delay product (EDP) were simulated and calculated for the proposed cells and compared against the functionally equivalent cells from a commercial 65nm LP HVT standard cell library in the exact same technology. The relative improvements are reported in Table I. The absolute values for elementary cells XOR2, XNOR2, XOR3, XNOR3, HA1, HA2 and FA cells are not reported to avoid publication of confidential data. Additional information regarding the leakage power for the complex MAC1 and MAC2 cells is presented in Table II.

All the proposed cells use fewer transistors than their



Fig. 2. Test bench for functional and electrical simulations

standard cell library equivalents and a layout area reduction is observed for XOR2, XNOR2, XOR3, XNOR3 and HA cells. Due to the more complex routing of signals through the gate as well as source / drain terminals of FETs, this area improvement is not present for HA1, FA, MAC1 and MAC2 cells. The effect of routing complexity can be very well observed from the HA variants in Fig. 3. A double-height variant using only metal M1 routing does not lead to an area improvement, even though the number of transistors is reduced. When the routing constraints are relaxed and Metal M2 is also used, the single-height variant of the half adder cell (HA2) can be realized, leading to an area improvement of 10%, see Fig. 3 and Table I. Similarly, the FA and MAC2 cells designed with metal M1 routing show layout area increment of 16.7% and 3.4% respectively.

In terms of propagation delay, see Table I, the 'Carry' output of the FA and MAC1 cells exhibit some deterioration. All the

#### TABLE I

IMPROVEMENTS OBSERVED BASED ON THE POST-LAYOUT SIMULATIONS CARRIED OUT FOR XOR2, XNOR2, XOR3, XNOR3, HA1, HA2, FA, MAC1 AND MAC2 CELLS IN COMPARISON TO THEIR STANDARD CELL LIBRARY EQUIVALENTS

| Parameter                              |                 | XOR2  | XNOR2 | XOR3  | XNOR3 | HA1   |       | HA2   |       | FA    |       | MAC1  |       | MAC2  |       |
|----------------------------------------|-----------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| Transistor count <sup>a</sup>          |                 | -2    | -2    | -4    | -4    | -2    |       | -2    |       | -2    |       | -5    |       | -4    |       |
| Area (%) <sup>a</sup>                  |                 | -5    | -5    | -5.6  | -5.6  | 0     |       | -10   |       | +16.7 |       | 0     |       | +3.4  |       |
|                                        |                 |       |       |       |       | Sum   | Carry |
| Propagation<br>delay (%) <sup>ab</sup> | SS d            | -19.7 | -17.5 | -10.7 | -10.2 | -12.9 | -0.7  | -12.0 | -8.7  | -12.0 | +9.6  | -17.5 | +4.8  | -26.9 | -17.6 |
|                                        | TT <sup>e</sup> | -17.1 | -15.3 | -10.4 | -10.0 | -12.8 | -1.0  | -11.8 | -8.3  | -12.0 | +4.6  | -15.5 | +3.5  | -30.2 | -21.6 |
|                                        | FF <sup>f</sup> | -15.3 | -13.9 | -8.9  | -8.8  | -12.1 | -0.8  | -11.3 | -7.5  | -12.4 | +3.4  | -14.7 | +0.3  | -31.0 | -23.0 |
|                                        | SS              | -21.4 | -24.5 | -11.6 | -4.7  | -19.1 |       | -20.8 |       | -13.4 |       | -24.0 |       | -31.7 |       |
| EDP (%) <sup>abc</sup>                 | TT              | -19.9 | -23.0 | -12.0 | -8.1  | -19.6 |       | -21.0 |       | -12.6 |       | -22.3 |       | -34.5 |       |
|                                        | FF              | -19.3 | -22.3 | -11.0 | -7.7  | -19.8 |       | -21.2 |       | -13.2 |       | -21.7 |       | -35.7 |       |

[a] Results normalized with respective values for standard cell equivalents

[c] Calculations done considering slower output path (if exists)

[e] TT = TT process corner ; 1.2V ; +25C

TABLE II LEAKAGE POWER SIMULATED FOR PROPOSED MAC1 AND MAC2 CELLS IN COMPARISON TO THEIR STANDARD CELL LIBRARY EQUIVALENTS

| Paramete              | r                  | Proposed | Std. lib | Change <sup>d</sup> |  |  |  |  |  |
|-----------------------|--------------------|----------|----------|---------------------|--|--|--|--|--|
| MAC1 cell             |                    |          |          |                     |  |  |  |  |  |
|                       | SS dh              | 192.18   | 202.38   | - 5.0%              |  |  |  |  |  |
| Laskaga powar         | TT <sup>eh</sup>   | 65.3     | 71.97    | - 9.3%              |  |  |  |  |  |
| Leakage power<br>(pW) | FF fh              | 137.97   | 161.44   | - 14.5%             |  |  |  |  |  |
| (011)                 | Worst <sup>b</sup> | 238.2    | 267.3    | - 10.9%             |  |  |  |  |  |
|                       | Best <sup>i</sup>  | 56.03    | 57.23    | - 2.1%              |  |  |  |  |  |
| MAC2 cell             |                    |          |          |                     |  |  |  |  |  |
|                       | SS                 | 304.31   | 344.17   | - 11.6%             |  |  |  |  |  |
| Laskaga powar         | TT                 | 104.10   | 116.68   | - 10.8%             |  |  |  |  |  |
| Leakage power<br>(pW) | FF                 | 209.79   | 252.09   | - 16.7%             |  |  |  |  |  |
| (011)                 | Worst              | 349      | 432.6    | - 17.2%             |  |  |  |  |  |
|                       | Best               | 83.2     | 103.3    | - 19.4%             |  |  |  |  |  |

<sup>a b c d e f g</sup> Refer table notes of Table I

h Average value calculated for all input combinations

<sup>i</sup> Absolute best from all input combinations / transitions

remaining cells achieve a delay improvement at all the process corners with a 0.7% minimum improvement at the SS corner for the HA1 cell and 31% maximum improvement for the MAC2 cell at the FF corner.

EDP is considered to be a better metric in comparison to the PDP when cells / circuits need to be evaluated or compared for energy-efficiency [12]. The EDP values from Table I show that all the cells can provide higher energy efficiency as compared to their standard cell library equivalents. The XNOR3 cell achieves the lowest EDP improvement of 4.7% at the SS corner, whereas the MAC2 cell achieves the highest EDP improvement of 35.7% at the FF corner. Additionally, from the leakage power numbers mentioned in Table II, the absolute worst and absolute best case leakage power for the MAC1 cell show an improvement of 10.9% and 2.1% respectively, whereas the MAC2 cell shows 17.2% and 19.4% improvement respectively.

Based on the results obtained, the proposed combination of HPSC and static CMOS logic simultaneously improve performance and energy-efficiency of (complex) standard cells without compromising on area or noise immunity. If routing

[b] Absolute worst from all input combinations / transitions

[d] SS = SS process corner ; 1.08V ; +125C

[f] FF = FF process corner ; 1.32V ; -40C



Fig. 3. (a) Double-height variant of half adder cell (HA1) with Metal M1 routing only (b) Single-height variant of half adder cell (HA2) with Metal M1 and M2 routing. M2 routing is highlighted.

constraints are relaxed to use Metal M2, an area improvement can also be achieved with a lower transistor count for more complex standard cells. The proposed alternative of standard cells can be used by the upcoming edge devices to achieve higher performance at lower energy consumption.

## IV. CONCLUSION

Energy efficiency is an essential aspect of modern day digital circuits. It is difficult to meet the tight power budget when the edge devices operate on energy harvesting circuits or with small batteries. Changes are required at the cell level to improve the energy efficiency of the design. A hybrid approach involving static CMOS logic and HPSC logic was explored to design standard cells with higher performance, lower area and lower EDP. Based on the post-layout simulations carried out for the proposed toolchain-compatible XOR / XNOR cells, a half adder cell, a full adder cell and two 1-bit combinational MAC cells, an improvement of 4.7% - 35.7% is observed in EDP with lower propagation delays in comparison to the standard cell equivalents from the library provided by the semiconductor manufacturing company, without sacrificing area and noise immunity.

#### REFERENCES

- N. Su, Y. Zhang, and M. Li, "Research on data encryption standard based on AES algorithm in internet of things environment," in 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), 2019, pp. 2071–2075.
- [2] M. Paliy, S. Strangio, P. Ruiu, T. Rizzo, and G. Iannaccone, "Analog vector-matrix multiplier based on programmable current mirrors for neural network integrated circuits," *IEEE Access*, vol. 8, pp. 203 525– 203 537, 2020.
- [3] D. Judy and V. s. k. Bhaaskaran, "Review and Analysis of the Impacts and Effects on Low Power VLSI Circuits Operating in Subthreshold Regime," *International Journal of Engineering and Technology*, vol. 5, pp. 3870–3883, 10 2013.
- [4] J. L. B. Peje, H. H. L. Ho, F. Barot, M. F. G. Bautista, C. C. E. Misagal, J. R. E. Hizon, and L. P. Alarcon, "An ultra low-voltage standard cell library in 65-nm CMOS process technology," in *TENCON 2014 - 2014 IEEE Region 10 Conference*, 2014, pp. 1–6.
- [5] Y. Chen, Y. Nie, and H. Jiao, "An Ultralow-Power 65-nm Standard Cell Library for Near/Subthreshold Digital Circuits," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 30, no. 5, pp. 676–680, 2022.
- [6] M. Hasan, M. J. Hossein, U. K. Saha, and M. S. Tarif, "Overview and Comparative Performance Analysis of Various Full Adder Cells in 90 nm Technology," in 2018 4th International Conference on Computing Communication and Automation (ICCCA), 2018, pp. 1–6.
- [7] R. Zimmermann and W. Fichtner, "Low-power logic styles: CMOS versus pass-transistor logic," *IEEE Journal of Solid-State Circuits*, vol. 32, no. 7, pp. 1079–1090, 1997.
- [8] M. Zhang, J. Gu, and C.-H. Chang, "A novel hybrid pass logic with static CMOS output drive full-adder cell," in 2003 IEEE International Symposium on Circuits and Systems (ISCAS), vol. 5, 2003, pp. V–V.
- [9] H. Naseri and S. Timarchi, "Low-Power and Fast Full Adder by Exploring New XOR and XNOR gates," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 26, no. 8, pp. 1481–1493, 2018.
- [10] J. Ebergen, J. Gainsley, and P. Cunningham, "Transistor sizing: how to control the speed and energy consumption of a circuit," in *10th International Symposium on Asynchronous Circuits and Systems*, 2004. *Proceedings.*, 2004, pp. 51–61.
- [11] M. Rashid and A. Muhtaroğlu, "Power delay product optimized hybrid full adder circuits," in 2017 International Artificial Intelligence and Data Processing Symposium (IDAP), 2017, pp. 1–4.
- [12] B. Steinbach, "Recent progress in the boolean domain." [Online]. Available: https://www.cambridgescholars.com/product/978-1-4438-5638-6