# Novel Ultra-Low-Voltage Flip-Flops : Near- $V_{th}$ Modeling and VLSI Integration

Anuradha C. Ranasinghe, Sabih H. Gerez

Chair of Computer Architecture for Embedded Systems, Faculty of Electrical Engineering, University of Twente 5, Drienerlolaan, 7522 NB Enschede, Netherlands Email: rnsburg@gmail.com, s.h.gerez@utwente.nl

Abstract—This paper presents two novel ultra-low-voltage (ULV) Single-Edge-Triggered flip-flops (SET-FF) based on the True-Single-Phase-Clocking (TSPC) scheme. By exploiting the TSPC principle, the overall energy efficiency has been improved compared to the traditional flip-flop designs while providing fully static, contention-free functionality to satisfy ULV operation. At 0.5V near-V<sub>th</sub> level in 65nm bulk CMOS technology, the proposed SET-FFs demonstrate up to 11-45% and 7-20% of energy efficiency at 0% and 100% data activity rates compared to the best known SET-FFs. The proposed SET-FF can safely operate down to 0.24V of supply voltage without corrupting rail-to-rail voltage levels at its internal nodes. The integration of proposed SET-FFs in a 320-bit parallel shift register demonstrated up to  $33\,\%$ of clock network power, 17-39% of register power reductions compared to the state-of-the-art and commercial standard-cells at near- $V_{th}$  level. In addition to these merits, with the aid of parasitic modeling, this paper re-evaluates the vital performance metrics of SET-FFs at near- $V_{th}$  voltage domain, improving their characterization accuracy and enabling the VLSI integration for commercial end-use.

Index Terms—Single Edge Triggering, True Single Phase Clocking, Near/Sub-V<sub>th</sub>, Low Power Clock Network, Clock Slope Sensitivity, Ultra-Low-Voltage

### I. INTRODUCTION

Flip-flops are essential sequential elements in synchronous designs, facilitating data storage and data synchronization in digital hardware [1]. Modern VLSI architectures typically require hundreds of thousands of flip-flops, given their complex pipeline structures [2]. Unlike the datapaths, the clock network's constraints are often stringent and its 100% activity rate leads to a significant portion of total power ( $\sim$ 40%) [3] to be consumed by the clock distribution. Hence a great emphasis was placed on improving flip-flops and thereby overall system-level energy efficiency in the past few years [4], [5].

The most well-established flip-flops are the single-edgetriggered flip-flops (SET-FF) due to their simple structure and compatibility with standard ASIC design flow. SET-FFs sample data at either the positive or negative edge of the clock signal. The conventional and the most common design of this kind is TGL24 [1] which can be found in commercial standard cell libraries. Due to the heavy clock loading of this design, transistor-level optimization of SET-FFs has been studied in recent research [4]–[6]. The recurring concept of truesingle-phase-clocking (TSPC) has replaced the conventional dual-phase-clocking (DPC) scheme of the traditional designs. Unlike DPC, TSPC does not require complementary clock signals and does not need local clock buffers, leading to a twofold reduction in clock load [5]. The total power dissipation of a flip-flop is given by:

$$P_{tot} = V_{DD}^{2} \{ f_{clk} \cdot (C_{clk} + C_{ff,clk}) + f_{dat} \cdot C_{ff,dat} \}$$
 (1)

where  $f_{clk}$ ,  $f_{dat}$  represent the clock and the average data frequencies.  $C_{clk}$ ,  $C_{ff,clk}$  and  $C_{ff,dat}$  are the capacitive loads seen at clock input, the internal clock paths and data paths respectively. In a nutshell, TSPC intends to minimize the previous two parameters.

In addition to the clock-load reduction, the fully static and contention-free operation is crucial for the ultra-low-voltage (ULV) operability. SET-FFs based on complementary master-slave latches have always been a good candidate for ULV operation [7] compared to some other alternatives such as pulse-triggered [8], [9] and C-element [10] based designs. A comprehensive analysis on their shortcomings against master-slave designs is given in [7], [11]. Despite these merits, certain prevalent master-slave SET-FFs such as XCFF [12], ACFF [13], TCFF [14] and 18T-FF [4] still suffer at ULV levels owing to the issues such as contention and lack of PVT resilience which have been already discussed in [5]. The TSPC24 [6] and TSPC18 [5] designs were found superior in terms of many aspects and hence will be used as the baselines in this paper.

Conventional CAD tools fall short when it comes to characterization and evaluation of non-conventional cell topologies. In evaluating certain parameters such as hold, recovery and removal times, the standard measurement scheme (i.e. 10% delay degradation point [15]) might fail to converge for certain designs and against certain operating conditions. Besides, the selection of a proper input stimuli for characterization plays a vital role at near- $V_{th}$  levels. Therefore some alternative measurement schemes and the near- $V_{th}$  parasitic modeling [16]–[18] will be studied in this paper.

To address all these concerns, we present more energy-efficient versions of TSPC SET-FFs, achieving the full-fledged functionality at ULV levels. The typical SET-FF operation and the issues of prevalent designs are detailed in Section II. Section III revisits the vital parameters of the flip-flops and the near- $V_{th}$  modeling aiming at ULV operation. Section IV details the cell level merits of the SET-FFs based on Spice simulations. An accurate cell characterization strategy based on current composite source (CCS) modeling is presented



Fig. 1: (a) SET-FF conceptual block diagram (b) TGL24 SET-FF (c) The clock overlap issue of TGL24 (SS, -40°C) at 0.4V (d) TSPC18 [5] worst-case setup(blue)/hold(red) paths

in Section IV. In Section V, the merits of the SET-FFs are evaluated in a higher-level integration (i.e. 320-bit Shift Register) through extensive post-layout simulations in 65nm bulk CMOS process technology. Finally, the conclusions are drawn in Section V.

# II. MASTER-SLAVE LATCH SET-FFS

This section revisits the working principle and the critical features of the master-slave SET-FFs. Fig. 1(a) depicts a simplified block diagram of a SET design with multiplexer (MUX) based latches that operate in a complementary fashion. The data (D) captured at the negative clock phase (blue) will be immediately available at Q output by the next positive edge of the clock. Besides capturing, the input MUX should latch the captured data during the positive phase of the clock.

The conventional TGL24 [1] based on this master-slave arrangement is depicted in Fig. 1(b). This arrangement with input/output buffering is often found in commercial standard cell libraries. The clock overlap issue is a major concern in this design as depicted in Fig. 1(c). At ULV levels, a possible overlap between the negative (CKB) and the positive (CKI) phases of the clock can lead to data transparency (the orange arrow of Fig. 1(b)) between the capturing latch and the output MUX as well as between node N and L. Steeper clock transitions can alleviate this failure. However, this eventually increases the gate parasitics [16] of the local clock buffers and therefore the clock tree power.

With the aid of TSPC, this barrier can be overcome. Since TSPC does not require dual-phase clocking, it provides steeper clocked transitions ideal for ULV operation. Kim *et al.* [6] and Yunpeng *et al.* [5] proposed TSPC24 and TSPC18 designs



Fig. 2: PVT resilience and the  $3\sigma$  yields (%) of 3 SET-FFs in terms of  $T_{D\text{-}Q,opt}$  against 0.5V/TT/25°C and  $C_L$ =10fF conditions.

based on this theory and their functionality has been proven to be superior at ULV levels. From TSPC24 to TSPC18, the number of clocked transistors to the overall transistor count ratio has reduced from 5/24 to 4/18. This is a significant improvement. The TSPC18 design with its worst-case setup(blue)/hold(red) paths is depicted in Fig. 1(d).

This design is superior in many aspects compared to other state-of-the-art designs. However, there is still room for improvement of two vital parameters: the hold time and the PVT resilience. Due to the long hold path as indicated in red color by M13  $\rightarrow$  M2  $\rightarrow$  M6 (obtained from simulations), the circuit performance under extreme ULV conditions becomes less favorable. This, on the other hand affects the optimal Dto-Q delay  $(T_{D-O,opt})$  and the clock-slope sensitivity (will be discussed later) [11] of the flip-flop. The PVT resilience of TGL24, TSPC24 and TSPC18 designs based on 1000 Monte Carlo iterations at typical near- $V_{th}$  conditions is depicted in Fig. 2. As evident from the mean  $(\mu)$  and the standard deviation ( $\sigma$ ) values of  $T_{D-Q,opt}$ , the TSPC24 is the most PVT resilient design despite its very high clock load [5]. The latest TSPC18 design demonstrates the poorest  $3\sigma$  yield despite its superiority. In a nutshell, this paper aims to overcome these drawbacks while providing a better energy profile than the latest TSPC18 design.

#### III. TSPC17 - PROPOSED FULLY-STATIC SET-FFS

The proposed SET-FFs based on complementary TSPC latches are depicted in Fig. 3(a)-(b). The left (M1-M10) and the right (M11-M17) portions of each circuit represent the negative (master) and the positive (slave) edge triggering latches of the SET-FF, respectively. The specific features of this design can be summarized as follows.

## A. Fully Static Operation

Similar to TSPC18 [5], both TSPC17-V1/2 circuits have fully static storage elements in each latch. For instance, when D=0/CLK=0, the positive edge-triggered (left) latch captures  $\overline{D}$  ('1') during the negative phase of the clock and stores it at node SP through M1 and M2. During this phase, node LKP is at logic '1' through M7, and therefore M4 is opened. As soon as the positive phase of the clock arrives (CLK=1), M7 closes and logic '1' at node SP opens M8 through M10, pulling node LKP at logic '0'. In this way, the original D ('0') is transferred



Fig. 3: Proposed versions with worst-case setup(blue)/hold(red) paths indicated: (a) TSPC17-V1 (b) TSPC17-V2 (c) Clock slope sensitivity issue propagation in TSPC17-V2 (d) The worst-case setup/hold path parasitic decomposition of TSPC17-V1/V2 (e)  $T_{D-Q}/T_{clk-Q}$  vs setup time  $(T_{SU})$  of TSPC18 at 0.5VDD.

to node LKP. Furthermore, this new LKP value helps to sustain the original  $\overline{D}$  ('1') value at node SP through M3 during the positive clock phase. This functionality is the same for both versions.

Similarly, when D=1/CLK=0, node SP reaches  $\overline{D}$  ('0') through M4 and M5, so that M8 is closed. The original logic '1' stored at LKP (through M7) keeps conducting M4 through M9-M10 in version 1 and only through M9 in version 2 during the positive phase, preserving SP node at logic '0'. This eventually sustains logic '1' at LKP node through M6 during this period. To fully satisfy the static operation, M4 and M9 should be made smaller than M8 and M10 so that logic '1' at node SP when D=0/CLK=0, will not be reset during the next positive clock phase. This can be easily achieved by setting the W/L ratio of M4 to  $\sim$ 2/3-1/2. The circuit is fully functional even for the standard minimum device sizing  $(0.15\mu/0.06\mu)$  in 65nm technology.

The slave latch (M11-M15) in both circuits is similar and transfers the stored value at node LKP to the output Q. The two circuits are only different by the arrangement of M9-M10. The clock load of the two versions is 4 and 5 gate inputs respectively. The slower hold path issue in TSPC18 is overcome in TSPC17-V1/2 designs by decoupling the D input transistors from the clocked transistors in the master latch, further improving the PVT immunity at ULV levels. This is indicated by the red arrows in Fig. 3(a)-(b).

# B. Setup/Hold Time and $T_{D-Q}/T_{clk-Q}$ Delay Specifications

In the constraint measurements, always the worst-case specifications should be obtained from both '0/1' data inputs at the specific clock edge of the flip-flop. For instance, the parasitics along the M1-M2 path define the worst-case setup specification in both versions. Similarly,

 $M10\rightarrow M9\rightarrow M4\rightarrow M6$  in TSPC17-V1 and  $M9\rightarrow M4\rightarrow M6$  in TSPC17-V2 define the hold specifications. The improvement to these specifications is only restricted by the M4-M9 device sizing requirement stated in Section III-A. With the aid of parasitic modeling [16], the decomposed setup and hold paths of both circuits are depicted in Fig. 3(d).

Here, the charging and the discharging times of node SP through the given paths are considered as the minimum time delays for the D input to become stable and for M6 to properly latch the LKP value.  $R_D$ ,  $C_{d/s}$  represent the drain resistance and the parasitic capacitances (off/on state) of the MOSFETs respectively [16]. With the aid of the modeling scheme presented in [16], the minimum setup/hold requirements can be written as follows:

$$t_{setup} = 0.69 [R_{D1}C_1 + (R_{D1} + R_{D2})C_2]$$

$$t_{hold\_V1} = 0.69 [R_{D10}C_5 + (R_{D10} + R_{D9})C_6 + (R_{D10} + R_{D9} + R_{D4})C_7]$$

$$t_{hold\_V2} = 0.69 [R_{D9}C_3 + (R_{D9} + R_{D4})C_4]$$
(2)

From [18], the expressions for the  $R_D$ s at near- $V_{th}$  levels can

$$R_{D\_P} \approx \frac{L}{(\lambda_{DS} + \lambda_{BS})WI_O} e^{\frac{-\left[(1+\lambda_{DS} + \lambda_{BS})V_{DD} - |V_{thp0}|\right]}{\eta v_T}}$$
(3)  

$$R_{D\_N} \approx \frac{L}{(\lambda_{DS} + \lambda_{BS})WI_O} e^{\frac{-\left[(1+\lambda_{DS} + \lambda_{BS})V_{DD} - |V_{thn0}|\right]}{\eta v_T}}$$
(4)

$$R_{D\_N} \approx \frac{L}{(\lambda_{DS} + \lambda_{BS})WI_O} e^{\frac{-[(1+\lambda_{DS} + \lambda_{BS})V_{DD} - V_{thn0}]}{\eta v_T}} \tag{4}$$

where  $\lambda_{BS/DS}$  represent the body and the DIBL effects of the MOSFETs. Other terms have their usual meanings [18]. Note that the term  $\lambda_{BS}$  will diminish in the transistors whose  $V_{BS}$ =0 (i.e. M1, M10). The hold value of TSPC17-V1 is slightly higher than the TSPC17-V2 version. Both have similar worst-case setup times. More importantly, we could consider the operation of the transistors to predict the nature of the hold constraints. For instance, when D=1/CLK=0 in TSPC17-V2, the corresponding value at node LKP is already latched through M5 $\rightarrow$ M4 $\rightarrow$ M6. As soon as CLK=1, the same condition is maintained through M9 $\rightarrow$ M4 $\rightarrow$ M6. The time difference between these two operations is too small, so that TSPC17-V2 should result in a negative or a quite small hold time. For TSPC17-V1, this should be larger than TSPC17-V2 or even slightly positive, due to the longer hold path M10 $\rightarrow$ M9 $\rightarrow$ M4 $\rightarrow$ M6.

Another crucial parameter that is often overlooked in flipflops is the data-to-Q delay  $(T_{D-Q})$ . Some authors only consider clk-to-Q  $(T_{clk-O})$  delay to evaluate circuit performance while it is not recommended as a rigorous metric for flip-flop performance [5], [11], [19]. On the other hand,  $(T_{clk-O})$  is an important parameter for CAD STA tools and therefore utilized in cell characterization rather than  $T_{D-O}$ . In this work, we consider minimum  $T_{D-Q}$  ( $T_{D-Q,opt}$ ) as in [11] for the celllevel performance and energy measurements in Section IV.  $T_{clk-O}$  will be used along with the setup times in ASIC library generation. The relationship between these two parameters is highlighted in Fig. 3(e). As depicted, the  $T_{D-Q}$  gradually converges to the minimum  $T_{D-Q,opt}$  point and after that it exponentially increases.  $T_{clk-Q}$  is almost a constant till its tripping point and it also exponentially converges to  $T_{D-Q}$  due to the meta-stability of the flip-flop. Traditionally, the 10% delay degradation point (10% of constant  $T_{clk-Q}$  as shown in Fig. 3(e)) [15] defines the setup time  $T_{SU}$  of the flip-flop.

# C. Contention Free Operation

In the proposed TSPC17-V1/V2 versions, the excitation of the internal nodes (SP, LKP) occurs in different clock phases and it avoids contention in the master latch. For instance, M7 and M8-M10 or M1-M2 and M9-M10 paths never operate simultaneously. Similarly in the slave latch, the latching transistors M14 and M15 are blocked by M11 and M12 to avoid possible raise conditions at the output node. In this way, possible short circuit paths are avoided and the robustness of the circuits is improved at ULV levels.

# D. Clock Load and Clock Slope Sensitivity

Both clock load and the clock slope sensitivity of flip-flops are vital parameters to the clock network design [11]. The TSPC relaxes the sizing of clocked transistors and in fact, minimizes the overall clock load. A higher clock slope sensitivity [19], on the other hand, requires larger clock buffers and this parameter was shown to be worst for the traditional TGL24 [19] design owing to the dual-phase clocking. However, this has not been evaluated for TSPC SET-FFs in previous literature. In a nutshell, the effective capacitive load seen at the clock input and the clock slope sensitivity determine the strength of the required clock buffers in the clock network design. In cell characterization, the sensitivity is usually quantified as the clk-to-Q rise/fall propagation delay against different clock slew rates and output loads.

Fig. 3(c) illustrates the clock slope sensitivity propagation

of TSPC17-V2 from the master latch output to the entire slave latch. The opened switches denote off-state MOSFETs (i.e. M7) in the circuit. The inertial delay to a clocked transistor is defined as its gate input signal delay to activate the transistor. This entirely depends on the local parasitics [16] and the driving capability [19] of the clock drivers. The inertial delay is denoted by  $\tau_D$  [16] and is evidently degraded by larger clock slopes. Considering the CLK= $\Box$ /D=0/Q $\rightarrow$ 0 operation of Fig. 3(c), the  $T_{clk-Q}$  delay through M10 in the slave circuit can be computed as [16]:

$$t_{clk-M10-Q} = \tau_{D10} + 0.69 \left[ R_{D10} C_{eq1} + (R_{D10} + R_{D8}) \right. \\ \left. \left( C_{eq2} + C_{g11} + C_{g12} \right) + R_{D11} (C_{eq3} + C_{g16} + C_{g17}) \right) \\ \left. + R_{D17} (C_{eq4} + C_L) \right]$$
(5)

where  $C_{eq}$ s represent equivalent drain/source parasitics and  $t_{rff}$  represents rise/fall times of the clock signal. Here it is assumed that the parasitic constant 0.69RCs to M11-M12 and M16-M17 gate terminals are larger than their  $\tau_{D}$ s and hence their  $\tau_{D}$ s are not added to the equation. Moreover from [16], the time variant nature of  $R_{Dl0}$  and  $C_{eql}$  due to the degraded clock is given by:

$$C_{eq1} = \int_{t_1}^{t_2} C_k dt$$
,  $R_{D_{-1}0} = \int_{t_1}^{t_2} R_k dt$  ;  $t_{r/f} = t_1 - t_2$  (6)

It is evident from eq. (6) that the increased  $t_{r/f}$  due to the clock slope degradation eventually increase  $\tau_{D10}$ ,  $R_{D10}$  and  $C_{eq1}$  parameters in eq. (5) and therefore a larger  $T_{clk-O}$ .

# IV. CELL-LEVEL PERFORMANCE EVALUATION

This section evaluates the cell-level figures of merits of the SET-FFs discussed in this paper. Prevalent designs such as TGL24 [1], TSPC24 [6] and TSPC18 [5] are compared to the designs proposed in this paper. All the designs were implemented in 65nm bulk CMOS technology with layout/parasitic extracted in the Cadence Virtuoso environment. The simulations were carried out in Cadence Spectre Spice environment. In all SET-FFs, PMOS/NMOS width ratio of internal devices was set to  $0.28\mu/0.2\mu$  and the output driver (inverter) to  $0.45\mu/0.3\mu$ , respectively. This specification is sufficient to withstand ULV operation and brings fairness to the evaluation. The worst-case input scenario for each metric of each SET-FF is considered. For instance, under typical conditions (TT, 25°C) at 100% data activity rate, TGL24 consumes its highest power when it samples logic '1'. Similarly, its worst-case  $T_{D-Q,opt}$  is observed when CLK= $\bot$ /D=1.

# A. Power, Delay and PDP of DET-FFs

Fig. 4(a)-(b) depict the true power consumption of SET-FFs against 100% and 0% data activity rates ( $\alpha$ ) at 100 MHz, under typical (TT/25°C) conditions. The measurement setup is depicted in Fig. 4(d). When  $\alpha$ =100% (Fig. 4(a)), proposed versions (TSPC17-V1/2) consume the lowest power. They are followed by TSPC18, TSPC24 and TGL24, consuming 12-16%, 32-37% and 46-47% more power respectively at 1.2-0.5V levels. Interestingly, when  $\alpha$ =0% (Fig. 4(b)), TSPC18 outperforms both proposed versions by  $\sim$ 16-24% at 0.5-1.2V



Fig. 4: (a)-(b) Power consumption at  $\alpha$ =100-0% (c) Worst-case  $T_{D-Q,opt}$  (d) True power measurement (excluding drivers and loads) (e)-(f) PDP (Power× $T_{D-Q,opt}$ ) at  $\alpha$ =100-0% —All measurements under TT, 25°C conditions.

levels. Still the proposed versions are superior to TSPC24 and TGL24, consuming 14-18% and 14-22% less power at the same voltage levels. When the flip-flops are under a considerable switching activity rate, the proposed TSPC17-V1/2 designs are the best option for low power.

From Fig. 4(c), the TSPC24 is evidently the slowest design. The traditional TGL24 outperforms all other designs in most of the cases except at 0.5V. Proposed versions outperform TSPC18 and TSPC24 in all levels except at 0.5V. More specifically at 0.9V, TSPC17-V1/2 are 4.6%/25.7% faster than TSPC18/TSPC24 designs and 10.5% slower than TGL24. All in all, the proposed versions showcase balanced trade-offs between power-delay profiles.

The overall energy efficiency of SET-FFs in terms of power-delay-product (PDP) is depicted in Fig. 4(e)-(f). This calculated by taking the product between power and  $T_{D\text{-}Q.opt}$  [11]. When  $\alpha$ =100%, similar to the power profile, proposed versions outperform all other designs. More specifically TSPC17-V1 is 4.6%, 10.5%, 37% and 45% better than TSPC17-V2, TSPC18, TSPC24 and TGL24 designs at 0.5V respectively. Similar savings can be observed at 1.2V as well. However when  $\alpha$ =0%, TSPC18 becomes the most energy efficient design owing to its better delay and lower power at zero activity rates. Except this case, TSPC17-V1/2 versions demonstrate better or comparable energy efficiency to other designs at all voltage levels. Even though TSPC18 exhibits graceful numbers when  $\alpha$ =0%, such scenarios rarely exist.

# B. Clock Slope Sensitivity and T<sub>clk-Q</sub> Degradation

The impact of the clock slope sensitivity to the  $T_{clk-Q}$  was briefed in Section III-D. Fig. 5 gives a deeper look into

the impact of the clock slope and the output load to the propagation delays of flip-flops at 0.5V typical near- $V_{th}$  level. This impact has been quantified by the increase of the  $T_{clk-Q}$  propagation delays in Fig. 5 (a)-(b).

In [19], the clock slope sensitivity has only been evaluated as the factor of the increase of  $T_{clk-Q}$  delay ( $\Delta T_{clk-Q}$ ). However this does not accurately reflect the clock slope impact to the clock network design. In a realistic design environment, the absolute  $T_{clk-Q}$  values are more meaningful to determine the required clock buffer strength than the  $\Delta T_{clk-O}$ . Fig. 5(a) illustrates this behavior against different clock slopes and the output loads. Here FO1 load (C<sub>L</sub>) represents the capacitive load of a ×1 inverter at Q output. Conversely, FO1 slope represents the clock slew when the  $\times 1$  clock driver drives  $\times 1$  inverter load. From Fig. 5(a), the lowest  $T_{clk-Q}$  delay can be observed in proposed TSPC17-V2 version against all slope/loading conditions. This is followed by TSPC17-V1, TSPC24, TSPC18 and finally TGL24 respectively. More specifically when  $C_L=1\times FO_1$  and  $12\times FO_1$ , proposed versions are roughly 37-43% and 18-20% faster than the traditional TGL24 version. In the same figure, it can be observed that, when the clock slope increases from  $2\times FO_1$  to  $560\times FO_1$  at  $C_L=1\times FO_1$ , the  $\Delta T_{clk-O}$  of TSPC17-V1 and TGL24 designs has increased by 0.35 ps from 2.25 ns (13.5%) and 0.31 ps from 3.9 ns (7.2%) respectively. For other loading conditions, this difference is smaller.

Fig. 5(b) represents the absolute values of  $T_{clk-Q}$  for all slope/load conditions. As depicted, TSPC18 is the slowest design in most cases. The fastest design is TSPC17-V2 and it is followed by the rest similar to previous case. The normalized



Fig. 5: (a)-(b) Impact of the clock slope and the output load ( $C_L$ ) of SET-FFs on  $T_{clk-Q}$  (c)  $\Delta T_{clk-Q}$  against clock slope and  $C_L$ 

 $T_{clk\cdot Q}$  to the corresponding  $T_{clk\cdot Q}$ , min of each loading condition, or in other words  $\Delta T_{clk\cdot Q}$  is given in Fig. 5(c). Despite the faster  $T_{clk\cdot Q}$  operation, the proposed designs now show a higher clock slope sensitivity (normalized to  $1\times FO_1$  slope/load) compared to the other designs. However this should not be mistaken as an inferiority as the absolute values matter most in a realistic situation.

#### C. PVT Resilience of TSPC17-V1/2

Fig. 6 depicts the PVT variations of the proposed SET-FFs. Recall that the setup paths of the proposed designs are identical to the TSPC18 version. As anticipated in Section II, both TSPC17-V1/2 designs exhibit better  $\mu$ ,  $\sigma$  and yields compared to the latest TSPC18 design (see Fig. 2) thanks to their faster hold paths. However for the mentioned device aspect ratios (internal=0.28 $\mu$ /0.2 $\mu$  and output=0.45 $\mu$ /0.3 $\mu$ ), still the TGL24 and TSPC24 designs demonstrate better resilience despite their circuit overhead. All in all, the proposed designs deliver balanced trade-offs between the PVT resilience and the power consumption.

# V. NEAR- $V_{th}$ CHARACTERIZATION STRATEGY

The accuracy at ULV levels is paramount for the characterization and VLSI integration of flip-flops. Even if the CCS scheme [20] is recommended for accurate timing, power or noise modeling, the automated characterization has posed a significant impediment to its adoption [21] and hence requires



Fig. 6: PVT resilience and the  $3\sigma$  yields (%) of TSPC17-V1/2 against 0.5V/TT/25°C and C<sub>L</sub>=10fF conditions



Fig. 7: CCS driver (blue) / receiver (red) model with RC networks

a pre-characterization flow. In particular, following measurement scenarios should be re-evaluated to improve the accuracy of ULV flip-flops.

# A. Stimuli generation in CCS driver model

Generally, a cell is characterized for different input slew and output load combinations similar to the case in Fig. 5(a)-(b). In reality, the input signal waveform to the cell may arrive from any driving cell. Hence the stimulus of a fast ramp function used in typical cell characterization may significantly differ from the actual pre-driver cell despite their same slew value. This is vital for flip-flops due to their impact on the design constraints. To minimize this error, it is recommended to use an averaged waveform signal [15] based on the empirical values of two extremes; a fast ramp function of no RC network and a slow/exponential function with a significant RC network. This situation is illustrated in Fig. 7. In this, the intrinsic parasitics to the CCS driver are denoted by  $R_D$  and  $C_{eq}$ .  $C_{slew}$  is a pure capacitive load and used to adjust the input slew value to the cell.  $R_{Poly}$ ,  $C_{Cont}$  and  $C_{G_{-CK}}$  represent the poly-silicon



Fig. 8: Data (red) sweeping towards the clock signal (blue) and the glitch-peak criteria at  $0.5 \text{V/TT}/25^{\circ}\text{C}$ .



Fig. 9: The parasitic behavior of MOSFETs during the switching at near- $V_{th}$  levels (off $\rightarrow$ on; rise/fall time =  $t_1$ - $t_2$ ) [16].

wire, the metal1-poly contact and intrinsic parasitics at the CCS receiver. The waveform Ramp and Exp represent the two extreme stimuli depending on the depth of  $R_M/C_M$  network and the drivers. The averaged piece-wise linear (PWL) waveform is represented in green. The commercial characterization tools (i.e. Cadence Liberate) can be instructed to pre-calculate the extremes and the average based on all possible driver/wire-load conditions. For instance, a given slew can be generated in Ramp and Exp fashions by using the strongest-driver/fastest RC wire-load and the weakest-driver/slowest RC wire-load combinations, respectively. The use of such an analytical waveform instead of an active driver avoids the impact of secondary effects such as Miller effect to the circuit.

# B. Constraint Measurements

Typical flip-flop constraints include setup/hold, recovery/removal (asynchronous) and minimum pulse width measurements. Traditionally, these measurements were quantified by the 10% delay degradation method [15], [22], [23] similar to  $T_{SU}$  in Fig. 3(e). However in certain cases and for certain designs, the binary search algorithms [24] in the general setup cannot arrive to a solution. For instance, the hold time of TSPC17-V2 is quite small and therefore sweeping of D with respect to the CK (clock) results a diminutive resolution in clk-Q delays. In such a case, capturing the 10% output degradation point is difficult and the output glitch peak of the flip-flop which occurs before that can be considered as an alternative. Fig. 8 illustrates this behavior of the TSPC17-V2 design in which, the data pulse (red) is swept towards the rising edge of the clock (blue) signal. In this particular case, even if the 10% degradation is found by excessive number of bisection runs, the resulting hold time is too small and does not make much sense. Instead, the 10% V<sub>DD</sub> crossing point of the output glitch peak is used. The removal constraint may utilize this method for the asynchronous (set/reset) inputs as well.

# C. The Input Capacitance Measurements

In ASIC libraries, the capacitance measurements are stored as rise/fall dynamic capacitance under a specific input pin of

TABLE I: 320-bit Shift Register Gate-Level Power ( $\mu$ W) and Clk<sub>max</sub> (ns)

|   | Design ( $\mu$ W/ns) | Power_SP | Power_GL      | Clk_SP | Clk_GL        |
|---|----------------------|----------|---------------|--------|---------------|
| - | TSPC17-V1            | 41.40    | 41.60 (0.48%) | 10.70  | 10.66 (0.37%) |
| _ | TSPC17-V2            | 41.30    | 40.80 (1.21%) | 11.10  | 11.00 (0.72%) |
| Ī | TSPC18               | 43.10    | 42.20 (2.09%) | 11.00  | 11.10 (0.90%) |
| _ | TSPC24               | 57.50    | 56.33 (2.03%) | 12.50  | 12.50 (1.18%) |
|   | TGL24                | 67.76    | 67.97 (0.31%) | 10.35  | 10.54 (1.80%) |

TABLE II: 320-bit Shift Post-Layout Power ( $\mu$ W) and Clk<sub>max</sub> (ns)

| Design    | Power(µW)                           | Clk_max | Clk_load(fF) | Area               |
|-----------|-------------------------------------|---------|--------------|--------------------|
|           | $(\alpha = 100\% \rightarrow 20\%)$ | (ns)    | (gate/wire)  | $(\mu \text{m}^2)$ |
| TSPC17-V1 | 57.86 (30%)→39.4 (30%)              | 14.8    | 683/177      | 3001               |
| TSPC17-V2 | 58.90 (29%)→43.5 (22%)              | 14.5    | 793/186      | 2980               |
| TSPC18    | 60.15 (27%)→38.7 (31%)              | 14.3    | 755/187      | 3101               |
| TSPC24    | 82.74→56.0                          | 15.0    | 1125/230     | 4338               |
| TGL24     | 76.05 (8.1%)-54.6 (2.6%)            | 14.9    | 398/159      | 3304               |

a cell. The accuracy of these measurements is vital for flip-flops, i.e. direct impact on the clock network power. Typical cell characterization measures the instantaneous dynamic capacitance of an input during the switching time of the MOSFET [25]. This is fair for a timely invariant response (i.e. For  $C_{Cont}$  and  $C_{M}$  in Fig. 7) which is not the case for the intrinsic parasitics ( $C_{G\_CK}$  in Fig. 7) owing to the weak channel formation and fringing effects [18] at near- $V_{th}$  levels. The behavior of MOSFET's intrinsic parasitics in 65nm and 40nm nodes [16] is depicted in Fig. 9. Instead of the instantaneous capacitance, the effective average capacitance during  $t_{I}$ - $t_{2}$  period should be used in measurements. For  $C_{G\_CK}$ , this can be calculated as:

$$C_{av} = \frac{1}{\frac{dV(t)}{dt}} \int_{t_1}^{t_2} i(t) dt$$
 (7)

where i(t) and dV(t)/dt represent the CCS receiver injected current and the slew rate of the CCS driver waveform respectively. For better accuracy, the CCS capacitance measurement thresholds should be at 1% and 99% of the input  $V_{\rm DD}$  level.

# VI. VLSI INTEGRATION OF ULV SET-FFS

This section summarizes the merits of the SET-FFs and the proposed characterization strategy in a large-scale integration. First, we present a lightweight experiment to evaluate the characterization accuracy of SET-FFs, only with gate-level parasitics. The 320-bit shift register (20 FFs in a chain $\times$ 16 parallel chains) presented in [5] is utilized for this experiment. We use Cadence Liberate for cell characterization, including intra-cell parasitics, Synopsys Design Compiler for logic synthesis and Synopsys PrimeTime for the gate-level delay/power (Clk\_GL/Power\_GL) measurements. The latter is compared to the Cadence Spectre Spice simulations (Clk\_SP/Power\_SP) in Table I for different SET-FFs under 0.5V/TT/25°C conditions. Note that the 16 chains of 20 SET-FFs (in each chain) work concurrently at 100% data activity rate. The power consumption was measured at 71 MHz operation and the observations evidently reflect the cell-level merits of each SET-FF (Section IV-A) in the shift register integration. More importantly, the error between the gate-level digital simulations and the Spicelevel analog simulations was maintained at  $\leq 2\%$  for all the designs thanks to the adopted characterization strategy.





| SS/-40°C | 0.5V/TT/25°C                     |                                                                                                       |
|----------|----------------------------------|-------------------------------------------------------------------------------------------------------|
| 0.24V    | 3.0                              | 0.14                                                                                                  |
| 0.24V    | 2.5                              | 0.02                                                                                                  |
| 0.28V    | 3.0                              | 2.80                                                                                                  |
| 0.28V    | 4.2                              | 0.25                                                                                                  |
| 0.30V    | 1.8                              | 0.83                                                                                                  |
|          | 0.24V<br>0.24V<br>0.28V<br>0.28V | 0.24V         3.0           0.24V         2.5           0.28V         3.0           0.28V         4.2 |

Fig. 10: (a) Layout of the 320-bit shift register (clock tree highlighted) (b) Worst-case power break-down of the shift-register (0.5V/TT/25 $^{\circ}$ C) (c)  $V_{DD,min}$  and setup/hold specifications of the SET-FFs at worst and typical corners (setup/hold at  $C_L$ =FO4 and D/CLK slopes=1ns).

# A. 320-bit Shift Register: SET-FFs in Isolation

A more comprehensive analysis based on the fully placed and routed designs in the Cadence Innovus environment is given in Table II. This accounts for both cell (.lib) / interconnect (.spef) level parasitics of the shift registers for which the numbers were obtained from less cumbersome digital simulations. In the physical design experiment, the maximum transition of the data paths ( $Tr_{max\_d}$ ), the clock network ( $Tr_{max\_clk}$ ) and the clock skew ( $t_{skew}$ ) were set to 20%, 10% and 10% of the clock frequency (66.66 MHz) respectively under 0.5V/TT/25°C conditions. The hold buffers were not needed for this experiment. However, additional buffers have been used to fix the transition/capacitive violations in the data paths. For the clock-tree synthesis, only the balanced inverters (CKINVX) in the cell library were used.

The clock load (Clk\_load) in Table II represents the summation of total gate inputs from the clock buffers, standard cells and the wire loads of the clock line. TSPC17-V1, TSPC17-V2, TSPC18, TSPC24 and TGL24 designs required 104, 117, 121, 180 and 70 clock-tree components respectively. Despite the local clock buffer usage in the TGL24 design (Fig. 1(b)), its clock input pin is just an inverter load. Therefore, the required number of clock network inverters/buffers during the clock-tree synthesis for TGL24 shift register is fairly lower than the other designs. Hence the lowest gate load of 398 fF in the clock line. Conversely, TSPC24 reported the highest number owing to its higher clock input load and higher  $T_{clk-Q}$  sensitivity. These experimental results evidently justify that the absolute  $T_{clk-Q}$  value has more impact to the clock network power than  $\Delta T_{clk-Q}$  as mentioned in Section IV-B.

Moreover, the complexity of the TSPC24 cell slightly increases the routing congestion and therefore leads to a slightly higher wire load in the clock line. This is reflected in the area number as well. All designs have somewhat comparable critical path delays, while TSPC18 is the fastest design. This observation is aligned with  $T_{D-Q}$  delays in Fig. 4(c). The proposed designs consume the lowest power in the shift register. TSPC17-V1, TSPC17-V2 and TSPC18 designs consume 30%, 29% and 27% less power than the most power-hungry design (TSPC24) at 100% data activity rate. Observations are similar for the 20% activity rate as well. As anticipated, these

designs are even superior to the conventional TGL24 design in the commercial standard cell library. For the 320-bit shift register, a correlation between the gate-level and post-layout level power numbers can be clearly seen. The layout of the 320-bit shift register is illustrated in Fig. 10(a).

A detailed breakdown of the worst-case power ( $\alpha$ =100%) for this experiment is given in Fig. 10(b). As depicted (the total power is already given in Table-II), the lowest clock network power is reported by TSPC17-V1 and TSPC18 designs and followed by TSPC17-V2. This is obvious since TSPC17-V2 requires 1 additional clocked transistor (see Fig. 3(a)-(b)). Consecutively, TGL24 and TSPC24 power numbers stand in this category. More specifically, TSPC17-V1's clock network power is 13.6%, 32.5% and 32.2% lower than TSPC17-V2, TSPC24 and TGL24 while being comparable to TSPC18. In the register power category, TSPC17-V2 saves roughly 11.4%, 24.6%, 39.8% and 24% of register power compared to TSPC17-V1, TSPC18, TSPC24, TGL24 designs respectively. All in all, TSPC17-V1/2 designs show their superiority in different categories.

Additional specifications related to the SET-FFs are summarized in Fig. 10(c). V<sub>DD,min</sub> represents the lowest voltage for which the circuit still functions correctly. The proposed versions can safely operate down to 0.24V at SS/-40°C process corner owing to the faster setup/hold paths of the circuits. This is also proven by the worst-case setup and hold values in the table. TSPC17-V1/2 report the smallest (positive) hold values and will, therefore relax the hold buffer requirement in complex digital designs.

# CONCLUSION

This work presented novel ultra-low-voltage (ULV) SET-FFs based on the True-Single-Phase-Clocking scheme to improve the clock network and register power efficiency in digital subsystems. In addition to the power saving, proposed designs provide fully static, contention-free functionality to satisfy ULV operation. At 0.5V near- $V_{th}$  level in 65nm bulk CMOS technology, the proposed SET-FFs demonstrate up to 11-45% and 7-20% of energy efficiency at 0% and 100% data activity rates compared to prevalent and commercial SET-FFs. In addition to these cell-level merits, this paper re-visits critical parameters of SET-FFs at near- $V_{th}$  voltage domain

and provided a strategy to improve their characterization accuracy within  $\leq 2\%$  of Spice level simulations. This further enables the VLSI integration of these non-conventional cells for commercial end-use. The integration of proposed SET-FFs in a 320-bit parallel shift register demonstrated up to 33% of clock network power, 17-39% of register power reductions compared to the state-of-the-art and commercial standard-cells at typical near- $V_{th}$  levels.

#### ACKNOWLEDGMENT

The research reported in this paper is supported by the Dutch NWO Applied and Engineering Sciences program ZERO: Towards Energy Autonomous Systems for IoT and by Dialog Semiconductor B.V., The Netherlands.

The experimental data and the cell-characterization scripts to this experiment can be accessed at: https://dx.doi.org/10.21227/0a2t-jg52

#### REFERENCES

- J. M. Rabaey, A. P. Chandrakasan, and B. Nikolic, *Digital Integrated Circuits*. Prentice hall Englewood Cliffs, 2002, vol. 2.
- [2] J. L. Shin, R. Golla, H. Li, S. Dash, Y. Choi, A. Smith, H. Sathianathan, M. Joshi, H. Park, M. Elgebaly et al., "The next generation 64b SPARC core in a T4 SoC processor," *IEEE journal of solid-state circuits*, vol. 48, no. 1, pp. 82–90, 2012.
- [3] H. Kawaguchi and T. Sakurai, "A reduced clock-swing flip-flop (RCSFF) for 63% power reduction," *IEEE Journal of Solid-State Cir*cuits, vol. 33, no. 5, pp. 807–811, 1998.
- [4] F. Stas and D. Bol, "A 0.4-V 0.66-fJ/cycle retentive true-single-phase-clock 18T flip-flop in 28-nm fully-depleted SOI CMOS," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 65, no. 3, pp. 935–945, 2017.
- [5] Y. Cai, A. Savanth, P. Prabhat, J. Myers, A. S. Weddell, and T. J. Kazmierski, "Ultra-low power 18-transistor fully static contention-free single-phase clocked flip-flop in 65-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 54, no. 2, pp. 550–559, 2018.
- [6] Y. Kim, W. Jung, I. Lee, Q. Dong, M. Henry, D. Sylvester, and D. Blaauw, "27.8 A static contention-free single-phase-clocked 24T flipflop in 45nm for low-power applications," in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 2014, pp. 466–467.
- [7] Y. Lee, G. Shin, and Y. Lee, "A Fully Static True-Single-Phase-Clocked Dual-Edge-Triggered Flip-Flop for Near-Threshold Voltage Operation in IoT Applications," *IEEE Access*, vol. 8, pp. 40232–40245, 2020.
- [8] N. Nedovic, M. Aleksic, and V. G. Oklobdzija, "Conditional techniques for low power consumption flip-flops," in *ICECS 2001. 8th IEEE International Conference on Electronics, Circuits and Systems (Cat. No. 01EX483)*, vol. 2. IEEE, 2001, pp. 803–806.
- [9] J. F. Lin, "Low-power pulse-triggered flip-flop design based on a signal feed-through," *IEEE transactions on very large scale integration (vlsi)* systems, vol. 22, no. 1, pp. 181–185, 2013.
- [10] S. Lapshev and S. R. Hasan, "New low glitch and low power DET flipflops using multiple C-elements," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 63, no. 10, pp. 1673–1681, 2016.
- [11] M. Alioto, E. Consoli, and G. Palumbo, "Analysis and comparison in the energy-delay-area domain of nanometer CMOS flip-flops: Part II—Results and figures of merit," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 19, no. 5, pp. 737–750, 2010.
  [12] A. Hirata, K. Nakanishi, M. Nozoe, and A. Miyoshi, "The cross charge-
- [12] A. Hirata, K. Nakanishi, M. Nozoe, and A. Miyoshi, "The cross charge-control flip-flop: A low-power and high-speed flip-flop suitable for mobile application socs," in *Digest of Technical Papers*. 2005 Symposium on VLSI Circuits, 2005. IEEE, 2005, pp. 306–307.
- [13] C. K. Teh, T. Fujita, H. Hara, and M. Hamada, "A 77% energy-saving 22-transistor single-phase-clocking d-flip-flop with adaptive-coupling configuration in 40nm cmos," in 2011 IEEE International Solid-State Circuits Conference. IEEE, 2011, pp. 338–340.

- [14] N. Kawai, S. Takayama, J. Masumi, N. Kikuchi, Y. Itoh, K. Ogawa, A. Ugawa, H. Suzuki, and Y. Tanaka, "A fully static topologicallycompressed 21-transistor flip-flop with 75% power saving," *IEEE Jour*nal of Solid-State Circuits, vol. 49, no. 11, pp. 2526–2533, 2014.
- [15] E. Salman and E. Friedman, High performance integrated circuit design. McGraw Hill Professional, 2012.
- [16] A. C. Ranasinghe and S. H. Gerez, "Glitch-Optimized Circuit Blocks for Low-Power High-Performance Booth Multipliers," *IEEE Transactions* on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 9, pp. 2028–2041, 2020.
- [17] A. C. Ranasinghe and S. H. Gerez, "Ultra-Low Voltage 4-to-2 Compressors for Near-Vth Computing," in 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2020, pp. 1–5.
- on Circuits and Systems (ISCAS). IEEE, 2020, pp. 1–5.

  [18] A. C. Ranasinghe and S. H. Gerez, "MEPNTC: A Standard-Cell Library Design Scheme Extending the Minimum-Energy-Point Operation of Near-Vth Computing," in 2020 IEEE 38th International Conference on Computer Design (ICCD). IEEE, 2020, pp. 96–104.
- [19] M. Alioto, E. Consoli, and G. Palumbo, "Flip-flop energy/performance versus clock slope and impact on the clock network design," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 57, no. 6, pp. 1273–1286, 2009.
- [20] M. S. M. LEE and C. N. J. LIU, "CCS Timing Library Characterization Guidelines CCS Timing Library Characterization Guidelines, 2008," IEICE transactions on fundamentals of electronics, communications and computer sciences, vol. 93, no. 3, pp. 595–606, 2010.
- [21] K. Chopra, C. Kashyap, H. Su, and D. Blaauw, "Current source driver model synthesis and worst-case alignment for accurate timing and noise analysis," in ACM/IEEE International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, 2006, pp. 45–50.
- [22] A. Millán, J. Juan, M. J. Bellido, P. Ruiz-de Clavijo, and D. Guerrero, "Characterization of normal propagation delay for delay degradation model (ddm)," in *International Workshop on Power and Timing Modeling, Optimization and Simulation*. Springer, 2002, pp. 477–486.
- [23] M. J. Bellido, J. J. Chico, and M. Valencia, Logic-timing simulation and the degradation delay model. Imperial College Press, 2005.
- [24] Star-HSPICE Manual, 1998, Avant Corp., Fremont, CA.
- [25] P. Feldmann, S. Abbaspour, D. Sinha, G. Schaeffer, R. Banerji, and H. Gupta, "Driver waveform computation for timing analysis with multiple voltage threshold driver models," in 2008 45th ACM/IEEE Design Automation Conference. IEEE, 2008, pp. 425–428.