In this paper, simple circuital techniques to design efficient pulse triggered flip-flops are presented. The proposed approach aims at considerably alleviating the detrimental effects of current contention mechanisms, occurring at critical switching nodes of the circuits. In this way, both latency and power consumption of pulse triggered flip-flops are reduced. The proposed approach is assessed by means of simulations in 90-nm ST commercial CMOS technology. When applied to some recently proposed implicit pulse triggered flip-flop architectures, the suggested design strategy, allows speed to be improved up to 13% and power-delay-product to be lowered down to 14%. Moreover, also the process variation tolerance is considerably improved.
INTRODUCTION
It is well known that choosing the appropriate flip flop (FF) topology is of fundamental importance in the design of synchronous digital systems (such as microprocessors). From a timing perspective, FF latency consumes a large portion of the clock cycle time while the operating frequency increases. 1 Moreover, in order to sustain the trend of high performance and throughput, a large number of FFs is usually employed for extensive pipelining of datapath sections, causing the power dissipation of FFs having a deep impact on the power characteristics of the whole system. 2 Both master-slave (MS-) and pulsed-triggered (P-) FFs are commonly used in contemporary digital systems. 3 4 Conventional MS-FFs consist of two latches called master and slave, respectively. These FFs are characterized by a positive setup time requirement that increases the data-tooutput delay. 5 P-FFs are considered to be an interesting alternative to MS-FFs. Their functioning is based on the generation of a narrow transparency window in correspondence of the rising (or falling) clock edge. In this way a near to zero or even negative setup time is allowed and smaller data-to-output delay is achieved. Moreover, since the operation of a P-FF requires only a single latch, as opposed to two latches needed in conventional MS configurations, the logic complexity and the number of stages Email: lanuzza@deis.unical.it of the circuit are reduced, thus leading to lower power consumption. 5 Depending on the method used to generate the transparency window, P-FFs are classified into two categories: the explicit P-FFs (EP-FFs) and the implicit P-FFs (IPFFs). 3 In EP-FFs, an external pulse generator circuit is exploited to generate the pulse signal, thus triggering the transparency window for the latching structure. A single pulse generator could be shared among neighboring latches, as in the Itanium 2 microprocessor. 6 As an advantage of this solution, the power overhead of the pulse generator is distributed across many FFs, but, due to the increased capacitive load, the generation of a precisely timed pulse signal could be difficult to be managed in practice. Such an issue is further complicated in presence of process and environmental variations. IP-FFs allow to eliminate the need to distribute the pulse signal by incorporating the pulse generator into their structures. In this way better control of transparency window width is enabled (so that a very narrow transparency window can be produced), usually at the expense of some speed penalties in comparison to EP-FF structures. 7 In the following, we will mainly focus on the design of IP-FFs. A design solution to achieve faster and lower power consuming IP-FFs is here proposed. When applied to some state-of-the-art IP-FF circuits, 7-9 the proposed approach leads to reduce data-to-output delay from 8% to 13% and power-delay-product (PDP) from 10% to 14%, while assuring the lowest power consumption for different input data signal switching activities and without any significant loss in terms of occupied silicon area. Moreover, the suggested circuit modifications lead to obtain higher process variation tolerance in terms of delay and PDP parameters.
The remainder of this paper is organized as follows: Section 2 surveys recently proposed low-power and highspeed IP-FFs.
7-9 Section 3 suggests simple circuit techniques to increase speed and reducing power consumption of IP-FFs. Section 4 deals with comparative evaluations, performed exploiting the commercial 90 nm ST Microelectronics CMOS technology. Finally, Section 5 concludes the paper.
IMPLICIT PULSED FLIP FLOPS
Figure 1(a) shows the single-ended conditional capturing energy recovery (SCCER) IP-FF. 8 The circuit mainly consists of two branches, sharing two clocked transistors (N 1-N 2) in their discharging paths, and a simple latch (P 3/N 6-P 4/N 7) which preserves the output signal (Q) while the FF is insensitive to the input data signal (D). The left branch of the circuit is responsible for capturing the high logic value of the input data signal, whereas the low logic value of D is captured by the right branch. The actual latching occurs only during the 1-1 overlap between CLK and CLKB signals. After the rising edge of CLK and before CLKB falls low, N 1 and N 2 are simultaneously turned on for a short period of time, thus triggering the FF transparency window. To save power, the SCCER design exploits the conditional capturing technique. 5 This is implemented by using the feedback signal Q_fdbk to control the N 3 transistor. Only when Q is low (i.e., Q_fdbk is high), a high logic value on the input signal D causes node X to be discharged during the FF transparency window. This in turn leads Q (Q_fdbk) to change from the low (high) to the high (low) logic value and, consequently, N 3 to be turned off. As a result, node X is prevented from discharging in succeeding clock cycles as long as D is stable at the high logic value. In this way precious power can be saved. As highlighted in Figure 1 (a), the worst case delay of the SCCER design occurs when the output signal Q undergoes a low-to-high transition. In such a condition, node X has to be discharged through four stacked transistors (N 1-N 4) which have to win the contention with the always "on" pull-up transistor P 1. A proper sized pulldown circuitry is thus needed to ensure that node X can be properly discharged. This implies the use of large N 1-N 4 devices and/or weak inverter I1 to widen the transparency window. However, sizing of the N 1-N 4 transistors is constrained by robustness issues. When most of the devices in the left branch discharging path become ON simultaneously, significant charge sharing may occur. 8 In an extreme case, this can lead to an unwanted switching of the output signal outside the FF transparency window. Upsizing P 1 helps in reducing the effect of charge sharing but, at the same time, increases the current contention at node X, when it has to be discharged. Indeed, the preferred option is to use minimum-sized pull-up PMOS P 1 7 (thus reducing the adverse impact on FF speed/power), while avoiding too large transistors in the pull-down path to guarantee a proper level of immunity to charge sharing effects.
11
As X is discharged, P 2 is turned on, thus charging node Q at the V DD voltage level. However, because the NMOS N 7 does not turn off instantaneously, there is another current contention mechanism occurring between P 2 and N 7 (due to the crossbar current flowing through P 2 and N 7) during the low-to-high transition of the output signal Q. Also such mechanism causes FF delay and power dissipation to be degraded.
The low-power conditional pulse enhanced FF (CPEFF), proposed by Hwang et al. 7 is shown in Figure 1 (b). A pseudo-explicit pulse generator, consisting of an inverter (I1) and a two-input pass transistor logic (PTL)-based AND gate, is used to trigger the FF transparency window. Such circuitry produces a diminished voltage swing pulse Only when X is discharging, the pulse signal at node Z is raised to the V DD voltage level through P 3, thus enhancing strength of transistor N 1. Such clocking mechanism allows reducing the number of stacked NMOS devices in the left branch but, as a counter effect, a larger NMOS device (i.e., N 4) is needed on the discharging path of the right branch to compensate the weakened action of transistor N 1 during the capturing of the low logic value on the input signal D. Note that also the CPEFF suffers from the current contention drawbacks already described for the SCCER design.
The solution proposed by Zhao et al. 9 preserves the simple clocking structure of the SCCER design and exploits the conditional data mapping technique 10 to control the discharge path of the left branch of the circuit. The resulting FF, called clocked pair shared FF (CPSFF), 9 is shown in Figure 1(c) . On the basis of the D, Q and Q_fdbk signals, the conditional data mapper switches off the discharge path of the left branch when a high logic value has to be maintained on the output. This avoids redundant transitions at node X while the height of the NMOS stack in the left branch of the circuit is reduced. Unfortunately, the current contention at node X and the crossbar current, occurring at the beginning of the low-to-high output signal transition, continue to negatively impact on speed and power of the CPSFF design.
In this work, the SCCER, CPEFF and CPSFF circuits, later used as the reference designs for the comparative analysis described in Section 4, were optimized with the objective of achieving a tradeoff between power consumption and D-to-Qbar delay i.e., minimizing the product of the two terms. Sizes of the devices belonging to the reference circuits are reported in Figures 1(a)-(c) . There, channel widths are normalized to the minimum value W min (= 0.12 um) imposed by the 90 nm ST Microelectronics CMOS technology. Where not indicated, devices were sized with minimum channel length L min (= 0.1 um). 
PROPOSED DESIGN APPROACH
In this section, some circuital modifications are suggested to alleviate the effects of current contention mechanisms occurring during the worst-case output switching of IPFFs. Figure 2(a) illustrates the proposed design strategy when applied to the SCCER circuit. As clarified in the following, the same approach can be profitably used also for all the other circuits described in Section 2.
The contention reduced (CR)-SCCER circuit of Figure 2(a) , replaces the always "on" pull-up transistor of the left branch with a parallel pull-up PMOS network (P 1-P 5) driven by feedback signals Q and Q_fdbk. In this way, the node X is correctly maintained at V DD when the FF is not transparent to the data input, whereas current contention at node X is reduced when it has to be discharged, during the FF transparency window. In fact, if a rising transition of the output signal has to occur (i.e., as a consequence of a rising transition of the input data signal which has to be captured during the FF transparency window), the discharge of the node X, determined by the pull-down network of the left branch, is counteracted by the charging current flowing through P 1 only at the beginning of the falling transition of node X. As X is lowered below the V DD − V th P2 voltage level, P 2 is turned on and Q voltage starts to rise towards V DD . As a consequence, the source to gate voltage (V sg P 1 ) of P 1 is gradually reduced, thus weakening the action of the device P 1. Due to the delay introduced by the inverter P 3/N 6, P 1 is completely switched off while P 5 is not yet turned on. This means that there is a portion of time, during the FF transparency window, where no charging current is flowing through the pull-up network of the left branch. As a result the "fighting" problem at node X results greatly alleviated with a positive impact on the switching speed of node Q. Additionally, the rising transition of node Q is favored by the NMOS transistor N 8 added in the pull-down network of the inverter P 4/N 7. Such device, controlled by node X, allows the NMOS pull-down network N 7-N 8, to be quickly turned off. In this way, the crossbar current flowing through P 2 and N 7-N 8 is quickly zeroed with benefit in terms of charging speed for node Q. It is worth noting that, since current contentions are considerably reduced during the critical output switching, the suggested approach also results beneficial in terms of power dissipation, especially when the critical path delay is highly solicited (i.e., for high input data activities). Figures 2(b)-(c) show the contention reduced versions of the CPEFF and CPSFF designs, respectively. In order to provide a direct evaluation of the impact of the proposed design approach, the original pull-down transistor sizing was maintained. Instead, the pull-up PMOS devices belonging to the left branch of the circuits were sized for 1.2 W min to assure robustness against charge sharing effects similar to that of the original designs. The additional NMOS device driven by node X was sized with minimum channel width.
Note that the suggested design strategy leads to increase the capacitive loads on node X, Q and Q_fdbk. However, the speed advantages brought by the temporary switching off of the pull-up network of the left branch, when X is discharging during the FF transparency window, greatly overcome the adverse impact of the increased parasitic capacitances on node X. Moreover, the devices, controlled by nodes Q and Q_fdbk, have minimal impact on the capacitive load of such nodes due their reduced sizing. Given all, the suggested approach leads to improve the overall speed and power consumption of the CR-FF designs.
COMPARATIVE RESULTS
The Cadence Spectre tool was employed for comparative analysis exploiting the simulation setup shown in Figure 3 . In order to obtain accurate results, IP-FFs were simulated considering a realistic environment, where input buffers drive the FF inputs (CLK and D), and the outputs (Q and Qbar) drive a 20 fF load capacitance. An extra capacitance of 3 fF is placed after the clock driver. 7 The clock frequency was set to 500 MHz, whereas the power supply voltage is 1 V.
The characteristics of the analyzed IP-FFs are compared in Table I in terms of total gate area, optimum setup The CPEFF and the CR-CPEFF designs have hold time longer than other designs. This is because the pulse enhancement mechanism requires a more prolonged availability of the input data signal. 7 The SCCER and CPSFF circuits shorten the hold time of about 43% and 48% with respect the CPEFF design. Anyway, when compared to their conventional counterparts, the CR-FFs always reduce hold time requirements.
Due to the reduced current contentions during the worst-case output switching, all the CR-FFs shorten the D-to-Qbar delay in comparison to their conventional counterparts. More precisely, at the optimum setup time, CR-SCCER, CR-CPEFF and CR-CPSFF designs improve D-to-Qbar delay of about 10%, 13% and 8% when compared to SCCER, CPEFF and CPSFF circuits, respectively. As shown in Figure 4 , such speed advantage is maintained for different setup times. Even though the optimum setup time of the CR-FFs is slightly increased in comparison to that of their conventional counterparts, the CR-FFs retain the ability to work correctly for negative setup times. This provides soft-clock edge property for overcoming clock skew related cycle time loss. 11 12 Therefore, the CR-FFs present proper timing characteristics for high-performance applications.
As the power dissipated in a FF depends on input data activities, five different input patterns were considered to evaluate the power consumption behavior of the compared designs. The considered patterns present 0% (all-zero or all-one), 25%, 50%, and 100% data transition probabilities, respectively. Both the clock buffer power and total power (including power consumed in the latches and in the data and clock drivers) data are reported in Table I . Again, reduced current contentions, lead the CR-FFs to achieve slightly better power results in comparison to their conventional counterparts, especially for the highest input data activities. It is worth noting that, if power becomes the primary concern, the increased speed, brought by the suggested design approach, could be traded-off for additional power saving by reducing sizing of pull-down devices in the left branch of the CR-FF circuits. Due to the improved power and timing characteristics, the CR-FFs achieve significantly better PDP results in comparison to the previous proposed designs. Figure 5 illustrates the curves of PDP versus setup time. At the optimum setup time, CR-SCCER, CR-CPEFF and CR-CPSFF reduce the PDP of about 13%, 14% and 10% when compared to SCCER, CPEFF and CPSFF, respectively. Therefore, the CR-FFs are confirmed to be a good design option for low-power and high-performance applications. Table II gives the leakage power of all FF designs in stand-by mode (i.e., clock is gated), considering different combinations of clock and input/output data signals. It should be noted that despite the increased number of transistors, the leakage power is not significantly degraded (leakage variations are always below 1%) by the proposed circuit modifications. Figure 6 depicts the minimum D-to-Qbar delay data obtained by simulating the compared designs for different process corners. It is easy to observe that the speed advantages of the CR-FFs are maintained quite constant over the different process corners, except for the SS corner case where no significant speed improvements were observed. As shown in Figure 7 , a similar behavior was observed in terms of PDP. Pulsed FFs are usually very sensitive to random process variability. 13 14 For this reason, the tolerance to process uncertainties was analyzed for all the compared circuits. Table I ), without any setup time margin. In Table III , mean ( and standard deviation ( values are reported for the D-to-Qbar delay and PDP, respectively. As expected, the suggested approach, lead the CR-FFs to achieve better mean D-to-Qbar delay and PDP with respect their conventional counterparts. The reduced current contentions during the worst-case output switching, lead the CR-FFs also to achieve higher process variation tolerance. This is confirmed by observing standard deviation ( ) values for the D-to-Qbar delay and PDP, which result considerably reduced. More precisely, the reduction of Delay ranges between 13% (CR-CPSFF vs. CPSFF) and 17% (CR-CPEFF vs. CPEFF). At the same time, the reduction of PDP is between 15% (CR-CPSFF vs. CPSFF) and 20% (CR-CPEFF vs. CPEFF).
CONCLUSION
State-of-the-art implicit pulsed flip-flops suffer from current contentions occurring at critical switching nodes when the worst-case delay path is solicited. This has an adverse impact on data-to-output delay and power consumption of the circuits. In this paper, simple circuital techniques to considerably alleviate the detrimental effects of current contentions, occurring during the worst-case output switching, are proposed. When applied to state-of-the-art implicit pulse triggered flip-flops, the suggested approach allows data-to-output delay to be improved up to 13% and power-delay-product to be reduced down to 14%, without any significant penalty in terms of occupied silicon area. Moreover, the reduced current contentions, lead to improve process variation tolerance of the modified designs. Interestingly, the suggested approach can be easily mixed with several low power techniques, including low swing and double edge clocking, to design more effective pulsetriggered flip-flops.
