Abstract-Dual-edge-triggered (DET) synchronous operation is a very attractive option for low-power, high-performance designs. Compared to conventional single-edge synchronous systems, DET operation is capable of providing the same throughput at half the clock frequency. This can lead to significant power savings on the clock network that is often one of the major contributors to total system power. However, in order to implement DET operation, special registers need to be introduced that sample data on both clock-edges. These registers are more complex than their single-edge counterparts, and often suffer from a certain amount of clock-overlap between the main clock and the internally generated inverted clock. This overlap can cause contention inside the cell and lead to logic failures, especially when operating at scaled power supplies and under process variations that characterize nanometer technologies. This paper presents a novel, static DET flip-flop (DET-FF) with a true-singlephase clock that completely avoids clock overlap hazards by eliminating the need for an inverted clock edge for functionality. The proposed DET FF was implemented in a standard 40 nm CMOS technology, showing full functionality at low-voltage operating points, where conventional DET-FFs fail. Under a nearthreshold, 500 mV supply voltage, the proposed cell also provides a 35% lower CK-to-Q delay and the lowest power-delay-product compared to all considered DET-FF implementations.
I. INTRODUCTION
The design of energy-efficient circuits remains one of the main challenges in the field of digital integrated systems [1] . A large portion of the power dissipated in VLSI architectures is attributed to clock distribution, consuming as much as 45% of the total system power [2] . Clock networks are characterized by a 100% activity factor, charging and discharging their parasitic capacitors during each cycle, and thereby leading to power dissipation that is directly proportional to clock frequency. For this reason, among the different applications, high-speed, high-throughput designs are especially affected by this issue.
One well-known approach for reducing the clock power is dual-edge-triggered (DET) synchronous operation. By sampling data on both the rising and falling edges of the clock, the clock frequency can be reduced by 50% without changing the system throughput. This directly cuts the power dissipation of the clock network in half, leading to significant overall system power savings. However, implementation of DET operation requires the introduction of registers that sample, store, and propagate their input at both clock edges. While these dualedge-triggered flip-flops (DET-FFs) are more complex and generally larger than their single-edge-triggered (SET) counterparts, they can be designed to be more energy-efficient [3] , thereby providing additional power savings.
The implementation of storage cells that are triggered on both clock edges is a well researched subject. Many solutions for the design of DET-FFs have been proposed [4] - [8] . The most popular of these cells is the transmission-gate latch-MUX (DET-TGLM) [4] due to its simple implementation that is based on two latches and an output multiplexer (MUX). An alternative configuration can be assembled by replacing the transmission gates with C 2 MOS gates [11] , resulting in the C 2 MOS latch-MUX (DET-C 2 MOSLM) [5] . A different approach is to generate a short pulse on each clock edge, thereby realizing a pulse-triggered DET-FF, as shown in [6] . More advanced DET FFs that limit the switching activity through pulse generation and precharge conditions are the conditional discharge flip-flop (DET-CDFF) [7] and the symmetric pulse generator flip-flop (DET-SPGFF) [8] .
While these topologies have been demonstrated on various applications, few of them have been examined in deeply scaled process technologies under voltage scaling, commonly used for the implementation of energy-efficient systems. In particular, in the presence of considerable process variation, the use of both clock phases usually introduces some extent of clock-overlap, which can lead to race conditions and other detrimental circuit behavior. For example, when considering the traditional DET-TGLM, process, voltage and temperature (PVT) variations can cause this overlap to increase to a point, where the currently held data is over-written, resulting in a fatal logic error.
Contribution: in this paper, we solve this clock overlap problem, by presenting the first static true-single-phase-clock (TSPC) DET-FF. By implementing the cell with TSPC circuits and an internal dual-feedback mechanism, completely static operation is achieved, enabling robust operation under voltage scaling and process variations. To demonstrate its functionality in nanoscaled technologies, the cell was implemented in a 40 nm CMOS process, showing full functionality at a near-threshold, 500 mV supply voltage (V DD ) under extensive Monte Carlo (MC) statistical simulations. In addition to being the only topology to continue to operate robustly under these conditions, the proposed cell also provides the lowest CK-to-Q delay (t cq ) and the best power-delay product (PDP) when compared to other leading DET-FF solutions.
Outline: the rest of this paper is organized as follows: Section II presents the clock-overlap hazard in the traditional DET-TGLM circuit. The proposed static dual-edge-triggered flip-flop with true-single-phase clock (SDET-TSPCFF) is presented in Section III to address this hazard and enable lowvoltage operation. Section IV provides simulation results for the proposed cell and a comparison with other popular DET-FF implementations. Finally, the conclusions are reported in Section V. 
II. CLOCK-OVERLAP FAILURE RISK IN DET-TGLM CELLS
This section explains the risk of failure in DET-FFs due to clock-overlap. The DET-TGLM gate was chosen to demonstrate this hazard, as it is the most popular DET-FF implementation. Accordingly, a brief overview of the DET-TGLM cell is provided in Section II-A, followed by a detailed analysis of the risk of failure due to clock-overlap in Section II-B. Note that while this discussion is specific for the DET-TGLM cell, a similar analysis applies to many other DET-FF implementations.
A. Overview of the DET-TGLM
Among the various DET-FFs, the DET-TGLM is one of the most commonly implemented topologies, primarily due to its simple structure and straightforward behavior. Two values are stored internally in two separate latches that are connected through an output MUX, as illustrated in Fig. 1 . The latches are implemented with input transmission gates (M3-M4 and M13-M14), inverters (M5-M6 and M15-M16) and clocked inverters (M7-M10 and M17-M20), while the MUX is exclusively composed of transmission gates (M11-M12 and M21-M22). Each latch is transparent during a different phase of the clock, and the value stored in the opaque latch is passed through the MUX to the output.
In addition to its simple structure, this cell has been shown to be one of the most energy-efficient DET-FFs for high-speed operation [3] . During its respective transparent window, each latch passes the input data to its cascaded transmission gate (SNP and SNN in Fig. 1 ), such that the data only needs to propagate through the output MUX on the next clock edge. This provides a short t cq , which makes the DET-TGLM suitable for high-frequency applications. Finally, this circuit does not rely on pulse-triggered circuits or precharge (dynamic) conditions, such as those required by [6] - [8] making it less sensitive to variations from technology and voltage scaling.
B. Clock-Overlap Failure Risk
The static operation of the DET-TGLM provides inherent robustness; however, one problematic feature remains -its dependence on both clock phases for functionality. To accommodate this need, the inverted clock signal is internally generated with an inverter (M25-M26). A second inverter (M27-M28) is implemented to internally buffer the input clock and ensure a controlled and fast slew rate of CKI. Due to the intrinsic delay of the second inverter, CKB and CKI share the same value during an interval of time that is ideally equal to this delay. The time during which both clock signals are high is defined as positive clock-overlap (PCO), while negative clockoverlap (NCO) occurs when both clock signals are low. These overlap phases occur immediately after each transition of the clock signal that is used to generate the inverted one. Since both clock signals are equal during such an overlap, there is always one type of transistor (either NMOS or PMOS) turned on in each transmission gate of the MUX (M11-M12 and M21-M22). A conducting path is therefore generated between the inputs of the MUX in the DET-TGLM, causing an internal race between the values that are stored in the two latches. This clock overlap time is heavily dependent on PVT variations and wire parasitics. If the overlap is too large, the voltage value stored in one latch will overwrite the value stored in the other latch, resulting in a storage failure. Fig. 2 demonstrates the behavior of a DET-TGLM gate under a typical hazardous disrupt. In this example, the clock is initially low and a logic-0 is stored at SNP and passed through the top transmission gate to the output. During this phase, the bottom latch is transparent, passing a logic-1 from the input (D) to SNN. After the rising edge of the clock, both CKI and CKB are low during the NCO, and the PMOS transistor in each transmission gate is conducting. If the NMOS that is pulling down SNP (M6) drives more current than the PMOS that is driving SNN (M15) and if the overlap time is sufficiently long, the voltage value on SNN will drop until it is overwritten by a logic-0 through the cross-coupled feedback of the bottom latch. Following the overlap period, this logic-0 value is latched and driven through the MUX to provide the wrong value at the output. The transient waveforms of SNN and QB during a failing event are shown as a solid line in Fig. 2 , while a case where the circuit overcomes the hazard is shown with a dotted line. The same failure risk can be studied for the case where CKB and CKI are high, and a logic-0 value stored at SNP is the critically affected value.
As previously described for the case of NCO, a failure occurs when the voltage at SNN drops below a critical threshold that results in a latched logic-0 level. To evaluate the probability of such an occurrence, we employ statistical MC simulations, applying global and local process variations to a DET-TGLM gate during a NCO phase. Fig. 3 displays the obtained distribution of the minimum voltage level of SNN for 10,000 MC samples applied to a DET-TGLM implemented in a standard 40 nm CMOS process. The simulations were run with a near-threshold V DD of 500 mV at 125
• C. Out of the 10 k samples, 15 resulted in a storage failure, as can be seen by the non-zero probability of voltage levels centered around 0 V. In addition, the failure threshold can be estimated at 0.199 V, which is the minimum voltage level for a stored logic-1 that is still overcome by the gate without causing a failure. However, it is clear from the presented distribution and the large number of failures that the DET-TGLM is not a viable candidate for near-threshold operation.
III. THE PROPOSED SDET-TSPCFF
In the previous section, the traditional DET-TGLM gate was shown to be unsuitable for near-or sub-threshold operation in scaled technologies, due to the risk of clock overlap failures. In order to overcome these risks, we propose a fully-static, TSPC alternative to the DET-TGLM and other dual-phase solutions. Other TSPC DET-FFs have been shown in the past [6] - [10] ; however, these gates all rely on temporary dynamic storage [9] , [10] and/or generated pulses [6] - [8] , which make them sensitive to both process variations and voltage scaling.
The schematic of the proposed SDET-TSPCFF is shown in Fig. 4 . Similar to other latch-MUX DET-FFs, new data is written to an internal storage node during one clock phase and subsequently latched and driven to the output following the clock transition. This is achieved without the need for an inverted clock signal by implementing the storage elements with a pair of TSPC latch-MUX branches (M1-M18 and M19-M36). These branches are loosely based on the classic TSPC latch [12] with the addition of two internal feedback mechanisms that ensure strong data levels and fully-static operation to enable robust, low-voltage functionality.
To further explain the circuit operation and its feedback mechanisms, we will focus on the top branch in Fig. 4 (M1-M18), with the opposite branch operating in a completely symmetric fashion. When CK is high, devices M1-M8 act as a buffer, passing the value at D to SNP. This buffer does not encounter any contention with other parts of the circuit, as M5, M10 and M11 are all cut off. In addition, in this state, the output of the top branch presents a high-impedance to Q, as M17 cuts off the pull-up to this node and M12 pulls down the gate of M18, cutting off the pull-down to the output. When CK goes low, the current state of SNP is latched, since M7 cuts off the pull-down and M2 cuts off the pull-down path to DBP, disabling a pull-up through M6. It is essential to ensure that DBP does not drift and possibly turn on M6, and therefore, a feedback path from SNP to M4 maintains a logic-1 at DBP if SNP was latched at 0. Moreover, in this state, devices M9-M15 comprise a cross-coupled inverter that holds the level at SNP through a strong positive-feedback loop. Finally, devices M16-M18 function as a tri-state inverter, selectively and robustly passing the storage value to the output.
While the 36 transistors required to implement this gate is larger than the 28 required by the DET-TGLM or many of the other DET-FFs, the additional area enables static overlapcontention free operation, thereby providing variation-tolerant functionality at scaled supply voltages and for advanced process technologies.
IV. SIMULATIONS AND RESULTS The performance of the proposed cell is evaluated considering two groups of simulations. First, the resilience of the storage cell against failures is tested with MC simulations to show its robustness. Second, the SDET-TSPCFF is characterized in terms of speed and power consumption in order to compare it with the other popular static DET-FF implementations. All circuits were implemented with standard-V T transistors in a 40 nm CMOS technology for comparison.
In the first considered testbench, all possible combinations of data are written inside the storage cell, and subsequently checked at the output during the next clock phase. The output value is continuously sampled and failures are reported if it differs from the expected value. Both process and mismatch variations are taken into account while running MC simulations. Furthermore, near-threshold operation is targeted by setting V DD to 500 mV. This set of runs is executed for each of the following temperatures: 0
• C, 25
• C and 125
• C. An example of the family plots obtained through a set of simulations is shown in Fig. 5 . The proposed cell provided the correct output for all 10 k samples at each temperature point, indicating robust functionality under these operating conditions.
In order to evaluate the performance of the proposed cell, it was compared with other latch-MUX based storage cells that do not rely on pulse generation. The characterization of the storage cells is performed using the testbench proposed in [13] , where several state-of-the-art DET-FFs are simulated and compared. All the simulations were applied to 40 nm implementations of the considered circuits with V DD =0.5 V, at 25
• C and at a typical process corner. The frequency of the input clock is 500 MHz, corresponding to a cell throughput of 1 GHz with data activity of 25%.
The results are summarized in Table I , showing that in addition to solving the clock-overlap failures, the proposed SDET-TSPCFF also provides the lowest CK-to-Q delay (t cq ). The DET-C 2 MOSLM shows the worst performance in terms of speed and dynamic power consumption, as the presence of four stacked transistors severely compromises its performance at near-threshold operation. Therefore, much wider transistors are required in order to operate correctly under conditions, resulting in a severe area and power consumption penalty. The advantage in t cq of the proposed circuit as compared to the DET-TGLM cell is due to the reduced conductivity of its transmission gates at scaled voltage supplies. The SDET-TSPCFF also provides a lower clock load compared to the DET-TGLM, defined as the number of minimum-size transistors controlled by a clock signal. The leakage and total power consumption of the presented cell is slightly higher than those of the DET-TGLM; however its PDP is lower, confirming that the SDET-TSPCFF is the best option in terms [13] . 2 At 500MHz input clock frequency and 25% data activity. 3 PDP = tcq · Ptot of energy-efficiency. Note that in any case, the only fully functional cell at this operating point is the proposed SDET-TSPCFF, and therefore, it is the unequivocal choice for DET operation in low-power, nanoscaled systems, targeted at nearthreshold operation.
V. CONCLUSIONS This paper presented a novel dual-edge-triggered flip-flop topology to solve the inherent clock-overlap risk in the majority of the previously presented DET-FFs. The failure risk due to clock-overlap was demonstrated on a popular DET-TGLM gate, showing an unacceptable error-rate at near-threshold voltages in a 40 nm CMOS process. The proposed fully-static truesingle-clock-phase DET-FF was shown to be fully functional at a similar operating point, under local and global process variations and at a wide range of temperatures. In addition, the proposed cell was found to provide the best performance and energy-efficiency among static DET-FF options.
