Abstract-Deeply pipelined systems require flip-flops with low latency and power consumption. Often, the flip-flop must supply both inverted and non-inverted signals to subsequent logic. Generating both outputs at the same time improves performance by equalizing the worst-case delays. In this paper, we present a novel differential flip-flop for deeply pipelined systems. The circuit uses cross-coupled p-transistors as pull-up devices to achieve high energy efficiency. We simulated the design in 90-nm CMOS technology to determine the delay and power consumption. We then repeated the analysis with four other differential flip-flops that produce symmetric outputs. The proposed design achieves the best power-delay product of the five alternatives.
I. INTRODUCTION
Pipelining is an important design technique to realize highperformance digital systems. In a pipelined system, a large combinational logic block is broken down into a series of smaller blocks separated by pipeline registers. Composed of several flip-flops in parallel, these pipeline registers synchronize the flow of data from one stage to the next. Deeply pipelined systems break down the combinational logic block to a greater extent, so that each pipeline stage encapsulates a simple operation. As one example, we have integrated deep pipelining into the design of a medium-grain reconfigurable architecture for digital signal processing [1] . This architecture features an array of 4-bit cells and a hierarchical interconnection network. Each cell and interconnection level forms one pipeline stage.
Clearly, deeply pipelined systems face several challenges: numerous pipeline registers, a large clock distribution network, and increased power consumption [2] . The delay overhead of the pipeline registers also becomes more significant as the number of stages increases and clock frequencies scale upward. The later in the circuit. Circuit styles such as pass-transistor logic also achieve better performance with symmetric signals. Thus, differential flip-flops are a reasonable choice for deeply pipelined systems. Fig. 1 depicts a basic differential flip-flop. The design works with input data D-D, output data Q-Q, and differential clock Ck-Ck. Two pairs of minimum-size inverters serve as memory elements. When Ck is low, data from the inputs overwrites the first memory element. The rising edge of Ck causes the data to overwrite the second memory element and propagate to the outputs. We use this simple design in later analysis as a baseline for comparison.
Researchers have proposed a number of flip-flops that provide complementary outputs. We limit our focus here to circuits that generate Q and Q at the same time, and operate statically for the best noise tolerance. One such design is the static single-transistor clocked (SSTC) flip-flop [3] . Fig. 2 illustrates how this design consists of a master portion and a slave portion. The role of the master portion is to assert the set signal S or the reset signal R when Ck is low. The slave portion uses these signals to change the outputs when Ck is high. The extra inverter and n-transistors in the master portion reset S and R if the inputs change while Ck is high. Notice that the slave portion maintains the current state when S and R are both low. Unlike the basic flip-flop, the SSTC does not require a differential clock.
Another design taken from the literature is the sense amplifier flip-flop (SAFFI) in [4] . As shown in Fig. 3 described above. Finally, Section IV summarizes the results of the analysis and gives some concluding remarks.
II. CIRCUIT DESIGN to ground. The slave portion then uses these signals to set the outputs of the flip-flop. A feedback loop maintains the current state of the outputs when S and R are high.
A slightly different sense-amplifier flip-flop (SAFF2) appears in [5] . This version uses the same master portion, but reduces the number of transistors in the slave portion. The slave portion now contains a memory element implemented by a pair of inverters. The S-S and R-R signals control four transistors that set the state the memory element. As with the SAFFI, the SAFF2 only requires a single clock. Fig. 4 depicts the design.
In this paper, we propose a novel differential flip-flop for deeply pipelined systems. The circuit achieves high energy efficiency by using cross-coupled p-transistors as pull-up devices. Section II describes the design and gives a simulation verifying its functionality. Section III compares the delay and power consumption of the flip-flop with the four alternatives Fig. 5 illustrates the proposed differential flip-flop. The circuit consists of identical master and slave latches. When Ck is low, the input data overwrites the contents of the master latch. When Ck is high, the master latch overwrites the contents of the slave latch. The design is fully static, so Ck can run at any frequency up to the maximum.
The numbers in the figure denote the transistor widths with respect to the minimum size. As shown, the master latch contains minimum-size n-transistors that overwrite the stored value when Ck is low, and maintain the current state when Ck is high. The slave latch operates in a complementary manner. The latches also include cross-coupled p-transistors as pull-up devices. These transistors improve noise tolerance by providing full-rail swing at the inverter inputs. Since the minimum-size p-transistors are several times weaker than the n-transistors in the write path, writing new data into the latches does not consume much power. Changing the data again while Ck is high has no effect on Q and Q. However, the voltage levels of D and D do deteriorate slightly when Ck falls low and the external circuitry changes the state of the master latch.
Unlike most of the other designs described in Section I, the proposed flip-flop uses a differential clock. This property might seem disadvantageous, since distributing a global signal requires significant power. However, the system can use the circuit in Fig. 7(a) [6] . For comparison, Fig. 7(b) illustrates a single-ended clock buffer. A circuit simulation of the differential clock generator appears in Fig. 8 In addition, the flip-flops have approximately equal driving capabilities.
To characterize the delay parameters of the flip-flops, we applied the methodology described in [7] and [8] . Referring to Fig. 8 , the clock-output delay tCOkQ depends on tDOCk and tCk-D, which describe how long the input data remains stable before and after the clock edge. Decreasing the window of stability increases the clock-output delay until the flip-flop no longer captures the correct value. The setup time is the value of tD-Ck that minimizes the sum tD-Ck + tCOkQ. We call the corresponding value of tCOkQ the output delay, and the sum the total delay. The total delay places an upper limit on the clock rate in a pipelined system. The hold time is the value of tCO-D that minimizes the sum tCO-D + tCk-Q. The hold time is often negative, meaning that the input data can transition before the clock edge without changing the sampled value.
We also measured the power consumption of the five flipflops during the simulations. We included the contributions of the input and output buffers in the testbench (Fig. 9 ), since different designs have different internal loads. Now the power consumption of a flip-flop depends on the utilization, or the probability that the data changes within a given clock cycle. We determined the power consumption at 0% and 100% utilization, and averaged the two values to find the power consumption for random data. Finally, we multiplied this number by the total delay to compute the power-delay product. This parameter measures the energy efficiency of the circuit. Table I presents the results of the simulations. As shown, the proposed flip-flop achieves above-average to excellent results in almost all parameters. For example, only the SSTC requires fewer transistors.
The setup time varies widely between the five alternatives: from -9.0 ps for the SAFF2 to 99.5 ps for the SSTC. The proposed design falls in the middle at 42.4 ps. The hold time ranges from -20.9 ps for the basic flip-flop to 45.1 ps for the SSTC. The proposed design has a hold time of -18.4 ps, very close to the basic flip-flop. The output delay does not show as much variation as the setup time or the hold time, although the proposed flip-flop has the best value at 66.7 ps. As a result, the SAFF2 has the lowest total delay, the SSTC has the highest, and the proposed flip-flop again falls in the middle. Fig. 11 90 -SAFF. The experimental results show a clear tradeoff between delay and power. The power consumption at 0% utilization is the lowest for the SSTC, and the highest for the SAFFI and SAFF2. At 100% utilization, the proposed flip-flop has the lowest power consumption at 20.4 ,uW. This design also emerges on top for the case of 50% utilization. A plot of the total delay versus the power consumption appears in Fig. 12 . Each vertical bar denotes the range of power consumption from 0% to 100% utilization.
We also compared the power consumption of the singleended and differential clock buffers in Fig. 7 . Driving an equivalent load of eight flip-flops, the single-ended circuit consumed 36.8 ,uW, whereas the differential circuit consumed 48.6 ,uW. The difference, 11.8 ,uW, is small when divided among the eight flip-flops. Hence, the differential clock generator carries only a small penalty in incremental power consumption. 
IV. CONCLUSION
In this paper, we have proposed a novel differential flipflop for deeply pipelined systems. We compared the circuit with four well-known alternatives, including the SSTC and two versions of the SAFF. Circuit simulations in 90-nm CMOS demonstrated that the proposed design had aboveaverage delay characteristics and the best power consumption. Thus, the design also achieved the best power-delay product. The low transistor count and extensive use of minimum-size devices translates into low area overhead. In addition, all nodes have full rail-to-rail swing, increasing the noise immunity in high-performance datapaths. While the proposed design does require a separate circuit to generate the differential clock inputs, we found that the additional power consumed by this clock generator is very small when amortized among multiple flip-flops.
One difference between this study and related work such as [9] is that the flip-flops do not drive a large output load. Thus, we could use small transistors throughout the five circuits. Many other studies assume that the flip-flops are part of a standard-cell library, and hence size the transistors for greater driving capacity. It is possible that the relative performance of the five designs might be different in this case. However, the cited study also found that the SAFF achieved better performance than the SSTC, so we expect that the proposed design would still achieve good results.
In future work, we will conduct further tests to analyze the proposed flip-flop under different operating conditions. For example, we will determine how efficiently the design can drive large output loads, or function with different transistor sizes. We will also explore the effect of misaligned inputs or outputs on the overall performance. In addition, we plan to implement the flip-flops on a prototype chip and compare the physical measurements with the simulated results.
As a final remark, we have integrated the proposed differential flip-flop into a pipelined medium-grain reconfigurable architecture [10] . The low transistor count, small power-delay product, and large noise margins makes the design ideal for shift registers as well as pipeline latches.
