Abstract-In this paper, we present a differential current-mode pulsed flip-flop (DCMPFF) for low-power clock distribution using a representative 45nm CMOS technology. Experimental results show that the DCMPFF has 47% faster clock-to-output (CLK-Q) delay than a traditional voltage-mode (VM) pulsed flip-flop. When the DCMPFF is integrated with a differential currentmode clock distribution, the differential technique saves 62% and 17% power compared to a conventional VM and a previous current-mode (CM) clock network, respectively.
I. INTRODUCTION
The clock distribution network (CDN) is the most crucial network in synchronous VLSI design, as it is the basic signaling network for every synchronous block and seriously affects overall system power and performance. In terms of signaling type, clocking can be either voltage-mode (VM) or current-mode (CM). Although VM clocking is widely used due to the compatibility with standard VM logic networks, CM clocking can play an important role in low-power systems. CM signaling offers many potential advantages such as higher operating speed [1] , low voltage operation [2] , and ease of processing [3] compared to VM techniques.
Global interconnect power and latency are increasing in traditional VM signaling schemes [4] . Systems-on-chips (SOCs) add more functionality which means chip sizes are roughly constant while wire length increases relative to its planar dimensions. Because of this, the latency of RC lines grow linearly with wire length [4] despite using properly sized repeaters. An immediate solution is to use wide wires, but this results in higher energy per bit because of the large railto-rail voltage swing. An alternative signaling scheme such as CM, however, can eliminate transmission line repeaters in addition decreasing necessary voltage swing to significantly reduce power [5] - [8] .
We can categorize signaling as differential or nondifferential (single-ended). Differential clocks use two wires to send a pair of complementary clock signals. Differential signaling has higher reliability to electromagnetic interference, supply voltage fluctuations, and other sources of commonmode noise compared to single-ended signaling. Differential CM (DCM) signaling has better noise immunity compared to a single-ended CM scheme [7] , [9] . However, this comes at the cost of double wiring resources and increase wiring complexity.
In this paper, we extend the de novo CM clocking concept [5] to implement and analyze the first DCM clock distribution and a new DCM pulsed D-type FF. The clock (CLK) input to the FF is a CM receiver and the data input (D) and output (Q) are VM. In particular, the key contributions of this paper are:
• The first demonstration of a differential current-mode clocked FF.
• The first demonstration of a symmetric H-tree differential current-mode CDN.
• The effective integration of the DCM FF with VM CMOS logic. The rest of the paper is organized as follows: Section II gives a brief overview of some existing signaling schemes. Section III and Section IV propose our DCM FF and CDN, respectively. Section V compares our new FF and CDN with existing schemes. Section VI investigates the noise and reliability of the proposed system. Finally, Section VII concludes the paper.
II. OVERVIEW OF EXISTING SIGNALING SCHEMES
Unlike traditional buffer-based interconnect signaling, DCM signaling uses a differential CM transmitter (Tx) that sends complementary current pulses at a very low-voltage swing into a pair of interconnect wires. The interconnect is held at roughly the same voltage and is unbuffered. At the receiving end, a differential CM receiver (Rx) senses the two complementary currents and ideally converts them into two differential voltages or a single-ended, full-swing output voltage. A typical non-clock differential CM signaling scheme is shown in Figure 1 [7] . This scheme uses a self-level-converted driver circuit that limits the output voltage swing. Finally, two diodeconnected transistor pairs drive the interconnect. However, this kind of driver does not provide sufficient driving capability for large loads and is highly sensitive to noise [10] . This scheme uses a low-swing differential CM Rx circuit [7] . In order to increase the robustness of the design, the Rx uses both a common-gate and a common-source amplifier configuration. However, the Rx consumes a significant amount of static power due to double current-mirror stages.
Another previous work used differential current-sensing for the interconnect signaling based on a modified clamped bit-line sense amplifier (MCBLSA) Rx [9] . It utilized the traditional "fanout of four" (FO4) sizing rule for a CMOS buffer chain to design the driver. However, there is no real guideline to design the Tx for different sized interconnects. Moreover, the Tx drives static current into the interconnect while the current is useful during only a fraction of the cycle which results in additional power consumption. The Rx circuit requires an equalizing signal that creates a metastable phase, while the differential input currents break this metastability and help the Rx to produce two complementary outputs. A self-level-converted driver circuit transmits two low-swing voltages and the Rx circuit amplifies the difference between them to reproduce the fullswing output voltage [7] .
However, this scheme suffers significant static power loss in the metastable phase and also may switch next stages buffer or latches [11] . The previous differential current-mode schemes, however, were one-to-one data connections whereas clock networks are, by definition, a one-to-many signal distribution. A oneto-many CM clocking scheme based on CM current-pulsed FF [5] offered a large CDN power savings compared to a VM scheme. However, it consumes high static power and is highly susceptible to noise. Our differential CM scheme addresses these issues.
III. DIFFERENTIAL CURRENT-MODE PULSED FLIP-FLOP
We propose the first differential CM pulsed FF (DCMPFF) in Figure 2 (a). The DCMPFF extends the previous single input current CM pulsed FF (CMPFF) [5] to have two complementary input currents, I(IN+) and I(IN-). These inputs can be either positive or negative depending on the current direction, however, the DCMPFF is only sensitive when I(IN+) has a push-current and I(IN-) has a pull-current to mimic an edgetriggered behavior.
The DCMPFF has a current-comparator (CC) with two reference voltage generators, an inverter-amplifier (amp), an output stage, and a static storage cell. An enable (EN ) signal activates the DCMPFF while the CC uses the push-pull current as input clock to provide a full-swing output voltage depending on the data input.
A reference voltage generator is built using a diodeconnected PMOS-NMOS pair (or polysilicon resistors) as shown in Figure 2 (a). The two reference voltage generators create two static currents in PMOS M2 and NMOS M3 and also provide a low-impedance input. The CC compares the differential current using inverting amp (M6-M7) at node C. After the two-stage amplification, a buffer provides required drive to generate full-swing local clock pulse (CLKP) that activates the output stage. A feedback connection to M5 limits the CLKP pulse to less than 50% duty cycle. A transmission gate output stage latches data into a storage cell.
The use of a differential input current is more robust to noise compared to a single-ended scheme which will be discussed and analyzed further in Section IV. The complementary pushpull currents also helps simplify the design of the current Tx which can generate the currents from a single input voltage. The CC compares two complementary currents which are combined using an inverter amplifier that enables smaller transistors in the CC (M2-M3) compared to the prior singleended CMPFF CC [5] . Due to the lower logical effort of M2-M3, the DCMPFF requires less input current and consumes less power.
The representative simulation waveforms of the proposed DCMPFF are shown in Figure 2 (b) and confirm the internal current-to-voltage conversation. The internally-generated CLKP signal triggers the data storage which is enabled with EN . The amplitude of the two input currents affect the FF performance by changing the operating point of M2-M3.
IV. DIFFERENTIAL PULSED CURRENT TRANSMITTER AND DISTRIBUTION
A differential clocking scheme requires a differential current transmitter (DCTx) that can efficiently provide differential push-pull current into the interconnect and distribute enough current to each sink. The DCTx is a voltage-to-current converter that receives a traditional voltage-mode clock (CLK) from a PLL and converts it into a complementary push-pull current signal with minimal voltage swing in the interconnect line. The entire proposed scheme with the DCMPFF, DCTx, and CDN is shown in the Figure 3(a) . The DCM scheme is based on a CDN that has similar impedance at each branch resulting equal current to each DCMPFF. The proposed DCTx extends the previously reported pulsed current Tx [5] by using two extra inverters and an extra driver circuit (M3-M4) to generate two complementary currents. The second (differential) current has the same amplitude with one inverter delay of phase difference.
In order to have equal differential current, the DCTx uses similar sizes for M1-M2 and M3-M4 drivers. The driver sizes are adjusted for current-loss in the long transmission line and supply the required amount of current to each sink. It is important to have appropriate sizing of the wires for both reliability and performance of the CDN. A narrow or highly resistive network will produce distorted output current while a wide network would be low resistance and not have electromigration problems.
V. SIMULATION RESULTS AND ANALYSIS
The circuits are simulated in HSpice with a 45nm CMOS technology model [12] . In order to compare the power, performance, and area, we implemented several designs in layout: a MSDFF, a CMPFF [5] , and the proposed DCMPFF. The layout areas, nominal CLK-Q delay, data-to-Q (D-Q) delay, and total power are listed in Table I . The performance of the FFs was evaluated considering clock frequencies from 2-5GHz and a 1V supply voltage. The power considers input data at 100% activity with a four FF load. The DCMPFF consumes 6% less silicon area compared to the previous CMPFF and uses 23 transistors while the MSDFF and CMPFF use 20 and 25 transistors, respectively. Table I shows the nominal CLK-Q delay for both highto-low and low-to-high Q transitions. Compared to previous single-ended CMPFF input current of ±2.3µA amplitude, the nominal CLK-Q delay of DCMPFF requires only ±1.8µA and 70ps pulse width. Clearly, the DCMPFF has lower CLK-Q delay than the CMPFF but is only slightly slower than the MSDFF. For each FF, we measured the setup-time (t s ) and hold-time (t h ). These use the common definition as the time margin that causes a CLK-Q delay increase of 10% beyond nominal. The t s and t h of the DCMPFF are −20ps and 95ps, respectively. The setup time of the DCMPFF is 1.95× lower than the traditional MSDFF while the t h of the DCMPFF is 1.34× higher than the CMPFF. We also measure the D-Q delay of each FF. The D-Q of the DCMPFF is 66% faster than the VM MSDFF.
We measured the total power consumption of each FF considering the input clock and data switching. For VM FFs, we used a traditional approach [13] . For CM FFs, we used a CM Tx that can produce the required amount of current and the bias voltage to drive the CM FF. First, we measure the total power consumption including the Tx and CM FFs. Then we remove the FFs to measure the Tx power. The difference between these two results is the CM FF power.
In the power measurement, we also consider both static and dynamic power of VM and CM FFs. At a 2GHz clock frequency, the DCMPFF consumes 39.3% and 4.6% more power compared to the MS DFF and Tra. PFF, respectively. However, the power consumption of the DCMPFF is comparable to a MS DFF at 5GHz. At the same frequency the DCMPFF consumes 33% and 41% less power compared to the Tra. PFF and CMPFF [5] , respectively.
In order to validate the functionality of the DCTx and the proposed DCMPFF in a CDN, we implemented a equalimpedance binary-tree network spanning 1mm × 1mm. Each branch of clock tree is modeled as a lumped 3-component Π-model and then connected together to make a distributed CDN model. The functional simulation results with the resulting output current are shown in Figure 3(b) .
Our CDN analysis uses a 5-level H-tree distributed in 7.69mm × 7.69mm area for both the single-ended CM and VM CDN, but buffers drive the VM CDN instead of the CM Tx circuit. In order to minimize later stages short-circuit power and any timing violation, the VM buffered network is optimized for an output clock signal slew with less than 10% of minimum operating clock period. In the differential CDN, two such tree networks are routed. All CDNs drive 1024 FFs. Table II shows the power breakdown of the VM, CM, and DCM CDNs simulation of clock frequencies ranging from 2-5GHz. On average, our DCM CDN consumes less power than both the single-ended CM and VM CDN for all frequencies. The obvious reason for more power consumption of VM CDN compared to the other CM/DCM CDNs is due to the voltage swing (0-to-Vdd) in the VM CDN, whereas the CM/DCM CDN has negligible voltage swing as shown in Figure 3 (b). The proposed DCM CDN consumes less power than the CM CDN due to the high static power consumptions in the CMPFFs. As expected at low frequency, the total power of the DCMPFFs is higher than the VM case. However, at high frequencies, the power of DCMPFFs is lower than both the VM FFs. The VM interconnect power dominates the CM/DCM FF power even at low frequencies. The real advantage, however, is that the DCM CDN power does not increase with frequency like the VM CDN power. Since the fluctuation of common-mode voltage is relatively small, the dynamic power consumption of the DCM CDN is negligible. At 2GHz in particular, the DCM CDN system exhibits 16% to 47% total power savings compared to different single-ended CM/VM CDN. As expected, the power saving increases to 17% to 72% at high 5GHz clock frequency.
VI. NOISE AND RELIABILITY

A. Supply Voltage Fluctuation
We studied the response of the proposed DCM scheme to supply voltage variation. We considered a ±10% voltage fluctuation from the nominal supply voltage. The delay variation for traditional buffered VM scheme ranges from -21ps to 12ps compared to the nominal delay. The delay variation in singleended CM scheme ranges from -23ps to 28ps. The proposed DCM has delay variation from -23ps to 22ps compared to the nominal voltage delay.
B. Process Sensitivity
It is impossible to analytically predict the behaviour of a large network due to combination of the mismatch errors of individual devices. Hence, the resiliency of the proposed DCM scheme is demonstrated through non-uniform Monte-Carlo simulation of process variation and mismatch. The proposed DCMPFF has a mean CLK-Q delay of 48ps with standard deviation of 7ps in 1000 runs. This result is much better compared to the recently reported CMPFF. The CMPFF has a mean CLK-Q delay of 55ps with standard deviation of 7.4ps in 1000 runs.
VII. CONCLUSION
In this paper, we presented a DCM distribution as an alternative to conventional repeater based VM or CM distribution. The proposed DCM scheme uses a new DCMPFF which is 47% faster, consumes 33% less power and requires 9% less silicon area compared to a traditional PFF at 5GHz. The proposed DCM scheme saves 41% to 72% power compared to a traditional single-ended VM clock at 2−5GHz and consumes 17% less power on average compared to a previously reported single-ended CM scheme. Additionally, it has 21% less delay variation due to supply voltage fluctuation.
