DCMCS: Highly Robust Low-Power Differential
Current-Mode Clocking and Synthesis with wire length [5] despite using properly sized repeaters. An immediate solution is to use wide wires, but this results in higher energy per bit because of the large rail-to-rail voltage swing. An alternative signaling scheme such as CM, however, can eliminate transmission line repeaters, while, in addition, decreasing necessary voltage swing to significantly reduce power [6] - [9] . We can categorize signaling as differential or nondifferential (single ended). Differential clocks use two wires to send a pair of complementary clock signals. Differential signaling has higher reliability under electromagnetic interference, supply voltage fluctuations, and other sources of commonmode noise compared to single-ended signaling [10] - [13] . Differential CM (DCM) signaling has better noise immunity compared to a single-ended CM scheme [8] , [14] , [15] . However, this comes at the cost of double wiring resources and increased wiring complexity. As a result, the traditional clock routing techniques are limited to single-ended clocking [16] - [19] .
In the early years, CM signaling was applied to off-chip interconnects [20] . However, over the past decade, increasing attention has been paid to on-chip CM signaling. Researchers have shown tremendous power-performance improvement over VM signaling by applying CM signaling into a symmetric network [6] - [8] , [21] .
In this paper, we extend the de novo CM clocking concept [6] to implement and analyze the first DCM clock distribution and a new DCM-pulsed D-type flip-flop (FF). The clock (CLK) input to the FF is a CM receiver (Rx) and the data input (D) and output (Q) are VM. In addition, we propose the first electromigration (EM) aware DCM clocking and synthesis (DCMCS) methodology applicable to any network (symmetric or asymmetric). In particular, the key contributions of this paper are as follows:
1) the first demonstration of a DCM clocked FF; 2) the first demonstration of a symmetric H-tree DCM CDN; 3) the effective integration of the DCM FF with VM CMOS logic; 4) the first demonstration of DCM clocking on industrial testbenches; 5) the first demonstration of EM aware wire sizing for DCM clocking. The rest of this paper is organized as follows. Section II gives a brief overview of some existing signaling schemes. Sections III and IV propose our DCM FF and CDN, Fig. 1 . Self-level-converted driver circuit transmits two low-swing voltages and the Rx circuit amplifies the difference between them to reproduce the full-swing output voltage [8] . Fig. 2 . Clamped bitline sense amplifier Rx-based DCM scheme uses a factor of four sizing rule in cascaded inverters that drive the long interconnect [14] .
respectively. Section V introduces the automatic DCM CDN generation technique. Section VI compares our new FF and CDN with existing schemes. Section VII investigates the noise and reliability of the proposed system. Finally, Section VIII concludes this paper.
II. OVERVIEW OF EXISTING SIGNALING SCHEMES
Unlike traditional buffer-based interconnect signaling, DCM signaling uses a DCM transmitter (Tx) that sends complementary current pulses at a very low-voltage swing into a pair of interconnect wires. The interconnect is roughly held at the same voltage and is unbuffered. At the receiving end, a DCM Rx senses the two complementary currents and ideally converts them into two differential voltages or a single-ended, full-swing output voltage. A typical nonclock DCM signaling scheme is shown in Fig. 1 [8] . This scheme uses a selflevel-converted driver circuit that limits the output voltage swing. Finally, two diode-connected transistor pairs drive the interconnect. However, this kind of driver does not provide sufficient driving capability for large loads and is highly sensitive to noise [22] . This scheme uses a low-swing DCM Rx circuit [8] . In order to increase the robustness of the design, the Rx uses both a common-gate and a commonsource amplifier configuration. However, the Rx consumes a significant amount of static power due to double current-mirror stages.
Another prior strategy that uses differential current sensing for interconnect signaling is shown in Fig. 2 [14] . The scheme is based on a modified clamped bitline sense amplifier Rx [14] . It utilizes the traditional "fan-out of four" sizing rule for a CMOS buffer chain to design the driver. However, there is no real guideline to design the Tx for differentsized interconnects. Moreover, the Tx drives static current into the interconnect, while the current is useful during only a fraction of the cycle, which results in additional power consumption. The Rx circuit requires an equalizing signal that creates a metastable phase, while the differential input currents break this metastability and help the Rx to produce two complementary outputs. However, this scheme suffers significant static power loss in the metastable phase and also may switch the next stage's buffer or latches [23] .
The previous DCM schemes, however, were one-to-one data connections, whereas clock networks are, by definition, a one-to-many signal distribution. A one-to-many CM clocking scheme based on CM current-pulsed FF [6] offers a large CDN power savings compared to a VM scheme. However, it consumes high static power and is highly susceptible to noise. Our DCM scheme addresses these issues.
III. DIFFERENTIAL CURRENT-MODE PULSED FLIP-FLOP
We propose the first DCM-pulsed FF (DCMPFF) in Fig. 3(a) . The DCMPFF extends the previous single input current CM pulsed FF (CMPFF) [6] , [24] to have two complementary input currents, I(IN+) and I(IN-). These inputs can be either positive or negative depending on the current direction; however, the DCMPFF is sensitive only when I(IN+) has a push current and I(IN-) has a pull current to mimic an edgetriggered behavior.
The DCMPFF has a current comparator (CC) with two reference voltage generators, an inverter amplifier (amp), an output stage, and a static storage cell. An enable (EN) signal activates the DCMPFF, while the CC uses the pushpull current as an input clock to provide a full-swing output voltage depending on the data input.
A reference voltage generator is built using a diodeconnected pMOS-nMOS pair (or polysilicon resistors), as shown in Fig. 3(a) . The two reference voltage generators create two static currents in pMOS M2 and nMOS M3 and also provide a low-impedance input. The CC compares the differential current using an inverting amp (M6 and M7) at node C. After the two-stage amplification, a buffer provides the required drive to generate a full-swing local clock pulse (CLKP) that activates the output stage. A feedback connection to M5 limits the CLKP pulse to less than 50% of duty cycle. A transmission gate output stage latches data into a storage cell.
The use of a differential input current is more robust to noise compared to a single-ended scheme, which will be discussed and analyzed further in Section VII. The complementary push-pull currents also help simplify the design of the current Tx, which can generate the currents from a single-input voltage.
The CC compares two complementary currents, which are combined using an inverter amplifier that enables smaller transistors in the CC (M2 and M3) compared to the prior single-ended CMPFF CC [6] . Due to the lower logical effort of M2 and M3, the DCMPFF requires less input current and consumes less power. The representative simulation waveforms of the proposed DCMPFF are shown in Fig. 3(b) and confirm the internal current-to-voltage conversion. The internally generated CLKP signal triggers the data storage, which is enabled with EN. The amplitude of the two input currents affects the FF performance by changing the operating point of M2 and M3.
Clock gating is a common phenomenon to reduce CDN power [25] . One of the major advantages of using DCMPFF is it has an embedded active-low EN signal and can be utilized to perform clock gating in DCM CDN.
IV. DIFFERENTIAL-PULSED CURRENT TRANSMITTER AND DISTRIBUTION
A differential clocking scheme requires a differential current Tx (DCMTx) that can efficiently provide differential push-pull current into the interconnect and distribute enough current to each sink. The DCMTx is a voltage-to-current converter that receives a traditional VM clock (CLK) from a PLL and converts it into a complementary push-pull current signal with minimal voltage swing in the interconnect line. The entire proposed scheme with the DCMPFF, DCMTx, and CDN is shown in Fig. 4 (a). The DCM scheme is based on a CDN that has similar impedance at each branch, resulting equal current to each DCMPFF.
The proposed DCMTx extends the previously reported pulsed current Tx [6] by using two extra inverters and an extra driver circuit (M3 and M4) to generate two complementary currents. The second (differential) current has the same amplitude with one inverter delay of phase difference.
In order to have equal differential current, the DCMTx uses similar sizes for the M1 and M2 and M3 and M4 drivers. The driver sizes are adjusted for current loss in the long transmission line and supply the required amount of current to each sink. It is important to have appropriate sizing of the wires for both reliability and performance of the CDN. A narrow or highly resistive network will produce distorted output current, while a wide network would be low resistance and not have EM problems.
V. DCM CLOCKING AND SYNTHESIS (DCMCS)
The existing CM and DCM clocking schemes are applicable only to symmetric H-tree networks, while researchers very recently demonstrated a single-ended CM clock synthesis (CMCS) methodology [26] and efficiently applied that to CM clocking in asymmetric networks. However, it ignores the EM effect in wire sizing. Similar to CMCS, the proposed DCM clocking and synthesis (DCMCS) methodology utilizes DCMTx sizing by computing the total admittance (Y T ) of an entire clock network with the DCMPFFs as
where C w, j is the wire capacitance of wire j , α i is the admittance factor of sink/FF i , and β is a constant. The first part of 1 represents the total input admittance of each DCMPFF, while the latter part represents the total wire admittance of the network. In addition, the proposed DCMCS methodology incorporates EM aware wire sizing to improve the reliability of the design. Fig. 5 shows the DCM CDN generation methodology. Algorithm 1 presents the pseudocode of our DCMCS flow for the entire clock network. The algorithm takes any clock network, EM constraints, or maximum current density (J max ) from International Technology Roadmap for Semiconductors (ITRS) [27] for the corresponding technology, initial wire width (Wire width ) [28] , and minimum wire width (W min ) as inputs and returns an EM-aware DCM CDN. In order to implement the testbench/asymmetric networks, the clock tree is routed utilizing zero-skew DME methodology [17] , while the final tree nodes are connected with DCMPFFs (Line 6). DCM clocking scheme uses a single differential current Tx to drive the clock network and the DCMPFFs. The DCMCS algorithm calculates Y T of the network (Line 7) in the total Admittance(Tree) method, which applies 1. Then, it determines the initial Tx sizing (T init ) of the network (Line 8) using si zeT x(Y T ). It runs a transient simulation [si mulateT r ansi ent ()] and uses calculateSkew() to measure the initial skew (S init ) (Lines 9 and 10). T best and (S best , S new ) are set to the initial values of T init and S init , respectively (Line 11). The initial Tx sizing value is also stored in two temporary variables (T newUp and T newDown ). Then, we recursively size up (increase Tx size 1% from initial sizing) and size down (decrease Tx size 1% from 
VI. SIMULATION RESULTS AND ANALYSIS
The circuits are simulated in HSPICE with a 45-nm CMOS technology model [29] . In order to compare the power, performance, and area, we implemented several designs in layout: a master-slave D FF (MSDFF), a CMPFF [6] , and the proposed DCMPFF. The layout areas, nominal CLK-Q delay, data-to-Q (D-Q) delay, and total power are listed in Table I . The performance of the FFs was evaluated considering clock frequencies from 1-5 GHz and a 1-V supply voltage. The power considers input data at 100% activity with a four-FF load. 
A. DCMPFF Results
The DCMPFF consumes 6% less silicon area compared to the previous CMPFF and uses 23 transistors, while the MSDFF and CMPFF use 20 and 25 transistors, respectively. Fig. 6 shows the layout of the proposed DCMPFF. The CLK-Q delays of the FFs are measured under relaxed timing conditions for both the VM and CM instances. In other words, the data are stable sufficiently before the arrival of the VM clock edge or the CM input current pulse. Table I shows the nominal CLK-Q delay for both high-tolow and low-to-high Q transitions. Compared with the previous single-ended CMPFF input current of ±2.3-μA amplitude, the nominal CLK-Q delay of DCMPFF requires only ±1.8-μA and 70-ps pulsewidth. Clearly, the DCMPFF has a lower CLK-Q delay than the CMPFF but is only slightly slower than the MSDFF. For each FF, we measured the setup time (t s ) and hold time (t h ). These use the common definition as the time margin that causes a CLK-Q delay increase of 10% beyond nominal. t s and t h of the DCMPFF are −20 and 95 ps, respectively. The setup time of the DCMPFF is 1.95× lower than the traditional MSDFF, while t h of the DCMPFF is 1.34× higher than the CMPFF. We also measure the D-Q delay of each FF. The D-Q of the DCMPFF is 66% faster than the VM MSDFF.
We measured the total power consumption of each FF considering the input clock and data switching. For VM FFs, we used a traditional approach [31] . For CM FFs, we used a CM Tx that can produce the required amount of current and the bias voltage to drive the CM FF. First, we measure the total power consumption, including the Tx and CM FFs. Then, we remove the FFs to measure the Tx power. The difference between these two results is the CM FF power.
In the power measurement, we also consider both static and dynamic power of VM and CM FFs. At 1-GHz clock frequency, the DCMPFF consumes 40% and 9.6% more power compared to the MSDFF and Tra. PFF, respectively. However, the power consumption of the DCMPFF is comparable to an MS DFF at 5 GHz. At the same frequency, the DCMPFF consumes 33% and 41% less power compared to the Tra. PFF and CMPFF [6] , respectively. At low frequencies, the DCMPFF consumes higher power than the VM Tra. PFF and MSDFF due to a high static power overhead. However, the dynamic power of the CM FFs increases proportionally to the frequency at a slower rate than that of the VM FFs, as shown in the bottom two rows of the Table I .
B. H-Tree Distribution
In order to validate the functionality of the DCMTx and the proposed DCMPFF in a CDN, we implemented an equalimpedance binary-tree network spanning 1 mm × 1 mm. Each branch of clock tree is modeled as a lumped 3-component -model and then connected together to make a distributed CDN model. The interconnect unit capacitance and resistance values are for 45-nm CMOS technology [29] . The functional simulation results with the resulting output current are shown in Fig. 4(b) .
For initial results, our CDN analysis uses a 5-level H-tree distributed in 7.69 mm × 7.69 mm area for both the single-ended CM and VM CDN, but buffers drive the VM CDN instead of the CM Tx circuit. In order to minimize the later stages' short-circuit power and any timing violation, the VM-buffered network is optimized for an output clock signal slew with less than 10% of minimum operating clock period. In the differential CDN, two such tree networks are routed. All CDNs drive 1024 FFs. Table II shows the power breakdown of the VM, CM, and DCM CDNs' simulation of clock frequencies ranging from 1-5 GHz. On an average, our DCM CDN consumes less power than both the single-ended CM and VM CDN for all frequencies. The obvious reason for more power consumption of VM CDN compared to the other CM/DCM CDNs is due to the voltage swing (0-to-Vdd) in the VM CDN, whereas the CM/DCM CDN has negligible voltage swing, as shown in Fig. 4(b) . The proposed DCM CDN consumes less power than the CM CDN due to the high static power consumptions in the CMPFFs.
As expected at low frequency, the total power of the DCMPFF system is comparable to the VM cases, as shown in Fig. 7 . This is because, at low frequencies, the DCMPFF consumes higher power than the VM FFs. However, at high frequencies, the power of DCMPFFs is lower than both the VM FFs, while the power of CMPFFs is higher than the proposed DCMPFFs due to the large static power consumption. The VM interconnect power dominates the CM/DCM FF power even at low frequencies. The real advantage, however, is that the DCM CDN power does not increase with frequency like the VM CDN power. Since the fluctuation of common-mode voltage is relatively small, the dynamic power consumption of the DCM CDN is negligible. At 1 GHz in particular, the DCM CDN system exhibits 5%-22% total power savings compared to different single-ended CM/VM CDNs. As expected, the power saving increases to 24%-72% at the high 5-GHz clock frequency.
C. ISPD Testbench Results
It is clear from Sections VI-A and VI-B that the proposed DCMPFF and the DCM CDN consume lower power than the other VM FFs and VM CDN at higher frequencies (i.e., 5-GHz clock). However, at low 1-GHz clock frequency, the DCMPFF consumes higher power than the VM FFs, resulting in smaller power savings in an H-tree distribution. Hence, it is important to show the effectiveness of the proposed scheme at low 1-GHz frequency on industrial testbenches. For this, we used ISPD 2009 [32] and ISPD 2010 [28] testbenches.
The clock tree and the DCM FFs are driven by a single DCM Tx at the root. The DCM Tx, the tree, and the DCM FFs compose the entire DCM CDN. Fig. 8(a) and (b) shows the resulting DME-routed bufferless DCM CDN for the ISPD 2009 benchmark circuit f11 and the ISPD 2010 benchmark circuit 05, respectively. In the proposed DCMCS scheme, the total power consumption includes the DCM Tx power, the parasitic power, and the total DCM FF power. The VM clocking uses the same minimum wire length DME network [17] ; however, we inserted buffers to meet the slew and skew constraints [16] . In addition, the final tree nodes are connected with the VM FFs.
The proposed DCM clocking consumes lower power than the buffered VM MSDFF and Tra. PFF-based clocking scheme Tables III and IV. In particular, the proposed DCM clocking saves more than 77% and 40% power compared to the MSDFF system using the ISPD 2009 and 2010 networks, respectively. In addition, the DCMPFF-based clocking saves 79% and 51% power compared to the Tra. PFF-based system using ISPD 2009 and 2010 networks, respectively. As suggested in Section VI-B, it is certain that the proposed DCM clocking will save quadratically more power at higher frequencies.
In addition to power, the proposed DCM clocking has 7.7 and 11.3 ps lower average clock skew compared to the traditional-buffered VM scheme. Table V shows the overall power-performance comparison of existing VM and CM and the proposed DCM clocking schemes. The proposed DCM clocking saves 43% and 62% average power compared to the CMPFF system using the ISPD 2009 and 2010 networks, respectively. This is primarily due to the large static power of CMPFF. In addition to power, the proposed DCM clocking has 11.0 and 15.1 ps lower average clock skew compared to the previous CM scheme.
VII. NOISE AND RELIABILITY

A. Jitter Analysis
In scaled technology, it becomes increasingly difficult to ensure the correctness of the multigigahertz clock signal. One of the main reasons is the presence of clock jitter. Depending on the measurement techniques, jitter can be categorized as period jitter, cycle-to-cycle jitter, long-term jitter, phase error, and time-interval error. However, it has been shown that these jitters are mathematically related to each other [33] ; hence, we measured the period jitter to show the robustness of DCM clocking compared with the other clocking schemes. For this analysis, we considered supply voltage-induced noise in the voltage-control oscillator of the clock PLL and measured the 1000 random-sample clock period. The jitter-corresponding standard deviation (σ ) for traditional-buffered VM scheme is 1.55 ps and peak-to-peak jitter is 5.5 ps. The σ for the singleended CM scheme is 1.47 ps and peak-to-peak jitter is 3.7 ps. The proposed DCM scheme exhibits much better 1.46 ps of σ and 1.46-ps peak-to-peak jitter.
B. Supply Voltage Fluctuation
We studied the response of the proposed DCM scheme to supply voltage variation. We considered a ±10% voltage fluctuation from the nominal supply voltage. The delay variation for a traditional-buffered VM scheme ranges from −21 to 12 ps compared to the nominal delay. The delay variation in a single-ended CM scheme ranges from −23 to 28 ps. The proposed DCM has delay variation from −23 to 22 ps compared to the nominal voltage delay.
C. Electromigration
Since we used homogeneous wires from root to sinks for all the clock networks, the root wire carries the maximum current. The VM CDN maximum current density is 0.53 MA/cm 2 . As expected, the proposed DCM CDN requires less current compared to the single-ended CM CDN. The maximum current density of the DCM CDN in the root wire is 0.24 MA/cm 2 less than the single-ended CM CDN, 0.275 MA/cm 2 . This more than satisfies the ITRS suggestion that current density be limited to 1.5 MA/cm 2 and relieves the EM threat to the proposed CDN wire sizing.
D. Process Sensitivity
It is impossible to analytically predict the behavior of a large network due to the combination of the mismatch errors of individual devices, while it is really intractable to analytically model even a small SRAM cell or FF behavior due to those variations. However, using Monte Carlo (MC) simulation, the impact of these random parameter variations on FF functionality and performance can be studied. Hence, the resiliency of the proposed DCM scheme is demonstrated through nonuniform MC simulation of process variation and mismatch. The result of this experiment is shown in Fig. 9 . The proposed DCMPFF has a mean CLK-Q delay of 48 ps with a standard deviation of 7 ps in 1000 runs. This result is much better compared to the recently reported CMPFF. The CMPFF has a mean CLK-Q delay of 55 ps with a standard deviation of 7.4 ps in 1000 runs.
E. Threshold Voltage Mismatch
In scaled technologies, the circuits are highly sensitive to intradie (process) variation such as threshold voltage (V th ) variation. The CDN can experience large delay variation or skew due to V th variation. In order to quantify this timing uncertainty, we analyzed the proposed DCM CDN and a traditional PFF-based buffered VM CDN, as shown in Fig. 10(a) and (b) , respectively. In addition, we considered ss-ff corners. Unlike a traditional skew computation, we considered delay variation in the FF's outputs to include the FF's V th variation. The proposed DCM CDN has 41-ps skew. The buffered VM scheme has 43-ps skew due to the presence of buffers in the VM clock tree.
F. Loading Effect
We studied the loading effect of different FFs by changing the driving load of each FF. For any reliable design, it is expected that the FF power performance will linearly increase with the increase of FF load. Fig. 11 shows the result of these experiments. Fig. 11(a) and (b) shows the CLK-Q delay and power consumption of the proposed DCMPFF and Tra. PFF, respectively. Clearly, the proposed DCMPFF's CLK-Q delay and power increase linearly with the increase of FF load and ensure the scalability of the proposed design.
VIII. CONCLUSION
In this paper, we presented a DCM distribution as an alternative to conventional repeater-based VM or CM distribution. The proposed DCM scheme uses a new DCMPFF, which is 47% faster, consumes 33% less power, and requires 9% less silicon area compared to a traditional PFF at 5 GHz. When applied to a symmetric H-tree network, the proposed DCM scheme saves 5% to 72% power compared to a traditional single-ended VM clock at 1-5 GHz and consumes 26% less power on average compared to a previously reported single-ended CM scheme. At the same frequency range, the proposed scheme saves 48% and 53% average power compared to the MSD and Tra. PFF-based systems, respectively. In addition, in this paper, we presented the highly robust low-power DCMCS methodology. The proposed scheme saves 79% and 51% average power compared to the traditionalbuffered synthesized VM scheme using ISPD 2009 and ISPD 2010 testbenches, respectively. In addition, the DCMCS scheme exhibits 7.7 and 11.3 ps lower average clock skew compared to a VM scheme using the ISPD 2009 and ISPD 2010 testbenches, respectively. Additionally, it has 21% less delay variation due to supply voltage fluctuation.
