Abstract-In a high-performance VLSI design, the clock network consumes a significant amount of power. While most existing methodologies use voltage-mode (VM) signaling, these clock distributions lose a tremendous amount of dynamic power to charge/discharge the large global clock capacitance. New circuit approaches for current-mode (CM) clocking save significant clock power, but have been limited to only symmetric networks, while most application specific integrated circuits have asymmetric clock distributions. In this paper, we propose the first CM clock synthesis (CMCS) methodology to reduce the overall clock network power with low skew. The method can integrate with traditional clock routing followed by transmitter and receiver sizing. We validate the proposed methodology using ISPD 2009 and 2010 industrial benchmarks using an extracted SPICE model distributed in 1.4-275.6-mm 2 area and consists of 81-2249 sinks. This methodology saves 39%-84% average power with similar skew on the benchmarks using 45-nm CMOS technology simulation of clock frequencies range from 1-3 GHz. In addition, the CMCS methodology takes 2.4−9.1× less running time and consumes 20%-26% less transistor area compared with synthesized, buffered VM clock distributions.
I. INTRODUCTION
C LOCK distribution networks (CDNs) have a tremendous impact on overall dynamic power and performance in VLSI systems. As technology progresses, the complications associated with distributing the CDN are becoming increasingly more challenging.
Many researchers have already proposed different ways to reduce CDN power [1] - [7] . In addition to power, a tremendous amount of work has investigated signal integrity issues due to process variation and noise [8] - [10] . Researchers mostly improved these attributes considering a power budget as a primary constraint [11] . All of the CDN efforts to improve signal integrity and power are based on traditional voltagemode (VM) signaling.
As an extension of VM signaling, a wide range of research has been conducted on low-voltage swing signaling [1] , differential signaling [12] - [16] , pseudodifferential signaling, and incremental signaling [17] . The latter two schemes were only limited to nonclock signal transmission but achieved significant power and performance improvement over fullswing VM schemes. VM CDNs require clock buffers and the placement of these buffers can disturb timing and require improved clock synthesis methodologies to tackle skew and variability [5] , [18] . A current-mode (CM) signal, however, does not need distributed buffers and improves the process variation and noise related timing uncertainties [7] , [17] , [19] . CM signaling has extremely low-voltage swing, which enables low dynamic power, and also has higher transmission speed compared with its counterpart VM signaling [17] , [20] . In addition to power, CM signaling offers superior signal integrity and low switching and substrate noise compared with VM schemes [17] .
Recently, attractive circuit techniques for CM clock distribution have been proposed that offer low-power and high signal integrity using current-pulsed flip flops [3] , [7] . However, these schemes were only suitable for symmetric (i.e., equal impedance) clock networks and failed to provide evidence that CM clocking can apply to a real clock network. The primary reason is the lack of existing automation tools to process CM clocks instead of traditional VM clocks. Balancing insertion delay in a VM clock network is not the same as balancing impedances for CM clocking. Prior VM algorithms relied on buffers to do this and are not applicable to CM clocks. In our proposed scheme, we present the first methodology to distribute CM clock signals in real clock networks [21] , [22] using a standard-cell design style. Our major contributions are as follows:
1) the first clock synthesis methodology to create nonsymmetric CM clocks; 2) the first demonstration of CM clocking on industrial benchmarks; 3) the first standard-cell methodology to utilize CM latch/flip-flop input impedance to minimize skew. Sections II and III present a brief description of previously reported CM signaling schemes and the motivation of the CM clocking issues. In Section IV, a tuning method is proposed for CM clocks along with a thorough analysis of CM pulsed flip-flop properties and design using them. Section V presents results comparing the proposed CM clocking scheme with existing buffer-based VM CDNs and Section VI concludes this paper.
II. BACKGROUND CM is widely used for global signaling, especially in highspeed serial links for network buses, memory buses, and multiprocessor interconnection networks [23] . However, at low frequencies, CM signaling consumes large overall power, due to high static power consumption. On the other hand, CMOS logic utilizes VM signaling due to its low static power. CMOS current steering logic has been shown to be robust against digital switching noise, but consumes too much static power [24] . Fig. 1 . Previously reported CM clocking scheme saves significant CDN power and exhibits high robustness due to noise and variation, however, only limited to work at symmetric clock networks [7] .
A traditional, point-to-point CM scheme requires a CM transmitter (Tx) and a receiver (Rx) circuit. A Tx circuit ideally converts a VM signal into a CM signal while Rx circuit does the opposite. There have been prior works on these pointto-point networks for both off-chip [25] and on-chip [26] signaling. However, they have not considered point-to-many distribution as needed by clock networks.
One CM clocking scheme for point-to-many clock networks demonstrated significant power and performance improvement over traditional VM clock schemes, as shown in Fig. 1 [7] . This scheme is based on a low-power CM flip flop (CM FF) and efficiently applied CM clocking in a hand-designed multilevel H-tree network. The CM-FF-based design used a NAND-NOR Tx that sent a current pulse converted from a single source VM signal. The Tx generated and transmitted the current pulse, which was synchronized with the rising edge of the input VM clock signal at the Tx. This enabled an edge triggered operation of the Rx circuit in CM FFs. In addition to low power, this scheme showed significant noise robustness compared with the existing VM clocking schemes. However, the work neglected to demonstrate the CM pulsed scheme in a real asymmetric clock network. This needs a new methodology due to CM design issues.
III. CURRENT-MODE CLOCKING ISSUES
The trip current of a CM FF is the minimum current to deposit enough charge at a CM FF input, so it can store a new value. The clock tree itself remains steady state at roughly [(V dd )/2] and the current pulse arrives nearly instantaneously. Therefore, delay induced skew is not a major issue, unlike VM clocks. In a CM clock, however, an equal amount of current is needed at each FF to prevent timing skew within the CM FF. The main complication is that the duration and peak, and hence total charge, of the current pulse must be within bounds.
Balancing the impedance at each wire branch is not a trivial task, because it depends on the input impedance of the FF inputs. Prior VM methods could decouple downstream impedance using buffers but CM has an advantage in performance and power by not using buffers. In addition, the Tx at the root determines the steady-state voltage of the clock network, which defines the bias point of the FF clock input.
The FF input impedance changes depending on the input current and the bias point set by the Tx, which effectively means that the CM FF changes input impedance during a typical clock pulse when there are slight bias fluctuations. The current steered at each branching point depends on each branch's impedance but this, in turn, depends on the downstream FFs and the current that is steered to them. Because of this challenge, previous CM clocking has been restricted to symmetric H-trees [7] , [26] .
As a result of trip current mismatch, the internal CM FF voltage pulse (CLKP) can vary in the time domain and results in clock skew. This inaccuracy can increase quickly in larger asymmetric networks with large variation in current at the sinks. In the worst case, a CM FF may not respond if the trip current is insufficient, which can result in a functional failure. Hence, it is desirable to use an automated synthesis tool not only for the automation of the routing and impedance balancing but also to ensure the electrical correctness and functionality.
VM clock synthesis techniques typically use Elmore delay models for initial clock routing and then insert and balance buffers to constrain the network's slew rates. Since the Elmore delay model is based on the charging/discharging of a capacitance through a resistance, it is not suitable for CM synthesis, because CM clocking maintains a steady-state voltage in the entire clock network. Elmore delay-based clock routing balances delays in clock branches, which is not the same as balancing impedances. However, it is a reasonable starting point and can be compensated for by appropriately sizing the Tx and the Rx circuitry in the CM FF.
To demonstrate the skew improvement after proposed Tx sizing and CM FF sizing stages, we performed synthesis and simulation of different routing techniques in Fig. 2 on an four sink, asymmetric CM clock distribution using the previously reported CM Tx and FF circuits [7] . Since a symmetric H-tree network does not work well with asymmetric distributions, it routes to a fixed location depending on the size of the H-tree. This results in a large 19.1-ps skew, as shown in Fig. 2 (a) [7] , [26] . Using a deferred merge-embedding (DME) methodology and CM clocking, we observed a better, but, still considerable 14.8-ps skew, as shown in Fig. 2(b) . The skew improvement is due to the balanced RC product in each subtree. Using our proposed iterative Tx sizing methodology with a DME tree, we observe improvement to 3.1-ps skew, as shown in Fig. 2(c) . Sizing the Rx in the CM FF further improves the impedance matching and compensates for skew using the clock-to-internal voltage pulse (CLK-CLKP) delay of the CM FF. Using this technique along with the Tx sizing, the skew is 1.6 ps, as shown in Fig. 2(d) . However, this is a small four sink motivational example that is intended to illustrate the principle of this paper. It is not meant to be the verification of the methodology, which is reserved for the benchmark results in Section V.
In addition to skew, it is expected to have lower jitterinduced timing uncertainty in CM clocking compared with a VM scheme due to the absence of buffers in CM CDN and jitter will not be addressed further in this paper. Fig. 2 . Both symmetric and DME VM synthesis techniques introduce large skews (19.1 and 14.8 ps, respectively) when directly applied to asymmetric CM clock distributions, however, DME with Tx or combined Tx/Rx sizing methodology can improve the clock skew to 3.1 and 1.6 ps, respectively, with almost equal power consumption in each case. Our research provides an automated methodology for the Tx and CM FF Rx sizing. It is worth mentioning that the proposed methodology is in stark contrast to the existing impedance balancing VM schemes [8] , [27] where clustering and load balancing was achieved using wire and/or buffer sizing [27] . Even timing model-independent schemes utilized extra wires and dummy sinks to balance the network [8] , but these schemes are only suitable for buffered VM clocking, since the CM FF also have varying input impedance.
IV. PROPOSED CURRENT-MODE CLOCK SYNTHESIS
The reliability and overall performance of a CM clocking scheme depends greatly on the Tx and Rx/CM FF circuits and their transistor sizes. The advantage, however, is a tremendous amount of power savings with similar skews compared with existing buffered VM clocking methodologies.
The overview of the proposed CM clock synthesis (CMCS) scheme is shown in Fig. 3 , which starts with a traditional DME tree construction. While this is not exactly optimal for impedance matching, it is generally a good starting point. It is followed by a stage of Tx sizing to determine the appropriate bias voltage of the network and then an iterative skew improvement through Rx sizing in the CM FFs.
A. CM Pulsed Current Transmitter Sizing
The proposed CM clock networks are unbuffered and driven at the root by a CM Tx [7] . The CM Tx generates a push/pull Algorithm 1 Current Transmitter Sizing current and the devices are sized so that the network maintains a steady-state bias voltage. Since the Tx is large, it may have several exponentially tapered stages of buffers driving it, which are included in our later results. The detailed algorithm for our CM pulsed current Tx sizing is presented in Algorithm 1.
We performed a wide range of simulations on different size and topology networks to relate the Tx sizing with the total capacitive admittance (Y T ) of the network. The result of these experiments is shown in Fig. 4 . The relationship is highly linear between Y T and the Tx size.
In order to relate the total driving load/impedance with the Tx size, we calculate the total impedance of the network. However, it is tradition to use admittance, which is simply the inverse of impedance, for parallel networks. The total admittance of a network is proportional to the current, as shown in Fig. 4 . We calculate the total admittance of a CDN by considering the total FF load and the RC network. The input admittance of a CM FF is (1) where g m1 and g m2 are the transconductance of the receiving transistors, μn and μp are the mobility of nMOS and pMOS transistors, and C ox is the gate oxide capacitance. The aspect-ratio (AR = W /L = width/length) of Mr1-Mr2 in 
where C w, j is the wire capacitance of wire j , α i is the admittance factor of sink i , and β is a constant. We can utilize the linearity of Y T and Tx size to parameter fit β as a starting point. The error bounds suggestion that a ±12% range around the starting point should be considered during optimization. The α i values are optimized later in Section V-B when we select CM FF library cells with varying AR sizes. The first part of the (2) ensures the total required current at each sink, while the latter part helps the Tx to sustain [(V dd )/2] voltage and the fraction of energy loss due to nonideal voltage swing on the interconnect. Empirically, the Tx sizing is convex, so we used steepest descent search to find the best size. The Tx sizing algorithm first calculates Y T of the network (Line 4) in the total Admittance(Tree) method which applies (2). Then, it determines the initial Tx sizing (T init ) of the network (Line 5) using si zeT ransmi tter(Y T ). It runs a transient simulation (si mulateT r ansi ent ()) and uses calculateSkew() to measure the initial skew (S init ) (Lines 6 and 7). T best and S best are set to the initial values (T init and S init ), respectively, (Line 8).
The T init value is also stored in two temporary variables (T newUp and T newDown ). [7] .
After this, the algorithm sweeps up and down from T init with a step size of δs, which is assumed to be 1% of T init using two independent loops (Lines 9-24). The change in Tx device sizes also changes the network bias voltage and the input current of a CM FF that effectively changes the CLK-CLKP delay of the FF in Fig. 5 . In addition, the DME-based tree does not guarantee equal impedance of each branch resulting CLK-CLKP delay mismatch. This can change the skew of the network and it is imperative to calculate the new skew with the resized Tx. During each iteration, the algorithm compares the new simulated skew (S new ) with the previous best skew and retains the best skew (S best ) along with corresponding Tx size (T best ). The algorithm terminates if there is no improvement in skew. This proposed Tx sizing methodology has worked with any network and our experimental results in Section V-C will show the quality.
B. Receiver/CM FF Sizing Methodology
To aid skew optimization, we utilize a small set of predesigned CM FF library cells with different input impedances. The input impedance is changed by varying the A R of the input reference-voltage generator (Mr1-Mr2) diode-connected inverter circuits in Fig. 5 , as modeled in (1). However, it is necessary to have equal AR for both the input referencevoltage generator and the local reference-voltage generator (Mr3-Mr4) to measure the correct trip current of a CM FF. Because of that, we change the AR of both voltage generators simultaneously. This results in a voltage variation at the input of the current comparator and can move the bias point. The variation of bias voltage also varies the CLK-CLKP delay of CM FF. These results are shown later in Section V-B.
The proposed CM FF sizing methodology balances the root to sink the admittance of an unbalanced tree by selecting among the available CM FF library cells. Since these cells have different admittances, they have differing internal CLK-CLKP delays, which can be used to balance any skew. We approach the CM FF sizing problem by starting with a median CLK-CLKP delay FF and replacing those that have lower or higher impedance (with faster or slower versions), respectively.
Algorithm 2 CM Pulsed FF Sizing

Algorithm 3 Finding Critical Sinks
The detailed algorithm for our sizing is shown in Algorithm 2. The FFs are initially set to the median size to allow them to be made faster/slower. After a transient simulation, the algorithm calculates S init (Lines 4 and 5) and sets S best as S init (Line 6). We search over the sinks' timing information and determine the set of sinks that need improvement in f indCritical Sinks() (Line 8). Then, the algorithm iteratively resizes the critical CM FFs until we meet our skew bound (SB) (Lines 7-24).
The f indCriticalmethod() function identifies the largest cluster of FFs in any SB window as the "good" sinks. (Lines 8-13 ).
The largest number of sinks in a window ensures that the fewest CM FFs will be returned in the critical sink set C and needs to be adjusted in Algorithm 2. These "critical" sinks are outside the optimal window that can be either too fast or too slow.
Algorithm 3 has a worst case runtime complexity of O(n 2 ), where n is the number of sinks. However, the SB is small and we only look into the set of sinks within an SB, which severely limits the second n. This makes the proposed algorithm linear in practice. In addition, using linear time maximal sum algorithm [28] , the proposed Algorithm 3 could be speed up to O(n). However, the runtime is dominated by simulation and not the algorithm itself, so we did not do this.
During each iteration of Algorithm 2, we calculate the maximum delay (d max ) and the minimum delay (d min ) of the "good" sinks (Lines 9 and 10). Then, two consecutive loops iterate over the fast and slow critical sinks, respectively, and choose a faster/slower CM FF from the library cells (Lines 11-16) . A transient simulation calculates the new skew (S new ) and stores the minimum value to S best after comparison (Lines 17-23).
The proposed CM FF sizing algorithm converges to a minimum skew after either no skew improvement is seen or the SB is achieved. It is worth mentioning that the CM FF is sized to meet the SB for a fixed Tx size, which was determined in the previous stage. The Tx is not sized after the Rxs. So there is no need to size the Tx again. In addition, the CM FFs are very fast and Algorithm 1 ensures proper functionality of each FF by properly sizing the CM pulsed current Tx. FF metastability is usually due to the input arriving during a clock transition. Our CM FF still has setup and holds times like VM FFs to avoid any such problems.
V. SIMULATION RESULTS
A. Simulation Setup
We implemented the proposed CMCS scheme in C++ and Python. Simulations were run on an Intel Core i5-3570 Ivy Bridge 3.4-GHz quad-core processor. We validate the proposed methodology using 45-nm ISPD 2009 and 2010 industrial Benchmarks [21] , [22] . ISPD 2009 benchmarks are derived from real IBM application specific integrated circuits designs. These benchmark circuits are distributed in 50.4-275.6-mm 2 area and consists of 81-623 evenly/unevenly distributed sinks with equal or unequal sink capacitances. ISPD 2010 benchmarks are derived from real IBM and Intel Microprocessor designs. The 2010 benchmark circuits are distributed in 1.4-91.0-mm 2 area and consists of 981-2249 nonuniformly distributed sinks with different loadings. Our designs were optimized for 1 V supply voltage and clock frequencies range from 1-3 GHz. Traditionally, 5%-10% of the clock period is allocated for clock skew, so we used a clock SB of 70 ps for 1-GHz clock frequency. Traditionally, worst case slew rate is defined as 10% of the clock period. For the proposed CM clocking schemes, we used 10% SB. It is worth mentioning that at steady state the CM clock tree remain roughly around [(V dd )/2], hence we only considered worst case slew rate at the CLKP signal of CM FF. The CM Tx and Rx/FF [7] were designed using the FreePDK 45-nm CMOS technology [29] . We used HSPICE to measure power and performance for all results.
The clock tree is routed with minimum wire length by incorporating balanced bipartition with DME [9] , [10] and the final tree nodes are connected to the CM FFs. The clock tree and the CM FFs are driven by a single pulsed current Tx. In addition, we followed ISPD 2010 high-performance clock network synthesis contest guideline to model the clock network as a distributed RC model [21] , [22] . The CM Tx, tree, and the CM FFs compose the entire CM network. Fig. 6 shows the resulting DME routed bufferless CM CDN for the ISPD 2010 Benchmark circuit 06. In proposed CMCS scheme, the total power consumption includes the CM pulsed Tx power, parasitic power, and the total CM FF power.
The CMCS methodology uses the library cells of CM FF with different ARs and hence input impedance and CLK-CLKP delay resulting "slower" and "faster" FFs. Here, "faster" and "slower" FF refers to the smaller and larger CLK-CLKP delays, respectively. We calculate global clock skew at the FF's internal clock pins (CLKP), so that the changes in CLK-CLKP delay are included in the skew component of timing constraints and do not change the setup time and hold time.
It would be interesting to compare the CMCS results with the ISPD 2009 and ISPD 2010 winner's results. But, the winning teams consider local skew minimization resulting in wire snaking and extra buffers. For example, using the 01 benchmark circuit, the ISPD 2010 winning team used 198.3-pF capacitance, while the implemented VM network requires 93.7-pF capacitance. Overall, ISPD 2009 and ISPD 2010 winners consume significantly more capacitance resulting more than double power consumption compared with our implemented buffered VM networks, hence in our final comparison, we eliminated ISPD winners result.
Since the previous Tx sizing methodology [7] does not work with asymmetric networks, we used a state-of-the-art buffered VM methodology for comparison. The VM tree is routed using a common industry method with minimum wire length [9] , [10] and the buffers are inserted to meet the skew and slew constraints (10% of the clock period) [30] . For the VM buffered network, the total power consumption includes CDN buffer power, clock tree parasitic power, and VM pulsed FF [31] power. Both the VM and CM schemes receive a traditional voltage clock from a PLL/CLK divider at the root. The input CLK signal slew rate is 10% of the CLK period.
B. CM FF Library Cells
Similar to a VM FF, in the CM case, we considered 50% ideal input current (3 μA) transition to 50% Q transition as the CLK-to-Q delay of CM FF. For setup (t s ) and hold (t h ) times, we used the common definition as the time margin that causes a CLK-to-Q delay increase of 10% beyond nominal. t s and t h of the median size CM FF are −15.8 and 46.6 ps, respectively. Fig. 7 shows an analysis of the CM FF library cells with the nominal input current of ±3 μA and 70-ps pulsewidth. In this analysis, we vary the AR of CM FF reference-voltage generators and measure the corresponding CLK-CLKP delay. We observed a linear relationship between CLK-CLKP delay with AR. Particularly, the CLK-CLKP delay of the CM FF increases with the increase of AR by increasing input impedance, as shown in (1) . Hence, we utilized this characteristic to build our CM FF library cells with different CLK-CLKP delay. It is worth mentioning that, similar to an FF output (Q) signal, the CLKP act as both terminal and voltage pulses.
In order to tackle skew issues, the proposed CMCS utilized 13 CM FF library cells (a median size and six faster and six slower) with ±30-ps CLK-CLKP delay variation from the nominal delay value. It is expected that the use of different sizing CM FFs requires different FF areas and may add area overhead to the overall design. However, it is possible to have zero area overhead for different size FFs. Fig. 8 shows the layout of fastest, median, and slowest CLK-CLKP delay CM FF. In Fig. 8 , P n and N n indicate the sizing reference of pMOS and nMOS, respectively, corresponding to referencevoltage generator of median size CM FF. We laid out the CM FF in such a way that we can adjust the sizing of CM FF reference-voltage generator without changing the CM FF overall area. Since, each FF used standard-cell height, we can adjust the AR by using vertical empty space for slower CM FF (larger transistors) or decrease transistors size in the opposite direction (for faster CM FF), as shown in Fig. 8(a) and (c) , respectively. This requires no placement legalization.
We characterized the register stage of each CM FF considering maximum driving load. In addition, the CLKP signal has fixed loading from transistors M4, M7, and M10, as shown in Fig. 5 . If the CLKP signal meets a slew rate, there is no slew rate violation at the CM FF output (Q) signal. Table I , and up to 67% power in Table II. In a CM scheme, most of the power is static power consumed by the CM FFs and there are no CDN buffers, so it is highly insensitive to frequency [7] . Because of this, CM clocking saves quadratically more power at higher frequencies, which is extremely important in multigigahertz designs. Fig. 9 shows the evidence of the proposed CMCS methodology efficiency compared with VM buffered scheme at higher frequencies using ISPD 2009 benchmark circuit s4r3. In particular, the power saving of CM methodology increases from 68% (at 1 GHz) to 84% (at 3 GHz) compared with VM scheme.
C. Results and Comparisons
2) Skew Comparison: The proposed algorithm reduces skew by Tx and CM FF sizing while ensuring correct functionality. The CMCS methodology resulted in proper functionality in all of the asymmetric networks. The skew slightly degraded on average in both the 2009 and 2010 benchmarks, but the skew results were better on some benchmarks, as shown in Tables I and II . These skew levels are well within tolerable limits of 5%-10% of the clock period and are, therefore, not a concern especially considering the large power Fig. 9 . CM clocking is highly insensitive to frequency, as a result, it exhibits more power saving at higher frequencies, for example, using ISPD 2009 benchmark s4r3 circuit, the power saving of CM methodology increases from 68% (at 1 GHz) to 84% (at 3 GHz) compared with VM scheme.
consumption savings. In addition, each scheme uses a different methodology and the response to optimization is not predictable. This is common with any sort of heuristic optimization algorithm, which may end up in a solution that is closer or further from optimal. However, overall, the proposed CM scheme has only 3.3-and 3.9-ps average skew difference compared with VM scheme for ISPD 2009 and ISPD 2010 test benches, respectively.
3) Run-Time Comparison:
Most high-performance CDNs use HSPICE simulation instead of approximate analytical models, such as Elmore delay in traditional clock tree synthesis (CTS) algorithms. However, HSPICE simulation requires significant simulation time compared with a traditional CTS algorithm. Tables I and II show the results based on accurate HSPICE simulation for both VM and CM methodologies for fair comparison of quality of results and run time.
The run time of the CMCS methodology is significantly less than the VM methodology. This is because the proposed scheme requires fewer iterations, since it does not use buffers that need to be sized. Overall, the run time of the benchmarks is 2.4 − 9.1× less on average, as shown in Tables I and II. 4) Silicon Area Comparison: Similar to previous CM clocking systems, the proposed CMCS scheme uses a bufferless CDN. However, the Tx circuit has a few buffers for the internal delay chain and to drive the large Tx transistors. Fig. 10 shows a representative comparison of VM buffered total area compared with CM total area. The CM CDN includes the overhead of the resized FFs and Tx to compute the Tx and CM FF area. When considers CM Tx and VM buffers area, the CM clocking saves up to 73% transistor area compared with the VM scheme. Overall, using proposed CMCS methodology in ISPD 2009 and ISPD 2010 benchmarks, the CM clocking saves 21% average silicon area compared with VM scheme.
VI. CONCLUSION
We have presented the first CMCS methodology. The proposed methodology used Tx and Rx sizing in the CM FFs to ensure correct functionality and reduce skew. The proposed methodology saved 39%-84% average power with similar skews on industrial benchmarks. In addition, the methodology used 2.4 − 9.1× less run time up to 26% lower silicon area compared with the buffered VM networks.
