We present an unconventional clock distribution that emphasizes flexibility and layout independence. It suits a variety of applications, clock domain shapes and sizes using a modular standard cell approach that compensates intra-die temperature and process variances. Our clock distribution provides control over regional clock skew, permits use in beneficial skew applications and facilitates silicon-debug. By adding routing to the serial clock network, we permit post-silicon resizing and reshaping of clock domains. Defective sections of the clock network can be bypassed, providing post silicon repair capability to the network.
INTRODUCTION
In deep sub-micron technologies, device and interconnect variance is leading to an ever-increasing amount of uncertainty that must be addressed [1] , particularly with clock distribution networks (CDNs).
We present a CDN that differs radically from standard designs, by using a serial approach tolerant to clock buffer mismatches and capable of post-silicon re-shaping of clock domains. The system provides all the benefits of closed loop CDNs, using an active synchronization stage to eliminate clock skew between regional clocks, while avoiding many of their pitfalls. The all-digital circuitry uses an open loop approach at run-time to provide a simple to implement low-power operating mode.
SERIAL CLOCK NETWORKS
Our serial clock distribution network aligns each local clock to half the phase difference between two reference clocks traveling in opposite directions. This averaging technique was first proposed by Grover et al. [2] and has been used in [3, 4] . Our clock network is the first to use it to mitigate mismatch effects in a clock network.
Concept
The underlying concept of our clock network is shown in Figure 1 for n taps. All the taps are connected together as a thread using a pair of wires to propagate forward and reverse reference clock signals, creating a clock domain with the required shape and size. While there is more than one method to perform the required averaging, we employ a technique that delays the forward clock through two identical delay lines. The signal between these delay lines is used as the local clock. A simple phase detector is required to determine which reference signal edge occurs first.
Our dual reference signal averaging clock network simplifies clock network design since there are no constraints placed on the location of clock regions and the clock path taken between regions. Our network can be implemented with standard cell components and allows modification of portions of the clock network without complete reconstruction. The technique is easily ported to other technologies since the characteristics of devices and interconnect do not matter as much as how they match.
Reconfigurability
Reconfiguring clock domains post-silicon is not easy for typical IC clock networks. The clock threads in our serial network can be reconfigured using routing switches between local clock regions. This functionality is impossible when clock signals are broadcast through an integrated circuit, as is the case with clock trees. The extent of flexibility is variable; it can be as small as connecting a shared resource synchronously between two domains to a full fledged multi-clock mesh where local taps can be arbitrarily connected to any clock in the system. Devadas et al. [5] has used a bidirectional mesh similar to ours for data networks, but our application of this approach for clock networks is unique. 
611
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. W and L represent transistor width and length, respectively, and D represents the distance between devices. The A P 2 term models the distance variance and the S P 2 term models the discrete variance. Minimizing process variance requires using sufficiently large transistors to decrease discrete variance and locating the centers of devices as close together as possible using centroid layouts to minimize the distance variance. Equation 1 can be extended to these centroid layouts: 
where D x and D y are the horizontal and vertical distances between devices and D w is the wafer diameter [7] . Even though process gradients are never perfect planes, Equation 2 shows that placing clock buffers close to each other will result in much better matching than the dispersed clock buffers typical of current CDNs that do not allow them to be co-located. The clock buffers requiring matching in our clock network are adjacent, Figure 1 .
As clock drivers get further apart, the potential mismatch increases. The total distance related skew accumulates through every level of clock buffers. In a symmetric tree structure, the worst-case skew will occur between diagonally opposite local buffers since driver pairs here are furthest apart at every level, Figure 2 . The skew performance of our serial clock network depends on the matching of the forward and reverse reference signal segments between adjacent clock regions. Since co-located clock drivers are inherently well-matched, they are tolerant to distance related skew. By the same argument, distance related interconnect variance is also suppressed by our system.
Increased power density in ICs can cause significant cross-die temperature fluctuation, or so called "hot spots" that alter transistor and interconnect behavior. Power supply variation can also modify the delay of clock buffers, creating clock skew. Placing devices requiring matching close together will expose them to the same power supply and temperature environment, so our system can be synchronized to correct these conditions locally, but traditional distributed buffers in clock trees cannot be.
SIMULATION RESULTS
Our clock network has been designed using TSMC's 180 nm standard process using Cadence Virtuoso. Extracted layout simulations show that our proof of concept design can operate with clock signals between 500 MHz and 2.5 GHz and provides a sub-15 ps skew bound for 6 clock regions, Figure 3 .
CONCLUSION
The system provides multi-point active skew compensation and a power-saving open-loop operating mode. Our cell based approach to clock distribution allows components to be designed independently and to be moved around conveniently since the clock network can be modified with a simple change in the number or location of the clock taps. The presence of the digitally programmable delay lines allows the system to accommodate blocks with different tree depths and latencies.
Using a dual reference signal averaging technique allows designers to delay clock tuning and provides additional debug and repair capability to the clock network. Programmable repeater stages allow us to redirect clocks post-silicon. Using a serial approach minimizes the total clock line length, reducing the total clock load and potentially clock power. By placing clock buffers close together and using centroid layout techniques, it is possible to practically eliminate all distance induced variation. Clock buffers and delay lines will exhibit similar temperature and power supply characteristics allowing compensation of temperature and long term power supply fluctuation in our clock network.
[7] B. Linares-Barranco and T. Serrano-Gotarredona, "Cheap and easy systematic CMOS transistor mismatch characterization," Proc. ISCAS 1998, pp. 466-469. 
