








     
  










   
  
 







   
 
 












     
 
    
 
   
    
  
  
   
 
 
    
   












   
   










   
 
   
   
  
   
    
    
 
   






    
     
   
 
IEEE 2006 Custom Intergrated Circuits Conference (CICC) 
GHz Serial Passive Clock Distribution in VLSI Using
 
Bidirectional Signaling
V. Prodanov and M. Banu
 
MHI Consulting, LLC, New Jersey, USA
 
Abstract-We introduce a serial passive clock distribution
technique allowing efficient and accurate skew removal at any 
arbitrary clock drop point. The passive transmission medium
may be on-chip electrical transmission lines built in current IC
technology or possible optical wave-guides in future
developments.  The proposed technique is naturally insensitive to
practical loses and other non ideal effects and has the capability
of covering large chip areas.
I. CONVENTIONAL VLSI CLOCK DISTRIBUTION 
The common method of distributing GHz clock signals in
VLSI is by active trees. These are hierarchical structures of
wires and repeaters, which compensate the limited bandwidths
of the wires [1],[2],[3]. Ideally, the signals at the top of the
tree are synchronous by physical symmetry but practical errors
introduce skews. Table I using data compiled by Rusu [3] 
shows practical skews of 60 ps or more, independent of 
technology scaling (quoted papers from different years and
technologies) and produced mostly by device mismatches and
supply voltage variations [4],[5].
For clocking above 1 GHz, regional de-skewing is used to
the expense of adding complexity. Specialized circuits
mitigate the skew problem [6], [7], but are difficult to migrate
to new technology nodes or new products. In some cases, e.g.
FPGAs, de-skewing may not be economical. 
TABLE I 
Performance of skew-corrected active-tree clock distribution schemes
(Data compiled by Rusu [3]) 
Author Source Before De-Skewing
After 
De-Skewing
Geannopoulos ISSCC-98 60ps 15ps
Rusu ISSCC-00 110ps 28ps
Kurd ISSCC-01 64ps 16ps
Stinson ISSCC-03 60ps 7ps
Jitter is another important limitation of active trees. It is
dominated by the jitter of the repeater, which is proportional to
the repeater delay [5]. Since the total repeater delay is about
half of the overall tree delay [6], large trees have high jitter.
II. CLOCK DISTRIBUTION USING TRANSMISSION LINES
Recent research results have demonstrated clock
distribution networks based on transmission lines (TL)
[8],[9],[10]. Here the “wires” are larger in size, operating in
“LC-mode” with low loss. The clock signals propagate over
much larger distances compared to the conventional tree,
without the need of repeaters. The total delay, jitter, and power
dissipation reduce considerably. For example, a tapered two-
level H-tree was demonstrated at 5 GHz with 20 ps
uncompensated overall skew [8]. This experiment had only
16 clocking zones over a 10mm×10mm chip area, with a TL
characteristic impedance at the top of the tree of only about
4Ω. Expanding this H-tree to 4-5 levels needed in practice
would be rather difficult.. 
A superior use of TLs was reported in [9] and [10] where
the circuits achieve about 1 ps skews at 10 GHz, with very low
power dissipation, while being capable theoretically to cover
practical-size clocking regions. Both schemes use many
coupled oscillators interconnected into two-dimensional grids.
Traveling-wave and standing-wave configurations are used
respectively. One limitation of these approaches is having a 
“hard-wired” operating frequency, determined by the TL
characteristics and the capacitive loading. Furthermore, the
behavior of such complex autonomous clocking networks in a 
noisy chip environment is largely unknown. This includes the
possibility of exciting multi oscillatory modes creating a 
potential reliability problems.
The traveling/standing wave concepts have additional
limitations. Under traveling-wave conditions, the signals
derived at different locations along the TL have a constant
magnitude but different phases, which vary linearly with
position. De-skewing the clock drop-points is mandatory but
fundamentally difficult without an absolute phase reference.
Under standing-wave conditions, signals extracted at different
locations along the TL have the same phase but different 
magnitudes. These vary as a sinusoidal function with position
creating sizable regions where the signal is too small for
practical clock extraction.
Our new concept for TL-based clock distribution has equal 
capabilities for jitter and power dissipation performance as in
[9] and [10] by similarity of construction, but unlike these 
techniques, it is an open-loop method (no oscillators)
generating constant-phase and constant-magnitude clock 
signals simultaneously. In addition, it is a wide band design 
not requiring resonating inductors as in [11]. Next, we
introduce our technique systematically, first with a review of
single-line passive serial distribution, then describing the
concepts of bidirectional signaling and average time
extraction, and finally, developing our simplest and most
promising clocking scheme using analog multipliers. 
III. PASSIVE SERIAL DISTRIBUTION 
The classical active H-tree to be replaced is shown in Fig.
1(a). Our scheme uses passive serial connectivity as illustrated
in Fig. 1(b). An on-chip transmission line properly terminated
at both ends, meanders over the area to be clocked with
buffers tapping the line and deriving local clock signals.
Ignoring for the moment the major issue of variable skew
1-4244-0076-7/06/$20.00 ©2006 IEEE P-21-1 







   
   
  
   
 
 
    
  
    
 









   
    
 
   
  
  





    
  











    
    
  
  
     
   
 
























   
  
    
      
 
    
      
a) b)
Fig. 1. Clock distribution networks: a) active “H” tree; b) passive serial
along the line as in [9], we notice that there are fundamental 
reasons why this connectivity is ideally suited for clock
distribution.
First, the serial connectivity allows full topological 
flexibility. There are many ways to connect the local clocking
regions without any need for symmetry or regular patterns as
in a tree. Second, as discussed before, the passive nature of 
TLs eliminates the need for repeaters, with minimum power
dissipation and jitter.  Third, not having branches as in [8] 
maintains a constant characteristic impedance.
Next, we discuss the loading effects. The input impedance
of the buffers tapping the line is purely capacitive in practice.
The finite resistance “seen” into the transistor gates at the
clock frequency fclock is negligible because the clock frequency
is always low compared to the MOSFET transient frequency
fT. For example, 10 FO4 clocking gives a fT/fclock ratio of about
40. Therefore, the buffer loading will not introduce any
significant signal loss. However, it is important to avoid
reflections, which could introduce magnitude variations.
If the buffers have small input transistors the loading
effects are negligible. This would be a first design choice. For
larger buffer input transistors, the effective characteristic
impedance of the line changes locally, but if all buffers are
equal in size, there is only a global effective characteristic
impedance change. This can be easily compensated by
adjusting the line terminations accordingly by design. A more
sophisticated solution would set the termination impedance 
adaptively by monitoring the signal levels at the TL ends, 
intentionally placed in close proximity of each other.
The remaining major issue is the severe problem of
variable skews on serial distribution. These skews are
predictable due to constant speed of EM waves (“light”) in the
TL. However this does not help much with de-skewing locally
extracted clocks. In the absence of a locally available absolute
reference, which is the very purpose of clock distribution in
the first place, the only theoretical option is to synchronize
adjacent areas in a hierarchical manner with a large number of
PLLs or DLLs. The resulting complexity and performance
limitations would likely be worse than for active trees. We
will show that there is another much simpler solution to this
synchronization problem if the single TL is replaced by two
TLs.
III. BIDIRECTIONAL SIGNALING AND AVERAGE TIME
EXTRACTION
Consider two identical TLs side-by-side, as shown in Fig.
2, with two respective pulses originating from opposite ends at
the same time. If the bi-directional pulses traveling through
the TLs are sensed at any coordinate x, two respective skews
T1 and T2 are detected. The simple operations of adding or
averaging these skew numbers yield absolute results
independent of x. For example, the ½ (T1 + T2) average 
equals the time of flight on any of the two signals to the center
of the line, which is a constant as illustrated in Fig. 2.
Likewise, the (T1 + T2) sum represents the total propagation
time over the length L. If multiple points in the line extract
and process the bi-directional skews as mentioned, all local
clock signals would be phase synchronous.  This concept, to
be called BDS for Bi-Directional Signaling, enables serial
passive clocking theoretically. A circuit extracting the
absolute reference from the BDS line will be called an
Average Time Extractor or ATE. 
It is important to notice that the BDS principle holds even
if one adds an arbitrary fixed delay between the starting times
of the two signals. Since there is no need to keep the two
sources in a predetermined phase relationship, one of them, 
i.e., the one at the “far end” could be replaced by a PLL or
DLL regenerating the signal coming from the first line. This
decouples the two ends allowing complete freedom in line
placement. Furthermore, the “far-end” source may be
eliminated altogether and the signal just wrapped around, as
shown in Fig. 3a.
The realization of the ATE function would run into
fundamental difficulties (non-causal system) if we had only a
single pair of pulses. Fortunately, the distributed clock is a
periodic signal and therefore the ATE can average “future”
values from the previous clock period. Many ATE 
configurations are possible. The block diagram in Fig. 3b 
shows a topology based on a classical DLL. The feedback
controls the delay elements forcing the total delay to be equal 
to the time interval between the arrival times of the two input 
pulses. In lock state, the center tap of the delay line extracts








Fig. 2. Word-line (time vs. space) diagram of a  BDS system.
P-21-2 286 
Authorized licensed use limited to: Cal Poly State University. Downloaded on February 8, 2010 at 16:16 from IEEE Xplore. Restrictions apply. 
   










   
   
  
  
    
  
     
 
  







    
 
 
   
  
   








   
     




























   
   
 
  
    
  
   
  











   









     
   
signal 1
0 L 

















Fig. 3. Conceptual schematic of pulse-reference BDS system:
a) folded-line architecture; b) possible ATE implementation;
either analog or digital. For the latter, the DLL could be
disabled after lock.  A transistor-level circuit for this ATE and
a similar PLL-based topology were designed in 0.13u TSMC
CMOS and the correct operation was verified in simulations at
multi-GHz operation. We leave any quantitative claims after
test chips are fabricated and tested.
The ATE function as defined for periodic signals cannot
distinguish between delays separated by an integer number of 
the clock period. This gives rise to a phase reversal ambiguity
problem. In other words, while all ATEs connected to the
BDS line are synchronized in terms of clock transitions, some
ATEs will have inverted signals with respect to others,
depending on the ATE position along the TL. This consistent
phase inversion error, occurring over large TL sections, can be
easily detected during design and corrected with the addition
of inverters. If it is desirable to allow blind ATE placement 
with no attention to potential phase reversals, it is possible to
design simple circuits, which would correct this error
automatically. For example, at the boundary of each clocking
region, a phase detector would check if the two regional
clocks are in phase or out of phase and communicate this to an
appropriate local clock driver through a single control bit. This
one time operation at turn on is considerably easier than
complicated boundary phase comparison and correction for
clock-transition edge alignment already used in VLSI with
high precision DLLs.
A more problematic practical issue is the phase error
introduced by pulse shape distortion due to dispersion or any
other mechanism. This is an important potential problem for
any ATE processing high frequency pulses. As discussed 
previously, the clocking speed is low compared to the
transistor switching speed but still the demands of modern
clocking systems are very stringent. Next, we show a superior
solution to the clock extraction problem from a BDS line,
which is considerably simpler than the ATEs discussed so far,
has no phase ambiguity, and does not rely on edge detection.
This is accomplished by using sinusoidal BDS and multipliers
for clock extraction.




For sinusoidal signals, delay and phase summation are
equivalent operations. Indeed, multiplying two sinusoidal
signals of frequency f and phases ϕ1 and ϕ2 yields a DC term
and a sinusoidal term of frequency 2f and phase (ϕ1+ϕ2). But,
ϕ1 and ϕ2 are proportional to corresponding time delays. 
Therefore, the ATE functionality is automatically included in
the operation of analog signal multiplication.
We can configure a very efficient BDS clock distribution
system as in  Fig. 4a.. A single sinusoidal signal of frequency f
enters the TL at node A, passes sequentially through  B, C and
D and exits into a termination resistor at node E. Analog
multipliers are connected as shown. After multiplication,
phase synchronous local clocks are derived at points AA, BB,
CC, or any additional similar points.  
Extracting a clock signal at twice the transmitted
frequency is beneficial since the line loss for half-rate 
distribution is lower (skin-effect limited operation). Half-rate
distribution has been used successfully in commercial 
microprocessors [7], albeit with a conventional active tree.
Unlike using DLLs or PLLs, analog multiplication is a
simple memory-less (non-dynamic) operation. This classical,
well-understood circuit function can be implemented in any
semiconductor technology [12]. The DC term resulting from
the multiplication can be easily removed through AC coupling
or standard DC removal feedback techniques.
Extensive transient and harmonic-balance simulations of
the proposed sinusoidal BDS were performed using the
MentorGraphics Eldo-RF simulator. We modeled the BDS
line with Level-4 TL elements, assuming an on-chip Cu
microstrip design with 10 µm width, 10 µm dielectric (SiO2) 
thickness and 5 µm metal thickness. The characteristic
impedance was approximately 80Ω. The multipliers were
modeled using 120 nm TSMC CMOS technology files. The
total BDS line length in our simulations was 32 mm (64 mm
total TL length) with 32 local clocking points, one per every
millimeter. Such a BDS line can provide synchronous clocks
to a 32 mm2 area of any shape with 1mm×1mm granularity.
Four such lines centrally fed would cover 128 mm2 area.
Fig. 4b shows a typical simulation result for 2 GHz
distribution and 4 GHz local clocks. For clarity only 3 out of 
32 clock signals are shown. We notice the large skews at
points B, C, D, and E compared to A, consistent with finite
signal propagation speed. After multiplication, all 4 GHz local
clocks are in phase synchronism. The different DC levels, a 
byproduct of the multiplication, are intentionally kept for
demonstration purposes. They would be removed in practice,
as discussed previously.
P-21-3 287 
Authorized licensed use limited to: Cal Poly State University. Downloaded on February 8, 2010 at 16:16 from IEEE Xplore. Restrictions apply. 
       
 
 
   
 
 
   
   




     
 
 
   
   
  
   
 
 










   
  
   
 
 












   
   
   
   






















    
  
       
  
  














      
         
 




AA BB CC 
a)




0.0N  0.1N  0.2N 0.3N  0.4N 0.5N  0.6N   0.7N  0.8N  0.9N  1.0N
b)
Fig. 4. Sinusoidal BDS system: a) block diagram; b) simulation results;
Several important non-ideal conditions are included in
these simulations.  First, we have realistic TL loss:
approximately 6 dB from point A to point E. This loss has
little effect on the extracted clocks, confirming a theoretical 
observation that the extracted clock is independent of TL loss
and position. We expect to have no phase errors due to signal
loss but on a first guess, we would expect to see magnitude
errors. However even the magnitude is insensitive to loss. This
important property is a consequence of the fact that the TL
loss in dB is linear with the distance. As multiplication is
equivalent to addition in dB, a loss-error cancellation occurs at
every clock drop point. The slight inconsequential magnitude
error visible in Fig. 4b is due to large loading mismatches in
the line introduced in simulations creating reflections. 
Second, notice that all signal traces are somewhat thick.
This is because each one is composed of twenty superimposed
traces resulting from a 25% termination resistance sweep. The
synchronization errors are obviously small proving our
technique is resilient to termination errors.
Finally, we mention that at every TL-tapping point (there
are 64 such points) the multipliers load the line with the
equivalent of four minimum CMOS inverters.
V. CONCLUSION
We have introduced a general VLSI clock distribution 
concept using bidirectional signaling on an integrated wave
guide. This concept may be implemented with several
technology and circuit choices such as using electrical TLs in
current IC technologies or optical guides in future IC
technologies, and performing clock extraction based on pulse
arrival-time averages or analog multiplication. The most
attractive practical approach for current IC technology is using
electrical transmission lines and analog multipliers.
Simulations and theoretical arguments show that this very
simple architecture is insensitive to all major practical non-
ideal  effects and has optimum power dissipation and jitter.
ACKNOWLEDGEMENTS
We thank Applied Materials Inc. and especially Dr. M.
Pinto, Dr. L. West and Dr. D. Eaglesham for the support of 
this work. We also thank Dr. B Ackland of Nobel Device
Technologies, our co-author of the BDS concept discovery for
many useful discussions.
REFERENCES
[1] E.	 Friedman, “Clock Distribution Networks in Synchronous Digital
Integrated Circuits” Proc. IEEE, Vol. 89, No 5, pp. 665-692, May 2001.
[2]	  A. Mule’, E. Glytsis, T. Gaylord, J. Meindl, “Electrical and Optical Clock
Distribution Networks For Gigascale Microprocessors,” IEEE Trans. On
VLSI Systems, Vol. 10, No. 5, pp. 582 -592, Oct. 2002.
[3] S.	 Rusu, “Clock Generation and Distribution for High-Performance
Processors” invited presentation, International Symposium on System­
on-Chip, 2004.
[4]  	G. Geannopoulos, X. Dai, “An Adaptive Digital Deskewing Circuit for
Clock Distribution Networks,” International Solid-State Circuit 
Conference Dig. Tech. Papers, pp 400-401, Feb. 1998.
[5] 	 D. Harris and S. Naffziger, ”Statistical Clock Skew Modeling with Data
Delay Variation,” IEEE Trans. VLSI Systems, Vol. 9, No. 6, pp. 888 – 
898, Dec. 2001.
[6] J.	 Warnock at al, “The circuit and physical design of the POWER4
microprocessor,” IBM Journal of R&D, Vol. 46, Nov. 2002.
[7] P. Mahoney, E.  Fetzer, B. Doyle, S. Naffziger, “Clock Distribution on a  
Dual-Core Multi-Threaded Itanium®-Family Processor,” International
Solid-State Circuit Conference, Dig. Technical Papers, pp. 292-293, Feb.
2005.
[8] M.	 Mizuno at al, “On-chip Multi-GHz Clocking with Transmission
Lines”, International Solid-State Circuit Conference Dig. Tech. Papers,
2000.
[9] J.	 Wood, T. Edwards and S. Lipa, ”Rotary traveling-wave oscillator
arrays: a new clock technology,” IEEE Journal of Solid-State Circuits,
Vol. 16, pp. 1654 – 1665, Nov. 2001.
[10] F.O’Mahony, P.	 Yue, M. Horowitz and S. Wong, “A 10-GHz Global
Clock Distribution Using Coupled Standing-Wave Oscillators,” IEEE
Journal of Solid-State Circuits, Vol. 38, No. 11, pp. 1813-1820, Nov.
2003 
[11] S.	 Chan, K. Shepard, P. Restle, “Uniform-Phase Uniform-Amplitude
Resonant Load Global Clock Distributions” IEEE Journal of Solid-State
Circuits, Vol. 40, pp. 102 – 109, Jan. 2005.
[12]	 G. Han and E. Sanchez-Sinencio, “CMOS Transconductance 
Multipliers: A Tutorial,” IEEE Trans. Circuits and Systems – II, Vol. 45,
No. 12, pp. 1550 – 1563, Dec. 1998.
P-21-4	 288 
Authorized licensed use limited to: Cal Poly State University. Downloaded on February 8, 2010 at 16:16 from IEEE Xplore. Restrictions apply. 
