Zero skew clock routing for fast clock tree generation by Reaz, Mamun Ibn et al.
ZERO SKEW CLOCK ROUTING FOR FAST CLOCK TREE GENERATION 
 
M. B. I. Reaz1, Nowshad Amin2, M. I. Ibrahimy1, F. Mohd-Yasin3, A. Mohammad3 
 
1Department of Electrical and Computer Engineering, International Islamic University Malaysia, 53100 
Kuala Lumpur, Malaysia 
2Dept. of Electrical, Electronic and System Engineering, Faculty of Engineering, National University of 
Malaysia, 43600 UKM, Bangi, Selangor, Malaysia 
3Faculty of Engineering, Multimedia University, 63100 Cyberjaya, Selangor, Malaysia 
 
ABSTRACT 
 
A Zero Skew Clock Routing Methodology has been 
developed to help design team speed up their clock tree 
generation process. The methodology works by breaking up 
the clock net into smaller partitions, then inserting clock 
buffers to drive each portion, and lastly, routing the 
connection from original clock source to each newly 
inserted clock buffers with zero skew. A few Perl scripts 
and a new Visual Basic based routing tool have been 
developed to support the methodology implementation. The 
routing algorithm used in this tool is based on the Exact 
Zero Skew Routing Algorithm. The methodology has been 
tested using a real design database and resulting in a 
significant improvement in the through put time required to 
complete the clock tree generation. This improvement is 
attributed to the ability to generate clock tree on much 
smaller portions of clock nets that supports of speeding up 
the clock tree generation process in IC design. 
 
Index Terms— Zero skew, Clock routing, Clock tree 
generation, IC design 
 
1. INTRODUCTION 
 
At a deep submicron (DSM) silicon technology levels, it 
is now feasible to integrate all major function of an end 
product in a single system-on-a-chip (soc) silicon die [1]. 
However, almost all of these chips are still designed based 
on synchronous clock design. Physically, the clock signal is 
distributed from an external pad to all similarly clocked 
synchronizing elements through a distribution network that 
includes clock distribution logic and interconnects. It serves 
to unify the design by determining the precise instance in 
time that the synchronizing elements change state.  In order 
to improve chip performance, i.e. to make the chip operates 
at faster clock speed; the chip designer must ensure that the 
clock signal is distributed to all the synchronizing elements 
in the design at the same time. Non-optimal clock behavior 
is typically caused by either one of these two phenomena: 
the unbalanced routing to the chip’s synchronizing elements, 
or the non-symmetric behavior of the clock distribution 
logic [2].  The problem is further complicated by the 
massive size of modern chip design and the fact that the 
design usually utilizes more than one clock signal to sync up 
its function. Each of these clocks may operate at different 
frequencies [3]. therefore the common challenge faced by 
all chip designers is how can they create a perfect clock 
distribution scheme that could guarantee all the timing 
elements scattered all over the chip are activated by the 
same edge of the clock signal simultaneously. The 
maximum difference between the arrival times from the 
clock source to the elements is defined as the clock skew [4, 
5]. The ideal case is to have all the elements sync up at the 
same time, which would mean no skew.  
The benefits of a zero skew clock distribution are well 
known and have been proven to affect the performance of a 
chip significantly. Many researches have pointed out that 
minimizing clock skew is the most important task when 
designing a clock network. Works, such as [2, 6, 7], have 
led to many other studies and researches on how to balance 
a clock network and achieve the zero clock skew target. 
More recent studies, such as [8, 9] also reveal that reducing 
clock skew can help to minimize power consumption of 
large chips, which is very important for chips that are 
designed for low power application (longer battery life) such 
as mobile computers and cell phones. 
 
2. THE CLOCK TREE GENERATION PROBLEM 
AND THE PROPOSED SOLUTION 
 
As the chip size grows (in term of increasing number of 
functional elements), the clock nets get bigger as well. 
Clock nets have bigger fan-out, and have to be distributed 
over larger areas. It has been observed that existing clock 
tree generation (CTG) tool performance has deteriorated as 
the clock net fan out size grow bigger and bigger. The tool 
execution time is getting slower and slower, and it is taking 
more iteration before the user can get a result that matches 
the desired outcome. It is suspected that the tool may have 
hit the limit of its computation capability, and is not able to 
handle nets with large fan-out. Although the CTG tool has 
been used extensively and successfully before, this 
capability limitation is slowing the design process. 
000023
CCECE/CCGEI May 5-7 2008 Niagara Falls. Canada
978-1-4244-1643-1/08/$25.00      2008 IEEE
Since the problem is the CTG tool capacity, the logical 
solution is to break the clock nets into smaller parts. This 
can be accomplished by partitioning the chip into several 
pseudo-partitions at the layout level, based on the cells 
placement, as opposed to   partitioning at the RTL level, 
which will be a real design partitioning effort. Partitioning at 
RTL level will involve some changes to the chip 
architecture and would increase the complexity of this 
solution.  
The pseudo-partitioning task will be executed after the 
completion of cells placement stage. The design engineer 
can output the placement information of the chip and 
analyze and decide the best way to partition each clock net.  
The engineer must also make sure that the size of each 
pseudo-partition should be well below the upper limit of 
what the CTG tool is capable of handling.   
For each of the pseudo-partition, a new buffer will be 
inserted to act as a new clock source point for each pseudo-
partition. Each pseudo-partition will have its own clock 
source point, and that clock point will drive a smaller 
number of fan-out. The location of the clock source point 
will be greatly influenced by the floor plan of the chip and 
the placement location of the fan-out cells as well. Since the 
clock net is smaller, the CTG tool should perform well 
within expectation. 
 
3. REVIEWS ON CLOCK ROUTING ALGORITHMS 
 
There are many clock routing algorithms that have been 
proposed throughout the years. The aim of all of these 
algorithms is to deliver a zero or minimal skew clock 
routing and distribution. 
One of the earliest clock routing algorithms is the H-
Tree clock routing algorithm. While the H-Tree clock 
routing algorithm is widely used in the IC industry [10], 
their application is not suitable for IC design because cells 
placement in IC design is not symmetrical. This fact is also 
supported by the Power PC microprocessor design team. 
The team has conducted experiment on their chip and 
concluded that clock tree generation methodology with 
balance router ensure quick turnaround and produce more 
than 40% improvement over H-Trees [11].  
The Method of Means and Median (MMM) algorithm 
has been proposed to overcome the H-Tree Algorithm 
shortcoming [2], but it also has its weaknesses. The 
difference in wire length from the clock source to each leaf 
cell can be as large as half the diameter of the chip. The 
algorithm effectiveness depends heavily on how the cut 
direction is decided, thus adding lethargy to the algorithm 
implementation.   
The Recursive Geometric Matching (RGM) algorithm 
was proposed in 1991 [6]. It has been proposed to fix the 
unbalanced wire length problem seen in the MMM 
algorithm. This algorithm always yields clock tree with 
perfectly balance path length for trees of two, three, or four 
terminals, but does not necessarily minimize the skew as it 
did not consider the cell loading. Like the MMM algorithm, 
the algorithm cannot guarantee a balanced wire length on 
certain cases, and requires work around that reduces its 
effectiveness. 
The Exact Zero Skew (EZS) algorithm was first 
proposed in 1991, by Dr Ren-Song Tsay, from IBM’s 
(International Business Machine) T.J. Watson Research 
Center [7]. It considers delay balance instead of the wire 
length balance has been proposed to deliver better skew 
performance. The EZS algorithm adopts a bottom up 
process similar to that of the RGM algorithm. This 
algorithm achieves delay balance by elongating wires that 
have smaller delays, therefore, does not require any look 
ahead techniques that can slow down the execution. 
In order to have a fair comparison, simulation exercise 
to test out the three different algorithms, namely, the MMM, 
the RGM, and the EZS algorithm, to route a clock net with 8 
leaf cells has been performed. The wire used for routing was 
1 micron wide and the wire resistance per unit length was 
1m and wire capacitance per unit length was 0.075 fF. 
Each unit length was equivalent to 1-micron meter. The die 
area was 8000 microns by 8000 microns. The overall 
comparison is summarized in the Table I.  
The simulation results clearly show that the EZS 
algorithm gives the smallest skew (0.004233 ps) and its 
amount is negligible. However, in this test case, the 
maximum path delay and the amount of wire length used to 
route the nets using EZS algorithm are only slightly lower 
compared to the other algorithm (3.7% difference in 
maximum path delay difference and 0.4% in wire length). 
This is consistent with what is reported in [7]. Based on all 
these considerations, the EZS algorithm is recommended 
and proposed to be deployed in the implementation of the 
zero skew clock net routing tools.  
 
Table I 
SUMMARY OF THE COMPARISON 
Algorithm Max Path 
Delay (ps) 
Skew (ps) Total Wire 
Length (μm) 
MMM 16.78469 1.883 88000 
RGM 16.21875 0.285 88000 
EZS 15.61857 0.004233 87610 
 
4. IMPLEMENTATION OF THE SOLUTION 
 
The proposed solution is called Zero Skew Clock Routing 
(ZSCR) methodology. The methodology is presented in the 
flowchart shown in Fig. 1. It can actually be categorized in 
three stages. The first stage is the pseudo-partitioning stage. 
In this stage, the user analyzes the placement results and 
identifies how to pseudo-partition the clock networks. The 
user has to also determine the location of the new clock 
source for each pseudo-partition. The second stage is the 
clock trunk routing stage. For this stage, a software program 
based on the EZS algorithm had been created to perform the 
zero skew routing between the original clock sources to the 
000024
new pseudo-partition clock sources. This part of the net is 
referred to as the clock trunk as it now became the main 
branch of the clock net. The third stage is the database 
update stage. In this stage, the results from stage 1 and stage 
2 are integrated back into the database. This methodology is 
implemented in three stages. 
 
 
Fig. 1. Zero Skew Clock Routing Flowchart 
 
4.1. The Pseudo-partitioning Stage 
 
In the first stage, a couple of Perl scripts were written to 
help the user extract and analyze placement data as well as 
determine the pseudo-partition boundary and the location of 
the new clock source. A set of guidance on the decision 
making process of determining the pseudo-partition 
boundary and clock buffers location was given as well. 
 
4.2. The Clock Trunk Routing Stage 
 
For this stage, a software tool called Mojiii was created to 
perform the zero skew routing between the original clock 
sources to the new pseudo-partition clock sources. The 
routing algorithm used in this tool is based on the EZS 
algorithm that was authored by Dr Ren-Song Tsay, from 
IBM’s (International Business Machine) T.J. Watson 
Research Center.  The tool is written using Microsoft Visual 
Basic. The main input to this tool is the location of the 
original and pseudo-partition clock sources and the input pin 
capacitance of the pseudo-partition clock source. One of the 
outputs from the tool is a command file that instructs how to 
implement the routing in automatic place and route-
graphical user interface (APR-GUI). 
 
4.3. Updating Database 
 
In the third and final stage, another set of Perl Scripts was 
coded to integrate back the results from stage 1 and stage 2 
back into the database.  
After completing all the three stages, the user continues 
with the normal clock tree generation process. Since the 
clock nets have been partitioned into smaller portions, the 
clock tree generation process completes faster and requires 
less iteration to meet the skew and delay specification. 
 
5. RESULTS AND DISCUSSION 
 
5.1. Results of Calibration Test Cases 
 
To calibrate the tool accuracy and validate the legality of the 
methodology and its results, a couple of test cases have been 
conducted. The test cases use the gClk400m and sR32k nets. 
The nets are chosen because of their placement distribution 
reflect typical clock distribution on the chip.  Net gClk400m 
has 2231 fan-out or leaf cells and net sR32k has 687 leaf 
cells. The number of leaf cells in this two test case is not 
very high, to allow faster evaluation, less tedious and more 
accurate result validation. The clock frequency for 
gClk400m is 400 MHz, the clock period is 2.5 ns, and the 
maximum skews allowed are set to 5% of the clock period, 
which are 125 ps in this case. For the sR32k net, the clock 
frequency is 32 MHz, the clock period is 31.25 ns, and the 
maximum skew allowed is also set to 5% of the clock 
period, which are 1562.5 ps in this case. 
The net is routed in Mojiii to ensure zero skew routing 
from the clock source to the new buffers. The resistance per 
square length value is set to 0.001 m/μm and the 
capacitance per length is set to 0.075 fF/μm. These numbers 
are derived from the resistance per square value and the 
capacitance per square value of the routing layer, assuming 
that all the routing will be done with 1 μm wire. The results 
of the routing are shown in the Fig. 2 for gClk400m net and 
in the Fig. 3 for sR32k net. 
From the results calculated in Mojiii, the path delay for 
net gClk400m, from the clock source to its leaf cells is 
4.283 ps, with zero (negligible) skew and for net sR32k 
from the clock source to its leaf cells is 6.560 ps, with zero 
(negligible) skew, as expected. 
The routing for both nets is then implemented on the 
real database using two different routing layers. The results 
are shown in the Table II and III. 
 
000025
 
Fig. 2. Routing Results for Net gClk400m 
 
 
Fig. 3. Routing Results for Net sR32k 
 
Another item that needs to be evaluated is the total time 
needed to generate the clock tree using this new 
methodology. The results are shown in Table IV. The 
throughput time is about 13 to 18% better for these test 
cases. However, it is believed that it will affect more 
significantly on larger clock nets. 
 
Table II 
EXTRACTED RESULTS FOR NET GCLK400M 
Cell X Y Input 
Capacitance 
(fF) 
Output to 
input pin 
path delay  
(ps) 
gClk400m 2504 3 - - 
gLB 1545 1150 70 4.928844 
gLT 1445 3600 70 4.950264 
gRB 3910 1855 70 4.841353 
gRT 3998 4795 70 4.862765 
Longest Delay 4.950264 
Shortest Delay 4.841353 
Maximum Skew 0.108911 
Clock Period (400 MHz) 2500  
Percentage of skew over clock period 0.00436 % 
 
Table III 
EXTRACTED RESULTS FOR NET SR32K 
Cell X Y Input 
Capacitance 
(fF) 
Output to 
input pin 
path delay  
(ps) 
sR32k 5 4006 - - 
BL1 1720 950 70 7.437194 
BL2 1550 1800 70 7.474003 
BR 3050 1700 70 7.371831 
ST2 3500 3100 70 7.338573 
ST1 4250 4850 70 7.304272 
Longest Delay 7.470272 
Shortest Delay 7.304272 
Maximum Skew 0.1657316 
Clock Period (32 MHz) 31250  
 
Table IV 
THROUGHPUT-TIME COMPARISON 
Net Original 
(minutes) 
Mojiii 
(minutes) 
Improvement 
gClk400m 261 226 13.4% 
sR32k 145 119 17.93% 
 
5.2. Results of Evaluation on a Chip Database 
 
The ZSCR methodology was deployed on one of the chips. 
The chip was fabricated in 0.35 micron (Leff=0.25 microns), 
2.5V CMOS technology with 6 metal layers. The chip die 
size was 7.2 mm x 5.5 mm and contained 2.8 million 
transistors.  The clock distribution network has to serve a 
large number of macro cells (18) along with approximately 
32,000 master slave latch-pairs.  The frequency of the chip’s 
main clock is 250 MHz. 
The main clock was first balanced using the original 
CTG tool. Then the balancing exercise was repeated using 
the ZSCR Methodology.  The results from both run are 
compared in Table V. 
000026
 
TABLE V 
RESULTS COMPARISON 
Flow Maximum 
Path Delays 
(ns) 
Maximum 
Skew (ps) 
Throughput 
Time 
(days) 
Original 12.5621 180 14 days 
Zero 
Skew 
12.2873 174 7 days 
 
Based on the results, the maximum path delays and 
maximum skews show some improvement, but are not 
significantly different. This is expected as the main 
underlying tools for balancing the clock net is still the same, 
which is the CTG tool. However, there is a significant 50% 
improvement in the through put time. This improvement is 
attributed to the ability to run CTG on much smaller 
portions of clock nets and to the ability to run it 
concurrently, i.e. running five smaller nets on five 
workstations in parallel.  
The results of this evaluation validate that the ZSCR 
methodology has been correctly implemented and has 
produced valid results. It also confirms that the deployment 
of the methodology has achieved its goal of speeding up the 
clock tree generation process in IC design. 
 
6. CONCLUSION 
 
The ZSCR methodology has been shown to have helped the 
design team improve their overall throughput time when 
balancing the main clock network on the chip. There are still 
many areas of improvement that can be implemented. 
The pseudo-partitioning stage can be further automated 
to eliminate the need for the user to manually analyze the 
placement. The efficiency of stage two of the methodology 
can be further improved to provide more accurate delay 
estimations and address other realistic layout concerns. One 
of the layout concerns is that the EZS algorithm does not 
strive for wire length minimization. Another area of 
improvement that should be implemented is wire-sizing. 
 
7. REFERENCES 
 
[1] P. Rashinkar, P. Paterson, L. Singh, “System-on-a-Chip 
Verification: Methodology and Techniques,” New York: Kluwer 
Academic Publishers, 2001. 
 
[2] M. A. B. Jackson, A. Srinivasan, E. S. Kuh, “Clock Routing for 
High-Performance ICs,” Proceedings of the ACM/IEEE Design 
Automation Conference, pp.573-579, June 1990. 
 
[3] M. C. Chi, S. H. Huang, “A Reliable Clock Tree Design 
Methodology for ASIC Designs,” Chung Yuan Journal, Vol. 28, 
No. 3, pp. 115-122, 2000. 
 
[4] T. Chao, Y. Hsu, J. Ho, “Zero Skew Clock Routing with 
Minimum Wirelength,” IEEE Transaction on Circuits and Systems, 
Vol. 39 No. 11, pp. 799-814, 1992. 
 
[5] Y. P. Chen, D. F. Wong, “An Algorithm for Zero Skew Clock 
Tree Routing with Buffer Insertion,” Proceedings of the European 
Design and Test Conference, pp. 230-236, March 1996. 
 
[6] A. Kahng, J. Cong, G. Robins, “High Performance Clock 
Routing Based on Recursive Geometric Matching,” Proceeding of 
the ACM/IEEE Design Automation Conference, pp.322-327, June 
1991. 
 
[7] R. S. Tsay, “Exact Zero Skew,” Proceedings of the IEEE 
International Conference on Computer Aided Design, pp.336-339, 
November 1991. 
 
[8] S. Pullela, N. Menezes, L. T. Pillage, “Low Power IC Clock 
Tree Design,” Proceedings of the Custom Integrated Circuits 
Conference, pp.263-266, May 1995. 
 
[9] R. Y. Chen, N. Vijaykrishnan, M. J. Irwin, “Clock Power 
Issues in System-on-a-Chip Designs,” Proceedings of the IEEE 
Computer Society Workshop on VLSI '99.  pp. 48-53, April 1999. 
 
[10] H. Bakoglu, J. T. Walker, J. D. Meindl, “A Symmetric Clock 
Distribution Tree and Optimized High Speed Interconnections for 
Reduced Clock Skew in ULSI and WSI Circuit,” Proceedings of 
the IEEE International Conference on Computer Design, pp. 118-
122, October 1986. 
 
[11] K. M. Carrig, A. M. Chu, F. D. Ferraiolo, J. G. Petrovick, P. 
A. Scott, and R.J. Weiss, “A clock Methodology for High-
Performance Microprocessors,” Proceeding of the IEEE Custom 
Integrated Circuits Conference, pp. 119-122, May 1997. 
000027
Intentional Blank Page
000028
