High performance IC clock networks with grid and tree topologies by Lu, Jianchao
High Performance IC Clock Networks with Grid and Tree
Topologies
A Thesis





in partial fulfillment of the
requirements for the degree
of
Doctor of Philosophy
in Electrical and Computer Engineering
May 2011
c© Copyright 2011
Jianchao Lu. All Rights Reserved.
ii
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions of this Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Contributions on Clock Mesh Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Contributions on ROA Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Contributions on Clock Tree Optimization . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. Overview of Clock Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 The Properties of Clock Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 The Timing Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 The Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Clock Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 The Tree Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Polarity Assignment on Clock Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Clock Mesh Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Switching Capacitance on the Mesh Network . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Clock Skew on Clock Mesh Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Rotary Oscillator Array (ROA) Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 ROA Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Capacitive Load Balancing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.3 ROA Clock Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.4 Unbuffered Tapping Wires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3. Clock Mesh Synthesis: Timing Slack Aware Incremental Register Placement
with Clock Mesh Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Generating the Feasible Moving Regions (FMR) . . . . . . . . . . . . . . . . . . 28
3.2.2 Mesh Wire Generation and Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Timing Slack Aware Incremental Register Placement Avoiding
Overlapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 Top Level Tree Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.5 Discussions on Routing Congestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Improvements and Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 Power Density Optimization and Routing Congestion . . . . . . . . . . . . 45
3.4 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4. Clock Mesh Synthesis: Gated Local Trees and Activity Driven Register
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
iii
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Generating the Feasible Moving Regions (FMR) . . . . . . . . . . . . . . . . . . 52
4.2.2 Activity Driven Register Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.3 Incremental Register Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.4 Clock Mesh Synthesis with Gated Local Trees . . . . . . . . . . . . . . . . . . . . 60
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5. ROA Synthesis: Steiner Tree Based Rotary Clock Routing . . . . . . . . . . . . . . . . . . . 65
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.1 Register Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.2 Target Delay Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.3 Bounded Skew Cost Matrix Construction . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.4 Assignment Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.5 Balanced Tapping Points Assignment Algorithm . . . . . . . . . . . . . . . . . 75
5.3 Time Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 The Results of the Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6. Clock Buffer Polarity Assignment: Capacitive Load Awareness and Skew
Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1 Clock Buffer Polarity Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.1 The Capacitive Load of Clock Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.2 Capacitive Load on Peak Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 Polarity Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1 Selection of the Negative Polarity Buffers/Inverters . . . . . . . . . . . . . . 93
6.4.2 Weight of the Clock Buffers/Inverters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4.3 Polarity Assignment by Lexi-search Algorithm. . . . . . . . . . . . . . . . . . . . 93
6.4.4 The Area Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4.5 The Comparison of Lexi-search Algorithm with Dynamic Pro-
gramming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.7 Discussions on the Peak Current Weight . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5 Heuristic Tuning for Skew Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.1 Performing Skew Tuning at the Post-PA Stage . . . . . . . . . . . . . . . . . . . 106
6.5.2 Skew Tuning Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.5.3 Skew Tuning Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.6 The Comparison with Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.7 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7. Clock Polarity Assignment: A Reconfigurable Flow for Clock Gated Designs 118
iv
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.2.1 XOR Gate Configuration for Polarity Assignment . . . . . . . . . . . . . . . . 121
7.2.2 The Characteristics of the Buffer and the XOR Gates . . . . . . . . . . . 122
7.2.3 Reconfigurable Polarity Assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3 XOR Gates Insertion on Clock Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4 Polarity Assignment on Sink Level XOR Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4.2 A Greedy Polarity Assignment Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.4.3 Discussion on the Polarity Assignment Method . . . . . . . . . . . . . . . . . . . 130
7.5 Polarity Assignment on Non-sink Level XOR Gates . . . . . . . . . . . . . . . . . . . . . . 131
7.5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.5.2 Polarity Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.6 Discussions on the Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.7.1 The Design of DET-FF s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.7.2 Polarity Assignment with XOR Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.7.3 Comparison to Conventional Polarity Assignment with Buffer
and Inverter Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.7.4 Trade-off Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.8 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8. Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.1 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.1.1 Conclusions on Clock Mesh Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.1.2 Conclusions on Rotary Oscillator Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.1.3 Conclusions on Clock Polarity Assignment . . . . . . . . . . . . . . . . . . . . . . . . 146
8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2.1 Clock Mesh Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2.2 Rotary Oscillator Arrays (ROA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2.3 Clock Polarity Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
vList of Tables
3.1 The linear programming formulation for incremental register placement. . . . 39
3.2 Experimental Set 1: Wirelength comparison using the optimized mesh size
in [29]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Experimental Set 2: Wirelength comparison using the proposed optimized
mesh size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Total capacitance and clock skew of the synthesized mesh network.. . . . . . . . . 42
3.5 Trade-off analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Power density and routing congestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 The linear programming formulation for incremental register placement. . . . 60
4.2 The comparison of switching capacitance and clock skew. . . . . . . . . . . . . . . . . . . . 63
4.3 The trade-off effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1 Balanced assignment problem (BAP) formulation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Linear bottleneck assignment problem (LBAP) formulation.. . . . . . . . . . . . . . . . . 76
5.3 Wirelength, cap balancing, frequency variation and skew results. . . . . . . . . . . . 82
6.1 Peak current reduction problem formulation for one area. . . . . . . . . . . . . . . . . . . . 92
6.2 Peak current curve fitting error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Benchmark information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Peak current reduction on the largest ISCAS’89 benchmark circuits (Typ-
ical operating condition). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.6 Peak current reduction on the largest ISCAS’89 benchmark circuits (Worst
operating condition). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
vi
6.7 Peak current reduction problem formulation for one area considering the
small current peak. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.8 The optimization level w/ and w/o considering the small current peak. . . . . 105
6.9 An illustration of polarity assignment and skew tuning. . . . . . . . . . . . . . . . . . . . . . . 109
6.10 Skew information after polarity assignment with skew tuning.. . . . . . . . . . . . . . . 113
6.11 Peak current improvements compared to the peak current of the original
clock trees (typical corner and worst corner, respectively). . . . . . . . . . . . . . . . . . . . 113
6.12 The performance comparsion of different polarity assignment methods. . . . . 116
7.1 The logic function of XOR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2 XOR gates inserted at the sink level [48]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.3 XOR gates inserted at the non-sink level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.4 Numerical peak current reduction comparison of the proposed method and
the optimal MIP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.5 The characteristics of the DFFX1 and the designed DET-FF driving a
capacitive load of 60 fF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.6 Reconfigurable polarity assignment with XOR gates inserted at the sink
level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.7 Reconfigurable polarity assignment with XOR gates inserted at the non-
sink level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.8 Peak current comparison for clock trees with XORs without clock gat-
ing (without polarity assignment). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.9 Peak current comparison for clock trees with XORs with clock gating. . . . . . 140
7.10 Skew degradation for different polarity assignment methods. . . . . . . . . . . . . . . . . 142
7.11 Area increase and power saving information compared with the traditional
clock tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
vii
7.12 Area and power increase information compared with the traditional clock
tree with DET-FF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
viii
List of Figures
1.1 The IC design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 An illustration of the register to register timing path. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 An illustration of the clock tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Two symmetrical tree topologies [41]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Transient simulation of buffers BUFX4 and INV X4 from a 90 nm library. 14
2.5 Clock mesh network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Rotary Oscillator Array (ROA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Simulated clock signals for unbalanced capacitance distribution on the
five (5) rings of an ROA. Note the distortion of the rectangular clock
waveform and the 30.3% variation of clock frequency over different rings. . . 20
3.1 The methodology flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 The construction of feasible moving regions (FMR).. . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 The feasible moving regions and mesh tracks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 The overlap illustration.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 The incremental register placement illustration for ISCAS’89 s35932.. . . . . . . 45
4.1 The illustration of clock mesh networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 The methodology flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 An illustration of the merging region construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 The overlap illustration.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 ROA network illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 The physical design flow employing ROA technology. . . . . . . . . . . . . . . . . . . . . . . . . 69
ix
5.3 The illustration of tapping wire connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1 RC model of buffers and wires. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Peak current on vdd/gnd rails of BUFX4 with capacitive load. . . . . . . . . . . . . . 89
6.3 Clock tree in s15850 and local areas of P/G straps. . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 HSPICE simulation of buffer BUFX4 with cap load and curve fitting. . . . . . 94
6.5 The runtime and optimality comparison of Dynamic Programming (DP)
approach and the Lexi-Search Algorithm (LSA).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.6 Delays vs. cap load for BUFX4 and IBUFX4 under typical and worst
operating conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.7 A clock branch illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.8 Peak current improvements (Not considering cap load, considering cap
load and considering cap load with skew tuning method). . . . . . . . . . . . . . . . . . . . 113
6.9 The time interval graph on one local area of the clock tree in s13207. . . . . . . 115
7.1 Buffer and XOR gate simulation in HSPICE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.2 The schematics of the buffer and XOR gate in [73]. . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3 The peak current analysis of the buffer and XOR gate. . . . . . . . . . . . . . . . . . . . . . . 123
7.4 The illustration of the clock network with clock gating. . . . . . . . . . . . . . . . . . . . . . . 124
7.5 A clock tree synthesized with XOR gates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
xAbstract
High Performance IC Clock Networks with Grid and Tree Topologies
Jianchao Lu
Advisor: Baris Taskin, Ph.D.
In this dissertation, an essential step in the integrated circuit (IC) physical de-
sign flow—the clock network design—is investigated. Clock network design entails
a series of computationally intensive, large-scale design and optimization tasks for
the generation and distribution of the clock signal through different topologies. The
lack or inefficacy of the automation for implementing high performance clock net-
works, especially for low-power, high speed and variation-aware implementations, is
the main driver for this research. The synthesis and optimization methods for the
two most commonly used clock topologies in IC design—the grid topology and the
tree topology—are primarily investigated.
The clock mesh network, which uses the grid topology, has very low skew vari-
ation at the cost of high power dissipation. Two novel clock mesh network design
methodologies are proposed in this dissertation in order to reduce the power dissi-
pation. These are the first methods known in literature that combine clock mesh
synthesis with incremental register placement and clock gating for power saving pur-
poses. The application of the proposed automation methods on the emerging resonant
rotary clocking technology, which also has the grid topology, is investigated in this
dissertation as well.
The clock tree topology has the advantage of lower power dissipation compared to
other traditional clock topologies (e.g. clock mesh, clock spine, clock tree with cross
links) at the cost of increased performance degradation due to on-chip variations. A
novel clock tree buffer polarity assignment flow is proposed in this dissertation in
xi
order to reduce these effects of on-chip variations on the clock tree topology. The
proposed polarity assignment flow is the first work that introduces post-silicon, dy-
namic reconfigurability for polarity assignment, enabling clock gating for low power
operation of the variation-tolerant clock tree networks.

11. Introduction
Clock distribution network, in conjunction with the clock generation circuitry,
defines the synchronization quality of a synchronous system. The clock distribution
network consumes a significant portion of the overall power budget which can be as
much as 40% of the total power dissipation [60]. The clock network synthesis tools
are automated to design the clock distribution network satisfying a given synchro-
nization quality (timing budget) and a power budget. Depending on these timing
and power budgets, and the clocking technology, clock networks can be implemented
with one or a combination of many structures such as a clock tree, a clock mesh or a
rotary oscillator array (ROA), etc. Most of the existing industrial clock network syn-
thesis tools provide a sophisticated, robust clock tree generation process. Recently,
the industrial tools started supporting clock mesh network generation, as well, how-
ever, with limited flexibility in design and power optimization. The automation for
rotary clocking has only been performed at the academic level [31, 32, 35, 82], and no
commercial options exist for this process.
In this dissertation, the electronic design automation methodologies for the two
clock network topologies—the grid topology (clock mesh network and rotary oscillator
arrays) and the tree topology (clock tree)—are investigated. The clock mesh network
is commonly used in the high-end microprocessor design as the clock mesh provides
low variations and high reliability [22, 63, 64, 75, 85]. Rotary oscillator array is a novel
and attractive alternative to the traditional clock network due to the high frequency
and low power dissipation characteristics [84, 86, 89]. Rotary oscillatory arrays and
clock mesh networks are similar in having a grid topology backbone, and requiring the
connections of registers to these topologies with stub wires. The clock tree is a low








Figure 1.1: The IC design flow.
methods are investigated in this dissertation to reduce the on-chip variations for the
clock tree topology.
1.1 Problem Statement
The clock network consists of interconnect wires and buffers to deliver the clock
signal to every synchronous component on the system. The clock network synthesis
is one essential step in the IC physical design flow as shown in Figure 1.1.
In a synchronous system, each synchronous component (i.e. register) has a timing
requirement, i.e., the specification on when the clock signal generated by a phase-
locked loop (PLL) should arrive at the synchronous component. The clock network
synthesis is the stage in the design flow where the timing requirements are interpreted
and the clock network is generated using interconnect wires and buffers which delivers
the clock signals to the synchronous components satisfying the timing requirements.
In most designs, the timing requirements are zero skew, which requires the clock
signal to arrive at each synchronous component at the same time.
3In addition to the timing requirements, the power dissipation on the clock network
is another design objective. The dynamic power dissipation on the clock network
is significant due to two reasons: 1. The clock network is a global interconnect
network of wires and buffers with large capacitance. 2. The clock signal on the large
capacitive network switches at every clock cycle. Recent research shows that the
power consumed on the clock network can be as much as 40% or even more of the
total power dissipation [60].
Due to the above reasoning, zero (bounded) skew clock network synthesis with
power optimization is considered the main problem in industrial EDA tool develop-
ment of the clock network synthesis stage in the IC physical design flow. In this
dissertation, zero (bounded) skew and low power clock network synthesis methodolo-
gies for the two different types of clock network topologies—clock grid (clock mesh
and rotary oscillator array) and clock tree—are investigated.
The clock mesh network structure is the industry standard for high-end micro-
processors and the center of interest in the academia as well. The rotary oscillator
array structure is an emerging technology of interest for longer term integration into
microelectronics design. The clock tree is low power compared to the clock mesh
but have relatively higher skew variation. The clock tree is the industry standard for
application specific integrated circuits (ASICs) operating at lower frequencies than
the high-end microprocessors. These three clock structures are primarily investigated
in this dissertation.
1.2 Contributions of this Work
The major contributions of the work are the novel methodologies for synthesis and
optimization of the clock networks with different topologies. The proposed improve-
ments are significant on the topics of low power synthesis methodologies for clock
4mesh, low power routing methodologies for rotary oscillator arrays (ROA) and low
variation methodologies for clock tree optimization, addressing the specific challenges
of each clock network topology.
1.2.1 Contributions on Clock Mesh Synthesis
The clock mesh network has low skew variation by design but the major drawback
is its high power dissipation. Different methodologies [2, 29, 61, 67, 81] are proposed in
recent years for reducing the power dissipation on the clock mesh network. However,
none of these proposed methodologies has considered integrating placement, local
clock trees and clock gating with clock mesh synthesis to reduce the power dissipa-
tion. In this dissertation, the following methodologies with practical significance are
proposed:
(i) A novel clock mesh network synthesis approach which generates an improved
mesh size with registers placed incrementally considering the timing slack on
the data paths and the non-uniform grid wire placement is proposed [43, 46].
This is the first work that combines placement with clock mesh synthesis to
achieve significant power savings.
(ii) A clock mesh network synthesis method which enables clock gating on the local
sub-trees in order to reduce the clock power dissipation is proposed [45]. This is
the first work known in literature that introduces clock gating into clock mesh
synthesis. Clock gating, a popular technique previously inapplicable on a clock
mesh network, is shown to be applicable at the local tree level of the clock mesh
network, leading to significant power savings, which will advance the clock mesh
synthesis research in academia and industry.
51.2.2 Contributions on ROA Synthesis
ROA is a novel clock network with a grid topology which has low power and high
performance synchronization (i.e. high frequency) characteristics. In this work, a
novel rotary clock network routing method [44] is proposed for the low-power reso-
nant rotary clocking technology which guarantees: 1. The balanced capacitive load
driven by each of the tapping points on the rotary rings, 2. Customized bounded
clock skew among all the registers on chip, 3. A sub-optimally minimized total wire-
length of the clock wire routes. The proposed methodology has significantly less
capacitance compared to the best known previous studies. The proposed design au-
tomation methodology is the first efficient automation tool for the local tree routing of
an ROA design. This routing solution conserves capacitive load balancing to maintain
the quality of oscillation and reduces wirelength to limit routing congestion.
1.2.3 Contributions on Clock Tree Optimization
The frequent and synchronized switching activity of the large capacitive load on
a clock distribution network causes large amount of current to be drawn from the
power supply rails. This sudden and frequent surge of current causes fluctuations in
voltage levels. Clock buffer polarity assignment is demonstrated to be an effective
way of reducing this peak current, which is one of the biggest sources of variation on
the design area. In this dissertation, two polarity assignment methods are proposed:
(i) A clock polarity assignment method that considers the impacts of i) The output
capacitive load on the peak current drawn by the sink level clock buffers, and
ii) The buffer/inverter replacement scheme of polarity assignment on the timing
accuracy is proposed [47, 50].
(ii) A novel clock polarity assignment flow which introduces post-silicon reconfigura-
bility is proposed. The proposed method inserts XOR gates at one level of the
6clock tree to facilitate the polarity assignment. The polarities of the XOR gates
can be reconfigured for different modes of clock gating (sleep mode, busy mode,
etc.) such that further reduction of the peak current can be achieved [48, 49, 51].
This is the first work which allows reconfigurable polarity assignment to be per-
formed dynamically at run-time as opposed to traditional static assignment
methods.
1.3 Organization of the Dissertation
The rest of the dissertation is organized as follows. In Chapter 2, clock network
topologies and their synthesis processes are introduced. In Chapter 3, the novel clock
mesh synthesis method, which integrates the register placement with clock mesh gen-
eration for power saving purposes, is proposed. In Chapter 4, another clock mesh
synthesis method is proposed which uses local gated clock trees to connect the reg-
isters to mesh grid wires to reduce power dissipation. In Chapter 5, the routing
methodology for ROA network is proposed which uses steiner tree-like connections to
connect registers to tapping points to reduce the total routing wirelength. In Chap-
ter 6, the improved polarity assignment method for on-chip peak current reduction
is proposed. In Chapter 7, the new polarity assignment flow is proposed which uses
XOR gates as clock buffers to enable reconfigurable polarity assignment at run-time.
The conclusions and the future directions are discussed in Chapter 8.
72. Overview of Clock Networks
The clock network delivers the clock signal from a single source to every syn-
chronous component of a system. Traditionally, the clock tree network [79] is used
for this purpose since the tree topology uses minimal routing resources while satisfy-
ing the zero or low skew requirements in the timing budget. With technology scaling,
on-chip variations are becoming more significant [53] thus the clock skew introduced
by the delay variation on different tree branches are no longer negligible. Due to
this reason, clock network topologies using redundant metal wires to eliminate delay
variation are introduced. The clock spines [39] are used in high performance micro-
processors in the early 2000s which utilize metal wires called the spine to short the
registers in a local area to minimize the local clock skew introduced by delay variation.
Later on, cross links [62] is proposed which inserts minimal additional metal wires on
an already designed clock tree to reduce the clock skew on critical branches. Both
the clock spine and cross links are beneficial for reducing the local skew variation but
not the global skew variation. Clock mesh network [22, 63, 64, 75, 85] is very popu-
lar in modern high-end micro-processor design as it limits the global clock skew by
shorting all the registers using horizontal and vertical metal wires across the design
area. Besides these traditional clock networks with trees and grids, more exotic clock
structures such as the Rotary Oscillator Arrays (ROA) [84] exist that provide low
power and high performance characteristics. These exotic structures are promising
for the application in future ICs. In this chapter, the preliminaries about the clock
mesh, ROA and the clock tree networks are presented.
82.1 The Properties of Clock Network
The two primary concerns when designing the clock networks are the timing and
power requirements. The synchronous systems typically have well defined states and
each synchronous component 1 changes states at the same time. This requires the
clock signal to arrive at all the synchronous components (registers) at the same time,
which is the timing requirement for the clock network. There are multiple ways
to design a clock network such that the timing requirement is satisfied. However,
achieving minimal power dissipation while guaranteeing the timing requirement is
not a trivial task and is the major objective in clock network design.
2.1.1 The Timing Requirement
Although it is desirable that the clock signal arrives at each register at the same
time, it is not always possible due to the existence of the on-chip variations. The
clock skew tif between the registers Ri and Rf is defined as the clock arrival time
difference between the two registers:
tif = ti − tf , (2.1)
where ti and tf are the clock arrival time of the registers Ri and Rf , respectively.
The global clock skew tskew is defined as the maximum clock arrival time difference
between all register pairs:
tskew = max∀Ri,Rf
(ti − tf ). (2.2)
The functionality of a synchronous system can be guaranteed as long as the timing
requirement of each local data path is satisfied. A local data path Ri → Rf as
shown in Figure 2.1 consists of two registers Ri(nitial) and Rf(inal) and a combinational
1In this dissertation, a synchronous component refers to a register without the loss of generality.
9logic block. The minimum and maximum propagation delays on the combinational
block are denoted by DifPMin and D
if
PMax, respectively. The clock-to-output delay of a
register Ri is denoted by D
i
CQ, whereas Sf is the setup time of the register Rf . The
parameters ti and tf represent the clock delays to registers Ri and Rf , respectively





















Figure 2.1: An illustration of the register to register timing path.
The timing analysis of a synchronous circuit is performed by satisfying the setup
and hold timing constraints for each local data path:




PMax ≤ tf + T − Sf − Lif , (2.3)




PMin ≥ tf . (2.4)
For zero clock skew systems, which are the norm for synchronous circuits, clock delays
ti and tf are identical, simplifying the timing constraints. Thus, if the sum of the
maximum data propagation time DifPMax and the clock-to-output delay D
i
CQ of the
register Ri minus the setup time Sf of the register Rf is greater than the operating
clock period, a timing violation occurs [14]. After the clock period T is chosen, the
placement and routing of the circuit should guarantee the setup and hold constraints
10
for each datapath are satisfied with a non-negative timing slack [28]. The setup timing
slack is more critical since the hold violations can be fixed by inserting delays on a
datapath [77]. The timing slack Lif on each timing path Ri → Rf of the circuit can
be calculated as:
Lif = T − Sf −DiCQ −DifPMax. (2.5)
A non-negative timing slack on each local data path guarantees the timing of the
circuit and improves the tolerance to variations as the timing degradation due to
variations can be potentially accommodated within the available timing slack.
2.1.2 The Power Dissipation
The clock network consumes as much as 40% of the total power dissipation on a
design [60]. Most of the power dissipation on a clock distribution network is dynamic
power dissipation. This is such as the clock signal on the clock network switches
at every clock cycle (whereas the logic signal only switches when necessary). Thus,
designing the clock network with less metal wires (i.e. interconnect wirelength) and
buffers is desirable.
The dynamic power dissipation P of a clock network is represented as:
P = αCV 2f, (2.6)
where α, C, V and f are the switching activity, total capacitance, voltage level and
operating frequency of the clock network. The voltage level and the operating fre-
quency are determined by the particular semiconductor technology (i.e. IBM 90nm)
and the design requirement, respectively. As a result, in most of the previous works
on clock network synthesis, the power dissipation is measured by the total switching









Figure 2.2: An illustration of the clock tree.
In the proposed methodologies, the switching capacitance αC is also considered as
the metric to measure the level of power dissipation.
2.2 Clock Tree
An illustration of the clock tree is presented in Figure 2.2. The clock tree delivers
the clock signal from a single source to the sink registers through interconnect wires
and buffers using a tree topology. The advantage of the tree topology is that it saves
power dissipation because tree topology uses less wire resources than other topologies.
2.2.1 The Tree Topology
An ideal tree topology is the H-tree topology [27, 38]. The H-tree topology is
symmetrical and has the same path length from the single source to the sink registers
as shown in Figure 2.3(a). A similar structure is the X-tree topology [4] as shown in






Figure 2.3: Two symmetrical tree topologies [41].
the clock sinks to be distributed evenly on the chip area. This may not always be
possible for performance optimization. The asymmetrical tree topology which still
guarantees the zero skew requirement is proposed in [79]. This asymmetrical tree
topology is commonly used in VLSI design for delivering zero (or bounded) skew
signals to multiple components as shown in Figure 2.2. In this dissertation, the clock
tree topology refers to this common asymmetrical topology.
2.2.2 Polarity Assignment on Clock Tree
In a clock tree, the clock buffers are distributed throughout the chip, facilitating
the well-controlled delivery of the clock signal globally to the synchronous compo-
nents. The peak current on the power/ground straps of a synchronous system typi-
cally occurs at the rising/falling edge of the clock signal, when a large number of gates
and registers switch simultaneously [54]. The peak current drawn by the clock tree
buffers is critical due to all the clock buffers switching at every clock period, whereas
the switching activities of the logic gates are much more infrequent. The peak cur-
rent behavior of a clock buffer presented in Figure 2.4(a) is obtained by performing a
13
HSPICE simulation on a clock buffer BUFX4 from a 90nm technology library [73]. The
buffer under test is set to drive a 50fF capacitive load, which is an approximation of
the capacitive load of an average size clock tree branch with fanout registers and wire
in the 90nm technology [73]. As shown in Figure 2.4(a), the peak current on the vdd
and gnd rails occurs at the rising and falling edges of the clock signal, respectively.
The current on the vdd/gnd rails during the period when no switching occurs is very
limited. If all the clock buffers on a clock tree have the same polarity, the current
on vdd rails at the clock rising edge and the current on gnd rails at the clock falling
edge accumulate and a large peak current can be observed. A negative polarity clock
buffer IBUFX4 is simulated by HSPICE and the result is shown in Figure 2.4(b). It is
observed that the peak current of the negative polarity clock buffer on the vdd and
gnd rails occurs at the falling and rising edge of the clock source signal, respectively.
The basic idea of polarity assignment is to reduce the peak current by permitting the
clock buffers to switch at the opposite clock polarities, distributing the peak currents
on the vdd/gnd rails on the rising edge and falling edge of the clock signal. Earlier
polarity assignment methods, such as [54, 66], target polarity assignment on the whole
clock tree. In [13, 37, 65], polarity assignment is considered only on the sink level of
the clock tree, as the majority of the capacitive load is on the sink level nodes. Only
considering polarity assignment at the sink level has the added advantage of low clock
skew degradation.
2.3 Clock Mesh Network
In high performance microprocessor design, the clock distribution network is typ-
ically synthesized with redundancy in order to reduce on-chip variations. Clock
mesh [22, 63, 64, 75, 85] is the commonly used clock structure with redundancy. The

























































(b) Negative polarity buffer.
Figure 2.4: Transient simulation of buffers BUFX4 and INV X4 from a 90 nm
library.
mesh grid wires (redundancy), the stub wires connecting the sink registers to the
mesh grid wires and the top level buffered clock tree that drives the capacitive mesh
grids. The redundancy in a clock mesh network permits a low global clock skew varia-
tion. The low global clock skew is achieved at the expense of high power consumption
compared to tree or other structures because of the excessive wires (mesh grid and
stub wires) and buffers used as well as the short circuit power introduced.
The existing industrial tools synthesize the clock mesh network with very limited
flexibility and power optimization. Due to the popularity of clock mesh in very large
scale microprocessors, design automation efforts are made in the area of clock mesh
synthesis and optimization [2, 29, 61, 67, 81]. In [81], buffer driver insertion and sizing
are studied as well as the mesh reduction for power savings. In [61], an optimal
mesh size selection method under skew constraints is proposed. The method in [61]
encompasses buffer placement and sizing as well as reduction of mesh wires. In [67],
steiner tree like connections between registers and meshes are created—instead of
connecting registers to the mesh individually—such that the total stub wires are








Figure 2.5: Clock mesh network.
and [29], non-uniform clock meshes are explored in order to reduce mesh wirelength
and the power consumption. In [2], the timing delays of the combinational logic
paths are considered when building the clock mesh such that the grid density can be
adjusted based on the timing criticality. In [29], the stub wires connected between
sink registers and the clock meshes are reduced by allowing an incremental movement
of the mesh grid.
2.3.1 Switching Capacitance on the Mesh Network
In this work, the switching capacitance is adopted as a measurement of dynamic
power dissipation on the clock network. Assuming the capacitance cti is the total stub
wire capacitance on the sub-tree ti and the capacitance c
r
k is the input capacitance
of the register rk, the total switching capacitance ctotal (excluding the top level clock
tree) on a mesh network can be calculated as:










where αi and CICG are the switching factor of the sub-tree ti and the capacitance of
the clock gating cell, respectively.
In order to reduce the power consumption on the clock mesh, the proposed meth-
ods in [2, 61, 81] reduce the mesh wirelength while the methods in [29, 67] reduce the
stub wirelength. The method proposed in Chapter 3 of this dissertation reduces both
the mesh and stub wires through the integrated placement and clock network syn-
thesis approach. In particular, the stub wirelength is further reduced than [29, 67] as
most of the stub wires are eliminated through incremental placement. The mesh wires
are reduced through the mesh size selection and the mesh wire placement methods.
Without clock gating, the switching capacitance of the clock network contributed
by the mesh and stub wires is proportional to the total wirelength. However, con-
sidering clock gating, the switching capacitance can be reduced by reducing both the
switching factor and the wirelength. The method proposed in Chapter 4 reduces the
stub wirelength and the switching factor by register clustering, steiner tree-like stub
wire connections and clock gating to reduce the total switching capacitance.
2.3.2 Clock Skew on Clock Mesh Network






where tbufskew, Dmesh(dmax) and Dstub(L
max
stub ) are the skew introduced by the buffer
drivers of the mesh, the maximum delay on the mesh from a buffer driver to a stub
wire tapping point and the maximum delay from a tapping point to the sink registers,
respectively. In Equation (2.8), the skew introduced by the buffer driver tbufskew (the
first item) can be compensated using the prescribed skew tree generation method [11]
when synthesizing the top level clock tree. Increasing the number of buffer drivers
17
also improves tbufskew through improving the driving strength of the mesh, however,
with penalty in increased power consumption. Inserting more buffer drivers reduces
the second term Dmesh(dmax). The skew introduced by the third term Dstub(L
max
stub )
is affected by the capacitance and the topology of the sub-tree which connects the
tapping points to the sink registers.
In the proposed methods, the skew introduced by the third term in Equation 2.8
is primarily considered as the first two terms can be optimized during the top level
tree generation. As such, in the clock mesh synthesis methods, the skew requirement
refers to the skew introduced by the third term. The proposed method guarantees
the skew introduced by the local gated sub-trees is within a given limit.
2.4 Rotary Oscillator Array (ROA) Synthesis
Rotary oscillator array (ROA) is an attractive alternative to traditional clock net-
work due to the high performance and low power dissipation characteristics [84]. The
rotary clocking technology is traditionally implemented with a regular array (grid)
topology of oscillatory rings called the rotary oscillatory array (ROA) [84] as shown
in Figure 2.6. The rotary rings are generated on the cross-connected transmission
lines (mobius strips) formed by regular IC interconnects. An oscillation on a rotary
ring can start spontaneously upon any noise event or stimulated by a start-up circuit
for controlled operation resulting in a continuous traveling wave [84]. Distributed
CMOS inverters placed uniformly along the transmission lines in anti-parallel con-
figuration adiabatically save power, amplify signals and ensure the rotational clock.
Such rotary oscillator generated square waves present the controllable skew proper-
ties. The traveling oscillation of the signal around the rings causes a varying phase
depending on where the clock signal is sampled. A fixed number of locations on the
































Figure 2.6: Rotary Oscillator Array (ROA).
Despite this varying, controllable skew property, the implementation of zero or
non-zero clock skew synchronized systems are equally feasible using the ROA net-
work [32]. The phase of the clock signal on the ring and the delay of the tapping
wires are summative for delivering the clock phase to the synchronous components.
This can be used to deliver a non-zero or zero clock skew. In experiments in [32], the
trade-off between zero and non-zero skew synchronization using ROA analyzed.
2.4.1 ROA Design Considerations
The ROA structure as shown in Figure 2.6 has an array of rings distributed all
over the design area in a checker board manner. The clock signals of various phases
are evenly distributed along each ring. The ROA acts as the global clock signal
distribution network. The registers at connected to the ROA through metal wires
which is considered as the local distribution network.
19
In ROA, there exists one tapping point on each ring to deliver the same phase clock
signal. In order to guarantee zero skew synchronization and wire minimization, the
registers are typically connected (tapped) to the nearest tapping point which has the
desired phase [31]. The drawbacks of this method are that the registers are connected
to the ROA using individual connections, which leads to long tapping wirelength,
routing congestion and clock signal distortion. Moreover, the capacitive load of each
tapping point are expected to be balanced as the frequency of each ring is affected
by the load capacitance. Thus it is desirable to have a method which efficiently
encapsulates the implementation for 1) operational characteristics such as zero clock
skew synchronization of the synchronous components [31] and 2) implementation
requirements such as capacitive load balancing of each tapping point [32, 84].
In an ROA, there are infinite number of tapping points to deliver a clock signal
with different phases. For practical concerns, a preselected number of locations on
the rotary rings is designated as the tapping points with high granularity, uniformly
distributed clock phases. The interconnects used to build the mobius ring are wide
enough such that the wire resistance is negligible. The capacitive load balancing of
each tapping point is important as an imbalance can lead to a distortion in the signal
waveform as shown in Figure 2.7.
2.4.2 Capacitive Load Balancing






where LT and CT are the total inductance and total capacitance respectively, along
the path of the rotary signal on the ring. The total inductance LT depends on the
20
Figure 2.7: Simulated clock signals for unbalanced capacitance distribution on the
five (5) rings of an ROA. Note the distortion of the rectangular clock waveform and
the 30.3% variation of clock frequency over different rings.










where Ctra, Cinv, Creg, and Cwire are the capacitances contributed by the transmis-
sion line, inverter pairs, registers, and the register tapping wires, respectively. The
21
operating frequency of an ROA depends on the total estimated inductance LT and
capacitance CT in the system as estimated in (2.9). All the rotary rings on the ROA
have the same perimeter and the structural properties. Hence, the inductance of
each rotary ring on the ROA is identical. Ctra and Cinv are identical for each ROA
ring. Creg and Cwire depend on the number of registers connected to each ring as
well as their physical proximity to the ring as the proximity affects the tapping wire-
length (thus, Cwire). This potential variation on the total capacitance of each ring of
an ROA negatively affects the oscillation quality [32].
HSPICE simulations are performed to observe the effects of an unbalanced capac-
itance distribution on the frequency of the rotary rings of the ROA. In simulations,
the U-element from HSPICE is used to model the lossy transmission line [19, 68, 86].
The circuit is setup in SPICE to display five (5) rings in an ROA. In this demon-
strative setup, the total capacitance loads of 10pF, 20pF, 30pF, 40pF and 50pF, are
modeled for rotary rings Ring1 through Ring5, respectively. For simplicity, the total
capacitance of each ring is uniformly distributed to the tapping points within the cor-
responding ring. The clock waveforms observed for this setup are shown in Fig. 2.7.
Across the different rings of the ROA, a maximum variation of 30.3% in frequency is
observed from 1.28GHz to 1.84GHz. Note that, in addition to the unmatched fre-
quencies, the oscillations are distorted due to the high capacitance imbalance across
the rings of the ROA. When the synchronous components (i.e. registers) are con-
nected to different rotary rings—in order to satisfy the skew requirements—similar
variations might occur. Hence, the balanced capacitive loading across the rotary rings
of the ROA is an important requirement in guaranteeing stable rotary oscillations.
The capacitive imbalance can be eliminated by using varactors, as suggested in [84]
and [42]. The varactors enable post-silicon tunability of the capacitance on the ring,
at the expense of the increased area to place the varactors. Another drawback of var-
22
actors is their impact on the oscillation frequency through the relationship in (2.9):
The frequency of the system can be compromised if the change in the value of capac-
itance is high. A better approach is to use design automation efforts to 1) accurately
model the parasitics of the rotary clock network [33] and 2) utilize routing to balance
the capacitive load and satisfy the oscillation frequency through (2.9) [35].
2.4.3 ROA Clock Routing
The grid-based ROA topology is placed during the floorplanning stage of the
physical design flow. The perimeter of each ring is computed by (2.9) based on
the desired clock frequency. Traditionally, ROA is composed of regular (e.g. square
boundary for the mobius ring) rotary rings, which are replicated to have a footprint
equal to that of the floorplan. In [30], the use of custom-shaped rotary rings to
constitute the ROA is proposed, with a reduced tapping wirelength. The new rotary
clock network routing method proposed in this work focuses on the regular ring ROAs
for experimentation. An extension to custom rotary ring ROAs is trivial.
The ROA clock routing involves connecting the placed registers to the tapping
points on the pre-placed rings through tapping wires to achieve goals such as zero
skew, capacitive load balancing at the tapping points and wire minimization. A
previous work in [32] presents an ROA routing scheme by using an integer linear
programming (ILP) formulation. Two functional drawbacks in [32] are that the ILP
formulation is proposed for non-zero clock skew synchronization and the registers
are tapped individually to the tapping points. The ILP solution in [32] also has a
performance drawback for (non-typical) large scale implementations, as ILP runtimes
can become prohibitive. The previous work in [76] combines the tree network and
rotary ring together to reduce the tapping wirelength. However, zero clock skew and
capacitive load balancing are not guaranteed. The previous work in [35] considers
23
zero-skew synchronization and capacitive load balancing at the same time but the
method uses excessive amount of wires such that the clock signal may be severely
distorted. A post-mesh routing method is proposed in [67] in order to connect the
registers to the clock mesh grids using a tree structure which is similar to the problem
presented in this work. However, this method in [67] is inapplicable to the ROA clock
routing because the size of the rings in an ROA is determined by the operating
frequency. Thus, the ring size may be significantly larger than a clock mesh grid,
which would lead to a large clock skew if the tree structure is not built in a zero skew
manner.
2.4.4 Unbuffered Tapping Wires
The tapping wires that are used to connect the synchronous components to the
rotary clock network also contribute to the operation of the rotary clock network.
Particularly for implementations where the tapping wires are long, the parasitics of
the tapping wires impact the rotary oscillation as well as the power consumption of
the clock network. Previous works aim to minimize this tapping wires, such as [78]
and [82], partially to eliminate this impact. The silicon implementation of an FIR
filter in [87] with rotary clocking adopts a different strategy of buffering the tapping
wires similar to [23]. This enables predictability of the capacitive load on the ring
and simplifies the design process.
The power savings of the rotary clocking are due to the adiabatic switching of
the oscillating signal on the ROA rings. When the registers are connected to the
tapping points using unbuffered tapping wires, the adiabatic switching of the clock
signal can be extended to the tapping wires to adiabatically switch Cwire and Creg,
thereby preserving the power savings. However, if the buffers are inserted on tapping
wires, the buffers isolate their fanouts from the adiabatic switching on the ring. As
24
a result, the buffers charge and discharge their output load (a portion of Cwire and
Creg depends on where the buffers are inserted) dynamically through the vdd and
gnd connections, thereby undesirably increasing the power dissipation.
25
3. Clock Mesh Synthesis: Timing Slack Aware Incremental Register
Placement with Clock Mesh Synthesis
In this chapter, a clock mesh planning and synthesis method is proposed which
significantly reduces the power dissipation on the network while considering the power
density and timing slack simultaneously. The proposed method is performed at the
post placement stage and consists of three major steps: 1) Feasible moving region
construction of each register considering timing slack, 2) Mesh grid wire generation
and placement, 3) Incremental register placement for stub wire minimization consid-
ering power density and timing slack. The advantages of the proposed method are
the reduced power dissipation — 28% on average on the benchmark circuits — the
optimized power density and the guaranteed non-negative timing slack. These ad-
vantages are possible through a decreased timing slack (1.1% of the clock period) and
change in the logic wirelength (+5.9%) on the benchmark circuits.
This chapter is organized as follows. In Section 3.1, a brief introduction about
clock mesh network and the proposed method is presented. In Section 3.2, the pro-
posed methods are introduced. In Section 3.3, the experimental results are summa-
rized. This chapter is finalized in Section 3.4.
3.1 Introduction
In the traditional integrated circuit design flow (Figure 1.1 on page 2), the place-
ment and clock network synthesis stages are performed sequentially. The primary
objectives of placement do not include the optimization of register placement for the
succeeding clock network synthesis stage. This may result in a low quality (e.g. power
consuming) clock network after synthesis. It is desirable to combine the placement
26
and clock network synthesis stages to provide a better physical design. The novel
approach of synthesizing clock mesh network proposed in this chapter combines the
placement and clock network synthesis stages of the traditional IC physical design
flow through incremental placement. In this method, both the registers and the
mesh wires are incrementally placed towards each other considering the timing slack
and power density such that the total stub wirelength is significantly reduced. It is
demonstrated in the experiments that the routing congestion is not worsened if the
registers are placed to the closest mesh grid wires. Moreover, a more favorable (sparse
and non-uniform) grid is selected automatically with limited skew degradation. The
advantages of the clock mesh network generated by the proposed method are the
following:
(i) The power consumption of the clock mesh network is reduced compared to the
networks generated by previous clock mesh design methods due to the sparse
mesh network and the reduced stub wirelength,
(ii) The skew variation of the clock network is comparable to a uniform, full mesh
grid clock network, despite the sparser mesh grid,
(iii) The non-negative timing slack of the circuit is preserved after the incremental
register placement; the worst case power density is increased by only a limited
amount and the routing congestion is not affected.
The trade-offs of the proposed method are the changing timing slack and logic
routing wirelength. The timing slack is guaranteed to be non-negative by the proposed
method with minimal slack decrease. The logic routing wirelength is increased by only
5.9% on average on the benchmark circuits, which is very limited.
27
3.2 Methodology
The proposed method flow is presented in Figure 3.1. The proposed method takes
an existing placement result as the input. A static timing analysis is performed to
identify the timing slack of each data path. Based on this information, the feasible
moving regions of each register are created. The non-uniform clock mesh is generated
and placed in order to simultaneously reduce the mesh wires and stub wires according
to the feasible moving regions of the registers. The registers are then incrementally
moved based on the timing and the clock mesh placement. To finalize, the buffer
drivers of the clock mesh are inserted and the top level clock tree is generated. The





































































































(b) The feasible moving region of a register within a timing
budget.
Figure 3.2: The construction of feasible moving regions (FMR).
3.2.1 Generating the Feasible Moving Regions (FMR)
The proposed method suggests the incremental placement of the registers towards
the clock meshes. The timing slacks are considered during the incremental placement
(movement) of the registers in order to guarantee the functionality correctness of the
29
design. The feasible moving region (FMR) of each register is thus defined based on
the timing path to guide the register clustering and incremental placement. Note
that the timing slack of a register-to-register path Ri → Rf is associated with the
physical paths on the register-to-register (timing) path. The incremental placement
of the registers affects the locations of the registers but not the combinational logic
gates constituting the physical paths. Consequently, incremental register placement
changes the slack of the entire timing path; however, only the physical paths at the
fanout of the initial register Ri and the fanin of the final register Rf are affected. The
remaining physical paths between the combinational gates remain unaffected. To this








as illustrated in Figure 3.2(a). Difok is the wire delay from the output of the register Ri
to the input of the kth fanout gate of the register Ri. D
if
mk
is the gates and wire delay
from the input of the kth fanout gate of the register Ri to the input of the fanin gate
of register Rf . D
f
fik
is the gate and wire delay from the input of the fanin gate of
register Rf to the input of the register Rf .
The timing slack of the local data-path can be re-written as:
Lif = T − Sf −DiCQ −Difmk −Dffik −Difok . (3.2)
At the post placement stage, the clock period T and each part of the original data-path
delay Difmk , D
i
fok
and Dffik are known. In order to guarantee the functional correctness
under variation, the timing slack Lif of each register timing path Ri → Rf should
be non-negative (or greater than a positive value specified by the designer). If the
register Ri is moved, the fanout wirelength w
i
fok
of the register Ri will change thus the
30
delay Difok on the k
th fanout wire and the clock to output delay of the register DiCQ
will change as the load capacitance changes. For the same reason, if the location
of the register Rf is changed, the delay D
f
fik
on the fanin path of the register Rf
changes.





are monotonically increasing functions of fanout
wirelength wifok and fanin wirelength w
f
fi, respectively. Given a positive slack, a
maximum fanout wirelength W ifok and fanin wirelength W
i
fi can be calculated for
each register Ri. As long as the manhattan distance from the register Ri to each
corresponding fanout gate is less than the maximum fanout wirelength W ifok and the
manhattan distance from the register Ri to the fanin gate is less than the maximum
fanin wirelength W ifi, the timing slack of the register Ri is guaranteed to be feasible
with the adapted timing models.
The feasible moving regions for each fanout and fanin gates of register Ri are
created based on W ifok and W
i
fi, respectively. For instance, at the location of the k
th
fanout gate of register Ri, a tilted rectangle with radius W
i
fok
is created as shown
in Figure 3.2(b) such that the manhattan distance from the register Ri to the gate
is equal to W ifok on the boundary of the region. As long as the register is placed
within the created tilted rectangle region, the timing slack on the kth fanout path is
satisfied. For each fanout and fanin gate of register Ri, a feasible rectangle region is
created. The shaded overlapping region of all the tilted rectangle regions of the fanin
and fanout gates of register Ri is defined as the feasible moving region of register Ri
as shown in Figure 3.2(b). Note that the moving region construction is not very
accurate as when one register is moved, the slack of the other register which has
a data-path to the moved register changes. In other words, the feasible region of
movement generated at this stage for each register Ri is valid when the rest of the
registers are unmoved. Thus, in this stage, only the so called feasible moving region
31









































Figure 3.3: The feasible moving regions and mesh tracks.
3.2.2 Mesh Wire Generation and Placement
The clock mesh network consists of the mesh network with grids and the stub wires
connecting all the registers and the top level clock tree. In the proposed method, the
mesh grid and the stub wires are generated after building the feasible moving regions
of each register Ri. The objective of the mesh wire generation is to generate a sparse
grid which guarantees a low clock skew. The proposed method allows a non-uniform
clock mesh in order to reduce the stub wirelength as well as the number of grids.
On the chip area, the mesh tracks are created, where each mesh track represents
a possible placement of the clock mesh grids as shown in Figure 3.3. The mesh
32
tracks Mhi and Mvi represent the possible horizontal and vertical mesh locations,
respectively. The mesh tracks are defined at the floorplanning stage through the
uniformly distributed power rails. In this work, the mesh generation and placement
problem is formulated as a weighted set cover problem. The weighted set cover
problem is defined as [20]:
Given a universe U and a family of subsets Si where each subset Si has
a positive weight, find the series of sets whose union is U and the total
weights of these sets are minimized.
In the mesh wire generation problem, each horizontal and vertical mesh trackMhi
and Mvi is considered as a set. The registers are the elements and the universe of
the problem is all the registers. A register Ri is included in a set Mhi (Mvi) if the
feasible moving region FRi of the register overlaps with the mesh track Mhi (Mvi).
It is assumed that if a feasible moving region FRi of a register has overlapping with
a mesh track, the register can be moved close to the mesh track. The objective is to
find the minimum number of mesh wires such that the all the registers can be moved
close to these mesh wires. The problem is equivalent to finding the minimum weight
set cover for the register universe given the subsets Mhi and Mvi.





where WTMhiRk is the weight of each register. The weight of each register when con-
necting to Mhi (Mvi) is defined as:
WTMhiRk = dist(Rk,Mhi). (3.4)
33
Equation (3.4) suggests that the weight of the register Ri in set Mhi(Mvi) is propor-
tional to the distance of the register Ri to the mesh track. As such, the solution favors
less incremental movement of the registers. Solving the set cover problem identifies
the mesh tracks which lead to the minimum cost (total weight) in incremental register
placement.
The minimum weight set cover problem is a well-known NP-hard problem [20]. In
this work, a simple yet effective greedy approximation algorithm from [20] is applied.
The algorithm greedily adds the set (horizontal or vertical mesh wire) with the mini-
mum amortized cost into the solution at each iteration until the sets in the solution in-
clude all the registers. The amortized cost is defined as the cost of the set |Mhi| (|Mvi|)
divided by the number of new elements added when the set Mhi (Mvi) is chosen. The
sets (Mhis andMvis) in the solution are the mesh wires that are generated and placed
for the non-uniform clock mesh network.
3.2.3 Timing Slack Aware Incremental Register Placement Avoiding Over-
lapping
Although the feasible moving regions for the registers are generated as explained in
Section 3.2.2, aggressively moving the registers onto a mesh segment inside the mov-
ing region does not always guarantee the timing slack requirement. This is because
moving one register may negatively affect the timing slack, and thus, the feasible
moving region of the other registers which have a timing path to it. Moreover, mov-
ing the registers incrementally close to the mesh may introduce overlapping. The
incremental register placement is formulated as a linear programming formulation
and solved optimally without timing violation and register overlapping considering
routing congestion and power density.
34
Objective
Each register is moved towards the mesh wire assigned in the set cover solution.
If more than one mesh wire has overlapping with one register, the register is assigned
to the closest mesh wire in order to reduce the moving distance of the incremental
placement. The stub wirelength wistub of each register Ri is the minimum distance
from the location of the register Ri to the mesh wire of the assigned mesh segment:
wistub =
 |xRi − xMvj |, if Ri connects to Mvj,|yRi − yMhj |, if Ri connects to Mhj, (3.5)






During this incremental register placement process, the setup timing slack (3.2)





are functions of the wire capacitance
and the capacitive load. The wire capacitance and the capacitive load of the gate
depends on the wire length. In this work, the delay change on the wire and the gate


























G are the slopes
of the wire delay versus wire capacitance curve, register delay versus capacitive load
35
curve and the fanin gate delay of register Rf versus the capacitive load, respectively.
The parameters DiR0 and D
f
G0 are the clock-to-output delay and the gate delay when
the capacitive load is zero (0). The wirelength wifok and w
f
fik
can be estimated using
the distances of the fanout and fanin gate to the register:
wifok = |xRi − xifok |+ |yRi − yifok |, (3.10)
wffi = |xRf − xffi|+ |yRf − yffi|, (3.11)
where xRi and yRi , x
i
fok




fi are the x and y coordinates of the
register Ri, the x and y coordinates of the k
th fanout gate of register Ri, the x and y
coordinates of the fanin gate of the register Rf , respectively. In automation, the delay
typically is modeled quadratically, e.g. Elmore delay [25]. Alternative higher order
delay modeling can be performed for accuracy. However, the proposed design method
is based on linear programming, thus, the linear delay model is selected to generate
linear constraints. The selected linear approximation guarantees that the estimated
(linear) delay is always higher than the higher order models, which is conservative but
guarantees timing slack. The difference in overestimation is available as a positive
timing slack after incremental placement, which is favorable for a practical operation.
Non-overlapping constraints
Simultaneous with these requirements in timing, the physical requirement in pre-
venting the overlapping of registers is considered. As shown in Figure 3.4, let the
length and width of a register be Lr and Wr, respectively. One of the following four
















Figure 3.4: The overlap illustration.
overlapping between each pair of registers Ri and Rj:
xRi − xRj ≥ Wr, (3.12)
xRj − xRi ≥ Wr, (3.13)
yRi − yRj ≥ Lr, (3.14)
yRj − yRi ≥ Lr. (3.15)
These constraints prevent the horizontal and vertical overlapping of register place-
ment based on the register length and width. The constraints (3.12) and (3.13) are
mutually exclusive, similar to the constraints (3.14) and (3.15). In order to form an
LP formulation for the problem, only one of the four constraints is placed in the LP
formulation between each pair of registers Ri and Rf . To this end, the following flow
is proposed to generate and reduce the overlapping avoidance constraints to one:
(i) Construct the stub wire minimization problem formulation without the overlap-
ping avoidance constraints, where only the timing constraints are considered.
Solve the formulation to obtain the incremental register placement results.
37
(ii) In the incremental register placement result, if two registers Ri and Rj are non-
overlapping, add the constraint which has the maximum left hand side value
among Equations (3.12–3.15) to the overlapping avoidance formulation.
(iii) In the incremental register placement result, if two registers Ri and Rj are
overlapping, add the constraint which has the maximum left hand side value
for the original locations of the registers Ri and Rj in Equations (3.12 – 3.15)
to the overlapping avoidance formulation.
(iv) If two registers are placed on two different grids which do not have any intersec-
tion (e.g. two parallel horizontal or vertical grids), the overlapping avoidance
constraints between the two registers can be eliminated.
In brief, the LP is solved without considering the physical overlapping avoidance con-
straints first (step 1). If the registers are placed to non-intersecting grids, the overlap-
ping avoidance constraint is unnecessary as the registers will never overlap (step 4).
Otherwise, the most conservative constraint is added to the LP and solved for the
optimal placement without physical overlapping of registers (steps 2 and 3).
Power density constraints
The power density constraints are optional and generated in order to guarantee
the power density in local areas. If two or more registers with large power dissipation
are close to each other, there might be a local area with high power density. In order
to guarantee the registers with large power dissipation are not close to each other,
additional distance constraints are generated.
In modeling for the power density, the power dissipation of each register is obtained
as that of reported from IC Compiler (alternative methods can be used without loss
of generality). The registers are sorted in descending order of power dissipation. For
38
the registers with power dissipation greater than a threshold, the distance of the
register from other registers is constrained to be larger than a user specified margin.
For instance, if the power dissipation of register Ri is greater than the threshold, the
constraints in Equations (3.12—3.15) are re-written as:
xRi − xRj ≥ 2×Wr, (3.16)
xRj − xRi ≥ 2×Wr, (3.17)
yRi − yRj ≥ 2× Lr, (3.18)
yRj − yRi ≥ 2× Lr, (3.19)
where the registers are not permitted to be closer than 2×Wr or 2×Lr for horizontal
or vertical direction constraints, respectively.
LP formulation
The linear programming formulation is presented in Table 3.1. The objective of
the formulation is to minimize the total stub wirelength connecting the registers to
the mesh wires by incrementally moving the registers. The timing constraints and the
overlapping constraints are generated. Note that xdist(a, b) and ydist(a, b) represents
the distance between nodes a and b on the horizontal direction and vertical direction,
respectively. The constraints about xdist(a, b) and ydist(a, b) are used to linearize
the distance constraints. For each register, a constraint is generated to restrain its x
value (if the register is assigned to a horizontal mesh segment) or y value (if the register
is assigned to a vertical mesh segment) for congestion. For each register pair, at most
one constraint among the last four constraints of power density limitation appears in
the LP formulation. By solving the formulation, the optimal locations (xˆRi , yˆRi) of
each register Ri and the corresponding total stub wirelength are obtained.
39
Table 3.1: The linear programming formulation for incremental register placement.
































wistub = xdist(Ri,Mvj) (or ydist(Ri,Mhj)|), ∀Ri
xdist(Ri,Mvj) ≥ xRi − xMvj , ∀Ri
xdist(Ri,Mvj) ≥ xMvj − xRi , ∀Ri
ydist(Ri,Mhj) ≥ yRi − yMhj , ∀Ri
ydist(Ri,Mhj) ≥ yMhj − yRi , ∀Ri
wifok = xdist(Ri, fok) + ydist(Ri, fok), ∀Ri
xdist(Ri, fok) ≥ xRi − xfok , ∀Ri
xdist(Ri, fok) ≥ xfok − xRi , ∀Ri
ydist(Ri, fok) ≥ yRi − yfok , ∀Ri
ydist(Ri, fok) ≥ yfok − yRi , ∀Ri
wffi = xdist(Rf , fi) + ydist(Rf , fi), ∀Rf
xdist(Rf , fi) ≥ xRf − xfi, ∀Rf
xdist(Rf , fi) ≥ xfi − xRf , ∀Rf
ydist(Rf , fi) ≥ yRf − yfi, ∀Rf
ydist(Rf , fi) ≥ yfi − yRf , ∀Rf
xRi − xRj ≥ k ×Wr.
or xRj − xRi ≥ k ×Wr.
or yRi − yRj ≥ k × Lr.
or yRj − yRi ≥ k × Lr.
The overlapping between the registers is considered in the LP formulation. In the
implementation presented in this manuscript, the overlapping between the registers
and the logic gates due to the incremental placement is resolved by placement legal-
ization using IC Compiler. However, the overlapping between registers and gates can
be avoided by allowing the registers to be placed on white space only.
40
3.2.4 Top Level Tree Generation
The mesh driver insertion process of the proposed method adopts the similar set
cover solution in [81]. The set-cover problem proposes a clock driver placement and
sizing solution in order to drive the registers on the mesh for a given global clock skew
requirement. The buffered clock tree is generated using the method in [15] where the
mesh drivers are considered as the sinks of the clock tree.
3.2.5 Discussions on Routing Congestion
Besides power density, another potential trade-off due to the incremental register
placement method is the introduced routing congestion. Two different assignment
methods are applied in attempt to reduce the routing congestion:
(i) The registers are assigned to mesh segments considering the balanced fanin and
fanout pins on each mesh segment.
(ii) The registers are assigned to mesh segments considering only the distance from
the mesh segment.
In the first assignment method, the number of fanin and fanout pins are used as direct
indicators of the routing demand in one area. In the second assignment method, the
incremental movement of registers are considered to be detrimental to the routing
congestion established by the initial placement procedure. As such, the incremental
register placement is constrained in excessive movement based factors such as timing
slack and power density.
3.3 Experimental Results
The proposed algorithm flow is implemented in C++. The top level clock tree is
generated using a buffered DME algorithm [15, 18] to drive the mesh grid. The clock
41
Table 3.2: Experimental Set 1: Wirelength comparison using the optimized mesh size
in [29].
Circuit [29] (Set 1) Proposed method Improvement
Grid Stub Mesh Total Grid Stub Mesh Total Stub Total
(µm) (µm) (µm) (µm) (µm) (µm)
s13207 8*8 3281 4848 8129 6*7 389 3938 4327 88.1% 46.8%
s15850 8*8 2226 4062 6288 5*4 178 2285 2463 91.9% 60.8%
s35932 12*12 10112 10871 20983 11*7 985 8157 9142 90.3% 56.4%
s38417 12*12 8839 10794 19633 10*9 1252 8546 9798 85.8% 50.1%
s38584 11*11 8533 12668 21201 12*7 674 10941 11615 92.1% 45.2%
Avg. 89.6% 51.9%
Table 3.3: Experimental Set 2: Wirelength comparison using the proposed optimized
mesh size.
Circuit [29] (Set 2) Proposed method Improvement
Grid Stub Mesh Total Grid Stub Mesh Total Stub Total
(µm) (µm) (µm) (µm) (µm) (µm)
s13207 6*7 4012 3938 7950 6*7 389 3938 4327 90.3% 45.6%
s15850 5*4 3683 2285 5968 5*4 178 2285 2463 95.2% 58.7%
s35932 11*7 12867 8157 21024 11*7 985 8157 9142 92.3% 56.5%
s38417 10*9 11440 8546 19986 10*9 1252 8546 9798 89.1% 50.9%
s38584 12*7 9173 10941 20114 12*7 674 10941 11615 92.6% 42.3%
Avg. 91.9% 50.8%
mesh networks with a top level tree are translated into the ISPD10 clock network
contest format [74] and simulated using Ngspice with a 45nm PTM model card. IC
Compiler of Synopsys is used to perform the initial placement, timing slack analysis,
routing and power estimation. The LP formulations are solved by the online solver
Feaspump and SCIP from [21]. Since the benchmark circuits provided by the ISPD’10
clock network contest do not have any logic gate information, the benchmark circuit
used in the experiments are the five largest circuit from the ISCAS’89 benchmark.
As reference, note that the register count for the largest ISCAS’89 circuit is in the
same level with the ISPD’10 contest benchmark (1728 vs. 2249). The runtime of the
42
Table 3.4: Total capacitance and clock skew of the synthesized mesh network.
Circuit Set 1 Set 2 Proposed method Imprv. (Set 1) Imprv. (Set 2)
Skew Cap Skew Cap Skew Cap Skew Cap Skew Cap
(ps) (fF ) (ps) (fF ) (ps) (fF ) (ps) (ps)
s13207 0.7 5656 1.2 4138 0.3 3127 -0.4 44.7% -0.9 24.4%
s15850 0.8 4935 1.2 2748 0.8 1837 0.0 62.8% -0.4 33.1%
s35932 0.7 13169 2.1 9975 1.8 7112 1.1 46.0% -0.3 28.7%
s38417 1.4 12791 1.2 9803 0.9 7280 -0.5 43.1% -0.3 25.7%
s38584 1.1 13531 2.9 10417 0.8 7428 -0.3 45.1% -2.1 28.7%
Average 0.0 48.3% -0.8 28.1%
proposed method on all the benchmark circuits are less than 7 minutes on a standard
2.2GHz linux box and the on-line solver [21].
3.3.1 Improvements and Trade-offs
The proposed mesh generation and incremental register placement methods are
compared against the non-uniform mesh placement method proposed in [29]. The
iterative k-means method in [29] is implemented in C++. Since the mesh size selection
and mesh wire placement are integrated into the proposed set cover solution, the mesh
network generated by the proposed method using the mesh size in [29] is not available.
Thus, two sets of comparison are performed:
Set 1 The mesh network generated by the proposed method is compared to the mesh
network generated by the method in [29], using the optimized mesh size in [29].
Set 2 The mesh network generated by the proposed method is compared to the mesh
network generated by the method in [29], using the mesh size optimized by the
proposed method.
These comparisons are performed in order to demonstrated the wire reduction effects
of the proposed method.
43
The results of the first set of comparison from the above list are summarized in
Table 3.2. For the same circuits, the proposed method typically generates a sparser
mesh network. The stub wirelength is reduced by 89.6% and the mesh wirelength is
reduced by 24.4%, on average. The total wire reduction on the mesh network is 51.9%
on average.
The results for the second set of comparison from the above list are summarized
in Table 3.3. The stub wirelength reduction is more significant at 91.9% on average
than the method in [29]. This is such as the method proposed in this chapter suggest
incremental register placement towards the mesh grids whereas the method in [29]
keeps the placement intact. Due to the experimental setup of choosing the same mesh
grid sizes as in [29], there is no mesh reduction. The total wire reduction is 50.8% on
average.
The power and global clock skew are compared between the proposed method and
the method in [29], and the results are summarized in Table 3.4. The power reduction
is presented as the total switching capacitance including the wire capacitance, buffer
capacitance on the clock mesh and the top level clock tree. Comparing to the mesh
generated using the method in [29] with the mesh size in [29] (Set 1), a 48.3% reduction
on the total capacitance is observed. This reduction is achieved through the wire
reduction and the top level clock tree altogether (less buffer drivers and thus smaller
clock tree). The clock skew on the proposed mesh network is similar (0ps change on
average with +1.1ps/-0.5ps change) to the previous work in [29]. Comparing to the
mesh generated using the method in [29] but with the mesh size optimized by the
proposed method (Set 2), the power reduction is 28.1% on average. The average clock
skew is reduced by 0.8ps on average using the proposed method due to the reduced
stub wirelength.
44
The trade-offs of the proposed method in logic wire routing and timing slack
are performed and analyzed using IC Compiler. The trade-offs are summarized in
Table 3.5. It is observed that the timing slack is reduced by 22ps (7.3%) on average
using the proposed method, which is very limited compared to the original timing
slack before applying the proposed clock mesh synthesis. A non-negative timing slack
is guaranteed in the proposed formulation. The logic wire routing is only increased
by 5.9% on average due to the incremental placement of the registers, which is very
limited compared to the power saving on clock network.
Table 3.5: Trade-off analysis.
Ckt. Slack Information Logic
Pre-syn Post-syn Slack decr. Decr. Wire
(ps) (ps) (ps) % Incr.
s13207 297 272 25 8.4% 7.7%
s15850 213 180 33 15.5% 8.3%
s35932 277 265 12 4.3% 3.6%
s38417 647 612 35 5.4% 9.0%
s38584 113 110 3 2.7% 0.7%
Average 22 7.3% 5.9%
The placement results for the circuit ISCAS’89 s35932 before and after the pro-
posed clock mesh network synthesis are illustrated in Figure 3.5(a) and Figure 3.5(b),
respectively. The optimal mesh grid calculated by the proposed method is 7× 11. It
is visually observed that the registers (highlighted as dark blue boxes, the light green
boxes represents the logic gates) are placed on a 7× 11 grid wires.
45
(a) Before incremental register placement. (b) After incremental registers placement.
Figure 3.5: The incremental register placement illustration for ISCAS’89 s35932.
3.3.2 Power Density Optimization and Routing Congestion
The first set of experiments presented in Section 3.3.1 are performed with the
single objective of power minimization through clock mesh network wirelength min-
imization. As discussed in Section 3.2.3 and Section 3.2.3, however, the proposed
methods potentially affect the power density and routing congestion due to the in-
cremental placement of the registers. In this section, the power density optimization
constraints are considered. In this experiment, the threshold for adding the power
density constraints is set to 70% of the maximum power dissipation over all the reg-
isters.
In Table 3.6, the worst case power densities before the incremental placement,
after the incremental placement with and without considering the power density op-
timization constraints are summarized in the columns Org., no PD and PD aware,
respectively. It is observed that the worst case power density is improved by 26.7% on
average after the incremental placement through the proposed optimization. It is im-
46
portant to note that on the two benchmark circuits s38417 and s38584, the worst case
power density is much larger than the smaller circuits due to the level shifter cells. In
these cases, the worst case power density is caused by these level shifter cells, but not
the registers that are incrementally placed. Thus the incremental register placement
does not affect the worst case power density in these cases.
The second assignment method discussed in Section 3.2.5 is adopted in the experi-
ments as it provides a much better routing congestion results. The routing congestion
results are summarized in Table 3.6. The routing congestion is measured by the rout-
ing overflow using IC Compiler. In IC Compiler, the chip area is divided by small
routing cells and each routing cell has certain amount of routing tracks. In the ex-
perimental results, the total routing overflow which is the sum of the overflow at all
the routing cells is reported. It is shown that the routing overflow is not affected by
the incremental placement. This is such as the register movement is minimal by mov-
ing the registers to the closest mesh segments. On the four circuits s13207, s35932,
s38417, 38584, only less than 0.47% of all the routing cells has routing overflow with
one per routing cell on average, which is a very limited number. On circuit s15850,
the original routing overflow is already high. However, the register placement does
not increase the routing overflow, which indicates that the design planning on cir-
cuit s15850 is the main reason for high routing congestion.
3.4 Conclusions
A clock mesh network synthesis flow is proposed in this chapter. In this flow, the
power density and timing slack aware incremental register placement with the non-
uniform clock mesh generation methods are proposed. The clock mesh synthesis flow
is demonstrated to have significantly (48%) less total switching capacitance on the
clock distribution network than the previous work. The method is able to reduce the
47
Table 3.6: Power density and routing congestion.
Ckt. Power density Routing congestion
Org. no PD PD aware Impro. Org. no PD PD aware
(µW/µm2) (µW/µm2) (µW/µm2)
s13207 9.7 16.5 10.5 36.5% 59 71 57
s15850 8.4 13.6 10.3 24.0% 1253 1049 1256
s35932 14.5 19.1 15.4 19.5% 63 72 68
s38417 120.4 - - - 115 100 74
s38584 277.0 - - - 243 234 259
Average 26.7%
power dissipation without skew degradation. Moreover, it is demonstrated that the
routing congestion is not introduced as the incremental register placement is minimal.
The experimental results show that the methodology flow is effective and can be easily
integrated into existing industrial physical design flow.
48
4. Clock Mesh Synthesis: Gated Local Trees and Activity Driven
Register Clustering
In Chapter 3, the clock mesh synthesis is combined with incremental register
placement to reduce the stub wirelength. Other power saving techniques including
steiner-tree local tree synthesis, clock gating and register clustering are not considered.
In this chapter, a clock mesh network synthesis method is proposed which enables
clock gating on the local sub-trees in order to reduce the clock power dissipation.
Clock gating is performed with a register clustering strategy that considers both
i) the similarity of switching activities between registers in a local area and ii) the
timing slack on every local data path of the design area. This is the first work known
in literature that encapsulates the efficient implementation of the gated local trees
and activity driven register clustering with timing slack awareness for clock mesh
synthesis. Experimental results show that with gated local tree and activity driven
register clustering, the switching capacitance on the mesh network can be reduced by
22% with limited skew degradation. The proposed method has two synthesis modes
as low power mode and high performance mode to serve different design purposes.
This chapter is organized as follows. In Section 4.1, a brief introduction about
the proposed method is presented. In Section 4.2, the proposed methodologies on
clock mesh synthesis are introduced. In Section 4.3, the experimental results are
summarized. The chapter is finalized in Section 4.4.
4.1 Introduction
Clock mesh network, by design, has a very low global clock skew (variation). As
such, the clock mesh network is popular in high-end microprocessors and consequently,
49
many previous design automation methods are developed in the area of clock mesh
synthesis and optimization [2, 17, 29, 46, 61, 67, 81]. As explained in Chapter 3, the
methods proposed in these previous works aim to reduce the power dissipation given
a practical skew requirement. For instance, the methods in [2, 81] and the methods
in [29, 67] aim to reduce the mesh grid wires and stub wires, respectively, whereas the
methods in [17, 61] and Chapter 3 aim to reduce the sum of the mesh grid wires and
stub wires.
Although optimizing for power dissipation, none of the previous works has consid-
ered the commonly used power saving techniques for clock tree network such as clock
gating [26] and register clustering [16] on meshes. In the clock mesh network, the
clock gating is only potentially applicable on the local connections between the mesh
grid wires and the sink registers. In the previous works, the stub wires which connect
the grid wires to the sink registers are considered buffer-less where clock gating is
inapplicable. A significant percentage of the switching capacitance is at the sinks of
the clock network, thus, clock gating on the local trees of a clock mesh is beneficial.
In the proposed method, the sink registers are connected using local steiner trees
and the integrated clock gating cells (ICG) insertion is considered for power saving
purposes.
In most of the previous works, sink registers are connected to the mesh grid wires
individually. In [67], steiner tree connections are used to connect registers to the
mesh grid wires. In the proposed method, the steiner tree connection is used to
connect registers and thus the clock routing wirelength can be reduced by clustering
registers in a local area. Since inserting an ICG cell occupies chip area, it is desirable
that the number of inserted ICG cells is minimal. The register clustering based on
the switching activity and timing slack information is considered to further reduce



















(b) Clock mesh network with gated local trees.
Figure 4.1: The illustration of clock mesh networks.
requirement. The advantages of the clock mesh network generated by the proposed
method are presented as the following:
(i) The power consumption of the clock mesh network is reduced compared to
previous clock mesh design methods due to the combination of clock gating,
steiner tree connection and the register clustering.
(ii) The non-negative timing slack of the circuit is preserved after the incremen-
tal register placement. The slack decrease tolerance can be specified by the
designer.
(iii) The incremental register placement is performed in local areas only, which pre-
serves the placement optimization in terms of timing and routing.
4.2 Methodologies
The proposed method generates a clock mesh network with gated local trees as
shown in Figure 4.1(b). The local trees connect registers with similar switching
51
activities together. The proposed method consists of four major steps as shown in
Figure 4.2:
(i) Build the feasible moving regions of each register based on the timing slack of
each local data path on the design.
(ii) Based on the feasible moving regions of each register, cluster the registers with
small distance and similar switching activity together.
(iii) Incrementally move the registers in the same clusters towards each other guar-
anteeing non-negative timing slack.
(iv) Generate the clock network with local trees and perform ICG insertion to save
power dissipation.
















Figure 4.2: The methodology flow.
52
4.2.1 Generating the Feasible Moving Regions (FMR)
The proposed method suggests the incremental placement of the registers towards
each other in the same cluster to reduce the ICG cells insertion. The timing slacks
are considered during the incremental placement (movement) of the registers in or-
der to guarantee the functionality correctness of the design. The feasible moving
region (FMR) of each register is thus defined based on the timing path to guide
the register clustering and incremental placement. The same feasible moving region
construction procedure as described in Section 3.2.1 is adopted in this work.
4.2.2 Activity Driven Register Clustering
After generating the feasible moving regions of the registers, the registers are clus-
tered together based on the distance of the feasible moving regions (FMR) of registers,
similarity in switching activities and total switching capacitance after clustering. In
a later stage, the registers in the same cluster are incrementally moved close to each
other to save routing wirelength considering positive timing slack. The registers in
the same cluster are driven by a single ICG cell.
Clusters merging in local areas
Initially, each register on the design area is a cluster by itself. During the clusters
merging step, registers inside one local area with FMRs close to each other and similar
switching activities are merged together. In this method, the local area is defined as
one grid box, that is, only the clusters inside one grid box are allowed to be merged.
In the merging process, two capacitance cost metrics are defined for each clus-
ter Gi: Switching capacitance c
s
i and un-buffered capacitance c
u
i . The switching
capacitance csi of a cluster is the minimum capacitance after making a clock gating









































y x k  
Figure 4.3: An illustration of the merging region construction.
cell (if any) of the cluster. The switching capacitance csi is equal to the un-buffered
capacitance cui if clock gating does not reduce the total switching capacitance.
At the beginning, the merging regions of the registers (initial clusters) are the
feasible moving regions (FMR) of the registers created in the previous stage. The
merging cost is defined as the minimum total switching capacitance after merging the
two clusters. At each merging step, the switching capacitance and the un-buffered
capacitance are updated for the newly merged cluster. For instance, if two clusters Gi

















where C and dij are the unit wire capacitance and the minimum distance between
the merging regions MRi and MRj of clusters Gi and Gj, respectively.
The merging regions are physically constructed as shown in Figure 4.3. As-
sume MRi and MRj are the merging regions of the cluster Gi and Gj, respectively.
The two clusters are merged to form a new cluster Gv. Without loss of generality,
assume each merging region is a tilted rectangle region. Each rectangle can thus be
represented by its four edges represented as y = x + kip0, y = x + k
i
p1, y = −x + kin0
and y = −x+kin1, where two lines have the slope +1 and two lines have the slope −1.
In the two tilted rectangle regions (two merging regions MRi and MRj), there are







Also, there are four edges with the same slope −1 and different k-values as kin0, kin1,
kjn0 and k
j
n1. The region constrained by the two lines that have the k-values in the






p1 for positive slope edges and the two lines that have






n1 for negative slope edges is the
merging region MRv for the newly merged cluster Gv. For instance, in Figure 4.3,
the region constrained by the four lines y = x + kip1, y = x + k
j
p0, y = −x + kin1 and
y = −x+ kjn0 is the merging region MRv for the new cluster Gv.
In the above example, the merging region construction is explained for the case of
merging two non-overlap tilted rectangle regions. In fact, the above method can be
applied to the merging of any lines, points or tilted rectangle regions with or without
overlaps. Note that a line segment is in fact a merging region with either two positive
edges having the same k-values or two negative edges having the same k-values. A
55
point is a merging region with two positive edges having the same k-values and two
negative edges having the same k-values. The proposed merging method guarantees
that at any point inside the newly merged region, the sum of the minimum distance
from the point to the merging regionMRi and the merging regionMRj is equal to the
minimum distance dij between MRi and MRj, which guarantees the minimum un-
buffered capacitance cuv for the cluster Gv. The merging region construction procedure
greedily forms new clusters such that the total switching capacitance of each cluster
is minimized.
Register cluster generation
In order to generate the register clusters and guarantee the incremental register
placement is minimal, the merging of the clusters is restrained to be within one grid
box. The clustering algorithm is an iterative algorithm that is performed for each
grid box. At each iteration, the two clusters with the minimum merging cost, defined
as the total switching capacitance after merging, is merged. After each merging step,
a gating decision is made on the newly merged clusters to determine whether to insert
an ICG cell for reducing csv. Then these clusters are connected to the mesh grid to
generate a complete mesh routing solution. The total switching capacitance of all the
clusters and stub wires is calculated as ctotalprev . The merging stops when the merging of
the clusters does not reduce the total switching capacitance inside the grid box. The
algorithm is presented in Algorithm 1.
High performance mode
The proposed clustering method is developed with the power dissipation mini-
mization objective. As such, the number of ICGs inserted is limited, which causes
clock skew (e.g. between local clusters with and without clock gating ICGs). A
56
Algorithm 1 Register clustering algorithm.
Input: Merging region MRi for each register Ri and grid size M ∗N .
Output: The cluster set G = {G1, G2...Gm}
1: Initialize each register Ri as a cluster Gi;
2: for Each grid box Bk do
3: Initialize the cluster set GBk = {Gx|Rx ∈ Bk};






5: while ctotalcur ≤ ctotalprev do
6: ctotalprev = c
total
cur ;
7: Find the clusters Gi and Gj in GBk such that c
s
v is the minimum;
8: if cuv > Climit then
9: break;
10: end if
11: Generate merging region MRv from MRi and MRj;












14: Gv = Gi ∪Gj;






18: G = G ∪GBk;
19: end for
method to reduce the clock skew is to insert ICG cell on each cluster. This alterna-
tive method potentially increases the area and power, but balances the clock skew.
In order to the reduce clock skew for this method (inserting gates on all clusters),
another requirement is to have relatively balanced capacitance for each gate. The
merging cost is changed to unbuffered capacitance instead of switching capacitance
and a capacitance limit Climit on each cluster will be placed as shown in Step 8 of
Algorithm 1. Note that the capacitance limit Climit is defined based on the skew
requirement. In this chapter, the method described in Section 4.2.2 is considered the
low power (LP) mode while the variation discussed in this section is considered the
high performance (HP) mode of the proposed method.
57
4.2.3 Incremental Register Placement
During the register clustering phase (Section 4.2.2), the registers are clustered but
their positions are not changed. In this step, the registers are incrementally placed
considering the timing slack of the design. Since moving one register potentially
changes the feasible moving regions of the other registers that have a path to the
moved register, the incremental placement is a combinational optimization problem.
The scalability of the solution or the runtime are not major concerns as the sizes of
the individual linear programming formulations are limited to the number of clock
sinks in each grid box. The problem is solved using a linear programming formulation.
The objective and the constraints of the formulation are explained in the following
sections.
Objective
The objective is to minimize the distance between the registers inside the same
cluster. This is such as the registers of the same cluster are merged during the
local tree generation and this will reduce the routing wirelength during the local tree








dist(Ri, Rj) = |xRi − xRj |+ |yRi − yRj |. (4.4)
Timing constraints
The same timing constraints and delay modeling as discussed in Section 3.2.3 are
adopted in the linear programming formulation to guarantee the functionality of the
58























wifok = |xRi − xifok |+ |yRi − yifok |, (4.8)
wffi = |xRf − xffi|+ |yRf − yffi|. (4.9)
Physical Constraints
Simultaneous with these requirements in timing, the physical requirement in
preventing the overlapping of registers is considered, which is similar to the non-
overlapping constraints presented in Section 3.2.3. As shown in Figure 4.4, let the















Figure 4.4: The overlap illustration.
59
One of the following four overlapping avoidance constraints has to be satisfied in
order to guarantee there is no overlapping between each pair of registers Ri and Rj:
xRi − xRj ≥ Wr, (4.10)
xRj − xRi ≥ Wr, (4.11)
yRi − yRj ≥ Lr, (4.12)
yRj − yRi ≥ Lr. (4.13)
These constraints prevent the horizontal and vertical overlapping of register placement
based on the register length and width. The constraints (4.10) and (4.11) are mutually
exclusive, similar to the constraints (4.12) and (4.13). In order to form a linear
programming formulation for the problem, only one of the four constraints is placed in
the formulation between each pair of registers Ri and Rf . To this end, the constraints
are generated based on the original relative positions of the registers. For instance,
if registers Ri and Rj are within one cluster and their original coordinates have the
following relationship yi < yj, the Equation (4.13) is set as the non-overlapping
constraint for these two registers. Constraints on the y axis are preferred as the
height of the cell is often smaller. These constraints consider the overlap between
registers. The overlaps between registers and logic gates are resolved using placement
legalization.
Linear programming formulation
The overall linear programming formulation is presented in Table 4.1. The ob-
jective of the formulation is to minimize the distance between registers inside the
same cluster. The timing constraints and the overlapping constraints are generated.
Note that xdist(a, b) and ydist(a, b) represent the distance between nodes a and b on
60
Table 4.1: The linear programming formulation for incremental register placement.


































wifok = xdist(Ri, fok) + ydist(Ri, fok), ∀Ri
xdist(Ri, fok) ≥ xRi − xfok , ∀Ri
xdist(Ri, fok) ≥ xfok − xRi , ∀Ri
ydist(Ri, fok) ≥ yRi − yfok , ∀Ri
ydist(Ri, fok) ≥ yfok − yRi , ∀Ri
wffi = xdist(Rf , fi) + ydist(Rf , fi), ∀Rf
xdist(Rf , fi) ≥ xRf − xfi, ∀Rf
xdist(Rf , fi) ≥ xfi − xRf , ∀Rf
ydist(Rf , fi) ≥ yRf − yfi, ∀Rf
ydist(Rf , fi) ≥ yfi − yRf , ∀Rf
xRi − xRj ≥Wr.
or xRj − xRi ≥Wr.
or yRi − yRj ≥ Lr.
or yRj − yRi ≥ Lr.
the horizontal direction and vertical direction, respectively. The constraints about
xdist(a, b) and ydist(a, b) are used to linearize the distance constraints. For each
register pair, at most one constraint among the last four constraints presented as
“or” appears in the linear programming formulation. By solving the formulation, the
optimal locations (xˆRi , yˆRi) of each register Ri for the distance minimization of the
registers in the same cluster are obtained.
4.2.4 Clock Mesh Synthesis with Gated Local Trees
Given the new locations of the sink registers and the clusters generated, the clock
mesh network is synthesized. The grid size of the final mesh network is the same as
the grid size during the register clustering stage which is optimized using the method
61
Algorithm 2 Clock mesh synthesis.
Input: The cluster set G = {G1, G2...Gm} and the locations of all the registers.
Output: The clock mesh network.
1: for each cluster Gk do
2: Initialize each register Ri in Gk as sub-tree root ti such that c
u
i = ci, di = 0;
3: Update sub-tree roots set T = {tx|Rx ∈ Gk};
4: while |T | > 1 do
5: Find the newly merged sub-tree root tv from merging ti and tj such that the
delay dv is the minimum;
6: Generate sub-tree tv, update dv and c
u
v ;






10: Connect the root of tv to the closest grid segment;
11: Top-down Embedding from root tv [18];
12: end for
13: Insert buffer drivers at the intersections of the grid wires;
14: Generate the top level buffered clock tree;
in [61]. The local clock tree generation stage is similar to the traditional method for
zero skew clock trees in [79] except that the merging cost is defined as the delay from
the root of the newly merged tree to the sink registers. The algorithm is described
in Algorithm 2. In this algorithm, only the sink registers inside the same cluster
are allowed to be merged. The whole clock mesh network is generated in the order
of i) mesh grid wire generation, ii) gated local tree generation for each cluster and
connecting the gated local trees to the mesh grid wires, and iii) top level clock tree
generation.
4.3 Experimental Results
The proposed algorithm flow is implemented in C++. The top level clock tree
is generated using a buffered DME algorithm [15, 18] to drive the mesh grid. IC
Compiler of Synopsys is used to perform the initial placement and routing. The
62
linear programming formulations are solved by the on-line solver Feaspump and SCIP
from [21]. Since the benchmark circuits provided by the ISPD’10 clock network
contest do not have any logic gate information, the benchmark circuit used in the
experiments are the five largest circuits from the ISCAS’89 benchmark. As reference,
note that the register count for the largest ISCAS’89 circuit is in the same level with
the ISPD’10 contest benchmark (1728 vs. 2249).
The switching capacitance of the clock mesh network synthesized by the proposed
method is compared against the method in [61]. The previous work in [61] is imple-
mented such that it generates a mesh network with minimum capacitance under a
skew requirement using uniform mesh grid wires. The mesh reduction is not consid-
ered in the experimental results. However, the same mesh reduction method (or any
other alternative) can be applied on the proposed method. In the proposed method,
two synthesis modes are implemented:
(i) Low Power (LP) Mode: The proposed method inserts clock gating cells on the
local trees only if it reduces the switching capacitance given a skew requirement.
(ii) High Performance (HP) Mode: The proposed method inserts clock gating cells
and buffers on all the local sub-tree roots to balance the clock skew introduced
by the buffering elements (ICG or buffer).
The experimental results are summarized in Table 4.2. In the experiments, the
whole clock networks including gated local trees, mesh grid wires, buffer drivers and
top level clock trees are synthesized using the proposed method. The generated trees
are transformed into the equivalent RC model, same as the procedure in the ISPD’10
clock network contest, where Ngspice is performed to analyze the final clock skew.
By using the LP mode, the switching capacitance is reduced by 22.1% compared to
the previous work in [61]. The clock skew is increased in the LP mode because the
clock gating cells are not inserted on all the sub-trees. The skew is increased by 4.5%
63
Table 4.2: The comparison of switching capacitance and clock skew.
Circuit Grid Switching capacitance Global clock skew
[61] LP Redn. HP Redn. [61] LP Incr. HP Incr.
(pF ) (pF ) (pF ) (ps) (ps)
s13207 7*7 17.5 13.4 22.9% 15.5 10.9% 10.3 33.3 4.6% 15.5 1.1%
s15850 7*7 13.8 10.5 23.1% 12.3 10.1% 5.2 35.3 6.0% 15.0 2.0%
s35932 14*14 47.2 36.9 21.6% 42.9 9.1% 14.7 33.7 3.8% 17.0 0.5%
s38417 14*14 44.8 35.1 21.5% 40.9 8.5% 14.1 34.4 4.1% 17.8 0.8%
s38584 11*11 32.1 25.2 21.4% 29.0 9.2% 14.5 35.3 4.2% 16.4 0.4%
Average 22.1% 9.5% 4.5% 0.9%
of the clock period (500ps) compared to the previous method. However, the overall
clock skew is limited to be within only 7% of the clock period. By using the HP
mode, the switching capacitance reduction is 9.5%, which is less than LP mode. The
skew degradation is only 4.6ps, which is 0.9% of the clock period. The clock networks
synthesized using HP mode have very low clock skew degradation and still achieve a
reasonable power reduction.
The timing slack is guaranteed to be non-negative (or a positive value specified
by the designer) by the linear programming formulation. In reality, the timing slack
might be improved due to the incremental register placement. In the experiments,
four out of the five benchmark circuits have improvements on timing slack range from
2ps to 22ps. Only one out of the five circuits has timing slack reduction of 16ps. On
average, the timing slack is improved by 8ps due to the register placement. The
trade-off effect of applying the proposed method is the increased cell area due to the
ICG gates insertion as shown in Table 4.3. The second and third columns show the
number of gates inserted for the two different modes. The HP mode inserts more
gating cells and the area increase is more, which is 4.7% on average. However, it
is observed that the average area increase on the experimental circuits is within 5%
of the cell area, which is very limited. As clock gating is done independently for
each grid box, circuit scaling is not expected to increase by this area overhead. The
64
Table 4.3: The trade-off effects.
Circuit Gates insertion Area increase Reg move
LP (#) HP (#) LP (%) HP (%) ×Wr
s13207 224 342 3.5% 5.3% 3.1
s15850 198 263 2.8% 3.7% 2.6
s35932 669 957 4.4% 6.3% 2.0
s38417 605 862 4.0% 5.7% 2.2
s38584 434 600 1.8% 2.5% 3.6
Average 3.3% 4.7% 2.7
register incremental movement is constrained to be within a grid box, which suggests
very limited register displacement. It is observed that the average register movement
is only 2.7Wr, where Wr is the width of the registers.
4.4 Conclusions
A low power clock mesh synthesis method is proposed which allows clock gating
on local trees and clusters the registers considering the switching activities and the
timing slacks of all the local data-paths. This is the first work on the topic of clock
mesh synthesis that considers clock gating at local trees. The proposed method have
two modes: low power mode and high performance mode. The power reduction are
realized by register clustering and clock gating. The proposed method is a promising
and practical way of generating clock mesh networks for high performance ICs.
65
5. ROA Synthesis: Steiner Tree Based Rotary Clock Routing
Although Rotary Oscillator Arrays (ROA) is a novel clock structure which has a
significant amount of different characteristics with the traditional clock mesh network,
they are similar in that both the ROA and the clock mesh network have a grid
topology, which suggests similar routing methodologies to connect registers to the
grid. In this chapter, a novel rotary clock network routing method is proposed for the
low-power resonant rotary clocking technology to build a distribution network that
defines and maintains a stable oscillation through design automation. In particular,
the method guarantees: 1. The balanced capacitive load driven by each of the tapping
points on the rotary rings, 2. Customized bounded clock skew among all the registers
on chip, 3. A sub-optimally minimized total wirelength of the clock wire routes. The
proposed routing method is tested with the ISPD clock network contest and IBM r1-
r5 benchmarks. The experimental results show that the capacitive load imbalance is
very limited demonstrating high robustness. The total wirelength is reduced by 64.2%
compared to the best previous work known in literature through the combination of
steiner tree routing and the assignment of trees to the tapping points. The average
clock skew simulated using HSPICE is only 8.8ps when the bounded skew target is
set to 10.0ps.
This chapter is organized as follows. In Section 5.1, a brief introduction about
the proposed method is presented. In Section 5.2, the problem formulations and the
proposed methodology are presented. The time complexity of the proposed method
is analyzed in Section 5.3. Some discussions about the previous works are presented
in 5.4. The experimental results are presented in Section 5.5. The conclusions are
presented in Section 5.6.
66
5.1 Introduction
Resonant rotary clocking technology is an attractive alternative to the tradi-
tional clocking due to the high frequency and low power dissipation characteris-
tics [84, 86, 89]. Early implementations of rotary clocking, including the pioneering
implementation in [84] and silicon-based implementation in [42], advocated the use of
varactors to balance the capacitive load requirements and tune the frequency. Varac-
tors are hard to implement in IC design and can occupy chip area. Furthermore, even
in presence of these varactors, synchronous components need to be connected to the
rotary rings efficiently. Recent work from [31] has adopted a different scheme, where,
the capacitive loading and frequency tuning of rotary rings are performed through
design automation. Novel automation methods are proposed to create custom topolo-
gies of rotary rings and tapping wires to route the synchronous components to the
rotary clock distribution ring. However, this technology requires novel electronic
design automation (EDA) routines within the physical design flow. It is desirable
to have a method which efficiently encapsulates the implementation for the recently
identified 1) operational characteristics such as zero clock skew synchronization [31]
and 2) implementation requirements such as capacitive load balancing [32, 84].
Earlier design automation methods [82, 87] focus on eliminating or accounting
for the non-zero clock skew as it is commonly conceived that rotary clocking only
provides a non-zero clock skew synchronization. It is shown in [31] that zero skew
rotary clock can be efficiently implemented with resonant clocking. However in [31],
the capacitive load balancing requirement is not considered in achieving the zero
skew synchronization. The rotary clock signal can be severely distorted due to the
uneven capacitance distribution [32, 84]. In [82], the number of registers per ring is
limited with an upper bound in an effort to have a uniform register distribution on
the rotary ring. Capacitive load balancing requires a more comprehensive analysis
67
as the tapping wires and the anti-parallel inverters on the rotary ring contribute to
the capacitive load, as well. In [35], the zero skew and capacitive load balancing are
considered but the method leads to undesirable amount of tapping wires.
Another major need in rotary clock routing automation is the minimization of
the tapping wirelength used to connect the synchronous components to the rotary
rings. In [82], the tapping wirelength is minimized by incrementally placing the
synchronous components closer to the tapping points. In [31, 35], the integer linear
programming (ILP) framework is used to compute the optimal tapping wirelength
given the placement of the synchronous components. In these previous works [31, 32,
35, 82, 87], the synchronous components are connected to the rings using individual
tapping wires. In [76], it is shown that rotary clocking with the tree based sub-
networks reduces the overall wirelength. However, the capacitive load balancing is
not considered in building the tree sub-networks in [76].
In this work, a novel rotary clock network routing method is proposed where
the synchronous component are connected (e.g. tapped) to the rotary oscillator ar-
ray (i.e. ROA) using a forest of steiner trees satisfying all objectives simultaneously.
The novelties and advantages of the proposed method are:
(i) The capacitive load driven by each ring on the rotary clock distribution grid
(i.e. ROA) is balanced,
(ii) Bounded clock skew is guaranteed under the Elmore delay model and it can be
extended to other delay models,
(iii) The total wirelength connecting the tapping points to the registers is minimized
by using the steiner trees.
(iv) This is the first polynomial time solution for rotary clock routing which achieves































































(b) ROA network with registers routed in a
steiner tree manner.
Figure 5.1: ROA network illustration.
5.2 Methodology
The proposed rotary clock network routing simultaneously targets bounded skew,
capacitive load balancing and wirelength minimization. The proposed method is a
polynomial-time algorithm performed at the post-placement stage in the physical
design flow as shown in Figure 5.2. At this stage, an ROA is already created and
the registers placement locations are known. The proposed method first clusters
the registers to be routed with steiner routing while guaranteeing zero clock skew
within each cluster and providing the capacitance balancing between all the clusters.
Next, the tapping points are connected to the sub-tree roots to achieve the goal










Figure 5.2: The physical design flow employing ROA technology.
5.2.1 Register Clustering
The unbuffered zero skew steiner tree based network generation is a well-studied
problem [7, 10, 24, 79]. The deferred merge embedding (DME) solution [7] is adapted
in this work with a modified merging cost function to achieve the targeted design
goals. The DME algorithm is modified to generate a forest of unbuffered steiner
trees (with multiple sub-tree roots) as opposed to a single unbuffered steiner tree.
In this chapter, a novel clustering method is proposed which achieves capacitance
balancing and zero skew at the same time. This is unlike [76], which uses min-
cut partition to identify the registers that are connected to the same tree (without
capacitive load balancing). The input to the proposed clustering algorithm is the
placement information of the circuit and the number of tapping points on the ROA.
The number of clusters is equal to the number of tapping points such that a one-
to-one connection is derived for the purpose of capacitive load balancing. Thus, the
desired features of the proposed clustering algorithms are:
(i) The estimated total capacitance of each cluster should be balanced,
(ii) The registers in each cluster should have zero skew,
70
(iii) The total wirelength used to connect the registers inside each cluster should be
minimized.
A method is proposed to generate a forest of steiner trees as the register clusters which
achieves the three goals. In the forest, each register cluster is connected through a
steiner tree. Similar to DME [7], the proposed method has two phases: The bottom
up clustering phase and the top down embedding phase. The bottom up clustering
algorithm is shown in Algorithm 3. Similar to DME [7], the merging of sub-trees guar-
antees the zero skew from each new sub-tree root to its sink registers. Unlike DME [7],
where the merging cost is the distance between the sinks, the cost of merging sub-trees
is selected as the total capacitance of the newly merged sub-trees. Consequently, if a
newly merged sub-tree has the least total capacitance from root to sinks (registers)
over all the other possible newly merged trees, this sub-tree is created. This merging
scheme greedily balances the capacitance among the register clusters. Unlike DME,
the bottom up clustering terminates when the number of sub-trees is equal to the
number of tapping points. Next, the top down embedding [7] proceeds from each of
the sub-tree roots to determine the locations of the root and the internal nodes of
each sub-tree, identical to [7].
The total capacitance of each sub-tree is balanced as the merging cost selected is
the total capacitance of the sub-trees formed by merging pairs during the bottom-
up phase. Merging with reduced capacitance cost also helps to reduce the total
wirelength as the capacitance is directly proportional to the wirelength in general.
5.2.2 Target Delay Calculation
As the root of the created sub-trees may not reside on the tapping points, extra
(i.e. tapping) wires are needed to connect the tree roots to the tapping points as
shown in Figure 5.3. Assuming the delay from the root of a sub-tree i to all its sink
71
Algorithm 3 Register clustering (Bottom up phase).
Input: Register locations si(x, y) for each register si and number of available tapping
points N .
Output: The set of sub-tree roots R.
1: Initialize the sub-tree roots set R = {s1, s2, ...sn};
2: while |R| > N do
3: Find the sub-tree roots si and sj in R such that the newly merged sub-tree sv
has the least total capacitance;
4: Create new sub-tree sv by merging sub-trees si and sj;
5: R = R− {si}, R = R− {sj}, R = R ∪ sv;
6: end while
registers is di, and the phase delay of a tapping point j is pj, the Elmore delay dij





+ Ci) + di + pj (5.1)
where lij is the tapping wirelength needed to connect the root of sub-tree i to the
tapping point j. Ci is the total capacitance on the sub-tree i. The parameter r0 and
c0 are the unit wire resistance and unit wire capacitance, respectively. In order to
guarantee zero skew among all the registers, all delays dij should be equal to a delay




l2ij + r0Cilij + di + pj −D = 0. (5.2)
If the target delay D is high, a lot of extra wiring is required to maintain zero clock
skew and the total wirelength increases. If the target delay D is low, there might
be no solution for the one-to-one matching of sub-tree roots to the tapping points
guaranteeing zero skew. To this end, an optimal solution is proposed to calculate the






















Figure 5.3: The illustration of tapping wire connection.
In order to solve the problem of zero skew synchronization and capacitive load
balancing, a cost matrix C is introduced, where each element Cij in matrix C repre-
sents the sum of the wire capacitance used to connect a sub-tree root i to a tapping
point j and the total capacitance of the sub-tree i:
Cij = c0lij + Ci. (5.3)
The tapping wirelength lij is calculated using the Equation (5.2). There are two
cases where solution lij should be discarded and the corresponding Cij evaluates
to an impractical case of ∞ (no connection available between sub-tree root i and
tapping point j): 1. The quadratic equation does not have a positive solution; 2.
The quadratic equation has a positive solution lij, but lij is smaller than the shortest
possible wire necessary to connect the root of sub-tree i and the tapping point j.
Both cases imply that the delay target can not be satisfied by connecting a tapping
wire between the root of the sub-tree i and the tapping point j.
73
Algorithm 4 Optimal target delay calculation.
Input: Dmax, Dmin, K.
Output: The optimal target delay D∗.





, i ∈ {0, 1, ..., K − 1}, kmax = K − 1,
kmin = 0;





4: Construct cost matrix C using delay target Dt[kmedian];
5: if C has a perfect matching then
6: kmax = kmedian;
7: else
8: kmin = kmedian;
9: end if
10: end while
11: Return Dt[kmin] as D
∗;
In Algorithm 4, a binary search is performed to solve for the optimal target delay
value D∗ for wirelength minimization. The perfect matching implies that under the
current delay target, there exists a one-to-one routing solution for connecting each
sub-tree root to a different tapping point. The parameters Dmax and Dmin are the
maximum and minimum delays of connecting a sub-tree root i to a tapping point j,
respectively. A customized integer K is introduced to denote the granularity of the
optimality. The interval between Dmax and Dmin is equally partitioned into K small
intervals, the average value of each small interval is stored in a vector Dt of size K.
The optimal target delay D∗ is the output of the algorithm.
5.2.3 Bounded Skew Cost Matrix Construction
The target delay is calculated based on the zero skew assumption. However,
the routing wirelength and capacitive load balancing can be further optimized by
relaxing the zero skew constraint to a bounded skew constraint. This is such for two
reasons: 1. The additional tapping wires required can be reduced by relaxing the
74
delay requirement. 2. More tapping points and sub-tree roots connections may be
available.
Assuming the skew bound is set to B, the delay upper bound Du and lower





, respectively. Each ring in the ROA is a rectangular
shape with four edges as shown in Figure 5.3. For an illustration purpose without
loss of generality, the locations of tapping points are restrained to be moving within
only one edge of the rings. As shown in Figure 5.3, the upper left corner of the ring
is set as the reference tapping point for tapping point j with phase pj0. The tapping
point can be moved within the top edge of the ring with distance x from the reference
tapping point. A different x results in a different delay from the tapping point to
the registers as shown in Equation (5.5), as well as a different phase delay pj on the
tapping point with a phase changing rate α as shown in Equation (5.6). The following
formulation aims to find the value of x which minimizes lij such that the delay dij to
the sink registers is within the tolerable delay range:
min lij (5.4)
s.t. dij = r0lij(
c0lij
2
+ Ci) + di + pj, (5.5)
pj = pj0 + αx, (5.6)
lij = |xri − xtpj0 − x|+ |yri − ytpj0|, (5.7)
dij ∈ [Dl, Du]. (5.8)
The only variable in the quadratic formulation is x. In implementation, this opti-
mization problem with only one variable can be easily solved in constant time. Note
that the value lij may be discarded based on the same criteria as explained in Sec-
tion 5.2.2. An lij will be calculated for each sub-tree i and each tapping point j to
construct the bounded skew cost matrix C as explained in Section 5.2.2.
75
5.2.4 Assignment Problem Formulation
The problem of balancing the capacitance on each tapping point is presented
as obtaining a balanced assignment of sub-tree roots to tapping points where the
difference between the maximum and minimum capacitive load is minimized. The
problem formulation is shown in Table 5.1 with the difference as the minimization
objective. The binary variable xij is the indicator of whether the sub-tree i and the
tapping point j are connected. The constraints in Table 5.1 force each sub-tree root
to connect to exactly one tapping point and a tapping point to have exactly one
sub-tree connected to it. Theoretically, more than one tree can be connected to a
tapping point to satisfy the design objectives. In the proposed solution, however,
the number of sub-trees is selected to be identical to the number of tapping points
to guarantee capacitive load balancing, thus, the perfect matching is implied. The
problem formulation is known to be the balanced assignment problem (BAP) [8, 9, 59].
5.2.5 Balanced Tapping Points Assignment Algorithm
A two-step algorithm is applied to solve the problem as shown in Table 5.1. The
first step is to solve for the Linear Bottleneck Assignment Problem (LBAP) problem
presented in Table 5.2. The goal of an LBAP problem is equivalent to minimizing the
maximum capacitive load over all the tapping points. The cost matrix C defined in


















xij ∈ {0, 1}
76










xij = 1, ∀j∑
j
xij = 1, ∀i
xij ∈ {0, 1}
Section 5.2.2 is constructed under bounded skew constraint such that each entry Cij
represents the total capacitance cost (sub-tree and tapping wire) of connecting a
sub-tree i to a tapping point j, when lij is feasible.
The LBAP problem shown in Table 5.2 is solved by Algorithm 5. Note that the
matrixG[x] in Algorithm 5 is created based on matrix C with the same dimension such
that all the edges with a weight greater than a capacitance threshold x are deleted.
The algorithm utilizes a binary search to find the minimum max threshold x such
that a feasible assignment exists in G[x]. A feasible assignment indicates a one-to-one
perfect matching of each sub-tree root to each tapping point in the current matrix.
A modified version of Hungarian algorithm [6] is applied to find an assignment in the
cost matrix. The output of the algorithm is an optimal assignment (i.e. the values
of xij) of the sub-tree roots to the tapping points, which indicates the connections
such that the maximum capacitive load of all the tapping points is minimized.
The second step applies Algorithm 6 to obtain an optimal solution of the bal-
anced assignment problem. In Algorithm 6, the output of Algorithm 5, which is the
optimal assignment and the corresponding capacitance value, is used as the input.
The algorithm iteratively decreases the difference between the min and max capaci-
tance values while simultaneously checking for an assignment between the max and
min values. The algorithm returns the optimal balanced assignment such that the
difference between the max and min capacitance values is minimized. Therefore, the
77
Algorithm 5 The bottleneck assignment algorithm [9].
Input: Cost matrix C = [Cij]n∗n.
Output: Optimum assignment permutation: i→ φ(i).
1: Cmin = min
i,j
(Cij), Cmax = max
i,j
(Cij)
2: if Cmin = Cmax then




= {Cij : Cmin < Cij < Cmax} 6= ∅ do
6: Find the median Cmedian of all the elements in C
′
;
7: Set threshold(Cmedian, Cmin, Cmax);
8: end while
9: Set threshold(Cmin, Cmin, Cmax);
10: Find an assignment (permutation φ) in G[Cmax];
11: end if
Procedure Set threshold(Cmedian, Cmin, Cmax)
1: Construct the cost matrix G[Cmedian];
2: if G[Cmedian] contains an assignment then
3: Cmax = Cmedian
4: else
5: Cmin = Cmedian
6: end if
optimal capacitive load balanced solution of connecting the sub-tree roots to the tap-
ping points is obtained that also satisfies the skew requirement without the excessive
use of tapping wires.
5.3 Time Complexity Analysis
The proposed clustering algorithm is a polynomial-time greedy algorithm. The
balanced assignment algorithm solves the problem of assigning sub-tree roots to tap-
ping points with balanced total capacitance optimally in polynomial time as well.
Theorem 5.3.1. The time complexity of the bottom up clustering algorithm shown
in Algorithm 3 is O((|S| −N)|S|2), where S is the set of registers, N is the number
of tapping points.
78
Algorithm 6 The balanced assignment algorithm [8].
Input: Cost matrix C = [Cij]n∗n and optimal permutation i → φ(i) by solving
Algorithm 5.
Output: Balanced assignment (permutation i→ φ∗(i)).
1: Cmin = min
i
(Ciφ(i)), Cmax = max
i
(Ciφ(j)). If Cmin = Cmax, return the current
assignment permutation φ; else go to Step 2.
2: Copy C to C
′
, delete in C
′
all elements such that {Cij : Cij <= Cmin or Cij >
Cmax, i, j ∈ N}. If an assignment permutation φ exists in C ′ , if cost(φ) <
cost(φ∗), set φ∗ = φ. Go to Step 1. If no assignment exists, go to Step 3.
3: If Cmax = max
i,j
(Cij), return the current best assignment permutation as φ
∗. Oth-
erwise increase Cmax to be the next larger Cij in C. Go to Step 2.
Proof. Step 1 in Algorithm 3 takes O(|S|) time. Step 3 has time complexity O(|S|2)
as it has to traverse all the possible pairs to determine the minimum cost pair. The
merging process in Step 4 takes constant time to finish: the loop is executed for |S|−N
times. The total complexity is O((|S| −N)|S|2).
Theorem 5.3.2. The time complexity of Algorithm 4 is O(N3logK) [8].
Proof. The main loop of the Algorithm 4 from Step 2 to Step 10 is executed for logK
times. The time complexity of constructing cost matrix C is O(N2). The time
complexity of judging whether cost matrix C has a perfect matching is O(N3). The
overall complexity of the algorithm is O(N3logK).
Theorem 5.3.3. The time complexity of Algorithm 5 is O(N3 logN).
Proof. The main part of the Algorithm 5 is the loop from Step 5 to Step 8. A binary
search is applied to search for the minimum max threshold. Since the cost matrix size
is N × N , the loop is executed log(N2) times in the worst case. In each of the loop
a modified Hungarian algorithm [6] is executed once to judge whether an assignment
exist. The time complexity of the modified Hungarian algorithm is O(N3). The
overall time complexity of the Algorithm 5 is thus O(N3 logN).
79
Theorem 5.3.4. The time complexity of Algorithm 6 is O(N4) [8].
The overall time complexity of the proposed method is O(N4 + |S|3). However,
since the capacitance balancing results are already good for the benchmark circuits
before applying the proposed balanced assignment algorithm, the balanced assign-
ment algorithms Algorithm 5, 6 and the delay tuning algorithm Algorithm 4 can
be an optional step in the proposed work, in which case the time complexity drops
to O(|S|3).
5.4 The Results of the Previous Works
Zero (bounded) skew synchronization and capacitive load balancing for ROA have
been previously studied in [31, 32, 34, 35]. In [31] and [34], the proposed methods
aim to achieve zero skew and bounded skew respectively, but the capacitive load
balancing is not considered. In [32], the capacitive load is balanced but the clock
skew is not considered. The method in [35] is the only previous work that achieves
the capacitive load balancing and zero (bounded) skew at the same time. However,
the method in [35] uses an excessive amount of tapping wire due to two reasons:
1. The individual connection between the tapping points and the registers; 2. The
attempt to balance the clock skew using tapping wires. Among the previous works
targeting zero skew synchronization, the method in [31] obtains the best result in
terms of total wirelength. Thus in the experimental results, the proposed method is
compared against the results produced by [31] in terms of wirelength, capacitive load
balancing and global clock skew.
5.5 Experimental Results
The proposed steiner tree based rotary clock routing method is implemented in
C++ and Matlab. The experiments are performed on the ISPD clock network contest
80
and IBM r1-r5 benchmark circuits. An industrial 90nm technology library is used for
routing and skew analysis. The rotary clock network is simulated using HSPICE to ob-
serve the global clock skew. Note that in order to compute the oscillation frequencies,
the length units of the r1-r5 benchmark circuits are scaled to empirically to match
the dimensions of the rotary ring dimensions in [84]. In particular, a perimeter of
50,000 units for r1 corresponds to a perimeter of 3200µm. The skew bound for the
method is set to 10ps for all the benchmark circuits, which is a very limited skew
range under the operating frequency of 4GHz.
After applying the proposed clustering and capacitive load balancing algorithms,
the results for wirelength, capacitive load balancing, frequency variation and global
clock skew are presented in Table 5.3. By applying the proposed method, the total
wirelength is reduced by an average of 64.2% compared to [31]. The wirelength savings
are achieved through the steiner routing and the bounded skew constraints.
In [31], the capacitance imbalance is not considered, which is significant (87.2%)
in order to keep the skew and wirelength small. Using the proposed method, two dif-
ferent sets of capacitance imbalance are compared. The first set is obtained through
minimum cost tapping points and sub-tree roots assignment after applying the clus-
tering algorithm, which has a capacitance imbalance of 42.4%. This result is better
than [31] due to the greedy capacitance balancing effect at the clustering stage. The
second set is obtained through the proposed balanced assignment algorithm (Ta-
ble 5.1), which reduces the capacitance imbalance to 19.4%. All the capacitance
imbalance percentages using the proposed method are acceptable to establish and
maintain a stable resonant oscillation, which is demonstrated by HSPICE simulations.
HSPICE simulation are performed on the clock distribution network of r1 − r5 cir-
cuits generated by the proposed method to show the generated clock signal. It is
observed in the HSPICE simulations that all the ROA routings generated by the pro-
81
posed method have very low frequency variation. However, for the routing method
in [31], four benchmark circuits (ispd01, ispd02, ispd03, ispd07 ) have a large frequency
variation due to capacitance imbalance.
The HSPICE simulations are performed to report the global clock skew of the
ROA networks. The clock skew of the ROA rings generated by the proposed method
is compared against that of the ROA rings generated by the methodology in [31]. The
average skew on all the benchmark circuits using the proposed method is only 8.8ps,
smaller than the average skew values produced by [31] (15.5ps). This is such as the
total capacitive load of each ring using the proposed method is much smaller and less
imbalanced. Note that for the benchmark circuits ispd01, ispd02, ispd03 and ispd07,
the ROA routings generated using the method in [31] have significant imbalances of
the load capacitance, which causes a severe frequency variation. Thus, the skews of
these four benchmark circuits can not be reported.
5.6 Conclusion
In conclusion, a clock network routing method is proposed for resonant rotary
clock that considers all operational characteristics and implementation requirements.
The registers are routed using a forest of steiner trees while guaranteeing zero (bounded)
skew and capacitive load balancing of each steiner tree. A balanced assignment al-
gorithm is applied to connect the tapping points to the steiner trees to maintain the
capacitive load balancing of each tapping point and the bounded clock skew of all the
registers. The total wirelength is minimized and the method is 64.2% better in terms
of the wirelength than the best known previous work in literature. The simulated
























































































































































































































































































































































































































































































































































































































6. Clock Buffer Polarity Assignment: Capacitive Load Awareness and
Skew Tuning
Compared to the grid topology, the tree topology is advantageous in that it uses
the minimal routing resources to deliver the clock signal to all the synchronous com-
ponents. Thus it dissipates significant less power than the traditional clock mesh
network. The trade-off is that the tree topology has less tolerance to the on-chip
variations. Clock buffer polarity assignment [54] is demonstrated to be an effective
way to reduce the on-chip variations introduced by peak current. Two novel clock
buffer polarity assignment methods are proposed in Chapters 6 and 7, respectively,
to further reduce the on-chip variations introduced by the peak current.
A clock polarity assignment method is proposed in this chapter that reduces the
peak current on the vdd/gnd rails of an integrated circuit. The impacts of i) The
output capacitive load on the peak current drawn by the sink level clock buffers,
and ii) The buffer/inverter replacement scheme of polarity assignment on the timing
accuracy, are considered in the formulation. The proposed sink-level-only polarity
assignment is performed by a lexi-search algorithm in order to balance the peak
current on the clock tree. Most of the previous polarity assignment methods that do
not include clock tree re-synthesis lead to an undesirable increase in the worst corner
clock skew. A skew tuning scheme is proposed to this end that reduces the clock skew
through polarity refinement and not through clock tree re-synthesis. The proposed
polarity assignment method with the skew tuning scheme is implemented within an
industrial design flow for practicality. Experimental results show that the worst case
peak current drawn by the clock tree can be reduced by an average of 36.5%. The
worst corner clock skew is increased from 60.7ps to 76.2ps by applying the proposed
polarity assignment method. The proposed skew tuning scheme reduces the worst
84
case clock skew from 76.2ps to 61.5ps on average with a limited degradation in the
peak current improvement (36.5% to 31.2% on average).
In Section 6.1, the clock buffer polarity assignment methods is introduced and
reviewed. In Section 6.2, the motivation about peak current and polarity assignment
are introduced. In Section 6.3, the problem formulation is presented. In Section 6.4,
the polarity assignment method is proposed and the corresponding experimental re-
sults are summarized. In Section 6.5, the proposed skew tuning method is presented
and the corresponding experimental results are summarized. The comparison of the
proposed method with the previous work is presented in Section 6.6. The work is
summarized in Section 6.7.
6.1 Clock Buffer Polarity Assignment
On chip variations and low power design schemes greatly affect the operation and
reliability of the modern circuits. For instance, on-chip variations and the down-
scaling power supply voltages lead to narrower noise margins, which cause circuits to
be more susceptible to power/ground noise [60]. The importance of the analysis and
reduction of the power/ground noise is amplified due to such narrower noise margins.
Clock tree has a significant impact on the power/ground noise since all the nodes on
the clock tree switch during each cycle. The power/ground noise manifests itself as
the peak current drawn from the vdd/gnd rails which can be hazardous to the circuit
operation. The polarity of the switching activity on the clock tree nodes affects the
peak current. A clock buffer has a positive polarity if its output switches in the same
direction as the clock source. A clock buffer has a negative polarity if its output
switches in the opposite direction of the clock source [54]. Reducing the peak current
on the vdd/gnd rails by reducing the number of simultaneously switching clock tree
nodes is called clock buffer polarity assignment.
85
Clock buffer polarity assignment for peak current reduction is first introduced
in [54]. The work in [54] proposes a method to reduce the peak current via using
opposite-phase buffers on the clock tree. The method does not consider the peak
current as a local effect. Furthermore, the clock skew induced by the delay difference
of inverters and buffers used in the polarity assignment is not considered. Later on,
an alternative method of performing the polarity assignment is proposed in [66] which
presents three assignment algorithms. These methods assign polarity on the whole
clock tree (i.e. sink and non-sink buffers) and might lead to a large clock skew in
the worst case. In [13], a clock polarity assignment which not only reduces the peak
current but also minimizes the clock skew on the clock tree is devised. In [65], an
approach which minimizes the power/ground noise while keeping a zero skew clock
tree through a clock tree re-synthesis is proposed. The method has a better effect on
the power/ground noise reduction but leads to an increased wirelength (5%) due to the
clock tree re-synthesis. Another polarity assignment method is proposed in [37] that
addresses the increased wirelength problem by eliminating the clock tree re-synthesis
step while simultaneously considering buffer sizing. This method in [37] sacrifices the
accuracy of timing in order to simultaneously perform polarity assignment and buffer
sizing. As such, the method in [37] might lead to negative time slacks. The clock
polarity assignment methods can also be categorized based on when they are applied
in the physical design flow. The method in [54] is often used in the pre-layout stage,
where the input and the output of the method are the circuit netlist. In contrast,
the methods in [13, 37, 65, 66] are applied at the post-layout stage, where the input
of the method is the original clock tree generated by a conventional physical design
tool. The method proposed in this work belongs to the second category.
These previously offered polarity assignment methods share two major drawbacks.
First, the differences of clock buffers in drawing the peak current from the vdd/gnd
86
rails have not been considered, which depend on the buffer type, size and its output
capacitive load. HSPICE simulations on a 90nm technology library are used to
demonstrate this fact. Second, although the differences in the timing delays of the
buffers and inverters are considered in some works, including [13] and [37], the timing
closure cannot be guaranteed after the polarity assignment. This is such as the
replacement of buffers and inverters on one clock tree branch may negatively affect
the delay of the other clock tree branches. This second fact, for instance, is not
considered in the simplified timing model used in [37]. The polarity assignment
method proposed in this work overcomes the two drawbacks in the previous works.
The proposed polarity assignment method with the clock skew tuning scheme has
two steps in order to meet timing closure:
(i) Step 1: Performing a clock buffer polarity assignment which accounts for the dif-
ferent peak currents drawn from the vdd/gnd rails of the different buffers/inverters
with different capacitive loads.
(ii) Step 2: Performing a polarity refinement with timing analysis to finalize the
clock skew in order to meet the timing requirement.
The polarity assignment is performed on the clock sink layer only similar to [13, 37], as
the total peak current caused by this layer is the most dominant. Polarity assignment
at the sink level only has the added advantage of eliminating clock skew degradation
due to clock polarity assignment, which is critical when the timing characteristics
of each clock buffer and branch are considered accurately. A heuristic clock skew
tuning scheme for the timing closure is proposed as the second step. Consequently,
the proposed methods lead to the clock trees that are improved in the peak current


















Figure 6.1: RC model of buffers and wires.
6.2 Motivation
In this section, the motivation of the proposed polarity assignment method is
presented, particularly addressing the contributions to improve the accuracy and
efficiency of the process.
6.2.1 The Capacitive Load of Clock Buffers
An RC model network of a clock branch with wires and buffers is presented in
Figure 6.1 [12]. Clock tree branches are typically implemented with wider metals,
which limits the wire resistance Rwire. The capacitive load of a clock buffer Bi can
be estimated as the sum of the output interconnect capacitance Cwire and the input
capacitance of the fanout gates Cbuf , Cinv and Creg of the buffers, inverters and
registers, respectively:















reg are the fanout buffers, inverters and registers of the
buffer Bi, respectively.
6.2.2 Capacitive Load on Peak Current
At clock switching instances, the buffers (positive and negative polarity) draw peak
currents from the vdd/gnd rails to provide a full-swing operation. The relationship
between the peak current and the capacitive load of a clock buffer can be estimated
as:
Ipeak = (Cload + Cinternal)× dV
dt
, (6.2)
where Ipeak is the peak current on the power rail of the clock buffer, Cload is the
external output load capacitance, Cinternal is the internal output capacitance of the
clock buffer as shown in Figure 6.1. dV
dt
is the slew rate of the clock buffer output
signal, which is a function of the capacitive load. In order to obtain the actual
relationship between the peak current and the capacitive load, an HSPICE simulation
is performed. The peak current on the vdd/gnd rail is observed as a monotonically
increasing function of the capacitive load, as shown in Figure 6.2. This peak current
change with the varying capacitive load is significant due to the wide output capacitive
load range for the buffers of a clock tree. For instance, in the clock tree synthesized
with IC Compiler [72] for s38584 of ISCAS’89 benchmark circuits using BUFX4
elements only, the capacitive loads of different clock sink buffers range from 32fF to
95fF . The corresponding peak current on the vdd rail ranges between 0.93mA and
1.30mA, respectively. The peak current versus the capacitive load curve of each clock
buffer from a cell library can be obtained similar to the one shown in Figure 6.2 by
performing a limited number of simulations.
89


























Figure 6.2: Peak current on vdd/gnd rails of BUFX4 with capacitive load.
6.3 Problem Formulation
The peak current on the vdd/gnd rails of a design is a local effect. Each local
area represents the set of gates and registers which connect to the same vdd/gnd
rails and contribute to a local peak current [55, 88]. Thus, non-overlapping areas
can be defined for local peak current reduction. In the presented experiments, the
local areas are defined as even sized areas as shown in Figure 6.3, which reflect the
power/ground straps defined on the die. The extension of the proposed method to
any arbitrary-shaped local area is trivial.
The problem of minimizing peak current on each local area is formally defined as
follows:
Given the areas for local supply rails, placement and a synthesized clock
tree of an IC, compute the polarity of each sink node of the clock tree to
minimize the total peak currents on the supply rails of each local area.
90
Figure 6.3: Clock tree in s15850 and local areas of P/G straps.
The peak current of a clock buffer occurs at the rising and the falling edge of the
clock signal as shown in Figure 2.4 on page 14. Thus, the problem of minimizing the
total peak current is equivalent to minimizing the peak current on the vdd/gnd rails
at the clock rising and falling edges for each local area. For instance, for the clock
tree shown in Figure 6.3, the goal of the proposed method is to minimize the worst
case peak current on the vdd/gnd rails of the 9 (nine) local areas A1–A9 at the rising
and falling edges of the clock signal.
The proposed method aims to minimize the worst case peak current over all the


















The superscript i represents the index for a buffer Bi. The subscripts vdd and gnd
represent the vdd and gnd rails, respectively. The subscripts ↑ and ↓ represent the
91
time instant at the rising and falling edge of the source clock signal, respectively. The






gnd↓ are the peak currents Ipeak obtained from buffer
simulation as illustrated in Figure 2.4 on page 14. The formulation presents the
problem as minimizing the worst peak current over clock rising and falling edges at
the vdd/gnd rails of each local area, respectively. For a positive polarity clock buffer,
the peak current terms I ivdd↓ and I
i





are more critical as shown in Figure 2.4(a). Similarly, for a negative polarity clock
buffer, the peak current terms I ivdd↑ and I
i
gnd↓ can be ignored as the currents I
i
vdd↓

























In Equation (6.4), Pi indicates the polarity of the buffer: Pi = 0 indicates positive
polarity and Pi = 1 indicates negative polarity. The objective in Equation (6.4) is
reduced compared to the objective in Equation (6.3) due to the reduced number of
peak currents to observe.
Furthermore, it is observed that the four peak current terms shown in Equa-
tion (6.3) and Equation (6.4) are not totally independent. Optimizing the peak
current on vdd rails optimizes the peak current on gnd rails as well. This is such as
the peak currents on both vdd and gnd rails for positive and negative clock buffers
are at the opposite input clock edges as shown in Figure 2.4 on page 14. Moreover,
optimizing the peak current on rising edges of the clock signal optimizes the peak
current on falling edges of the clock signal as well. The optimization problem be-
comes minimizing the peak current on the worst rail (either vdd or gnd) or worst



























The objective of the proposed peak current minimization problem in Equation (6.5)
implies the simultaneous minimization of the peak currents in each local area. The
peak current optimization can be performed separately for each area as well, with-
out impacting the optimality of the solution. This is possible as the local areas are
non-overlapping and the peak current reduction in each area is independent. Conse-
quently, the problem of peak current minimization for one local area can be formulated
as shown in Table 6.1, which is to be performed on each area. The formulation in
Table 6.1: Peak current reduction problem formulation for one area.









s.t. Pi ∈ {0, 1}, ∀Bi ∈ Aj
Table 6.1 is in the form of a Time Minimizing Assignment Problem (TMAP) [59].
The TMAP problem can be solved optimally by a lexi-search algorithm [3, 52].
6.4 Polarity Assignment
In this section, the optimal solution to the polarity assignment problem that con-
siders the impacts of the output capacitive load on each clock sink buffer is presented.
93
6.4.1 Selection of the Negative Polarity Buffers/Inverters
Without loss of generality, assume that the original clock tree is synthesized exclu-
sively with (positive polarity) buffers. A negative polarity clock inverter needs to be
selected for each clock sink buffer for the purpose of polarity assignment (buffer/inverter
replacement). The criteria for choosing the negative polarity inverter for each clock
sink are: 1. Little or no skew degradation after replacing the existing clock buffers
with inverters; 2. Least delay difference with the original sink buffer after replace-
ment; 3. Least peak current weight. Based on these criteria, a negative polarity
clock inverter type is chosen for each sink level clock buffer to perform the polarity
assignment.
6.4.2 Weight of the Clock Buffers/Inverters
In order to develop a mathematical formulation, the weight of each individual
buffer and inverter in a clock tree is defined in terms of the amount of peak cur-
rent they draw from the power rails. To this end, the peak current curve of each
buffer/inverter type versus the capacitive load is approximated into a polynomial
function. Through experimentation, a third order polynomial is selected to be suffi-
cient for the selected cell library. The peak current simulation and the curve fitting
for a particular buffer size is shown in Figure 6.4. The standard deviations of the
curve fitting compared to the mean values of different sized buffers are summarized
in Table 6.2. The average error of the third order polynomial curve fitting is only
1.0%. The order of the polynomial can be changed for accuracy.
6.4.3 Polarity Assignment by Lexi-search Algorithm
The polarity assignment method using Lexi-Search Algorithm (LSA) is presented
in Algorithm 7, which uses the notations proposed in Table 6.3.
94


































Figure 6.4: HSPICE simulation of buffer BUFX4 with cap load and curve fitting.
Table 6.2: Peak current curve fitting error.
Type X2 X4 X8 X16 X32 Avg
Buffers 0.6% 2.0% 2.9% 1.1% 0.6% 1.4%
Inverters 0.2% 0.4% 1.1% 0.7% 0.3% 0.5%
Avg 0.4% 1.2% 2.0% 0.9% 0.5% 1.0%
The lexi-search algorithm starts with an initial assignment of polarities to clock
sink buffers stored in vector wm. The worst case peak current corresponding to the
initial assignment is set as the upper bound of the worst case peak current Iminpeak,m.
The algorithm then iteratively switches the polarity of each buffer to search for the
optimal polarity assignment. During the search process, the value of Icurpeak,m is up-
dated every time when a buffer is assigned with a polarity according to the polarity
assigned buffers in stack pwm. After each optimization iteration, the value of I
cur
peak,m




wm A static vector of size nm (number of buffers in area Am)
which indicates the polarity assignment of each buffer for the
current best peak current value in area Am. The value at
each entry of the vector wm is 0 or 1 representing positive
and negative polarity, respectively.
pwm A stack which stores the current polarity assigned buffers and
their polarities in area Am. The stack pwm is empty at the
beginning. After each buffer Bi is assigned with a polarity,
the buffer with polarity information is pushed onto the top
of the stack pwm.
Iminpeak,m The upper bound of the total peak current weight in area Am.
The value of Iminpeak,m is updated during runtime. It always
stores the best total peak current weight at the current time
instance.
Icurpeak,m The current total peak current weight corresponding to the
polarity assignment in pwm in area Am. The value of Iminpeak,m
is updated during runtime. It stores the total peak current
weight at the time instance according to the current polarity
assignment.
i i is used to indicate the buffer index at the top of the stack
pwm.
Pi Pi is used to indicate the polarity (0 indicates positive polar-
ity, 1 indicates negative polarity) of buffer i in stack pwm.
of the peak current Iminpeak,m. The initial polarity assignment is found with a heuris-
tic algorithm as shown in Algorithm 8, which is an integral part of the lexi-search
process.
6.4.4 The Area Definition
In the proposed method, the polarity assignment method is performed on each
local area. The area definition is based on the power network and the connections
of the clock sink buffers to the power network similar to the definition in [55]. The
polarity assignment is performed at the post-placement stage when the power network
96
Algorithm 7 The polarity assignment method using Lexi-Search Algorithm (LSA).
Input: I i+ and I
i
− for each buffer Bi, set Am for each local area
Output: Pi for each buffer Bi
1: for Each area Am do
2: Find an initial polarity assignment of buffers in area Am using the heuristic
approach presented in Algorithm 8. Assign wm and I
min
peak,m with the initial
assignment and the corresponding minimum peak current value, respectively.
nm is set to the number of buffers in area Am. Index the buffers in area Am
from 0 to nm − 1.
3: Set i = 0, Pi = 0. Push buffer Bi onto stack pwm.
4: 4.a: If Pi = 0, pop-up buffer Bi from stack pwm. Set Pi = 1. Go to step 5.
4.b: If Pi = 1, pop-up buffer Bi from stack pwm, set i = i− 1. Go to step 6.
5: Assign buffer Bi with polarity Pi. Push buffer Bi into stack pwm. Compute
the current total peak current value Icurpeak,m corresponding to the pwm.
5.a: If Icurpeak,m < I
min
peak,m, go to step 7.
5.b: If Icurpeak,m >= I
min
peak,m and i > 0, go to step 4.
5.c: If Icurpeak,m >= I
min
peak,m and i = 0, go to step 8.
6: 6.a: If i > 0 and each buffer in pwm is assigned negative polarity, go to step 7.
6.b: If i > 0 and not all buffers in pwm are assigned negative polarity, go to
step 4.
6.c: If i = 0 and the polarity of buffer B0 is positive (P0 = 0), go to step 4.
6.d: If i = 0 and the polarity of buffer B0 is negative (P0 = 1), go to step 8.
7: 7.a: If i < n− 1, set i = i+ 1, set Pi = 0, go to step 5.
7.b: If i = n − 1, a new polarity assignment is generated. Since this new
assignment has a Icurpeak,m < I
min




peak,m and set w to be
the new polarity assignment. Also set pwm = ∅, go to step 3.
8: Report the optimal assignment in area Am for peak current as wm.
9: end for
routing is already completed. Thus, the areas for polarity assignment are clearly
defined at the polarity assignment stage. The buffers impact the peak current on
the power pads that are in the closest physical proximity more significantly than
the other pads on the power grid. Such an assumption enables the definition of local
areas, which is also adopted in previous arts such as [65] and [37] to model the locality
effect of the peak currents. As a result, minimizing the peak current in each local
area is considered sufficient for significant power noise reduction.
97
Algorithm 8 Heuristic algorithm for obtaining a good initial assignment.
Procedure init assignment(m)
Input: I i+ and I
i
− for each buffer Bi in local area Am
Output: Pi for each buffer Bi and the corresponding peak current value
1: Set Um = {Bi|Bi ∈ Am}. Set Pi = 0, ∀Bi ∈ Am.
2: Find buffers Bi and Bj which lead to the minimum peak currents on the vdd and






Ik−. Pi = 0, Pj = 1, Um = Um − {Bi},
Um = Um − {Bj}.
3: repeat




















6: Pi = 0, Um = Um − {Bi}.
7: else
8: Pj = 1, Um = Um − {Bj}.
9: end if
10: until Um = ∅
6.4.5 The Comparison of Lexi-search Algorithm with Dynamic Program-
ming Algorithm
In [37], the polarity assignment problem is formulated as a knapsack problem and
solved by a pseudo-polynomial dynamic programming algorithm. In this work, the
polarity assignment problem is solved by a lexi-search algorithm. The advantages of
the lexi-search algorithm compared with the dynamic programming approach are:
(i) The lexi-search algorithm obtains the initial upper bound (a feasible solution
for the problem) in O(n log n) time, and very often the initial upper bound is
very close to the final optimal solution [56]. This is critical for designs with very
high number of buffers in each local area, which cannot be solved optimally.
(ii) After obtaining the initial upper bound, the optimality of the solution is im-
proved during each iteration. Thus, an upper bound of runtime can be set for
the lexi-search algorithm. If the runtime is longer than tolerance, the algorithm
98









































Figure 6.5: The runtime and optimality comparison of Dynamic Programming (DP)
approach and the Lexi-Search Algorithm (LSA).
can be terminated and a good sub-optimal solution is obtained. In general, the
optimal solution of the lexi-search algorithm are often obtained in early iter-
ations [56]. However, the dynamic programming approach obtains a feasible
result only when the algorithm finishes running.
(iii) The runtime of the dynamic programming approach is affected by the desired
accuracy due to the fact that the complexity of the algorithm is determined by
the sum of the weight of all the buffers. However, the runtime of the lexi-search
algorithm is not affected by the accuracy.
The optimality of the dynamic programming approach and the lexi-search ap-
proach is the same. However, the lexi-search algorithm can be used to trade in the
optimality for the purpose of saving runtime. The runtime and the optimality of
the lexi-search method and the dynamic programming method are compared using
a C++ program. The results are summarized in Figure 6.5. Note that the runtime
granularity in C++ is 10ms. The algorithm typically finishes in 1ms for small num-
ber of buffers. The runtime is thus calculated by running the algorithm for 10000
99
times and then dividing the runtime result by 10000. In Figure 6.5(a), the x-axis is
the number of sink buffers in a typical local area ranging from two to ten for polarity
assignment. Note that these buffer numbers are reasonable as the local area definition
is desirable to be small such that the local peak current are balanced. The y-axis is
the runtime in logarithmic scale for the dynamic programming (DP) approach and
the lexi-search algorithm (LSA). The runtimes of lexi-search algorithm terminated
after one optimization iteration to four optimization iterations are labeled LSA 1
to LSA 4, respectively. It is observed that the lexi-search algorithm is better than
the dynamic programming in [37] in terms of runtime if the number of sink buffers
is less than six (6). However, if the lexi-search is terminated in early optimization
iterations (e.g. 3 optimization iterations), the runtime of the lexi-search algorithm
is better in terms of runtime for all buffer numbers of interest and the optimality
can still be guaranteed as shown in Figure 6.5(b). The corresponding optimality of
the two methods is summarized in Figure 6.5(b). The x-axis is the number of sink
buffers ranging from two to ten for polarity assignment. The y-axis is the optimality
of the results. The peak current is scaled such that “1” represents the optimal peak
current. Thus the DP approach and the LSA approach always provide the optimal
solution “1”. If the LSA approach is terminated in early optimization iterations, the
solution may not be optimal. But with more optimization iterations, the solution
improves. This trend is observed in Figure 6.5(b). Even when the lexi-search algo-
rithm is terminated after the first optimization iteration, the optimality is only off
by 2.2% from the optimal solution on average. The runtime for one iteration, on the
other hand, is significantly reduced compared to the DP approach. With two or more
optimization iterations, the results are very close to the optimal solution, and the
runtime is reduced significantly.
100
In summary, although the lexi-search is exponential in the worst case, the runtime
of the lexi-search may be better when the number of buffers is small, which is the case
in polarity assignment as the local peak current is desirable to be small. Moreover,
the lexi-search algorithm can terminate in early iterations for saving runtime with
very limited optimality degradation. As noted earlier, from a general perspective, the
lexi-search algorithm is also important in providing an alternative solution method
to the peak current minimization problem, particularly for research that builds up on
the original formulations in the prior art. For instance, one contribution of this work
is the recognition of the peak current change with buffer type, size and capacitive
load. The TMAP provides a framework to mathematically formulate the proposed
model for optimization.
Table 6.4: Benchmark information.
Circuit # cells # sinks # buffers # levels runtime (s)
s13207 2876 638 48 3 < 1
s15850 1895 496 37 3 < 1
s35932 6140 1728 131 3 1
s38417 6337 1636 128 3 1
s38584 13281 1426 109 3 1
6.4.6 Experimental Results
The proposed clock buffer polarity assignment methodology is used in experiments
on a suite of benchmark circuits from ISCAS’89. Synopsys design flow is adopted in
experiments and a 90nm cell library [73] is used to synthesize the benchmark circuits.
In experiments, the benchmark circuits are synthesized by using Design Compiler [69].
Clock tree synthesis is performed by using the physical design tool IC Compiler [72].
The proposed polarity assignment method is implemented using C++ and applied
101
Table 6.5: Peak current reduction on the largest ISCAS’89 benchmark circuits (Typ-
ical operating condition).
Circuit Org Not considering cap load Considering cap load
I (mA) tcs (ps) I (mA) Impro. tcs (ps) I (mA) Impro. Add. tcs (ps)
s13207 8.4 26.8 6.7 20.2% 25.5 5.8 31.0% 13.4% 26.1
s15850 10.3 28.8 7.0 32.0% 29.5 6.5 36.9% 8.5% 30.4
s35932 18.1 38.9 12.4 31.8% 38.7 11.7 35.4% 5.6% 38.8
s38417 24.1 29.8 17.0 29.5% 29.0 15.3 36.5% 10.0% 25.3
s38584 18.5 53.4 12.8 30.8% 51.8 11.3 38.9% 11.7% 49.0
Avg. 35.5 29.6% 34.9 36.5% 9.8% 33.9
to the physical synthesized circuits. After polarity assignment, extractions are per-
formed using Star-RCXT [71] followed by a post layout simulation using the HSPICE
simulator in Nanosim [70].
Details about the benchmark circuits, the synthesized clock trees and the runtimes
for polarity assignment on these trees are presented in Table 6.4. In particular, the
columns #cells, #sinks, #buffers and #levels present the number of cells in the
design, the number of clock tree sinks, the number of buffers in the clock tree and
the number of levels in the clock tree, respectively. The column runtime shows
the runtime of solving the polarity assignment on a 2.2GHz processor by applying
the proposed algorithms. The results show that on all the benchmark circuits, the
runtime of the proposed algorithms is very limited for the small tree sizes necessary
to synchronize the academic benchmarks.
The proposed polarity assignment method is implemented and experimented with
and without considering the capacitive load. The peak current and skew changes are
reported for both the typical and worst case PVT corners. These two corners are
investigated separately as the propagation delays through the buffers behave differ-
ently at these corners. In practice, a designer can choose to focus on one or both
of the corners in the clock polarity assignment process. The peak current reduction
102
Table 6.6: Peak current reduction on the largest ISCAS’89 benchmark circuits (Worst
operating condition).
Circuit Org Not considering cap load Considering cap load
I (mA) tcs (ps) I (mA) Impro. tcs (ps) I (mA) Impro. Add. tcs (ps)
s13207 8.0 50.8 6.3 21.3% 64.3 5.2 35.0% 17.5% 65.4
s15850 9.8 52.5 6.7 31.6% 99.6 6.2 36.7% 7.5% 73.7
s35932 17.3 66.6 11.9 31.2% 76.8 11.1 35.8% 6.7% 77.3
s38417 22.8 50.2 15.6 31.6% 64.4 15.0 34.2% 3.9% 63.3
s38584 17.7 83.5 12.3 30.5% 89.9 11.0 37.9% 10.6% 101.1
Avg. 60.7 29.2% 79.0 35.9% 9.2% 76.2
results for these typical and worst operating conditions are summarized in Table 6.5
and Table 6.6, respectively. The multi-column “Org” summarizes the results of the
circuits where no polarity assignment is performed. The columns “I (mA)” present
the worst case peak current on the vdd/gnd rails of the clock tree in a design. The
columns “tcs (ps)” present the global clock skew of the benchmark circuit. The im-
provement in the “Not considering cap load” case is computed with respect to the
“Org” case. The improvements in the “Considering cap load” case are computed
with respect to the “Org” and the “Not considering cap load” cases. It is observed
that an average improvement of 29.6% and 29.2% on the peak current is achieved
by applying the proposed method without considering capacitive load in the typical
corner and worst corner, respectively. The additional improvement by considering the
capacitive load of each clock buffer is summarized in Column “Add.”. The additional
improvement is calculated as the improvement percentage of the peak current con-
sidering the capacitive load over the peak current without considering the capacitive
load (thus, is not additive). By considering the capacitive load of each clock buffer in
the computation of polarity assignment, a further improvement of 9.8% and 9.2% on
average is achieved for the worst case peak current reduction in the typical corner and
the worst corner, respectively. The overall peak current reduction is 36.5% and 35.9%
103
for typical corner and worst corner, respectively. These enhanced improvements are
achieved by considering the impacts of capacitive load on peak current.
As a part of the clock polarity assignment, buffers and inverters are replaced with
inverters and buffers, respectively, which have different timing characteristics. This
leads to a potential clock skew degradation on the clock tree. As observed from the
experimental results in Table 6.5 and Table 6.6, the clock skews under typical oper-
ating condition after polarity assignment are maintained. However, the clock skews
after polarity assignment under the worst operating condition increases as reported
in Table 6.6. This is observed despite the inverters to replace the buffers are chosen
in order to have the smallest delay difference with the replaced buffer as discussed in
Section 6.4.1. The worst corner clock skews before and after polarity assignment are
60.7ps and 76.2ps, respectively. This is such as the delay difference between buffers
are larger in worst corner than in the typical corner as shown in Figure 6.6. The worst
corner clock skew with and without considering the capacitive load (79.0ps vs. 76.2ps
on average) are similar by applying the proposed method (which chooses the negative
polarity buffers/inverters to have similar timing delay with the one originally on the
clock tree) as shown in Table 6.6. Thus, considering capacitive load in peak current
minimization does not especially worsen the global clock skew.
6.4.7 Discussions on the Peak Current Weight
In Figure 2.4 on page 14, it is observed that for a positive polarity buffer, the peak
current on the vdd rail at the input falling edge and the peak current on the gnd rail
at the input rising edge are not zero. There is a small current peak at the other clock
edge where the actual peak current occurs [e.g. Ivdd↓ and Ignd↑ for positive polarity
buffer as shown in Figure 2.4(a) and Ivdd↑ and Ignd↓ for negative polarity buffer as
shown in Figure 2.4(b)]. In the proposed method and the previous works [13, 37, 65],
104





















Figure 6.6: Delays vs. cap load for BUFX4 and IBUFX4 under typical and worst
operating conditions.
the small current peak is ignored during polarity assignment as it does not define the
“peak” value. In this section, numerical results are generated to demonstrate that
ignoring the small peak current does not significantly degrade the optimization level.
Let I i+s and I
i
−s be the weight of the small current peak for buffer Bi with positive
and negative polarity, respectively. Integrating the small current peak, the formula-
tion in Table 6.1 is re-written as the formulation in Table 6.7.
Table 6.7: Peak current reduction problem formulation for one area considering the
small current peak.








i− + (1− Pi)Ii+s)
s.t. Pi ∈ {0, 1}, ∀Bi ∈ Aj
105
The numerical peak current results shown in Table 6.8 are computed with the
ILP formulations in Table 6.1 and Table 6.7, which demonstrate the difference of
the results obtained from these formulations. The results computed from the ILPs
are more descriptive than HSPICE simulations in terms of observing the optimality
of these formulations because the HSPICE simulations consider other factors that
might not be modeled accurately in the formulations. To this end, the columns “w/”
and “w/o” summarize the worst case peak current with and without considering the
small current peak, respectively. The large current peak and the small current peak
are obtained through HSPICE simulations. The small current peak is around 30%
of the large current peak in the simulated buffers and inverters. It is observed that
without considering the small current peak, the optimization level for the worst case
peak current is degraded by 2.0% only, which is very limited. Thus, the small current
peak can be ignored during the polarity assignment for reduction in the problem size,
which improves scalability.
Table 6.8: The optimization level w/ and w/o considering the small current peak.
Circuit w/ (mA) w/o (mA) Degradation
s13207 4.91 5.00 1.8%
s15850 5.18 5.30 2.3%
s35932 12.81 13.20 3.0%
s38417 13.50 13.81 2.3%
s38584 13.11 13.15 0.3%
Average 2.0%
106
6.5 Heuristic Tuning for Skew Minimization
As observed in the experimental results, the skew is maintained after polarity
assignment under typical operating conditions. However, under the worst case op-
erating conditions, the clock skew is increased due to two primary reasons: 1. The
delay difference of buffers/inverters is increased in the worst corner; 2. The slopes of
delay vs. capacitive load curve for buffers/inverters are greater in the used technology
library as shown in Figure 6.6. A heuristic skew tuning stage is proposed for skew
minimization, which is effective for all operating conditions but particularly so for
the worst operating condition.
The motivation for skew tuning after polarity assignment is presented in Sec-
tion 6.5.1. The proposed skew tuning scheme is presented in Section 6.5.2. The
experimental results are summarized in Section 6.5.3.
6.5.1 Performing Skew Tuning at the Post-PA Stage
The skew tuning procedure of the proposed method is performed at the post
polarity assignment stage. This is such as when a buffer is replaced by an inverter
(or vice versa) or a different size buffer, the delays from the clock source to all the
other buffers at the same level, which share a parent buffer with the replaced buffer,
change. This fact is ignored in previous works, including in [37], which sacrifice
timing accuracy to propose optimal solutions to the simplified problem models. A
clock tree illustration is shown in Figure 6.7 to demonstrate this timing inaccuracy.
The parent buffer P1 drives the sink buffers B1–B4. Assume originally the four sink
buffers B1–B4 are size X2. An HSPICE simulation on the clock tree demonstrates
that the delays from the input of the buffer P1 to the output of the sink buffers B1–B4
are 322ps. If buffer B2 is sized up to X16, the delays at the output of buffers B1,
B3 and B4 increase to 331ps, which increases the delay of the unchanged buffer by
107
9ps. If buffers B2 and B3 are sized up to X16, the delays at the output of buffers B1
and B4 increase to 340ps. If three buffers B2–B4 are sized up to X16, the delay at
the output of buffer B1 increases to 350ps. Although the size and capacitive load of
buffer B1 stay the same during the sizing process, the delay at the output of buffer B1
changes due to the changes on the sibling buffer sizes. This is because when the sibling
buffers are changing their types or sizes, their input capacitance changes, thus the
output capacitive load of their parent buffer P1 changes. As such, the delays on all
the branches which have the same parent buffer P1 are affected.
Based on the above reasoning, the changes in the buffer types and sizes affect the
clock skew during the polarity assignment process. These clock skew changes include
delay changes on the clock branches where no polarity assignment is performed. As
a result, the delay of each clock branch is difficult to predict during the polarity
assignment stage (this is the source of timing inaccuracy in previous works). To this
end, the skew is tuned after the polarity assignment in order to finalize the global
clock skew.
B1 B4B2 B3
X2 X2 X2 X2
P1
Figure 6.7: A clock branch illustration.
108
6.5.2 Skew Tuning Scheme







where ti is the clock arrival time at the sinks on a clock branch i. The clock tree
branches i and j, which have the maximum arrival time ti and minimum arrival
time tj, respectively, are called the critical clock branches. In the experiments, an ob-
servation is made that the critical paths always involve two opposite polarity branches.
Since the polarity assignment is blind to the clock arrival times, it is highly possible
that the latest clock arrival times are becoming worse and the earliest clock arrival
times are remaining unchanged. Thus, if the polarity of the two critical clock tree
branches are switched, the global skew is likely to be reduced. This process is illus-
trated with the following example. Assume there are 6 (six) sink buffers in the clock
tree of the original design. The clock arrival times ti’s and the peak current weights
of these sink buffers are summarized in the rows titled “Before polarity assignment”
in Table 6.9. Note that the numbers in Table 6.9 are empirical but the results can be
generalized.
According to Equation (6.6), the critical clock branches before polarity assignment
are branch 3 and branch 6 with a global clock skew of t3 − t6 = 250 − 200 = 50ps.
After applying the proposed polarity assignment method, 3 (three) out of the 6 (six)
buffers, indexed 3–5, are assigned negative polarity and thus are replaced by inverters.
The corresponding clock arrival times of the six buffers after polarity assignment
become 220ps, 210ps, 300ps, 280ps, 290ps and 200ps, respectively. The critical clock
branches after polarity assignment are 3 and 6, which increases the clock skew to t3−
t6 = 300 − 200 = 100ps. The clock arrival time and peak current information after
polarity assignment are shown in rows titled “After optimal polarity assignment”
109
Table 6.9: An illustration of polarity assignment and skew tuning.
Before polarity assignment
Buffer 1 2 3 4 5 6
I i+ 112 109 108 108 77 112
I i− 130 126 125 125 101 129
Polarities + + + + + +
ti (ps) 220 210 250 230 240 200




tj = t3 − t6 = 50ps
Peak current I totalpeak =
∑
i=1−6
I i+ = 626
After optimal polarity assignment, (3, 4, 5)→ −
Buffer 1 2 3 4 5 6
Polarities + + - - - +
ti (ps) 220 210 300 280 290 200




tj = t3 − t6 = 100ps






I i−) = 351
After skew tuning, (3, 4, 5)→ +, (1, 2, 6)→ −
Buffer 1 2 3 4 5 6
Polarities - - + + + -
ti (ps) 270 260 250 230 240 250




tj = t1 − t4 = 40ps






I i−) = 385
in Table 6.9. The sink of the clock branch 3 has negative polarity and the sink of
the clock branch 6 has positive polarity. If the polarity of the sink buffers of these
two critical clock branch are switched, the polarity of the sink buffer 3 becomes
positive and the polarity of sink buffer 6 becomes negative. As a result, their clock
arrival times become 250ps and 250ps, respectively. The critical clock branches now
become branch 2 and branch 5 and the corresponding global clock skew is reduced
to t5− t2 = 290− 210 = 80ps. The critical clock branches in this setting 2 and 5 also
have the opposite clock polarity. If these clock sink buffer polarities on the critical
clock branches are switched, the clock skew can be further reduced.
110
The clock arrival time and the peak current information after skew tuning are
presented in rows titled “After skew tuning” in Table 6.9. The clock skew after the
skew tuning is 40ps which is a 60% improvement over the initial assignment with
skew of 100ps.
As shown in Table 6.9, the clock tree with skew tuning has a global skew com-
parable to the skew of the original clock tree (50ps versus 40ps). Skew tuning is
achieved at a trade-off against the peak current. The peak current on the vdd/gnd
rails of each local area may not be minimal after skew tuning because of the polarity
switching for skew refinement. In the presented example, the peak current value after
skew tuning increases from 351 to 385 as shown in Table 6.9. In order to avoid a large
peak current increase, a user defined parameter ² is introduced to guarantee that the
worst case peak current on the skew tuned circuit is never ² larger than the optimal
peak current. In practice, ² can be determined by the level of variation tolerable as
the vdd/gnd noise, where a higher ² indicates a higher vdd/gnd noise.
The replacement of a buffer with an inverter or vice versa on the clock branch i
not only affects the delay of the clock branch i but also affects the delay of all the
other branches which share the same parent buffer of the replaced buffer. Thus,
an incremental timing analysis is performed after each iteration of the clock buffer
polarity switching. The heuristic algorithm utilizes the timing analysis tool embedded
in IC Compiler. In each iteration, the proposed method performs timing analysis and
identifies the critical path on the clock tree. This heuristic method for skew tuning
is presented in Algorithm 9. The algorithm iteratively switches the polarity of the
critical clock branches and performs static timing analysis during each iteration. The
method terminates when there is no more reduction in clock skew or the peak current
increase is higher than the user defined parameter ². Algorithm 9 is heuristic in nature
and iteratively switches polarities starting with the buffer nodes with the largest delay
111
difference. Although simplistic in nature, the algorithm is very effective, as will be
presented in Section 6.5.3. The simplicity of the algorithm is desirable on this end
due to minimal additional algorithmic complexity to the peak current minimization
process.
Algorithm 9 Heuristic algorithm for skew tuning.
Input: The polarity assigned design
1: Initialize tiskew′ =∞ for each area Ai
2: repeat
3: Perform timing analysis and obtain the current tiskew (clock skew in each local
area Ai).
4: for all Area Ai do
5: Compute tiskew for Area Ai.
6: if tiskew > t
i
skew′ or Icur > Imin + ² then
7: Undo the previous buffer replacement.
8: Mark area Ai as optimized.
9: else
10: Find clock branch m such that tm = max
k∈Area Ai
tk and clock branch n such
that tn = min
k∈Area Ai
tk.
11: if Pm = 0 and Pn = 1 then
12: Let Pm = 1, Pn = 0, replace buffers accordingly.
13: else if Pm = 1 and Pn = 1 then
14: Find l such that tl = max
k∈Area Ai,Pk==0
tk.
15: Let Pl = 1, Pn = 0, replace buffers accordingly.
16: else if Pm = 0 and Pn = 0 then
17: Find l such that tl = min
k∈Area Ai,Pk==1
tk.
18: Let Pm = 1, Pl = 0, replace buffers accordingly.
19: else
20: Mark area Ai as optimized.
21: end if





25: until All areas are optimized.
112
6.5.3 Skew Tuning Experimental Results
The experimental setup is the same as presented in Section 6.4.6. The benchmark
circuits are synthesized to have a clock period of 2ns (500MHz). The input of the
skew tuning experiments is the polarity assigned design. The proposed skew tuning
scheme shown in Algorithm 9 is implemented in Perl and Tcl scripts to interoperate
with IC Compiler. In the experiments, the parameter ² is chosen to be infinity in
order to optimize for clock skew.
The skew tuning results are summarized in Table 6.10 and Table 6.11. In Ta-
ble 6.10, the multi-columns “Typical corner skew info” and “Worst corner skew info”
present the clock skew information before and after applying the skew tuning method
after the proposed polarity assignment under the typical corner and the worst cor-
ner, respectively. It is observed that the worst corner clock skew of the polarity
assigned circuit can be effectively reduced by applying the proposed skew tuning
method. In particular, an average clock skew of 61.5ps is observed on the polarity
assigned circuit using the proposed skew tuning method, which is close to the clock
skew (60.7ps) without any polarity assignment (only a 1.3% increase). The typical
corner clock skews are not affected by the skew tuning method (33.9ps and 32.9ps on
average before and after skew tuning, respectively). In Table 6.11, the multi-columns
“Typical corner improvement level” and “Worst corner improvement level” present
the worst case peak current information before and after applying the skew tuning
method after the proposed polarity assignment under the typical corner and the worst
corner, respectively. It is observed that although the peak current improvement level
is decreased by the skew tuning, the improvement is still better than the polarity
assignment method without considering capacitive load (31.2% vs. 29.6% and 30.7%
vs. 29.2% for typical corner and worst corner, respectively).
113
Table 6.10: Skew information after polarity assignment with skew tuning.
Circuit Typical corner skew info Worst corner skew info
Org. Pre tune Post tune Org. Pre tune Post tune
(ps) (ps) (ps) (ps) (ps) (ps)
s13207 26.8 26.1 23.5 50.8 65.4 52.9
s15850 28.8 30.4 26.8 52.5 73.7 67.0
s35932 38.9 38.8 38.4 66.6 77.3 68.5
s38417 29.8 25.3 26.4 50.2 63.3 50.4
s38584 53.4 49.0 49.2 83.5 101.1 68.6
Average 35.5 33.9 32.9 60.7 76.2 61.5
Table 6.11: Peak current improvements compared to the peak current of the original
clock trees (typical corner and worst corner, respectively).
Circuit Typical corner improvement level Worst corner improvement level
Pre tuning Post tuning Pre tuning Post tuning
s13207 31.0% 22.5% 35.0% 21.1%
s15850 36.9% 32.4% 36.7% 32.6%
s35932 35.4% 35.1% 35.8% 34.4%
s38417 36.5% 29.3% 34.2% 30.1%
s38584 38.9% 36.8% 37.9% 35.5%
Average 36.5% 31.2% 35.9% 30.7%

































Figure 6.8: Peak current improvements (Not considering cap load, considering cap
load and considering cap load with skew tuning method).
114
6.6 The Comparison with Previous Work
The previous work in [37] is one of the most recent polarity assignment methods
which uses buffer sizing to control clock skew when performing polarity assignment.
The method in [37] first finds a feasible time interval (i.e. global clock skew) assuming
an inaccurate timing model during buffer sizing or buffer/inverter replacement. Then
the polarity assignment is performed on the sink buffers guaranteeing (given the inac-
curate timing model) the worst case global skew. This is illustrated by implementing
the method in [37] to be tested on the same benchmark circuits under typical oper-
ating conditions. The HSPICE simulation results of the two methods are compared
in Table 6.12. The time interval selected for the method in [37] is the smallest time
interval where a complete feasible interval exists. The complete feasible interval is
defined as the time interval where there exists at least a buffer and an inverter type
for each sink buffer to have delay in the chosen time interval. It is observed that on
the benchmark circuits s15850 and s38417, the peak current is similar for the two
methods. The method in [37] is better for s38417, as a few buffers are sized down
in [37]. However, the proposed method is significantly better on the benchmark cir-
cuits s13207, s35932 and s38584. This is such as some of the buffers/inverters are
sized up using the method in [37] for the skew requirement. The skew is increased due
to the lack of consideration of the buffer input capacitance when performing buffer
sizing in the inaccurate timing model used in [37].
In short, the method in [37] is very effective in most cases, however, it does not
guarantee a good solution in some cases. This is such as the time interval selection
algorithm proposed in [37] may choose a time interval where the buffers are larger.
Although larger buffers are faster with large capacitive load in general, the larger
buffers may be slower when the capacitive load is small because of the larger intrinsic
capacitance. As a result, the time interval selection algorithm in [37] may (incorrectly)
115





























Figure 6.9: The time interval graph on one local area of the clock tree in s13207.
choose a larger buffer to guarantee the time interval requirement and thus the peak
current may even increase after polarity assignment. This is demonstrated in the
performed implementation of the method in [37]. On all the benchmark circuits
synthesized with IC Compiler, the capacitive load of the sink buffers of the clock tree
ranges from 20fF to 90fF , including the capacitance on the wires and the fanout
registers. In this capacitive load range, it is observed that the buffer with a larger
size may not necessarily be a faster buffer. For instance, the delays for the inverting
buffers IBUFX2, IBUFX4, IBUFX8 and IBUFX16 with the same capacitive
load of 60fF are non-monotonically increasing values of 112ps, 102ps, 115ps and
125ps, respectively.
The time interval selection procedure is demonstrated in Figure 6.9. There are
11 (eleven) clock tree sink buffers in one local area of the benchmark circuit s13207.
116
Table 6.12: The performance comparsion of different polarity assignment methods.
Circuit Peak current reduction (mA) Skew values (ps)
Org. [37] Proposed Org. [37] Proposed
s13207 8.4 14.2 5.8 26.8 33.7 23.5
s15850 10.3 6.9 6.9 28.8 27.3 26.8
s35932 18.1 16.8 11.7 38.9 39.6 38.4
s38417 24.1 16.0 16.8 29.8 28.3 26.4
s38584 18.5 22.2 11.9 53.4 53.4 49.2
Average 35.5 36.5 32.9
The original clock buffers are chosen to be BUFX4 and the original local clock
skew is 26ps. It is observed that the larger buffers may have larger delays given
the capacitive load. The time interval selection algorithm proposed in [37] finds a
complete feasible time interval where there exist at least a buffer and an inverter type
guaranteeing (using the inaccurate timing models) a pre-selected skew starting from
a larger delay to a smaller delay. In this case, assuming the local skew target is set
to 30ps, the complete feasible time interval selected is the one shown in the figure
where minimum delay Dmin = 248ps and maximum delay Dmax = 278ps. In this
feasible time interval, BUFX32 and INV X32 are used and thus larger peak current
are introduced after polarity assignment. This situation occurs on three of the five
benchmark circuits tested. However, using the proposed method in this work, the
positive polarity buffers originally on the clock tree are not sized. Only the negative
buffers are sized. This guarantees the peak current results after polarity assignment
to always be better than the original clock tree as the original clock tree configuration
is within the solution space of the polarity assignment of the proposed method.
6.7 Conclusion
A clock polarity assignment method is proposed which takes into consideration
the capacitive load and the peak current impact of each clock buffer. Due to the cor-
117
rected timing models, the polarity assignment and buffer sizing cannot be performed
simultaneously, as in the previous work. Instead, a two step solution is proposed. The
polarity assignment method utilizes a lexi-search algorithm in order to optimally as-
sign the clock buffer polarities. The proposed method can be applied on any existing
clock tree, preserving the topology and existing optimizations. The implementation
in an industrial design flow is performed in order to demonstrate the practicality of
the clock buffer polarity assignment method. The analysis and experimental results
show that the worst case peak current on the clock tree of all the benchmark circuits
can be reduced by an average of 36.5%. A skew tuning scheme is proposed to effec-
tively reduce the worst case clock skew induced by polarity assignment. The peak
current reduction is 31.2% after the skew tuning.
118
7. Clock Polarity Assignment: A Reconfigurable Flow for Clock Gated
Designs
The polarity assignment methods presented in Chapter 6 and the previous works [13,
37, 47, 48, 54, 65, 66] use buffers and inverters to configure the polarities of different
clock branches. In these methods, the polarity assignment can not be re-configured
after fabrication. In this chapter, a novel clock polarity assignment flow which intro-
duces post-silicon reconfigurability is proposed. The proposed method inserts XOR
gates at one level of the clock tree to facilitate the polarity assignment. The polarity
of the XOR gates can be reconfigured for different modes of clock gating (sleep mode,
busy mode, etc.) such that further reduction of the peak current can be achieved.
Experimental results show that the worst case peak current on a clock tree can be
reduced by 33.3% by assigning polarity to XOR gates at the sink level of the clock
tree. An additional 12.8% reduction in the worst case peak current can be achieved
by reconfiguring the polarity assignment based on the clock gating information. The
proposed flow increases the area by 7.1% but reduces both the total power consump-
tion by 23.8% and the global skew increase (due to polarity assignment) from 19.3ps
to 8.8ps. The insertion of XOR gates at the non-sink nodes is also studied to further
reduce the global skew increase and the area overhead.
This chapter is organized as follows. In Section 7.1, a brief introduction about the
proposed method is presented. In Section 7.2, the preliminaries of clock gating and
the proposed reconfigurability are described. In Section 7.3, the proposed clock tree
synthesis scheme with XOR gate insertion is presented. In Sections 7.4 and 7.5, the
polarity assignment methods for XOR gates inserted at the sink and non-sink level
of the clock tree, respectively, are presented. In Section 7.7, the experimental results
of the proposed method are presented. The chapter is summarized in Section 7.8.
119
7.1 Introduction
The clock tree is a significant source of peak current due to all the sink buffers
switching simultaneously at the clock edges. Reducing the peak current induced by
the sink buffers of the clock tree is demonstrated to be an effective method of reduc-
ing the peak current on the chip [13]. There are a number of effective methods for
reducing the peak current on the clock tree. In [5, 36, 40, 83], clock skew scheduling
is applied to eliminate the simultaneous switching of the sink registers. Clock skew
scheduling, however, is not preferred in mainstream design flow due to the difficulty
at the verification stage. Polarity assignment methods are proposed for peak current
reduction under zero clock skew in [13, 37, 47, 48, 54, 65, 66] and Chapter 6. The po-
larity assignment and the clock skew scheduling methods can be combined to further
reduce the peak current on chip.
As discussed in Chapter 6, the polarity assignment methods typically involve re-
placing buffers with inverters or vice versa on an existing clock tree. Although the
input to the polarity assignment problem is a zero skew clock tree, skew may be
induced by the replacement of buffers with inverters and the replacement of posi-
tive edge-triggered flip-flops with negative edge-triggered flip-flops. Furthermore, cell
overlapping may occur as the dimensions of the buffers and inverters are non-identical.
This necessitates an incremental placement which is undesirable as the quality of the
clock tree is compromised with such a placement change. In this work, a new method
of polarity assignment is proposed, which eliminates the need to replace buffers with
inverters and positive edge-triggered DFF s with negative edge-triggered DFF s. The
method utilizes XOR gates as the buffering elements on one level of a clock tree net-
work, which are reconfigurable for clock polarity. Reconfigurable polarity assignment
is particularly beneficial for gated-clock trees. It is shown for the first time in this
work that the reconfigurability of polarity assignment for gated-clock circuits is essen-
120
tial as the peak current profiles change with each clock gating event. The clock tree
synthesis with polarity assignment method proposed in this work has the following
features and novelties:
(i) The proposed method eliminates the replacement of buffers and registers, and
thus, the need for incremental placement. The polarity assigned tree has less
clock skew degradation and power dissipation at the expense of limited area
increase.
(ii) The proposed method has the reconfigurability feature in assigning polarity
during runtime for clock gating.
(iii) Any clock polarity assignment method that works on the sink level buffers can
be applied on the clock tree with the proposed reconfigurability feature.






























(a) Buffer BUFX2 simulation.




























(b) Positive polarity XORX2
gate simulation.




























(c) Negative polarity XORX2
gate simulation.
Figure 7.1: Buffer and XOR gate simulation in HSPICE.
7.2 Preliminaries
An HSPICE simulation result on the current drawn from the vdd/gnd rails of a
clock buffer is shown in Figure 7.1(a). In the simulation, a clock buffer BUFX2 from
121
a 90nm library [73] is used. The peak current on the power rails [i(vdd)] occurs at the
rising edge of the clock signal. The peak current on the ground rails [i(gnd)] occurs
at the falling edge of the clock signal. Polarity assignment intentionally permits
some buffers to switch in the same direction as the clock signal and the other buffers
to switch in the opposite direction as the clock signal such that the current will
not accumulate on vdd or gnd rails during rising or falling edge of the source clock
signal, respectively. The polarity assignment methods, as demonstrated in previous
works [13, 37, 47, 54, 65, 66], involve replacing buffers by inverters or vice versa on the
clock tree. However, such a replacement scheme not only increases the clock skew
but also affects the robustness of the clock tree because of the mismatch of the clock
buffers after polarity assignment. The clock skew degradation can be so dramatic
that some previous methods have advocated clock tree re-synthesis after polarity
assignment [65]. The proposed clock tree synthesis scheme with a level built with
XOR gates eliminates the necessity to perform incremental placement or clock tree
re-synthesis and enables post-silicon reconfigurability.
7.2.1 XOR Gate Configuration for Polarity Assignment
In the proposed clock tree synthesis scheme, XOR gates are inserted at one level
of the clock tree (sink or non-sink) to enable the reconfigurability of the polarity
assignment. Consider a two-input XOR gate with inputs A and B and output Q.
Let the input terminal B be driven by the clock signal clk. The polarity of the clock
signal at the output of the XOR gate is reconfigurable through the control input A,
as shown in Table 7.1.
The peak current characteristics of the XOR gate XORX2 from the 90nm tech-
nology library [73] is similar to a buffer gate BUFX2 for buffering the clock signal on
the clock tree, as demonstrated in Figure 7.1.
122
Table 7.1: The logic function of XOR.

































(b) The XOR gate XORX2.
Figure 7.2: The schematics of the buffer and XOR gate in [73].
7.2.2 The Characteristics of the Buffer and the XOR Gates
In addition to enabling reconfigurability, the XORX2 gate is advantageous in
terms of the peak current drawn from the power supplies. This is described on the
schematics of the buffer gate BUFX2 and the XOR gate XORX2 in Figure 7.2.
The XOR gate contains three stages in terms of transistor sizes: The input inv-
erter (0.5X), the XOR logic (1X) and the driver (2X), whereas the buffer gate
contains two: The input inverter (1X) and the driver (2X). The transistor sizes of
a stage are twice the size of the previous stage. For XORX2 and BUFX2, the driver
stages in the two gates are of the same size (i.e. 2X). Thus, the output driving





























































(b) The XORX2 peak current decompo-
sition.
Figure 7.3: The peak current analysis of the buffer and XOR gate.
The peak current drawn from the vdd rail of each gate can be separated into
two parts: 1. The current drawn by the input and logic stages and 2. The current
drawn by the driver stage. The current drawn from the vdd rail by the transistors
is primarily determined by three factors: 1. The W
L
ratio of the transistors: A larger
W
L
causes a larger current [60]; 2. The capacitive load of the transistors: The peak
current is larger with a larger capacitive load (confirmed by HSPICE simulations);
3. The slew rate of the input signal of the transistors: A larger slew rate causes a
larger peak current (confirmed by HSPICE simulations). The peak current drawn
by the drivers of XORX2 and BUFX2 is dominant as the drivers have the largest W
L
ratio and the drivers typically drive a larger capacitive load. Given the same input
signal to BUFX2 and XORX2, the slew rate of the input signal at the driver stage
of XORX2 is smaller than the slew rate of the input signal at the driver stage of the
BUFX2. This is such as the driver of XORX2 is driven by the drain capacitance of
two pmos and two nmos transistors whereas the driver of BUFX2 is driven by the
drain capacitance of one pmos and one nmos transistors. Thus the peak current at




































B1 B2 B3 B4 B5 B6
Gating logic
Gated




















1 1 1 0 0 0
















(c) The circuit for clock gating
with reconfigurable polarity as-
signment.
Figure 7.4: The illustration of the clock network with clock gating.
The above reasoning is confirmed by HSPICE simulations as shown in Figure 7.3(a)
and 7.3(b), for BUFX2 and XORX2, respectively. In each figure, the waveform on the
top is the total peak current drawn from vdd rail by the gate. The waveform in the
middle presents the peak current drawn by the driver stage of the gate. The waveform
at the bottom shows the peak current drawn by the other stages (input inverter stage
for BUFX2, input inverter stage and XOR logic stage for XORX2 ). It is observed
that the input inverter stage of BUFX2 draws a peak current of 143.5µA [i(inv)]
while the input inverter and the XOR logic stages of XORX2 draw a peak current
137.5µA [i(xor)]. The driver of BUFX2 draws a peak current of 699.4µA [i(drv)]
from vdd rail while the driver of XORX2 draws a peak current of 595.4µA [i(drv)].
The drivers of the two gates contribute to most of the peak current drawn from the
vdd rails, totaling the peak current i(tot) up to 749.4µA and 644.1µA for BUFX2
and XORX2, respectively. Note that the analysis and simulations are performed only
on the vdd rail; the similar analysis and simulations can be applied on the gnd rail
as well.
In summary, the peak current of a logic gate is determined by the switching
activities and its transistor-level structure altogether. Although XORX2 has more
125
transistors, the peak current is less than that of BUFX2. The trade-off of XOR inser-
tion is the potential area overhead compared to the buffers. These are investigated
thoroughly in this work to identify critical trade-offs in power, area and peak current.
7.2.3 Reconfigurable Polarity Assignment
Clock gating is a commonly used technique for reducing the power consumption
on clock tree. Previous polarity assignment methods ignore clock gating, which is
not optimal due to the dynamic changes in the switching activity of the clock sink
buffers after gating as demonstrated in Figure 7.4(a). Let the sink buffers B1–B3 be
assigned with negative polarity and the sink buffers B4–B6 be assigned with positive
polarity after the polarity assignment. When clock gating occurs, the buffers B2
and B3 are gated and the peak current reduction effect is no longer optimal. The
conventional clock polarity assignment methods do not allow the reconfigurability
polarity assignment during runtime due to the following difficulties:
(i) The polarity assignment is achieved through replacing buffers with inverters or
vice versa at the CTS stage. This procedure cannot be performed post-silicon.
(ii) In order to guarantee the synchronicity of the registers, the type of the regis-
ters (positive or negative edge triggered) has to be determined based on the
polarity assignment at the pre-silicon stage.
By inserting the XOR gate at one level of the clock tree, the first difficulty is
overcome because the polarity of the output signal of the XOR gate can be controlled
by one control input signal. The second difficulty is overcome by using the double edge
triggered flip-flops (DET-FF) [58, 80]. Despite having a different polarity of the clock
signal at the sink level after polarity assignment, as long as the clock edges (either
rising or falling edge) arrive at the same time, the synchronicity of the circuits is
126
maintained. The implementation with DET-FF s also has the added advantage of
operating with half the clock frequency, which leads to significant power savings.
In the proposed design, DET-FF s are used as registers instead of single edge trig-
gered flip-flops (SET-FF) as shown in Figure 7.4(b). Control inputs of the XOR gates
are assigned to binary 0s and 1s to perform polarity assignment. After clock gating
on XORs X2 and X3, the control input to one of the positive polarity XORs (X4, X5,
X6) can be assigned to 1 in order to balance the peak currents on the vdd/gnd rails.
The proposed reconfigurable polarity assignment method is applicable to clock-
gated designs, which already include the clock gating logic. Similar to the clock gating
system [57], a control block for clock gating with reconfigurable polarity assignment
is necessary as shown in Figure 7.4(c). The desirable polarities corresponding to
the gating logic are precomputed and stored in the polarity assignment logic. The
clock gating logic triggers the corresponding polarity assignment to be applied on
the control input of the XOR gates. The area and routing overhead for the polarity
assignment logic is appended to the overheads planned for clock gating logic.
7.3 XOR Gates Insertion on Clock Tree
Different effects are observed when the XOR gates are inserted at different clock
tree levels. If the XOR gates are inserted at the sink level of the clock tree, a larger
overhead possibly in terms of delay, area and power—depending on the cell library—
can be observed. On the other hand, if the XOR gates are inserted at the non-sink
level of the clock tree, the overhead is reduced but there is less granularity in polarity
assignment. This is because if a non-sink XOR gate is assigned to one polarity, all its
sink buffers have the same polarity as the driving XOR gate. The change in insertion
delay (decrease or increase) cannot be generalized as the delay through the XOR gate
depends on the capacitive load and input slew at the particular sink or non-sink.
127
For simplicity of the analysis and without loss of generality, the chip areas in the
experiments are evenly partitioned for the vdd/gnd strips and the polarity assignment
is performed on each local area individually. In Figure 7.5, a clock tree example
on s15850 from the ISCAS’89 benchmark circuits is illustrated. The chip area in
Figure 7.5 is evenly partitioned into four areas labeled A1 through A4 based on the
vdd/gnd straps in physical design.
If the XOR gates are inserted at the sink level only, the polarity assignment can
be performed separately per area. However, if the XOR gates are inserted on non-
sink level, it is possible for some XOR gates to drive sink buffers in different areas
such that the polarity of one XOR gate simultaneously affects the peak currents in
different areas. For instance, in Figure 7.5, it is observed that the highlighted non-
sink XOR gate (shown as a black square) in area A1 drives sink buffers (shown as
black triangles) located in both area A1 and A2. Thus, assigning polarities for XOR
gates inserted at the sink level and non-sink level of the clock tree are two different
problems to be addressed in Sections 7.4 and 7.5, respectively.
7.4 Polarity Assignment on Sink Level XOR Gates
If the XOR gates are inserted at the sink level of the clock tree, any polarity
assignment method, including the proposed method in Chapter 6, can be applied. In
this work, a fast greedy algorithm previously discussed in Chapter 6 is adopted.
7.4.1 Problem Formulation
Similar to the buffer/inverter based polarity assignment problem presented in
Section 6.3, the problem definition of the sink level XOR polarity assignment is:
Given a synthesized clock tree with one level of XOR gates inserted, the
definition of each local area Ak for power/ground network connections
128
Figure 7.5: A clock tree synthesized with XOR gates.
and XOR gates locations, compute the polarity Pi of each XOR gate Xi
such that a minimum worst case peak current is observed on vdd/gnd rails.
The binary values 0 and 1 of the polarity variable Pi represent the positive polarity





peak current on vdd rails and gnd rails for XOR gate Xi with positive polarity,
respectively. Let I iv− and I
i
g− represent the peak current on vdd rails and gnd rails
for XOR gate Xi with negative polarity, respectively. The objective of the peak

























The objective of the formulation is to reduce the worst case peak current on the
vdd and gnd rails over all local areas A1–A4. As explained in Section 6.3, since the
variables in each of the variables pairs {I iv+, I ig+} and {I iv−, I ig−} are associated with
each other, optimizing the peak current on vdd rails optimizes the peak current on
gnd rails as well. Moreover, optimizing the peak current on rising edges of the clock
signal optimizes the peak current on falling edges of the clock signal as well. The
optimization problem becomes minimizing the peak current on the worst rail (either
vdd or gnd) or worst clock edge (either rising or falling edges). The objective is thus
simplified as shown in Table 7.2. The parameters I i+ and I
i
− are calculated as:











The problem aims to minimize the worst case peak current over all the areas by
assigning the polarities of XOR gates.
Table 7.2: XOR gates inserted at the sink level [48].













s.t. Pi ∈ {0, 1},∀ Xi
7.4.2 A Greedy Polarity Assignment Method
In order to assign the polarity efficiently, the greedy algorithm presented in Al-
gorithm 10 is used. The Algorithm 10 is the same as Algorithm 8 on page 97,
which greedily assigns an unassigned XOR gate which has the least positive or nega-
130
Algorithm 10 Greedy sink level polarity assignment [48].
Input: The peak current weight I i+ and I
i
− of each XOR Xi
Output: The polarity Pi of each XOR gate Xi
1: for Each area Ak do







I l−. Set Pi = 0, Pj = 1. Uk = Uk−{Xi}, Uk = Uk−{Xj};
3: while Uk 6= ∅ do





















6: Set Pj = 1, Uk = Uk − {Xj};
7: else




tive polarity peak current weight with the corresponding polarity such that the worst
case peak current is minimized. The peak current weight I i+ and I
i
− for each XOR
gate depends on the size of the gate and the capacitive load at the output. The
algorithm returns the polarities Pi of each XOR gate Xi. The algorithm obtains a
feasible polarity assignment in O(N logN) time, where N is the number of the XOR
gates.
7.4.3 Discussion on the Polarity Assignment Method
The polarity assignment formulation in Table 7.2 is initially proposed by the
authors in [48]. A peak current improvement of 47% is reported in [48]. This level
of peak current improvement is possible through polarity assignment on a modified
buffered clock tree synthesis method from [15]. The XOR gate inserted trees in this
work (as opposed to [48]) are synthesized with IC Compiler, using no clock tree
re-synthesis. Thus, the level of improvement on the peak current is expected to be
131
smaller than [48], but comparable to other polarity assignment methods without clock
tree re-synthesis such as [13, 37, 47]. Other polarity assignment methods [13, 37, 47,
65], can be applied on the XOR based clock tree for peak current reduction without
loss of generality.
7.5 Polarity Assignment on Non-sink Level XOR Gates
A greedy method which assigns polarity on the clock tree with XOR gates inserted
at the non-sink level is developed based on the method in Section 7.4.
7.5.1 Problem Formulation
Each of the non-sink level XOR gates has several descendent sink buffers. These
buffers, although sharing the same ancestor XOR gate, do not necessarily reside in
the same local area. As a result, assigning the polarity of one XOR gate may affect
the polarity of the sink buffers in different areas.
In this problem, the peak current weights I i+ and I
i
− for positive and negative
XOR gate, respectively, are represented by M -tuples instead of numbers, where M is
the number of local areas on chip. For instance, the chip area in Figure 7.5 is divided
by four areas so the I i+ and I
i






























where the set Si includes all the sink buffers Bj, which are the descendent of the XOR








− represents the peak
132
current weight in the local area k of the XOR gate Xi. The peak current weight in
one local area is calculated as the sum of the peak current weight of the sink buffers in
that local area that are the descendants of the XOR gate. The problem formulation
for this polarity assignment problem is presented in Table 7.3.
Table 7.3: XOR gates inserted at the non-sink level.






























Pi ∈ {0, 1}, ∀Xi ∈ E.
In Table 7.3, the parameters I+Ak and I
−
Ak
represent the total positive polarity and
negative polarity peak current for area Ak, respectively. The set E is the set of all
the non-sink XOR gates. The objective of the problem is still to minimize the worst
case peak current over all the areas.
7.5.2 Polarity Assignment
In order to solve the polarity assignment formulation in Table 7.3 efficiently, a
greedy polarity assignment method is presented in Algorithm 11, which is similar to
the Algorithm 10 discussed in Section 7.4.2. The proposed polarity assignment first
assigns the polarity on the XOR gates whose descendent sink buffers are in the same
area. The algorithm then greedily assigns the polarities of the XOR gates which
drives sink buffers in different areas and minimizes the peak current. The algorithm
returns the polarity Pi of each XOR gate Xi in set E in O(N logN) time complexity.
133
Algorithm 11 Greedy non-sink level polarity assignment.
Input: The peak current weight I i+ and I
i
− of each XOR Xi
Output: The polarity Pi of each XOR gate Xi
1: Set U = {Xi|Xi ∈ E}
2: for Each area Ak do







I l−(k). Set Pi = 0, Pj = 1. Uk = Uk − {Xi},
Uk = Uk − {Xj}, U = U − {Xi, Xj}, I+Ak = I i+(k), I−Ak = I i−(k);
4: while Uk 6= ∅ do
5: Find buffer i such that I i+ = min
Xl∈Uk






6: if I+Ak + I
i
+(k) ≥ I−Ak + Ij−(k) then
7: Set Pj = 1, Uk = Uk − {Xj}, U = U − {Xj}, I−Ak+ = Ij−(k);
8: else









+(k)] ≥ max∀Ak [I
−
Ak
+ I i−(k)] then
15: Set Pi = 1, I
−
Ak
+ = I i−(k), ∀ Ak;
16: else
17: Set Pi = 0, I
+
Ak
+ = I i+(k), ∀ Ak;
18: end if
19: end for
7.6 Discussions on the Methodology
The proposed methodology flow has two parts: 1. Clock tree synthesis with XOR
gates inserted at one level of the clock tree as clock buffers; 2. Polarity assignment
on the clock tree. During the clock tree generation, the XOR gates instead of clock
buffers are inserted on one level of the clock tree (sink or non-sink level). The existing
clock tree synthesis tools can be seamlessly adjusted to accomplish this task. The
clock topology generation with XOR gates insertion is performed at the pre-routing
stage. The polarity assignment is reconfigurable and is performed after routing stage
134
Table 7.4: Numerical peak current reduction comparison of the proposed method and
the optimal MIP.
Circuit Sink level XOR insertion Non-sink level XOR insertion
MIP Alg. 10 Offset MIP Alg. 11 Offset
s13207 6233 6259 0.4% 5601 5724 2.2%
s15850 4485 4505 0.4% 3715 3840 3.4%
s35932 11231 11370 1.2% 10676 10876 1.9%
s38417 14200 14312 0.8% 11087 11347 2.3%
s38584 14968 15139 1.1% 14663 14911 1.7%
Avg. 0.8% 2.3%
to guarantee the optimal assignment, as in this stage the logic routing, clock routing
and power network routing are known. The clock polarity assignment can also be
performed post-silicon to take into account the on-chip variations.
The polarity assignment problems presented in Table 7.2 and 7.3 are NP-hard
problems. The two proposed sub-optimal heuristic methods have the complexity of
O(N logN), where N is the number of the XOR gates. Although heuristic in na-
ture, the proposed methods have very good approximation results on the benchmark
circuits. The optimality of the proposed methods is compared against the optimal
Mixed Integer Programming (MIP) solution on the benchmark circuits and the re-
sults are summarized in Table 7.4. The MIP formulation in Table 7.2 and 7.3 are
solved using the online solver SCIP [21]. It is observed that the proposed methods
are only off from the optimal results by 0.8% and 2.3% by using the Algorithm 10
and Algorithm 11, respectively.
7.7 Experimental Results
The proposed clock tree synthesis scheme and the polarity assignment methods are
applied on the ISCAS’89 circuits. The circuits are logically and physically synthesized
using Design Compiler and IC Compiler, respectively. The operating frequency is
135
set to 500 MHz without any timing violations. The clock tree synthesis with XOR
gates inserted on one level of the clock tree is performed by IC Compiler. The
routed benchmark circuits are extracted using StarRCXT. The polarity computation
methods are implemented using C++ and the polarities are implemented using Tcl
in IC Compiler. The peak current reduction effects are simulated using the HSPICE
simulator in Nanosim.
In order to obtain an as fair as possible comparison between the XOR based po-
larity assignment flow and the conventional buffer/inverter based polarity assignment
flow, the XOR gates and buffers/inverters with similar slew rate are used in clock tree
synthesis. To this end, HSPICE simulations are performed on different sizes XOR gates
and different sizes buffers/inverters in the selected cell library [73]. It is observed in
the experiments that the buffer of size X2 has the closest slew rate with the XOR of
size X2 under the same capacitive load (i.e. 8.3gV/µs vs. 7.8gV/µs driving the same
capacitive load of 60 fF ). Thus, in the experiments, the power and peak current of
the circuit with the clock trees synthesized using BUFX2 and clock trees synthesized
using BUFX2 and XORX2 (at the sink or one non-sink level only) are compared.
7.7.1 The Design of DET-FFs
The proposed method requires XOR gates insertion and the use of DET-FF s.
There are XOR gates of similar driving strength to buffers in the adopted cell li-
brary [73] but there are no DET-FF s. To this end, the DET-FF s of similar charac-
teristics to the single-edge-triggered flip-flops (SET-FF s) in the library are designed
using custom design. The DET-FF gate is built based on the design in [58]. The cell
height of the designed DET-FF is restricted to be 2.88µm, which is the height of the
standard cells of the used 90nm library [73]. In order to gain a similar slew rate as
the SET-FF s in the cell library, the inverter at the last stage of DET-FF is re-sized.
136
Note that in the cell library [73], the design of DFFX1 (a SET-FF ) also uses a large
inverter size at its last stage, thus, this is standard procedure to increase the driving
strength.
The characteristics of the DFFX1 in the standard cell library and the designed
DET-FF are compared in Table 7.5. All the information is obtained by driving a
60fF capacitive load under the same operating frequency 500MHz. It is observed in
the largest benchmark circuit s38584 that the average output load of the DFFX1 is
16.5fF . Thus, an overly conservative capacitive load of 60fF is chosen for compari-
son. Although the total area of the DET-FF is larger than DFFX1 in the library [73],
the DET-FF reduces the power consumption. In addition to this small power savings
in each register, the total power consumption of the clock tree is further reduced
through slashing the operating frequency by half. Moreover, the clk − to − q delay
of the DET-FF is smaller than that of DFFX1, which is similar to the result in [58].
The decreased delay potentially leads to an improved timing slack.
Table 7.5: The characteristics of the DFFX1 and the designed DET-FF driving a
capacitive load of 60 fF .
Regs type Area (µm2) Slew (gV/µs) Delay (ps) Power (µW )
DFFX1 24.9 4.85 244 105.8
DET-FF 28.0 4.80 160 91.5
7.7.2 Polarity Assignment with XOR Gates
The peak current reduction effects of reconfigurable polarity assignment for the
clock tree with XOR gates inserted at the sink level are summarized in Table 7.6.
The number of sink clock tree buffers and the number of XOR gates inserted are
137
Table 7.6: Reconfigurable polarity assignment with XOR gates inserted at the sink
level.
Circuit information Without clock gating With clock gating
Circuit # of ct # of Org PA Imp. Gated RePa Incr. Total
bufs XORs (mA) (mA) (mA) (mA) Imp. Imp.
s13207 144 107 24.3 16.1 33.7% 14.7 12.2 17.3% 50.0%
s15850 108 83 12.1 9.0 25.6% 7.9 7.0 10.7% 42.0%
s35932 393 288 44.1 29.0 34.2% 25.8 23.1 10.4% 47.5%
s38417 374 273 55.2 35.6 35.5% 32.3 28.4 12.1% 48.6%
s38584 320 238 59.3 37.1 37.5% 31.7 27.4 13.5% 53.7%
Avg. 33.3% 12.8% 48.4%
presented in the columns “# of ct bufs” and “# of XORs”, respectively. The peak
current information for the XOR-inserted clock tree before and after polarity assign-
ment are presented in column “Org (mA)” and “PA (mA)”, respectively. The peak
current reduction is 33.3% on average which is summarized in column “Imp”. The
column “Gated (mA)” presents the peak current information on the polarity assigned
XOR inserted clock trees when clock gating occurs. In this experiment, a 20% chance
of clock gating is simulated. The column “RePa (mA)” presents the peak current
value when polarity re-assignment is applied on the gated polarity assigned clock tree.
An additional 12.8% improvement is observed by applying the reconfigurable polarity
assignment, which adds up to an overall peak current reduction of 48.4%. Note that
the improvement percentage may vary as the clock gating percentage changes.
A polarity assignment algorithm using an optimal, dynamic programming-based
method is proposed in [47]. The average level of peak current improvement reported
in [47] is 35%. Thus, the level of peak current improvement (33.3%) of the pro-
posed greedy algorithm when applied on the XOR-gate inserted tree is comparable
to previous art when applied on the conventional buffer-based tree.
Similarly, the peak current information for the clock tree with XOR gates inserted
at the non-sink level is presented in Table 7.7. In the experiments, the XORX2 gates
138
Table 7.7: Reconfigurable polarity assignment with XOR gates inserted at the non-
sink level.
Circuit information Without clock gating With clock gating
Circuit # of ct # sink # of Org PA Imp. Gated RePa Incr. Total
bufs bufs XORs (mA) (mA) (mA) (mA) Imp. Imp.
s13207 133 107 18 23.1 15.0 35.1% 12.8 11.4 10.8% 50.5%
s15850 101 83 14 13.1 9.7 25.9% 7.8 6.8 11.2% 47.4%
s35932 368 288 48 48.4 30.9 36.2% 27.4 22.8 16.6% 52.8%
s38417 360 273 46 46.1 30.4 34.2% 25.6 21.7 15.3% 53.0%
s38584 339 238 40 73.5 45.5 38.1% 39.5 35.3 10.6% 51.9%
Avg. 33.9% 12.9% 51.1%
are inserted at the first non-sink level. The clock gating percentage is set to 20% for
consistency. The peak current reduction is observed to be 33.9% on average. The
reconfigurable polarity assignment permits an additional improvement of 12.9% after
clock gating, totaling the improvement up to 51.1%. Note that in the experimental
results, the peak current on the clock tree with XOR gates inserted at the sink
level (48.4% on average) is less than the peak current on the clock tree with XOR
gates inserted at the non-sink level. This is such as the peak current on XOR gate is
less than the peak current on the clock buffers in the selected cell library ([73]).
7.7.3 Comparison to Conventional Polarity Assignment with Buffer and
Inverter Replacement
In Section 7.7.2, the efficacy of reconfigurable polarity assignment is demonstrated
using the newly designed XOR-based tree as the comparison basis. Two other en-
lightening comparisons are against the basis of the conventional buffered clock tree
built in IC Compiler with and without clock gating. Comparison against a buffered
clock tree without clock gating demonstrates the effects of XOR insertion on the clock
tree. Comparison against a buffered clock tree with clock gating demonstrates the
efficacy of the proposed method when compared to the common art.
139
The peak current reduction effects without clock gating are compared in Table 7.8.
In the data field of Table 7.8, the results before and after the “/” are the corresponding
peak current or the improvement values of the traditional clock trees with single-
edge-triggered flip-flops (SET-FF s) and the traditional clock trees with DET-FF s,
respectively. It is observed that, despite the lack of clock gating, the peak current is
reduced by 30.3% (SET-FF s) and 25.7% (DET-FF s), respectively. For XOR gates
inserted trees, the peak current value for the sink level and non-sink level insertion
are reduced by 35.3% and 34.2%, respectively, compared to the peak current value on
the conventional buffered clock tree using SET-FF s without any polarity assignment.
The peak current value for the sink level and non-sink level XOR gates insertion are
reduced by 39.2% and 38.2%, compared to the peak current value on the conventional
buffered clock tree using DET-FF s without any polarity assignment. Compared to
the conventional polarity assigned (buffer/inverter replaced) clock tree using SET-
FF s, the XOR based trees (without clock gating) has 7.4% (18.5% if compared with
buffered tree using DET-FF s) and 5.4% (16.6% if compared with buffered tree using
DET-FF s) less peak current after polarity assignment by inserting the XOR gates at
the sink and non-sink level, respectively. This improvement is partially due to the
low peak currents drawn by the XOR gates and partially due to different clock trees.
The peak current reduction effects with clock gating are summarized in Table 7.9.
Similar to the data field of Table 7.8, the results before and after the “/” are the
corresponding peak current or the improvement values of the traditional clock trees
with SET-FF s and the traditional clock trees with DET-FF s, respectively. After
the reconfigurable polarity assignment at the clock gating events, the maximum peak
currents are reduced by 19.1% (29.4% compared to the traditional clock trees with
DET-FF s) and 20.9% (30.9% compared to the traditional clock trees with DET-FF s)
with XOR gates inserted at the sink and non-sink level, respectively, than the clock
140
Table 7.8: Peak current comparison for clock trees with XORs without clock gat-
ing (without polarity assignment).
Circuit Bufs/invs Sink lv. XOR insertion Non-sink lv. XOR insertion
Org. PA Imprv. PA Over Over PA Over Over
(mA) (mA) (mA) Org. PA. (mA) Org. PA.
s13207 22.6/ 16.5/ 26.9%/ 16.1 28.5%/ 2.2%/ 15.0 33.4%/ 8.9%/
23.6 17.9 23.9% 31.5% 9.9% 36.2% 16.1%
s15850 13.7/ 9.5/ 30.6%/ 9.0 34.3%/ 5.3%/ 9.7 29.2%/ -2.1%/
14.4 10.7 25.3% 37.4% 16.2% 32.6% 9.7%
s35932 49.6/ 34.4/ 30.6%/ 29.0 41.5%/ 15.7%/ 30.9 37.7%/ 10.3%/
56.5 41.8 26/0% 48.7% 30.7% 45.4% 26.2%
s38417 47.9/ 34.6/ 27.8%/ 35.6 25.7%/ -2.9%/ 30.4 36.6%/ 12.3%/
51.6 40.3 21.9% 31.0% 11.6% 41.1% 24.6%
s38584 69.1/ 44.4/ 35.7%/ 37.1 46.3%/ 16.5%/ 45.5 34.2%/ -2.5%/
70.7 48.7 31.2% 47.6% 23.8% 35.7% 6.5%
Avg. 30.3%/ 35.2%/ 7.4%/ 34.2%/ 5.4%/
25.7% 39.2% 18.5% 38.2% 16.6%
Table 7.9: Peak current comparison for clock trees with XORs with clock gating.
Circuit Bufs/invs Sink lv. XOR insertion Non-sink lv. XOR insertion
Org. Gated Gated+ Over Gated+ Over
(mA) (mA) RePA (mA) Gated RePA (mA) Gated
s13207 22.6/23.6 14.6/16.0 12.2 16.4%/23.8% 11.4 21.4%/28.3%
s15850 13.7/14.4 7.8/9.4 7.0 10.2%/25.6% 6.8 11.8%/27.0%
s35932 49.6/56.5 29.9/35.5 23.1 22.8%/34.9% 22.8 23.8%/35.7%
s38417 47.9/51.6 32.2/37.1 28.4 12.0%/23.5% 21.7 32.7%/41.5%
s38584 69.1/70.7 41.5/45.3 27.4 34.0%/39.5% 35.3 15.0%/22.1%
Avg. 19.1%/29.4% 20.9%/30.9%
gated buffer/inverter based polarity assigned circuit. These additional improvements
are achieved by the reconfigurable polarity assignment enabled by the XOR gates
insertion, which demonstrate the advantage of the proposed flow.
The skew and skew degradation of the buffer/inverter replacement based polarity
assignment flow in previous works are compared against the proposed XOR based
polarity assignment flow in Table 7.10. The polarity assignment degrades the clock
141
skew for both the conventional buffer/inverter trees and the XOR gates inserted trees,
despite at different rates. It is observed that unlike the conventional buffer/inverter
replacement based polarity assignment, the skew degradation is very limited for the
proposed XOR gate based clock tree synthesis scheme. The global skews of the clock
trees synthesized with BUFX2 buffers, sink level XORX2 gates and non-sink level
XORX2 gates are 55.0ps, 51.7ps and 45.1ps, respectively. As discussed in Section 7.2,
in the buffer/inverter based polarity assignment flow, an inverter with a similar delay
while guaranteeing the slew rate and power constraints is used when performing the
polarity assignment. By using the buffer/inverter based polarity assignment flow, the
skew increases to 74.3ps due to the unavoidable delay difference of the buffer and
inverters of the same size. By using the XOR based polarity assignment flow, the
polarity is assigned without buffer/inverter replacement such that the skew increase
is determined by the difference in the propagation delay of the positive and negative
polarity configurations of the XOR gates. XOR gate should have similar delay for
positive and negative polarity configurations, which is often the case in most cell
libraries. The skews after polarity assignment with XOR gates inserted at the sink
and non-sink levels are increased by 8.8ps and 1.7ps, respectively, totaling up to
60.5ps and 46.8ps, respectively. The skew increase is less when the XOR gates are
inserted at the non-sink level of the clock tree because the capacitive loads of the
non-sink level XOR gates are generally smaller and the XOR gate XORX2 has less
delay difference for positive and negative polarity configurations when the capacitive
load is less.
Despite the global skew being around 40-70ps in the proposed XOR-based trees,
the peak current reduction is primarily through polarity assignment and not through
clock skew scheduling. This is such as the duration (greater than 100ps) of the current
peak is larger than these global skew values, as illustrated in Figure 7.1. Note that
142
Table 7.10: Skew degradation for different polarity assignment methods.
Circuit Bufs/invs Sink lv. XOR insertion Non-sink lv. XOR insertion
Pre PA Post PA Incr. Pre PA Post PA Incr. Pre PA Post PA Incr.
(ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps) (ps)
s13207 47.3 68.6 21.3 56.2 68.7 12.5 37.5 41.0 3.5
s15850 43.7 62.0 18.3 34.8 36.6 1.8 32.2 33.4 1.2
s35932 57.5 74.6 17.1 48.6 58.6 10.0 48.2 48.8 0.6
s38417 55.7 75.7 20.0 57.2 67.0 9.8 38.4 41.4 3.0
s38584 70.6 90.8 20.2 61.8 71.5 9.7 69.0 69.5 0.5
Avg. 55.0 74.3 19.3 51.7 60.5 8.8 45.1 46.8 1.7
the lower global skews for the proposed XOR-based clock trees do not aid in the peak
current reduction compared to the traditional buffer based trees. Nonetheless, the
peak current reduction is superior.
7.7.4 Trade-off Analysis
The area and power information on the polarity assigned trees using the buf/inv
replacement based flow and the XOR based flow are compared in Table 7.11 and 7.12.
In Table 7.11, the area and power information of the XOR based flow is com-
pared to buffer/inverter based flow with single-edge-triggered flip-flops (SET-FF ).
It is observed that the XOR based flow permits power savings compared to the
buffer/inverter replacement based flow. This is such as the frequency of the XOR
based clock tree can be essentially halved due to the DET-FF triggering at both the
rising and falling clock edges. Moreover, the DET-FF consumes less power than regu-
lar flip-flops as listed in Table 7.5. However, due to the XOR gates and the DET-FF
occupying a larger area, the overall chip areas are increased by 7.1% and 5.9% on
average, respectively, by inserting XOR gates at the sink and non-sink level.
In Table 7.12, the area and power information of the XOR based flow is compared
to the buffer/inverter based flow with DET-FF s. In this comparison, both the XOR
143
Table 7.11: Area increase and power saving information compared with the traditional
clock tree.
Circuit Buf/inv Sink level XOR insertion Non-sink level XOR insertion
Area Pwr. Area Incr. Pwr. Sav. Area Incr. Pwr. Sav.
(µm2) (mW ) (µm2) (mW ) (µm2) (mW )
s13207 36660 16.2 39500 7.7% 12.0 25.5% 39038 6.5% 11.6 28.0%
s15850 39820 12.2 41936 5.3% 9.3 23.0% 41597 4.5% 9.1 25.2%
s35932 86821 51.6 94685 9.1% 38.5 25.4% 93517 7.7% 38.4 25.6%
s38417 85783 55.3 93097 8.5% 42.6 23.0% 91881 7.1% 41.9 24.4%
s38584 136496 55.5 142912 4.7% 43.1 22.3% 141921 4.0% 43.0 22.4%
Avg. 7.1% 23.8% 5.9% 25.1%
Table 7.12: Area and power increase information compared with the traditional clock
tree with DET-FF.
Circuit Buf/inv Sink level XOR insertion Non-sink level XOR insertion
Area Pwr. Area Incr. Pwr. Incr. Area Incr. Pwr. Incr.
(µm2) (mW ) (µm2) (mW ) (µm2) (mW )
s13207 38638 11.8 39500 2.2% 12.0 2.5% 39038 1.0% 11.6 -1.7%
s15850 41358 9.2 41936 1.4% 9.3 1.1% 41597 0.6% 9.1 -1.1%
s35932 92178 38.4 94685 2.7% 38.5 0.3% 93517 1.5% 38.4 0.0%
s38417 90855 41.9 93097 2.5% 42.6 1.7% 91881 1.1% 41.9 0.0%
s38584 140916 43.0 142912 1.4% 43.1 0.2% 141921 0.7% 43.0 0.0%
Avg. 2.0% 1.2% 1.0% -0.6%
based flow and the buffer/inverter based flow use DET-FF s as sink registers. The
power increase is 1.2% due to the XOR gates insertion at the sink level, which is very
limited. The power increase is even more limited at 0.6% when inserting the XOR
gates at the non-sink level of the clock tree. The area increase is very limited at 2.0%
and 1.0% with XOR gates inserted at the sink and non-sink level, respectively, since
the area overhead is only caused by XOR gates insertion. In summary, the power and
area overhead caused by the XOR gates insertion is very limited, which demonstrates
the practicality of the proposed method.
144
7.8 Conclusions
This chapter presents a novel polarity assignment flow which inserts XOR gates at
either the sink or non-sink level of the clock tree based on the design requirement. The
proposed clock tree synthesis scheme together with the polarity assignment methods
is able to integrate the reconfigurability of polarity assignment into a design such that
a further reduction in peak current is achieved when clock gating occurs.
An important future application of the proposed method is to reduce the peak
power on a design dynamically. Peak power is harmful to the performance of a design
as it potentially increases the temperature on local areas. By identifying the spatial
and temporal characteristics of the peak power, the polarity of the XOR gates can
be dynamically assigned in run-time to reduce the worst case peak power.
145
8. Conclusions and Future Directions
Due to the significance of the clock network in a synchronous system, the synthesis
and optimization of the clock network is a very active topic in industry and academia.
In this dissertation, various novel methodologies to generate high performance, low
power and low variation clock networks are proposed which are summarized in Sec-
tion 8.1. In Section 8.2, the future directions on clock network synthesis are presented.
8.1 Conclusions
The conclusions on the proposed methodologies for different clock topologies are
summarized in the following sections.
8.1.1 Conclusions on Clock Mesh Synthesis
In Chapter 3, the proposed clock mesh network synthesis method reduces the
total switching power on the mesh network by 50% compared to the best work known
in literature [29]. The trade-offs of the proposed method are the decreased timing
slack. This is the first work that combines the clock mesh synthesis with incremental
register placement for power saving purposes. It is also beneficial for reducing the
global clock skew since the stub wirelength are very limited. This work is published
in [43, 46].
In Chapter 4, the local steiner trees are considered during the clock mesh synthesis
stage. These local steiner trees enable the clock gating for clock mesh network which
reduces the power saving on the clock mesh by 22%. This is the first work that
introduces the clock gating into the clock mesh network. This work is current under
review in [45].
146
The proposed methods can be seamlessly integrated with the industrial design
flow, which demonstrates the high practicality of the proposed methods.
8.1.2 Conclusions on Rotary Oscillator Arrays
In Chapter 5, the proposed routing methodologies for rotary oscillator array re-
duce the total routing wirelength by 60% compared to the best work known in liter-
ature [31]. This is the first work which connects the registers to the tapping points
using steiner tree connections while guaranteeing zero skew and capacitive load bal-
ancing.
This work is significant in that it is the first efficient steiner tree routing imple-
mentation for ROA synthesis which also guarantees timing. The implementation of
this work is adopted as the design automation tool for the current on-going ROA
research [1]. This work is published in [44].
8.1.3 Conclusions on Clock Polarity Assignment
In Chapter 6, a polarity assignment method with improved peak current results
is presented which is published in [47, 50]. In Chapter 7, a novel polarity assignment
flow using XOR gates is proposed which is published in [48, 49, 51]. This is the first
work that uses XOR gates as buffering elements on the clock tree to facilitate the
polarity assignment. The proposed method is able to reconfigurate the polarities of
different clock sinks at runtime during clock gating events.
The implementation of the methods are within the industrial design flow. It
is demonstrated that the proposed polarity assignment methods can be seamlessly
integrated with the industrial design flow as additional optimization stages or a minor
modification to the existing flow.
147
8.2 Future Directions
Although having significant results using these methodologies, future directions
are proposed to improve further on the synthesis and optimization of these clock
topologies. The future directions on different clock topologies are presented in differ-
ent sections as follows.
8.2.1 Clock Mesh Synthesis
Clock mesh synthesis is a very popular topic in industry and academia due to
its popularity in application. The previous works and the proposed methods present
significant improvements in this area. However, there is still plenty of room on im-
proving the synthesis and optimization of the clock mesh networks. One possible
future direction on clock mesh synthesis is to further combine the placement stage
with clock mesh synthesis stage. Instead of performing incremental register placement
with clock mesh synthesis, combining the whole placement stage with the clock mesh
synthesis stage to optimize multiple objectives (power, timing, routing congestion)
simultaneously is appealing.
8.2.2 Rotary Oscillator Arrays (ROA)
For ROA synthesis, a future direction is to insert buffers on the sub-trees which
connect the registers to the ROA to improve the slew rate of the delivered clock signal.
More power dissipation is expected when buffers are inserted on the sub-trees. Thus
the trade-off between the slew rate of the clock signal and the power dissipation of
the ROA network needs to be investigated.
148
8.2.3 Clock Polarity Assignment
A future direction on polarity assignment is that instead of balancing the peak
current by only looking at the clock buffers, the effects of polarity assignment on
power/ground networks can be investigated. Combining the power/ground network
synthesis and analysis with the polarity assignment is beneficial to obtain more opti-
mized peak current reduction effects.
149
Bibliography
[1] VLSI Lab at Drexel University. http://ece.drexel.edu/faculty/taskin/wiki/vlsilab/
-index.php/Research.
[2] A. Abdelhadi, R. Ginosar, A. Kolodny, and E. G. Friedman. Timing-driven
variation-aware nonuniform clock mesh synthesis. In Proceedings of the Great
Lakes Symposium on VLSI (GLSVLSI), pages 15–20, May 2010.
[3] S. Arora and M. Puri. A variant of time minimizing assignment problem. Euro-
pean Journal of Operational Research (EJOR), 110:314–325, October 1998.
[4] H. B. Bakoglu. Circuits, Interconnections, and Packaging for VLSI. Addison-
Wesley, Reading, MA, 1990.
[5] L. Benini, P. Vuillod, A. Bogliolo, and G. D. Micheli. Clock skew optimization for
peak current reduction. Journal of VLSI Signal Processing Systems, 16(2/3):114–
130, June 1997.
[6] P. E. Black. Munkres’ assignment algorithm. Dictionary of Algorithms and Data
Structures, May 2006.
[7] K. Boese and A. Kahng. Zero-skew clock routing trees with minimum wirelength.
In Proceedings of the IEEE International ASIC Conference and Exhibit, pages
17–21, Sep. 1992.
[8] R. E. Burkard and E. Cela. Linear assignment problems and extensions, 1998.
[9] R. E. Burkard, M. Dell’Amico, and S. Martello. Assignment Problems. SIAM,
1st edition, 2009.
[10] T.-H. Chao, Y.-C. H. Hsu, J.-M. Ho, K. D. Boese, and A. B. Kahng. Zero
skew clock routing with minimum wirelength. IEEE Transactions on Circuit
and Systems (TCAS), 39:799–814, 1992.
[11] R. Chaturvedi and J. Hu. An efficient merging scheme for prescribed skew clock
routing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
13(6):750–754, June 2005.
[12] C.-P. Chen, C. C. N. Chu, and D. F. Wong. Fast and exact simultaneous gate
and wire sizing by lagrangian relaxation. IEEE Transactions on Computer-Aided
Design of Intergrated Circuits and Systems, 18(7):1014–1025, July 1999.
[13] P.-Y. Chen, K.-H. Ho, and T. Hwang. Skew aware polarity assignment in clock
tree. In Proceedings of the IEEE/ACM International Conference on Computer-
aided Design (ICCAD), pages 376–379, 2007.
150
[14] W.-K. Chen, editor. The VLSI Handbook. CRC Press, 1st edition, 1999.
[15] Y. Chen and D. F. Wong. An algorithm for zero-skew clock tree routing with
buffer insertion. In Proceedings of the European Conference on Design and
Test (ED&TC), pages 230–236, March 1996.
[16] Y. Cheon, P.-H. Ho, A. Kahng, S. Reda, and Q. Wang. Power-aware placement.
In Proceedings of ACM/IEEE Design Automation Conference (DAC), pages 795–
800, June 2005.
[17] M. Cho, D. Z. Pan, and R. Puri. Novel binary linear programming for high per-
formance clock mesh synthesis. In Proceedings of the IEEE/ACM International
Conference on Computer-aided Design (ICCAD), pages 438–443, 2010.
[18] J. Cong, A. B. Kahng, C.-K. Koh, and C.-W. Tsao. Bounded-skew clock and
steiner routing. ACM Transactions on Design Automation of Electronic Systems
(TODAES), 3(3):341–388, 1998.
[19] V. H. Cordero and S. P. Khatri. Clock distribution scheme using coplanar trans-
mission lines. In Proceedings of the Design, Automation and Test in Europe
(DATE), pages 985–990, Mar. 2008.
[20] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to
Algorithms. MIT Press, 2nd edition, 2001.
[21] J. Czyzyk, M. Mesnier, and J. More. The NEOS server. IEEE Journal on
Computational Science and Engineering, 5(3):68–75, July 1998.
[22] M. Desai, R. Cvijetic, and J. Jensen. Sizing of clock distribution networks for high
performance cpu chips. In Proceedings of the ACM/IEEE Design Automation
Conference (DAC), pages 389–394, June 1996.
[23] A. Drake, K. Nowka, T. Nguyen, J. Burns, and R. Brown. Resonant clock-
ing using distributed parasitic capacitance. IEEE Journal of Solid-State Cir-
cuits (JSSC), 39(9):1520–1528, Sep. 2004.
[24] M. Edahiro. A clustering-based optimization algorithm in zero-skew routing.
In Proceedings of the ACM/IEEE Design Automation Conference (DAC), pages
612–616, June 1993.
[25] W. Elmore. The transient response of damped linear networks with particular
regard to wideband amplifiers. Journal of Applied Physics (AIP), 19(1):55–63,
January 1948.
[26] A. Farrahi, C. Chen, A. Srivastava, G. Tellez, and M. Sarrafzadeh. Activity-
driven clock design. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 20(6):705–714, June 2001.
151
[27] A. L. Fisher and H. T. Kung. Synchronizing large systolic arrays. In SPIE, pages
44–52, May 1982.
[28] E. G. Friedman. Clock Distribution Networks in VLSI Circuits and Systems.
IEEE Press, 1995.
[29] M. R. Guthaus, G. Wilke, and R. Reis. Non-uniform clock mesh optimization
with linear programming buffer insertion. In Proceedings of the ACM/IEEE
Design Automation Conference (DAC), pages 74–79, June 2010.
[30] V. Honkote and B. Taskin. Custom rotary clock router. In IEEE International
Conference on Computer Design (ICCD), pages 114–119, Oct. 2008.
[31] V. Honkote and B. Taskin. Zero clock skew synchronization with rotary clock-
ing technology. In International Symposium on Quality of Electronic Design
(ISQED), pages 588–593, Mar. 2009.
[32] V. Honkote and B. Taskin. Analysis, design and simulation of capacitive load
balanced rotary oscillatory array. In International Conference on VLSI De-
sign (VLSID), pages 218–223. IEEE Computer Society, Jan. 2010.
[33] V. Honkote and B. Taskin. PEEC based parasitic modeling for power analysis
on custom rotary rings. In ACM/IEEE International Symposium on Low-Power
Electronics and Design (ISLPED), pages 111–116, Aug. 2010.
[34] V. Honkote and B. Taskin. Skew analysis and bounded skew constraint method-
ology for rotary clocking technology. In International Symposium on Quality
Electronic Design (ISQED), pages 413–417, Mar. 2010.
[35] V. Honkote and B. Taskin. Skew-aware capacitive load balancing for low-power
zero clock skew rotary oscillatory array. In IEEE International Conference on
Computer Design (ICCD), pages 209 –214, Oct. 2010.
[36] S.-H. Huang, C.-M. Chang, and Y.-T. Nieh. Fast multi-domain clock skew
scheduling for peak current reduction. In Proceedings of the Asia and South
Pacific Design Automation Conference (ASPDAC), pages 254–259, Jan. 2006.
[37] H. Jang and T. Kim. Simultaneous clock buffer sizing and polarity assignment
for power/ground noise minimization. In Proceedings of the ACM/IEEE Design
Automation Conference (DAC), pages 794–799, July 2009.
[38] S. Y. Kung and R. J. Gal-Ezer. Synchronous versus asynchronous computation
in very large scale integration (vlsi) array processors. In SPIE, volume 341, pages
53–65, May 1982.
[39] N. Kurd, J. Barkarullah, R. Dizon, T. Fletcher, and P. Madland. A multigiga-
hertz clocking scheme for the Pentium(R) 4 microprocessor. IEEE Journal of
Solid-State Circuits (JSSC), 36(11):1647–1653, Nov. 2001.
152
[40] W.-C. D. Lam, C.-K. Koh, and C.-W. A. Tsao. Power supply noise suppression
via clock skew scheduling. In In Proceedings of the International Symposium on
Quality of Electronic Design (ISQED), pages 355–360, 2002.
[41] Y.-W. C. Laung-Terng Wang. Electronic Design Automation: Synthesis, Verifi-
cation, and Test. Morgan Kaufmann, 2009.
[42] G. Le Grand de Mercey. A 18GHz rotary traveling wave VCO in CMOS with I/Q
outputs. In Proceedings of the European Solid-State Circuits Conference (ESS-
CIRC), pages 489–492, Sep. 2003.
[43] J. Lu, Y. Aksehir, and B. Taskin. Register on mesh (rome): A novel approach
for clock mesh network synthesis. In Proceedings of the IEEE International
Symposium on Circuits and Systems (ISCAS), pages 1219–1222, May 2011.
[44] J. Lu, V. Honkote, X. Chen, and B. Taskin. Steiner tree based rotary clock
routing with bounded skew and capacitive load balancing. In Proceedings of the
Design, Automation and Test in Europe (DATE), pages 455–460, Mar. 2011.
[45] J. Lu, X. Mao, and B. Taskin. Clock mesh synthesis with gated local trees and
activity driven register clustering. In submitted to IEEE/ACM International
Conference on Computer-aided Design (ICCAD), 2011.
[46] J. Lu, X. Mao, and B. Taskin. Timing slack aware incremental register placement
with non-uniform grid generation for clock mesh synthesis. In Proceedings of
the International Symposium on Physical Design (ISPD), pages 131–138, March
2011.
[47] J. Lu and B. Taskin. Clock buffer polarity assignment considering capacitive
load. In International Symposium on Quality Electronic Design (ISQED), pages
765–770, March 2010.
[48] J. Lu and B. Taskin. Clock tree synthesis with XOR gates for polarity as-
signment. In Proceedings of the IEEE Computer Society Annual Symposium on
VLSI (ISVLSI), July 2010.
[49] J. Lu and B. Taskin. Reconfigurable clock polarity assignment for peak cur-
rent reduction of clock-gated circuits. In Proceedings of the IEEE International
Symposium on Circuits and Systems (ISCAS), pages 1940–1943, May 2011.
[50] J. Lu and B. Taskin. Clock buffer polarity assignment with skew tuning. ACM
Transactions on Design Automation of Electronic Systems (TODAES), In pre-
print.
[51] J. Lu, Y. Teng, and B. Taskin. A reconfigurable clock polarity assignment flow for
clock gated designs. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, In pre-print.
153
[52] F. M. Muller, M. M. Camozzatoa, and O. C. B. de Araujo. Exact algorithms for
the imbalanced time minimizing assignment problem. In Brazilian Symposium
on Graphs, Algorithms and Combinatorics, pages 122–125, April 2001.
[53] S. Nassif. Design for variability in dsm technologies [deep submicron technolo-
gies]. In IEEE International Symposium on Quality Electronic Design (ISQED),
pages 451–454, 2000.
[54] Y.-T. Nieh, S.-H.Huang, and S.-Y. Hsu. Minimizing peak current via opposite-
phase clock tree. In Proceedings of the ACM/IEEE Design Automation Confer-
ence (DAC), pages 182–185, June 2005.
[55] J. Oh and M. Pedram. Multi-pad power/ground network design for uniform
distribution of ground bounce. In Proceedings of the ACM/IEEE Design Au-
tomation Conference (DAC), pages 287–290, June 1998.
[56] S. Pandit and K. Srinivas. A lexisearch algorithm for traveling salesman problem.
In IEEE International Joint Conference on Neural Networks, pages 2521–2527,
Nov. 1991.
[57] M. Pedram and A. Abdollahi. Low power rt-level synthesis techniques - a tutorial.
In IEE Proceedings on Computers and Digital Techniques, volume 152, pages
333–343, May 2005.
[58] M. Pedram, Q. Wu, and X. Wu. A new design for double edge triggered flip-flops.
In Proceedings of the Asia and South Pacific Design Automation Conference (AS-
PDAC), pages 417–421, Jan. 1998.
[59] D. W. Pentico. Assignment problems: A golden anniversary survey. European
Journal of Operational Research, 176:774–794, January 2007.
[60] J. M. Rabaey, A. P. Chandrakasan, and B. Nikoli. Digital Integrated Circuits: A
Design Perspective. Prentice Hall, 2nd edition, 2003.
[61] A. Rajaram and D. Pan. MeshWorks: An efficient framework for planning,
synthesis and optimization of clock mesh networks. In Asia and South Pacific
Design Automation Conference (ASPDAC), pages 250–257, Jan. 2008.
[62] A. Rajaram and D. Z. Pan. Variation tolerant buffered clock network synthesis
with cross links. In Proceedings of the International Symposium on Physical
Design (ISPD), pages 157–164, 2006.
[63] P. Restle, C. Carter, J. Eckhardt, B. Krauter, B. McCredie, K. Jenkins,
A. Weger, and A. Mule. The clock distribution of the Power4 microprocessor. In
Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC),
volume 1, pages 144–145, Feb. 2002.
154
[64] P. J. Restle, T. G. Mcnamara, D. A. Webber, P. J. Camporese, K. F. Eng, K. A.
Jenkins, S. Member, D. H. Allen, M. J. Rohn, M. P. Quaranta, D. W. Boerstler,
C. J. Alpert, C. A. Carter, R. N. Bailey, J. G. Petrovick, B. L. Krauter, and
B. D. Mccredie. A clock distribution network for microprocessors. IEEE Journal
of Solid-State Circuits (JSSC), 36:792–799, 2001.
[65] Y. Ryu and T. Kim. Clock buffer polarity assignment combined with clock tree
generation for power/ground noise minimization. Proceedings of the ACM/IEEE
International Conference on Computer-aided Design (ICCAD), pages 416–419,
November 2008.
[66] R. Samanta, G. Venkataraman, and J. Hu. Clock buffer polarity assignment for
power noise reduction. In Proceedings of IEEE/ACM International Conference
on Computer-aided Design (ICCAD), pages 558–562, November 2006.
[67] R. S. Shelar. An algorithm for routing with capacitance/distance constraints
for clock distribution in microprocessors. In Proceedings of the International
Symposium on Physical Design (ISPD), pages 141–148, Mar. 2009.
[68] Synopsys. HSPICE Signal Integrity User Guide, 2009.
[69] Synopsys Inc. Design Compiler User Guide, b-2008.09 edition, September 2008.
[70] Synopsys Inc. Nanosim User Guide, b-2008.09 edition, September 2008.
[71] Synopsys Inc. Star-RCXT User Guide, b-2008.12 edition, December 2008.
[72] Synopsys Inc. IC Compiler User Guide: Implementation, b-2008.09-sp4 edition,
March 2009.
[73] Synopsys Inc. Synopsys 90nm Generic Library, 2009.
[74] C. N. Sze. ISPD 2010 high performance clock network synthesis contest: bench-
mark suite and results. In Proceedings of the International Symposium on Phys-
ical Design (ISPD), pages 143–143, 2010.
[75] S. Tam, J. Leung, R. Limaye, S. Choy, S. Vora, and M. Adachi. Clock gener-
ation and distribution of a dual-core Xeon processor with 16MB L3 cache. In
Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC),
pages 1512–1521, Feb. 2006.
[76] B. Taskin, J. Demaio, O. Farell, M. Hazeltine, and R. Ketner. Custom topol-
ogy rotary clock router with tree subnetworks. ACM Transactions on Design
Automation of Electronic Systems, 14(3):1–14, 2009.
[77] B. Taskin and I. S. Kourtev. Delay insertion method in clock skew scheduling. In
Proceedings of the International Symposium on Physical Design (ISPD), pages
47–54, April 2005.
155
[78] B. Taskin, J. Wood, and I. S. Kourtev. Timing-driven physical design for VLSI
circuits using resonant rotary clocking. In IEEE International Midwest Sym-
posium on Circuits and Systems (MWSCAS), volume 1, pages 261–265, Aug.
2006.
[79] R.-S. Tsay. Exact zero skew. In IEEE International Conference on Computer-
Aided Design (ICCAD), pages 336–339, Nov. 1991.
[80] S. H. Uuger. Double-edge-triggered flip-flops. IEEE Transactions on Comput-
ers (IETC), 30:447–451, June 1981.
[81] G. Venkataraman, Z. Feng, J. Hu, and P. Li. Combinatorial algorithms for fast
clock mesh optimization. IEEE Transactions on Very Large Scale Integration
Systems (TVLSI), 18(1):131–141, Jan. 2010.
[82] G. Venkataraman, J. Hu, and F. Liu. Integrated placement and skew optimiza-
tion for rotary clocking. IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, 15(2):149–158, Feb. 2007.
[83] A. Vittal, H. Ha, F. Brewer, and M. Marek-Sadowska. Clock skew optimiza-
tion for ground bounce control. In Proceedings of the IEEE/ACM International
Conference on Computer-aided Design (ICCAD), pages 395–399, Nov. 1996.
[84] J. Wood, T. Edwards, and S. Lipa. Rotary traveling-wave oscillator arrays: a
new clock technology. IEEE Journal of Solid-State Circuits, 36(11):1654–1665,
Nov. 2001.
[85] T. Xanthopoulos, D. Bailey, A. Gangwar, M. Gowan, A. Jain, and B. Prewitt.
The design and analysis of the clock distribution network for a 1.2 GHz Alpha
microprocessor. In Proceedings of the IEEE International Solid-State Circuits
Conference (ISSCC), pages 402–403, Feb. 2001.
[86] Z. Yu and X. Liu. Power analysis of rotary clock. In Proceedings of the IEEE
Computer Society Symposium on VLSI, pages 150–155, May 2005.
[87] Z. Yu and X. Liu. Design of rotary clock based circuits. In 44th ACM/IEEE
Design Automation Conference (DAC), pages 43–48, June 2007.
[88] M. Zhao, Y. Fu, V. Zolotov, S. Sundareswaran, and R. Panda. Optimal place-
ment of power supply pads and pins. In Proceedings of the ACM/IEEE Design
Automation Conference, pages 165–170, 2004.
[89] C. Zhuo, H. Zhang, R. Samanta, J. Hu, and K. Chen. Modeling, optimization
and control of rotary traveling-wave oscillator. In IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), pages 476–480, Nov. 2007.
156
Vita
Jianchao Lu was born in Beijing, China. He received a Bachelor’s degree in Elec-
tronics and Information Engineering from Zhejinag University, Hangzhou, China in
2007. He received a Master’s degree in Computer Engineering from Drexel Univer-
sity, Philadelphia, PA in 2009. During his PhD study, he focused on the synthesis
and optimization of high performance clock networks including clock mesh, rotary
oscillator arrays (ROA) and clock tree. His research interests include physical design
methodologies, clock network synthesis, timing/power analysis and optimization and
parallel CAD algorithms.
Jianchao served as a Teaching and Research assistant at Drexel University. As a
teaching assistant, he participated in teaching and developing the undergraduate level
VLSI design course at Drexel University. As a research assistant, he has published
several papers at prestigious journals and conferences including TVLSI, TODAES,
ISPD, DATE, ISCAS, ISVLSI, ISQED, VLSID and etc. He has represented Drexel
in various programming and poster contests at ICCAD and DAC conferences. He is
going to join Synopsys at Mountain View, CA upon graduation in 2011 to continue
his research in physical design field with an emphasis on routing methodologies.

