Wave-Pipelined Multiplexed (WPM) Routing for Gigascale Integration (GSI) by Joshi, Ajay Jayant
 
Wave-Pipelined Multiplexed (WPM) Routing 

















In Partial Fulfillment 
Of the Requirement for the Degree of 







School of Electrical and Computer Engineering 





Copyright © 2006 by Ajay Joshi
 
Wave-Pipelined Multiplexed (WPM) Routing for 



















Dr. Jeffrey Davis, Advisor 
School of Electrical and Computer 
Engineering 
Georgia Institute of Technology 
 
Dr. James Meindl 
School of Electrical and Computer 
Engineering 
Georgia Institute of Technology 
 
Dr. Sung Kyu Lim 
School of Electrical and Computer 
Engineering 
Georgia Institute of Technology 
 
Dr. Linda Milor 
School of Electrical and Computer 
Engineering 
Georgia Institute of Technology 
 
Dr. Paul Kohl 
School of Chemical and Biomolecular 
Engineering 
Georgia Institute of Technology 
 




































 I would like to take this opportunity to thank my PhD advisor, Dr. Jeffrey Davis 
for his support and guidance. It is his infinite patience and constant encouragement that 
has made it possible for me complete PhD. Working with Dr. Davis has been a 
memorable experience which I will cherish for years to come. I would also like to express 
my gratitude towards Dr. Sung Kyu Lim for advising me in exploring the physical design 
limits on WPM routing. In addition, I am grateful to Dr. James Meindl, Dr. Linda Milor 
and Dr. Paul Kohl for their invaluable guidance in my research. 
 My fellow researchers, Ragu, Pranav, Gerald, Heather and Deepak, have played a 
significant role in my life at Georgia Tech. Our discussions, both technical and non-
technical, have made my time at Georgia Tech a thoroughly enjoyable experience. 
 I would like to express my gratitude towards my family for their unconditional 
love and support. It helped me successfully complete the PhD program. Finally, I would 
like to thank my wife, Vinita. Her undying love and support has made it possible for me 
get through this important phase in my life. 
 
 v
TABLE OF CONTENTS 
 
ACKNOWLEDGEMENTS…………………………………………………………….iv 
LIST OF TABLES……………………………………………………………………….x 
LIST OF FIGURES…………………………………………………………………….xii 
SUMMARY…………………………………………………………………………….xvi 
 
1. INTRODUCTION AND BACKGROUND………………………………………….1 
1.1. Introduction…………………………………………………………………...1 
1.2. The interconnect problem…………………………………………………….2 
1.3. Solutions to the interconnect problem………………………………………..3 
1.3.1. Low-k dielectric and low resistivity interconnect material………....4 
1.3.2. Repeater insertion…………………………………………………..5 
1.3.3. Wave-pipelining…………………………………………………….7 
1.3.4. Wire sharing………………………………………………………...8 
1.4. Proposed research…………………………………………………………...11 
1.5. Summary of chapters………………………………………………………..12 
 
2. WAVE-PIPELINED MULTIPLEXED (WPM) ROUTING CIRCUIT DESIGN 
AND OPTIMIZATION……………………………………………………………….. 15 
 
2.1. Introduction………………………………………………………………….15 
2.2. Interconnect idleness distribution…………………………………………...16 
2.3. WPM logic design and delay constraints……………………………………18 
2.4. WPM circuit design…………………………………………………………23 
 
 vi
2.5. Validation of WPM circuit design…………………………………………..31 
2.6. Optimization of WPM circuit design………………………………………..34 
2.6.1. Minimum wire area optimization………………………………….36 
2.6.2. Low power and low area design………………………………..…38 
2.6.3. Minimum power design…………………………………………...41 
2.6.4. High performance design………………………………………….43 
2.6.5  Comparison of WPM circuit optimizations……………………….45 
2.7. Summary…………………………………………………………………….46 
 




3.2. Crosstalk and dynamic delay………………………………………………..49 
3.3. Power supply noise……………………………………………………….…59 
3.4. Clock skew…………………………………………………………………..62 
3.5. Wave-pipelined encoded (WPE) routing……………………..……………..64 
3.6. Comparison between WPM and WPE routing……………………..……….70 
3.7. Summary……………………..……………………..……………………….77 
 
4. DESIGN AND IMPLEMENTATION OF A MULTILEVEL INTERCONNECT 




4.2. Compact models versus circuit simulation tools……………………..……..80 
4.3. Existing tools for system-level analysis……………………..………………84 
 
 vii
 4.3.1. System simulators using analytical circuit models………………..84 
 4.3.2. System simulator using circuit simulation tools…………………..86 
4.4. Design and implementation of HR-MINDS……………………..………….87 
4.5. Extensions to the design capabilities of MINDS……………………..……..95 
4.6. Comparison between MINDS and HR-MINDS……………………..……...98 
4.7. Validation of HR-MINDS……………………..……………………..……...99 
4.8. Summary……………………..……………………..……………………...101 
 
5. SYSTEM-LEVEL IMPACT OF WPM ON A FULL-CUSTOM DESIGN…….103 
5.1. Introduction……………………..……………………..…………………...103 
5.2. Design methodology for full-custom WPM n-tier interconnect network….104 
5.3. Wire-area-optimized WPM design methodology……………………..…...106 
5.4. Design of power-centric WPM methodology……………………..……….113 
5.4.1. Core-area-centric low power design……………………..………115 
5.4.2. Wire-coupling-centric low power design……………………..….118 
5.5. Design of performance-centric WPM methodology……………………….122 
5.6. Summary……………………..……………………..……………………...123 
 
6. SYSTEM-LEVEL IMPACT OF WPM ON A SEMI-CUSTOM DESIGN……..125 
6.1. Introduction……………………..……………………..…………………...125 
6.2. Design methodology for semi-custom WPM n-tier interconnect network...126 
6.3. Design of power-centric methodology……………………..………………128 
6.3.1. Core area-centric low power design……………………..………129 
 
 viii
6.3.2. Wire-coupling-centric low power design……………………..….132 
6.4. Design of performance-centric methodology……………………..……….137 
6.4.1. Core-area-centric high performance design……………………...137 
6.4.2. Wire coupling-centric high performance design…………………140 
6.5. Summary……………………..……………………..…………………..142 
 
7. PHYSICAL DESIGN LIMITS AND OPPORTUNITIES FOR APPLICATION 
OF WPM ROUTING………………………………………………………………….144 
 
7.1. Introduction……………………..……………………..…………………...144 
7.2. Design of source-sink proximity and run-length proximity constraints…...146 
7.3. Placement problem formulation……………………..……………………..148 
7.3.1. GORDIAN placement algorithm……………………..………….148 
7.3.1.1. Net list and cell description……………………..……...149 
7.3.1.2. Fixed cells……………………..……………………….149 
7.3.1.3. Problem formulation……………………..…………….150 
7.3.1.4. Solution……………………..……………………..…...152 
7.3.2. Overlap removal and compaction algorithm……………………..154 
7.3.3. Determination of WPM nets……………………..………………155 
7.3.4  Simulated annealing……………………..……………………….156  
7.4. Results……………………..……………………..……………………..….158 
7.5. Application of WPM routing to SPARC64 processor……………………..163 





8. CONCLUSIONS AND FUTURE WORK………………………………………...175 
8.1. Conclusions……………………..……………………..…………………...175 
8.2. Future work……………………..……………………..…………………...181 
8.2.1. Multi-slot routing on multi-source multi-sink nets………………181 
8.2.2. Routing algorithms to study run-length proximity………………182 
8.2.3. Tolerance of WPM routing to manufacturing and process 
 variations……………………..…………………………………………183 
8.2.4. Enhancements to second generation HR-MINDS……………….183 
 
REFERENCES………………………………………………………………………...184 




LIST OF TABLES 
 
1.1 MOSFET and interconnect latency for 1.0 µm, 100 nm and 35 nm technology 
generations..……………………..……………………..……………………..……..3 
 
2.1 Delay in ns obtained for various stages and inverter sizes in an inverter chain..….31 
2.2 Comparison between conventional design and high performance design…………45 
2.3 Comparison between different design styles using WPM routing..………………..45 
3.1 Comparison of interconnect delay for conventional repeater insertion and staggered 
repeater insertion using HSPICE..……………………..……………………..……56 
 
3.2 Clock skew tolerance of the WPM circuit calculated using HSPICE..……………64 
3.3 Encoded signals for various data combinations..……………………..……………66 
3.4 Comparison of conventional, WPM and WPE routing technique..………………..76 
4.1 MINDS vs HR-MINDS..……………………..……………………..……………..99 
4.2 Interconnect dimensions for the different metal levels of the SPARC64 
 processor……………………..……………………..……………………..……...100 
 
4.3 Comparison of the SPARC64 design and HR-MINDS design..………………….100 
5.1 Number of metal levels for the various wire sharing efficiencies. Conventional case 
= 9.3 (~10) metal levels..……………………..……………………..……………108 
 
5.2 Power dissipation for the various wire sharing efficiencies. Conventional case = 
19.51W..……………………..……………………..……………………..………109 
 
5.3 Multilevel interconnect network design parameters for a conventional system and a 
WPM system..……………………..……………………..……………………….117 
 
5.4 Multilevel interconnect network design parameters for a conventional system and a 
WPM system.……………………..……………………..………………………..120 
 
5.5 Interconnect capacitance for the longest interconnect in conventional and WPM 
systems……………………..……………………..……………………..………..122 
 
6.1 Multilevel interconnect network design parameters for a conventional design, core 




6.2 Multilevel interconnect network design parameters for a conventional design, core 
area-centric design and wire coupling-centric design……………………..……...136 
 
7.1 Benchmark circuits used to study WPM routing……………………..…………..161 
7.2 Interconnect pitch used to design the SPARC64 microprocessor [61] …………..165 
7.3 Interconnect delay for different interconnect lengths……………………..……...165 
7.4 Interconnect delay and pulse width normalized to clock for different interconnect 
lengths……………………..……………………..……………………..………...167 
 
7.5 Comparison of different design styles……………………..……………………..171 





LIST OF FIGURES 
 
1.1 Distributed RC model of an interconnect without repeaters………………………6 
1.2 Distributed RC model of an interconnect with repeaters……………………..…...6 
1.3 A wave-pipelined interconnect……………………..……………………..………8 
1.4 Global bus shared between multiple cores……………………..………………….9 
1.5 Sample layout of a network on chip (NoC) configuration……………………….10 
2.1 Interconnect delay normalized to clock period for different interconnect lengths 
and stochastic wire distribution……………………..……………………..…….18 
 
2.2 ‘Interconnect delay + minimum sustainable pulse width’ normalized to clock 
period for different interconnect lengths and stochastic wire distribution………20 
 
2.3 Two uni-directional interconnects……………………..……………………..….22 
2.4 A single WPM uni-directional interconnect……………………..………………23 
2.5 Schematic diagram for conventional routing……………………..……………...23 
2.6 Schematic diagram for WPM routing……………………..……………………..24 
2.7 Circuitry used to generate the φmin signal from global clock…………………….26 
2.8 Pulse width vs interconnect length……………………..……………………..…27 
2.9 Pulse width / wire delay vs number of repeaters……………………..………….27 
2.10 Delay circuitry using a series of inverters……………………..…………………28 
2.11 Delay obtained at various stages in an inverter chain……………………..……..29 
2.12 3d plot showing the variation of delay with inverter stages and ratio of size of 
larger inverter to the smaller inverter in a stage……………………..…………..30 
 
2.13 HSPICE generated timing waveforms of wave-pipelined multiplexed circuit for 
two interconnects satisfying the delay constraint in equation (2.5)..…………….33 
 
2.14 HSPICE generated timing waveforms of wave-pipelined multiplexed circuit for 




2.15 Cross-sectional view of conventional design and WPM design…………………35 
2.16 WPM design - interconnect area vs number of repeaters……………………..…37 
2.17 WPM design - transistor area vs number of repeaters……………………..…….37 
2.18 WPM design - total power vs number of repeaters……………………..………..38 
2.19 Low power and low area design – interconnect area vs number of repeaters…...39 
2.20 Low power and low area design – transistor area vs number of repeaters………40 
2.21 Low power and low area design – total power vs number of repeaters………….40 
2.22 Minimum power design – total power vs number of repeaters………………….42 
2.23 Minimum power design – transistor area vs number of repeaters……………….42 
2.24 High performance design using WPM routing (Cross-sectional view of the global 
 tier) ……………………..……………………..……………………..…………..44 
 
3.1 Interconnect system with 5 interconnects and 2 ground planes………………….50 
3.2 Different switching patterns in the 5 interconnect system……………………….51 
3.3 WPM circuit and timing diagram showing the necessary pulse width to tolerate 
crosstalk noise……………………..……………………..………………………54 
 
3.4 Staggered repeater insertion……………………..……………………..………...55 
3.5 Minimum pulse widths required to avoid loss of data integrity due to crosstalk 
noise (fixed interconnect length) ……………………..……………………..…..57 
 
3.6 Minimum pulse widths required to avoid loss of data integrity due to crosstalk 
noise (fixed interconnect dimensions) ……………………..……………………58 
 




3.8 Schematic diagram of the WPE routing circuit……………………..…………...68 
3.9 HSPICE validation of WPE interconnect……………………..…………………70 
3.10 Transistor area vs number of repeaters..……………………..…………………..72 
 
 xiv
3.11 Total power vs number of repeaters……………………..……………………….72 
3.12 WPE design - transistor area vs number of repeaters……………………..……..73 
3.13 WPE design – total power vs number of repeaters……………………..………..74 
4.1 Interconnect delay vs interconnect width……………………..…………………82 
4.2 Interconnect delay vs interconnect length……………………..…………………83 
4.3 Flowchart for design flow of HR-MINDS – no repeaters……………………….94 
4.4 Flowchart for design flow of HR-MINDS – with repeaters……………………..95 
5.1 Demand function variance for different cutoff lengths……………………..…..107 
5.2 Variation of reduction in the required wire area with different cutoff lengths for 
various wire sharing efficiencies……………………..……………………..….110 
 
5.3 Variation of percent increase in the dynamic power with different cutoff lengths 
for different wire sharing efficiencies……………………..……………………111 
 
5.4 Percent reduction in wire area when WPM routing is applied to interconnects 
satisfying delay constraint in equation (2.5).……………………..…………….112 
 
5.5 Percent increase in dynamic power when WPM routing is applied to interconnects 
satisfying delay constraint in equation (2.5) ……………………..…………….113 
 
5.6 Percent reduction in power dissipation and core area varying with cutoff 
 length……………………..……………………..……………………..………..116 
 
5.7 Total power dissipation in a conventional and a WPM system with core area 
reduction……………………..……………………..……………………..……117 
 
5.8 Percent reduction in power and percent increase in wire spacing for a power-
centric design using core area reduction……………………..…………………119 
 
5.9 Total power dissipation in a conventional and a WPM system with wire spacing 
and dielectric thickness increase……………………..……………………..…..121 
 
6.1 Power reduction in a core area-centric design……………………..…………...130 





6.3 Power reduction in a wire coupling-centric design. (No change in the inter-level 
dielectric thickness).……………………..……………………..………………134 
 
6.4 Power reduction in a wire coupling-centric design……………………..………135 
6.5 Power components in a conventional and wire coupling-centric designs……...136 
6.6 Percent improvement in performance for a performance-centric design using core 
area reduction……………………..……………………..……………………...139 
 
6.7 Percent reduction in power in a performance-centric design using core area 
reduction……………………..……………………..……………………..……139 
 
6.8 Percent improvement in performance of a performance-centric design using wire-
spacing increase……………………..……………………..…………………...141 
 
6.9 Percent increase in power in a performance-centric design using wire spacing 
increase……………………..……………………..……………………..……..142 
 
7.1 Source-sink proximity for WPM design……………………..…………………147 
7.2 Run-length proximity for WPM design……………………..………………….147 
7.3 Placement strategy to determine WPM interconnects……………………..…...157 
7.4 Wire sharing efficiency for various benchmarks……………………..………...158 
7.5 Percent reduction in wirelength for different benchmarks……………………..161 
7.6 Percent increase in transistor area for various benchmark circuits……………..162 
7.7 Percent increase in leakage and dynamic power for various benchmarks……...162 
7.8 Abstraction of the floor plan showing a subset of macrocells and the 
corresponding inter-macrocell interconnects……………………..…………….166 
 
7.9 “Delay + pulse width” normalized to clock period for interconnects between 






The main objective of this research is to develop a pervasive wire sharing technique that 
can be easily applied across the entire range of on-chip interconnects in a very large scale 
integration (VLSI) system. A wave-pipelined multiplexed (WPM) routing technique that 
can be applied both intra-macrocell and inter-macrocell interconnects is proposed in this 
thesis. It is shown that an extensive application of the WPM routing technique can 
provide significant advantages in terms of area, power and performance. In order to study 
the WPM routing technique, a hierarchical approach is adopted. A circuit-level, system-
level and physical-level analysis is completed to explore the limits and opportunities to 
apply WPM routing to current VLSI and future gigascale integration (GSI) systems. 
Design, verification and optimization of the WPM circuit and measurement of its 
tolerance to external noise constitute the circuit-level analysis. The physical-level study 
involves designing wire sharing-aware placement algorithms to maximize the advantages 
of WPM routing. A system-level simulator that designs the entire multilevel interconnect 
network is developed to perform the system-level analysis. The effect of WPM routing on 












The semiconductor industry has exhibited exponential growth in the last three-
four decades. The inventions of the solid-state transistor in 1948 and the integrated circuit 
technology in 1958 have been key driving forces in the emergence of the semiconductor 
field. Gordon Moore developed an empirical formula in 1965 that states that the number 
of transistors on a chip will double every two years [1] [2] [3]. Satisfying Moore’s law 
has often proved the barometer for progress of a company in the semiconductor industry. 
The use of technology scaling and different process/material solutions has enabled 
companies to satisfy Moore’s law. However, with the trend towards deep sub-micron 
(DSM) technologies, it has become a great challenge to double the number of transistors 
on a chip every two years. Various innovative solutions have been proposed in both 
academia and industry to maintain this accelerated growth. Collaboration among a 
consortia of industrial, academic and government organizations has resulted in the 
development of the biennial International Technology Roadmap for Semiconductors 






the next fifteen years [4]. The ITRS projects that it will be possible to design and 
manufacture multi-billion transistor systems by the end of the current decade. Intel has 
already designed and manufactured the next generation product of the Itanium family 
containing 1.72 billion transistors [5]. This continuous increase in transistor density with 
each new technology generation has created new challenges for design and 
manufacturing in the DSM regime. One main challenge that is addressed in this 
dissertation is the challenge of wiring these billion transistor systems. 
  
1.2. The interconnect problem 
Constant field scaling of transistors is one of the most widely adopted strategies to 
improve performance and productivity of a semiconductor system [6]. Individual 
transistors are becoming smaller and faster, and are dissipating lower power with each 
new technology generation because constant field scaling reduces transistor size and the 
power-delay product. On the other hand, the constant wire length interconnect scaling 
increases the distributed resistance-capacitance product and hence demanding larger 
interconnect latency [6]. Table 1.1 shows a comparison between the metal oxide 
semiconductor field effect transistor (MOSFET) switching delay and RC response time 
of a 1 mm long interconnect for different technology generations after [6]. It can be 
observed from the table that as one goes from 1.0 µm technology to 100 nm technology, 
the speed of the 1 mm long interconnect has changed from 20 times faster to 6 times 
slower than the MOSFET. In addition, for 35 nm technology, the interconnect is almost 
1000 times slower than the MOSFET. Hence, the performance of future gigascale 






Table 1.1: MOSFET and interconnect latency for 1.0 µm, 100 nm and 35 nm technology 
generations. 
Technology generation MOSFET switching delay (td = CV / I) (ps) 
RC response time (Lint = 
1.0 mm) (ps) 
1.0 µm (Al, Si02) ~ 20 ~ 1 
100 nm (Cu, k = 2.0) ~ 5 ~ 30 
35 nm (Cu, k = 2.0) ~ 2.5 ~ 250 
 
 The increase in transistor count increases the number of interconnects that need to 
be routed. In addition, high clock frequencies necessitate reverse scaling of global and 
semi-global interconnects so that they satisfy the timing constraints. As a result, there is 
an increase in the number of metal levels with each new technology generation [4], 
resulting in a non-trivial increase in the manufacturing cost. It is therefore essential to 
look at interconnect design techniques that will reduce the impact of multilevel 
interconnect networks on the power, performance and cost of the entire system. A 
hierarchical framework of these limits and opportunities for the design of interconnects in 
the future GSI systems has been described in [7] [8]. This hierarchical approach provides 
a comprehensive overview of the design challenges that will be faced by the design 
engineers in future technology generations. 
 
1.3. Solutions to the interconnect problem 
Various interconnect design techniques have been proposed in order to reduce the 
effect of interconnect design on the overall system performance and manufacturing cost. 
Some of the solutions include using low-k dielectric to reduce switching capacitance, 
reducing interconnect resistivity to reduce interconnect resistance, inserting repeaters on 






improve interconnect throughput and sharing wires among multiple source-sink pairs 
(e.g. wire buses and network-on-chip (NoC)) to reduce channel count in global metal 
levels. 
 
1.3.1. Low-k dielectric and low resistivity interconnect material 
 
The ground capacitance and coupling capacitance of an interconnect is directly 
proportional to the permittivity of the dielectric. The use of low-k dielectric reduces the 
total ground and coupling capacitance, which reduces interconnect and device power 
dissipation and RC interconnect delay. For example, in Intel’s 90 nm technology 
generation, a carbon doped silicon dioxide is used to achieve a bulk dielectric constant of 
2.9 [9], in order to reduce interconnect capacitance. With the reduction in interconnect 
switching capacitance, for a fixed delay constraint, it is possible to reduce wire size to 
reduce the wiring area. In addition, for a fixed wiring area constraint, it is possible to 
reduce the interconnect delay to obtain greater system performance. 
 Similarly, the interconnect resistance is directly proportional to the resistivity of 
the interconnect material. Hence, the use of dual damascene copper, which has low 
resistivity reduces the interconnect resistance, which in turn reduces the RC interconnect 
delay. For example in Intel’s 90 nm technology generation, 7 layers of copper wires with 
copper vias are used to reduce interconnect delay and to increase resistance to 
electromigration [9]. As in the case of low-k dielectric solution, for a given operating 
frequency, the reduction in RC interconnect delay allows the use of thinner wire widths 
for data transmission, thus reducing the interconnect routing area. The reduction in 






opportunity to further reduce both wire-limited and transistor-limited die area and power 
dissipation. 
 
1.3.2. Repeater insertion 
 Repeaters can be inserted on interconnects to reduce the dependency of 
interconnect delay on interconnect length [10], [11]. Figure 1.1 and Figure 1.2 show a 
distributed RC model of an interconnect without repeaters and with repeaters, 
respectively. It is shown in [10] and [11] that insertion of an optimal number and optimal 
size of equi-spaced repeaters on interconnects improves the interconnect delay to 
interconnect length relationship from quadratic to linear. The expression given in [11] for 
the 50% time delay of an interconnect when an optimal number and optimal size of equi-
spaced repeaters are placed on the interconnect is 
2.46 o int o intT R R C C=      (1.1) 
where Ro is the output resistance of a minimum sized inverter, Co is the input capacitance 
of a minimum sized inverter, Rint is the interconnect resistance and Cint is the interconnect 
capacitance. 
 The expressions for the optimal number (kopt) and optimal size (hopt) of repeaters 





















Figure 1.1: Distributed RC model of an interconnect without repeaters. 
 
Figure 1.2: Distributed RC model of an interconnect with repeaters. 
 
 Though inserting an optimal number and optimal size of repeaters minimizes the 
delay, [12] has shown that only 10% increase in delay is observed by inserting 50% of 
optimal number of repeaters. Similarly, [13] has shown that Bakoglu’s expression for 
optimal repeater size overestimates the required transistor size. A repeater whose size is 
43% smaller than the optimal size yields only 6.8% delay penalty. Thus, inserting 
suboptimal number and suboptimal size of repeaters may be more advantageous. 
 A direct result of repeater insertion is the improvement in interconnect 
performance because of the reduction in interconnect delay. For a fixed interconnect 
performance, the use of repeaters also enables scaling of interconnect dimensions and 
device sizes, which could help reduce area and power of interconnect circuits. Thus, the 
reduction in interconnect delay because of repeater insertion enables design of a 
minimum power, minimum area or maximum performance interconnect circuit. Various 
optimization techniques have been proposed in [14], [15], [16], [17] to optimize different 
combinations of power, area and performance of interconnect circuits using repeater 
insertion. In the case of large macrocells, repeater insertion on interconnects can be used 






frequency, minimum number of metal levels or minimum power dissipation. These 
optimization strategies are extensively discussed in [12]. 
 
1.3.3. Wave-pipelining 
With future GSI systems operating at tens of gigahertz, global interconnects will 
require more than one clock cycle to transmit data across chip [4]. Hence, the 
microarchitecture design of future systems must account for the multiple clock cycles that 
are necessary for across-chip communication on global interconnects. This presents an 
opportunity to pipeline data over these global interconnects and achieve higher 
throughput. Though the wave-pipelining paradigm was proposed almost 36 years ago 
[18], wave-pipelining of interconnects has become popular in the recent past [19], [20], 
[21]. The pipelining of global interconnects by inserting repeaters and/or flip-flops that 
temporarily store data and drive the next interconnect segment is shown in [22]. As a 
result, there can be multiple signals on a single interconnect at any instant of time 
resulting in a high interconnect throughput. In addition, the wave-pipelining technique 
also provides an opportunity for the designers to optimize the interconnect power and 
area [23]. Figure 1.3 shows an interconnect that is wave-pipelined using repeaters/flip-
flops. It can be observed from Figure 1.3 that there are multiple signals i.e. S1, S2, …., 
Sn+1 on the interconnect at an instant of time resulting in a high interconnect throughput. 
The number of repeaters across which one signal spans in a wave-pipelined interconnect 
is dependent on the interconnect length and the number of repeaters that are inserted on 










Figure 1.3: A wave-pipelined interconnect. 
 
 Data can be synchronized with the help of storage buffers at the sender and 
receiver side of the wave-pipelined circuit. However, synchronization becomes difficult, 
due the delay variations resulting from crosstalk noise, power supply noise, clock skew, 
clock jitter, process parameter variations, temperature induced delay changes, data 
dependent variations etc. [24]. In addition, the high operating frequencies of future GSI 
systems decrease the allowable margin of delay variation making the synchronization all 
the more complicated. Phase-locked loops (PLL) and clock-and-data recovery 
mechanism proposed in [26] can be used to correctly sample the data at the receiver side. 
A skew-insensitive receiver circuit for wave-pipelined interconnects proposed in [27] can 
also be used to successfully transmit and receive data between different clock domains. 
However, all these data receiving techniques significantly increase the overhead circuitry. 
 
1.3.4. Wire sharing 
The sharing of a single wire among multiple source-sink pairs helps reduce the 
number of routing channels, which in turn decreases the total interconnect routing area of 
a system. This reduction in routing area can help reduce the metal count and/or power 






designs are commonly used to implement wire sharing in very large scale integration 
(VLSI) systems. 
In case of bus architectures, a global bus is shared between multiple cores of a 
system. A static priority based bus design [30] or a time division multiplexed shared bus 
design [30] can be adopted to implement bus architectures. Figure 1.4 shows a bus 










Figure 1.4: Global bus shared between multiple cores. 
 
The biggest advantage of bus architectures is the simplicity of logic. The bus can 
be easily shared among multiple cores using the static priority or time-division 
multiplexed approach. It reduces the interconnect count and provides sufficient 
delineation between different cores of the system. However, given the design approach a 
core may have to wait for a significant number of clock cycles before it gets control of 
the bus. At the same time, these bus architectures are restricted to global interconnects, 






The NoC paradigm has been extensively discussed in [31], [32], [33], [34] and 
[35], where the data is exchanged between the various source-sink pairs using either 
circuit-switched or packet-switched interconnect network. The proposed techniques use 
complex algorithms for correct scheduling and routing of data signals on the shared 
interconnect network between various intellectual property (IP) cores. A typical layout of 
the IP cores in a NoC configuration [31], [32] is shown in Figure 1.5. The IP cores are 
arranged in a mesh type structure and each IP core has a switch (S) associated with it. 
This switch acts as a liaison between the IP core and the remaining system. Each IP core 
















Figure 1.5: Sample layout of a network on chip (NoC) configuration. 
 
 Similar to bus architectures NoC design reduces the number of routing channels 
and provides the delineation between the different cores of a system. NoC configuration 
can be easily applied to a system-on-chip (SoC) design. However, NoC design 






large overhead in terms of area and power and it may also lead to a possible loss in 
performance. 
 Most of the wire sharing techniques that have been proposed so far are primarily 
focused on sharing of global interconnects that connect the various cores/macrocells in a 
VLSI system. It is difficult to apply them to intra-core/intra-macrocell interconnects 
because of the complexity involved in implementing these techniques. Thus, these 
techniques do not harness the complete potential of wire sharing. A pervasive 
interconnect design technique that uses a combination of wave-pipelining and wire 
sharing using repeaters is proposed in this thesis. This technique can be easily applied to 
both intra-core/intra-macrocell and inter-core/inter-macrocell interconnects and provides 
significant advantages in terms of power, area and cost. 
 
1.4. Proposed research 
The objective of this proposed research is to develop a new VLSI interconnect 
design technique that could help overcome the detrimental effects of VLSI interconnects 
on power, area and performance of application specific integrated circuit (ASIC) and 
microprocessor products. This new technique is referred to as wave-pipelined 
multiplexed (WPM) routing, and this interconnect technique can potentially be used as 
pervasively as VLSI interconnect repeater circuits. The goal of this research is to 
rigorously study the limits and opportunities for application of the WPM routing 
technique at the circuit level, the system level and the physical design level in current and 
future digital systems. This research seeks to make contributions in the fields of physical 






The circuit-level study includes design, verification and optimization of the WPM 
circuit. The tolerance of WPM circuits to crosstalk noise, power supply noise and clock 
skew are also studied. The physical-design limits on the application of the WPM routing 
technique are investigated with the help of GORDIAN placement algorithm, compaction 
techniques and simulated annealing. The system-level analysis involves study of 
application of the WPM routing technique to full-custom and semi-custom/ASIC 
systems. In order to perform this system-level analysis, a multilevel interconnect network 
design simulator that is interfaced with HSPICE and RAPHAEL is developed. In 
addition, the opportunities for application of the WPM routing technique to current VLSI 
systems and future gigascale integration (GSI) systems are explored. 
 
1.5. Summary of chapters 
In order to study the limits and opportunities of application of the WPM routing 
technique, a hierarchical approach is adopted. The WPM routing technique is studied at 
the circuit level, the system level and the physical design level. A detailed description of 
this study of WPM routing at the various level of the hierarchy is presented in the 
following chapters. 
The circuit-level analysis of the WPM routing technique is presented in Chapter 
2. The scope of implementation of wire sharing in modern day VLSI systems is explored. 
The WPM logic design and delay constraints to maintain interconnect performance are 
explained here. This chapter also includes a detailed description of the design and 






Any new circuit design technique needs to work under both, best-case and worst-
case scenarios. Chapter 3 explains the tolerance level of the WPM circuit to external 
noise. The tolerance of WPM circuit to crosstalk noise, power supply noise and clock 
skew is discussed. A wave-pipelined encoded (WPE) routing solution that has lesser 
susceptibility to external noise as compared to WPM routing is also proposed. 
To study WPM routing at the next level of the hierarchy i.e. system level, an 
interconnect network design simulator is developed. The first generation of this simulator 
was developed based on the design methodology in [12]. In order to make the design 
methodology more accurate, the simulator is interfaced with circuit and electromagnetic 
simulation tools. A detailed description of the second generation of this system simulator 
is presented in Chapter 4. 
For the system-level analysis of WPM routing technique, a full-custom and a 
semi-custom interconnect network is designed. For the full-custom design approach, 
there is significant flexibility in the choice of multilevel interconnect network design 
parameters. A wire-area-centric, a power-centric and a performance-centric methodology 
is adopted to study WPM routing. All these methodologies are discussed in detail in 
Chapter 5. Similarly, for the semi-custom design approach, where the choice of design 
parameters is less flexible a macrocell-area-centric, power-centric and performance-
centric design approaches are developed and they are described in Chapter 6. 
The physical design limits of WPM routing are elucidated in Chapter 7. The 
source-sink proximity constraint and run-length proximity constraint that are necessary to 
reduce overhead due to WPM routing are defined in this chapter. Placement algorithms 






routing. The advantages of using WPM routing on these benchmark circuits are presented 
in this chapter. The opportunities to apply WPM routing to a 64-bit SPARC64 processor 
are also explored in this chapter. In addition, a study of WPM design based approach 
versus material solutions for future GSI systems is completed. This analysis of limits and 
opportunities to apply WPM routing to current and future systems designed using 
integrated circuit technology is part of this chapter. 
Finally, the salient features and future work for this research are presented in 
Chapter 8. Based on the completed research certain conclusions are drawn concerning the 
WPM routing technique, and these conclusions are listed in this chapter. For the future, 
the WPM routing technique can be extended to operate as a multi-slot routing technique 
on multi-source multi-sink nets. Using routing algorithms the run-length proximity 
among interconnects can be determined to evaluate the scope of WPM routing. 
Manufacturing variations play a crucial role in the correct functioning of a circuit. The 
tolerance of WPM routing to manufacturing and process variations could also be 







WAVE-PIPELINED MULTIPLEXED (WPM) 






 The ever-increasing transistor count in current and future digital systems has 
made it necessary to have a large number of interconnects for data transmission between 
and within a myriad of logic and memory macrocells. This global and semi-global 
interconnect complexity has resulted in systems whose performance is being increasingly 
restricted by interconnect delay and signal integrity [6]. In addition, the increase in the 
number of interconnects has resulted in an increase in the number of metal layers for 
every new technology generation, which introduces a non-trivial increase in the 
manufacturing cost of the system [4]. It is, therefore, imperative to investigate novel 
VLSI interconnect network designs that most efficiently utilize the available wiring 






 A new wave-pipelined multiplexed (WPM) routing technique is proposed in this 
thesis that takes advantage of the intra-clock period interconnect idleness to reduce the 
impact of interconnects on system performance and cost. The existence of intra-clock 
period idleness on the non-critical path interconnects in a multilevel interconnect network 
is discussed in Section 2.2. A detailed description of the WPM logic design and the 
necessary delay constraints to maintain high performance is presented in Section 2.3. The 
WPM circuit design and its validation are presented in Section 2.4 and Section 2.5, 
respectively. Finally, the optimization of the WPM circuit to minimize interconnect area, 
transistor area and power dissipation, and improve interconnect performance is discussed 
in Section 2.6. 
 
2.2. Interconnect idleness distribution 
 It is a premise of this work that a tier will consist of a pair of orthogonal routing 
levels and all the interconnects on a given tier have approximately the same pitch value, 
and this pitch value for a tier is chosen such that the longest interconnect on a tier 
satisfies a predefined delay constraint. As a result, the shorter interconnects on a tier 
remain idle for part of the clock period. The wave-pipelined multiplexed (WPM) routing 
technique uses this intra-clock period interconnect idleness to more efficiently 
communicate binary signals. Specifically, WPM takes advantage of this intra-clock 
period idleness to send a second signal during the same clock period.  
 To illustrate the amount of wire idleness that is present in a current VLSI system, 
a system-level simulator similar to [12] is used to simulate a 40 million transistor logic 






area. All interconnects on a tier have approximately the same wiring pitch and this pitch 
is proportional to the length of the longest interconnect on that tier [12].  
 Figure 2.1 shows interconnect delay normalized to the clock period for all 
interconnect lengths on different wire tiers of this simulated logic core. In addition, the 
stochastic interconnect demand function [36] for this system is also plotted as a function 
of interconnect length in Figure 2.1. The interconnect demand function gives a 
cumulative distribution of the interconnect lengths in a macrocell. The interconnect 
lengths are in units of average gate pitches. For this case study 1 gate pitch = ( / )cA N , 
where Ac = area of the core and N = number of gates in the core. 
 It can be observed from Figure 2.1 that the multilevel interconnect network has 
been designed such that the longest interconnect on each tier requires a maximum of 80% 
of the clock period for data transfer from source to sink. The extra 20% of the clock 
period accounts for clock skew and provides the necessary guardband to ensure a timely 
transfer of data from source to sink. It can be calculated from Figure 2.1 that 67% of 
wires with length greater than 0.1 mm require less than 60% of the available clock period 
for data transmission. WPM routing takes advantage of the resulting intra-clock period 
idle time and sends a second signal during this idle portion in a wave-pipelined fashion. 
In fact, to fully utilize WPM routing it is a recommended that most semi-global and 
global wires need to be designed at the RTL stage such that they have pipeline stages. 
This would be the most significant constraint on the microarchitecture that would 































































than 60% clk 
period
 
Figure 2.1: Interconnect delay normalized to clock period for different interconnect 
lengths and stochastic wire distribution. 
 
2.3. WPM logic design and delay constraints 
 A wave-pipelining technique similar to [37] is adopted for sending multiple 
signals on an interconnect in a clock period. The primary motivation behind using wave-
pipelining instead of time division multiplexing is to increase the number of interconnects 
that can be designed as WPM interconnects. If simple time division multiplexing is used, 
then there will be only one signal on the interconnect at any instant of time. The signals 
will be transmitted on the positive and negative edge of the clock cycle, and only those 
interconnects that have a latency of less than half a clock cycle can be designed as WPM 






bits of data on an interconnect at any instant of time. This increases the opportunity for 
wire sharing and to design interconnects using WPM routing. 
 As explained in [37] the expression for calculating the minimum sustainable pulse 
width (tpulse) that can travel down a repeater interconnect circuit without any loss of signal 
integrity is given by equation (2.1).  








                                                (2.1) 
where 
   0.4RCseg t t t seg t seg seg segR C R C C R R Cσ = + + +        (2.2) 
   1 1.01
4
t seg seg t seg seg
t seg seg t seg seg
R C R C R C
K














       (2.4) 
Here υ1 is the fraction of the full supply voltage at the output of the first repeater 
segment, repeater∆  is the delay of a repeater, Rseg is the resistance of the interconnect 
segment, Cseg is the switching capacitance of the interconnect segment, Rt is the output 
resistance of the repeater driver and Ct is the input capacitance of the repeater driver. 
 In WPM routing, two signals are transmitted in one clock period. The first signal 
is scheduled at the beginning of the clock period and the second signal is scheduled after 
tpulse seconds. Both signals will arrive at the respective sinks within a single clock period 












where tlatency is the 50% latency of the wire channel and tclk is the clock period. A 
guardband tguardband is provided to account for any variations in delay and clock skew. The 
condition in equation (2.5) ensures that the second signal reaches the appropriate sink 
before the end of the current clock period. The interconnects that satisfy delay constraint 
in equation (2.5) can be classified as single stage WPM (SSWPM) interconnects. Figure 
2.2 shows the plot of ‘interconnect delay + pulse width’ normalized to clock period  for 
different interconnect lengths in a 40 million transistor logic core. In addition, the 
corresponding stochastic interconnect demand function [36] for this system is also plotted 
as a function of wire length. The shaded regions in Figure 2.2 illustrate the range of 
interconnects to which the WPM routing technique can be applied without any loss of 






















































alized to clk period
Demand function







Range of wires 
having (T + min. 
pulse) less than 
80% clk period
 
Figure 2.2: ‘Interconnect delay + minimum sustainable pulse width’ normalized to clock 







 In case of the longer interconnects that do not satisfy the SSWPM delay constraint 
given by equation (2.5), the WPM technique is further modified. Even in this case, the 
first signal is sampled and transmitted at the beginning of the clock cycle (t = 0) and the 
second signal is sampled and transmitted at t = tpulse. However, both the signals do not 
reach the appropriate sinks before the end of the current clock cycle and hence, they are 
available to the receiver side circuitry only after t = tclk (i.e. during second clock cycle). 
Since, we have assumed that all the circuits of our system sample data at the beginning of 
the clock period, the data sent at t = 0 and t = tpulse is used only at t = 2 * tclk. As a result 
there is an increase in the signal latency; however, there is no change in the wire bit rate.  
 Even if the first set of signals do not reach their respective sinks at t = tclk, the 
second set of signals can be scheduled at t = tclk without losing signal integrity due to the 
wave-pipelining effect. The second set of signals will reach the respective sinks in the 
third clock cycle and by that time the first set of signals would have already been used by 
the receiver side circuitry. The second set of signals can be used at t = 3 * tclk. Therefore, 
signals can be transmitted at the source side in every clock cycle and sampled at the sink 
side in every clock cycle; hence the overall throughput performance of the wire is 
maintained.  
 Since the latency is two clock cycles, the shared interconnect, for this case, can 
have total delay of more than one clock cycle. Hence, the timing constraint in equation 













 For SSWPM interconnects, the interconnects were designed such that they would 
have a maximum delay of tclk (considering delay variations due to external noise). 
However, under the new constraint in equation (2.6), tpulse and tlatency can be larger. This 
provides an opportunity whereby it might be possible to further reduce silicon area and 
wire area by application of WPM routing technique. The interconnects designed to satisfy 
the delay constraint in equation (2.6) can be classified as double stage WPM (DSWPM) 
interconnects. 
 In the two wiring nets shown in Figure 2.3, assuming that there exists some intra-
clock period idleness, the signals that were being sent on dedicated interconnects can be 
sent over a shared interconnect as shown in Figure 2.4. A 2:1 multiplexer and 1:2 
demultiplexer are required for correct scheduling and routing of the input and output 
signals, over the shared resource. Additionally, some buffers are also used at the receiver 












































Figure 2.4: A single WPM uni-directional interconnect. 
 
2.4. WPM circuit design 
 Figure 2.5 and 2.6 show the schematic diagram of the circuitry required for 
conventional routing and WPM routing, respectively. Pipeline registers are used at the 
source and sink side in both the routing techniques for data storage. A simple low 















Figure 2.6: Schematic diagram for WPM routing. 
 
 For conventional routing, a driver, a receiver and a suboptimal number [12] and 
suboptimal size [13] of repeaters are used. A suboptimal number of repeaters are inserted 
as [12] shows that inserting 50% of the optimal repeaters imposes only 10% performance 
penalty. Repeater sizing is assumed to be suboptimal because Bakoglu’s expression [10] 
for optimal sizing of the repeaters overestimates the required transistor size [13]. A 
repeater whose size is 43% smaller than the optimal size yields only 6.8% delay penalty. 
Each repeater consists of an inverter pair. For WPM routing a 2:1 multiplexer and a 1:2 
demultiplexer are placed at the input and output, respectively, of the shared wire. Buffers 
are used at both the outputs of the demultiplexer to maintain signal integrity, and to hold 
the received value dynamically. 
 The signals, P0 and P1, from the two different sources are given as input to the 
two input lines of a 2:1 multiplexer, respectively. A signal (φmin) having cycle period 
equal to global clock cycle but which remains at logic 1 only for t = tpulse, calculated 
using [37], is given as input to the select line of the multiplexer. When φmin is high, 
(beginning of the clock cycle; t = 0) P0 is sampled by transmission gate A and 





Size of transistors for mux-demux circuit - W/L = h/4, where h is the size 






transmission gate B and transmitted. At the receiver end, a locally generated φmin is 
delayed to give φ1 and φ2, and these delayed clocking signals are used to sample the data 
received on the shared wire by controlling the transmission gates C and D, respectively. 
Finally, Line_out is the signal that is transmitted over the shared interconnect and is 
given as input to the demultiplexer on the receiver side. 
 It is assumed that the signal φmin is generated using the global clock that is 
distributed across the entire chip. Hence, the signal φmin can be easily generated in any 
region of the chip. Figure 2.7 shows the logic diagram of the circuitry used to generate 
the signal φmin. The global clock signal φclk is delayed and this delayed signal is ANDed 
with the global clock to generate the signal φmin. The sizes of the inverters used to delay 
the global clock should be suitably chosen to ensure the required delay is obtained using 
a small number of inverters. Figure 2.8 shows a plot of duty cycle of φmin varying with 
the interconnect length of 1 cm for various interconnect dimensions. Here it is assumed 
that an optimal number [10] and optimal size [10] of repeaters are inserted on the 
interconnect. It can be observed from Figure 2.8 that there is not much variation in the 
minimum pulse width with the variation of the interconnect length. As a result, the 
circuitry required to generate φmin signal need not be separately designed for each WPM 
interconnect. In addition, this circuitry can be shared among multiple WPM interconnects 
in close vicinity. This reduces the total overhead required for WPM routing.  
 Moreover there is an opportunity to have a new type of repeater optimization to 
minimize the overall latency of a WPM channel. Figure 2.9 shows a plot of pulse width 
and wire delay varying with the number of repeaters. Here a 1 cm interconnect having a 






width model in [37], while the wire delay is calculated using delay models in [10]. As can 
be observed from the figure, with the increase in the number of repeaters the minimum 
required pulse width reduces. On the other hand, the wire delay first reduces and then 
increases with the number of repeaters. It can be observed from the figure that the wire 
delay is minimum when 10 repeaters are inserted on the wire. However, for WPM routing 
we are concerned with the sum of the pulse width and wire delay as explained in equation 
(2.5). Figure 2.9 shows that the sum of pulse width and wire delay is minimum when 11 
repeaters are inserted on the interconnect. Thus, it can be concluded that in case of WPM 
routing inserting optimal number of repeaters may not necessarily minimize the latency 
of the two signals that will be transmitted over the WPM interconnect. On the contrary, 
by having repeater count slightly more than the optimal value it is possible to minimize 































Pulse width (interconnect width = 1e-5 cm)
Pulse width (interconnect width = 2e-5 cm)
Pulse width (interconnect width = 3e-5 cm)
 


































width + wire delay)
Length = 1 cm
Width = Height = Spacing = Dielectric thickness = 400 nm
 







 Figure 2.6 shows two delay circuits at the receiver side. The delay circuitry 1 
delays the signal φmin to give φ1 such that signal P0 gets sampled by transmission gate C 
as soon as it reaches the input (Line_out) of the demultiplexer. The second signal P1 
follows P0 on the shared wire with a time difference of tpulse. Hence, the delay circuitry 2 
further delays φ1 to give φ2 such that transmission gate D samples signal P1 at the 
appropriate time. It should be noted that only one of the two transmission gates C and D 
is ON during sampling of signals received on the shared interconnect.  
 These delay circuitries, 1 and 2, that are used to delay the signal φmin are designed 
as a chain of CMOS inverters. The series consists of a small inverter driving a large 
inverter in one stage. Each stage drives another stage as shown in Figure 2.10. This 
enables a large delay with a small number of inverters. These delay circuits could be 
shared among multiple shared interconnects to distribute the resulting overhead. 
Depending on the interconnect delay, the signals φ1 and φ2 can be appropriately branched 




Stage 1 Stage 2 Stage 3
 
Figure 2.10: Delay circuitry using a series of inverters. 
 
 Figure 2.11 shows the delay that can be obtained at various stages in an inverter 
chain for 100 nm technology. Here it is assumed that a minimum sized inverter is driving 






the plot the size of all the larger inverters is the same. It can be observed from the plot 
that a minimum delay of 0.01 ns can be obtained using this delay chain. For interconnects 
having a delay of less than 0.01 ns this delay chain cannot be used. Delays larger than 





















h = Ratio of the size of NMOS 
of the larger inverter to that of 
the smaller inverter in a delay 
stage
 
Figure 2.11: Delay obtained at various stages in an inverter chain. 
 
 The delay circuitry that is required for WPM routing technique adds to the 
overhead. The power dissipation and area of this overhead circuitry is directly 
proportional to the sizes and count of the transistors. The desired delay of φmin signal can 
be obtained either by increasing the number of stages and using smaller inverters or by 
increasing the size of the larger inverter in each stage and using lesser number of stages. 






desired delay while minimizing the power and area. In addition to optimizing the sizes 
and number of stages for minimizing power dissipation, using a lower supply voltage for 
the delay circuits will also help in reducing its power dissipation and extracting larger 
delay. Figure 2.12 shows a 3-d plot of delay, stage and ratio of the width of the larger 
inverter to a smaller inverter of a stage in an inverter chain. Table 2.1 gives the data that 
is plotted in Figure 2.12 where ‘h’ gives the ratio of the width of the larger inverter to the 
smaller inverter in a stage of the inverter chain. 
 
Figure 2.12: 3d plot showing the variation of delay with inverter stages and ratio of size 














Table 2.1: Delay in ns obtained for various stages and inverter sizes in an inverter chain. 
Stage h = 1 h = 10 h = 20 h = 30 h = 40 h = 50 
1 0.01 0.06 0.13 0.20 0.26 0.33 
2 0.04 0.15 0.29 0.43 0.57 0.72 
3 0.07 0.24 0.45 0.67 0.89 1.10 
4 0.10 0.33 0.61 0.91 1.19 1.48 
5 0.13 0.41 0.78 1.14 1.51 1.87 
6 0.15 0.50 0.94 1.37 1.82 2.26 
 
2.5. Validation of WPM circuit design 
 In order to validate the nominal WPM circuit, it is simulated using HSPICE. 
Figure 2.13 shows the timing waveforms, generated using HSPICE, for the two data 
signals sent over a 0.5 cm long shared SSWPM interconnect in a single clock cycle. A 
pitch of 1.05 µm is used for this interconnect. The pitch value is selected based on the 
interconnect network design obtained for the 40 million transistor logic core described in 
Section 2.2. Signal P0 sends bit stream 0110010 while signal P1 sends bit stream 
0110110. When φmin goes high, the transmission gate A samples and transmits the signal 
P0 over the shared interconnect. When φmin goes low, the input signal P1 is sampled and 
transmitted by the transmission gate B. At the receiver side, whenever, φ1 is high, 
transmission gate C samples the data at the input of the demultiplexer (Line_out) and 
gives it as output OP0. At this time transmission gate D is OFF. When φ2 goes high, 
transmission gate D samples and routes data on the shared wire to the appropriate sink. 
This corresponds to signal OP1. It can be observed from Figure 2.13 that both input 
signals, P0 and P1, reach the appropriate sinks within one clock cycle, and are read in 
correctly by the positive edge triggered pipeline registers. HSPICE simulation shows that 
the delay of multiplexer and demultiplexer is small. The delay due to the 2:1 multiplexer 






Compared to the allowable wire delay of 550 ps the delay of multiplexer and 
demultiplexer is less than 10%. 
 For the DSWPM interconnects that do not satisfy the delay constraint in equation 
(2.5) the same circuit in Figure 2.6 is used. As explained in Section 2.3, the latency of the 
signals will be two clock cycles and the constraint in equation (2.6) is used. For example, 
if the signal P0 is sampled at t = 0, then signal P1 will be sampled at t = tpulse by the 
multiplexer. Assuming the first datum of P0 reaches Line_out at t = 1.5 * tclk (accounting 
for any clock skew and guardband), the first datum of P1 will reach Line_out at t = 1.5 * 
tclk + tpulse. These signals will be used by the appropriate receiver side circuits at t = 2 * 
tclk. Meanwhile, the second datum on P0 and P1 will be sampled and transmitted at t = tclk 
and t = tclk + tpulse respectively, and will be used by the receiver side circuits at t = 3 * tclk. 
Here, the delay circuits will have to be suitably designed so that φ1 will go high at t = 1.5 
* tclk to sample P0 and φ2 will go high at t = 1.5 * tclk + tpulse to sample P1. 
 Figure 2.14 shows the HSPICE waveforms of a WPM interconnect of length 0.7 
cm. This interconnect does not satisfy the timing constraint given in equation (2.5) when 
the same pitch as that for 0.5 cm interconnects in the earlier case is used. Hence, as 
described in the earlier section, this interconnect is redesigned so as to satisfy timing 
constraint given in equation (2.6). The new pitch is 0.586 µm. This helps in reduction of 
wire and silicon area. P0 and P1 are sampled and transmitted by the multiplexer when 
φmin goes high and low, respectively. As can be seen from Figure 2.14 both the signals 
require more than one clock cycle to reach the input of the demultiplexer (Line_out). 
When φ1 is high, data at Line_out is sampled and transmitted to give OP0. On the other 






data, OP0 and OP1, is used two clock cycles after it is transmitted at the source. The 
second set of data is scheduled at t = tclk and is used by the receiver side circuitry at t = 3 
* tclk. The delay due to 2:1 multiplexer and 1:2 demultiplexer is 14.3 ps and 30 ps, 
respectively, which is again very small compared to the allowable interconnect delay. 
Thus, though the overall latency of the system increases, the total communication 









































Figure 2.13: HSPICE generated timing waveforms of wave-pipelined multiplexed circuit 













































Figure 2.14: HSPICE generated timing waveforms of wave-pipelined multiplexed circuit 
for two interconnects designed using the delay constraint in equation (2.6). 
 
2.6. Optimization of WPM circuit design 
 As described in Section 2.3, two dedicated interconnects can be replaced by a 
single shared interconnect using WPM routing. The elimination of one interconnect frees 
up some routing area that can be harnessed in a variety of ways. For example, the spacing 
between the interconnects can be increased so as to fill up all available routing area. This 
increase in wire spacing decreases the coupling capacitance between the neighboring 
interconnects. As a result, smaller sized drivers and receivers can be used, resulting in a 
decrease in the total device capacitance. Hence, there is an opportunity to reduce both 






 To illustrate WPM optimization, two dedicated interconnects, each of length 1.0 
cm and designed to operate at 1.3 Ghz, are considered. It is assumed that these two 
interconnects have active lines as their neighbors as shown in Figure 2.15. The 
dimensions of these two interconnects are designed such that they require 70% of the 
clock period for data transmission. It is assumed that a buffer of 20% of the clock period 
is necessary to account for clock skew and signal guardbands. Even though each 
interconnect remains idle for only 10% of the clock period, this time is enough to 
schedule a second signal in a pipelined fashion, using the WPM technique, on either 
interconnect without any loss of throughput performance. Hence, we use WPM routing 
and replace these two interconnects by a single shared WPM interconnect, which uses the 
overhead circuitry described in Section 2.4. The WPM design will have a single shared 








TNe Ne Ne NeDe De Sh
W = Wire width
S = Wire spacing
T = Wire thickness
H = Dielectric thickness
P = Wire pitch
De = Dedicated wire
Sh = Shared wire using WPM
Ne = Active neighboring wire
Spwire = S/W
Hoxide = H/W













2.6.1. Minimum wire area optimization 
 This optimization represents the more classic application of WPM to reduce wire 
area only and will represent a baseline to compare to other optimizations in this section. 
Figure 2.16 shows the variation in the interconnect area with the number of repeaters for 
the conventional design with two dedicated interconnects and WPM design with a single 
shared interconnect. One can observe from Figure 2.16 that a simple application of WPM 
(no change in wire spacing or dielectric thickness) decreases the total wire area by 50%. 
This reduction in interconnect count decreases repeater count, and therefore decreases 
active transistor area. Figure 2.17 shows the variation in the transistor area for different 
number of repeaters. Even with WPM overhead, one can get more than 15% reduction in 
the transistor area at the optimal design point. 
 An increase in the static power and the dynamic power of the system is observed 
after application of WPM routing. The dynamic power increases due to the increase in 
the activity factor of the shared resources. In addition, the switching capacitance of the 
overhead circuitry required for implementing the WPM routing also contributes to the 
power equation. As a result, there is an increase in the total power dissipated by the wire-
area-centric WPM design. At the minimum wire area design point, close to 20% increase 




























Conv Spwire = 1.0 Hoxide = 1.0
WPM Spwire = 1.0 Hoxide = 1.0
WPM Spwire = 1.2 Hoxide = 1.0































Conv Spwire = 1.0 Hoxide = 1.0
WPM Spwire = 1.0 Hoxide = 1.0
WPM Spwire = 1.2 Hoxide = 1.0
































Conv Spwire = 1.0 Hoxide = 1.0
WPM Spwire = 1.0 Hoxide = 1.0
WPM Spwire = 1.2 Hoxide = 1.0






Figure 2.18: WPM design - total power vs number of repeaters. 
 
2.6.2. Low power and low area design 
 The elimination of interconnects resulting from WPM increases available routing 
area and provides an opportunity to increase wire spacing, which can result in lower wire 
capacitance and smaller driver sizes. Furthermore, because of crosstalk constraints, 
increasing wire spacing can enable an increase in dielectric thickness, which reduces the 
effective ground capacitance. The wire spacing and dielectric thickness are increased in 
same proportions, such that the crosstalk constraints and processing constraints are not 
violated. Figures 2.19, 2.20 and 2.21 show the variation in interconnect area, transistor 
area and power with the number of repeaters for a low power design. The increase in wire 
spacing and dielectric thickness decreases interconnect capacitance, which decreases 






interconnect can be proportionately decreased so that the delay will be equal to 70% of 
the clock period (as in the conventional design). This enables the use of smaller sized 
drivers/receivers. The resulting decrease in interconnect capacitance and device 
capacitance decreases the total power dissipated by the system creating an opportunity for 
a low power design. For Spwire = 1.5 and Hoxide = 1.5, a 6% reduction in power can be 
observed at the optimal point in Figure 2.18. The decrease in interconnect count and 
driver/receiver sizes decreases the interconnect area and transistor area, respectively, of 
the system. A 44% decrease in interconnect area and 29% decrease in transistor area can 























Conv Spwire = 1.0 Hoxide = 1.0


































) Conv Spwire = 1.0 Hoxide = 1.0
WPM Spwire = 1.5 Hoxide = 1.5
Optimal point
29% reduction in 
transistor area
 




















Conv Spwire = 1.0 Hoxide = 1.0












2.6.3. Minimum power design 
 It is possible to further increase wire spacing such that the interconnect area of the 
WPM circuit is the same as that of the conventional circuit. This would represent a 
minimum power design for the 2-wire circuit. In our case study, for Spwire = 4.9 and 
Hoxide = 2.0 the total interconnect area of WPM circuit is equal to the interconnect area 
of the conventional circuit. Here, it is assumed that Hoxide can have a maximum value of 
2.0 due to manufacturing constraints. Figure 2.22 shows the variation in power with 
number of repeaters for conventional and a minimum power design. Even with the 
inclusion of WPM overhead circuits, the WPM design dissipates 26% less power than the 
conventional circuit at the optimal point. Figure 2.23 shows the transistor area for the 
conventional and WPM interconnect design. As expected a significant reduction in 
transistor area is observed for the minimum power design. At the optimal point, 41% 
























Power - Conventional design
Power - Minimum power design
Optimal point
26% reduction in power
Spwire = 1.0
Hoxide = 1.0
Spwire = 4.9 
Hoxide = 2.0
 

























Transistor area - Conventional design
Transistor area - Minimum power design













2.6.4. High performance design 
 The WPM routing technique can also be optimized to improve the performance of 
the interconnect circuit for a given area constraint. The performance of large systems can 
limited either by delay of the longest interconnect on the chip which is routed on the 
global tier or the logic critical path. Let us assume that the system performance is being 
limited by the interconnect delay. In such a case, for a given wire area WPM routing can 
be used to reduce the interconnect delay and in turn improving system performance. 
 The interconnects routed on a global tier have high aspect ratio [9]. The 
elimination of an interconnect using WPM routing frees up significant amount of routing 
area, that can be used to reduce interconnect switching capacitance of the WPM 
interconnect by increasing interconnect spacing. Figure 2.24 shows the redesign approach 
that can be adapted for the interconnects in the global tier to improve performance. As 
can be seen from Figure 2.24 the width of the WPM interconnect is increased to be equal 
to the height of the interconnect. This increase in interconnect cross-section reduces the 
interconnect resistance. In addition, the spacing between the interconnects has also 
increased. This helps in reducing the interconnect coupling capacitance. Though the 
ground capacitance increases due the increase in interconnect width, there is reduction in 






Conventional design of 
global interconnect
High performance design 










Figure 2.24: High performance design using WPM routing (Cross-sectional view of the 
global tier) 
  
To quantify the performance improvement, a 1.0 cm long interconnect was designed to 
operate at 1.3 Ghz using 100 nm technology. A suboptimal number [12] and suboptimal 
size [13] of repeaters are inserted on the interconnects. An aspect ratio of 1.8 [9] is 
assumed and the wire spacing is assumed to be equal to the wire width for the 
conventional design. Table 2.2 shows a comparison between the conventional design and 
the high performance design. While the calculating the delay of the high performance 
design, the delay of the overhead circuitry is also considered. It can be observed from the 
table that a 74% improvement in performance is obtained. In addition, a 29% reduction in 










Table 2.2: Comparison between conventional design and high performance design 

















0.691 0.613 3.72e-6 2.26 0.334 
 
2.6.5. Comparison of WPM circuit optimizations 
 The advantages obtained using WPM routing in terms of interconnect area, 
transistor area and power dissipation for different designs are summarized in Table 2.3. A 
simple application of WPM routing reduces interconnect area and transistor area while 
maintaining performance. However, there is an increase in power. If the interconnect area 
is increased to be equal to the conventional design, then for a low power and low area 
design significant reduction in transistor area (41%) and power (26%) is obtained while 
maintaining performance. Similarly, for the high performance design, WPM routing can 
result in a 74% improvement in latency performance while still reducing transistor area 
by 29%.  
 














design 1.0 1.0 1.0 1.0 
WPM balanced 
design 0.5 0.85 1.0 1.2 
WPM low power 
and low area design 1.0 0.59 1.0 0.74 
WPM high 








A novel wire sharing technique called wave-pipelined multiplexed (WPM) 
routing is proposed that can be seamlessly incorporated into existing global and semi-
global pipelines. It is shown that the WPM routing technique takes advantage of the 
inherent intra-clock period interconnect idleness and transmits two data signals over a 
shared interconnect in a single clock period. A detailed description of the circuit that is 
required for implementing WPM routing is presented. The validation of this WPM circuit 
using HSPICE is also demonstrated here.  
The reduction of interconnect count from two to one using WPM routing provides 
direct advantages in terms of power and area. The WPM interconnect can be easily 
optimized to reduce interconnect area, transistor area and power dissipation. Using case 
studies it is shown that a simple application of WPM routing reduces the interconnect 
area by 50% and the transistor area by 15%. The reduction in interconnect area provides 
an opportunity to increase spacing between the interconnects and dielectric thickness. A 
suitable increase in the wire spacing and dielectric thickness reduces the interconnect area 
by 44%, transistor area by 29% and total power (dynamic + leakage) dissipation by 6%. 
On the other hand, if the spacing between the WPM interconnect and the neighboring 
interconnects is increased such that there is no change in wire area, then a 26% reduction 
in power dissipation and 41% reduction in transistor area is observed. The performance 
of the WPM interconnect can be improved by suitably changing the interconnect 
dimensions. Close to 74% improvement in performance can be obtained using WPM 
routing. For a fixed wire area, WPM circuits can be used to reduce the wire delay by 50% 






of this technique into a traditional VLSI design flow, this implementation has the 
potential to be a pervasive routing technique that can be applied to both inter-core and 








IMPACT OF VARIATIONS ON WAVE-





 The WPM routing technique can provide significant advantages in terms of 
interconnect area, transistor area, power dissipation and interconnect performance. 
However, to harness these advantages it is imperative for the circuit to function correctly 
under both best-case and worst-case conditions. The performance of an interconnect 
circuit can be severely affected due to crosstalk noise, power supply noise, clock skew 
etc. It is therefore necessary to provide the necessary guardbands and at times even 
overdesign the WPM circuit to a certain extent to ensure a correct circuit operation under 
different non-ideal conditions. 
 The tolerance levels of the WPM circuit to external noise are discussed in this 
chapter. As described in Section 2.4, the circuitry for WPM routing uses static CMOS 
logic, dynamic latch and transmission gate designs. The WPM routing technique requires 






shared wire. This delay matching is directly affected by delay variations resulting from 
crosstalk noise and power supply noise. The tolerance of WPM circuit to these delay 
variations is discussed in Section 3.2 and 3.3, respectively. The signal φmin which is used 
to correctly sample the data at the sender side and the receiver side is generated from the 
global clock. The existence of clock skew in the different clock domains in a high 
performance system can affect the correct operation of the WPM circuit. The effect of 
clock skew on the WPM circuit is explored in Section 3.4. Finally, a wave-pipelined 
encoded (WPE) routing technique is developed as a part of this research that be can used 
to reduce the effect of external noise on the wire sharing circuit.  A detailed discussion of 
the WPE routing technique is presented in Section 3.5. 
 
3.2. Crosstalk and dynamic delay 
 The interconnects in a multilevel interconnect network are routed orthogonally on 
adjacent routing levels. The capacitive coupling between interconnects routed on adjacent 
routing channels may result in noise transients that cause variations in the latency of the 
interconnects. As the WPM routing technique is highly dependent on the delay 
constraints given by equations (2.5) and (2.6), it is necessary to study the impact of 
crosstalk noise on the delay of the shared wires. 
 In order to analyze the impact of crosstalk on WPM routing, an interconnect 
system consisting of five interconnects and two non-ideal ground planes is considered. 
The ground capacitances and mutual capacitances are determined using the RC2 models 
of RAPHAEL. The interconnect system that is used to study the capacitive crosstalk 






capacitances are denoted by Cg and Cm, respectively. The two ground planes can be 
replaced by orthogonal active lines, however, [38] has shown that assumption of ground 









Figure 3.1: Interconnect system with 5 interconnects and 2 ground planes. 
 
 Depending on the way an interconnect network is designed, the neighboring lines 
of an interconnect can be active lines or ground lines. To provide the most general 
analysis, we will have three different types of switching patterns for this five interconnect 
system. The three different switching patterns are shown in Figure 3.2. For the first 
switching pattern, only the center interconnect C switches and all the neighboring 
interconnects are ground lines. In the second switching pattern, all the interconnects 
switch in the same direction, while for the third switching pattern alternate interconnects 
switch in opposite directions. The figure also shows the effective switching capacitance 
of the center interconnect. As can be seen from Figure 3.2, the switching capacitance is 
maximum for pattern (3), i.e. when the adjacent interconnects are switching in opposite 






hand, the switching capacitance is minimum for pattern (2) resulting in a minimum 
interconnect latency. 
Cint = 2Cg + 2Cm
Cint = 2Cg



















Figure 3.2: Different switching patterns in the 5 interconnect system. 
 
  As described in Section 2.4, at the receiver side of the WPM circuit, the signal 
φmin is delayed to give φ1 and φ2 which are used to sample the signal received at the input 
of the demultiplexer. In order to ensure that the received signal is correctly sampled, the 
signal φ1 should be wide enough to account for any variations in interconnect latency. To 
account for latency variations due to capacitive crosstalk the signal φ1 (and hence signal 
φmin) should be at least as wide as the difference between the worst-case delay and best-
case delay of the interconnect.   
 Using Sakurai’s model for interconnect delay [39], the delay for an interconnect 
having a resistance R, capacitance C and repeaters k can be given by 
(1.02 2.3 2.3 2.3 )int seg seg t seg t t seg tR C R C R C R C kτ = + + + ⋅ ,  (3.1) 
where Rseg (= R k ) is the resistance of the interconnect segment, Cseg (= 
C
k ) is the 
switching capacitance of the interconnect segment, Rt is the output resistance of the 
repeater driver and Ct is the input capacitance of the repeater driver. For pattern (2) the 






(2Cg + 4Cm) / k. Assuming the repeaters have been designed for worst-case delay for both 
patterns, the minimum width of signal φ1 to maintain signal integrity can be given by the 
difference between the best-case and worst-case delay, or 
[(1.02 2.3 )(2 4 2 ) / ]WPM seg t g m gR R C C C k kτ = + + − ⋅ ,    (3.2) 
[(1.02 2.3 )(4 )]WPM seg t mR R Cτ = + .    (3.3) 
 The minimum pulse width [37] of the signal φmin that is necessary to ensure data 









     (3.4) 
where 
   0.4RCseg t t t seg t seg seg segR C R C C R R Cσ = + + + ,   (3.5) 
   1 1.01
4
t seg seg t seg seg
t seg seg t seg seg
R C R C R C
K














     (3.7) 
and υ1 = fraction of the full supply voltage at the output of the first repeater segment.  
Let us assume the φn is the control signal that is used for sampling data. It needs to have a 
large enough pulse width to avoid timing errors due to dynamic delay effects on the data 
line. If the pulse width τWPM,limit given in (3.4) is more than that given by equation (3.3), 
then the WPM circuit designed for the same interconnect dimensions and driver-receiver 
sizes will not be affected by crosstalk noise i.e. φn = τWPM,limit. On the other hand, if the 






signal φn should have a pulse width at least equal to that given by equation (3.3) to avoid 
any loss of signal integrity due to noise on the data line (i.e. φn = τWPM). 
 Figure 3.3 shows the WPM circuit and its corresponding timing diagram. In the 
timing diagram φs (x=0) and φs (x=l) give the data signal at the beginning and the end of the 
interconnect, respectively. In case of no delay variations, the interconnect delay will be 
equal to τint(ideal). In such a case, the minimum pulse width of τWPM, limit can be used at 
both sender and receiver side. On other hand, in case of delay variations due to crosstalk 
noise, the best-case delay and worst-case delay are given by τint(best case) and τint(worst case), 























Figure 3.3: WPM circuit and timing diagram showing the necessary pulse width to 
tolerate crosstalk noise. 
 
 The difference in the worst-case interconnect delay and the best-case interconnect 
delay can be fairly large. Hence, the pulse width of the control signal φn can be quite 
large resulting in a small fraction of interconnects that can be classified as SSWPM 
interconnects, and as a result, there will be less opportunity to apply WPM routing. In 
order to avoid such case, a staggered repeater insertion technique after [40] can be 







Similar switching Opposite switching
 
Figure 3.4: Staggered repeater insertion. 
 
 If opposite switching signals are transmitted on adjacent interconnects, then half 
of the interconnect segment experiences similar switching on the neighboring 
interconnect, while the other half experiences opposite switching. This staggered repeater 
insertion significantly reduces the worst-case delay because of the reduction of net 
crosstalk currents. In case similar switching signals are transmitted on adjacent 
interconnects, the neighboring interconnects with staggered repeaters still experience 
some coupling, and the delay is more than the case where repeaters are inserted in 
conventional fashion. Thus, the worst-case delay using staggered repeater insertion is less 
than that exhibited for conventional repeater insertion for pattern (3) in Figure 3.2 and the 
best-case delay using staggered repeater insertion is more than that exhibited by 
conventional repeater insertion for pattern (2) in Figure 3.2. Table 3.1 shows this 
comparison between the similar switching and opposite switching delay for conventional 
and staggered repeater insertion determined using HSPICE for a 1.0 cm long 






resistive and the inductance can be ignored in this analysis. However, inductance and 
inductive coupling is included in the HSPICE simulations. The difference between the 
worst-case and best-case delay for staggered repeater insertion is less than the difference 
between the delays exhibited by conventional repeater insertion by more than 60%. So, if 
a staggered repeater insertion technique is adopted a smaller control pulse for the φn 
signal will be sufficient for correct sampling of data at the receiver side. This would 
significantly increase the number of interconnects that can be designed as SSWPM 
interconnects.  
 
Table 3.1: Comparison of interconnect delay for conventional repeater insertion and 
staggered repeater insertion using HSPICE. 















1.00E-05 1.18E-09 1.59E-09 1.21E-09 1.37E-09 
2.00E-05 5.64E-10 7.99E-10 5.76E-10 6.68E-10 
3.00E-05 3.43E-10 5.21E-10 3.78E-10 4.41E-10 
4.00E-05 2.49E-10 3.86E-10 2.79E-10 3.25E-10 
5.00E-05 2.06E-10 3.05E-10 2.20E-10 2.55E-10 
 
 Figure 3.5 and Figure 3.6 show a plot of the pulse width τWPM that is necessary to 
avoid any loss of signal integrity due to crosstalk noise when repeaters are inserted in a 
conventional and a staggered fashion. Here, τWPM is equal to the difference between 
worst-case and best-case interconnect delay plus 20% guardband. In Figure 3.5, the 
interconnect length is maintained at 1.0 cm, while the interconnect width is varied from 
0.1 µm to 0.9 µm. The aspect ratio is assumed to be 1.0 and the spacing between the 
wires is assumed to be equal to the width of the wires. A suboptimal number [12] and 

























) Pulse width - conv
Pulse width-stagg
Minimum pulse
Length = 1 cm
Suboptimal number and size of repeaters
 
Figure 3.5: Minimum pulse widths required to avoid loss of data integrity due to crosstalk 


























Pulse width - conv
Pulse width-stagg
Minimum pulse
Width = 2e-5 cm
Suboptimal number and size of repeaters
 
Figure 3.6: Minimum pulse widths required to avoid loss of data integrity due to crosstalk 
noise (fixed interconnect dimensions). 
  
 For small wire widths, the necessary pulse width for a staggered repeater insertion 
is more than the minimum pulse width that can travel across the interconnect. Beyond a 
width of 0.2 µm the two plots cross and the pulse width becomes limited by the minimum 
pulse width τWPM, limit that can travel across the interconnect. Thus, if we use staggered 
repeater insertion then for interconnect widths larger than 0.2 µm there would be no loss 
of signal integrity due to capacitive coupling between neighboring interconnects. As we 
go to larger wire widths, necessary pulse width plot for conventional repeater insertion 
crosses the plot for minimum pulse width. Hence, for wire width of 0.7 µm and above 







 Similarly in Figure 3.6, the interconnect width is fixed at 0.2 µm and the 
interconnect length is varied from 0.1 cm to 1.3 cm. Here too, a suboptimal number [12] 
and suboptimal size [13] of repeaters are inserted on these interconnects. It can be seen 
from this plot that for shorter interconnects the required pulse width for the control signal 
φn is limited by the minimum pulse width τWPM, limit. For interconnects that are 0.9 cm and 
longer, the pulse width required will be limited by the pulse width for staggered repeater 
insertion. 
 Thus, the pulse width of the control signal that will be used for correct sampling 
of the data at the receiver side and at the sender side (the data pulse width is equal to the 
pulse width of the sampling signal) can be increased to prevent any loss of data integrity 
due to the crosstalk noise. The pulse width should be at least as large as the difference 
between the best-case delay and the worst-case delay of the interconnect. 
 
3.3. Power supply noise 
 The simultaneous switching in neighboring circuits within a short duration of time 
can cause considerably large current spikes, which increase the IR drop and result in 
diL
dt
 noise over the power supply network [11]. The power supply noise can reduce the 
supply voltage that may significantly affect the latency of interconnects with repeaters 
due to the reduction in the transistor drive current. The WPM circuit design uses delay 
matching at the receiver side to correctly sample the data received at the input of the 
demultiplexer and route it to the appropriate sink. Hence, it is imperative to study the 






 To study the impact of power supply noise on the overall delay of the 
interconnect, a 1.0 cm long interconnect with 0.2 µm width is modeled using RC 
modeling. The interconnect aspect ratio is 1.0 and the interconnect spacing is equal to the 
interconnect width. A suboptimal number [12] and suboptimal size [13] of repeaters are 
inserted on these interconnects. The circuit is assumed to be driven by a supply voltage of 
1.2 V with a ± 10% variation. 
 A lower supply voltage reduces the transistor drive current, which increases the 
delay of the interconnect circuit. While on the other hand, if supply voltage increases, the 
drive current increases and reduces wire delay. For the WPM routing technique, we are 
concerned with the difference between the worst-case delay and the best-case delay for a 
given supply voltage. The variations in supply voltage will have similar effect on both the 
best-case delay and the worst-case delay i.e. for lower supply voltage, both worst-case 
delay and best-case delay will increase. Similarly, for increase in the supply voltage, both 
worst-case delay and best-case delay will reduce. Hence, there is minimal change in the 
difference between the best-case delay and the worst-case delay. Figure 3.7 shows the 
various pulse widths necessary for the three different cases as the supply voltage varies. 






























Figure 3.7: Minimum pulse widths required to avoid loss of data integrity due to power 
supply noise. 
 
 As can be seen from Figure 3.7, the minimum pulse width is less than the pulse 
width for staggered repeater insertion which is less than the pulse width for conventional 
repeater insertion. Power supply variation has the same effect on all three trends. For the 
given interconnect length and interconnect width, the difference between the best-case 
delay (1.32 V) and worst-case delay (1.08 V) is much less than the minimum pulse width. 
Thus, the delay variations due to power supply noise do not significantly affect delay 
matching at the receiver side. A staggered repeater insertion technique can be used to 
increase the number of SSWPM interconnects. Some interconnects will need to be 
designed as DSWPM interconnects. A significant effort at the design stage will be 
necessary to account for the increase in interconnect latency for these DSWPM 






will be significantly large and only a small fraction of interconnects can be designed as a 
SSWPM interconnects. 
 The power supply noise has a similar effect on the delay circuitry. As the drive 
current is directly proportional to the supply voltage, a reduction in the supply voltage 
will increase the charge-discharge time of the transistors in the inverter chain, resulting in 
an increase in the delay of φmin. Similarly, if the supply voltage increases, the drive 
current increases, charge-discharge time of the transistors reduces, which results in a 
decrease in the delay of φmin. In order for the delayed φmin signal to correctly sample the 
signal at the receiver side, it is imperative that there are minimum variations in the 
delaying of φmin. A guardband equal to 20% of the minimum pulse width is introduced 
while calculating the pulse width of the signal φmin that is generated locally at the receiver 
side. This guardband ensures that the signals are correctly sampled by the demultiplexer 
even if there are variations in the delaying of signal φmin. 
 
3.4. Clock skew 
 With the continuous increase in the operating frequency of future GSI systems 
[4], it is difficult to maintain the clock signals in phase with each other in different 
regions of a system. This has resulted in the existence of multiple clock domains in high 
performance digital systems. A significant amount of clock skew can exist among 
different clock domains. The signal φmin that is used for sampling of data at both the 
sender and the receiver side is generated by simple ANDing the global clock signal and a 






generated, which might result in a delay mismatch at the receiver side. Thus, clock skew 
can have an adverse effect on the working of the WPM circuit. 
 To analyze the tolerance of WPM routing to clock skew, it is assumed that if a 
leading edge of global clock input to a region arrives after the expected global reference 
leading edge, then the global clock in that clock domain has positive clock skew. On the 
other hand, if the global clock input to a region arrives before the expected time, then the 
region has negative clock skew. Though the existence of clock skew can be determined 
with respect to a global reference signal, the relative clock skew between two clock 
domains maybe much more or much less than the clock skew between individual regions 
and the global reference. 
 In case of WPM routing, if there is positive clock skew in the global clock at the 
receiver side with respect to the global clock at the sender side, then the signal φmin that is 
generated using the global clock will be delayed. As a result, the sampling window will 
get delayed and it will not correctly sample the signal at receiver side. Similarly, if the 
clock skew is negative, then the φmin signal will be generated early and it will reach the 
sampling circuitry early, which will again result in an incorrect sampling of data. This 
will significantly affect the signal integrity in a WPM circuit. 
 To study the tolerance level of an interconnect circuit to clock skew, a 0.5 cm 
interconnect is designed for 100 nm technology. The interconnect dimensions are chosen 
such that it would operate at 1.3 Ghz clock and will require 70% of the clock period for 
data transmission. Out of the remaining 30% of the clock period, 20% will serve as the 
guardband and 10% will be enough to schedule a second signal over the WPM 






[13] of repeaters are inserted on the interconnect. Clock skew is deliberately introduced 
in between the clock signals that are used to generate the φmin signal at the sender side 
and receiver side. As a result, the signal φmin gets delayed or arrives early at the input of 
the sampling circuitry.  
 The upper and lower limits of clock skew tolerance are determined based on 
signal integrity analysis. Within this range, the data signal integrity is maintained. Table 
3.2 shows the maximum positive and negative clock skew that can be tolerated by the 
WPM circuit. The table also shows the skew tolerance as a percentage of the clock 
period. On an average the WPM circuit has a skew tolerance of 10% for both positive and 
negative clock skew. 
 










+ve clock skew 
as a percentage 
of clock 
-ve clock skew 
as a percentage 
of clock 
0.7692 0.085 0.08 11.05 10.4 
 
 The tolerance level of the WPM circuit to clock skew can be improved by 
designing a circuit that does not depend on the global clock to generate φmin. Instead of 
the global clock, if a local signal can be used to generate φmin it will significantly help is 
making the WPM circuit more robust. 
 
 
3.5. Wave-pipelined encoded (WPE) routing 
As we saw in Sections 3.2-3.4, the working of a WPM circuit can become 






The main reason for this can be attributed to the delay matching that is necessary at the 
receiver side circuitry to correctly sample the data. Various encoding techniques have 
been proposed in an attempt to reduce the effect of external noise and improve the overall 
performance of an interconnect circuit. For example, an encoding strategy for on-chip 
buses has been proposed by [41] to reduce LC crosstalk. Similarly, a bus encoding 
scheme that corrects errors and avoids crosstalk is proposed in [42]. A transition-encoded 
dynamic bus technique has been proposed in [43], that is primarily used to reduce 
interconnect delay and improve performance. 
Similar to these encoding techniques, in order to improve the robustness of the 
WPM routing technique, instead of sending the actual data on the shared interconnect, 
encoded data can be transmitted. As the receiver side circuitry receives the expected 
encoded signals, it can easily sample and decode the signals and route the appropriate 
data to the correct sinks. This new routing technique described in this section will be 
referred to as wave-pipelined encoded (WPE) routing technique. 
As we replace two interconnects by a single shared interconnect, there can be only 
four possible data combinations. A different encoded signal is assigned to each of these 
data combinations. The data combinations and the corresponding encoded data signals 
















Table 3.3: Encoded signals for various data combinations. 





1 positive edge 
 
1 0 
1 positive edge and 1 negative edge 
 
1 1 
2 positive edges and 1 negative edge 
 
 
 The circuitry that is required at the sender side to generate the encoded signals can 
be shared by multiple WPE interconnects in a region. At the receiver side, flip flops are 
used to correctly sample the positive and negative edges of the encoded data signal. This 
encoded data signal is decoded and the decoded data is then routed to sinks. The biggest 
advantage of the WPE routing technique is that no delay matching is necessary at the 
receiver side to sample the data received on the shared interconnect. The WPE routing 
technique requires no custom design so that the required WPE circuits can be easily 
inserted at the sender and the receiver side of the shared interconnect. 
 As can be observed from Table 3.3, in the worst case, two pulses will need to be 
transferred over the shared interconnect to account for two positive edges and one 
negative edge. Both these pulses need to have a width of at least equal the minimum 
pulse width given by [37]. However, as the first pulse always has to be positive, the 






width [37] at the beginning of the clock to correctly detect the positive pulse. Essentially, 
for transmitting data logic 0 followed by logic 1, a negative pulse of minimum required 
width followed by a positive pulse of minimum required width is required. In addition to 
that since flip flops are used at the receiver side the setup and hold time of the last flip 
flop needs to be accounted. Hence, for the WPE interconnect to maintain performance the 
delay constraint for the WPE circuit can be given by 
3*
1latency pulse setup hold guardband
clk
t t t t t
t
+ + + +
<      (3.8) 
 If one compares this delay constraint to the delay constraint of WPM routing 
given by equation (2.5), it can be observed that a lesser number of interconnects can be 
designed as SSWPM as a significant amount of interconnect idleness is necessary. The 
interconnects that do not satisfy the delay constraint given by equation (3.8) can be 
redesigned like DSWPM while accounting for increase in interconnect latency. These 
interconnects will need to satisfy the delay constraint 
3*
2latency pulse setup hold guardband
clk
t t t t t
t
+ + + +
<     (3.9) 
 The schematic diagram of the logic required to implement WPE routing is shown 
in Figure 3.8. A 4:1 multiplexer is required at the sender side to correctly send the 
encoded data depending on the actual data that needs to be transmitted. As mentioned 
earlier, the circuits required to generate these encoded signals can be shared between 
multiple WPE interconnects. At the receiver side, three flip flops are required to sample 
the encoded data. Flip flop A detects the first positive edge, flip flop B the negative edge 
and flip flop C the second positive edge. As can be observed from the circuit, the 






edge in the encoded signal. Similarly, the triggering of flip flop C depends on the output 
of flip flop B and the existence of a second positive edge in the encoded signal. This 
ensures flip flop B does not trigger before triggering of flip flop A, and flip flop C does 
not trigger before triggering of flip flop A and flip flop B. This is necessary for correct 
decoding of the signal received on the shared interconnect. The outputs of the flip flops 
are cleared to zero at the beginning of every clock cycle by the positive edge of the clock. 
This ensures that the outputs of the flip flops have logic one values only when they are 









































Figure 3.8: Schematic diagram of the WPE routing circuit. 
 
 In addition to the flip flops, some additional logic circuits are required at the 
receiver side. These logic circuits are needed to ensure correct triggering of the flip flops. 
As compared to the circuitry required at the receiver side in the WPM routing technique, 
a larger amount of circuitry is required at the receiver side in the WPE routing technique. 






overhead circuit in WPE routing also increases the power (dynamic and leakage) 
dissipated by the system. As a large amount of overhead circuitry is required for 
implementing WPE routing, the advantages of this design technique can be observed only 
in long global and semi-global interconnects. In case of short local interconnects, the 
required overhead may prove disadvantageous and adversely affect the overall 
performance of the interconnect circuit. 
 In order to validate the circuit for WPE routing, it is designed using CMOS 
technology and is simulated using HSPICE for 100 nm technology. A 0.5 cm long 
interconnect having interconnect width of 0.8 µm is considered. Two randomly generated 
bit streams are given as input to the select lines (p0 and p1) of the 4:1 multiplexer at the 
input side. The terminal line_out gives the signal that is received at the input of the 
receiver side circuitry. It can be observed from Figure 3.9 that the exact encoded signal 
that transmitted at the sender side is received at the input of the receiver side circuitry. In 
addition, the data given as input is available for use at the output terminals (Op0 and 
Op1) at the beginning of the next clock cycle. Some glitches are observed at the output 
terminals of the WPE circuit, but these do not affect the operation of the circuit. This is 







Figure 3.9: HSPICE validation of WPE interconnect. 
 
3.6. Comparison between WPM and WPE routing 
To compare the WPM interconnect design and a WPE interconnect design, two 
1.0 cm long interconnects are simulated using HSPICE to operate at 1.3 Ghz using 100 
nm technology. An optimal number and optimal size [10] of repeaters are inserted on 






interconnects require 70% of the clock period for data transmission. Out of the remaining 
30%, 20% provides the necessary guardband and 10% of idleness is sufficient to send a 
second signal in a wave-pipelined fashion. Thus, these two interconnects can be easily 
replaced by a WPM interconnect. For the WPE interconnect the delay constraints are 
more strict that WPM interconnect. Hence, the interconnect pitch is chosen such that the 
interconnect delay is 50% of clock cycle. Out of the remaining 50%, 20% will serve as 
the guardband and 30% will provide the necessary time frame to transmit zero or more 
pulses (encoded signals). Spwire (Wire spacing/Wire width) and Hoxide (Dielectric 
thickness/Wire width) are increased to 1.5 as it gives a low power design using WPM 
routing. A 44% and 30% reduction in interconnect area is observed for WPM and WPE 
design, respectively. Figure 3.10 shows a variation of transistor area with number of 
repeaters for the conventional design, WPM design and WPE design. As can be observed 
from the figure, the WPE routing technique significantly increases the total transistor 
area. At the optimal point, WPM reduces transistor area by 29%, while WPE increases 
transistor area by 12%. A variation of power dissipation with number of repeaters for the 
three different designs is shown in Figure 3.11. WPE dissipates more power as compared 
to the conventional design even if Spwire and Hoxide are increased to 1.5. At the optimal 
design point, WPM reduces power dissipation by 6% while WPE increases power 





























WPM Spwire = 1.5 Hoxide = 1.5
WPE Spwire = 1.5 Hoxide = 1.5






















WPM Spwire = 1.5 Hoxide = 1.5
WPE Spwire = 1.5 Hoxide = 1.5
Conv Spwire = 1.0 Hoxide = 1.0
4% increase in
power
6% reduction in power
 







 Thus, though WPE design reduces interconnect area by 30%, it significantly 
increases power and transistor area even for a Spwire and Hoxide of 1.5. The wire 
spacing and dielectric thickness need to be further increased, to reduce power dissipation 
and transistor area. To avoid any increase in interconnect area, the wire spacing is 
increased such that the new wire area is equal to the wire area of the conventional design. 
Here the Hoxide is increased to a maximum possible value of 2.0 due to manufacturing 
constraints. Figure 3.12 and 3.13 show the variation of transistor area and power with the 
number of repeaters for such a case. At the optimal point, transistor area reduces by 4% 
and power dissipation by 18%. Thus, with the help of WPE routing, for no change in 























) Conv Spwire = 1.0 Hoxide = 1.0
WPE Spwire = 3.7 Hoxide = 2.0
4% reduction in transistor area
 

























Conv Spwire = 1.0 Hoxide = 1.0
WPE Spwire = 3.7 Hoxide = 2.0
18% reduction in power
 
Figure 3.13: WPE design – total power vs number of repeaters. 
 
 Though the WPE routing design offers fewer advantages as compared to the 
WPM routing design in terms of area and power, the WPE routing design is much more 
robust and tolerant to variations as compared to WPM routing. The main reason for this 
can be attributed to the fact that no delay matching is necessary at the receiver side for 
WPE routing. As a result, even if there are any delay variations due capacitive crosstalk 
or power supply noise, the signals will be correctly sampled at the receiver side as long as 
the WPE interconnect is designed for worst-case delay. Low clock skew has minimal 
impact on the WPE routing technique. In case of a large clock skew between the sender 
side and receiver side global clock, WPM routing will fail as the signal received on the 
WPM interconnect will not be correctly sampled at the receiver side. On the other hand, 






edges and does not require any delay matching. The global clock is required to clear the 
flip flop outputs at the beginning of each clock cycle in WPE routing. For long 
interconnects, it is highly unlikely that the clock skew will be equal to the interconnect 
delay. Hence, the flip flop outputs will be cleared in time. As long as the WPE 
interconnect is designed for worst-case delay and it satisfies equation (3.8), the WPE 
routing technique can easily tolerate variations resulting from crosstalk noise, power 
supply noise and clock skew. 
Using WPE routing instead of WPM routing may prove most advantageous in 
case of long interconnects. In case of long interconnects, the elimination of an 
interconnect eliminates a significant number of repeaters, which can justify using the 
large overhead circuitry for WPE routing. Thus, there will be minimal to no increase in 
the transistor area and power. On the contrary, by optimizing the spacing between the 
WPE interconnect and its neighbors, it may be possible to reduce transistor area and 
power dissipation, and improve performance. A comparison between conventional 


















Conventional WPM WPE 
Design 





circuit, φmin generation 
circuit 
Multiplexer, signal encoding 
circuits, flip-flops 
Overhead 
transistor area None Low High 




Easy Difficult Easy 





















1.0 0.5 0.5 
Power 
(normalized) 1.0 1.2 1.38 
Transistor area 
(normalized) 1.0 0.85 1.14 




1.0 1.0 1.0 
Power 
(normalized) 1.0 0.74 0.82 
Transistor area 
(normalized) 1.0 0.69 0.96 
 
Both, WPM and WPE techniques can be adopted to implement wire sharing. Each 
technique has its pros and cons. WPM circuit provides significant benefits in terms of 
area and power, but it is susceptible to delay variations and increases design complexity. 






provide the same magnitude of advantages as the WPM routing technique. Depending on 
the interconnect circuit design at hand the designer can choose one of the two techniques. 
 
3.7. Summary 
The tolerance levels of WPM routing to variations due to crosstalk noise, power 
supply noise and clock skew are presented in this chapter. As delay matching is necessary 
at the receiver side of a WPM interconnect, delay variations resulting from capacitive 
crosstalk noise and power supply noise can affect the signal integrity. It is shown that by 
designing the WPM circuit for worst-case wire delay and having the pulse width for the 
sampling signal to be at least equal to the difference between the best-case delay and the 
worst-case delay, the signal can be correctly sampled at the receiver side and will not be 
affected by capacitive crosstalk. Similarly, in case of power supply noise, the variations 
in supply voltage affect the interconnect delay. However, these delay variations are much 
smaller compared to those due to crosstalk noise. So by designing the WPM circuit for 
worst-case delay and choosing a suitable pulse width for signal φmin the signal integrity 
can be maintained. The clock skew that can exist between two different regions of a 
VLSI system can have a detrimental effect on the WPM circuit. A case study presented in 
this chapter shows that the WPM circuit can tolerate an average of 10% positive and 
negative clock skew. 
A new wire sharing technique called wave-pipelined encoded (WPE) routing that 
is less susceptible to the variations in VLSI design is proposed. This wire sharing 
technique sends encoded signals on the shared interconnect which are correctly sampled 






technique shows that the WPE technique requires more overhead circuitry as compared to 
the WPM technique. However, the WPE routing technique is more robust and can easily 
tolerate variations due to crosstalk noise, power supply noise and clock skew. Depending 
on the interconnect circuit at hand, a choice between WPM routing and WPE routing 
needs to made to elicit maximum advantage in terms of power and area while 






 CHAPTER 4  
DESIGN AND IMPLEMENTATION OF A 
MULTILEVEL INTERCONNECT NETWORK 






In order to study the impact of WPM routing on the performance and cost of a 
digital GSI product, it is necessary to simulate the entire multilevel interconnect network. 
A multilevel interconnect network design simulator (MINDS) [12] currently exists and 
has elements that can be used to accomplish this task. MINDS uses a set of compact 
models and optimizes the various interconnect design parameters to formulate an optimal 
multilevel interconnect network. As part of this research, a second generation of this 
multilevel interconnect network design simulator is developed that interfaces with circuit 






Section 4.2 illustrates the need of interfacing simulation tools with MINDS by 
comparing the use of compact models and simulation tools to optimize various 
interconnect design parameters. An overview of other system-level simulation tools, like 
GENESYS, RIPE, BACPAC, etc., that are being used in the research community is 
presented in Section 4.3. A detailed description of the design methodology and the 
additional features added to the second generation of MINDS is presented in Section 4.4 
and Section 4.5, respectively. A comparison between the results obtained using the two 
generations of MINDS is elucidated in Section 4.6, while Section 4.7 involves the 
validation of the second generation MINDS. 
 
4.2. Compact models versus circuit simulation tools 
 The formulation of compact expressions to model the behavior of a circuit is 
important in the study of new circuit designs. This formulation of compact models 
enables a quick analysis of the new design methodology and it also provides physical 
insight into the behavior of the circuit which leads to a thorough understanding of the 
design technique. However, given the fact that there can be hundreds of parameters that 
can affect the working of a circuit, it becomes difficult to formulate expressions that will 
include a term for each parameter that affects the circuit in the compact expression. This 
introduces inaccuracies in the results, and the behavior of the circuit as predicted by a 
compact model may not always be the same as the actual behavior. 
 Figure 4.1 shows a plot of the 50% interconnect delay normalized to clock period 
versus the interconnect width for four different cases. Here, a 1.0 cm long interconnect 






optimal number and optimal size of repeaters [10] are inserted on the interconnect. The 
interconnect is assumed to have an aspect ratio of 1.0, and the spacing between the 
interconnects and dielectric thickness is assumed to be equal to the interconnect width. 
As can be seen from Figure 4.1, four different cases are considered. For the first 
case, the interconnect switching capacitance is determined using Chern’s models [44] and 
the 50% interconnect delay is calculated using the expressions derived by Bakoglu in 
[11]. For case two, the interconnect capacitance and delay models derived by Sakurai in 
[39] are used, while for the third case, the switching capacitance models of Sakurai [39] 
and the delay expression derived in [11] are used. For the fourth case, simulation tools 
are used to calculate both interconnect switching capacitance and 50% interconnect 
delay. The interconnect switching capacitance is calculated using RC2 models in 
RAPHAEL and the interconnect delay is calculated using level 49 HSPICE models. 
Figure 4.1 shows that the compact models severely underestimate the total interconnect 
delay. There is almost 100% difference between the delay estimated by the compact 
models and the delay evaluated by simulation tools. Figure 4.2 shows a similar plot, 
where the interconnect width is fixed at 0.1 µm and the interconnect length is varied from 
0.1 cm to 1.4 cm. Here too, the compact models underestimate the 50% interconnect 
delay.  
The main reason for this difference between the compact models and simulation 
tools is because of the manner in which transistor resistance and capacitance is modeled. 
Compact models assume a linear relationship between the size of the transistor and its 
resistance/junction capacitance – the resistance is assumed to be inversely proportional to 






the transistor size. However, if transistors of different sizes are considered and the 
relationship between a transistor size and its junction capacitance and resistance is 
determined using HSPICE, then it is observed that the relationship is not linear. In 
addition, transistor exhibits non-linear current characteristics as against the linear current 
characteristics for a resistor in a RC model used in [10]. Hence, the results obtained using 




















Interconnect length = 1 cm
Optimal number and size of repeaters
 



























Interconnect width = 1e-5 cm
Optimal number and size of repeaters
 
Figure 4.2: Interconnect delay vs interconnect length. 
 
  Various simulation tools are available that can be used to model the behavior of a 
circuit. These simulations tools consider the effect of almost all the parameters and 
enable a complex but accurate analysis of the behavior of a circuit. However, given the 
fact that a simulation tool considers the effect of almost all parameters, it has a higher 
execution time as compared to the compact models. This higher execution time may deter 
system designers from using simulation tool for determining initial estimates for various 
design parameters. Hence, it is a recommendation of this work to use a combination of 
compact models and simulation tools to study new interconnect circuit design techniques. 








4.3. Existing tools for system-level analysis 
 A large number of system-level simulators have been developed to explore the 
effects of the various design strategies on current VLSI and future GSI systems. The 
primary motivation behind these simulators is to gain a quick and accurate estimate of the 
advantages of using new circuit design techniques on large systems. All these system-
level simulators use only analytical models for circuit-level analysis or only circuit 
simulation tools for analysis. 
 
4.3.1. System simulators using analytical circuit models 
 The biggest advantage of using analytical models for circuit analysis is that they 
enable a rapid exploration of the design space. Various design methodologies using 
complex analytical models have been proposed and simulated to explore the design space 
of future GSI systems. For example, Stanford University System Performance Simulator 
(SUSPENS) [11] predicts chip area, power dissipation and clock frequency of a system 
using technology, design and packaging as the input parameters. However, it does not 
account for on-chip cache, memory structure and details of multilayer interconnect 
structure and clock distribution. 
 Generic System Simulator (GENESYS) [46] is more comprehensive, as it 
incorporates limits from various levels of hierarchy to evaluate key performance metrics 
and examine the effects of technological and architectural changes on future GSI 
systems. It captures a broad range of material, device, circuit and interconnect parameters 






 Rensselaer Interconnect Performance Estimator (RIPE) [47] uses data from ITRS 
roadmap to explore the effects of tradeoff in interconnect design and technology on 
integrated circuit (IC) performance. It accounts for memory and multilayer interconnect 
structure. It calculates the maximum clock frequency as a measure of performance, 
wireability of the chip and power dissipation. Newer version of RIPE can assess the 
extent to which a particular design will get affected by the limits set by line width 
wireability, random defects, signal integrity and electromigration [48]. 
 Berkley Advanced Chip Performance Calculator (BACPAC) [49] predicts a 
system design using interconnect, device and system-level parameters. It performs a 
comprehensive analysis of delay, noise, power, wireability and yield using various 
analytical models to generate an optimal system design. It is applicable to both ASIC and 
microprocessor systems.  
 The GSRC Technology Extrapolation (GTX) system [50] provides an open 
portable framework for specification and comparison of alternative modeling choices. It 
allows the user to choose/edit inference chains and collection of rules, define new 
parameters and rules, and request specific studies like parameter optimization. It builds a 
universal modeling environment to help the user identify issues related to current and 
future systems.  
 A Multilevel Interconnect Network Design Simulator (MINDS) is developed 
based on an optimal n-tier multilevel interconnect architecture design methodology 
proposed in [12]. This design methodology optimizes the interconnect cross-sectional 
dimensions on each tier, and computes the logic macrocell area, cycle time, power 






parameters. The proposed predictive design methodology also helps define the process 
technology parameters for future generations of microprocessors and ASICs.  
 An estimation method for global net-length distribution based on the models of 
netlist, placement and routing has been used to develop a Global Interconnect 
Distribution Estimator (GLIDE) [51]. Here, the heterogeneous Rent’s rule is used for 
computing the netlist information and the random walk technique is used to obtain the 
placement and routing information for every fan-out. 
 
4.3.2. System simulator using circuit simulation tools 
 The use of simulation tools enables the formulation of a more rigorous design 
methodology that could generate a realistic and accurate framework to study new design 
techniques. It is possible to study the effect of almost all the parameters that can affect a 
particular design technique with the help of simulation tools.  
 A SPICE-based interconnect planning tool for on-chip interconnects called 
Network-on-Chip Interconnect Calculator (NoCIC) has been proposed in [52]. This tool 
enables designers to assess the impact of various interconnect circuit designs and 
understand the tradeoffs involved to achieve better a priori interconnect planning in a 
network-on-chip design. This tool creates a database of interconnect designs using 
detailed circuit-level simulations, and then depending on the user input the appropriate 
data points are selected from the database.  
 Though this approach is promising, it is not exhaustive. It is difficult to maintain 
an extensive database for the various design approaches across multiple technology 






IPs in a NoC. It would be preferable to have a tool that can perform a real time SPICE 
analysis of the entire interconnect network.  
 A new contribution of this research is to use a combination of compact models 
and circuit and electromagnetic simulation tools in order to rapidly generate an accurate 
system-level simulation tool to study new design methodologies. To study the WPM 
routing technique at the system level a multilevel interconnect network design simulator 
is developed that uses a combination of compact models, HSPICE and RAPHAEL. This 
simulation tool is referred to as HR-MINDS. 
 
4.4. Design and implementation of HR-MINDS 
 The first generation multilevel interconnect network design simulator (MINDS) 
for GSI is designed based on the design methodology proposed in [12]. The simulator 
uses number of gates in a macrocell, Rent’s parameters, operating frequency, macrocell 
area and interconnect aspect ratios as inputs and designs the entire multilevel interconnect 
network for the macrocell. It calculates the interconnect dimensions for each interconnect 
tier, dynamic power dissipated by the macrocell, maximum operating frequency of the 
system and the area occupied by the transistors. 
 The design methodology described in [12] uses compact models for evaluating 
various design variables. However, comparison between the results obtained from 
MINDS and HSPICE simulations in Section 4.2 shows that the compact models 
underestimate the interconnect delay resulting in an underestimation of the pitch for 
various interconnect lengths. Hence, HSPICE and RAPHAEL are integrated with 






determine the optimal pitch value (i.e. interconnect delay and available metal level area 
are simultaneously optimized) for the different interconnect lengths based on the 
performance constraints of the GSI system, while RAPHAEL is used to extract 
inductance and capacitance values for the different interconnects. The new simulator 
designed using the revised design methodology for generating the interconnect network 
design is called HR-MINDS. The new interconnect network design methodology in HR-
MINDS follows the same set of steps for MINDS as given in [12]. The main difference is 
that MINDS uses compact models for all calculations while HR-MINDS uses a 
combination of compact models and simulation tools. 
 A stochastic wire length distribution [36] is used to obtain an a priori estimate of 
interconnect lengths in a logical block while designing the n-tier interconnect network in 
MINDS. The interconnect density function i(l) that describes the macrocell wiring given 
in [36] is:  
  Region 1:  1 l Ng≤ ≤    











lkli α  
  Region 2:  gg NlN 22≤≤  
    423)2(
6
)( −−Γ= pg llN
kli α  
(4.1) 
where  
 l = Interconnect length in gatepitches; 
 Ng = Number of logic gates; 






 k = Rent’s coefficient; 
 α= Fraction of sink terminals in the macrocell; 
 Γ= Normalizing factor. 
 In HR-MINDS, in addition to the option of using the stochastic interconnect 
distribution models [36], a wire distribution input by the user can also be used. The wire 
distribution should give the interconnect lengths and the number of interconnects of each 
length in the system. Using this input data, HR-MINDS can design the entire multilevel 
interconnect network for the system. 
 Shorter wires are routed on the lowest tier (collection of levels with the same 
wiring pitch) and have the smallest pitch value, and successively longer wires go on 
upper tiers with progressively larger pitch values. Interconnects on adjacent metal layers 
are routed orthogonally and they have the same wiring pitch. An interconnect tier is 
formed by grouping pairs of metal layers that have the same wiring pitch. 
 The key difference in the two design methodologies is where the pitch for the 
interconnect with the longest length on a tier is determined. As explained in the design 
methodology in [12] for MINDS, in case of the design with no repeaters, the wiring pitch 
of the local tier is equal to twice the minimum feature size; for non-local tiers the wiring 
pitch is obtained by equating the resistance capacitance (RC) time delay of the longest 
interconnect to an acceptable fraction of the cycle time [53], [39]. The RC interconnect 






















ερεβτ ≈=       (4.2) 
where 
 τ = Interconnect time delay; 
 β = interconnect time delay expressed as a fraction of the cycle time (1/fc); 
 fc = operating frequency; 
 ρ = resistivity of metal; 
 εr = permittivity of the dielectric; 
 εo= permittivity of free space; 
 pt = interconnect pitch for the longest interconnect length on the tth tier; 
 Am = die area; 
 Lt = interconnect length of the longest interconnect on the tth tier. 
 On the other hand, for the design with repeaters, the expression for the time delay 
of an interconnect, when the number and size of equi-spaced repeaters is optimal and the 
cumulative delay of the repeaters and interconnect segments is minimum, derived by 
Bakoglu [11], [54] is given as: 
0 0 0




βτ ρε ε= =                (4.3) 
where 
Ro = Output resistance of a minimum sized inverter; 
Co = Input capacitance of a minimum sized inverter. 
 Though inserting an optimal number of repeaters on each interconnect provides 
the best performance, inserting a fraction of the optimal number of repeaters works out to 






shown in [12]. It can be observed that inserting 50% of the optimal number of repeaters 
imposes only 10% performance penalty. Hence, [12] inserts a suboptimal number of 
repeaters on the interconnect and calculates the required pitch value using: 
0 0 0





βτ ζ ρε ε
ζ
⎛ ⎞
= = + +⎜ ⎟
⎝ ⎠
    (4.4) 
where ζ is the ratio of the number of repeaters inserted on a wire to the optimal number of 
repeaters for that wire. 
 In the new design approach for HR-MINDS, HSPICE and RAPHAEL are used to 
calculate the required pitch value for a given interconnect length. The RC2 solver in 
RAPHAEL is used to extract the inductance and capacitance values of the interconnect 
under consideration. An RLC model of the interconnect is generated based on the design 
parameters, and the interconnect delay is evaluated using HSPICE for various pitch 
values. For the selected wire pitch value, the interconnect delay is equal to an acceptable 
fraction of the cycle time [39], [53]. 
 Similar to the design methodology in [12], a suboptimal number of repeaters are 
inserted on each interconnect. In addition, based on [13], Bakoglu’s expression for 
optimal repeater size [10], overestimates the required transistor size resulting in a non-
realistic design point. So a suboptimal number of repeaters each having a suboptimal size 
are inserted onto the interconnects. 
 The suboptimal value for the number of inserted repeaters is chosen as (1/ 2 ) 
times Bakoglu’s expressions [10] for optimal number of repeaters. Thus, the suboptimal 











= ⋅       (4.5) 
where 
Cint = interconnect capacitance; 
Rint = interconnect resistance. 
 Suboptimal sizes for the driver, receiver and repeaters are chosen as (1/ 2 ) 
times Bakoglu’s expressions [10] for optimal repeater size. Thus, the suboptimal repeater 









h       (4.6) 
 While calculating the interconnect pitch, the design methodology assumes a unity 
aspect ratio i.e. H = S = W = T = Pt/2, where W is the metal width, T is the metal 
thickness, S is the spacing between the interconnects, H is the height of the inter-level 
dielectric and Pt is the tier pitch. However, there is flexibility to choose aspect ratios 
greater than unity. The wiring efficiency factor [55], [11] is assumed to be constant at 
40% for all the levels in the case study. This factor accounts for via blockage, power and 
ground lines and routing efficiency of the CAD tools.  
The following assumptions are made while integrating HSPICE and RAPHAEL with 
MINDS: 
1. Given that a longer execution time will be required due to the integration of 
HSPICE and RAPHAEL with MINDS, a dynamic lookup table method is adopted 
to reduce the total execution time. The lookup table stores an interconnect length 
and its corresponding optimal wiring pitch that satisfies the performance 






length of a new interconnect under consideration is less than or greater than any 
particular entry in the lookup table by 50 gatepitches, then the pitch value is read 
from the appropriate entry in the lookup table. 
2. The total time delay of the interconnect is the time required for the input signal to 
reach 50% of the Vdd at the input of the receiver. 
3. [12] assumes the longest length of an interconnect in a system to be N  gate 
pitches long where N is the number of gates in the system. For the revised version 
of MINDS, the longest interconnect length is assumed to be 2* N  gate pitches. 
Though the probability of having an interconnect of length 2* N  gate pitches is 
very small, the methodology considers a worst-case scenario where system design 
has a global bus traveling from one corner of the system to its diagonally opposite 
corner. 
4. The interconnect time delay expressed as a fraction of the cycle time i.e. beta (β), 
is assumed to be 0.25 for shorter interconnects (bottom most tier) as they would 
most likely constitute the critical path. For the longer interconnects β is assumed 
to be 0.8 as these longer interconnects would primarily be used for cross chip 
communication. 
 
 The revised design algorithm uses a bisection method for evaluating the pitch for 
a given interconnect length to further reduce the execution time. Starting with an upper 
limit of 2e-2 cm and a lower limit of 2e-5 for the pitch, the midpoint of this range is 
considered as the first estimate of the pitch for the interconnect under consideration. This 






the interconnect satisfies the timing constraint, i.e. the absolute value of the difference 
between the total time delay of the interconnect and (0.8*clock period) is less than 2.5%. 
If the timing constraint is not satisfied, then depending on whether the interconnect delay 
is lesser or greater than (0.8*clock period), the range of pitch values to be considered for 
the next iteration is appropriately reduced. The midpoint of this new range is considered 
as the estimate of the pitch value for the next iteration. Thus, the range of the pitch values 
is reduced for each new iteration until the midpoint of the range satisfies the timing 
constraint. In case of shorter interconnects, the wiring pitch that satisfies the timing 
constraint evaluates to less then twice the feature size. For these interconnects, the wiring 
pitch is set to a default minimum value of twice the feature size. Figure 4.3 and 4.4 show 
a simple flowchart for the HR-MINDS design flow for the case without repeaters and 
with repeaters, respectively. 
 
Input – N, f, Ac, p, k
t = 1; Calculate L1 and p1 τ(L1) = 0.25τ
t = t + 1; While Lt =
calculate Lt and pt τ(Lt) = 0.8τ
Is 
Lt =
Output – Lt & pt for all tiers, 
power, transistor area
Y
N N = Number of gatesf = Operating frequency
Ac = Area of the system
p, k = Rent’s co-efficients
t = tier count
Lt = Longest interconnect on tier t
pt = Pitch for tier t (no repeaters)












Start with the design with no repeaters
t = ttop; Redesign t to calculate Lt, prt
t = t - 1; Redesign all upper tiers 
to calculate Lt, prt for t to ttop
If 
nrep = nmax 
or 
Atran = αAc
Output – Lt, prt, power, transistor area
Y
N
ttop = Tier count for no repeater case
prt = Pitch for tier t (with repeaters)
nrep = Number of repeaters
Ac = Transistor area (logic + rep + WPM)
α = Transistor placement efficiency
 
 
Figure 4.4: Flowchart for design flow of HR-MINDS – with repeaters. 
  
4.5. Extensions to the design capabilities of MINDS 
 In addition to interfacing MINDS with HSPICE and RAPHAEL, several new 
features are also added to HR-MINDS. HR-MINDS provides the facility to generate a 
metal-level-count-centric design in addition to the macrocell-area-centric approach in 
MINDS. In the macrocell-area-centric design methodology, the macrocell area is fixed 
and the number of metal levels that are required for routing the interconnects are 
calculated. Here, the topmost metal layers may be partially filled. A placement efficiency 
of 60% is assumed while calculating the total area occupied by the transistors. It is 
ensured that the area required for transistors is less than 60% of the macrocell area. The 
transistor area is calculated based on the average interconnect length and the number of 






transistors is recommended. On the other hand, in case of the metal-level-count-centric 
design, the number of metal levels is fixed. So, the macrocell area is appropriately 
adjusted to ensure that all the metal levels are used to their maximum capacity. Here too, 
a bisection method is adopted to converge on a correct value for macrocell area. While 
creating the metal-level-count-centric design, it is also ensured that there is enough 
silicon area for the transistors.  
 The original version of MINDS is designed for 100 nm technology. With the shift 
towards a new technology generation every 2-3 years, it is imperative to have a simulator 
that can generate designs across multiple technology generations. HR-MINDS is 
extended to design the multilevel interconnect network for 130 nm, 100 nm, 70 nm and 
45 nm. The HSPICE models are obtained from [56]. The simulator can be easily 
extended to design the interconnect networks for technology generations beyond 45 nm. 
The option to design across multiple generations provides a better perspective into the 
design opportunities for future technology generations. 
 Leakage power is expected to become a significant proportion of the total power 
dissipated by a GSI system for deep sub-micron technologies. Hence, MINDS is also 
extended to calculate leakage power along with the dynamic power dissipated by the 
system. A model to calculate short-circuit power is also incorporated into the system. The 
total power dissipated by a digital VLSI system is given by the sum of dynamic power 
PD, leakage power PL and short-circuit power PSC. The dynamic power dissipated by the 
system is calculated using the same models as those used in the first generation of 
MINDS. These models are described in [12].  The models used to calculate leakage 











L dd off off
W W




     (4.7) 
where Ioff is the subthreshold leakage current per unit transistor width at room 
temperature and noff is the total number of transistors in the macrocell that are in the OFF 
state. Wp(Wn) denotes the width of PMOS (NMOS). The value of Ioff is taken chosen 
based on predictions in [4] for different technology generation. 
 Short-circuit power is calculated using the model in [14] and is given by 
1
1 . . .
2SC dd sc r ni
invn
P V I t Wα
=
= ∑      (4.8) 
where Isc is the short-circuit current per unit transistor width at room temperature, ninv is 
the number of inverters in a macrocell and Wn is the width of NMOS in an inverter. 
Based on [14], the value of Isc is chosen to be 65µA/µm. In equation (4.8), assuming 
symmetric transitions, tr denotes the time for input voltage of inverter to rise from Vtn to 
Vdd -Vtp, where Vtn and Vtp are the threshold voltages of NMOS and PMOS respectively. 










      (4.9) 
where τ represents the RC time constant of the circuit. 
 HR-MINDS also provides the flexibility to develop a semi-customized design and 
a fully-customized design. The semi-custom design is driven by the foundry 
specifications, and hence the interconnect dimensions and metal level count is fixed. As a 






levels for the given interconnect dimensions. In addition, since the interconnect 
dimensions are fixed, the maximum operating frequency of the system is calculated based 
on whether it is limited by the critical path delay or the delay of the longest interconnect. 
On the other hand, for the full-custom design, there is a significant flexibility in the 
choice of design parameters. The user needs to specify the operating frequency and 
transistor count. The metal count, macrocell area and interconnect dimensions that meet 
the delay constraints are appropriately calculated. 
 Given the fact that HR-MINDS uses simulation tools any new circuit design 
technique can be easily studied using HR-MINDS. This new circuit can be studied for 
various design parameter variations. This enables a broader outlook on the effect of 
application of new design styles to large systems. 
 
4.6. Comparison between MINDS and HR-MINDS 
 In order to compare the multilevel interconnect network designed by MINDS and 
HR-MINDS, a macrocell having 40 million transistors is simulated. It is assumed that the 
macrocell operates at 1.3 Ghz. For 8 metal levels, HR-MINDS calculates the required 
macrocell area as 1.125 cm2. Using the same macrocell area the interconnect network is 
again designed using MINDS. Table 4.1 shows a comparison of the metal level count, 
transistor area and power dissipation of the interconnect networks designed using 









Table 4.1: MINDS vs HR-MINDS. 
Design parameter MINDS HR-MINDS  
Number of transistors 40 million 40 million 
Macrocell area 1.125 cm2 1.125 cm2 
Operating frequency 1.3 Ghz 1.3 Ghz 
Inputs 
Number of metal levels 4.92 8.00 
Transistor area 0.348 cm2 0.322 cm2 
Power dissipation 15.51 W (dynamic)
29.84 W (dynamic 




 As can be seen from the table, the number of required metal levels calculated by 
MINDS for same macrocell area is much less than that calculated by HR-MINDS. The 
transistor (logic + repeater) area calculated by both simulators is comparable and is less 
than 60% of the macrocell area. HR-MINDS calculates dynamic power, leakage power 
and short-circuit power. 
 
4.7. Validation of HR-MINDS 
 To validate the interconnect network designed using HR-MINDS, the design from 
HR-MINDS is compared with the logic unit designed for SPARC64 processor. A detailed 
description of the SPARC64 processor is presented in [61]. The SPARC64 processor has 
a total of 190 million transistors with approximately 19 million in logic circuits. The 
processor is designed using 130 nm technology and operates at 1.3 Ghz. A supply voltage 






area of 2.9 cm2. The interconnect dimensions (width and spacing) used in the different 
metal layers are given in Table 4.2. 
 
Table 4.2: Interconnect dimensions for the different metal levels of the SPARC64 
processor. 
130 nm technology – 
Interconnect pitch (in nm) Metal layer 
Pitch Thickness 
1 450 200 
2 450 200 
3 450 225 
4 450 225 
5 900 450 
6 900 450 
7 1800 900 
8 1800 900 
 
 In order to validate the design methodology of HR-MINDS, the logic unit in the 
SPARC64 processor is considered. A digital system having 20 million transistors is 
simulated using HR-MINDS. The multilevel interconnect is designed as a semi-custom 
interconnect network as the interconnect dimensions used in the SPARC64 processor are 
used for the system designed using HR-MINDS. Rent’s values of k = 4.0 and p = 0.6 are 
used in the HR-MINDS design. Table 4.3 shows a comparison between the SPARC64 
design and a design obtained using HR-MINDS. 
 
Table 4.3: Comparison of the SPARC64 design and HR-MINDS design. 
Design parameter SPARC64 processor HR-MINDS design 
Technology generation 130 nm 130 nm 
Number of transistors 19M 20M 
Operating frequency 1.3 Ghz 1.3 Ghz (max. 1.44 Ghz) 
Macrocell area 0.66 cm2 0.7304 cm2 
Power dissipation 34.7 W (entire chip) 36.76 W 






 As can be observed from Table 4.3, the design obtained using HR-MINDS is 
similar to the SPARC64 design. For the design generated using HR-MINDS the 
operating frequency of a system is dependent on the logic critical path delay. The 
maximum operating frequency predicted by HR-MINDS is 1.44 Ghz. This operating 
frequency estimated using HR-MINDS is higher than 1.3 Ghz and it provides a sufficient 
guardband for any variations in clock distribution. Thus, the HR-MINDS design can 
easily operate at 1.3 Ghz. The macrocell area for the logic units of the SPARC64 
processor is estimated using the die micrograph. The area estimated using HR-MINDS is 
only 10% larger than the area estimate for the SPARC64 design. As far as power 
dissipation is concerned, it is difficult to compare SPARC64 design and HR-MINDS 
design as the power dissipation for the SPARC64 processor includes power dissipated by 
the logic, memory, signal lines, clock lines and power lines. It uses various low power 
techniques like static design, input gating of unused units, clock gating in the L2 cache 
unit, clock tree design and low parasitic capacitance 130nm CMOS SOI process [61]. 
HR-MINDS does not include these low power approaches in its design methodology. The 
power dissipation calculated by HR-MINDS only includes power dissipated by the logic 
unit and its signal lines. However, it is possible to extend HR-MINDS to include these 




A comparison between compact models and simulation tools shows that there is 






therefore recommended in this chapter that using a combination of compact models and 
simulation tools enables a quick and accurate exploration of future GSI designs. The 
design and implementation of the second generation multilevel interconnect network 
design simulator is presented in this chapter. The second generation simulator is 
interfaced with HSPICE and RAPHAEL to design interconnect circuits. In addition 
extensions to the first generation simulator in terms of metal-level-centric design, power 
calculations and design for multiple technology generation are also presented. A 
comparison of the two generations of multilevel interconnect simulator and the validation 













In addition to studying a new circuit design at the circuit level, it is imperative to 
determine the effect of application of the new design technique on a large system. In the 
hierarchical approach adopted to study WPM routing, the second level of hierarchy 
involves the system-level analysis. Here the entire multilevel interconnect network of a 
VLSI system is designed using HR-MINDS and the advantages of applying WPM 
routing are studied. At the system level, depending on the design flow, a full-custom or 
semi-custom multilevel interconnect network design can be generated. In case of a full-
custom interconnect design there is a significant amount of flexibility in the choice of 
interconnect design parameters. The design methodology used for a full-custom WPM n-
tier interconnect network is presented in Section 5.2. Given the flexibility in design 
choice it is possible to generate a wire-area-centric WPM design, power-centric WPM 






design, where the macrocell area is fixed is presented in Section 5.3. The advantages in 
terms of reduction in metal area using WPM routing are also presented here. For the 
power-centric design explained in Section 5.4, it is possible to formulate a core-area-
centric low power design or a wire-coupling-centric low power design. A significant 
reduction in power using WPM routing is observed in a low power design. A 
performance-centric design approach is discussed in Section 5.5. 
 
5.2. Design methodology for full-custom WPM n-tier 
interconnect network 
 
 As described in chapter 4, the multilevel interconnect network for a fully-
customized system can be easily designed using the multilevel interconnect network 
design simulator that uses HSPICE and RAPHAEL (HR-MINDS). Using the number of 
gates, core area and operating frequency as inputs, the entire n-tier interconnect 
architecture can be designed using a stochastic interconnect distribution [36] and 
simulation tools. As described in [12], the interconnect length assignment on each tier is 
determined by equating the area available for wiring to the area required for wiring.  
 Because wire dimensions are not fixed in a fully-customized multilevel 
interconnect network, there is a significant amount of flexibility for WPM design to 
optimize wire area, power, or performance. For example, it is possible to create a wire-
area-optimized WPM design, where the total core area remains constant. The number of 
metal levels in this design approach is variable and is appropriately calculated depending 
on the core area. Since the application of WPM technique reduces interconnect count, 






routing. It is also possible to formulate a power-optimized WPM design methodology, 
where the number of metal levels remains fixed and the core area and/or the wire spacing 
is optimized to completely utilize the freed up wire routing tracks. This reduction in core 
area or increase in wire spacing can be used to reduce power dissipation. A final design 
option is the performance-optimized design. Here too, the reduction in core area and/or 
increase in wire spacing can be used to improve the clock frequency of the system. 
 The design flow adopted for the full-custom WPM interconnect network is same 
as that shown in the two flowcharts in Figure 4.3 and 4.4 in Section 4.4. Using a 
multilevel WPM interconnect network without repeaters as the base case, the multilevel 
WPM interconnect network with repeaters is then designed. Since a probabilistic 
interconnect distribution technique is adopted to design the interconnect network, it is not 
possible to determine the exact number of interconnects that satisfy the proximity 
constraints described in Section 7.2. While designing the WPM interconnect network, a 
wire sharing efficiency factor is considered. This wire sharing efficiency factor quantifies 
the fraction of the number of interconnects that satisfy the proximity constraints and can 
be designed using WPM routing. The wire sharing efficiency factors for some sample 
benchmark circuits is presented in Section 7.4. In addition, it is not always advantageous 
to apply WPM routing to shorter interconnects, as the resulting reduction in interconnect 
length may not justify the required overhead circuitry. Hence, a cutoff length is 
considered. The cutoff length gives the lower limit of the range of interconnects that can 
be designed using WPM routing. Thus, the WPM routing technique is applied to only 







5.3. Wire-area-optimized WPM design methodology 
 For the wire-area-optimized design methodology, core area is fixed. The 
application of WPM routing reduces the number of interconnects that need to be routed. 
Because, the core area is fixed and the number of metal levels is variable, the reduction in 
interconnect count due to WPM routing results in a reduction of number of metal levels 
that are required for routing the interconnects. 
 In order to study the wire-area-centric approach for WPM design, a digital system 
consisting of 40 million logic transistors is designed using 100 nm technology parameters 
with a 3-input (six transistor) NAND gate chosen to represent the average standard gate. 
Copper and a low-k (εr= 2.0) dielectric material are used to design the multilevel 
interconnect architecture. The system is assumed to have a die area of 1.2 cm2 and is 
operated at 1.3 Ghz. A suboptimal number of repeaters are inserted as [12] shows that 
inserting 50% of the optimal repeaters imposes only 10% performance penalty. Repeaters 
are inserted beginning from the topmost tier and are successively inserted on lower tiers 
based on the availability of free silicon area. Repeater sizing is assumed to be suboptimal 
because Bakoglu’s expression [10] for optimal sizing of the repeaters overestimates the 
required transistor size [13]. A repeater whose size is 43% smaller than the optimal size 
yields only 6.8% delay penalty. The multiplexers and demultiplexers for the shared wires 
are inserted with the repeaters while checking for the availability of silicon area. Signal 
integrity analysis using HSPICE simulations demonstrates that the transistor sizing for 
the multiplexer and demultiplexer pair can be smaller than that of the driver and receiver 
sizing. Hence, the increase in silicon area due to the WPM overhead is minimal. For our 






Thus, various wire routing patterns and/or source-sink pair placement patterns can be 
considered while determining the system-level impact of the WPM technique.  
 Figure 5.1 shows the demand function [36] curves for the various cutoff lengths, 
for a 40 million transistor system. A 60% wire sharing efficiency has been used while 
plotting these demand curves. As expected for lower cutoff lengths the demand curve 
saturates at lower demand function values. This saturation of demand curves at lower 
values indicates that a smaller number of interconnects need to be routed resulting in a 




























0 1000 2000 3000 4000 5000 6000

















40 million transistor system
 
Figure 5.1:  Demand function variance for different cutoff lengths. 
 
 Table 5.1 shows the number metal levels that would be required for designing the 






consideration the longest interconnect length is 2.19 cm (twice the die edge) and the 
number of metal levels required is 9.3 (~10) for the conventional case. Table 5.2 shows 
the total power dissipation for the various cutoff lengths. As expected there is an increase 
in the power dissipation due to the increased overhead circuitry as one goes towards 
lower cutoff lengths. The conventional design dissipates a power of 19.51 watts at 1.3 
Ghz. 
 
Table 5.1: Number of metal levels for the various wire sharing efficiencies. Conventional 
case = 9.3 (~10) metal levels. 
Cutoff (in cm) 100% 60% 50% 40% 30% 20% 
0.084 6.39 8.00 8.004 8.05 8.18 8.53 
0.169 6.86 8.003 8.05 8.10 8.35 8.57 
0.254 7.12 8.01 8.09 8.22 8.40 8.59 
0.339 7.41 8.05 8.13 8.33 8.54 8.68 
0.424 8.00 8.12 8.31 8.49 8.52 8.79 
0.509 8.01 8.25 8.40 8.50 8.62 8.76 
0.593 8.03 8.29 8.36 8.51 8.71 8.80 
0.678 8.13 8.42 8.56 8.66 8.76 8.81 
0.763 8.25 8.57 8.69 8.71 8.79 8.84 
0.848 8.45 8.65 8.72 8.80 8.88 8.96 
0.933 8.45 8.84 8.87 8.98 9.02 9.05 
2.121 9.20 9.26 9.24 9.26 9.25 9.27 




















Table 5.2: Power dissipation for the various wire sharing efficiencies. Conventional case 
= 19.51W. 
Cutoff (in cm) 100% 60% 50% 40% 30% 20% 
0.084 21.00 20.34 20.16 20.00 19.88 19.75 
0.169 20.35 19.99 19.89 19.80 19.73 19.65 
0.254 20.19 19.87 19.81 19.74 19.68 19.62 
0.339 20.01 19.77 19.72 19.67 19.63 19.59 
0.424 19.92 19.70 19.67 19.64 19.60 19.57 
0.509 19.84 19.67 19.64 19.61 19.58 19.55 
0.593 19.75 19.63 19.61 19.59 19.57 19.54 
0.678 19.66 19.59 19.58 19.56 19.55 19.53 
0.763 19.61 19.57 19.56 19.55 19.54 19.53 
0.848 19.58 19.55 19.54 19.54 19.53 19.52 
0.933 19.57 19.55 19.54 19.53 19.53 19.52 
2.121 19.51 19.51 19.51 19.51 19.51 19.51 
2.190 19.51 19.51 19.51 19.51 19.51 19.51 
 
 Figure 5.2 shows the percent reduction in the required wire area versus cutoff 
length for various wire sharing efficiencies. As can be seen from the plot, a higher 
reduction in the required wire area for higher wire sharing efficiency can be obtained as 
the WPM technique can be applied to a larger number of interconnects. In addition, the 
percent reduction is higher for lower cutoff lengths. Close to 20% reduction in the total 
number of metal levels can be obtained for a wire sharing efficiency of 60%. 
 The percent increase in the dynamic power of the system because of the 
application of the wire sharing technique, for various wire sharing efficiencies, is shown 
in Figure 5.3. The increase in the dynamic power is primarily due to the overhead 
circuitry required for implementing wire sharing. There is no reduction in dynamic power 
because of the elimination of the dedicated interconnects and the repeaters on those 
interconnects, due to the proportional increase in the activity factor of the shared 
resources that replace the dedicated interconnects. For a wire sharing efficiency of 60%, a 






observed. Both, Figure 5.2 and Figure 5.3 also plot a trend where the WPM technique can 
be applied to all the interconnects, i.e. 100% wire sharing efficiency. One can get more 
than 30% reduction in the required wire area for around 8% increase in the dynamic 
power. There is no increase in the core area of the system in spite of the increase in the 
transistor area as the core area is wire-limited and there is sufficient white space to insert 





























100% wire sharing efficiency
60% wire sharing efficiency
50% wire sharing efficiency
40% wire sharing efficiency
30% wire sharing efficiency
20% wire sharing efficiency
40 million transistor system
Die area = 1.2 sq cm
Operating frequency = 1.3 Ghz
 
Figure 5.2: Variation of reduction in the required wire area with different cutoff lengths 


































100% wire sharing efficiency
60% wire sharing efficiency
50% wire sharing efficiency
40% wire sharing efficiency
30% wire sharing efficiency
20% wire sharing efficiency
40 million transistor system
Die area = 1.2 sq cm
Operating frequency = 1.3 Ghz
 
Figure 5.3:  Variation of percent increase in the dynamic power with different cutoff 
lengths for different wire sharing efficiencies. 
 
  Though the benefits of using WPM routing technique are obvious, the 
interconnects designed using delay constraint in equation (2.6) may require some 
redesign at the RTL stage to account for the increase in latency. Depending on the design 
approach, these changes may or may not be trivial. If the changes are going to be non-
trivial, then it may be advantageous to apply WPM only to those interconnects that 
satisfy the delay constraint given by equation (2.5). Figure 5.4 shows the reduction in 
wire area when WPM routing is applied to only SSWPM interconnects, and SSWPM and 
DSWPM interconnects. Here, the WPM routing technique is applied to the interconnects 
routed on tier 2 and above. Since all interconnects may not satisfy the proximity 






there is larger reduction in wire area and more increase in power, when WPM routing is 
applied to both SSWPM and DSWPM interconnects. 
 A significant fraction of interconnects on a tier satisfy the delay constraint given 
in equation (2.5). Close to 15% reduction in the required wire area is obtained for a wire 
sharing efficiency of 60% when WPM routing is applied to only SSWPM interconnects. 
The increase in power due to application of WPM routing technique is shown in Figure 
5.5. For a wire sharing efficiency of 60% there is more than 4% increase in power when 
SSWPM interconnects are designed using WPM routing. As can be seen from Figure 5.4 
and 5.5, there is not much difference between the reduction in wire area and increase in 





























Type - I interconnects
Type - I and Type - II interconnects
 
Figure 5.4: Percent reduction in wire area when WPM routing is applied to interconnects 






























Type - I interconnects
Type - I and Type - II interconnects
 
Figure 5.5: Percent increase in dynamic power when WPM routing is applied to 
interconnects satisfying delay constraint in equation (2.5). 
 
 However, it is highly improbable to apply the wire sharing technique to all the 
interconnects due to the physical layout constraints. If the WPM design approach is 
incorporated in the CAD layout algorithms then it might be possible to gain maximum 
reduction in the number of metal layers for a small increase in the power. Such a WPM-
aware placement strategy is discussed in Chapter 7. 
  
5.4. Design of power-centric WPM methodology 
 In case of the power-centric design methodology, WPM routing is primarily used 
to maintain or reduce the total power dissipated by a digital system. If the number of 






wire spacing or reduce core area, to reduce power dissipation. Two design methodologies 
– core-area-centric low power design and wire-coupling-centric low power design are 
generated. 
 To understand the power-centric design methodology, consider a digital logic 
core consisting of 40M CMOS transistors and designed using 100nm technology 
parameters with a 3-input (six transistors) NAND gate chosen to represent the average 
standard gate. Copper and low-k (εr = 2.0) dielectric material are used to design the 
multilevel interconnect stack. A suboptimal number of repeaters are inserted on the 
interconnects as [12] shows that inserting 50% of Bakoglu’s optimal number of repeaters 
[10] results in only 10% performance penalty. Repeater sizing is assumed to be 
suboptimal because Bakoglu’s expression [10] for optimal sizing of repeaters 
overestimates the required transistor size [13]. A repeater whose size is 43% smaller than 
the optimal size yields only 6.8% delay penalty. Repeaters are inserted beginning from 
the topmost tier and are successively inserted on the lower tiers based on the availability 
of silicon area. The core area for the system is optimized so that an even number of metal 
levels are used for routing interconnects and the transistor area is less than 60% of the 
core area.  
 The logic core is simulated using the multilevel interconnect network design 
simulator described in Chapter 4. For the case study, the core is assumed to operate at 1.3 
Ghz, the number of metal levels is fixed at 8 and a wiring efficiency of 40% is assumed. 
A core area of 1.125 cm2 is required to completely utilize the available routing tracks in 8 






Hoxide (Dielectric thickness / Wire width) = 1.0 are used across all metal levels for the 
conventional design. 
 
5.4.1. Core-area-centric low power design 
 The application of WPM routing reduces the number of interconnects that need to 
be routed and frees up some routing channels on the metal levels. For the first design 
type, the wire-limited core area is reduced to fill up all the metal levels. This reduction in 
core area reduces the interconnect lengths and interconnect dimensions which in turn 
reduces the interconnect switching capacitance which facilitates the use of smaller drivers 
and receivers. As a result, a reduction in the total power dissipated by the system is 
observed. Figure 5.6 shows the variation of percent reduction in power dissipation and 











































Percent reduction in  m
acrocell 
area
Percent reduction in power
Percent reduction in macrocell area
40M transistor logic core
8 metal levels
1.3 Ghz clock freq.
Spwire = Hoxide = 1.0
 
Figure 5.6: Percent reduction in power dissipation and core area varying with cutoff 
length. 
 
 As can be seen from Figure 5.6, close to 30% reduction in power dissipation can 
be observed and close to 60% reduction in the core area can also be obtained for a cutoff 
length of 0.8215 mm. As the cutoff length increases, the percent reduction in power 
dissipation and core area reduces. Table 5.3 shows the values for the various interconnect 
parameters for the conventional system and the WPM system using core area reduction. It 
can be observed from the table that there is significant reduction in the interconnect 
dimensions and repeater sizes after application of WPM routing. As a result, there is 
significant reduction in power dissipation. Figure 5.7 shows the reduction in the 
individual components of power. The interconnect network is designed for 100 nm 
technology, so as expected the dynamic power is more than leakage power which is more 






dynamic power, 30.58% reduction in leakage power and 34.72% reduction in the 
shortcircuit power can be obtained by using WPM routing for a cutoff length of 0.08215 
cm. 
Table 5.3: Multilevel interconnect network design parameters for a conventional system 
and a WPM system. 






Length of the 
longest 
interconnect on 













Conventional system: 40M transistors; Suboptimal number of repeaters, Core area = 1.125 cm2;  
Operating frequency = 1.3 Ghz; Number of metal levels = 8. 
4 2 2.1213 0.718 0.718 34844 120.60 
3 2 0.7125 0.235 0.235 297848 41.73 
2 2 0.2840 0.138 0.138 0 0 




WPM system: 40M transistors; Suboptimal number of repeaters, Core area = 0.486 cm2;  
Operating frequency = 1.3 Ghz; Number of metal levels = 8; Cutoff length = 0.08215 cm. 
4 2 1.3940 0.448 0.448 34643 79.45 
3 2 0.3986 0.133 0.133 395272 23.73 
2 2 0.0957 0.100 0.100 0 0 

































5.4.2. Wire-coupling-centric low power design 
  In the second design approach, instead of reducing the core area the wire spacing 
between the interconnects is increased to fill up all the metal levels. The dielectric 
thickness is increased in proportion to the wire spacing to maintain a constant crosstalk 
ratio. An increase in wire spacing decreases the coupling capacitance between the 
interconnects, which in turn can be used to reduce wire thickness and driver sizes. As 
discussed previously this results in a decrease in both dynamic and static power 
dissipation. 
 After application of WPM routing, the entire multilevel interconnect architecture 
of the logic core is redesigned with increased wire spacing; however, it is constrained by 
the same core area, operating frequency and metal level count as the conventional design. 
Figure 5.8 shows the variation of power reduction and wire spacing increase for the 40M 
transistor logic core with cutoff lengths. Here too, the cutoff length is the lower limit of 
the range of interconnect lengths to which the WPM routing technique is applied. It is 
assumed that 60% of all wires greater than the cutoff length have been implemented 
using WPM routing. As can be seen from Figure 5.8, close to 30% reduction in the power 
dissipation can be obtained for more than 80% increase in the wire spacing for a cutoff 
length of 0.08215 cm. As the cutoff length increases, the percent reduction in power and 












































Percent increase in w
ire spacing
Percent reduction in power
Percent increase in wire spacing
40M transistor logic core
8 metal levels
1.3 Ghz clock freq.
Die area = 1.125 sq cm
 
Figure 5.8: Percent reduction in power and percent increase in wire spacing for a power-
centric design using core area reduction. 
 
 Table 5.4 shows the various interconnect parameter values required for designing 
the conventional system with 40M transistors, and the corresponding WPM system with 
increased wire spacing and dielectric thickness. For this WPM system, it is possible to 
increase the Spwire to 1.82113 in all metal levels by using WPM routing. Hoxide is also 
increased to 1.82113 to maintain conventional crosstalk constraints. In addition to 
increasing the wire spacing, the wire width is optimized to satisfy the performance 
requirements. This facilitates the use of smaller drivers and receivers. All this contributes 
to a reduction in the total power dissipated by the system. It can be seen from Table 5.4 
that there is almost 30% reduction in the total power of the system. Figure 5.9 shows the 
power components of a conventional system and the corresponding WPM system with 






routing reduces all three components of power. The dynamic power reduces by 26.75%, 
leakage power by 26.96% and the short circuit power by 31.48% after application of 
WPM routing. 
 
Table 5.4: Multilevel interconnect network design parameters for a conventional system 
and a WPM system. 






Length of the 
longest 
interconnect on 













Conventional system: 40M transistors; Suboptimal number of repeaters, Core area = 1.125 cm2;  
Operating frequency = 1.3 Ghz; Number of metal levels = 8. 
4 2 2.1213 0.718 0.718 34844 120.60 
3 2 0.7125 0.235 0.235 297848 41.73 
2 2 0.2840 0.138 0.138 0 0 




WPM system: 40M transistors; Suboptimal number of repeaters, Core area = 1.125 cm2;  
Operating frequency = 1.3 Ghz; Number of metal levels = 8; Cutoff length = 0.08215 cm. 
4 2 2.1213 0.564 1.03 30749 83.67 
3 2 0.6487 0.177 0.322 311069 26.30 
2 2 0.1975 0.100 0.182 0 0 































Figure 5.9: Total power dissipation in a conventional and a WPM system with wire 
spacing and dielectric thickness increase. 
 
 
 Though it is recommended to increase the dielectric thickness in the same ratio as 
the wire spacing to maintain constant crosstalk ratio, it is not always possible to do that 
due manufacturing constraints. In such a case, the dielectric thickness is increased to the 
maximum possible value permitted by the manufacturing limits. As we are only reducing 
the mutual capacitance and not reducing the ground capacitance in the same proportion, 
slightly larger drivers and receivers are required. In spite of this a significant reduction in 
power dissipation can be observed. 
 As can be observed in this section, a higher reduction in power is obtained for the 
core area reduction case as compared to the power reduction obtained when wire spacing 
and dielectric thickness are increased. This higher reduction in power can be attributed to 






compared to reduction in the interconnect switching capacitance due to increase in wire 
spacing. Table 5.5 shows a comparison of the interconnect capacitance determined using 
RAPHAEL for the longest interconnect length of the three different designs. As can be 
seen from the table, the interconnect capacitance for the longest interconnect is less for 
the core area-centric design as compared to the wire spacing-centric design. The smaller 
switching capacitance also enables usage of smaller drivers for core area-centric low 
power design. As a result, one can observe more reduction in power dissipation for the 
core area-centric low power design as compared to the wire spacing-centric low power 
design. 
 
Table 5.5 Interconnect capacitance for the longest interconnect in conventional and WPM 
systems. 









Conventional 2.12 1.0 1.1e-12 2.34e-12 
WPM – Core area-
centric 1.394 1.0 1.1e-12 1.53e-12 
WPM – Wire 
spacing-centric 2.12 1.8211 7.73e-13 1.64e-12 
 
5.5. Design of performance-centric WPM methodology 
 As described in Section 5.3 the application of WPM routing reduces the number 
of interconnects that need to be routed. The core-area-centric design or wire-spacing-
design described in Section 5.4 can be used to improve system performance. The system 
performance depends on the delay of the longest interconnect or the critical path delay, 
whichever is larger. For the same interconnect dimensions as the conventional design, 






spacing can be increased to fill up the metal levels. As a result there is a reduction in 
switching capacitance (interconnect and transistor) which reduces interconnect delay and 
critical path delay. This can be used to improve system performance. However, since the 
interconnect dimensions do not change, the performance-centric approach becomes semi-
custom design approach. A detailed description of a performance-centric design for a 
semi-custom system is described in Section 6.4.  
 
5.6. Summary 
 A design methodology to generate fully-customized multilevel interconnect 
network using HR-MINDS is presented in this chapter. It is demonstrated that a pervasive 
application of WPM routing across the entire multilevel interconnect network can 
provide significant advantages in terms of interconnect area and power. Using HR-
MINDS it is possible to formulate a wire-area-centric WPM design, power-centric WPM 
design or performance-centric WPM design. In case of a wire-area-centric WPM design, 
the application of WPM routing enables significant reduction in metal area for a small 
increase in power. It is demonstrated that for a 40 million transistor system 15% 
reduction in the interconnect area can be obtained for a 4% increase in the power 
dissipation. In case of a power-centric design, for the core area-centric low power design, 
where the core area is reduced to utilize all routing channels, close to 30% reduction in 
power (dynamic + leakage) dissipation can be observed. In addition, the core area can 
also be reduced by almost 60%. Thus, this creates a win-win situation in terms of power 
and area. On the other hand, in case of the wire-coupling-centric design, 30% reduction 






routing channels, there is significant reduction in capacitive crosstalk that enables 
reduction in power. The performance-centric design approach necessitates keeping the 
interconnect dimensions fixed after applying WPM routing. Hence, the design flow 
changes to that of a semi-custom design. A detailed description of the performance-













With ASIC design market experiencing a significant growth in the past decade, it 
is imperative to explore the limits and opportunities to apply WPM routing to these ASIC 
products. The ASIC design adopts a semi-custom design flow for designing the 
multilevel interconnect network. The design flow adopted for developing a semi-custom 
multilevel interconnect network using HR-MINDS is discussed in Section 6.2 of this 
chapter. Using this semi-custom design approach it is possible to develop a power-centric 
WPM design and a performance-centric WPM design methodology. In case of the power-
centric approach, using WPM routing it is possible to reduce power dissipation of a VLSI 
system by reducing core area or increasing the wire spacing. A detailed discussion of 
these two low power approaches is presented in Section 6.3. Section 6.4 discusses the 






centric design, significant improvement in performance of a VLSI system can be 
obtained using WPM routing. 
 
6.2. Design methodology for semi-custom WPM n-tier 
interconnect network 
 
 A semi-custom interconnect network design is driven by the interconnect 
dimensions specified by the foundry. For a semi-custom digital system, the interconnect 
dimensions and the number of metal levels are fixed. As a result, the core area is 
completely dependent on the number of interconnects that need to be routed for that core. 
A semi-custom interconnect network design is a core area-centric design by default as the 
core area needs to be appropriately chosen to ensure there is enough routing space for all 
the interconnects and at the same time there is a complete utilization of the available 
routing area on all the metal levels. The maximum operating frequency of a digital 
system depends on the delay of the longest interconnect on each tier or on the logic 
critical path delay, whichever is greater. The logic critical path delay is determined using 
the number of gates in the critical path and average wire length. For a semi-custom 
interconnect network design, HR-MINDS uses interconnect dimensions and number of 
gates as input. A bisection method is adopted to calculate the optimal core area required 
for the design. For the optimal core area all the available routing area is completely 
utilized. HR-MINDS performs a series of calculations for each tier and estimates the 
interconnect distribution on each tier, transistor area (logic + repeaters) and power 
(dynamic + leakage + short-circuit) dissipated by the system. Given the fact that the die 






for placing the logic transistors and repeaters. However, there may be cases where a core 
area may be transistor-limited and not wire-limited. In such a case the core area may have 
to be appropriately increased and it may result in an under-utilization of the routing area 
available in all the metal layers. 
 The basic design methodology for the semi-custom WPM interconnect network is 
the same as that for the full-custom WPM interconnect network. Using WPM routing, it 
is possible to adopt a power-centric design methodology or a performance-centric design 
methodology for a semi-custom interconnect network. In the power-centric design 
methodology the power dissipation can be reduced without any loss of performance using 
WPM routing. As the number of interconnects that need to be routed reduces after 
application of WPM routing, the core area can be reduced or wire spacing can be 
increased to fill up the freed up routing area, which reduces the total switching 
capacitance of the system. This reduction in switching capacitance enables usage of 
smaller driver/receivers. This reduces the total dynamic power and static power of the 
system. 
 WPM routing also can be used to improve performance by adopting the same 
strategy as for the power-centric design, i.e. reduction in core area or increment in the 
spacing. The reduction in switching capacitance due to core area reduction or wire 
spacing increase reduces the interconnect delay and the critical path delay. This reduction 
in delay improves the performance of the system. However, this improvement in 








6.3. Design of power-centric methodology 
 A power-centric WPM design methodology involves reduction of core area or 
increase of wire spacing to reduce power dissipation and at the same time ensuring 
complete utilization of all the metal levels. A core-area-centric low power design 
methodology and a wire-coupling-centric low power design methodology are formulated 
to quantify the advantages of WPM routing. 
 In order to quantify the power reduction using WPM routing technique, a 40M 
transistor ASIC logic core is simulated using HR-MINDS. The core is designed using 
100 nm technology parameters with a 3-input (six transistors) NAND gate chosen to 
represent the average standard gate. Copper and low-k (εr = 2.0) dielectric material are 
used to design the multilevel interconnect stack. A suboptimal number of repeaters are 
inserted on the interconnects. Repeater sizing is also assumed to be suboptimal. HSPICE 
and RAPHAEL are used to model interconnect transients. For the case study, the core is 
assumed to have 4 tiers (each tier has two metal layers). In the multilevel interconnect 
network, the interconnect pitch is doubled for every successive pair of metal levels as 
recommended in [55]. Hence, the four tiers are assumed to have interconnect pitches of 
2F, 4F, 8F and 16F, where F is the feature size. A wiring efficiency of 40% is assumed 
[55], and wire spacing, metal width and dielectric thickness are assumed to be equal to 
wire width across all metal levels. The operating frequency of the ASIC core is given by 
the minimum of the following two values - inverse of the critical path delay or inverse of 
the delay of the longest wire. For the case study, the ASIC core is operated at 75% of the 
maximum possible frequency. The remaining 25% provides the necessary guardband to 






 The application of WPM routing technique reduces interconnect count. Hence, the 
entire multilevel interconnect network is redesigned. While redesigning the multilevel 
interconnect network, it is ensured that the inverse of the delay of the longest 
interconnect and the inverse of the critical path delay is more than the operating 
frequency of the conventional design. In addition, it also ensured that the core area 
remains wire-limited and does not become transistor-limited. Though it may be possible 
to apply the WPM routing technique to a large number of interconnects, it is not 
advantageous due to the resulting overhead circuitry. Hence, a cutoff length which gives 
the lower limit of the range of interconnect lengths that use the WPM routing technique is 
considered. In addition, to account for proximity constraints described in Section 7.2, it is 
assumed that 60% of all the wires above cutoff length have been designed using WPM 
routing. 
 
6.3.1. Core-area-centric low power design 
 In case of the core-area-centric low power design, the core area is optimized to fill 
up the free routing area created due to application of WPM routing. The wire spacing and 
wire thickness are maintained equal to the wire width across all metal levels. Though the 
reduction in the core area does not reduce interconnect capacitance per unit length, it 
reduces the interconnect lengths, which decreases the total interconnect switching 
capacitance. In addition, the reduction in wire lengths makes it possible to use a smaller 
number of repeaters, which reduces the device switching capacitance. There is no 
reduction in the repeater sizes as Bakoglu’s expressions [10] for optimal number of 






core is observed due to a reduction in the switching capacitance. In addition, the 
reduction in repeater count also decreases the leakage power and short-circuit power of 
the core. 
 Figure 6.1 shows the percent reduction in the total power dissipation and core area 
for various cutoff lengths, when the core designed using WPM routing is operated at the 
same frequency as the conventional design. More than 14% reduction in total power and 
almost 28% reduction in core area is observed for a cutoff length of 0.1 cm. As the cutoff 
length increases, the percent reduction in total power decreases and core area increases. 
Thus the core area-centric design creates a win-win situation where it is possible to 





























Percent reduction in core area
Percent reduction in power
Percent reduction in core area
40 million  transistor core
Spwire = 1.0
1.5 Ghz operating freq.
 







 Figure 6.2 shows the power dissipation of the individual components of power. A 
significant reduction in all the three components of power is observed. The dynamic 
power reduces by 14.21%, leakage power by 13.09% and short-circuit power by 19.64%. 
Table 6.1 shows the interconnect design parameters for the conventional design and the 
low power design using core area reduction. It can be observed from the table that the 
repeater sizes for the interconnects for both the conventional design and the low power 
design are the same. However, the number of repeaters required for the low power design 
is less than that for the conventional design. In addition, a significant reduction in power 


































Table 6.1: Multilevel interconnect network design parameters for a conventional design, 
core area-centric design. 






Length of the 
longest 
interconnect 














Conventional design: 40M transistors; Suboptimal number of repeaters, Core area = 1.677609 cm2;  
Operating frequency = 1.5 Ghz; Number of metal levels = 8 
4 2 2.5904 0.8 0.8 35152 136.88 
3 2 0.8492 0.4 0.4 140610 68.44 
2 2 0.4324 0.2 0.2 562439 34.22 




Core area-centric design: 40M transistors; Suboptimal number of repeaters, Core area = 1.208463 cm2;  
Operating frequency = 1.5 Ghz; Number of metal levels = 8; Cutoff length = 0.1003 cm 
4 2 2.1985 0.8 0.8 25322 136.88 
3 2 0.6631 0.4 0.4 101288 68.44 
2 2 0.3022 0.2 0.2 405152 34.22 







6.3.2. Wire-coupling-centric low power design 
 In a wire-coupling-centric design methodology, the spacing between all wires is 
increased to fill up all metal levels. The increase in wire spacing decreases the coupling 
capacitance of the interconnects, which reduces the interconnect power. This decrease in 
interconnect capacitance also enables the use of smaller drivers and receivers. This 
reduces the total power (static + dynamic) of the devices. A 40M transistor logic core 
similar to the one in Section 6.3 is simulated to determine the advantages of WPM 
routing in a wire coupling-centric design. In this case, for a fixed core area after 
application of WPM routing the wire spacing is proportionately increased to utilize all 
available wire area. 
 Figure 6.3 shows the percent reduction in power and percent increase in wire 
spacing versus cutoff lengths, when the wire-coupling-centric design is operated at the 






between the wires can be increased by more than 35% to fill up all metal levels. This 
reduces the interconnect coupling capacitance by more than 20%. This helps reduce delay 
variations and power, and improve performance. More than 6% reduction in the total 
power can be observed for a cutoff length of 0.1 cm. The reduction in power decreases as 
the cutoff length increases.  
 A variation of the wire coupling-centric design methodology is where the inter-
level dielectric thickness is increased in the same proportion as the wire spacing to 
maintain constant crosstalk ratio. This increase in the inter-level dielectric in addition to 
the wire spacing further reduces the total switching capacitance and driver/receiver sizes, 
which reduces power dissipation. Here too, a 40M transistor system is designed using 
HR-MINDS. Figure 6.4 shows the percent reduction in the power dissipation and the 
corresponding percent increase in the wire spacing and dielectric thickness. As expected 
a higher reduction in power dissipation is observed when the inter-level dielectric 
thickness is increased in the same ratio as the wire spacing. A power reduction of close to 
16% can be obtained for a cutoff length of 0.1 cm and the wire spacing/inter-level 
dielectric thickness has to be increased by more than 35% to fill up all the metal levels. 
As the cutoff length increases, the number of interconnects that can be designed using 
WPM routing reduces and the percent reduction in power reduces. 
 In case of the wire coupling-centric design, there is no change in the core area. 
However, there is reduction in the total area occupied by the transistors. As a result, there 
is free silicon area available to insert decoupling capacitors, which can reduce power 






between neighboring interconnects. Thus, the wire coupling centric design can not only 

































Percent increase in w
ire spacing
Percent reduction in power
Percent increase in wire spacing
40 million  transistor core
1.5 Ghz operating freq.
Core area = 1.677 sq cm
 
Figure 6.3: Power reduction in a wire coupling-centric design. (No change in the inter-







































Percent increase in w
ire spacing and 
inter-level dielectric thickness
Percent reduction in power
Percent increase in wire spacing and
dielectric thickness
 
Figure 6.4: Power reduction in a wire coupling-centric design. 
 
 Figure 6.5 shows the power components of the conventional design and wire-
coupling-centric (both cases) design for a cutoff length of 0.1 cm. It can be observed 
from Figure 6.5 that the application of WPM routing significantly reduces all three 
components of power. 
 Table 6.2 shows the various interconnect design parameters for the conventional 
design and wire-coupling-centric (both cases) low power design. In both low power 
designs, the total wire switching capacitance is less than the conventional design. In the 
wire-coupling-centric design a smaller number and size of repeaters are required. It can 
be observed from Table 6.2 that both the low power designs require lesser number of 





















Wire coupling-centric design (Hoxide = 1.0)










Figure 6.5: Power components in a conventional and wire coupling-centric designs. 
 
Table 6.2: Multilevel interconnect network design parameters for a conventional design 
and wire coupling-centric design (both cases). 






Length of the 
longest 
interconnect 














Conventional design: 40M transistors; Suboptimal number of repeaters, Core area = 1.677609 cm2;  
Operating frequency = 1.5 Ghz; Number of metal levels = 8 
4 2 2.5904 0.8 0.8 35152 136.88 
3 2 0.8492 0.4 0.4 140610 68.44 
2 2 0.4324 0.2 0.2 562439 34.22 




Wire coupling-centric design: 40M transistors; Suboptimal number of repeaters, Core area = 1.677609 cm2; 
Increase in wire spacing only, no change in inter-level dielectric 
Operating frequency = 1.5 Ghz; Number of metal levels = 8; Cutoff length = 0.1003 cm. 
4 2 2.5904 0.8 1.085 28778 132.03 
3 2 0.7812 0.4 0.542 115113 66.01 
2 2 0.3561 0.2 0.271 460451 33.00 




Wire coupling-centric design: 40M transistors; Suboptimal number of repeaters, Core area = 1.677609 cm2; 
 Wire spacing and inter-level dielectric increased in same proportion 
Operating frequency = 1.5 Ghz; Number of metal levels = 8; Cutoff length = 0.1003 cm. 
4 2 2.5904 0.8 1.085 27127 124.46 
3 2 0.7812 0.4 0.542 108510 62.23 
2 2 0.3561 0.2 0.271 434039 31.11 









6.4. Design of performance-centric methodology 
 In addition to reduction in power dissipation, the WPM routing technique can also 
be used to improve the performance of the system. As explained in Section 6.3, the 
reduction in interconnect count enables a reduction in the core area and/or increase in 
wire spacing. This reduces the total switching capacitance of both interconnects and 
drivers/receivers, which reduces delay and improves performance of the system. Thus, a 
core-area-centric or wire-coupling-centric design methodology to improve system 
performance is developed. 
 
6.4.1. Core-area-centric high performance design 
 In order to quantify the possible performance improvements using WPM routing, 
similar to Section 6.3, a 40M transistor random logic core is designed using HR-MINDS. 
The design methodology adopted here is the same as that used for the power-centric 
design. Using WPM routing, the core area is reduced or wire spacing is increased to 
improve the system performance. As described earlier, the performance of the system is 
limited by the delay of the longest interconnect or the logic critical path delay, whichever 
is larger. 
 Figure 6.6 shows the percent improvement in performance varying with the cutoff 
length. In case of core area reduction, the reduction in interconnect lengths reduces the 
interconnect delay, which results in performance improvement. However, as the cutoff 
length is reduced below a threshold value, the system performance gets limited by the 
critical path. As a result, there is lesser improvement in performance beyond the 






limited space available to increase the size of the gates in the critical path. Hence, the 
critical path delay does not reduce in the same proportion as the core area. As a result, a 
lesser performance improvement of the system is observed. As can be observed from the 
figure, a maximum improvement in performance of 4% can be obtained for a threshold 
cutoff length of around 0.6 cm. The performance improvement is less for cutoff lengths 
greater or lesser than the threshold value.  
 As we are reducing the core area to improve performance, there is also an 
opportunity to reduce the total power dissipation as described in Section 6.3. Figure 6.7 
shows the reduction in power dissipation for a performance-centric design. The trend for 
the power reduction is a mirror image of the trend for performance improvement. For 
higher cutoff lengths, as the operating frequency is more than that for the conventional 
design there is increase in power dissipation. The maximum increase in power dissipation 
is obtained at the threshold cutoff length of 0.6 cm, where there is maximum 
improvement in performance. Close to 7% increase in power dissipation is observed at a 
cutoff length of 0.6 cm. As the cutoff length reduces below 0.6 cm, the percent reduction 
in power increases. This is because the operating frequency reduces in addition to the 
reduction in the total switching capacitance of the system due to core area reduction. A 
15% reduction in power dissipation is observed at a cutoff length of 0.025 cm. At this 











































Figure 6.6: Percent improvement in performance for a performance-centric design using 





































6.4.2. Wire-coupling-centric high performance design 
 In case of wire-coupling-centric design, the reduction in coupling capacitance due 
to the increase in wire spacing and inter-level dielectric thickness reduces the 
interconnect delay which in turn improves performance. The inter-level dielectric and 
wire spacing are increased in the same proportion to maintain constant crosstalk ratio. 
Here too a 40M logic core is simulated to study the performance improvement using 
WPM routing. Figure 6.8 shows the percent improvement in performance varying with 
the cutoff length. The performance improvement increases steadily as cutoff length is 
reduced unlike the core-area-centric high performance design where the performance 
improvement reduces as the cutoff length is reduced below the threshold cutoff length of 
0.6 cm. This is because for the wire-coupling-centric design the core area remains 
constant. Hence, even if the operating frequency becomes limited by the critical path, 
there is enough silicon area for large gates in the critical path. This can reduce the critical 
path delay and improve system performance. As can be seen from Figure 6.8, more than 
7% improvement in performance can be obtained for a cutoff length of 0.025 cm. As the 
cutoff length increases, the percent improvement in performance decreases. 
 As expected, the change in operating frequency results in a change in power 
dissipation of the system. Figure 6.9 shows the percent reduction in power varying with 
the cutoff length. For higher cutoff lengths, the performance is dependent on the delay of 
the longest interconnect, and hence there is an increase in the total power dissipated by 
the system due to the increase in operating frequency. As the cutoff length reduces below 
a threshold value, the system performance becomes limited by the critical path. In 






dielectric thickness is large resulting in a low switching capacitance. As a result, there is 
lesser increment in power dissipation. On the contrary, for a cutoff length of 0.025 cm 
there is more than 3% reduction in power dissipation. For cutoff lengths of less than 0.1 
cm a reduction in power is observed.  
 Thus, WPM routing technique can be used generate a performance-centric design. 
A change in power dissipation is a by-product of the performance-centric design 
approach. A careful choice of cutoff lengths for WPM routing can provide significant 
















































































 The multilevel interconnect network of an ASIC design provides many 
opportunities to apply WPM routing. A semi-custom WPM design approach can be 
adopted for ASIC products and it can provide significant advantages in terms of power 
and performance. This semi-custom design approach is discussed in this chapter. For a 
power-centric semi-custom WPM design it is possible to adopt a core-area-centric or 
wire-coupling-centric approach. In case of the core-area-centric low power design close 
to 15% reduction in power can be obtained in addition to more than 25% reduction in 
core area using WPM routing. On the other hand, in case of the wire-coupling-centric 






20% reduction in interconnect capacitive coupling. For the performance-driven approach 
too a core-area-centric or a wire-coupling-centric design can be developed. Here, the 
performance of a system is determined using the logic critical path delay and delay of the 
longest interconnect on the chip. A performance improvement of 4% and 7% can be 
obtained using the core-area-centric high performance design and wire-coupling-centric 







PHYSICAL DESIGN LIMITS AND 






The third level in the hierarchical approach adopted to study WPM routing 
involves exploring the limits and opportunities of WPM routing at the physical design 
level. The application of WPM routing is driven by the physical placement of the sources 
and sinks, and routing of the interconnects in a circuit. Depending on the physical design 
of a circuit, certain dedicated interconnects can be identified and redesigned as WPM 
interconnects. While choosing which dedicated interconnects should be converted to 
WPM interconnects, it is imperative to ensure that there is minimum deviation in the 
existing routing solution. 
In order to perform a comprehensive analysis of any new design technique it is 






both current and future systems. In case of current systems, the existence of design data 
makes it easier to determine opportunities. On the other hand, an extrapolation 
methodology needs to be adopted in case of future systems. Even if a rigorous analysis is 
not possible, first order estimations on the limits and opportunities go a long way in 
understanding the new design technique better and adapting it to suit the future designs. 
Section 7.2 describes source-sink proximity and run-length proximity constraints 
that are defined to minimize any overhead due to WPM routing. To determine the scope 
of WPM routing, benchmark circuits are placed using GORDIAN placement algorithm 
[58] and simulated annealing. The detailed description of the GORDIAN placement 
algorithm and simulated annealing, and their implementation is given in Section 7.3. The 
advantages in terms of interconnect length, transistor area and power that is obtained 
using WPM routing for benchmark circuits is presented in Section 7.4. A description of 
the application of WPM routing to a SPARC64 processor is presented in Section 7.5. The 
SPARC64 processor is a high performance system and the application of WPM routing 
can help the design in terms of wire area reduction and power reduction. To explore the 
opportunities in applying WPM routing to future design systems, a comparison between 
the WPM-based design approach and material solutions for future designs is performed. 











7.2. Design of source-sink proximity and run-length proximity 
constraints 
 
The possibility of using shared wiring resources instead of dedicated 
interconnects is determined based on the physical placement of a given source-sink pair, 
or based on the routing of the interconnects such that a given pair of interconnects have 
shared run length. Figure 7.1 shows two interconnects, A and B that have the sources and 
sinks close to each other. For application of the WPM technique, the sources should be at 
a distance less than ‘r’ from each other and the sinks should be at a distance less than ‘r’ 
from each other. Here the distance ‘r’ is proportional to the average of the two 
interconnect lengths and is chosen such that any required deviation in routing will have 
minimal impact on delay. If any source-sink pairs satisfy this physical constraint then, 
depending upon the existence of wire idleness, the two dedicated interconnects can be 
replaced by a single shared wiring resource with the insertion of WPM overhead 
circuitry. 
On the other hand, Figure 7.2 shows two interconnects A and B that have shared 
run length and are at a distance less than ‘d’ from each other. Here, the distance ‘d’ is 
proportional to the length of the longer interconnect. In this case, one can replace the 
shorter interconnect A and part of the longer interconnect B by a shared wire. The data 
that was transmitted over the dedicated wires earlier will now be transmitted partially 
over the shared wire and partially over the dedicated wire. For this second physical 
constraint, the interconnects can be of equal lengths too. As long as they have some 














Set of sources Set of sinks



























In order to determine the limits and opportunities to apply WPM routing to 
current and future design technologies it is imperative to place and route benchmark 
circuits and determine the existence of any source-sink and run-length proximity. The 
placement and routing algorithms can be suitably modified to force source-sink and run-
length proximity among potential WPM interconnects. 
 
7.3. Placement problem formulation 
In order to determine the source-sink proximity, a two-step approach is adopted. 
The GORDIAN placement algorithm [58] is used to determine the initial solution of the 
placement. A trickle-down method is adopted for compacting the solution obtained using 
GORDIAN placement. The output of the compacting algorithm is then studied to 
determine interconnects that satisfy the source-sink proximity. The Berkeley Logic 
Interchange Format (BLIF) circuits having sizes varying from few hundred to thousands 
of gates are used as the sample circuits. After determining the WPM nets in the 
placement solution, further processing is done using simulated annealing to determine 
additional opportunities for WPM routing. Figure 7.3 shows the set of steps that are 
followed for the placement strategy. 
 
7.3.1. GORDIAN placement algorithm 
 A detailed description of the GORDIAN algorithm for placement is presented in 
[58]. GORDIAN algorithm uses an iterative approach and each iteration consists of two 
steps - global optimization and partitioning. For the first iteration, the entire core area 






coordinates of the movable modules in the system are calculated. The region is then 
partitioned and the modules are assigned to appropriate partition based on its coordinates. 
This enables a formulation of a new constraints for the global optimization step (one 
constraint for each partition). During each subsequent iteration in the GORDIAN 
algorithm, the placement coordinates for the movable cells in each region are calculated 
using global optimization. This is followed by a partitioning step where depending on the 
placement coordinates, the cells are assigned to appropriate partition. These two-step 
iterations are performed until each partition contains a maximum of 15 movable cells. 
 
7.3.1.1. Net list and cell description 
 The GORDIAN algorithm requires a net list and a cell description as inputs. The 
net list can be written in a binary relation as described in [58]. It can be written as 
Τ Ν×Μ⊆  where N and M are index sets of the nets and the modules, respectively. A 
connection of net υ to module µ is represented by (υ, µ) ∈  T; the set of modules 
connected by the net υ is Mυ = {µ ∈  M | (υ, µ) ∈  T}. 
 The dimensions of each cells and the cell-type (fixed or movable) is required. 
These inputs can be obtained from a cell library. It is imperative to have an accurate 
description of the cells for efficient placement. 
 
7.3.1.2. Fixed cells 
 The GORDIAN algorithm requires the fixed cell coordinates as input as it uses 
these fixed cells as anchor points while calculating the coordinates for the movable cells 






coordinates are calculated in such a way that the fixed cells are uniformly distributed 
around the periphery of the core area. This ensures that the movable cells get uniformly 
distributed across the core area and they do not over populate any particular region in the 
core area. Based on the number of fixed cells and the core area, the spacing between the 
fixed cells is calculated. 
 
7.3.1.3. Problem formulation 
 The GORDIAN placement algorithm adopts a quadratic programming problem 
formulation approach. The problem formulation is based on the circuit connectivity and 
the results obtained from the partitioning step in the earlier iteration. As described in [58], 
the objective function of the global optimization step is based on the rubber band lengths 
of the nets. The length Lυ of a net υ is measured by the sum of the squared distances from 
its pins to the nets center coordinates (xυ, yυ) 
2 2[( ) ( ) ]
M
L x x y y
υ




= + − + + −∑    (7.1) 
where ( ,υµ υµξ η ) are the coordinates of the fixed cell connected to net υ relative to the 
center coordinates (xµ, yµ) of its module µ. Based on the importance of the net a weight wυ 
is assigned to the net. Thus the objective function weighted sum of the squared rubber 





Φ = ∑      (7.2) 
 Due to the net model chosen in GORDIAN placement [58], it is possible to 
replace each net by all two-point connections of its pins. The objective function can 






1 1( , )
2 2
T T T T
x yx y x Cx d x y Cy d yΦ = + + +    (7.3) 
 The vectors x and y denote the x and y coordinates of the m movable modules 
mM Mµ ∈ ⊂  in the m-dimensional vector space R
m. The matrix C and the vectors dx and 
dy are setup based on the procedures defined in [58]. As can be observed from equation 
(7.3) the objective function can be separated as two sub-objective functions; one for x 
coordinates and the other for y coordinates.  
 The constraints for the quadratic programming problem is based on the center of 
gravity. During every iteration, for each partition a center of gravity constraint is 
formulated for the x coordinates and y coordinates. As described in [58], the constraint 
equation for x coordinates in region ρ at the lth level of partition ( 2 )q = can be given as 
follows 
A x u⋅ =        (7.4) 
where the entries aρµ  of the q m×  - matrix A  are given by 
/ ,
0,











    (7.5) 
Here Fµ is the area of the module µ. u  in equation (7.4) is the x coordinate of the center 
of gravity of region ρ. Similarly, for y coordinates we can have the center of gravity 
constraint as A y ν⋅ = . If we combine the objective function and the center of gravity 
constraint, the linearly constrained quadratic programming problem (LQP) for the x 







1: min{ ( ) | }
2
T T
xLQP x x Cx d x A x uΦ = + =    (7.6) 
1: min{ ( ) | }
2
T T
yLQP y y Cy d y A y vΦ = + =    (7.7) 
Thus, we can see that the x coordinates and y coordinates can be separately optimized. 
This significantly reduces the problem size and hence the time of execution. Depending 
on the level of partition, the size of matrix A  will change.  
 
7.3.1.4. Solution 
 A Lagrangian multiplier method is adopted to solve the quadratic programming 
problem as described in [59]. The Lagrangian function is first defined and then the 
gradient of the Lagrangian function is equated to zero to calculate the feasible Kuhn-
Tucker point for this programming problem. The Lagrangian function of LQP in equation 
(7.6) can be formulated as follows 
1min{ ( ) ( )}
2
T T
xx x Cx d x A x uλΦ = + − −    (7.8) 
where the λ vector is called the Lagrangian multiplier. Equating the gradients of this 
Lagrangian equation with respect to x and λ to zero gives us two equations 
0xCx d A λ+ − =      (7.9) 
0
T
A x u− =       (7.10) 
These two equations can be rearranged to form a linear system of the form Ax = b. 
0
T
xC A x d
uA λ
⎡ ⎤− ⎡ ⎤ ⎡ ⎤
= −⎢ ⎥ ⎢ ⎥ ⎢ ⎥
− ⎣ ⎦ ⎣ ⎦⎢ ⎥⎣ ⎦






 This system of linear equations can be easily solved using LU factorization. The 
solution of this system are the x coordinates of the movable cells and the values for λ that 
minimize the objective function. The y coordinates of the movable cells can also be 
evaluated in the same way. 
 However, in case of large circuits, the matrices in equation (7.11) will be very 
large and as a result the execution time will be very high. This can be avoided by taking 
advantage of the fact that the matrix C does not change across all iterations. An 
optimization technique described in [59] is adopted to reduce the execution time. The LU 
factorization of matrix C is inherited into the next partition level. The leftmost matrix in 







LU A LC A U U
L l uAA
⎡ ⎤ ⎡ ⎤−− ⎡ ⎤ ⎡ ⎤
= =⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
−⎢ ⎥ ⎣ ⎦⎣ ⎦⎢ ⎥− ⎣ ⎦⎣ ⎦
   (7.12) 
Ln, l, Un and u can be calculated using the following equations. 
nLU A= −       (7.13) 
T
nL U A= −       (7.14) 
0n nL U lu+ =       (7.15) 







⎡ ⎤ ⎡ ⎤
= −⎢ ⎥ ⎢ ⎥
⎣ ⎦⎣ ⎦
     (7.16) 
'
0





     (7.17) 
As described in [59], the optimization described above reduces the timing complexity of 






 The output of the GORDIAN algorithm is based on the center of gravity 
constraint. As a result, the placement solution may have significant amount of overlap 
among the cells and the solution will not be in the form of standard row cells. In order to 
remove the overlap and make the solution more compact, some post processing is 
necessary.  
 
7.3.2. Overlap removal and compaction algorithm 
 The cell coordinates obtained using GORDIAN algorithm are not in integer 
format. Hence, the coordinates are rounded to the nearest integer values. In order to 
remove the overlap among the different cells, the overlapping cells are moved to the 
nearest vacant cell location. Starting with the cell location on the immediate right, all the 
surrounding locations are explored in a clock wise fashion to search for a vacant slot. 
This is done to ensure that there is not any significant increase in the wirelength. If any 
vacant slot is not found, then the search circumference is appropriately increased to find 
the nearest vacant slot. The output of this overlap removal process is that each location 
has atmost one cell. 
 The next step in post processing is compaction. The output obtained from overlap 
removal step will have many empty slots. In order to reduce any white space in the core 
area, a compaction algorithm is used. Here a iterative trickle-down mechanism is 
adopted. Starting with the topmost row, for each iteration, the cells in each row are 
moved to the lower row if a vacant slot exists and the cell does not need to be moved by a 
distance more than the predefined offset. At the end of the iteration, existence of vacant 






appropriate cells in each row are shifted to the left and the number of columns is reduced. 
These steps of row reduction and column reduction are iteratively followed until a 
compact solution is obtained. 
 
7.3.3. Determination of WPM nets 
 The output obtained from the compaction algorithm is then used to determine the 
opportunities to use WPM routing. Several conditions need to be satisfied in order for a 
pair of nets to be redesigned as WPM nets. These conditions are 
• The sources and the sinks of the two nets should satisfy the source-sink proximity 
• The nets should not be part of the critical path of the sample circuit 
• The nets should have a length greater than a threshold interconnect length. Here 
the threshold length is determined based on the total area occupied by all the cells. 
Assuming a square placement for all the cells, the threshold length is assumed to 
be half the edge size. 
• If a source/sink of a net is part of another WPM net, then the current net cannot be 
redesigned as WPM net. 
• If a net is already designed as a WPM net, it cannot be paired up again with any 
other net to form a WPM net. 
• If the two nets have any common sources/sinks then they cannot be paired up to 
be redesigned as a WPM net. 
 
 The pair of nets that satisfy these conditions are chosen to be redesigned as WPM 






source-sink pair of one of the nets is eliminated and the source-sink pair of the other net 
is replaced by supernodes. A supernode consists of the drivers or receivers of the original 
nets and the overhead circuitry required for implementing WPM routing. Hence, the 
supernodes will have an area larger than the original source-sink pair. As a result, when 
supernodes are inserted the length of some of the cell rows will increase. To make the 
solution compact, the compaction algorithm described in Section 7.3.2 is used again. 
 
7.3.4. Simulated annealing 
 Though the above set of steps help determine the nets which can be 
redesigned as WPM nets, there is a possibility of existence of other nets which can be 
redesigned as WPM nets and these nets got filtered out as they did not satisfy one or 
more of the conditions described in Section 7.3.3. In order to increase the number of nets 
that can be redesigned as WPM nets, an iterative improvement algorithm called simulated 
annealing is used. The simulated annealing algorithm uses a cooling schedule to decide 
on the operations to be performed on the system. At high temperatures, it accepts all 
moves that improve the current state of the system and some moves that worsen the 
current system state based on a certain degree of probability. However at lower 
temperatures, only moves that improve the system state are accepted as the probability of 
acceptance of bad moves is reduced. The basic premise of this algorithm is that is 
necessary to accept bad moves at the initial stage to avoid getting stuck in the local 
minima for a given objective function. The simulated annealing algorithm continues until 






(temperature cools down to the lowest possible value). A detailed description of the 
simulated annealing algorithm is given in [60].  
 In case of the WPM technique, using the placement solution obtained using 
GORDIAN algorithm and compaction as the starting point, the cells are randomly moved 
and/or swapped to determine if there is any increase in the opportunity to apply WPM 
routing. The objective function here checks for any changes in the wire length while 
moving/swapping the cells. Depending on the status of the cooling schedule, a 
move/swap that increases wirelength may or may not be accepted. The final output of the 
simulated annealing algorithm is the placement solution containing original cells and 




Compaction Determination of potential
WPM interconnects












To determine the advantages of WPM routing, BLIF benchmarks are placed and 
studied. As described in the earlier section, the GORDIAN algorithm and compaction 
algorithm are used to place the cells. This is followed by two post-processing steps to 
determine WPM nets. Figure 7.4 shows the percentage of interconnects that satisfy 
source-sink proximity. This percentage contributes to the wire sharing efficiency factor 
used in Chapter 5 and 6. It can be observed from the figure that as the benchmark circuit 
size increases, there is an increase in the wire sharing efficiency. This can be attributed to 
































The key motivation for applying WPM routing is to reduce the total wirelength of 
the circuit. Assuming there would be no change in the aspect ratio of the interconnect 
architectures, and the wire spacing and dielectric thickness are equal to the wire width, 
the total switching capacitance of the interconnects will be directly proportional to the 
wirelength. Hence, a reduction in wirelength, reduces the total switching capacitance, 
which in turn reduces the total power dissipated by the interconnects. The reduction in 
wirelength also reduces the number of repeaters that need to be inserted on the 
interconnects. However, some overhead circuitry is required to implement WPM routing. 
Thus, the application of WPM routing changes the total transistor area occupied by the 
system.  
Table 7.1 shows a list of the different benchmark circuits, number of cells and the 
corresponding wirelengths, after application of GORDIAN algorithm, compaction, 
simulated annealing and arrangement of cells in standard row cells. Figure 7.5 shows the 
reduction in wirelength after application of WPM routing for the various benchmarks. 
For smaller circuits having a few hundred cells, there is lesser opportunity to apply WPM 
routing. As a result, the number of interconnects that can be designed as WPM 
interconnects reduces, which reduces the percent reduction in wirelength. As we go 
towards larger systems having thousands of cells, there are more interconnects that can be 
redesigned as WPM, thus resulting in a larger reduction in wirelength. On an average 
about 6% reduction in wirelength is observed. If the WPM routing is applied to even 
larger circuits having millions of cells, there will be more opportunities for WPM routing 
and provide further reduction in wirelength. Figure 7.6 shows the increase in transistor 






overhead circuitry required for WPM routing. Since, the benchmark circuits considered 
here are small (few thousand gates), the interconnect lengths will be fairly small and no 
repeaters will be inserted on the interconnects. Hence, elimination of interconnects does 
not eliminate any repeaters and so there is no reduction in transistor area. On an average, 
there is 4% increase in the transistor area of the system. In case of larger circuits having 
millions of cells, repeaters may be inserted on the global interconnects and if these 
interconnects are redesigned as WPM interconnects, a reduction in transistor area may be 
observed. 
In case of power dissipation, there is an increase in the activity factor of the 
shared resources when WPM routing is used. In addition, we add some overhead circuits 
for WPM routing. As a result, there is an increase in the dynamic power dissipated by the 
system. The leakage power dissipated by a circuit is proportional to the total device count 
and device sizes. The application of WPM routing increases the overall device count. As 
a result, an increase in the total leakage power dissipated by the system is observed. It can 
be observed from Figure 7.7, that an average of 4% increase in leakage power and 7% 
increase in dynamic power is obtained for the benchmark circuits. As with transistor area, 
for larger benchmark circuits with millions of cells, the application of WPM routing will 
















Total wirelength (using GORDIAN algorithm, 
compaction and simulated annealing) 
s641 433 512 
s713 447 534 
s1494 661 789 
s5378 3027 3561 
s9234 5844 6984 
s13207 8727 10278 
s15850 10397 12242 






























































































Thus, it can be observed that just using the source-sink proximity constraint some 
advantages in terms of reduction in wirelength can be observed. Though there is an 
increase in transistor area and power for these benchmark circuits, the larger benchmark 
circuits can exhibit reduction in transistor area and power due to elimination of repeaters 
on the eliminated interconnects. In addition, the larger benchmark circuits having 
millions of transistors are expected to be wire-limited. Hence, there will not be any 
increase in the total core area after addition of the overhead circuitry.  
In addition to source-sink proximity, the determination of interconnects that 
satisfy run-length proximity can also increase the number of interconnects that can satisfy 
WPM routing. Here too, simulated annealing can be used to further increase the number 
of interconnects that satisfy WPM interconnects. This increases the wire sharing 
efficiency factor that is used in Chapter 5 and 6. 
 
7.5. Application of WPM routing to SPARC64 processor 
 In order to study the limits and opportunities to apply WPM routing to current 
high performance systems a SPARC64 processor designed using 130 nm technology is 
considered. This processor has 190 million transistors in total (19 million logic 
transistors) and requires 8 metal levels for routing the interconnects. The detailed 
description of this 1.3 GHz fifth generation SPARC64 microprocessor design is given in 
[61]. Using the die micrograph in [61], approximate length of the interconnects between 
the floating point (FP) macrocell and the Load/Store (LS) macrocell, and the fixed point 






respectively. It is assumed that the interconnects travel from the center of one macrocell 
to the center of the other macrocell.  
 Given that it is a 64 bit microprocessor and it has 2 FP units, one can assume that 
there will be 4 read ports (therefore 4 x 64 interconnects) and 2 write ports (therefore 2 x 
64 interconnects) on the FP macrocell that sends/receives data from the LS unit. In 
addition to these data lines, there will be additional control lines to send and receive 
various handshaking signals between the two macrocells; however, these control lines 
have been ignored for this case study. Thus there will be a total of 384 interconnects (set 
A) between the two macrocells. Similarly one can assume that there will be 384 
interconnects (set B) between the FX macrocell and the LS macrocell. 
 In order to determine any existence of wire idleness, the interconnects in set A 
and set B are modeled using Level 49 HSPICE parameters for 130 nm technology [56]. 
The interconnect pitch and thickness values for the processor design are obtained from 
[61] and are shown in Table 7.2. A suboptimal number of repeaters [12], having 
suboptimal size [13], are inserted on the wires. The processor design in [61] has a die size 
of 1.81cm x 1.599cm and hence, the interconnects of length 1.023cm and 0.75cm are 
assumed to be global interconnects that are routed on metal level 7 or 8. Hence, the 
interconnect width is considered to be 900nm [61]. 
 Table 7.3 shows interconnect delay for the two interconnect lengths obtained 
using HSPICE. Delay for the interconnect of length 0.75 cm is just 0.427 ns i.e. 0.55 
times the clock period and from [37] the minimum pulse width evaluates to 0.131 ns. The 
sum of interconnect delay and minimum pulse width is 0.557 ns which is less than 0.8 






hand, the interconnect of length 1.023 cm has a delay of 0.585 ns which is 0.76 times the 
clock period. The minimum sustainable pulse width evaluates to 0.129 ns using [37] for 
this case. Hence, it does not satisfy the delay constraint in equation (2.5). 
 
Table 7.2: Interconnect pitch used to design the SPARC64 microprocessor [61]. 
130 nm technology – 
Interconnect pitch (in nm) Metal layer 
Pitch Thickness 
1 450 200 
2 450 200 
3 450 225 
4 450 225 
5 900 450 
6 900 450 
7 1800 900 
8 1800 900 
 
Table 7.3: Interconnect delay for different interconnect lengths. 










A 1.023 0.585 0.129 0.928 
B 0.75 0.427 0.131 0.725 
 
 Figure 7.8 shows the abstraction of the floor plan of the microprocessor described 
in [61]. The WPM routing can be applied to all interconnects in set B if they satisfy the 
proximity constraints. One can then reduce the number of routing channels by 50% 
without any loss of throughput performance and the latency constraint of the signal is 
maintained. For interconnects in set A, though the constraint in equation (2.5) is not 
satisfied the WPM routing can still be applied to all interconnects and the routing 






Just as a traditional 2-stage pipeline increases the latency to twice the clock period, here 
the latency would increase to twice the clock period but the throughput performance 
would be maintained. Interconnects of set A could require some re-design at the RTL 
stage to account for this data latency change.  Once the system is verified to work using 
shared interconnects for set B, then WPM could be seamlessly incorporated at the logic 











Overhead Circuitry  
Figure 7.8: Abstraction of the floor plan showing a subset of macrocells and the 
corresponding inter-macrocell interconnects. 
 
 Similar to the interconnects between the FP unit and LS unit or FX unit and LS 
unit, interconnects between various macrocells can be redesigned using WPM routing. As 
described earlier in this section, the new interconnect layout using WPM routing may or 
may not require any additional changes at the RTL stage. Table 7.4 shows a list of 
interconnects between different macrocells that can be designed using WPM routing, 
their lengths, delay and the minimum pulse width that can travel over the interconnect 






are inserted on the wires. As all these interconnects are inter-macrocell interconnects, it is 
assumed that these interconnects will be routed on the topmost tier (metal level 7 or 8). It 
can be observed from the table that the interconnect lengths vary over a range of 0.205 
cm to 1.102 cm. Since all these interconnects are being routed over the same tier, the 
shorter interconnects will provide more opportunities for WPM routing as they will have 
less delay.  
 Figure 7.9 shows a plot of the “delay + pulse width” normalized to clock period 
for the interconnects running between different macrocells. Out of the ten possible inter-
macrocell interconnect sets, seven sets satisfy the delay constraint in equation (2.5). The 
WPM routing technique can be easily applied to these interconnects without any loss of 
performance or any changes at the RTL level. For the remaining three sets, WPM routing 
technique can be used, but there will be an increase in latency and some changes at the 
RTL level may be necessary to account for this increase in latency. However, there will 
be no change in the throughput performance of the system. 
 














FX and LS 0.75 0.427 1.65E-10 0.769053 
FP and LS 1.023 0.585 1.50E-10 0.955612 
I-unit and LS 0.38 0.229 1.67E-10 0.514483 
MMU and LS 0.205 0.16 9.76E-11 0.334829 
L1D and LS 0.43 0.3 1.01E-10 0.521462 
L1I and LS 0.645 0.383 1.42E-10 0.683058 
BTAC and LS 0.75 0.427 1.65E-10 0.769053 
FP and L1I 1.102 0.616 2.48E-10 1.123524 
FX and L1I 0.855 0.471 1.88E-10 0.856728 




























































































Figure 7.9: “Delay + pulse width” normalized to clock period for interconnects between 
various macrocell pairs. 
 
 Table 7.4 only shows the opportunities that are available for using WPM routing 
for the inter-macrocell interconnects. In addition to these inter-macrocell interconnects, 
the WPM routing can also be applied to intra-macrocell interconnects. This further 
increases the number of interconnects that can be designed using WPM routing and can 
provide significant advantages in terms of wire area and/or power dissipation. 
 Given the fact that there are numerous opportunities to apply WPM routing for 
the existing layout of SPARC64 processor, it may prove even more advantageous if 
WPM-aware placement and routing algorithms are adopted for generating the physical 
layout of the SPARC64 processor. WPM routing technique can be used as a constraint in 
the QPP during the initial placement, instead of looking for potential WPM interconnects 






methodology, it can provide tremendous advantages in terms of power, area and 
performance. 
  
7.6. WPM based design approach vs material solutions for 
future systems 
 
 The use of new manufacturing materials is one of the most widely used 
techniques to improve quality of future designs. Though the advantages of material 
variations seem to be obvious, these solutions are not always feasible due to various 
manufacturing constraints. As a result, it is imperative to look at alternate design 
solutions that can provide the same amount of benefits using the existing material 
solutions. 
 In case of a multilevel interconnect network, the overall network performance 
depends on the interconnect delay which is limited by the material used to manufacture 
interconnects, dielectric material used between the interconnects and the interconnect 
dimensions. For fixed interconnect dimensions, the interconnect performance can be 
improved by using interconnect materials with low resistivity to reduce interconnect 
resistance and/or using a low-k dielectric between the two interconnects to reduce the 
switching capacitance (interconnect). This helps in improving performance and/or 
reducing power dissipation. Since the interconnect dimensions are chosen based on the 
delay constraints, for constant performance the use of low-k dielectric and low resistivity 
interconnect material enables the use of interconnects with smaller pitch values.  
 Though these benefits in performance, area and power of interconnects with the 






elegant circuit design techniques like WPM routing, while keeping the interconnect 
resistivity and dielectric material unchanged. 
 To compare the advantages in the interconnect network obtained using material 
change and using new circuit design techniques, a SPARC64 processor designed using 
130nm technology and operating at 1.3 Ghz is considered. The SPACRC64 processor has 
19 million (~20 million) logic transistors [61]. The total number of interconnects that are 
required to connect the different source-sink pairs are calculated using the stochastic 
interconnect distribution described in [36]. The multilevel interconnect network required 
for routing the different interconnects is designed using HR-MINDS. The interconnect 
dimensions (wire width and wire spacing) and tier count is obtained from [61]. The 
material choice for interconnects and transistors are based on the ITRS [4] projections. 
 A multilevel interconnect network is designed using a dielectric constant of 3.0 
and a interconnect resistivity of 2.2 µohm-cm. This design serves as the base case design. 
A new design with the same values for the design parameters as the base case except for 
the dielectric constant is created. This new design uses a dielectric constant of 2.7. 
Similarly, another multilevel interconnect network is designed where all the design 
parameters have the same value as the base case. However, here the WPM routing 
technique is used for routing the different interconnects instead of the conventional 
interconnect routing technique. Table 7.5 shows a comparison between the values for the 
different design parameter values for the three different designs. The detailed design of 








Table 7.5: Comparison of different design styles. 
Design parameter Conventional design 
Conventional 
design (low-k) WPM design  
Rho 2.2 µohm-cm 2.2 µohm-cm 2.2 µohm-cm 
Dielectric constant 3.0 2.7 3.0 
No. of transistors 20M 20M 20M 
Operating frequency 1.3 Ghz 1.3 Ghz 1.3 Ghz 
Max. operating 
frequency 1.445 Ghz 1.469.61 Ghz 1.389 Ghz 
Macrocell area 0.7304 cm2 0.7304 cm2 0.5369 cm2 
Transistor area 0.42 cm2 0.387 cm2 0.431 cm2 
Power 36.76 W 33.08 W 33.39 W 
 
Table 7.6 Detailed design of the multilevel interconnect network for the different designs. 
Conventional design (rho = 2.2 cm2 and dielectric constant = 3.0) 






Length of the 
longest 
interconnect 











1 2 0.016477 2.00E-05 2.50E-05 0 0 
2 2 0.145693 2.25E-05 2.25E-05 0 0 
3 2 0.409672 4.50E-05 4.50E-05 50819 91.8 
4 2 1.709278 9.00E-05 9.00E-05 12705 
64.91 
183.61 
Conventional design (rho = 2.2 cm2 and dielectric constant = 2.7) 
1 2 0.016477 2.00E-05 2.50E-05 0 0 
2 2 0.145693 2.25E-05 2.25E-05 0 0 
3 2 0.409672 4.50E-05 4.50E-05 48211 87.09 
4 2 1.709278 9.00E-05 9.00E-05 12053 
61.58 
174.18 
WPM design (rho = 2.2 cm2 and dielectric constant = 3.0) 
1 2 0.009653 2.00E-05 2.50E-05 0 0 
2 2 0.089583 2.25E-05 2.25E-05 25116 45.9 
3 2 0.308494 4.50E-05 4.50E-05 28805 91.8 




 As can be observed from Table 7.5, the use of lower dielectric does not reduce the 
macrocell area. This is because the macrocell area is wire-limited. On the other hand, in 
case of the WPM design, since the number of interconnects reduces, routing channels in 
the different metal levels get freed up and this provides an opportunity to reduce the 






WPM design close to 27% reduction in macrocell area can be observed. As far as the 
performance of the system is concerned, a 1.6% improvement in performance is observed 
as the dielectric constant value is reduced. For a WPM design, a reduction in macrocell 
area reduces the interconnect lengths and hence the interconnect delay. However, there is 
a loss in performance as the system performance gets limited by the critical path delay as 
the macrocell area is reduced. Though there is loss of performance for the WPM design, 
the maximum possible performance for the WPM design is more than the desired 
performance for the system. As expected, the transistor area reduces with lower dielectric 
constant as smaller drivers/receivers are required due to a reduction in the interconnect 
switching capacitance. In case of the WPM design, an increase in the transistor area is 
observed due the overhead area required for WPM routing. However, the total transistor 
area is less than 60% of the macrocell area. This ensures that the macrocell area does not 
become transistor-limited. As far as the power dissipation is concerned, the amount of 
power reduction obtained using a lower dielectric constant (10%) is comparable to the 
power reduction obtained using the WPM design (9.1%). The power reduction in WPM 
design is the due to the reduction in the interconnect lengths and the driver-receiver sizes. 
 Thus, we can see that the amount of improvement in terms of performance, area 
and power dissipation that are obtained using new material solutions can also be obtained 
using new design techniques in place of the conventional approaches. Thus, instead of 
pushing towards the fundamental design limits through material change to improve 
system design, using new and elegant design techniques can provide the same advantages 








 The limits of physical design and opportunities for application of WPM routing to 
current and future digital systems are presented in this chapter. Using the quadratic 
programming, simulated annealing and Lagrangian multiplier methods, placement 
solutions for BLIF benchmark circuits are evaluated. The placement solutions are then 
studied for the opportunities to apply WPM routing. Various factors like source-sink 
proximity, interconnect length, criticality of the interconnect, etc. are considered while 
pairing up two dedicated interconnects, and replacing these two dedicated interconnects 
by a single shared WPM interconnects. For BLIF benchmark circuits it is observed that 
there is a steady increase in the wirelength reduction with WPM routing as the size of the 
benchmark circuit increases. For circuits having close to 20,000 cells, a 9% reduction in 
wirelength is observed. By extrapolating this trend, for larger benchmark circuits, a 
higher reduction in wirelength can be obtained as there will be more opportunities for 
application of WPM routing. 
 The opportunities to apply WPM routing to a SPARC64 processor are also 
explored in this chapter. It can be concluded from the analysis that there are significant 
opportunities to apply both SSWPM and DSWPM routing techniques to the inter-core 
interconnects in the existing design of the SPARC64 processor without losing 
performance. This provides significant advantages in terms of reduction in wire area, 
which can be harnessed to reduce power dissipation and improve performance. 
 A comparison between the interconnect solutions using new material and the 
solutions that use efficient design techniques is completed. A 20 million transistor system 






original system is redesigned using low-k dielectric and using WPM routing. The 
solution using new materials provides direct advantages in terms of reduction in power 
and improvement in performance. A 10% reduction in power and 1.6% improvement in 
performance is observed for the same core area. On the other hand, for the approach 
using efficient design techniques (WPM routing), a 27% reduction in macrocell area and 
9.1% reduction in power dissipation is observed with no loss in performance. Thus, it is 
possible to derive the comparable advantages using new material solutions and new 
design techniques. This would help in reducing the pressure on process engineers for 











The key conclusions derived after completing the proposed research are presented 
in this chapter. In addition, the possible opportunities for refinement and extension of this 
research are discussed 
 
8.1. Conclusions 
 The primary objective of this research is to develop a new very large scale 
integration (VLSI) interconnect design technique that could help to overcome the 
detrimental effects of VLSI interconnects on power, area and performance of application 
specific integrated circuit (ASIC)/microprocessor products. This new interconnect design 
technique is referred to as wave-pipelined multiplexed (WPM) routing, and is put forth in 
as a design technique that could be used as pervasively as VLSI interconnect repeater 
circuits. The goal of this proposed thesis is to rigorously study the limits and 
opportunities for application of the WPM routing technique at the circuit level, the 






seeks to make contributions in the fields of circuit design, system-level simulation of 
multilevel interconnect architecture and physical design. 
 To conclude this dissertation, the main contributions of this research are 
summarized as follows: 
1. Wave-pipelined multiplexed (WPM) routing technique 
A unique WPM routing technique that takes advantage of the intra-clock period wire 
idleness is proposed. Using this wire routing technique it is possible to send multiple 
signals over a single interconnect in one clock cycle in a wave-pipelined fashion. 
Due to its simplicity and robustness of application, this WPM technique could be 
easily incorporated in the new GSI systems without any architectural changes. This 
technique has the potential to become a pervasive routing technique that can be 
easily applied to both inter-core and intra-core interconnects in any SoC or 
microprocessor design. 
2. Optimization of the WPM interconnect 
A direct advantage of the application of WPM routing technique is the reduction in 
the number of interconnects. This reduction in interconnect count reduces wire area 
which can be suitably optimized to reduce interconnect capacitance, which can 
reduce power dissipation. For a simple two interconnect system having 1.0 cm long 
interconnects, the application of WPM routing reduces wire area by 50% and 
transistor area by 15%, but increases power by 20%. By increasing the wire spacing 
to 1.5 times the interconnect width and the dielectric thickness in the same 
proportion to maintain constant crosstalk ratio, a 44% reduction in wire area, 29% 






the wire spacing is increased so as to fill up all the emptied wire area, then 26% 
reduction in power dissipation and 41% reduction in transistor area. The 
performance of an interconnect can also be improved using WPM routing. Taking 
advantage of the reduction in wire count, the interconnect dimensions can be 
suitably increased. For a 1.0 cm long global interconnect, close to 74% improvement 
in performance is observed using WPM routing. 
3. Tolerance of WPM routing to external noise 
It is imperative for any new circuit design to have high tolerance levels to external 
noise. The tolerance of WPM routing technique to crosstalk noise, power supply 
noise and clock skew is analyzed. In case of crosstalk noise, the WPM circuit can 
tolerate crosstalk as long as the pulse width used for sampling (at both sender and 
receiver side) is at least equal to the difference between the worst-case delay and 
best-case delay. A staggered repeater insertion technique can be adopted to increase 
the tolerance of WPM circuit to crosstalk noise. The tolerance of WPM routing to 
power supply noise is studied by inducing ±10% variations in the supply voltage. 
These power supply fluctuations affect the interconnect delay. However, these delay 
fluctuations are much smaller than the pulse width used for sampling. Thus, the 
WPM circuit can easily tolerate power supply noise. The use of global clock to 
generate the sampling pulse makes the WPM circuit susceptible to clock skew. A 0.5 
cm WPM interconnect, designed to operate at 1.3 Ghz using 100 nm technology 








4. Wave-pipelined encoded (WPE) routing 
A new data routing technique called wave-pipelined encoded (WPE) routing 
technique that is more robust than WPM routing is proposed. Instead of transmitting 
actual data over the shared interconnect, encoded data is transmitted. As a result, the 
receiver side circuitry receives one of the expected data, which is decoded and 
routed to the appropriate sink. The WPE routing technique requires significant 
amount of overhead circuitry to encode and decode the data, as compared to the 
WPM routing technique. As a result, for an individual 2-wire system having 1 cm 
long interconnects, a significant increase in power dissipation and transistor area is 
observed on application of WPE routing. However, if the spacing between the wires 
is increased so as to fill up all empty wire area, a 4% reduction in transistor area and 
an 18% reduction in power dissipation is observed. 
5. Second generation multilevel interconnect network design simulator – HR-MINDS 
To study the application of the WPM routing technique at the system level, the 
second generation multilevel interconnect network design simulator is developed. A 
comparison between compact models and simulation tools demonstrates that 
compact models cannot always accurately predict the behavior of a 
device/interconnect. Hence, this second generation simulator uses system-level 
interconnect prediction techniques and simulation tools to design the entire 
multilevel interconnect network for a given full-custom/semi-custom design. The 
simulator is interfaced with HSPICE and RAPHAEL to accurately simulate 
interconnect transients. The accuracy of the simulator is validated by designing the 






6. Full-custom WPM multilevel interconnect network design 
A full-custom design provides significant amount of flexibility in terms of choice of 
design parameters for a multilevel interconnect network. A pervasive application of 
WPM routing technique enables redesign of the entire multilevel interconnect 
network and provides advantages in terms of reduction of wire area and reduction in 
power dissipation. With the help of HR-MINDS, for a wire-area-centric approach 
close to 15% reduction in the wire area with only 4% increase in power and virtually 
no loss in communication performance can be observed by application of this WPM 
technique to a 40 million transistor system. On the other hand for a power-centric 
approach, a 30% reduction in power can be observed in addition to close to 60% 
reduction in core area of a 40 million transistor system for a core-area-centric low 
power design. Similarly, for a 40 million transistor system a 30% reduction in power 
is observed by keeping the core area fixed, but increasing the wire spacing to fill up 
the empty wire area. For the power-centric approach too there is no loss of 
performance. 
7. Semi-custom WPM multilevel interconnect network design 
Similar to the full-custom WPM design, the semi-custom WPM design is studied 
using HR-MINDS. A pervasive application of WPM routing enables reduction in 
wire area, which can be harnessed to reduce power dissipation and/or improve 
performance by reducing core area or increasing wire spacing. In a 40 million 
transistor system, for a core-area-centric design, the application of WPM routing 
reduces power by 14% and core area by 28% if the WPM design is operated at the 






frequency, a 6% reduction in power and 20% reduction in interconnect coupling 
capacitance are observed in a wire-coupling-centric design. For the performance-
centric approach, it is possible to improve performance of a 40 million transistor 
system by 4% and 7% using core area reduction and wire spacing increase, 
respectively. 
8. Physical design limits for WPM 
The physical design limits for application of WPM routing are determined. The 
quadratic programming approach, Lagrangian multiplier method and simulated 
annealing are adopted to determine the placement solution of BLIF benchmark 
circuits. Using these placement solutions, the opportunities for application of WPM 
routing are studied. For benchmark circuits having close to 20K cells, 10% reduction 
in wire length is observed. For larger systems having millions of transistors there 
will be more opportunity to apply WPM routing and a larger reduction in wirelength 
can be observed. 
9. Application of WPM routing to a SPARC64 processor 
The opportunity to apply WPM routing to an existing processor is explored. A 
SPARC64 processor designed using 130 nm technology is considered for the case 
study. There is significant amount of opportunity for applying WPM routing to inter-
core interconnects without any change in performance of the SPARC64 processor. 
Both SSWPM and DSWPM interconnect design techniques can be adopted to 
redesign the inter-core interconnects. The application of WPM routing reduces inter-
core interconnect count without any loss in performance. The wire spacing between 






10. Material solutions versus efficient design techniques for future GSI systems 
A comparison between using new material solutions and using efficient design 
techniques for future GSI design is analyzed. A 20 million transistor system is 
considered and it is redesigned using a low-k dielectric as one option and using the 
WPM routing technique as the second option. For the first option, a 10% reduction 
in power and 1.6% improvement in performance is observed with no change in core 
area. On the other hand, for the other approach a 27% reduction in core area and 
9.1% reduction in power dissipation is observed with no loss of performance. Thus, 
both options provide similar advantages and, hence developing efficient design 
techniques to improve future designs can reduce the pressure of developing new 
material solutions on the process engineers. 
 
8.2. Future work 
 The future work of this thesis involves extending the WPM routing technique to 
operate as a multi-slot routing technique on multi-source multi-sink nets, using routing 
algorithms to study the run-length proximity constraints, evaluating the tolerance of 
WPM routing to manufacturing and process variations, and improving the accuracy of 
the second generation multilevel interconnect network design simulator. 
 
8.2.1. Multi-slot routing on multi-source multi-sink nets 
 The WPM routing technique proposed in Chapter 2 is a two-slot routing 
technique and can be applied to only two-terminal nets. This routing technique  cannot 






insignificant. A routing technique can be developed that can transmit data to multiple 
sinks on a shared interconnect in a wave-pipelined fashion. In addition, depending on the 
intra-clock period idleness, the routing technique can be further enhanced to enable muti-
slot routing. Here, more than 2 signals are transmitted over the shared interconnect. This 
multi-slot routing technique will provide significant advantages in terms of reduction in 
wire area which can be further used to reduce power and/or improve performance. In 
fact, the multi-slot routing technique can be extended to design high throughput 
interconnect networks where data transmission rates are multiple times the clock. 
 
8.2.2. Routing algorithms to study run-length proximity 
 A study of the source-sink proximity in BLIF benchmark circuits using 
GORDIAN placement [58] is presented in Chapter 7. As explained in the same chapter, 
in addition to source-sink proximity, interconnects satisfying run-length proximity 
constraints can also be redesigned as WPM interconnects. Various routing algorithms 
like maze routing, Steiner tree based routing etc. can be used to route the interconnects of 
benchmark circuits and evaluate the opportunities to apply WPM routing. The set of all 
interconnects that satisfy source-sink proximity and those that satisfy run-length 
proximity on different benchmark circuits can be used to determine the average value for 









8.2.3. Tolerance of WPM routing to manufacturing and process 
variations 
 
 The effect of crosstalk noise, power supply noise and clock skew on WPM 
routing technique is discussed on Chapter 3. In addition to these, manufacturing 
variations like deviation in transistor and interconnect dimensions, and process-related  
variations like change in dielectric constant, resistivity and threshold voltage can also 
affect the working of the WPM circuit. A comprehensive analysis of the effect of these 
variations on the WPM circuit is necessary. Using Monte Carlo simulations, a study of 
the effect of manufacturing and process variations on the WPM circuit can be completed. 
The primary objective of this analysis would be to ensure a correct transmission of data 
signals over the shared WPM interconnect when subject to manufacturing and process 
variations. 
 
8.2.4. Enhancements to second generation HR-MINDS 
 There is scope to further improve the accuracy of second generation system 
simulator used to design the multilevel interconnect network design. As explained in 
Chapter 4, analytical models are used to determine the power dissipated by the system 
under consideration. The capabilities of HSPICE to determine power dissipation can be 
used to calculate the power dissipation accurately. The simulator uses a wire efficiency 
factor to account for vias, power lines and ground lines. To make the multilevel 
interconnect network more accurate, models for vias, power lines and ground lines can be 
included in the design flow. In addition, with the trend towards multi-core systems, the 
simulator can also be enhanced to design the multilevel interconnect network for a multi-







[1] G. Moore, “Cramming more components onto integrated circuits,” Electronics 
Magazine, vol. 38, pp. 114-117, Apr. 1965. 
 
[2] G. Moore, “Progress in digital integrated electronics,” Proc. IEDM, pp. 11-13, 
1975. 
 
[3] S. Borkar, “Obeying Moore’s law beyond 0.18 micron,” Proc. ASIC/SoC 2000, 
pp. 26-31. 
 
[4] International Technology Roadmap for Semiconductors (ITRS), 2004 update. 
[Online document], Available HTTP: http://public.itrs.net. Accessed: July 2004 
 
[5] C. McNairy and R. Bhatia, “Montecito: a dual core, dual-thread Itanium 
processor,” Proc. IEEE Micro, vol. 25, no. 2, pp. 10-20, Mar.-Apr. 2005. 
 
[6] J. Meindl, J. Davis, P. Zarkesh-Ha, C. Patel, K. Martin, and P. Kohl, 
”Interconnect opportunities for gigascale integration,” IBM Journal of Research 
and Development, vol. 46, number 2/3, pp. 245-255, 2002. 
 
[7] J. Davis, R. Venkatesan, A. Kaloyeros, M. Bylansky, S. Souri, K. Banerjee, K. 
Saraswat, A. Rahman, A. Reif and J. Meindl, “Interconnect limits on gigascale 
integration (GSI) in the 21st century,” Proc. IEEE, vol. 89, pp. 305-324, Mar. 
2001. 
 
[8] J. Meindl, “Low power microelectronics: retrospect and prospect,” Proc. IEEE, 
vol. 83, pp. 619-635, Apr. 1995. 
 
[9] S. Thompson, N. Anand, M. Armstrong, C. Auth, B. Arcot, M. Alavi, P. Bai, J. 
Bielefeld, R. Bigwood, J. Brandenburg, M. Buehler, S. Cea, V. Chikarmane, C. 
Choi, R. Frankovic, T. Ghani, G. Glass, W. Han, T. Hoffmann, M. Hussein, P. 
Jacob, A. Jain, C. Jan, S. Joshi, C. Kenyon, J. Klaus, S. Klopcic, J. Luce, Z. Ma, 
B. Mcintyre, K. Mistry, A. Murthy, P. Nguyen, H. Pearson, T. Sandford, R. 
Schweinfurth, R. Shaheed, S. Sivakumar, M. Taylor, B. Tufts, C. Wallace, P. 
Wang, C. Weber, and M. Bohr, “A 90 nm logic technology featuring 50 nm 
strained silicon channel transistors, 7 layers of Cu interconnects, low k ILD and 1 
µm2 SRAM cell,” Proc. IEDM, pp. 61-64, Dec. 2002. 
 
[10] H. Bakoglu and J. Meindl, “Optimal interconnection circuits for VLSI,” IEEE 
Trans. Electron Devices, vol. ED-32, pp.903-909, May 1985. 
 








[12] R. Venkatesan, J. Davis, K. Bowman and J. Meindl, “Optimal n-tier multilevel 
interconnect architectures for gigascale integration (GSI),” IEEE Trans. VLSI 
Systems, vol. 9, no. 6, pp.899-912, Dec 2001. 
 
[13] Y. Cao, C. Hu, X. Huang, A. Kahng, S. Muddu, D. Stroobandt and D. Sylvester, 
“Effects of global interconnect optimizations on performance estimation of deep 
submicron design,” Proc ICCAD 2000, pp. 56–61. 
 
[14] K. Banerjee and A. Meherotra, “A power-optimal repeater insertion methodology 
for global interconnects in nanometer designs,” IEEE Trans. Electron Devices, 
vol. 49, no. 11, pp. 2001-2007, Nov. 2002. 
 
[15] V. Adler and E. Friedman, “Repeater design to reduce delay and power in 
resistive interconnect,” IEEE Trans. Circuits Syst. I, vol. 45, pp. 607–616, May 
1998. 
 
[16] A. Nalamalpu and W. Burleson, “A practical approach to DSM repeater insertion: 
Satisfying delay constraints while minimizing area and power,” Proc. ASIC/SOC 
2001, pp. 152–156. 
 
[17] G. Garcea, N. van der Meijs, R. Otten, “Simultaneous analytical area and power 
optimization for repeater insertion,” Proc. ICCAD 2003, pp. 568-573. 
 
[18] L. Cotton, “Maximum rate pipelined systems,” Proc. AFIPS Spring Joint 
Computer Conference 1969, pp. 581-586. 
 
[19] A. Singh, A. Mukherjee and M. Marek-Sadowska, “Interconnect pipelining in a 
throughput-intensive FPGA architecture,” Proc. FPGA 2001, pp-153-160. 
 
[20] M. Hashimoto, A. Tsuchiya and H. Onodera, “On-chip global signaling by wave-
pipelining,” Proc. IEEE Topical Meeting on Electrical Performance of Electrical 
Packaging 2004, pp. 311-314. 
 
[21] L. Zhang, Y. Hu and C. Chen, “Statistical timing analysis in sequential circuit for 
on-chip global interconnect pipelining,” Proc. DAC 2004, pp. 904-907. 
 
[22] P. Cocchini, “Concurrent flip-flop and repeater insertion for high performance 
integrated circuits,” Proc. ICCAD 2002, pp. 268-273. 
 
[23] V. Deodhar and J. Davis, “Voltage scaling, wire sizing and repeater insertion 
design rules for wave-pipelined VLSI global interconnect circuits,” Proc. ISQED 
2005, pp. 592-597. 
 
[24] V. Deodhar, “Throughput-centric wave-pipelined interconnect circuits for 








[25] W. Burleson, M. Ciesielski, F. Klass and W. Liu, “Wave-pipelining: A tutorial 
and research survey,” IEEE Trans. VLSI Systems, vol. 6, no. 3, pp.464-474, Sept. 
1998. 
 
[26] L. Zhang, Y. Hu and C. Chen, “Wave-pipelined on-chip global interconnect,” 
Proc. ASP-DAC 2005, pp. 127-132. 
 
[27] V. Deodhar and J. Davis, “Designing for signal integrity in wave-pipelined SoC 
global interconnects,” Proc. SOC 2005, pp. 207-210. 
 
[28] CoreConnect Bus Architecture. [Online document] Available HTTP: http://www-
306.ibm.com/chips/techlib/techlib.nsf/products/CoreConnect_Bus_Architecture. 
Accessed: October 2005. 
 
[29] AMBA Bus Architecture. [Online document] Available HTTP: 
http://www.arm.com/products/solutions/AMBA_Spec.html. Accessed: October 
2005 
 
[30] K. Lahiri, A. Raghunathan and G. Lakshminarayana, “LOTTERYBUS: A new 
high-performance communication architecture for system-on-chip designs,” Proc. 
DAC, 2001, pp. 15-20. 
 
[31] P. Pande, C. Grecu, M. Jones, A. Ivanov and R. Saleh, “Performance evaluation 
and design trade-offs for network-on-chip interconnect architectures,” IEEE 
Trans. Computers, vol. 54, no.8, pp. 1025-1040, Aug. 2005. 
 
[32] S. Kumar, A. Jantsch, J. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja 
and A. Hemani, “A network on chip architecture and design methodology,” Proc. 
IEEE Comp Soc Annual Symposium on VLSI 2002, pp. 105-112. 
 
[33] J. Liu, L-R. Zheng, D. Pamunuwa and H. Tenhunen, “A global wire planning 
scheme for network-on-chip,” Proc. ISCAS 2003, vol.4, pp IV-892 – IV-895. 
 
[34] P. Bhojwani and R. Mahapatra, “Interfacing cores with on-chip packet switched 
networks,” Proc. VLSI design 2003, pp. 382-387. 
 
[35] J. Liu, M. Shen, L-R. Zheng and H. Tenhunen, “System level interconnect design 
for network-on-chip using interconnect IPs,” Proc. SLIP 2003, pp. 117-124. 
 
[36] J. Davis, V. De and J. Meindl, ”A stochastic wire-length distribution for gigascale 
integration (GSI)—Parts I and II,” IEEE Trans. Electron Devices, vol. 45, no. 3, 







[37] V. Deodhar and J. Davis, “Optimization for throughput performance for low 
power VLSI interconnects”, IEEE Trans. VLSI systems, vol. 13, no. 3, pp. 308-
318, March 2005. 
 
[38] A. Naeemi, “Analysis and optimization for global interconnects for gigascale 
integration (GSI),” Ph.D. Thesis, Georgia Institute of Technology, Atlanta, 2003. 
 
[39] T. Sakurai, “Closed-form expressions for interconnection delay, coupling, and 
crosstalk in VLSI’s,” IEEE Trans. Electron Devices, vol. 40, no. 1, Jan. 1993. 
 
[40] J. Xu and W. Wolf, “A Wave-pipelined on-chip interconnect structure for 
networks-on-chips,” Proc. HOTi 2003, pp. 10-14. 
 
[41] J. Huang, S. Tu and J. Jou, “On-chip bus encoding for LC cross-talk reduction,” 
Proc. VLSI-DAT 2005, pp. 233-236. 
 
[42] K. Patel and I. Markov, “Error-correction and crosstalk avoidance in DSM 
busses,” IEEE Tran. VLSI Systems, vol. 12, no. 10, pp. 1076-1080, Oct. 2004. 
 
[43] M. Anders, N. Rai, R. Krishnamurthy and S. Borkar, “A transition-encoded 
dynamic bus technique for high-performance interconnects,” IEEE Trans. Solid 
State Circuits, vol. 38, no. 5, pp. 709-714, May 2003. 
 
[44] J. H. Chern, J. Huang, L. Arledge, P-C Li and P. Yang, “Multilevel metal 
capacitance models for CAD design synthesis systems,” IEEE Electron Device 
Letters, vol. 13, No. 1, pp. 32-34, Jan. 1992. 
 
[45] R. Venkatesan, “Multilevel interconnect architectures for gigascale integration 
(GSI),” Ph.D. Thesis, Georgia Institute of Technology, Atlanta, 2003. 
 
[46] L. Codrescu, S. Nugent, J. Meindl and D. Wills, “Modeling technology impact on 
cluster microprocessor performance,” IEEE Trans. VLSI Systems, vol. 11, no. 5, 
pp.909-920, Oct. 2003. 
 
[47] B. M. Geuskens and K. Rose, “Modeling Microprocessor Performance,” Reading: 
Boston, MA: Kluwer Academic, 1998. 
 
[48] R. Mangaser, C. Mark, and K. Rose, “Interconnect constraints on BEOL 
manufacturing,” in Proc. Adv. Semicond. Manufact. Conf., 1999, pp. 304–308. 
 
[49] D. Sylvester and C. Hu, “Analytical modeling and characterization of deep-
submicrometer interconnect,” Proc. IEEE, vol.89, no. 5, pp. 634-664, May 2001. 
 
[50] Y. Cao, C. Hu, X. Huang, A. Kahng, I. Markov, M. Oliver, D. Stroobandt and D. 






extrapolation in the GTX system,” IEEE Trans. VLSI Systems, vol. 11, no. 1, 
pp.3-14, Feb. 2003. 
 
[51] P. Zarkesh-Ha, J. Davis and J. Meindl, “Prediction of net-length distribution for 
global interconnects in a heterogeneous system-on-a-chip,” IEEE Trans. VLSI 
Systems, vol. 8, no. 6, pp.649-659, Dec. 2000. 
 
[52] V. Venkatraman, A. Laffely, J. Jang, H. Kukkamalla, Z. Zhu and W. Burleson, 
“NoCiC: A spice-based interconnect planning tool emphasizing aggressive on-
chip interconnect circuit methods,” Proc. SLIP 2004, pp. 69-75. 
 
[53] J. A. Davis, “A hierarchy of interconnect limits for gigascale integration,” Ph.D. 
dissertation, Georgia Inst. Technology., Atlanta, GA, 1999. 
 
[54] W. E. Donath, “Wire length distribution for placement of computer logic,” IBM J. 
Res. Development, vol.2, no.3, pp. 152-155, May 1981. 
 
[55] G. A. Sai-Halasz, “Performance trends in high end processors,” Proc. IEEE, 
vol.83, pp. 18-34, Jan. 1995. 
 
[56] Berkeley Predictive Technology Model (BPTM) (http://www-
device.eecs.berkeley.edu/~ptm/introduction.html). Accessed: July 2004. 
 
[57] M. Pedram, “Leakage power modeling and minimization,” Tutorial, ICCAD 2004, 
Nov. 2004. 
 
[58] J. Kleinhans, G. Sigl, F. Johannes and K. Antreich, “GORDIAN: VLSI placement 
by quadratic programming and slicing optimization,” IEEE Trans. CAD vol. 10, 
no. 3, pp. 356-365, Mar. 1991. 
 
[59] W. Hou, X. Hong, W. Wu and Y. Cai, “FaSa: A fast and stable quadratic 
placement algorithm,” Proc. IEEE, pp. 1391-1395, 2002. 
 
[60] N. Sherwani, “Algorithms for VLSI physical design automation,” Reading: 
Kluwer Academic Publishers, 1999. 
 
[61] H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. 
Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. 
Yamashita and H. Sugiyama, “A 1.3-GHz fifth-generation SPARC64 
microprocessor,” IEEE Journal of Solid State Circuits, vol. 38, no. 11, pp. 1896-






LIST OF PUBLICATIONS 
 
[1] A. Joshi and J. Davis, “A 2-slot time-division multiplexing (TDM) interconnect 
network for gigascale integration (GSI),” Proc. IEEE/ACM SLIP Workshop 2004, 
pp. 64-68. 
 
[2] A. Joshi and J. Davis, “Wave-pipelined 2-slot time division multiplexed (WP/2-
TDM) routing,” Proc. GLSVLSI 2005, pp.446-451. 
 
[3] A. Joshi and J. Davis, “Wave-pipelined multiplexed (WPM) routing for gigascale 
integration (GSI),” IEEE Trans. VLSI Systems, vol. 13, no. 8, pp. 899-910, August 
2005.  
 
[4] A. Joshi and J. Davis, “Gigascale ASIC/SoC design using wave-pipelined 
multiplexed (WPM) routing,” Proc. IEEE-SOCC 2005, pp. 139-142. 
 
[5] J. Davis, V. Deodhar and A. Joshi, “The impact of wave pipelining on future 
interconnect technologies,” Proc. AMC 2005. 
 
[6] A. Joshi, V. Deodhar and J. Davis, “Low power multilevel interconnect networks 
using wave-pipelined multiplexed (WPM) routing,” Proc. Intl. Conference on 
VLSI Design 2006, pp. 773-776. 
  
[7] D. Sekar, R. Venkatesan, K. Bowman, A. Joshi, J. Davis and J. Meindl, “Optimal 










Ajay Joshi was born in Mumbai, India. He received his Bachelor of Engineering (B.E.) 
degree (2001) in computer engineering from University of Mumbai. In 2001, he joined 
the Advanced Interconnect Modeling and Design (AIMD) research group led by Dr. 
Jeffrey Davis at Georgia Institute of Technology (Georgia Tech). He received Master of 
Science (M.S.) degree (2003) in electrical engineering from Georgia Tech. He got a 
chance to work as an intern with Post Silicon Validation Debug group at Intel 
Corporation, Santa Clara, CA in 2003. He was responsible for developing two system 
debug tools. Over the past four years, he has co-authored seven publications in 
international conferences and refereed journals. His research interests include designing 
low-power and high-performance interconnect routing algorithms and system-level 
analysis. 
 
