University of Central Florida

STARS
Retrospective Theses and Dissertations
2000

Signal integrity in deep submicron CMOS chip design
Jignesh Suresh Sonchhatra
University of Central Florida

Part of the Systems and Communications Commons

Find similar works at: https://stars.library.ucf.edu/rtd
University of Central Florida Libraries http://library.ucf.edu
This Masters Thesis (Open Access) is brought to you for free and open access by STARS. It has been accepted for
inclusion in Retrospective Theses and Dissertations by an authorized administrator of STARS. For more information,
please contact STARS@ucf.edu.

STARS Citation
Sonchhatra, Jignesh Suresh, "Signal integrity in deep submicron CMOS chip design" (2000). Retrospective
Theses and Dissertations. 1989.
https://stars.library.ucf.edu/rtd/1989

SIGNAL INTEGRITY IN DEEP SUBMICRON CMOS
CHIP DESIGN

By

JIGNESH SURESH SONCHHA TRA
B.S(EE), University of Bombay, India, 1997

A thesis submitted in partial fulfillment of the requirements
.F or the degree of Master of Science
In the college of Engineering and Computer Science
At the University of Central Florida
Orlando, Florida

Fall Term
2000

Major Professor

Dr. J. S. Yuan

ABSTRACT
Advancement in CMOS technology has become a driving force in the advancement of
todaf s IC design arena. In the past few years, considerable research has been done on
the CMOS devices and circuits. Constant efforts have been made to realize smaller and
smaller devices by reducing the channel length of the transistors and scaling down
various other device parameters. Consequently, various problems have arisen such as
interconnect delay, signal integrity and signal coupling.

The purpose of this thesis is to review and understand current problems in IC design and
~

come up with various solutions to them. Efforts have been made to propose a model of
interconnect which demonstrates the effects of parasitic components on the chip. Signal
coupling effects have been demonstrated by simulating various RC and RLC models for
interconnects. The impact of parasitic inductance on the performance of the chip is
understood with the help of simulation results.

The design of an eight bit shifter is realized using Clockless Asynchronous and Clocked
Boolean design techniques. Both the chips are placed and routed using Silicon Ensemble,
a high-tech CAD tool from Cadence Design Systems. Various optimization techniques
have been applied to both the prototypes and a detailed comparison has been done
considering factors such as area of the chip, total length of the interconnect, row

11

utilization, chip congestion etc. Based on these results, it was found that Clocked Boolean
Shifter was compact, compared to its Asynchronous counterpart design. However,
Asynchronous Clockless architectures are well recommended where complex chip
functionalities are intended to be integrated without much of the hassles of timing
problems.

Ill

To my parents with love ...

IV

ACKNOWLEDGEMENTS
Graduate student life in UCF Orlando has been a rich, rewarding experience and I will
sorely miss it once I will leave this illustrious university. I have been fortunate enough, to
meet one of he greatest goals in my life of getting advanced education in a high-tech
field such as VLSI design in Electrical engineering, a dream which I cherished since past
many years of my undergraduate and graduate studies. Over the years, I had a privilege
of meeting many distinguished faculty members and intelligent colleagues, and I learnt
much in the process. It is a pleasure to acknowledge and express my gratitude to those
who have an impact on me, both during my time here, and in times past.

I am deeply grateful to Professor J.S.Yuan, not only for being my .advisor all through my
graduate education, but also for being a Godfather in the making of my career in the field
of my interest. It is solely because of his guidance and research interests personally, my
interest in this great field was enhanced. While pursuing my Masters and doing research
with him, I got a distinguished opportunity to advance myself in various cutting-edge research areas such as analog, digital and mixed circuit design and high performance chip
design. He gave me guidance and moral support when I was a young graduate student
and gave me freedom as I progressed through the learning and research process. My
weekly meetings with him were never wasted, whether we were talking about my
research, career choices, latest technological trends or about other cultures, my group

V

being multicultural. He has been an excellent instructor for the classes I took from him
and an execellent advisor, who always gave all his graduate and post-graduate students
freedom to work at his/her ·schedule. Most amazingly, he did all of this while managing
three different research teams under him with distinguished attention to individual
progress and the team progress, with all the additional curriculum work that entailed,
being a distinguished profe~sor in the EECS department at UCF. Nonetheless, he made a
very good manager. Thanks Dr. Yuan, you made graduate school worth it.

I thank professors DeMara and K. Sundaram for serving on my thesis committee, who
provided me timely guidance and feedback on my thesis and encouraging me by
providing me a great moral support.

This work is a part of the project with Theseus Logic Inc., a . . company with innovative
ideas about clockless logic design, located in Orlando, Florida. Working for this project
was a real IC design experience. I would like to thank Theseus Logic Inc., specially to
Ken Meekins for all his time and invaluable technical expertise.

I would like to thank Don Harper, our system administrator, for providing technical
support with high-tech CAD tools to our lab and making it easier for me to use those
tools.

It is impossible to completely describe how important my family was in a single
paragrap~ so I won't even try. Needless to say, without their efforts, hardwork and

VI

encouragement, it would be impossible for me to pursue advanced education in a foreign
co~ntry, leaving them back home miles away, just to miss me and though difficult,
morally supporting me across thousands of miles. My parents and siblings were a neverending source of support and encouragement at every phase of my education. They
always told me that anything was possible and nothing was impossible and to gain
something, a person has to s~crifice something. I knew they would always be there for
me if I needed them which encouraged me to advance myself in every venture of my life.
Mom, Dad, Ketal, Khushbu, thanks for everything.

And last, but by no means least, I would like to thank Ms. Swarupa Purandare (Kuku,
that's what I call her with love!) who is not only my younger sister, but a best friend and
next-of-kin for providing me with everything that takes to compose a Masters thesis,
apart from technical know-hows.

vu

TABLE OF CONTENTS

LIST OF TABLES ....... ~ .................................................... xi
LIST OF FIGURES ......................................................... xii
CHAPTER 1 - INTRODUCTION................................................ ....... I

Motivation. . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
Interconnect Scaling. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. 3
Local and Global wires ................................................... 4

CHAPTER 2 - SIGNAL INTEGRITY. ................................................. 8

Introduction to Signal Integrity.......................................... 8
Clock Skew in Synchronous designs... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Other Timing problems ................... ............ .. ... ... ... .. ........ 12
Clock Synchronization in Synchronous designs ....................... 14
Single Phase clocking ... ....................... ...·. . . . .. . . . . .. . . .. 15
On-chip multiple Clock generation .............................. 20
Clock Phase synchronization in Multiphase environments ........... 22
Adaptive Skew Control ............................................ 22
Phase Locked Loop ................................................ 22
Delay Locked Loop ...... ............... ........................... 24
Cou,pling Capacitance and Noise .......................................... 24
Crosstalk ...................................................................... 25
Analytical modeling of Crosstalk Noise ......................... 26
Model derivation ..... ............................... .. ..... 27

Vlll

Dynamic delay... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30
Low Power Synchronous and Asynchronous designs... . . . . . . . . . . . . ... 32
Capacitive Load of the network... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Power Reduction schemes in Synchronous systems. . . . . . . . . . . . . . . . . . .. 37
Design of Clock drivers... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Reduced Clock swing ... :. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

CHAPTER 3 - INTERCONNECT MODELING IN DSM

44
44

Introduction to the topic

Differences in On-Chip Inductance Consideration .................... 47
Inductive Reactance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Frequency dependent Resistance and Inductance ...................... 51
When do we need to consider On-Chip Inductanct ? ........ ..... ... .. 60
On-chip design solutions to cope with Inductive Coupling ......... .. 67
Dedicated Ground wires ............................................ 67
Differential Signals ................................................. 69
Buffer Insertion ........ . ........ .. ....... . .......................... 70
Splitting Wires ..................................... ........... ....... 72
Continuous Power/Ground p!anes ..... . ............ .......... .. . 75

CHAPTER 4 - REAL CHIP DESIGN ................................................... 76
ASIC design flow ............ .. .... ............ ....... ........... .... ........ 77
Floorplanning ............................................................... 78

lX

Clock planning in Synchronous designs............................... 86
Placement ................................................................... 94
Routing.................................................................... 100
Global routing... . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Measurement of Interconnect delay. . . . . . . . . . . . . . . . . . . . . . . . ... 101
Global routing methods... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 105
Power routing... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 106
Circuit Extraction... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 111

CHAPTER 5 - CONCLUSION ........................................................ 112
APPENDIX A ...... .................................................... 117
APPENDIX B ...................................................... .... 125
APPENDIX C ............................................................. .. 126
APPENDIX D .... ...................................................... 131
LIST OF REFERENCES ............................................ 132

X

LIST OF TABLES

Table 1.1

Ideal Scaling of MOS transistors follows these generalized rules ......... 3

Table 1.2

Common Scaling scenarios for local and global interconnect .............. 5

Table 5.1

Comparison between Boolean and NCL Shifter Chips ...... .... .... . .. . 115

Xl

LIST OF FIGURES

Figure 2.1

Explaining Clock Skew ............................................................ 9

Figure 2.2(a) Distributed clock driving network .............................................. 10
Figure 2.2(b) Clock trunk approach ............................................................. 12
Figure 2.3(a) Example in which the data delay exceeds a clock period .................... 13
Figure 2.3(b) Timing diagram for Figure 2.2 (a) .............................................. 13
Figure 2.4

Single-phase clock system and its timing diagram ............................ 15

Figure 2.5

Example of a scan-chain in a complex VLSI circuit. ........................ 17

Figure 2.6

Two-phase clock system ......................................................... 18

Figure 2. 7(a) Two-phase clock generator ....................................................... 19
Figure 2.7(b) Clock waveforms .................................................................. 19
Figure 2.8

Basic concept for phase-locked loop ........................................... 20

Figure 2. 9

Multi-clock generator, based on PLL. .......................................... 21

Figure 2.10

Clock phase synchronization to compensate for different clock skews
inside different cores ............................................................. 23

Figure 2.11

Graph of Coupling capacitance with technology ............................ .24

Figure 2.12

Simplified circuit schematic .................................................... 28

Figure 2.13

The impact of Dynamic delay on Coupling capacitance .................... 31

Figure 2.14

Power Dissipation vs. Clock frequency ....................................... 35

Figure 2.15

The clock network .... : .......................................................... 37

Figure 2.16

Principle and CMOS implementation of complementary-phase clock
driver ............................................................................... 42

Figure 3.1

Plot of resistance and inductive reactance vs. length of the wire .......... 46

Figure 3.2

Two signal wires and one ground wire ....................................... .48

Xll

figure 3.3(a) Two interconnect sturctures (with no ground) ....... ...... .................. .49
Figure 3.3(b) Two interconnect structures (with ground plane) ........................... .49
Figure 3.4(a) Four signal wires on top of orthogonal wires ................................. 52
Figure 3.4(b) Cross-sectional view with parasitic capacitances ............................ 52
Figure 3.5

The equivalent RLC circuit for the four-wire structure ..................... 54

Figure 3.6(a) Original structure with orthogonal layer. ........................ ............. 55
Figure 3.6(b) The conductors replaced by resistors in Figure 3.6(a) ...................... 55
Figure 3.6(c) The resistors are neglected ...................................................... 56
Figure 3.6(d) Extra caps are introduced between wires after network reduction ........ 56
Figure 3.7

CMOS inverter as our basic component for simulations .................. .57

Figure 3.8(a) Circuit Schematic for simulation with RC interconnect model. ........... 58
Figure 3.8(b) Simulation results of circuit schematic Figure 3.8(a) ....................... 59
Figure 3.8(c) Magnified view of Figure 3.8(b) .............................................. 60
Figure 3.9

Simple RLC circuit model for an interconnect line ..... : ................... 62

Figure 3. l0(a) A CMOS inverter driving the equivalent characteristic impedance of an
RLC transimission line .......................................... ...... ......... 64
Figure 3. l0(b) A CMOS inverter driving an RC approximation of a transmission
line ................................................................................. 64
Figure 3.1 l(a) Circuit Schematic for simulation with RLC interconnect model. ........ 65
Figure 3. ll(b) Simulation results of circuit schematic Figure 3. ll(a) ..................... 66
Figure 3. ll(c) Magnified view of Figure 3. ll(b) ............................................. 66
Figure 3.12

Two wires: Wire 1 is driven by a driver and Wire 2 provides current return
.. .... .......... ..................................................... ....... ........ ..68

Figure 3.13

Buffer insertion in an RLC line for the minimization of inductive coupling
as well as capacitive coupling and propagation delay ... ....... ........... .. 70

Xlll

Figure 3.14(a) A signal wire has been split into t\vo 2.5 µm parallel wires ................ 73
Figure 3. l 4(b) A 7.5 µm wide wire ................................ . ................ .............. 74
Figure 3.14(c) A signal wire has been split into four parallel wires ......................... 74
Figure 3.14(d) A signal wire split with a dedicated ground wire ............. .. .............. 74
Figure 3.15

Continuous power/ground planes on-chip ..................................... 75

Figure 4.1

ASIC design flow ....................................... ........................... 77

Figure 4.2(a) Initial floorplan generated by floorplanning tool.. ............ . .. ... .. ........ 79
Figure 4.2(b) Estimated placement for flexible blocks .......... ............................. 79
Figure 4 .2( c) Moving blocks to improve the floorplan ....................... . ............... 79
Figure 4 .2(d) Updated display showing the reduced congestion ............................ 79
Figure 4.3(a) Initial floorplan generated by the floorplanning tool.. ....................... 81
Figure 4.3(b) An estimated placement for flexible blocks A and C ........ . ............... 81
Figure 4.3(c) Moving blocks to improve the floorplan .. .. ........... .. ..... ................. 81
Figure 4.3(d) Updated display showing reduced congestion after changes ............... 81
Figure 4.4

Routing a T-junction between two channels in two-level metal.. .......... 83

Figure 4.5

Defining the channel routing order for a slicing floorplan using slicing tree
........... . ... .... ... ........ .................... .... .. .............. .... ... .. ..... ... 83

Figure 4.6(a) A nonslicing floorplan with a cyclic constraint. ........ ...................... 84
Figure 4.6(b) Alternative solution to Figure 4.6(a) ............................................ 84
Figure 4.6( c) Sliceable floorplan ..... ·....... ......... ..... . .. .. ....... ...... . ................... 84
Figure 4.7

Channel definition .. .. ............................................................. 85

Figure 4.8

Clock distribution .................................................................. 86

Figure 4.9

A clock tree ........................................................................ 89

X1V

Figure 4 .10( a) Floorplan view of NCL Shifter with Power and Input/Output pins
·
Placed ............................................................................... 92
Figure 4 .1 0(b) Floorplan view of Clocked Boolean Shifter with Power, Input/Output pins
and Cells Placed ................................................................... 93
Figure 4.11

Placement using trees on graph .................................................. 95

Figure 4.12(a) Interconnect length measurement using Complete-graph measure ......... 96
Figure 4.12(b) Interconnect length measurement using the Half-perimeter measure ...... 96
Figure 4.13(a) NCL Shifter with Power, Input/Output pins and Cells Placed. Each Row
of Cells is connected with Power Rings ..................... ... ..... ... . ...... 98
Figure 4.13(b) Clocked Boolean Shifter with Power, Input/Output pins and Cells Placed.
Each Row of Cells is connected with Power Rings ............. .... .. ........ 99
Figure 4 .14

Measuring the delay of a net with the help of a simple circuit ............ 104

Figure 4.15(a) The complete NCL Shifter Chip ............................................... 108
Figure 4.15(b) The complete Clocked Boolean Shifter Chip .................... .. .......... 109
Figure 4 .16

Optimized NCL Shifter Chip with 90% Row Utilization ............ ... .. .. 110

xv

CHAPTERl
INTRODUCTION

Motivation

Advances in CMOS technology over the past 25 years has led to a tremendous growth in
the performance in the integrated circuits. The concept of CMOS scaling refers to the
miniaturization of MOS transistors in a systematic manner such that the new smaller
devices are faster, more power efficient, and reliable. There are several branches of
scaling including ideal, constant-voltage, and quasi-ideal. In general, if the MOSFET
channel length (LchanneD and oxide thickness (t 0 x) are scaled prqportionately, faster
devices could be obtained. In today's technology, the supply voltage has been dropped
down considerably. As a result, efforts have been made to reduce the threshold voltage
(Vth) to maintain sufficient current drive with the dropping supply voltage. The most

fundamental limitations to scaling devices are the exponential increase in the leakage
current with reduced Vth and difficulty in fabricating ulltra-thin gate oxides. However,
the performance improvement of CMOS devices is still expected to continue in the time
ahead. A scaling factor, S, defines the change in a certain physical parameter (e.g.
Threshold voltage Vth) from one technology generation to the next. Changing the channel

1

l~ngth

Lchannel

from 0.35 µm to 0.25 µm gives S the value of 0.72 for

Lchannel•

A typical

scale factor is 0.7.

The delay of a transistor,

'tctelay,

can be modeled to the first order using the expression :

CL'V swing
'C

delay ·

Here CL is the load capacitance,
voltage), and

!drive

(1.1)

· 1 drive

Yswing

is the voltage swing of interest (50% of the supply

is the drive current of the device. By examining how these three

parameters scale from one process to the next, we can estimate how device delay will be
affected as well.

Table 1.1 presents a generalized look at ideal scaling for deep submicron (DSM, feature
size< 0.35 µm) processes. It has been documented that, beginning at about the 0.35 µm
generation, further increase in drive current (normalized to device width) are difficult to
achieve because of velocity saturation, mobility degradation, and parasitic source-drain
resistance [1.2]. However, even with constant drive current we see that transistor delay
decreases by S each generation, yielding faster devices with each technology shrink. In
addition, power consumption per device is reduced approximately quadratically due to
the drop in gate capacitance if the channel length,

Lchannel,

channel width, W, and the

oxide thickness, tax are scaled proportionately combined with lower voltage supplies. In
the quest for additional functionality, chip area is rising slowly with technology

2

a?vancement. Thus it is anticipated that the total device power consumption will increase
as a ~esult of efforts to realize smaller and faster devices.

Scaled Parameter

Ideal Scaling Factor

W,L, T 0 x, Xi

s

Substrate Doping,

Voltages: Vth,

1/S

Nsub

s

vdd

Idsat (for fixed W /L)

s

Cgate

s

Ron ( oc

't

1

V dd / Idsat)

0

(oc CV/I)

Power (oc CV2t)
Power•

52
s3

't

Area / Device

s2

Power I Area

1

Table 1.1 Ideal Scaling of MOS transistors follows these generalized rules

Interconnect Scaling

In direct contrast to the performance advantages inherent in MOS transistor scaling is the
phenomenon of interconnect "reverse" scaling. Reverse scaling refers to the concept that

3

smaller interconnections actually yield larger delays due to the rapidly shrinking crosssectional area of the wire that is used to conduct current.

Local and Global Wires

Interconnect scaling can be broken down into two distinct components, local and global
scaling. The distinction between the two can be made by first defining a local wire as a
connection within a functional unit that spans only a small number of gate dimensions
(known as a gate pitch). These local wires tend to be on the length scale of 50-500 µmin
current technologies (0.18 µm to 0.25 µm). Global wires serve to connect separate
functional units and can have significantly larger wirelengths. One way to view this is the
length scale of local wires is set by the size of an individual gate which is very small. On
the other hand, global wirelengths are set by both the size of a functional unit and the size
of the entire chip since they must span at least one functional unit and could attain a chipside in length for the limiting cases. These definitions serve to highlight the primary
difference between the two major types of on-chip wiring.

Interconnect scaling approaches for local and global wires are summarized in Table 1.2.
Two common scena1ios are shown for both local and global wires- the ideal scaling rules
are valid for both types of wires. In · the case of local interconnections, a quasi-ideal
approach to scaling can be taken which scales ve11ical and horizontal backend parameters
differently. Also, global wiring may employ a constant-dimension type of 'scaling' rather
than the ideal case.

4

Local Wiring
Ideal Scaling

Global Wiring

Quasi-ideal

Constant
Ideal Scaling

Scaling

Dimension Scaling

Linewidth &

s

s

s

1

Wire Thickness

s

✓s

s

1

ILD Thickness

s

✓s

s

1

Wirelength

s

s

v✓s

11✓s

l/S2

l/S3n

1/S2

Spacing

Resistance (per
unit length)

...

1

~ 1 (possible

Capacitance
1

1

1

small increase)

(per unit length)
RC delay

1

✓s

l/S3

1/S

_Current density

1/S

v✓s

1/S

s

Table 1.2 Common Scaling scenarios for local and global interconnect

The ideal scaling rules for interconnect basically serve to provide sufficient packing
density for highly integrated designs. Since gates are being rapidly scaled down in size,
more and more wires are needed for communication. Therefore, S decreases both
linewidth and spacing in each generation. The wire and inter-level dielectric (ILD)
thicknesses are also reduced by S for ease of process integration and roughly constant
capacitances (per unit length). Local wires see a reduction in the line length by the scale
factor S. This is due to the reduction in gate pitch by S2 . The wirelength reduction in local
interconnects results in a constant RC delay.

In quasi-ideal scaling, the vertical dimensions are scaled more slowly than the horizontal
dimensions resulting in a tall and narrow geometry after time. For example, with S= 0.7
the line and ILD thicknesses are scaled by only 0.837. Starting with a square wire, after
two generations, the scaled wire is 43% taller than it is wide. The performance
advantages of this scaling scenario include the preservation of packing density since the
horizontal dimensions (width and space) are still fully scaled. In addition, the quadratic
increase in resistance per unit length is dampened by the slightly larger line thickness.
This leads to a better RC delay scale factor of ✓s which more closely tracks the increase
in transistor switching speed.

There are two major problems with the quasi-ideal scaling approach. First, the increase in
line thickness results in a higher aspect ratio (defined as the ratio of line thickness to
linewidth) which yields more coupling capacitance (Cc) to the neighboring wires. The
rise in Cc leads to enhanced coupled noise effects which degrade both signal integrity and

6

del~y predi ctability. The second problem is that manufacturing high aspect ratio lines is
difficu,lt since a deep and narrow trench must be completely filled with metal to eliminate
possible opens or resistance fluctuations. For this reason, lines with aspect ratios of
greater than 2 or 2.5 can be hard to reliably mass-produce.

The ideal scaling scenario for global wires is set by the chip-side length (as well as
functional unit size) which is not shrinking, as are gate dimensions. As a result, the global
2

RC delay scales as 1/S rather than remaining constant. To arrive at this figure, global
wirelength is assumed to scale up as ✓s, or by about 20% per generation. Using a typical
scale factor of 0.7, this translates to an unacceptable 192% rise in RC delay from one
technology to the next. In order to combat these rising RC delays, a constant dimension
scaling approach to global wires has been suggested in [1]. Since the primary back-end
parameters are reduced, it is not actually an approach to scaling. The·main concept is that
by maintai ni ng wide and thick wires at the upper metallization levels, a low RC product
can be kept for global wires. As seen in Table 1.2, the global RC product only rises by
1/S in this scenaiio and combined with the introduction of new interconnect materials can
be further limited. A positive side effect of constant dimension scaling is the suppression
of noise effects since line spacing is not reduced. The foremost drawback to this approach
is the potential lack of routing resources since packing density is being sacrificed at the
expense of performance. The different length scaling of global and local wires is the main
reason behind the evolving paradigm shift in chip design from device-centric to
interconnect-centric.

ex:

CHAPTER2
SIGNAL INTEGRITY

Introduction

The increase in complexity of ICs over the last decades has enabled the integration of
millions of transistors on one single die. This has resulted in complete systems on a chip
(SoC). Today' s designers have been quite successful in designing sophisticated ICs. With
the increasing complexity of the chips, various problems have been faced such as
crosstalk, noise, clock skew, supply network and supply decoupling, interference and
EMC. The increased manifestation of these effects on a chip is threatening the signal
integrity of deep-submicron ICs in different ways. Many of the problems are related with
the back-end of the manufacturing process~ the formation of interconnections, which
starts dominating the IC ' s performance and signal integrity.

Clock skew in Synchronous Designs

Clock skew is the path difference in path delays from a clock input to each of the clock's
loads. For example, Figure 2.1 shows that it takes 4 ns for MY_CLK to travel to FFl, but
it takes 12 ns to travel to FF2. In this case, there is 8 ns of positive clock skew. If the

register-to-register delay between FFl and FF2 is less than 8 ns, then the delayed clock
edge wi~l end up clocking the DATA into FF2 on the first clock pulse - as if FFl were
not in the picture. This will, ofcourse, cause some undesirable circuit behavior. If the
delays were reversed (that is, MY_CLK reached FF2 first), then it can cause FF2 to
become metastable and produce unpredictable outputs. These tiny differences in
propagation delay, when compounded across all the clock nets in a complex digital
system, often lead to unacceptabale degradations in overall system-timing margins. The
problems created by clock skew are independent of the clock speed. A design with clock
skew will fail at 10 Hz or at 100 MHz.

_f_t..n®-f-.,..__.·___ ___,

D A l A 11

t t

j

t t 2
a

[.J+------111

-4

T1

bt ti~

bt .ti~

C1. B

CL K
Cl. B

M Y _ C L K ~ ; ~ - - - -.. -_-_s
...
__----oC 1. K

T
y

'--.............;.A---1

-

P,

.
/

12;;P,

__----.,.__:r_
.,...

Figure 2.1 Explaining Clock Skew

With the increasing density of VLSI chips and the use of several pipeline stages in
hardware design, clock skew becomes a dominating factor in determining the clock
period of synchronous digital systems. There are two sources of clock skew: different
lengths of clock paths and/or different loads of the clock drivers.

Different lengths of the clock paths cause different total resistance and capacitance of the
clock t~acks. This results in different RC times for the clock signals. Although these
clocks are routed in metal, they can still show a mutual clock skew of hundreds of
picoseconds to even several nanoseconds in a very badly routed clock system. Care must
be taken because most flip-flops in a library show clock skew sensitivities in modem
tools is therefore a must to guarantee proper operation of synchronous VLSI designs.
Different loads of the clock phase drivers is another potential cause of clock skew. Figure
2.2(a) shows a distributed clock driving network.

Figure 2.2 (a) Distributed clock driving network (clock tree approach)

Although clock signals· <j> 1,

<j>2

and

<j>3

should be equal, they may show some differences

because of different loads. This can be originated by a different number of latches and
flip-flops that they have to drive, or by a different capacitive load of the tracks. Suppose
capacitance C 3 (which represents all latches and flip-flops connected to q>3) is larger than
1 ()

C 2 . ~hen

Q2

is fed to the S input of FF3 , it might simultaneously ripple through this flip-

flop as well, when its clock (<j)3) is delayed to a certain extent with respect to <!>2- Such
anomalous circuit behavior reduces the voltage margin with which the circuit should
normally operate.

In the clock tree approach, it is extremely importal).t that the clock branches are equally
loaded (balanced clock tree). This must be verified by tools, particularly in high
performance complex circuits. Current tools offer a well-balanced clock tree synthesis,
which enhances the quality of clock timing. An important advantage of this clock tree
approach is the distribution of the different small clock drivers over the logic blocks. The
use of distributed clock drivers also puts the clock drivers right there where they are
needed. Distributed clock drivers keep the current loops short and they also do not switch
simultaneously, but distributed over a small time frame. Moreover, they can use the
intrinsic decoupling capacitance which is available in a logic standard cell block. This
reduces the di/dt fluctuations, which are responsible for most of the supply/ground
bounce in VLSI designs.

In many designs, a clock trunk approach as shown in Figure 2.2(b) is still used, with only
one global clock. In synchronous designs, the total dissipation of the clock-related circuit
may vary from 10% to even 50% of the ·total IC dissipation. It is obvious, then, that the
clock system will also generate a large part of the total supply bounce. On the other hand,
clock skew is now only determined by the delay across the clock wire. However, for
design integrity, the balanced clock-tree approach is preferred. Especially large library

1 1

blocks such as memories and microprocessor cores may include their own clock drivers
and/or r~)Uting. If this is the case, communication between these blocks and the rest of the
chip may become very critical and must be thoroughly verified and simulated. Because
these blocks have different internal clock delays, these delays must be compensated for in
the clock architecture. This can either be done by using compensating delays (which must
be simulated per logic of memory .block) or by using a PLL per logic or memory block,
which can synchronize the clock arrival times at the flip-flops.

Figure 2.2 (b) Clock trunk approach

Other Timing Problems·

In clocked low power CMOS circuits, some logic blocks (or sometimes even the
complete chip) may often be inactive for certain periods of time. Such a chip may contain

1 ')

diffe_rent clock domains , of which the mode of operation (active or stand-by) is controlled
by a gated clock. In such cases, the main clock is used as input to some logic gate to
perform some logic function on the clock signal (gated clock). Figure 2.3 shows an
example of this.

When the delay between the clock

<I>

and clock <)>' is longer than the data delay between

the output Ql of one flip-flop and the input D2 of the next flip-flop, this "new" data
sample will be clocked into this flip-flop by the "old" clock and a race will occur.

Timing problems could also occur when the data delay (caused by the logic and
interconnection delay) between two successive latches or flip-flops becomes equal to
larger than one clock period. Figure 2.3 shows an example of this:

(a)

Att~spolnt , al.lssampteolntollip-llop2

(b)

insteadola'l'

Figure 2.3 (a) Example in which the data delay exceeds a clock period and (b) Its
corresponding timing diagram

When the total propagation time through the logic from Q 1 to D 2 exceeds the clock
period, ,the data at

D2

can arrive after the sample period of flip-flop FF2 has been

terminated. It will then be sampled in the next clock period, resulting in improper
synchronization. Timing simulation to find critical delay paths is therefore a must in
CMOS VLSI design and is a part of the design flow.

Clock Synchronization in Synchronous Designs

With IC complexities exceeding ten of millions of transistors, the total effort required to
complete such complex VLSI designs is immense. This stimulates the reuse (IP) of
certain logic blocks (cores) and memories. In most of the synchronous designs, current
heterogeneous systems on chip may not only incorporate many clock domains, but can be
built up from cores that are designed at different sites or vendors with different
specifications. Because each core has a different clock skew from the core's clock input
terminal to the farthest away flip-flop, the clock phase of each core has to be
synchronized with the main clock.

Current VLSI chips may contain several thousands to several hundred thousands of
latches and the total wire length of the clock signals may exceed several meters. To
achieve high system performance, the clock frequency is often maximized. This
combination (a large clock load and a relatively high clock frequency) is the cause of
many on-chip timing problems. There are many different clocking strategies for
synchronous logic.

1A

Single Phase Clocking

fli p-flo p1

a,

D

flip-flop,

logic
T .

o,

a,

~
CLK

Ts kew

CLK

CLK,

o,

a,

o,

Figure 2.4 Single-phase clock system and its timing diagram

From_Figure 2.4, we can derive that the minimum cycle time is given by:

t min := t ff+ t iogic + t su + t skew

(2.1)

where

tff is the flip-flop delay from clock to output
tiogic is the propagation delay through the logic
tsu is the setup time of the data of flip-flop FF2
tskew is the maximum amount of time that the clock of flip-flop FF2 can be earlier than
that of flip-flop FF1.

Especially, tiogic, which is dominant in equation 2.1, must be carefully simulated to be
sure that the required frequency (clock period) will be achieved.

In many designs, the (pipe line and/or scan) registers are implemented qy using series
connections of flip-flops. Especially in the scan mode during testing, flip-flops are
directly connected to other flip-flops. In Figure 2.5, flip-flop of logic blockl can be
directly connected to flip-flop of logic block2.

scan-chain

Scan-in pa

Logic Block 2

Log ic Block 1

Scan-out pad

Clock pa

---[::::>-

r-.. -... .-.. .-.....-.. .-.. .l - - - - - - - L

Clock driver

Logic Block 3

Figure 2.5 Example of a scan-chain in a complex VLSI circuit

With a direct connection, the propagation time of the data between these flip-flops can be
very short. As the clock is routed through these blocks automatically, its time of arrival at
the first flip-flop of logic block2 can be later than the arrival time of the data. This will
result in a race, which can also occur in two-phase clocked registers. Therefore, each
register should be carefully checked with respect to the above critical timing situation. If
necessary, additional delay by using several inverters should be included in the critical
path in the scan chain.

One way to design synchronous systems with potentially less critical timing is to use twophase clock flip-flops. The non-overlapping property of the clocks prevent transparency

1,.,

eithe~ within a flip-flop or between flip-flops. The non-overlapping times between the
clocks i~ Figure 2.6 allows larger clock skew, however at the cost of performance.

o,

01

02

T non -overlap

Figure 2.6 Two-phase clock system

The non-overlapping time must be added to the cycle time and it therefore reduces the
performance. Because two clock signals must be routed through the chip, the routing area
of the chip will be increased with respect to single-phase clocking.

Fig 2.6 shows a synchronous two-phase clock system. When
listening and the slave is talking. When

<)>2

<!>1

is high, the master is

is high, the master is talking and the slave is

listenjng. When the difference in clock delays (clock skew) is larger than the

'tnon-overlap,

both ma~ter and slave will be talking or listening at the same time. This results in bad
communication and incorrect operation.

Additional margin can be created by increasing the non-overlapping time. In a two-phase
clocked system, the clock signals are usually generated by a central clock generater
circuit. Figure 2.7 is two-phase clock generator circuit.

IN V ,

INV

6

02

I

0,,.

I

IN V

2

~
INV

5

INV

7

01

INV

I

(a)

(b)

Figure 2.7 (a) Two-phase clock generator (b) Clock waveforms

11"\

Especially in a very large circuits, the transistor sizes of the final driving inverters can be
very large (W/Lwire ratios of several thousands) to be able to drive several hundreds of
picofarads of clock loads. In most cases, the predrivers (inv 4 and inv5) already consist of
large sized transistors. Implementation of transistors of drivers 4, 5, 6 and 7 in the layout
is done by connecting several parallel transistor structures to achieve the appropriate
W/Lwire ratios.

During the layout, the metal wires between these parallel transistors can be drawn such
that they are not at minimum spacing with respect to neighboring wires.

On-Chip Multiple Clock Generation

On-chip multiples of the clock can be generated by phase-locked lo0ps (PLLs). Figure
2.8 shows a basic phase-locked loop concept .

Phase Dete tor/ Ampl ifier

Loo Filter

Input

oscillator
fin

nfin Output

Frequency Divider

Figure 2.8 Basic concept for a phase-locked loop

I"){\

The Volt~ge-Controlled Oscillator (VCO)- current- controlled oscillators (CCOs) are also
used-is basically an oscillator whose frequency is determined by an externally applied
voltage. This frequency is a multiple of that of the input. The phase detector is sensitive
to differences in phase of the input and VCO signals. A small shift in the frequency of the
input signal changes the control voltage of the VCO, which then controls the VCO
frequency back to the same value as that on the input signal. Thus, the VCO remains
locked to the input. Based on this principle, a PLL can be used to generate an output
frequency which is a multiple of the input frequency. The output frequency equals n
times the input frequency. As current complex Ics require many different clock domains,
multiple frequencies must be generated on chip. Figure 2.9 shows an example of a multiclock generator based on PLL.

different frequency dividers

PLL

n' m1'fin

CLK 1

fin

Figure 2.9 Multi-clock generator, based on PLL

') 1

In th~s example, the PLL output frequency equals n x m 1 x .fi.0 • Using different divisions
(mi), ma,ny different clocks can be generated. The PLL, by nature, automatically makes
these clocks in phase with the input.

Clock-Phase Synchronization In Multiple Core Environments

Because of differences in the clock arrival times at the flip-flops of different cores, these
delays must be compensated for, to allow proper communication between different cores.
We will introduce some of the many methods of clock synchronization.

1) Adaptive Skew Control
In this approach, the clock network of each core (domain) is extensively simulated. The
clock skew in each core is then made equal to the worst case clock skew by using a chain
of inverters. The length of this inverter chain is then adapted to the required additional
delay in the specific core clock path.

2) Phase Locked Loop
We can also use the PLL concept for this purpose. The PLL property of locking one
signal phase to the phase of another reference signal makes the PLL also suitable for the
compensation of clock skew in different cores, see Figure 2.10.

Figure 2.10 Clock phase synchronization to compensate for different clock skews inside
different cores

Node A represents the clock terminal of the core, and node B represents the clock
terminal of the actual flip-flop in that core. The clock phase at the flip-flop will then be
locked to the input reference signal, which is usually the chip's main clock. In this way,
the clock tree delay (which might be different in all cores) can be easily compensated for.
Moreover, when the frequency dividers in Figure 2.10 are made programmable, then the
same PLL can be used in all cores, even when they run at different frequencies. In
synchronous designs, sometimes the reusable cores are only available with fixed
instances and only in the GDSII (layout description) format. In these cases, the clock tree
must be thoroughly simulated and a delay chain, which mimics the core's internal clock
delay path, replaces the clock tree between nodes A and B (Figure 2.10) in the feedback
path. The PLL must be placed outside the core. Disadvantages with the use of PLLs are:
•

Because of high internal frequencies, PLLs can consume relatively high power.

•

PLLs are difficult to start and stop. The start-up takes a relatively long time.

3) DelaY,-Locked Loop
The delay of the delay line can be controlled by the output voltage of the integrator. The
output signal is delayed over one complete clock period with respect to the input. If the
delay is less, then the phase detector produces a signal which increases the delay of the
delay line, via the integrator. The output signal in such a DLL has the same frequency as
the input, so it can't be used to multiply the frequency. Because the VCO or CCO in a
PLL generates frequencies that depend on the supply voltage, clock jitter can occur when
there is supply noise. Also, the delay in a DLL is susceptible to supply noise. Control of
the clock jitter is therefore one of the most important constraints in the design of a PLL
and DLL. For the synchronization of the clock phases of all cores in a heterogeneous
chip, each core needs its own PLL (DLL).

Coupling Capacitance And Noise

In the quasi-ideal scaling approach to local interconnect, wires become taller and
narrower.

:~ 1-•-

80 · ·• · · · Aspect
Pitch (nm)
Ratio
75 - -A - ·

~~

. - - .A- -

60

55

w

80

~~

_ __ . ,l· - - - - ,l·

65

_- ' ,

60

A· · -

55

ro

3.0

2,5
2.0
1.5
1.0
0.5

•i:

.,t.- _ .A-

Cc/ C, 0 ..1 (%)

........ 3.0

--■--

.

■--•--•
•

....... ..

0.0

0.35

0.30

0.25

.

·• ... ·• . •·

0.20

2.5
2.0
1.5
1.0

--·

0.15

0.10

..

.

~:~

0.05

Technology Generation t,m)

Figure 2.11 Coupling capacitance is an increasing portion of the interconnect
capacitance. Technology trends taken from [1]

The ~esult of this scaling scenario is a marked increase in coupling capacitance in the
DSM regime. Figure 2.11 shows the rise in CcfCtotaI for scaled processes where CtotaI is
the total wire capacitance (consisting of capacitance to upper and lower ground planes as
well as Cc). Currently, for minimum pitch wires, Cc accounts for roughly 70% of the total
wire capacitance. This capacitive coupling to neighboring wires means that voltage
changes on one wire can adversely .affect the voltage level of another wire.

Noise in a digital system can be defined as anything that causes a node to deviate from
Vdd or ground when it should otherwise have a stable high or low value. Coupling
capacitance is a source of noise in that it causes such deviations to occur. Noise sources
in tum lead to signal integrity problems and we now introduce two such problems caused
by Cc, i.e. Crosstalk and Dynamic delay.

Crosstalk

Crosstalk noise results when a quiet victim line is acted upon by one or more neighboring
aggressor lines. The coupling capacitance between the victim and aggressor lines is
partially charged by the driving gates of the aggressor which yields an unwanted voltage
spike on the victim line. If this voltage spike is large enough it can cause logic faults
(especially in dynamic circuits or pass-transistor logic). In the case of bootstrapped noise
where the victim voltage level goes above Vdd or below O V, there can be reliability
concerns due to enhanced device stress (hot carriers) and possible forward-biased drainsubstrate p-n junctions. The magnitude of the voltage spike, Vx, is a complex function of

driver strength,

Cc,

fan-out capacitance, and wiring resistance. To the first order,

however 7 V x can be viewed as proportional to the ratio of Cc to

Ctotal•

As we have already

discussed this ratio is rising in modem processes due to tight pitches and high aspect ratio
lines. Therefore, we anticipate a rise in crosstalk noise in scaled technologies.

Analytical Modeling of Crosstalk Noise

On-chip crosstalk is a maJor concern m ULSI cirucits due to scaling linewidths,
increasing aspect ratios and larger die size. Also, due to reduced noise margins and larger
ground bounce, noise issues become even more important. However, with several million
nets in a modem design, detailed simulation of crosstalk noise on each net is highly
inefficient. A rapid and accurate crosstalk estimation technique is needed to quickly
screen nets that violate noise margins.

The most basic crosstalk model is the charge-sharing model, presented in many circuit
textbooks and review papers [1,3]. This simple model considers only the ratio of coupling
capacitance to total lumped capacitance. This model has been found to introduce
extremely large errors (> 500% ).

Most high-level electronic-design-automation (EDA) tools that account for crosstalk
noise also consider the capacitance of the signal lines only. This simplification can
produce significant errors that may lead to unpredicted circuit failures under certain
conditions. A more accurate model needs to account for the various resistive components

of th~ circuit. In [l], a closed-form model based on RC transmission line analysis is
presente1, but driver modeling is not discussed and the analysis is limited to step inputs.
Two other models [1] approximate the driver with a resistor and a voltage source, but
signal line resistance is neglected. In DSM designs, line resistance is appreciable and
cannot be ignored.

Realizing the importance of crosstalk analysis in DSM technology, attempts have been
made to present various models for crosstalk. One such general closed-form crosstalk
model is described here. This model takes into account the driver strength and line
resistance. The model is simple, accurate and provides an excellent basis for a crosstalk
analysis.

Model Derivation

Crosstalk is a complex f01m of electromagnetic coupling between two or more
conducting lines. In order to obtain a tractable analytical expression for peak crosstalk
noise, several assumptions have been made. First, the aggressor gate is modeled as a
ramped voltage source with rise/fall time, trisetime· This rise time can be obtained by the
use of a gate-level timing simulator or through the use of analytical expressions such as
those presented in [1]. A second assumption is that all interconnect and load capacitances
(including fan-out gates and drain junction capacitances) are modeled as a lumped
capacitai ce to ground, excluding the coupling capacitance. Finally, the victim line driver
is modeled as an effective resistance. This resistance is equal to the inverse of the slope

1"'1"7

of a ~evice's ict - vct curve at the origin and needs to be found just once for a given
technology as it scales linearly with device width.

Vd/trisetime

Figure 2.12 Simplified circuit schematic

The resu lting equivalent circuit is shown in Figure 2.12. In the figure, Ra describes the
aggresso · line resistance while Rv is equal to the sum of victim driver resistance and line
resistance. The aggressor gate is modeled as a voltage source with ramp rate Vctctftnsetime,
and crosstalk is denoted as V x· Ca and Cv are the sum of all ground capacitances for the
aggressor and victim respectively. These totals include drain junction capacitance, fanout
capacitance, and interconnect capacitance to upper and lower ground planes.

From circuit analysis principles, the victim line voltage Yx is found to be:

(2.2)

O::; t::; t rise time

(2.3)
t risetime:'.:
't

o·t risetime

In these equations, trisetime is the rise/fall time at the output of the aggressor driver and 'to,
'ti,

and -r 2 represent different time constants of the equivalent circuit. The time constants

are defined as :

1

.
t O := [ [ R a ·( Ca + Cc) + R v · ( C v + Cc)

]2- 4 ·R v ·Ra · (C v ·Cc+ C v ·Ca+ Cc ·Ca) ]

2

The peak value of crosstalk noise, Vmax, can be found by differentiating equation (2.3)
as this peak always occurs at t > tnsetime·

t 1

(2-Rv -Ra( Cv-Cc + Cv-Ca + Cc-Ca) )
[ Ra ·( Ca + Cc) + Rv·(Cv + Cc) +

to]

(2.4)

(2.6)

-t

risetime

(2.7)

-t riset ime

(2.8)

where
for slow rise times Ctnsetime >> T2), (2.6) can be seen to approach a limiting value of (Rv x

Cv x V ctct)ltnsetime• Equation (2.6) reduces to:

(2.9)

This model allows for different values of Ca and Cv, which is almost always the case in
actual designs. Also, there are many cases where an aggressor line runs parallel to a
victim for less than the full length of either line. The general nature of this model can
handle such cases. Thus , the model can be seen to have wider applicability than those
presented previously.

Dynamic delay

This is another form of noise caused by Cc, While fundamentally related to the crosstalk
noise problem, dynamic delay specifically refers to situations where the victim line is
switching rather than static. Due to the fact that a victim line has substantial capacitance

to neighboring wires rather than ground, the delay of the victim system (driver + wire)
becomes .a function of neighboring wire switching activity. When the nearby aggressor
lines are static, Cc acts nearly as a ground capacitance. However, if the victim and
aggressor switch simultaneously the capacitance between them sees a different voltage
swing, ti V, than in the static case. For instance, if two capacitively coupled nodes have
the same voltage swing of OV- the capacitance is n~t charged/ discharged at all. If the
same two nodes have complementary voltage responses, then the total voltage swing

Vdd

J
-0
J '" \_

same di rection switc hing: No
potential difference across Cc no charge required

Gc--{]
IL___,,
~

II

Opposited ireclionswitching :
Each electrode of Cc is
sw itched by Vdd ->delta V
across Cc is 2"Vdd

Before

• ~ Sw i tching

IL___,,

II

Aftler
~ · • 7 _ _ J Sw i lchi ng

Figure 2.13 The impact of dynamic delay is related to the switching relationship between
the victim and aggressors. Cc experiences different voltage swings depending on signal
integrity

across the capacitor is 2 xV where Vis the·voltage swing of each waveform. This is
equivalent to switching the positive and negative electrodes of a capacitor- the effective
voltage swing is 2 x V. Figure 2.13 shows these two scenarios.

'l 1

According to:

.Q :=C-V

(2.10)

the previous two cases will require sufficiently different amounts of charge supplied to
the coupling capacitance due to different voltage swings. The result of this argument is
that when aggressors and victims switch in opposite directions, a larger amount of charge
needs to be supplied from the drivers to switch Cc than in the static case. This yields a
higher delay for the same circuit configuration, the only difference being the neighboring
signal activity. It is difficult to perform timing analysis of Dynamic delay. Neighboring
wires must be examined to determine the likelihood of simultaneous switching which
greatly increases the complexity of the timing analysis problem.

Low Power Synchronous and Asynchronous Designs

In present day synchronous VLSI systems, the clock distribution network may drive
thousands of registers, creating a large capacitive load that must be sourced efficiently.
Furthermore, each transition of the clock signal changes the state of each capacitive node
within the clock distribution network, in contrast to the switching activity in
combinational logic blocks, where the change of logic state is dependent on the logic
function. The combinatic;m of large capacitive loads and a continuous demand for higher
clock frequencies has led to an increasingly larger proportion of the total power of a
system being dissipated within the clock distribution network, to about 25% of the total
power.

The prifi1:ary component of power dissipation in most CMOS based digital circuits is
dynamic power. It is possible to ;educe CV 2f dynamic power by lowering the clock
frequenc y, the power supply and/or the capacitive load of the clock distribution network.
Lowering the clock frequency, however, conflicts with the primary goal of developing
high speed VLSI systems. Therefore, low dynamic power dissipation is best achieved by
employing certain design techniques that minimize the power supply and/or the
capacitive load.

To achieve this, a technique for designing clock buffers and pipeline registers such that
the clock distribution network operates at half the power supply swing, reducing the
power dissipated in the clock tree by 60% without compensating the clock frequency of
the circui t.

Other approaches exist for reducing the power dissipated within a clock distribution
network. These approaches reduce power by decreasing the total effective capacitance
required to implement a clock tree. This approach may enable us to reduce the power by
about 10-25% with no degradation in clock frequency.

Introducing pipelining in a data path based circuit, is an efficient and customary way to
increase the throughput rate of the circuit and thus increasing the computational power of
the data path. Circuits featuring a high degree of pipelining inherently have short and
simple critical paths. The short paths are essential for achieving a high throughput rate,

but they also allow the circuits to be realized with minimum sized devices, which is a
prerequis~te for low power dissipation. Moreover, the inherent locality of the logical
paths between the pipeline register·s considerably reduces the power dissipation of the
logic circuits due to the glitches.

So the principle of pipelining, applied to increase the throughput rate, may at the same
time reduce the power dissipated in the logic circuitry resulting in a reduction of the total
power dissipation. However, the large number of pipeline registers controlled by one or
more clock signals results in a considerable capacitive load in the clock network
requiring an optimization of the degree of pipelining.

The energy for charging the clock network each clock cycle is delivered by the clock
system of the circuit. An appropriate measure for the power efficiency of a circuit is the
specific power dissipation p, defined as the total power dissipation P divided by the clock
frequency l/tc1ock (which is related to critical path delay) and divided by the number of
equivalent gates G of the circuit

p := (P ·t cl~ck)

G

(2.11)

0.5

0.2

0.1

l1.5

UIII

CM04

0.05
0.03
0.02

0.01

•

Non-pipelined, Gate Arrays

X

RAM dominated

o

Optimally pipelined w/o clock

A

Optimally pipelined

■

Weakly pipelined

0.005

5

10

20

30

so

100

200 JOO

Figure 2.14 Power Dissipation vs. Clock frequency

Figure 2.14 shows the power dissipation per gate vs. the clock frequency. It shows that if
the power dissipated in the clock network is not taken into account, the specific power
dissipation of optimally pipelined circuits is an order of magnitude lower than the
specific power dissipation of non-pipelined circuits or gate array implementations.

However, the power dissipated in the clock system of even optimally pipelined [3] highthroughput VLSI circuits may be typically as large as the power dissipated in the logic
circuitry. But even if the additional power dissipation for the clock system is also taken
into account (Figure 2.14), the specific power dissipation of highly pipelined circuits is

still a~out a factor 5 better than of non-pipelined circuits. Nevertheless, the clock system
of such circuits still holds a great potential to reduce the total power dissipation.

Capacitive Load of the Clock Network

The expansive clock network is charged with the gate capacitances (Cg) of the clocked
transistors of the pipeline registers. Another capacitive part of the clock network, is the
junction capacitance of the source-drain regions of the output nodes of the clock drivers
(Cj). The wiring capacitance of the clock network can be separated into the capacitance of
the global clock network (Cwg) and the capacitance of the local clock network (Cw1). This
is illustrated in Figure 2.15. The total capacitive load CL of the clock network can be
written as:
(2.12)

The power needed for switching the clock network 1/t times per second is then given by:

P en ·=
· -1 ·\ ' dd 2.c L
t

(2.13)

Apparently, the dominant part of the capacitive load is the capacitance of the local clock
network (Cw 1).

Global cloc k ne two rk

--------,.----------...J---------+-~----------1-.J

Logic

PHlslave
PH lmaster
PHlslave

Latch

Log ic

Loca l cloc k netwo rk

Log ic

Latch

Lo gic

Figure 2.15 The clock network

Power Reduction in Synchronous Systems

1) Design of Clock Drivers

Since the power Pcs delivered by the clock drivers of the clock system can be a
considerable part of the total power dissipation P of the circuit, the design of the drivers
must also be optimized for minimum power dissipation. CMOS clock drivers consist of a
chain of cascaded drivers with increasing driving capability. An important design

'1'i

param~ter for such driver chains is the stage ratio a defined by the ratio of the widths of
the transi~tors of successive driver stages:

w

w

y

a := _ _P_
y- 1

wn

WP

n

y

(2.14)

y-1

W / is the width of the p-channel transistor of the driver of stage y in the chain and W / is
the width of the associated n-channel transistor.

In a first approximation the wiring capacitances between successive driver stages are
neglected. Then the parasitic capacitance of an intermediate node of the driver chain can
be taken proportional to the widths of the transistors interacting with this node. Thus the
capacitances of two successive intermediate nodes will also have a ratio equal to the stage
ratio f and an approximate value for the power dissipation of the complete clock system
can be given :

·- 1 ·V
_ dd 2 -C.-, L' ( l +-+-+
1
1 ... + - 1 )
p cs .-t
a
c/
aN

(2.15)

CL is the capacitive load for the clock driver, V dd is the supply voltage, t is the clock

period and N is the total number of stages in the driver chain. N is determined by the
desired input capacitance C 1 of the first stage of the driver chain:

log(CL)
C1
N :=--log( Cl )

(2.16)

The power efficiency of the clock system, can then be defined as:
1

(2.17)

which for a given ratio of CJC1 is only a function of a. The factor 1-llp characterizes the
additional power dissipation overhe~d needed in the clock driver circuits for charging and
discharging the clock network. It can very well be said that in order to minimize the
power dissipation of the clock system, the stage ratio of the drivers should be chosen as
large as possible. This logically means that the number of drivers in the chain should be
minimized. However another design parameter is the transition time of the clock signals.
For complementary and single-phase clock systems requiring smaller clock transition
times than non-overlapping clock systems, it may not be possible to apply stage ratios
much larger than a=3. Also the delay time of the complete driver chain must be taken
into account, but it is well known that the total delay increases only slightly with
increasing stage ratio f. So depending on the relative importance of the power dissipation
and delay time of the clock drivers, the optimum for the stage ratio will be in the range of
a= 3 .. 10.

2) Reduced Clock Swing

The power needed for charging and discharging the clock network lit times per seconds
is given by (2.15). If the clock swing Vclock can .be chosen different from the supply
Voltage

Vdd

') {\

V clock := r -V dd

(2.18)

with r<1

Then the power dissipation of a clock system supplied from V dd would be
p cs' :- _ _ _ _ __
t ·C L"V dd ·V clock

(2.19)

or with (2.18)
pcs' :- _ _1__
t·Ccr·V dd

(2.20)

2

A reduction of Vc1ock results in the proportional reduction of Pcs'. The reduced clock
swing however reduces the saturation current of the transistors connected to clock phases:

with 1<y<2

(with W and

Vth

(2.21)

the channel width and threshold voltage of the transistor involved; y = 2

for conventional CMOS and y = 1 for sub-micrometer CMOS).

As a consequence of the reduced saturation current, the delay time of the pipeline
registers wi 11 increase. In order to retain the original throughput rate, the channel width W
of the clocked transistors will have to be resized according to :

W'

(v
(vclock- V thr

dd - V th r
W··- - - - -

AA

(2.22)

This

';ill increase the gate capacitance portion of CL. But since this gate capacitance is

only a smaller part of the CL, already for y= 2 a reduction of the power dissipation can be
obtained. For y=l the required increase of the channel width and gate capacitance is
smaller than the power dissipation saving correspondingly higher.

If the clock drivers could be supplied with

Vc1ock

instead of

Vdd

(e.g from an external

source) then the power dissipation in the clock system is :
1
2
p cs" :=t ·CL·V clock

(2.23)

1C 2V 2
P cs" ._
.- -· -Lr · dd

(2.24)

or with (2.18)

t

In this case, due to the quadratic dependence of P cs" on r, a more significant saving of
power dissipation of the clock system on the chip could be obtained. However the power
dissipated in the additional external source or level conversion circuit is equal to

(2.25)

or with (2.18)

(2.26)

is only removed off-chip but not removed from the total system power dissipation which
for e.g, portable systems is an important cost factor. So even if the clock system is

supplied from a reduced supply voltage, the total power dissipation on system level is
still given. by (2.20).

To achieve a considerable amount of power reduction in synchronous circuits, circuit
shown in the figure 2.15 could be used. It is basically a circuit for an on-chip generation
of reduced-swing clock signals, requiring no extra supply voltage. This circuit enables to
achieve a true reduction of power dissipation as described in (2.24). The circuit exploits
the principle of charge sharing between the capacitances of complementary clock phases
and additionally reduces the swing of the clock signals to approximately half the supply
voltage. In this way both principles proposed to reduce the power dissipation in the clock
system were combined in a complementary non-overlapping clock system for which the
principle of distributed registers can be exploited, allowing a hardware efficient
implementation of the pipeline latches as simple transmission gates as described before.

1S;;

4

lplu

~
1•
I

Vdd

phit

f - - - - ---t--------+
Cplii

,,--++--'

To P-channel latch
transistors

To N-channel latch
transistors
Cphi

Cphi Vss

Figure 2.16 Principle and CMOS implementation of complementary-phase clock driver

The circuit of Figure 2.16 requires no additional overhead compared to the standard
solution. Also the widths of the clock driver transistors can be retained so that the silicon
area remains unaffected. Whereas th·e power dissipation reduction of about 75% could be
achieved, the penalty in the throughput rate can be compensated by resizing the width of
the clocked transistors of the latches according to (2.15). This will increase the gate
capacitance part of the clock lo~d, but will only slightly increase the total clock
capacitance.

CHAPTER3
INTERCONNECT MODELING IN DSM

Introduction

Modem technology has demonstrated a relentless trend toward faster circuits, larger die
sizes , shorter rise times , and smaller pulse widths. Simultaneously, the minimal feature
sizes have been aggressively decreased. Therefore, interconnect delay plays an important
role in the overall system delay. The effects of interconnect are manifold. For instance, it
is one of the causes of clock skew in synchronous systems. Therefore, while timing issues
like clock skew are inevitably important, accurate modeling of interconnects has also
become very important for predicting realistic delay and for performing timing analysis
of the chip. Also, on-chip inductance effects, such as delay increase, overshoot, and
inductive crosstalk, can no longer be ignored. For performance considerations, some
global signal and clock wires are routed with large widths and thickness at the top levels
of the metal to minimize interconnect delays. This decreases the resistance of the wires,
which makes the wire impedance due to inductance comparable to that due to resistance.
As in [8], the following formula for estimating the self-inductance of a straight wire with
length, width, and thickness

½ire,

W, tint respectively

Lself\ nH) := 2Lwire· (in(W2L+w~e)
+ 0.5- k)
tmt

(3.1)

wher~ k depends on W and tint, and 0 < k < 0.0025, Lwire, W, tind in cm. By using this
inductan~e estimation, figure 3.1 shows the comparisons for lengths ranging from 100 to
10,000 microns between the impedance due to inductance, 2n.fL, and resistance,
pLwire!Wtint,, atf = 1 GHz for Al wire with the resistivity p = 3.85 x 10 -6 Q-cm and with
2

widths of 1.25, 2.5 , 5.0 and 10.0 µm with thickness of 1.9 microns corresponding to a
typical top-level metallization thickness of 0.18 µm technology. Also shown in Figure 3.1
are the inductance values. As can be seen, the inductive line impedance is larger than the
line resistance for widths wider than 2.5 µm, and the inductances may be larger than
several nH for most cases and are much larger than the typical bonding wire inductances
of IC packages .

As the clock frequency increases and the rise times decreases, electrical signals comprise
more and more high-frequency components; for a 600 MHz clock with a 100 ps rise time,
at least 15% of its frequency components are above 3.4 GHz. Such frequency is defined
as the significant frequency of the pulse.

With the increase in chip size, it is fairly typical that many wires are long and run in
parallel, which increases the inductive crosstalk. Actually, the inductive crosstalk
increases with the parallel length and saturates near 1 cm length.

,Ct:

With _the push of performance, some low-resistivity metals have been explored to replace
Al in ord,e r to minimize wire RC delays. Copper, for instance, has low resistivity .

...

,.·· .
, ·
.

=1.25u)

··

_.,..

,;:.:::;:,· · ·~'.,,
1 '

I

,I

Figure 3.1 Resistance and inductive reactance of wires at 1 GHz for different lengths and
widths. (The super linear behavior of the inductive reactances may be seen because as the
length gets longer, their curves are further away for R (W=2.5µm)curve).[8]

Because of its low resistivity property, the wire inductive reactance would be larger than
the resistance. Many companies have started using Copper as interconnect material.

Therefore, circuit designers have to consider on-chip inductance in order to ensure the
functiona.l correctness and performance of their designs. When we compute the
inductance, the most important thing to know is where the return currents are. The current
flowing on a wire has to come back to the wire's driver to close the loop. Impedance of
the receiver at the wire's far-end terminal is only one part of the current loop. The
electrical behavior of the current loop is determined instead by the whole loop's
reactance. Self-inductance of the wire is affected by the inductive coupling from the
return currents so the inductive reactance is a function of the size of the loop. The smaller
the loop, the less the inductance. Currents always return through the smallest impedance
path. Therefore, minimizing inductance requires the complete consideration of R, Land
C.

Differences in On-Chip Inductance Consideration

Previously, inductances were considered only for off-chip interconnects, such as those
traces on PCB and bonding wires in IC packages. When we shift our attention to the
inductance of on-chip wires, some differences in consideration need to be observed.

We need to consider the internal inductance for on-chip wires. Because the skin depths at
the frequencies that we consider are comparable to the wire thickness or width, most
electrical currents flow inside the wires. On the other hand, the skin depths are much
smaller than the dimensions of off-chip interconnects or planes; hence, we can assume
the currents flow on the surface of those interconnects or planes.

Figure 3 .2 Two signal wires and one ground wire

To understand how large the internal inductances can be, let us consider the interconnect
structure shown in Figure 3.2. RC2 is a two-dimensional field-solver for capacitance,
inductance, and resistance; RB is a three-dimensional inductance-solver that takes the
skin depths at different frequencies into consideration. The per-unit-length inductance
matrix L' is determined by RC2 through the inversion of the per-unit length free-space
capacitance matrix C' as shown in equation (3.2):

(3.2)

where c is the speed of light in free space. Therefore, the inductance given by RC2 is
only the external inductance. This is because in the capacitance calculation it is assumed
that all the fields are outside the conductors. By the inversion of the capacitance matrix to
obtain the inductance matrix, it is assum~d that signal propagates on the surface of the
conductors. Table 3.1 summarizes the inductance values at different frequencies. As the
frequency increases, we see the difference between the RC2 and RB results become
smaller and smaller, which is due to the fact that the skin depth gets smaller and smaller.
At very high frequencies, both results will converge.Due to the lack of highly conductive

AO

ground planes on chip, the mutual couplings between wires cover long ranges and
decrease _very slowly with the increase of spacing. Figure 3.3 shows two interconnect
structures (a) and (b ). Structure (a) does not have a ground plane, while (b) has a ground
plane at 1.2 µm above the wires

(a)

Figure 3.3 Two interconnect structures : (a) has no ground plane and (b) has a ground
plane 1.2 µm above the wires. All the wires are 2000 µm long and the spacing is varied.
The leftmost wire is (a) is a ground wire

To calculate the mutual inductances between them, we can vary the spacing between
them for both the structures and use RI3. The coupling coefficients found are summarized
in Table 3.2. We see that the inductive coupling decrease very slowly without a ground
plane for structure (a). Separated by two orders-of-magnitude from 1 µm to 100 µm, the
coupling coefficient only decreases by 28%. On the other hand, for structure (b), with a
ground plane, the coupling coefficient decreases almost linearly with the increase of

AA

spacin~. At lµm apart, the coupling with a ground plane is only 40% of that without a
ground pl~ne !

The inductance of on-chip wires is not scalable with length. As demonstrated in equation
(3.1), the self-inductance of an on-chip wire increases with length at n log n rate, which is
also shown in Figure 3.1. However, as can be seen in figure 3.1, beyond a long length
(8000 µm) the inductance curves look line linear curves owing to the property of the n
log n function. It seems that inductance is scalable. But since most of the on-chip wires

are much shorter than 8000 µm, scalability cannot be applied.

Unfortunately, no good approximation formula exists for mutual inductances of two
parallel lines of unequal lengths or unequal dimensions. The mutual coupling for them
needs to be determined by a field-solver! The non-scalability of inductance is due to the
nonuniformity of self or mutual coupling near both terminals of the wir~s. Because the
inductive couplings extend to very long ranges, the terminal effect compared to the whole
only diminishes with a very long length. This is different from the coupling capacitances
of on-chip wires. The couplings between wires or at terminals extend only to very short
ranges. Therefore, after a short length, the coupling capacitances become scalable.

Inductive Reactance

Inductive reactance jwL is directly proportional to the signal frequency; therefore, in
order to estimate the significance of inductance impact, designers need to know the

magni_tudes of the signal at different frequencies. At high frequencies, where the
inductan~e effects become significant, the amplitude spectra of trapezoidal pulses with
different pulse widths rise times are first derived. A spectrum shows the distribution of
the components of a pulse over different frequencies. The spectrum indicates to designers
whether inductance needs to be considered. The importance of wire inductive reactance
may also be determined by looking-at the ratio betw~en jwL and R, where w is 2n times
the significant frequency, and L and R are the per-unit length wire inductance and
resistance, respectively. If the ratio is greater than one, inductance effects will be very
significant.

Frequency Dependent Resistance And Inductance

Both the resistance and inductance of an interconnect are frequency dependent at high
frequency. Conventional circuit simulator, such as SPICE, is not capable of considering
frequency-dependent RLC elements. In that, only the resistance and the inductance at one
frequency can be considered. Many different models can be formulated to simulate to
find the effect of inductance at high frequency. One such model is described here as
shown in Figure 3.4.

(a)

CH

r r r r
~ - - - - - -c_o_No_u_c r_oR_

_ _ _ _ _ _I

(b)

Figure 3.4 (a) Four signal wires on top of orthogonal wires ; (b) Cross-section view with
parasitic capacitances.

There are four signal wires, each of which. is assumed to be 5 µm wide, 1.5 µm thick, and
2000 µm long and all of which lie on top of a layer of orthogonal wires with equal width
and spacing, representing a layer of almost 100 % routing density. As mentioned earlier,
since the inductor is not scalable, there seems to be either an empirical way to determine
the inductance at a particular length of the interconnect (using tools like FASTHENRY,

developed at MIT) or assuming the inductance values by obtaining the parasitic
capacita~ce values from extration tools like Cadence Hyperextract and carefully
comparing the inductive and capacitive reactances at high frequency. In this work, we
have adopted the second approach.

To combine the resistance, inductance and capacitances, the wires have to be partitioned
into sections. The capacitors are inserted at section points. Figure 3.5 shows N equallength sections. To achieve enough accuracy, the length should not be too long. Here we
prefer to use 250 µm or eight sections (also in agreement with the methodology of model
used in Cadence tools). The R and L for each section are equal to wires' parasitic value
for that length divided by N (i.e. 8). The C for each section is equal to the per-unit length
parasitic capacitances multiplied by the section length. Half of the values are inserted at
the two terminals of each section.

Now as shown in the Figure 3.5, we differ from the traditional approach of treating the
capacitors ClO, C20, C30, C40 as grounded capacitors. This is because in reality, most
current flowing into the orthogonal wires has to travel a long way to their drivers or
receivers and, then through these devices to the ground node, which makes the ground
node assumption too optimistic.

Figure 3.5 The equivalent RLC circuit for the four-wire structure[8]

On the other hand, we use the approach demonstrated in Figure 3.6 to eliminate the
orthogonal layer and to distribute the capacitances into extraparasitic capacitances
between signal wires. In reality, an orthogonal wire is a series of resistors: Since the wire
spacing is small, the resistances are small and can be neglected. The resulting model is
shown in Figure 3.6 (d) where

(3.3)

i,j = 1..4

C' ,f

C14

r

r

CJD

I

I

ORTHOGONAL CONDUCTOR

Figure 3.6 (a) Original structure with orthogonal layer[8]

C14

Figure 3.6 (b) The conductor is replaced by resistors[8]

C14

X

C'IJ

c·2•

C'14'

Figure 3.6 (c) The resistors are neglected[8]

C14

TL.._CIO_ _J.,__C20_ _J.,__
CJO _

______.JT

su(emode

Figure 3.6 (d) Extra caps are introduced between wires after network reduction[8]

For our simulations, we employed both RC and RLC models described above. We
applied a pulse signal amplitude of ·1ns period with delay time, rise time, fall time each
set to 50 ps, and simulated the models using Cadence Spectre simulator.

As shown in Figure 3.7, the basic circuit component for all our simulations was a simple
CMOS inverter. We used BSIM3v3 models for our transistors in the inverter circuit.

vdd

Figure 3.7 CMOS inverter as our basic component for simulations

We first simulated a circuit using RC model of interconnect. We applied a pulse signal to
the aggressor circuit and kept the victim net silent. We modeled signal coupling effect
using two coupling capacitors between the aggressor and the victim nets as distributed
coupling capacitances. Figure 3.8(a) shows the basic circuit for simulation using RC
interconnect model.

CM

••*

ODUnutCI CArAffl.ud
(""'i.- l - ~ ~

)

~-----+--1"' · °",

....

Figure 3.8 (a) Circuit showing the RC interconnect model with aggressor activated with a
pulse signal and the victim net silent to see the effect RC interconnect model on the delay
and coupling capacitance on the signal coupling.

Figure 3.8(b) shows the simulation results. As can be seen from the waveforms,
OUTPUT_l is the output at the aggressor net and OUPUT_2 is the output at the victim
net. Due to the signal coupling effect, the output at the aggressor net causes the victim
net to conduct the signal without it being activated, as a result of which we get
OUTPUT_2. Also, the waveforms show delay in the outputs. Note that the outputs show
overdamped response as discussed earlier while discussion the interconnect models.
Figure 3.8(c) shows the magnified view of the Figure 3.8(b). As can be seen from Figure
3.8(c), the delay of 29.542 ps was obtained.

- 10n
;~ 1/ c ut

9 .0n
<(

-1 .0n
4 .0

>

0.0
1.0

>

'J : I I P J - _ 1

[I

\

I

"

6 : /'., PJT _ 2

0.0
3 .330

>

.__,. 3.280
3 _320
.::_ 3.280

- : /'J'v TPJ T_ ~

L~------- ,

1-.

,.- -- -- -•' '
~ -•

I

'-

i : / Cu TU T_ 2

L.,-r---- --~------,./'--0.0

400p

--- ---✓

==r::::::r=T::J

.-

._,- ·'

1.2n

800p

l

I

1.6n

time ( s )

Figure 3.8 (b) Output waveforms of circuit Figure 3.9(a) showi~g the eff~ct of RC
interconnect model on the delay and coupling capacitance on the signal couplmg on the
victim

0: /3/-::, .,:

500p
<{

200p

-~

-100p

:J'.

300p

/1 1/cu.

,,.._
<{

-200p
-700p

.

: /!N P .., _ 1

4.0

>

\

I

0.0

\

6: / IN P..: _2

1.0

>

I/ -

2 .0

500m
0 .0
3 .340

-: ;ov

J- - .

,,.._

> 3.310
3 .280
3.320

' - - - -I /

.. /CL - -

"· ;
J - _12

_,.,- \
/',

,,.._

> 3.300

--

- --- - - - --

- - --

/

\

;----

3.280
0 .0

200p

600 p

400p

- - - --· -- --- --·-- - - 800p

•--:-

1. 0 ri

time ( s)

dell o· (29 .54I Ip 1.62163)
slope . 54 .894 1C

Figure 3.8 (c) Magnified view of Figure 3.8 (b) shown to calculate the delay in the
outputs at the aggressor and the victim nets. The outputs show overdamped response as
can be seen from the peaks at the termination of the output waveforms. The delay as
calculated from the waveforms is 29.542 ps.

When Do We Need To Consider On-Chip Inductance ?

From our simulation results and discussion about the significant inductance impacts
shown in the previous section, we definitely do not want to overlook the potential
inductance problems in our designs. A couple of methods or figures of merit to
characterize the importance of on-chip inductance have been published in many technical

paper?. They can be used to answer the question as to when we need to consider on-chip
inductan~e. Circuit designers rely on the rules to know whether the consideration of
inductance is necessary in their design verifications. Because accurate and stable RLC
delay and crosstalk estimations still rely on computation-intensive simulations, in order to
efficiently verify a chip's signal integrity, determining which nets require the special
consideration of inductance is important. Based on the effective inductance screening,
designers may apply those RC delay and crosstalk prediction methods.

Basically, the interconnect is treated as a uniform RLC transmission line. Based on the
transmission line analysis, the following figures of merit for an interconnect length Lwrre
is introduced. Inductance needs to be considered if

t risetime <L

c-=

2 ·'\j L·C

. <~.
wire

~

(3.4)

R~C

where R, L, C are per-unit-length resistance, self-inductance, and coupling capacitance,
respectively; trisetime is the rise time of the signal at the input of the CMOS circuit driving
the interconnect.

Equation (3.4) is the conjunction of two rules. The first rule

Lwire< (2/R)✓Uc is

introduced to ensure that the transmission line's equivalent RLC circuit is overdamped.
The equivalent circuit in Figure 3.9 is created by using single RLC section
approximation, where 4
are

= LLwire,

Rt

= RLwire,

and Ct

= CLwrre·

The poles of the circuit

(3.5)

and the damping factor ~ is

)::
~

,_

R Lwire~
· -

, - - --

2

(3.6)

L

Rt

Lt

Vi n

Figure 3.9 Simple RLC circuit model for an interconnect line

If ~ is greater than one, the equivalent circuit is overdamped and has sin.all inductance
effects. The greater the value of ~' the more accurate the RC model becomes. On the
other hand, as ~ becomes less than one, the circuit becomes underdamped and the poles
become complex and oscillations occur. In that case, inductance cannot be neglected.
Therefore, we have the condition Lwire< (2/R)✓Uc, under which inductance needs to be
considered.

The second rule composed in (3.4), (trisetime /2)✓LC < Lwire, is introduced to ensure the
waveform agreement between the analytical solutions of the transmission line's
characteristics impedance approximation and its RC approximation. Figure 3.10 shows

/I"\

these two approximations and the driving circuit 1s considered. The characteristic
impedan~e of an RLC transmission line looks like a resistance in series with a
capacitance as defined in equations as :

1

Zo :=Ro + -.-

JW~Q

(3.7)

where

(3.8)

(3.9)

Both Ro and C 0 monotonically decrease with the increase of frequency and saturate to the
asymptotic values given by

(3.10)

·-2-1LC

Coasym -- ~

when the frequency is beyond R/L. They are used in the circuit of Figure 3.10 (a).

(3.11)

Rauym

I

Co11ym

(a)

I
(b)

Figure 3.10 (a) A CMOS inverter driving the equivalent characteristic impedance of an
RLC transmission line. (b) A CMOS inverter driving an RC approximation of a
transmission line.

Figure 3. ll(a) shows the basic circuit for simulation using RLC model of interconnect. In
this circuit, the aggressor is activated with a Ins period pulse signal and the victim is kept
silent. The purpose to simulate this circuit was to understand the effect of on-chip
inductance effects on the delay in addition to the delay caused by interconnect
capacitance. Figure 3.ll(b) shows the simulation results. Figure 3.ll(c) is a magnified
view of Figure 3.ll(b) to calculate the delay. Note the difference in the delay obtained by
simulating RC model and RLC model of interconnect. It can be said that the additional
delay caused in the RLC model is the delay due to inductance.

LA

rt.C L'IT'IICONNECTWOOD.
1
!~

' !

MPUU

J!
L . . . -_ _ _.._,..,

c,,

<••

cou,u11cw,m.0K:r.

COUl'UPICC4PACITAJCCI
(M4,.9. . t.npanllitlln.Ltrceaaedl)

(\,,l•NO h• pualleJ l&l.ttttnnect.)

dill

out

c,,
c•'ef

OUTM.l

J.,. ,

.____ _____,iii

p l

cWI

CM

Vlt'TlWKff

!

Nl'IIT.l

'
ca

1~~~:r
t•*

•,.,

a

}·

Figure 3.11 (a) Circuit showing the RLC interconnect model with aggressor activated
with a pulse signal and the victim net silent to see the effect RLC interconnect model on
the delay and coupling capacitance on the signal coupling.

<>:

20n

1.:./c,. ·

<l'.

- 70ri
/11/:,~:.

20n
<l'.

-70n
V: iF\ FL.·T-

4.0

>

fI

0 .0

\

I

R

o : ;:r,..,:::;:_,-_2

1.0

>

~

0 .0

.

6.0

>

. ,

-1. 0

[.__

•r,,

v,,

- :;

-

_,

.

...

L

-----~------,- - ·-

0 .0

____ .,,...· --------- -,,'- ...
- --

400p

::7 -

800p

1.2n

1.6n

2.0 n

time ( s)

Figure 3.11 (b) Output waveforms of circuit Figure 3.11 (a).

- "'\,.,/ -

:- - 10p

f
(

-50p
5 .0n

:; - .0n

v.

4.0

0 .0
.0
~

>

I

I

/I ' /co

----,

,,_t\

.,
/!~PU _ 1

[

~

2 .0

\

'

r

-7 .0n

>

.,

I

. ,--·- -- \\ .\1"._:\. .:

-

1,,,1"

\

r, , _,
!

/

/

\

500m
0 .0

5 .0

>

2 .0

- : ;ou-; UT _ .

[

~

f
-1.0

....____

__./

-----------

/JL - f L "'" _,,

1.30
;:-

_

'

600m

f

-100m
0 .0

.

~--------<-~
·

----

.

_..---/'

200p

..'-.. .......

·-

400p

600p

800p

--

--·

1.0 n

tim e ( s)

Celie. ( 163.24 Ip -45 15 Im)
slope · -276 592M

Figure 3.11 (c) Magnified view of Figure 3.11 (b) shown to calculate the delay in the
outputs at the aggressor and the victim nets. The outputs show underdampecl response as
can be seen from the peaks at the termination of the output waveforms. The delay as
calculated from the waveforms is 163.241 ps.

From ·the simulation results, the delay at the output of the aggressor and victim nets was
calculated to be 163.241 ps as compared to the delay of 29.542 ps obtained by simulating
the RC interconnect model. This additional delay is caused by the on-chip inductance.
Therefore, in high speed digital circuits and systems, it is very much essential to consider
the inductance while modeling the interconnect.

On-Chip Design Solutions To Cope With Inductive Coupling

From the previous discussions, we can understand that the inductance effects, such as
large oscillations and crosstalk noise, are mainly caused by oversized drivers and lowresistance interconnects. Knowing these facts, we need to be abreast of some chip design
solutions to minimize the inductance effects in advanced high-speed designs where
oversized drivers and low-resistance wires are employed.

1) Dedicated Ground Wires

Although the partial inductance is determined only by the geometry, the loop inductance
is determined not only by the geometry but also by the current returns. Because the
current has to return to the sources (the driver source or the noise source), what really
matters to delays and ~rosstalk noise i·s the loop inductance. The smaller the loop
inductance, the smaller the delay or noise. We would like to present a small example to
explain this.

WI RE 2

WIRE 1

Figure 3.12 Two wires: Wire1 is driven by a driver_and Wire2 provides current return

Consider Figure 3.12. It shows a current loop composed of two parallel wires. Wire 1 is
driven by a driver and Wire2 is the path for current return. The loop inductance Lioop, can
be calculated as follows:

(3.12)

where L1 1 and L 22 are self inductances, and L 12 and L 21 are mutual inductances. Since the
currents in the two wires flow in different directions, minus signs are in front of the
mutual terms. From these minus signs, we know that if the coupling between the signal
wire and its return path is strong, the loop inductance is small; therefore, the return path
should be as close as possible. It is also obvious from equation 3.12 that the loop
inductance is small if each self-inductance is small, which can be achieved by having
multiple return paths or by having wider return paths. Hence, the conclusion drawn from
studying (3.12) is that we need to have as many return paths as possible and they should
be as close as possible. In general, other signal wires may not be good return paths
because they may switch in the same direction. Furthermore, the currents returning from

LO

signal wires have to go through devices, which may add a couple of hundreds Ohms of
impedanc.e. Therefore, the best return paths are nearby dedicated ground wires.

Although we mentioned that wider return paths are desired due to their small selfinductance, the inductance does not decrease very fast with the increase of width. Using
more dedicated ground wires is p~eferred. That is, if we have the wiring resource to
afford twice the width of ground wires, we should still use the same width but add one
more ground wire in between signal wires. Of course, the width cannot be too small;
otherwise, the large line-resistance may overshadow its ground-return effect and the
ground wire may not provide enough shielding for inductive crosstalks.

2) Differential Signals

In order to provide enough nearby paths for fast signals, the differential signal technique
has been used. The neighboring signals of a high-speed bus are arranged to switch in
different directions. If designers know signals' probabilistic switching pattern, then the
signals tending to switch in the same direction must be interleaved by signals tending to
switch in the opposite direction. For example, both the signals and its logic inversion are
used. However, we need to avoid placing the signal and its bar right next to each other,
because doing so will slow down the signal when the capacitive couplings between them
are strong. They should be placed with one or two other signals in between, which shields
the capacitive coupling but still provides enough return paths through differential signals.

The worst delay or crosstalk case happens when all the aggressors are switching in the
same dir~ction. Employing differential signals may avoid this situation.

3) Buffer Insertion

Optimum buffer insertion is anoth~r effective technique to reduce inductive coupling as
well as delay and capacitive coupling in an RC or RLC tree that can be formed by one or
multiple sources and one or more sinks. The originally longer current path is now
shortened by guiding the current through the repeaters for a return, resulting in a smaller
current loop and hence smaller inductive coupling. We will explain this with the help of a
Figure 3.13.

Buller

Buffe r

Buller

Buffer

Figure 3.13 Buffer insertion in an RLC line for the minimization of inductive coupling as
well as capacitive coupling and propagation delay.

In this Figure, we illust~ate the buffers evenly inserted into a uniform RLC line and the
RLC each segment. Since L is a nonscalable function of length, as mentioned in the
previous sections,

Lseg

=

L1

when n =1 and

Lseg

=

2

L 1/k<nJ )

when n >1 ins!ead of Lseg =

Li/n, where k is between 2 and 2.5, depending on the surrounding interconnect

"1(\

environment and its own length. When the total length is about few thousand
micromet~rs to 8 mm, k is approximately equal to 2.3. Therefore, when more buffers are
inserted into the original RLC line; each segment has smaller Lseg as compared to L 1/n
and therefore a smaller effect from inductance on propagation delay. The mutual
coupling term, Mseg, has a similar dependence on length, Mseg = M 1/k

2
(n/ )

when n >1, if

there are coupled signal lines. Henc~, with inserted buffers, inductive coupling decreases
slightly faster than a linear rate. Each segment becomes more an RC line than an RLC
line, because L per segment scales down at a rate faster than the Rand C that are linearly
scaled with length. However, when inductance effect has significant impact on delay and/
or crosstalk, the RLC-induced crosstalk is higher than the RC-induced crosstalk before a
certain length (we call it inductance-critical length) and smaller after the length . Both
RLC-induced and RC-induced crosstalk tend to saturate when the total interconnect
length is long, and in general the RLC-induced one is smaller than the RC-induced one at
saturation. The basic reason for the above phenomenon is that the inductive crosstalk
initially induced a larger opposite inductive voltage on the victim wire due to Ldi/dt. The
inductive crosstalk is in the opposite direction of the capacitive crosstalk under the
assumption that all the aggressors switch in the same direction. Then the capacitive
crosstalk (caused by Cdv/dt) dominates on the victim signal with a larger dv/dt (due to
victim signals pulled down by inductive crosstalk and therefore larger differential voltage
with respect to the aggressors) compared to the RC-induced crosstalk alone. Therefore,
the RLC-induced crosstalk is larger upto a certain length. However, due to faster
charging during capacitive crosstalk (Cdv/dt) in the RLC case, the slew rate of the signal

"'71

is slower after the inductance - critical length and therefore the RLC- induced crosstalk
has in ge~eral a smaller saturation voltage.

4) Splitting Wires

The large oscillations caused by i;nductance effects prohibit us from using very wide
wires. Although wire sizing for RC interconnects has been a successful technique to
achieve higher performance, it may not be valid for future advanced high-speed designs
where thick or low-resistive metals are used. In order to minimize the line reactances,
wires wider than the skin depth need to be split as shown in the Figure 3.14(a) and (c). At
high frequencies, the majority of the currents are conducted around the left and right
comers of the wire, resulting in about two-skin depths conducting width. Wires wider
than that do not achieve much higher reduction on high-frequency resistance. Moreover,
for thick wires, the low-frequency self-inductance does not decrease much with the
increase of width. It only decreases by 10% with each doubling of the width. At high
frequencies, the reduction on self-inductance will be even smaller due to the skin effect.
Therefore, widening the wires does not seem to be effective in reducing line reactance.
Furthermore, wide wires have stronger couplings between them, causing larger crosstalk
noise and oscillation. On the other hand, splitting a wide wire into several, say N, parallel
wires of about two-skin-depths wide each -may reduce the total reactance by a factor of N.
That is a huge improvement. The width of each split wire is (1/Nih of the original width.
The spacing between split wires also helps to reduce inductance.

Combining this wire-splitting technique and the dedicated-ground-wire approach
introduc~d earlier, we may use the interconnect structure as shown in Figure 3.14(d),
where each split wire is guarded by' ground wires. This technique may be used for noisecritical signals and clock signals. An interesting point to investigate is whether Figure
3.14(d) is faster than Figure 3.14(c) when the number of split wires is small. Figure
3.14(d) is definitely better in terms .of noise immunity but it uses almost twice the wiring
resource. In such cases, it is always better to simulate both the structures for a particular
application. Experimental results [8] show that Figure 3.14 (c) is better in the sense that
45% of the total current flows through the center two wires, thus giving it an
improvement over Figure 3.13(d) on delay and almost twice efficient usage of routing
resources. However, as the number of split wires increases, the performance of (c) is
expected to decrease due to insufficient ground return.

Figure 3.14 (a) A signal wire has been split into two 2.5 µm parallel wires

'7'l

Figure 3.14 (b) A 7.5 µm wide wire

Figure 3.14 (c) A signal wire has been split into four parallel wires

Figure 3.14 (d) A signal wire has been split into four parallel wires, and in between every
two wires, a dedicated ground wire is inserted.

'7,1

5) Continuous Power/Ground Planes

ORTHO~~~:~ SIGNAL

I MPEDANCE·

CONTROLLED
LOW-LOSS LINES

C==:J

C==:J
D

D

c:=J

C==:J

C==:J

C==:J

CJ

DCJCJ

C==:J

J

C==:J

CONVENTIONAL
MULTIL AYER
INT ERCON NEC TS

Figure 3.15 Continuo us power/ground planes on-chip provide impedance-controlled lowless signal lines.

Inductive coupling can be greatly reduced if continuous power/ground planes are
employed on-chip, as shown in Figure 3.15. This is due to the fact that the image current
can flo w in the opposite direction on the power/ground planes directly underneath the
current flows on the signal lines. This greatly reduces the excessive inductive coupling
that occurs when signals travel a long distance for current return without planes nearby.

CHAPTER4
REAL DESIGN

In this chapter, we intend to present our ASIC design flow describing the design of
Asynchronous chip and Synchronous prototype of that using Cadence tools. The primary
goal of this work is to design asynchronous and synchronous protocols of an eight bit
shifter and evaluate the designs in terms of various factors such as area, power
comsumption, component count, interconnect length, density of the chip etc. On the basis
of the design, we will try to make a conclusion which protocol is better with appropriate
justification.

Figure 4.1 below shows the ASIC design flow of the work to be presented in this chapter.
From the figure, the first stage to the design is the synthesized netlist in verilog language
description format. The netlist is basically a logical description of various components,
macros, and modules used in the design and how they are connected to each other with
various signals used for the transfer of data among them. It is important to note that the
netlist is complet~ly in the gate level format before importing it into the layout tools since
the layout and placement tools can understand the flattened hierarchy of the design
description only. After the netlist is successfully imported into the design database, a
floorplan of the chip is created. The floorplanning is a mapping between the logical

76

description and the physical description of the design. It allows us to predict the
interconnect delay by estimating interconnect length. Various parameters such as the
Aspect Ratio of the chip, I/0' to core distance, Row utilization etc. can be initialized and
expected results can be calculated at the floorplanning stage.

SYNTHESIZED NETLIST

FLOORPLANNING OF THE DESIGN

,,
PLACEMENT OF THE MODULES

,,
ROUTING OF THE DESIGN

,
ON-CHIP PARASITIC EXTRACTION

EXPORTING THE GDSII FILE TO THE
LAYOUT SOFTWARE

Figure 4.1 ASIC design flow

77

After satisfactory results calculated from the design statistics, we can move on to the
pl~cement stage where the cells are placed in the rows. Efficient placement can be
achieved using automatic placement or my manually placing cells. Various signal lines
are routed with the routing of power lines on the chip, thus providing the Vdd and Vss
paths to all the cells in the chip. After the routing, the chip is checked for violations.
Violations could be physical .violations or logical violations. Finally, the summary of the
design statistics is obtained to check on various design aspects to meet the specifications
of the design. In this chapter, it is desired to explain the design flow in detail, one design
stage at a time.

As we scale down the feature size of the chip, the interconnect delay and the gate delay
decrease. Floorplannning allows us to predict this interconnect delay by estimating
interconnect length . Floorplanning is a mapping between the,_ logical description (the
netlist) and the physical description (the floorplan).

The goals of floorplanning are to :
•

arrange the blocks on a chip

•

decide the location of the I/O pads

•

decide the location and number of the power pads

•

decide the type of power distribution

•

decide the location and type of clock distribution

The objectives of floorplanning is to minimize the chip area and minimize delay.

78

Th~re are infact a lot of technical terms associated with floorplannning. Lets try to
understand what different terms are before we really get to know different kinds of
floorplans and their impacts on the various of the chips.

flexible standard-cell blocks

i not yet placed)
core
boundary

flexible standard-cell blocks
(with estimated placement)

A

,>~ . . ~;)~,~~:;: I - -i ,cy~,;

~~~~~ --. . ~. . . ,- .. ~,_ . 4 , 21 _·· ..

·· '

F

D

terminal, pin, or
port location

fixed blocks

F

E
(d)

(c) .

Figure 4.2 Floorplanning a cell-based ASIC. (a) Initial floorplan generated by the
floorplannig tool. Two of the blocks are flexible (A and C) and contain rows of standard
cells (unplaced). A pop-up window shows the status of block A. (b) An estimated
placement for flexible blocks A and C. The connector positions are known and a rat's nest
display shows the heavy congestion below block B. (c) Moving blocks to improve the
floorplan. (d) The updated display shows the reduced congestion after the changes.[9]
79

Figure 4.2 shows an initial random floorplan generated by a floorplanning tool. Two of
the.blocks, A and Care standard-cell areas. These are flexible blocks (or variable blocks)
because, although their total area is fixed, their shape (aspect ratio) and connector
locations may be adjusted during the placement step. The dimensions of the fixed blocks
can only be modified when they are created. We may force logic cells to be in selected
flexible blocks by seeding. y.le choose seed cells by name. For example, NCL_RAMx
would select all the logic cells whose names start with NCL_RAM to be placed in one
flexible block. Seeding may be hard or soft. A hard seed is fixed and not allowed to move
during the remaining floorplanning and placement steps. A soft seed is an initial
suggestion only and can be altered if necessary by the floorplanner. We may also use
seed connector within flexible blocks- forcing certain nets to appear in a specified order,
or location at the boundary of a flexible block.

The floorplanner can complete an estimated placement to determine the positions of
connectors at the bondaries of the flexible blocks. Figure 4.2(b) illustrates a rat's nest
display of the connections between blocks. Connections are shown as bundles between
the centers of blocks or as flight lines between connectors. Figures 4.2(c) and (d) show
how we can move the blocks in a floorplanning tool to minimize routing congestion.

Usually, the aspect ratio of the chip is 1. So we need to control the aspect ratio of our
floorplan to fit our chip into the die cavity (a fixed-size hole, usually square) inside a
package. Figure 4.3(a)-(c) show how we can rearrange our chip to achieve a square
aspect ratio. Figure 4.3(c) also shows a congestion map, another form of routability

80

_display. There is no standard measure of routability. Generally the interconnect channels
have a certain channel capacity; that is, they can handle only a fixed number of
interconnects.

2

IA

1.75

I iB

I IC

1.75

15
·

D

□□□

D

(a)

_c_

___,

D
F

(b)

1.75
D
G
Routing congestion

1.75

■ 200 %
■ 1 00 %

;. 50%

(d)

(c}

Figure 4.3 Congestion analysis. (a) The initial tloorplan with 2: 1.5 die aspect ratio. (b)
Alterating the floorplan to give a 1:1 chip aspect ratio. (c) A trial tloorplan with a
congestion map. Blocks A and C have been placed so that we know the terminal positions
in the channels. Shading indicates the ratio of channel density to the channel capacity.
Dark areas show regions that cannot be routed because the channel congestion exceeds
the estimated capacity. (d) Resizing flexible blocks A and C alleviates congestion.[9]

81

One measure of congestion is the difference between the number of interconnects that we
act~ally need, called the channel density, and the channel capacity. Another measure,
shown in Figure 4.3(c), uses the ratio of the channel density to the channel capacity. With
practice, we can create a good initial placement by floorplanning and a pictorial display.
This is one area where the human ability to recognize patterns and spatial relations is
currently superior to a computer program's ability. During the floorplanning we assign
the area between blocks that are to be used for interconnect. This process is known as
channel definition or channel allocation. Figure 4.4 shows a T-shaped junction between
two rectangular channels and illustrates why we must route the stem of the T before the
bar. The general problem of choosing the order of rectangular channels to route is
channel ordering.

Figure 4.4 shows a floorplan of a chip containing several blocks. Suppose we cut along
the block boundaries slicing the chip into two pieces(Figure 4.4a). Then suppose we can
slice each of those pieces into two. If we can continue in the fashion until all the blocks
are separated, then we have a slicing floorplan (Figure 4.4b ). Figure 4.4c shows how the
sequence we use to slice the chip defines the hierarchy of the blocks. Reversing the
slicing order ensures that we route the stems of all the channel T-junctions first. In
complex designs, it is better to have slicing floorplan to achieve good mutability of the
chip.

Figure 4.6 shows a floorplan that is not a slicing structure. We cannot cut the chip all the
way across with a knife without chopping a circuit block in two. This means we cannot

82

route any of the channels in this floorplan without routing all of the other channels first.
W ~ say there is a cyclic constraint in this floorplan.

•
channel B

channels
block 3

e
►

A
V

Now we can
adjust channel B.

Figure 4.4 Routing a T-junction between two channels in two-level metal. The dots
represents logic cell pins. (a) Routing channel A (the stem of the T) first allows us to
adjust the width of channel B. (b) If we route channel B first (the top of the T), this fixes
the width of channel A. We have to route the stem of T-junction before we route the
top.[9]

route
channels
in this
order

circuit
block

2
(a)

(b)

cut
number

c

E
(c)

Figure 4.5 Defining the channel routing order for a slicing floorplan using a slicing tree.
(a) Make a cut all the way across the chip between the circuit blocks. Continue slicing
until each piece contains just one circuit block. Each cut divides a piece into two without
cutting through circuit block. (b) A sequence of cuts : 1,2,3, and 4 that successively
slices the chip only until circuit blocks are left. (c) The slicing tree corresponding to the
sequence of cuts gives the order in which to route the channels: 4, 3, 2 and finally l.[9]

a

83

Th~re are two solutions to this problem. One is to move the blocks until we obtain a
slicing floorplan. The other· solution is to allow the use of L-shaped, rather than
rectangular, channels (or areas with fixed connectors on all sides- a switch box). We need
an area based router rather than a channel router to route L-shaped regions or switch
boxes.

□□□□□□□□□□
□
1'
t,
□-

□

□□□□□□□□□□
t~~JJ.~ ~ .,
D

- D□

□

~ .-.:Xll~-p;;;;~~-

t-------.--"

□
□
□
□
□
□
□
□
□□□□□□□□□□

(a)

□
0
□
□
□

2

□□□□□□
□
□
□
□

• ·□

□□□□□□□□□□

(b) .

Figure 4.6 Cyclic constraints. (a) A nonslicing floorplan with a _cyclic constraint that
prevents channel routing. (b) In this case, it is difficult to find a slicing floorplan without
increasing the chip area. (c) This floorplan may be sliced (with initial cuts 1 or 2) and has
no cyclic constraints, but it is inefficient in area use and will be very difficult to route.

Figure 4.7(a) displays the floorplan of the ASIC shown in Figure 4.3. We can remove the
cyclic constraint by moving the blocks again, but this increases the chip size. Figure
4.7(b) shows an alternative solution. We merge the flexible standard cell areas A and C.
We can do this by selective flattening of the netlist. Sometimes flattening can reduce the
routing area because routing between blocks is usually less efficient than routing inside
the row-based blocks. Figure 4.7(b) shows the channel definition and routing order for
our chip.
84

(a)
cyclic constraint: i_c;::;;::;~7:!rttl~
1, 2, .3, 4

(b)

.6

merge
standard

4

1e~:J~s 8 >:tlf!:f\':f
.·. j J-:. .,

7

3 \

_; ,,',,

channel

number
(in routing
order)

5

Figure 4.7 Channel definition (a) We can eliminate the cyclic constraint by merging the
blocks A and C (b) A slicing structure[9]

85

Clock Planning in Boolean (Synchronous) Designs

Figure 4.8a shows a clock spine routing scheme with all clock pins driven directly from
the clock driver. Figure 4.8b shows a clock spine for a cell-based ASIC. Figure 4.8c
shows the clock-driver cell, often a part of a special clock-pad cell. Figure 4.8d illustrates
clock skew and clock latency. Since all clocked elements are driven from one net with a
clock spine, skew is caused by differing interconnect lengths and loads. If the clockdriver delay is much larger than the interconnect delays, a clock spine achieves
minimum skew but with long latency.

Cl D □□ □ □□□ □ □□□□□□□□□□ D
- . . . - - - - ·-

D

--l- - - -18
.

D

- -8

n

D

□ -----4h_ _ o
D
, -----' ·• ,,"------ 1 □
D - \ - - -~ -- -D
D
□

G

88 -_

___,,,_---1

D

-~'

hes=

r __
-

+ - - --

8
D

dri

cell

-4J

1

clo

□ h;f,;:;;~~!;;:;~

L.....a..-_....;.:.c:_.....::..,

□□□□□□□□□□ DDDDDDOD§

D

Q

clock
spine

8 ~2

base cells

m1

block
connector

CLK

02
F1

latency

A

Figure 4.8 Clock .distribution. (a) clock spine for a gate array. (b) A clock spine for a
cell-based ASIC. (c) A clock spine is usually driven from one or more clock-driver cells.
Delay in the driver cell is a function of the number of stages and the ratio of output_ to
input capacitance for each stage (taper). (d) Clock latency and clock skew. We would hke
to minimize both latency and skew. [9]

86

Clock skew represents a fraction of the clock period that we cannot use for computation.
A ~lock skew of 500 ps with a 200 MHz clock means that we waste 500 ps every 5 ns
clock cycle, or 10% of performance. Latency can cause a similar loss of performance at
the system level when we need to resynchronize our output signals with a master system
clock.

Figure 4.8c illustrates the construction of a clock-driver cell. The delay through a chain of
CMOS gates is minimized when the ratio between the input capacitance C 1 and the
output (load) capacitance C 2 is about 3 (exactly e = 2.7), an exponential ratio, if we
neglect the effect of parasitics). This means that the fastest way to drive a large load is to
use a chain of buffers with their input and output loads chosen to maintain this ratio, or
taper. This is not necessarily the smallest or lowest-power method, though.

Suppose we have an ASIC with the following specifications:
•

40,000 flip flops

•

Input capacitance of the clock input to each flip-flop is 0.025 pF

•

Clock frequency is 200 MHz

•

Yctct= 3.3 V

•

Chip size is 20 mm on a side

•

Clock spine consists of 200 lines across the chip

•

Interconnect capacitance is 2 pF cm-

1

87

!n this case the clock-spine capacitance CL= 200 x 2 cm x 2 pF-cm-

1

= 800 pF. If we

dri~e the clock spine with a chain of buffers with taper equal toe ~2.7, and with a firststage input capacitance of 0.025 pF we will need

12

800.lff
log·--12
0.025-lff

(4.1)

=10.4 or 11 stages

The power dissipated charging the input capacitance of the flip-flop clock isjCV 2 or

P1

1

=(4

X

4

10 ) (200 MHz) (0.025 pF) (3.3

v)2 =2.178 Watt

(4.2)

or approximately 2 Watts. This is only a little than the power dissipated driving the 800
pF clock-spine interconnect that we can calculate as follows:

P1 1 = (200) (200 MHz) (20 mm) (2 pF-cm- 1) (3.3

v)2 = 1.7424 Watt

(4.3)

•
All of this power is dissipated in the clock-driver cell. The worst problem, however, is the
enormous peak current in the final inverter stage. If we assume the needed rise time is
0.lns, the peak current would have·to approach

. ._ (800pF) (3.3 V)

1 ,- - - - - -

0.1 ns

88

(4.4)

_Clearly such a current is not possible without extraordinary design techniques. Clock
spipes are used to drive loads of 100-200 pF but, as is apparent from the power
dissipation problems of this example, it would be better to find a way to spread the power
dissipation more evenly across the chip.

r2

~
(a)

C1rL

taper

f:

inside block A

li}
(b)

taper

ciock tree inside block F
(c)

Figure 4.9 A clock tree. (a) Minimum delay is achieved when the taper of successive
stages is about 3. (b) Using a fanout of three at successive nodes. (c) A clock tree for the
cell-based ASIC of Figure 4.6 b. We have to balance the clock arrival times at all of the
leaf nodes to minimize clock skew. [9]

We can design a tree of clock buffers so that the taper of each stage is e~ 2. 7 by using a
fanout of three at each node, as shown in Figure 4.9(a) and (b). the clock tree, shown in
Figure 4.9(c ), uses the same number of stages as a clock spine, but with a lower peak

89

current for the inverter buffers. Figure 4.9(c) illustrates that we now have another
pro~lem- we need to balance the delay through the tree carefully to minimize clock skew.

Designing a clock tree that balances the rise and fall times at the leaf nodes has the
beneficial side-effect of minimizing the effect of hot-electron wearout. This problem
occurs when an electron gajns enough energy to become "hot" and jump out of the
channel into the gate oxide. The trapped electrons change the threshold voltage of the
device and this alters the delay of the buffers. As the buffer delays change with time, this
introduces unpredictable skew. The problem is worst when the n-channel device is
carrying maximum current with a high voltage across the channel- this occurs during the
rise- and fall-time transitions. Balancing the rise and fall times in each buffer means that
they all wear out at the same rate, minimizing any additional skew.

A phase- locked loop(PLL) is an electronic flywheel that locks in frequency to an input
clock signal. The input and output frequencies may differ in phase, however. This means
that we can, for example, drive a clock network with a PLL in such a way that the output
of the clock network is locked in phase to the incoming clock, thus eliminating the
latency of the clock network. A PLL can also help to reduce random variation of the
input clock frequency, known as jitter, which, since it is unpredictable, must also be
discounted from the time available for computation in each clock cycle.

We began our design of an Eight Bit Shifter by first synthesizing and simulating the
VHDL code for the design using Synopsys Design Compiler tool. After thoroughly

90

checking the design and optimizing it at the pre-layout stage, we created the design netlist
in ~he form of Verilog files (.v files). Next, we imported this design netlist in Cadence
DSM environment using Cadence Silicon Ensemble tool. Before getting onto the
floorplanning stage, we have to make sure that proper library files and technology files
are also imported for mapping of the design module to them. Next, we compile the
design netlist files.

After _successfully compiling the design netlists, we reach the

floorplanning stage. Floorplanning stage is the most important stage in the design flow
because at this stage, we can estimate the chip statistics of the final chip by initializing
certain design parameters. For both our designs, that is, the Clocked Boolean Shifter and
the NCL Shifter, we set the I/O to Core distance to be 25 µm and the Row Utilization
factor to be 85% to allow some extra space for easy routing of interconnects and to avoid
lot of congestion in the chip. We obtained the following estimated statistics of the
designs:
Expected Chip Statistics

NCL Shifter

Boolean Shifter

Estimated Area of the Chip
Number of Standard Cell Rows
Number of Cells

9636.171 umi
15
166

4729.i62 umi
4
97

Figure lO(a) and Figure lO(b) shows the Floorplanned views of the NCL and Boolean
Chips respectively.

91

Figure 4.10( a) Floorplan View of NCL Shifter With Power and Input/Output Pins Placed

()2

Figure 4.1O(b) Floorplan View of Clocked Boolean Shifter With Power, Input/Output Pins and
Cells Placed

O'l

Placement

After completing the floorplan we can begin placement of the logic cells within the
flexible blocks. Placement is much more suited to automation than floorplanning. That is
why, we need measurement techniques and algorithms. After the floorplanning and
placement of the chip, we can predict both the intrablock and the interblock capacitances.
This allows us to return to logic synthesis with more accurate estimates of the capacitive
loads that each logic cell must drive.

Most ASICs currently use three to five levels of metal for signal routing. With three
layers of metal, we route within the rectangular channels using the first metal layer for
horizontal routing, parallel to the channel spine, second metal layer for the vertical
direction and the third one again in the horizontal direction. The maximum number of
horizontal interconnects that can be placed side by side, parallel to the channel spine, is
called channel capacity. Vertical interconnects use feedthroughs to cross logic cells.

Placement Goals and Objectives
The goal of placement is to arrange all the logic cells within the flexible blocks on a chip.
The objectives of placement are : ·
•

Guarantee the router can complete the routing step

•

Minimize all the critical net delays

•

Make the chip as dense as possible

Q4

•

Minimize the power dissipation

..

Minimize crosstalk between signals

• 'Mini mize the total estimated interconnect length

To determine the quality of placement, we need to be able to measure it. We need an
approximate measure of interconnect length, closely correlated with the final interconnect
length, that is easy to calculate.

expanded view of part of flexible block A

-

EJ

·

I

cMnnels

L. ................................... A.25,·························-·······························j
rows of
standard
cells

(a)

cell instance name
(b)

....i l.:t50 A

w
1
2
3
4
5·

'

6,

- -

minimum
rectilinear
Steiner tree

X
i

7

'

+l

~
50A

L=16

z
(d)

(c)

Figure 4.11 Placement using trees on graph[9]

The graph structures that correspond to making all the connections for a net are known as
trees on graphs. Special classes of trees- Steiner trees minimize the total length of the
interconnect and they are central to ASIC routing algorithms. Figure 4.11 shows the

95

minimum Steiner tree. This type of tree uses diagonal connections - we want to solve a
.restricted version of this problem, using interconnects on a rectangular grid. This is called
rectilinear routing.

The minimum rectilinear Steiner tree is the shortest interconnect using a rectangular grid.
The determination of the MRST is in general an NP- complete problem - which means
hard to solve. For small numbers of terminals heuristic algorithms do exist, but they are
expensive to compute. We need to estimate the length of the interconnect only. Two
approximations to the MRST are shown in the Figure 4.12

28 26 24 . 22

-(b)

(a)

:.J:J-1+R~-;:
•

-6

•- ----t----...---1· 16
8

complete-graph measure

i

10 12 14

half-perimeter m:easure
L=28/2 = 14

L=44/2=22

Figure 4.12 Interconnect-length measures. (a) Complete-graph measure. (b) Halfperimeter measure.[9]

The complete gr~ph has connections from each terminal to every other terminal. The
complete-graph measure adds all the interconnect lengths of the complete-graph
connection together and then divides by n/2, where n is the number of terminals. We can
justify this since, in a graph with n terminals, (n-1) interconnects will emanate from each

96

terminal to join the other (n-1) terminals in a complete graph connection. That makes
n(n-1) interconnects in total. However, we have then made each connection twice. So

there are one-half this many, or n(n-1)/2, interconnects needed for a complete graph
connection. Now we actually only need (n-1) interconnects to join n terminals, so we
have n/2 times as many interconnects as we really need. Hence we divide the total net
length of the complete graph connection by n/2 to obtain a more reasonable estimate of
minimum interconnect length. Figure 4.12(a) shows an a complete-graph measure.

The bounding box is the smallest rectangle that encloses all the terminals. The halfperimeter measure is one-half the perimeter of the bounding box. For nets with two or
three terminals corresponding to the fanout of one or two (which usually includes over
50% of all the nets on a chip), the half-perimeter measure is the same as the minimum
Steiner tree. For nets with four or five terminals, the minimum Steiner tree is between
one and two times the half-perimeter measure. For a circuit with m nets, using the halfperimeter measure corresponds to minimizing the cost function,

4
t D4 :=

~

(4.5)

R k4·Ck

k= 1

Figure 13(a) shows an NCL shifter with Power, 1/0 pins and cells placed. Note that each
row of NCL cells is connected with Power rings running vertically. Figure 13(b) shows
the corresponding view of the Boolean Shifter counterpart.

97

Figure 4.13(a) NCL Shifter With Power, Input/Output Pins and Cells Placed. Each Row of NCL
Cells is Connected with Power Rings.

98

Figure 4.13(b) Clocked Boolean Shifter With Power, Input/Output Pins and Cells Placed. Each
Row of Boolean Cells is Connected with Power Rings.

99

Routing

Once the designer has floorplanned a chip and the logic cells within the flexible blocks
have been placed, it is time to make the connections by routing the chip. Routing is
divided into Global routing and Detailed Routing.

Global Routing

A global router does not make any connections, it just plans them. The input to the global
router is a floorplan that includes the locations of all the fixed and flexible blocks; the
placement information for flexible blocks; and the locations of all the logic cells. The
goal of global routing is to provide complete instructions to the detailed router on where
to route every net. Following are the objectives of the global roufing:

•

Minimize the total interconnect length.

•

Maximize the probability that the detailed router can complete the routing.

•

Minimize the critical path delay.

In both floorplanning and placement, with minimum interconnect length as an objective,
it is necessary to find the shortest total path length connecting a set of tenninals. It is here
that the half-perimeter measure can be used to approximate the shortest length of the
interconnect. Floorplanning and placement both assume that interconnect may be put

100

anywhere on a rectangular grid, since at this point nets have not been assigned to the
Ghannels, but the global router must use the wiring channels and find the actual path.
Often the global router needs to find a path that minimizes the delay between two
terminals.

Measurement of Interconnect Delay(Theoretical)

Floorplanning and placement need a fast and easy way to estimate the interconnect delay
in order to evaluate each trial placement~ often this is a predefined look-up table. After
placement, the logic cell positions are fixed and the global router can afford to use better
estimates of the interconnect delay. In this section, we intend to present one of the very
good methods to estimate the interconnect delay. Figure 4.14 shows the example circuit
used to illustrate interconnect delay calculation using Elmore constant.

The problem is to find the voltages at the inputs to logic cells B and_C taking into account
the parasitic resistance and capacitance of the metal interconnect. Figure 4.14(c) models
logic cell A as an ideal switch with a pull-down resistance equal to

Rpct

and models the

metal interconnect using resistors and capacitors for each segment of the interconnect.

The Elmore constant for node 4 (labeled V 4) in the network shown in Figure 4.14(c) is

(4.6)

101

where

(4.7)

In Equation 4.7 above notice that R24=Rpct + R1 (and not Rpct + R 1 + R 2) because R 1 is the
resistance to Vo (ground) shared by node 2 and node 4.

Assume that we have the following parameters for the layout shown in Figure 4.14(b):

•

m2 resistance is 50 mQ/square

•

m2 capacitance is 0.2 pFmm- 1

•

4 X inverter delay is 0.02 ns + 0.5 CL ns (CL is in pF)

•

Delay is measured using 0.35/0.65 output trip points

•

M2 minimum width is 3A =0.9 µm

•

IX inverter input capacitance is 0.02 pF

First, the pull down resistance, Rpct, of the 4X inverter is found. If we model the gate with

f
. -t/(C Rpd) Th
.
.
. .
h
a lmear pull-down resistor, Rpct, drivmg a load CL, t e output wave orm 1s e L • e
output reaches 63 percent of its final value when t = C~pct, because e(-1) = 0.63. Then,

102

because the delay is measured with a 0.65 trip point, the constant 0.5 nspP- 1 = 0.5 k.Q is
very close to the equivalent pull-down resistance. Thus,

Rpct:::::

500 .Q.

The values of Rs and Cs are calculated as below:

R1

= R2 = (0.1 mm)(50 x 10-3 .Q)/0.9 µm = 6 .Q

R3 = (1 mm)(50 x 10-3 .Q)/0.9 µm = 56 .Q
R4 = (2 mm)(50 x 10-3 .Q)/0.9 µm=l 12 .Q

C1 = (0.1 mm)(0.2 pFmm- 1) = 0.02 pF

C 2 = (0.1 mm)(0.2 pFmm- 1) +0.02pF = 0.04 pF

(4.8)

C3 = (1 mm)(0.2 pFmm- 1) = 0.2 pF
C4 = (2 mm)(0.2 pFmm- 1) + 0.02 pF = 0.42 pF

Now we can calculate the path resistance, Rki, values (Rki = Rik)

R14

= 500 .Q + 6 .Q= 506 .Q

R24

= 500 .Q + 6 .Q= 506 .Q

R34

= 500 .Q+ 6 .Q + 56 .Q= 562 .Q

R44 = 500 .Q+ 6 .Q+ 56 .Q+ 112 .Q

(4.9)

~

674 .Q

103

A Vd
. (a)

V1

V2

B

1'C2

1X

4X

V3

V4 C

Vo

Rpd

R1

12

V1

R4

R3

1X
t= 0
':"

2

(b)

0.1 mm

pull-down
resistance of
inverter A

resistance of .
interconnect
segments

1mm

V3

C

(c)

V4

Figure 4.14 Measuring the delay of a net with the help of a simple circuit.[9]

Finally, we can calculate Elmore' s constants for node 4 and node 2 as follows:

= (506)(0.02) + (506)(0.04)+(562)(0.2) +(674)(0.42)

(4.10)

= 425 ps

= (500 + 6 + 6)(0.04) + (500 +6)(0.02 +0.2+0.42)
= 344 ps

and '!04 - -r0 2 = (425 - 344)

= 81 ps

104

(4.11)

A lumped-delay model neglects the effects of interconnect resistance and simply sums all
the node capacitances as follows :

to= Rpct (C1 + C2 + C3 + C4)

= (500)(0.02+0.04+0.2+0.42)

(4.12)

= 340 ps ·

Comparing equations 4.10-4.12, we can see that the delay of the inverter can be assigned
as follows : 20 ps (due to pull-down resistance and the output capacitance), 4 ps (due to
the interconnect from A to B), and 65 ps (due to the interconnect from A to C). We can
see that the error from neglecting interconnect resistance can be important.

Global Routing Methods

Global routing cannot use the interconnect-length approximations, such as the halfperimeter measure, that were used in placement. At this stage of the design, the router
needs the actual path instead of the approximate path length. Present day tools use the
routers based on the solutions of the tree on the graph problem. One such approach to
global routing takes each net in tum and calculates the shortest path using tree on graph
algorithms with the added restricti~ns of the available channels. This process is known as
sequential routing. As a sequential routing proceeds, some channels will become more
congested since they hold more interconnects than others. Now in chips where it has a

105

definite channel capacity, each channel can handle only a certain number of
interconnects. In such cases, the router handles this problem in the following manner.
One· such solution is the order-independent routing, wherein a global router proceeds by
routing each net, ignoring how crowded the channels are. Whether a particular net is
processed first of last does not matter, the channel assignment will be the same. In orderindependent routing, after all the interconnections are assigned to channels, the global
router returns to those channels that are the most crowded and reassigns some
interconnects to other, less crowded, channels. This is also called sequential routing.

In contrast to the sequential routing, which handles one net at a time, hirarchical routing
handles all nets on the chip at the same time, the global-routing problem is made more
tractable by dividing the chip area into levels of hierarchy. By considering only one level
of hierarchy at a time the size of the problem is reduced at each level. Starting at the
whole chip, or highest level, and proceeding down to the logic cells is the top-down
approach. Most of the present day tools use this methodology for routing.

Power Routing

Power buses have to be sized according to the current it will carry. Too much current in a
power bus can lead to a failure through a mechanism known as Electromigration. The
required power b~s widths can be ~stimated automatically from library information, from
a separate power simulation tool, or by entering the power-bus widths to the routing
software by hand. Many routers use a default power-bus width so that it is quite easy to

106

complete routing. For a direct current the mean time to life (MTTF) due to
electromigration is experimentally found to obey the following equation:

MTTF = Ar 2 exp (-E/kT)

(4.13)

where J is the current density; Eis approximately 0.5 eV; k, Boltzmann's constant, and T
is the absolute temperature in kelvins.

In our designs, we have chosen the width of the power bus and ground bus to be 10 µm
since it fulfills the above criteria. Figure 4.15(a) shows the complete NCL chip and
Figure 4.15(b) shows the the complete Clocked Boolean Shifter chip. Both the designs
are routed using Warp-rout which has an advantage over routing the design step by step
using Global and Final routing. In both our chips, we had set the Row Utilization Factor
to be 85% which is a standard factor. As can be seen from Figure 4.15(a) and Figure
4.15(b), the Clocked Boolean Shifter chip is more compact c~mpared to its NCL
counterpart. One of the advantages of the NCL chip is that it does not require Timing
Analysis, the design being clockless, inspite of the fact that the Boolean counterpart is
area efficient, which requires Timing Analysis.

Figure 4.16 shows the optimized NCL shifter chip with Row Utilization Factor is 90%.
This saved us some area from previous NCL design.

107

Figure 4.15(a) The Complete NCL Shifter Chip

108

Figure 4.15(b) The Complete Clocked Boolean Shifter Chip

109

Figure 4.16 Optimized NCL Shifter Chip With 90% Core Row Utilization

110

Circiut Extraction

After the design is completely routed with no voilations, the exact length and position of
each interconnect for every net is known. Now the parasitic capacitance and resistance
associated with each interconnect, via, and the contact can be calculated. This data is
generated by a circuit-extraction tool in one of these forms: RSPF, SPF and DSPF. For
our designs, we have this data 'in RSPF format (see Appendix).

111

CHAPTERS
CONCLUSION

This thesis discusses a lot of important IC design issues. In the past few years, the
advancement in CMOS technology has taken place by leaps and bounds. In this work, the
concept of Scaling has been introduced.

Miniaturization of the devices and circuits has given rise to some unavoidable problems
associated with it. It is very important for an IC designer to understand the concept of
Scaling and apply it carefully while designing any circuit of system since improper
scaling of different device/circuit parameters could lead to totally undesirable results.
The relationship among various paramters andt their dependencies _on one another has to
be understood before scaling. We have introduced three types of scaling such as ideal,
constant-voltage and quasi-ideal scaling. For high packing density, :-ve prefer quasi-ideal
scaling approach since the vertical dimensions are scaled more slowly than the horizontal
dimensions resulting in tall and narrow geometry. On the other hand, we could tradeoff
some packing density and suppress noise effects by adopting the constant dimension
scaling approach ..It really depends on the complexity of the circuit/system as to which
scaling approach to be employed.

112

The term Signal Integrity is playing a very significant role in today's IC design arena.
With smaller and faster devices and increasingly complex circuits/systems, signal
integrity has become a focus of attention for today's designers. In this work, efforts have
been made to demonstrate major signal integrity issues. Clock skew is one of the major
issues in the clocked synchronous systems. Clock skew could be caused by different
lengths of the clock paths to different part of the system or by different loads of the clock
drivers. For design integrity, balanced Clock Tree approach is employed wherein we try
to make the clock signal reach different parts of the circuit/system at the same time by
inserting some buffers at shorter paths from the clock source. This is called Clock
Synchronization. Generating single-phase, two-phase and multiple clock sources are

other alternative solutions to solve the clock skew problems.

In today's denser ICs , where interconnects are routed very close to one another, there is a
high

probability of signal coupling effect. This occurs due to the parasitic coupling

capacitances between closely routed interconnects. Coupling capacitances, Cc, accounts
for roughly 70% of the total wire capacitance. This leads to crosstalk and noise. It is very
important to understand the signal coupling effect and be able to derive an appropriate
model for crosstalk and noise. A good design is the one which is power efficient. The
combination of large capacitive loads and a continuous demand for higher frequencies of
operation has led to an increasingly larger proportion of the total power of the system
being dissipated ~ithin the clock distribution network, about 25% of the total power. One
of the solutions to this problem is a technique for designing clock buffers and pipeline
registers such that the clock distribution network operates at half the power supply swing,

113

thereby reducing the power dissipated in the clock tree by 60% without compensating
the clock frequency of the circuit. Properly designed clock drivers and/or reduced clock
swing could lead to a system with considerable clock power reduction.

In deep submicron technology, the interconnect delay contributes to about 30% delay in
the chip. Therefore, it has ab_solutely become necessary to come up with an appropriate
model for on-chip interconnects which can demonstrate effects such as signal coupling,
delay and crosstalk. Interconnect modeling has been given prime importance in this work.
A novel model of interconnect has been proposed. Simulations using RC and RLC
models has been done using Cadence tools which justifies our results. Our simulation
results show that there is a considerable amount of signal coupling between two
interconnect lines running parallel to each other. In the DSM technology, there is a need
to understand the impact of inductance (in addition to coupling capacitance) on delay and
crosstalk at high frequencies of operation. It is necessary to include inductance in the
interconnect model at high frequencies.

Asynchronous architectures and system is getting more and more place in the modem
VLSI design. In this work, efforts have been made to demonstrate the complete flow of
ASIC design. The back-end process of the IC design is gaining more and more
importance nowadays since it is this part of the ASIC design flow where the designers
have lot of liberty to meet the specifications of the system depending upon his approach
to the design. In this work, the design of an eight-bit shifter is presented using
Synchronous Clocked Boolean approach and Asynchronous Clockless Null Convention

114

Logic approach. These designs are the result of the real design project sponsored by
Theseus Logic Inc, Orlando. A detailed comparison has been done between both the
aforementioned design approaches in Table 5.1 below.
BOOLEAN

NCL

l] NO.OF MACROS

48

97

2] NO.OF COMPONENTS

80

255

3] NUMBER OF PINS

276

962

4] NUMBER OF NETS

71

216

5] A VG. NO OF PINS/NET

3.89

4.45

6] NUMBER OF ROUTING
TRACKS AVAILABLE

201

348

110

323

7] NUMBER OF GCELLS/
LAYER
8] CHIP AREA

7327200000
(SQ.DBU)

,.

21859560000
(SQ.DBU)

9] CELL AREA
14.7 %

37.47 %

86.97

93.5

485.2
(microns)

2976.8
(microns)

REGULAR WIRING

2621.36
(microns)

11013.46
(microns)

13]TOTAL WIRELENGTH

3106.56

13990.26

UTILIZATION
10] % ROW SPACE
ll]LENGTH OF
SPECIAL WIRING

12] LENGTH OF

(microns)

(microns)

Table 5.1 Comparison between Eight-Bit NCL and Boolean chips

115

Table 5.1 shows the comparison of various design statistics of these designs which are
obtained from design report files. As a result of this, it was found that the Synchronous
Clocked Boolean design is very much area efficient. It is almost half the size of its
Asynchronous prototype. It can therefore, be concluded that for small designs (500-2000
gates), it is advisable to adopt the Synchronous design approach to realize compact and
area efficient chips. For very c_omplex and big systems, the issue of adopting one of these
approaches depends upon the application and could still be debatable. However, it would
be quite feasible to use Asynchronous Clockless architectures where complex chip
functionalities are intended to be integrated without much of the hassles of timing
problems.

116

APPENDIX A
***SILICON_ENSEMBLE DESIGN SUMMARY REPORT***

The following is a design summary report file obtained from
Cadence for our design of eight-bit NCL Shifter Chip.

Design name: EU_ALU_SHIFTER
Report file name: CHIP2_WARPROUTED.summary
Number of macros: 97
Number of components: 255
Number of pins: 962
Number of regular pins: 630
Number of special pins: 332
Number of unused pins: 0
Number of nets: 216
Average number of pins per net: 4.45
Number of subnets: 0
Number of routing tracks available: 348
Number of GCELLS per layer: 323

** NET STATISTICS OF PIN COUNTS
Number
Number
Number
Number
Number
Number
Number
Number
Number

of
of
of
of
of
of
of
of
of

2-pin nets: 119
3-pin nets: 63
4-pin nets: 25
5-pin nets: 1
6-pin nets: 1
17-pin nets: 2
18-pin nets: 2
22-pin nets: 1
166-pin nets: 2

** MACRO USAGE STATISTICS
Macro_name
#SHIFTEROUT00_METALl
Sites required: 0
#SHIFTEROUT0l_METALl
Sites required: 0
#SHIFTEROUTl0_METALl
Sites required: 0

# instances
1
1
1

117

#SHIFTEROUTll_METALl
-Sites required: 0
#SHIFTEROUT20_METAL1
Sites required: 0
#SHIFTEROUT21 METALl
Sites required: 0
#SHIFTEROUT30_METAL1
Sites required: 0
#SHIFTEROUT31 METALl
Sites required: 0
#SHIFTEROUT40_METAL1
Sites required: 0
#SHIFTEROUT41_METAL1
Sites required: 0
#SHIFTEROUTS0_METALl
Sites required: 0
#SHIFTEROUTSl_METALl
Sites required: 0
#SHIFTEROUT60_METAL1
Sites required: 0
#SHIFTEROUT61_METAL1
Sites required: 0
#SHIFTEROUT70_METAL1
Sites required: 0
#SHIFTEROUT71_METAL1
Sites required: 0
#EU_FLAG_C8 REG0_METALl
Sites required: 0
#EU_FLAG_C8_REGl_METAL1
Sites required: 0
#SELFLAGS_METALl
Sites required: 0
#C_REG0_METALl
Sites required: 0
#C_REGl_METALl
Sites required: 0
#OPER0_METALl
Sites required: 0
#OPERl_METALl
Sites required: 0
#OPER2_METAL1
Sites required: 0
#OPER3_METAL1
Sites required: 0
#OPER4_METAL1

1

1
1
1
1
1

1
1
1
1
1
1
1

1
1

1
1

1
1

1
1
1
1

11 8

Sites required: 0
#OPER5_METAL1
Sites required: 0
#OPER6_METAL1
Sites required: 0
#CW_SHIFTIN_MUX0_METALl
Sites required: 0
#CW_SHIFTIN_MUXl_METALl
Sites required: 0
#SHIFTH0_METALl
Sites required: 0
#SHIFTN0_METALl
Sites required: 0
#SHIFTNl_METALl
Sites required: 0
#SHIFTV0_METALl
Sites required: 0
#SHIFTVl_METALl
Sites required: 0
#SHIFTZ0_METALl
Sites required: 0
#SHIFTZl_METALl
Sites required: 0
#SHIFTC0_METALl
Sites required: 0
#SHIFTCl_METALl
Sites required: 0
#SHIFT_OUT00_METALl
Sites required: 0
#SHIFT_OUT0l_METALl
Sites required: 0
#SHIFT_OUTl0_METALl
Sites required: 0
#SHIFT_OUTll_METALl
Sites required: 0
#SHIFT_OUT20_METAL1
Sites required: 0
#SHIFT_OUT21_METAL1
Sites required: 0
#SHIFT_OUT30_METAL1
Sites required: 0
#SHIFT_OUT31_METAL1
Sites required: 0
#SHIFT_OUT40_METAL1
Sites required: 0
#SHIFT_OUT41_METAL1
Sites required: 0

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

119

#SHIFT_OUTS0_METALl
Sites required: 0
#SHIFT_OUTSl_METALl
Sites required: 0
#SHIFT_OUT60_METAL1
Sites required: 0
#SHIFT_OUT61_METAL1

1
1
1

1

Sites required: 0
#SHIFT_OUT70_METAL1
Sites required: 0
#SHIFT_OUT71_METAL1
Sites required: 0
#ACCREG00_METALl
Sites required: 0
#ACCREG0l_METALl
Sites required: 0
#ACCREG02 _METALl
Sites required: 0
#ACCREG03 _METALl
Sites required: 0
#ACCREGl0_METALl
Sites required: 0
#ACCREGll_METALl
Sites required: 0
#ACCREG12 _METALl
Sites required: 0
#ACCREG13 _METALl
Sites required: 0
#ACCREG20_METAL1
Sites required: 0
#ACCREG21_METAL1
Sites required: 0
#ACCREG22 _METALl
Sites required: 0
#ACCREG23 _METALl
Sites required: 0
#ACCREG30_METAL1
Sites required: 0
#ACCREG31_METAL1
Sites required: 0
#ACCREG32 _METALl
Sites required: 0
#ACCREG33 _METALl
Sites required: 0
#ABUS00_METALl

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

120

Sites required:
#ABT:JS0l_METALl
Sites required:
#ABUS02 METALl
Sites required:
#ABUS03 METALl
Sites required:
#ABUSl0_METALl
Sites required:
#ABUSll_METALl
Sites required:
#ABUS12_METAL1
Sites required:
#ABUS13_METAL1
Sites required:
#ABUS20_METAL1
Sites required:
#ABUS21_METAL1
Sites required:
#ABUS22 _METALl
Sites required:
#ABUS23 _METALl
Sites required:
#ABUS30 _METALl
Sites required:
#ABUS31_METAL1
Sites required:
#ABUS32 _METALl
Sites required:
#ABUS33 _METALl
Sites required:
#ALUOUT00 _METALl
Sites required:
#ALUOUT0l_METALl
Sites required:
THllX0
Sites required:
core
TH12X0
Sites required:
core
TH13X0
Sites required:
core
TH14X0
Sites required:
core

0
1
0

1
0
1
0
1
0
1
0
1
0
1
0

1
0
1
0
1
0
1
0

1
0

1
0
1
0
1
0

1
0

1
0

18
4
33
4
2
5

4
6

121

TH22X0
Sites required:
core
TH33W2X0
Sites required:
core
TH33X0
Sites required:
core
TH34W22X0
Sites required:
core

65
9
33
13
4
11
7
18

** UTILIZATION OF ALL ROW TYPES
Type
%_Row_Space
core Rows
core Cells
86.97

Number

Length

Area

15
166

1471500
1279800

9417600000
8190720000

Area of chip: 21859560000 (square DBU)
Area required for all cells: 8190720000 (square DBU)
Area utilization of all cells: 37.47%

** LAYOUT RESULTS
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:
PLACE:

Configuration file noConfigFile
##### Welcome TO PLACE####
Dump the LEF file
Dump the DEF file
PLACE FAILS
Elapsed date 0
Elapsed time 0:00:09
CPU time used 0:00:01
Configuration file noConfigFile
##### Welcome TO PLACE####
Dump the LEF file
Dump the · DEF file
PLACE FAILS
Elapsed date 0
Elapsed time 0:00:04
CPU time used 0:00:01
Configuration file noConfigFile
##### Welcome TO PLACE####

122

PLACE: Dump the LEF file
PLACE: Dump the DEF file
PLACE: PLACE FAILS
PLACE: · Elapsed date 0
PLACE: Elapsed time 0:00:05
PLACE: CPU time used 0:00:01
PLACE: Configuration file noConfigFile
PLACE: ##### Welcome TO PLACE####
PLACE: Dump the LEF file
PLACE: Dump the DEF file
PLACE: PLACE FAILS
PLACE: Elapsed date 0
PLACE: Elapsed time 0:00:03
PLACE: CPU time used 0:00:01
PLACE: Configuration file noConfigFile
PLACE: ##### Welcome TO PLACE####
PLACE: Dump the LEF file
PLACE: Dump the DEF file
PLACE: PLACE FAILS
PLACE: Elapsed date 0
PLACE: Elapsed time 0:00:04
PLACE: CPU time used 0:00:01
PLACE: Configuration file noConfigFile
PLACE: ##### Wel come TO PLACE####
PLACE: Dump the LEF file
PLACE: Dump the DEF file
PLACE: PLACE FAILS
PLACE: Elapsed date 0
PLACE: Elapsed time 0:00:06
PLACE: CPU time used 0:00:01
PLACE: Configuration file noConfigFile
PLACE: ##### Welcome TO PLACE####
PLACE: Dump the LEF file
PLACE: Reading back the DEF
PLACE: PLACE finishes successfully
PLACE: Elapsed date 0
PLACE: Elapsed time 0:00:15
PLACE: CPU time used 0:00:05
WROUTE: Configuration file noConfigFile
WROUTE: ##### Welcome to WROUTE ####
WROUTE: Dump the LEF file
WROUTE: Dump the DEF file
WROUTE: Reading back the WRoute DB: wroute.wdb
WROUTE: Reading routing segments ...
WROUTE: Reading net violation infos ...
WROUTE: Reading global router congestion map ...
WROUTE: WROUTE finishes successfully

123

WROUTE: Elapsed date 0
WROUTE: Elapsed time 0:00:19
WROUTE: CPU time used 0:00:10

** LAYER INFORMATION
Total layers: 12
Routing layers: S
Layer information by layer number:
1 ==> METALl
prefers horizontal routing
2 ==> METAL2
prefers vertical routing
3 ==> METAL3
prefers horizontal routing
4 ==> METAL4
prefers vertical routing
S ==> METALS
prefers horizontal routing
can't route
6 ==> POLYl
can't route
7 ==> VIA12
can't route
8 ==> VIA23
can't route
9 ==> VIA34
can't route
10 ==> VIA45
can't route
11 ==> OVERLAP
can't route
VIRTUAL
12 ==>
Layers in process order (top to bottom):
OVERLAP
METALS
VIA45
METAL4
VIA34
METAL3
VIA23
METAL2
VIA12
METALl
POLYl
VIRTUAL
Layers in routing order (top to bottom):
METALS
METAL4
METAL3
METAL2
METALl

124

APPENDIX B
***SILICON_ENSEMBLE WIRING REPORT***

The following is the wiring report file which gives us the
information about the wire used in the design. This file
contains information about our eight-bit NCL Shifter chip.
Design name: EU_ALU_SHIFTER
Report file name: CHIP2_WARPROUTED.wires

** (only DETAILED wiring are reported for REGULAR nets)
Total
Total
Total
Total

vias in regular wiring: 1032
segments in regular wiring: 1479
vias in special wiring: 40
segments in special wiring: 56

LAYER name: METALl
Total wire length: 2973.66
Length of regular wires:
Length of special wires:
LAYER name: METAL2
Total wire length: 5177.62
Length of regular wires:
Length of special wires:
LAYER name: METAL3
Total wire length: 4513.52
Length of regular wires:
Length of special wires:
LAYER name: METAL4
Total wire length: 1325.46
Length of regular wires:
Length of special wires:

microns
489.66 microns
2484.00 microns
microns
4684.82 microns
492.80 microns
microns
4513.52 microns
.00 microns
microns
1325.46 microns
.00 microns

Total wire length in regular wiring: 11013.46 microns
Total wire length in special wiring: 2976.80 microns
Total wirelength in regu 1 ar_+ spec i a 1 wiring: 13990.26
microns
Timing
Timing
Timing
Timing

place
place
place
place

slope of wire X: .254
intercept of wire X: .795
slope of wire Y: .279
intercept of wire Y: .547

125

APPENDIX C
, ***SILICON_ENSEMBLE DESIGN SUMMARY REPORT***
The following is a design summary report file obtained from
Cadence for our design of eight-bit Boolean Shifter Chip.

Design name: cpu_eu_SHIFTER
Report file name: boolean_shifter_wrouted.summary
Number of macros: 48
Number of components: 80
Number of pins: 276
Number of regular pins: 194
Number of special pins: 82
Number of unused pins: 0
Number of nets: 71
Average number of pins per net: 3.89
Number of subnets: 0
Number of routing tracks available: 201
Number of GCELLS per layer: 110

** NET STATISTICS OF PIN COUNTS
Number
Number
Number
Number
Number
Number
Number

of
of
of
of
of
of
of

2-pin nets: 40
3-pin nets: 23
2
4-pin nets:
1
8-pin nets:
3
9-pin nets:
41-pin nets: 1
43-pin nets: 1

** MACRO USAGE STATISTICS
Macro_name
#eu_flag_c8_reg_METAL1
Sites required: 0
#eu_ccr_c_reg_METALl
Sites required: 0
#eu_acc_reg[7]_METAL1

# instances
1
1
1

126

Sites required: 0
#eu_acc_reg[6]_METAL1
Sites required: 0
#eu~acc_reg[S]_METALl
Sites required: O
#eu_acc_reg[4]_METAL1
Sites required: 0
#eu_acc_reg[3]_METAL1
Sites required: 0
#eu_acc_reg[2]_METAL1
Sites required: 0
#eu_acc_reg[l]_METALl
Sites required: 0
#eu_acc_reg[O]_METALl
Sites required: 0
#eu_abus[7]_METAL1
Sites required: 0
#eu_abus[6]_METAL1
Sites required: 0
#eu_abus[S]_METALl
Sites required: 0
#eu_abus[4]_METAL1
Sites required: 0
#eu_abus[3]_METAL1
Sites required: 0
#eu_abus[2]_METAL1
Sites required: 0
#eu_abus[l]_METALl
Sites required: 0
#eu_abus[O]_METALl
Sites required: 0
#eu_alu_outO_METALl
Sites required: 0
#cw_shiftin_mux[l]_METALl
Sites required: 0
#cw_shiftin_mux[O]_METALl
Sites required: 0
#cw_shifter_oper[6]_METAL1
Sites required: 0
#cw_shifter_oper[S]_METALl
Sites required: 0
#cw_shifter_oper[4]_METAL1
Sites required: 0
#cw_shifter_oper[3]_METAL1
Sites required: 0
#cw_shifter_oper[2]_METAL1
Sites required: 0

1

1
1
1
1

1
1
1
1

1

1
1

1

1
1

1

1
1
1
1
1
1
1

127

#cw_ shifter_oper[l]_METALl
Sites required: 0
#cw_ shifter _oper[0]_METALl
Sites required: 0
#eu- shifter_z_bit_METALl
Sites required: 0
#eu- shifter_v_bit _METALl
Sites required: 0
#eu- shifter_c_bit _METALl
Sites required: 0
#eu_ shift out7 _METALl
Sites required: 0
#eu- shift out6 _METALl
Sites required: 0
#eu- shift outs _METALl
Sites required: 0
#eu- shift out4 _METALl
Sites required: 0
#eu - shift out3 _METALl
Sites required: 0
#eu - shift out2 _METALl
Sites required: 0
#eu- shift outl _METALl
Sites required: 0
#eu - shift out0 _METALl
Sites required: 0
AOI2 22XL
Sites required: 8
core
AOI2 2Xl
Sites required: 6
core
MX2Xl
Sites required: 8
core
NAND2Xl
Sites required: 3
core
NOR2XL
Sites required: 3
core
OAI2BB1Xl
Sites required: 5
core
OR3Xl
Sites required: 6
core

1
1
1
1
1
1
1
1
1
1
1
1
1
1

3

1

16

1

14

1

128

OR4Xl
Sites required: 6
core
XOR2 XL
Sites required: 8
core

3

1

** UTILIZATION OF ALL ROW TYPES
Type
%_Row_Space
core Rows
core Cells
93.50

Numbe·r

Length

Area

5
41

180000
168300

1152000000
1077120000

Area of chip: 7327200000 (square DBU)
Area required for all cells: 1077120000 (square DBU)
Area utilization of all cells: 14.70%

** LAYOUT RESULTS
PLACE: Configuration file noConfigFile
PLACE: ##### Welcome TO PLACE####
PLACE: Dump the LEF file
PLACE: Dump the DEF file
PLACE: Reading back the DEF
PLACE: PLACE finishes successfully
PLACE : Elapsed date 0
PLAC E : Elapsed time 0:00:15
PLACE: CPU time used 0:00:09
WROUTE: Configuration file noConfigFile
WROUTE: ##### Welcome to WROUTE ####
WROUTE: Dump the LEF file
WROUTE: Dump the DEF file
WROUTE: Reading back the WRoute DB: wroute.wdb
WROUTE: Reading routing segments ...
WROUTE: Reading net violation infos ...
WROUTE: Reading global router congestion map ...
WROUTE: WROUTE finishes successfully
WROUTE: Elapsed date 0
WROUTE: Elapsed time 0:00:28
WROUTE: CPU time used 0:00:15

129

** LAYER INFORMATI ON
Total layers: 12
Routing layer s: S
Layer informati on b y l ayer number:
1 ==> METALl
prefers horizontal routing
2 ==> METAL2
prefers vertical routing
3 ==> METAL3
prefers horizontal routing
4 ==> METAL4
prefers vertical routing
S ==> METALS
prefers horizontal routing
6 ==> POLYl
can' t route
7 ==> VIA12
can ' t route
8 ==> VIA23
can't route
9 ==> VIA34
can't route
10 ==> VIA4S
can't route
11 ==> OVERLAP
can' t route
12 ==> VIRTUAL
can't route
Layers in proce ss order (top to bottom):
OVERLAP
METALS
VIA4S
METAL4
VI A34
METAL3
VIA23
METAL2
VIA12
METALl
POLYl
VIRTUAL
La yers in routing order (top to bottom):
METALS
METAL4
METAL3
METAL2
METALl

130

APPENDIX D
***SILICON_ENSEMBLE WIRING REPORT***

The fo llowing i s the wiring report file which gives us the
i nformat ion abo u t the wire used in the design. This file
con t ains information about our eight-bit Boolean Shifter
chi p.

Design name: cpu_eu_SHIFTER
Report fil e name: boolean_shifter_wrouted.wires

** (on ly DETAILED wiring are reported for REGULAR nets)
Tota l
Tota l
Tota l
Tota l

vias in r egular wiring: 303
segments in regular wiring: 454
vias in special wiring: 8
segments in special wiring: 8

LAYER name : METALl
To tal wire length: 449.04 microns
Length of r egu l ar wires: 200.64 microns
Length of special wires: 248.40 microns
LAYER name: METAL2
To tal wire length: 1481.82 microns
Length of regular wires: 1245.02 microns
Length of special wires: 236.80 microns
LAYER n a me : METAL 3
Tota l wire length: 934.50 microns
Length of r egular wires: 934.50 microns
Length of special wires: .00 microns
LAYER name : METAL 4
Tota l wire length: 241.20 microns
Length of r egular wires: 241.20 microns
Length of s pecial wires: .00 microns
Total wir e length in regular wiring: 2621.36 microns
Total wirelength in special· wiring: 485.20 microns
Total wirelength in regular+special wiring: 3106.56 microns
Timing
Timing
Timing
Timing

place
pla c e
place
place

s lope of wire X: .601
in tercept of wire X: -.309
slope of wire Y: .180
i n tercept of wire Y: .814

131

LIST OF REFERENCES

[1] Dennis Michael Sylvester, "Analytical Modeling and Characterization of Deep
Submicron Interconnect", University of California, Berkeley.
[2] Harry Veendrick, Deep-Submicron CMOS !Cs-From Basics to ASICs, 1998 Kluwer
Befrijfslnformatie b. v-Deventer, The Netherlands.
[3] F.Anceau, "A Synchronous Approach for Clocking VLSI Systems", IEEE Journal of
Solid-State Circuits, February 1982, 204-213.
[4] M.Afghahi and C. Svensson, "Performance of Synchronous and Asynchronous
chemesfor VLSI Systems", IEEE Transactions of Computers, July 1992, 397-412.
[5] Wayne Wolf, Modern VLSI Design - A systems Approach, PTR Prentice Hall,
Englewood Cliffs, New Jersey 07632.
[6] Erik De Man and Matthias Schobinger, "Power Dissipation in the Clock System of
highly pipelined VLSI CMOS Circuits", SIEMENS AG., Corporate Research and
Development, Germany.
[7] Xavier Aragones, Jose Luis Gonzalez & Antonio Rubio, Analysis and Solutions for
Switching Noise Coupling in Mixed-Signal ICs, Kluwer,. Academic Publishers,
Boston.
[8] Chung-Kuan Cheng, John Lillis, Shen Li~ Norman Chang, Interconnect Analysis and
Synthe is, John Wiley & Sons, New York.
[9] Michael John Sebastian Smith, Application -Specific Integrated Circuits, AddisonWesley Longman, Inc.
[10] R.Jacob Baker, Harry W.Li, David E. Boyce, CMOS Circuit Design, Layout, and
Simulation, IEEE Press Series on Microelectronic Systems, Edited by Stuart K.
Tewksbury.

132

